CN111178399A

CN111178399A - Data processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111178399A
Application number: CN201911284423.XA
Authority: CN
Inventors: 刘志煌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-05-19

Abstract

The embodiment of the disclosure provides a data processing method and device, electronic equipment and a computer readable storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring characteristic information of an object; obtaining the correlation degree of the characteristics according to the characteristic information of the object; clustering the objects according to the relevance of the features to obtain a clustering result; taking objects with less occupation in the clustering result as minority samples, taking objects with more occupation in the clustering result as majority samples, wherein the ratio of the number of the minority samples to the number of the majority samples is 1: N, and N is a positive integer with data imbalance multiplying power larger than 1; diffusing the minority samples based on the minority samples and the majority samples to obtain synthesized minority samples; and training the classification model according to the minority class samples, the majority class samples and the synthesized minority class samples. Through the scheme that this disclosed embodiment provided, can beat the label for the sample automatically, saved a large amount of manpower and materials.

Description

Data processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In the related art, the labels of the samples in the training data set for training the classification model are all labeled manually, a large amount of manpower and material resources are consumed, the efficiency is low, the cost is high, and errors are easy to occur in the manual labeling process.

Therefore, a new data processing method and apparatus, an electronic device, and a computer-readable storage medium are needed.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure.

Disclosure of Invention

The embodiment of the disclosure provides a data processing method and device, an electronic device and a computer-readable storage medium, which can automatically identify the class of a sample in a training data set used for training a classification model and automatically mark a training sample.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

The embodiment of the disclosure provides a data processing method, which comprises the following steps: acquiring characteristic information of an object; obtaining the correlation degree of the characteristics according to the characteristic information of the object; clustering the objects according to the relevance of the features to obtain a clustering result; taking the objects with less occupation in the clustering result as minority samples, and taking the objects with more occupation in the clustering result as majority samples, wherein the ratio of the number of the minority samples to the number of the majority samples is 1: N, and N is a positive integer with a data imbalance multiplying power larger than 1; diffusing the minority class samples based on the minority class samples and the majority class samples to obtain synthesized minority class samples; training a classification model according to the minority class samples, the majority class samples and the synthesized minority class samples.

The disclosed embodiment provides a data processing apparatus, the apparatus includes: a characteristic information acquisition unit for acquiring characteristic information of an object; a feature correlation obtaining unit configured to obtain correlation of features according to feature information of the object; a clustering result obtaining unit, configured to perform clustering processing on the object according to the relevance of the feature to obtain a clustering result; the sample type determining unit is used for taking the objects with less occupation in the clustering result as minority samples and taking the objects with more occupation in the clustering result as majority samples, the ratio of the number of the minority samples to the number of the majority samples is 1: N, and N is the data imbalance multiplying power; a minority sample synthesis unit, configured to diffuse the minority sample based on the minority sample and the majority sample, and obtain a synthesized minority sample; and the classification model training unit is used for training a classification model according to the minority class samples, the majority class samples and the synthesized minority class samples.

The disclosed embodiments provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the data processing method as described in the above embodiments.

An embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the data processing method as described in the above embodiments.

In the technical solutions provided in some embodiments of the present disclosure, on one hand, according to the obtained feature information of the object, a correlation degree of the feature is obtained, and the object is clustered according to the correlation degree of the feature to obtain a clustering result, less objects in the clustering result are used as minority class samples, more objects in the clustering result are used as majority class samples, a ratio of the number of the minority class samples to the number of the majority class samples is 1: N, and N is a positive integer with a data imbalance magnification and greater than 1, so that automatic identification of classes of the samples can be realized, the identification efficiency is high, the cost is low, and good operability is achieved in industry; on the other hand, the minority samples are diffused based on the minority samples and the majority samples to obtain synthesized minority samples, the problem of unbalanced data types can be solved by adding the minority samples, the information of the minority samples is not covered, the classification model can be trained according to the minority samples, the majority samples and the synthesized minority samples, and the classification accuracy of the trained classification model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

fig. 1 shows a schematic diagram of an exemplary system architecture to which a data processing method or a data processing apparatus of an embodiment of the present disclosure may be applied;

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device used to implement embodiments of the present disclosure;

FIG. 3 schematically shows a flow diagram of a data processing method according to an embodiment of the present disclosure;

FIG. 4 schematically shows a flow chart of a data processing method according to another embodiment of the present disclosure;

FIG. 5 is a diagram illustrating a processing procedure of step S320 shown in FIG. 3 in one embodiment;

FIG. 6 is a diagram illustrating a processing procedure of step S330 shown in FIG. 3 in one embodiment;

FIG. 7 is a schematic diagram illustrating a processing procedure of step S330 shown in FIG. 3 in another embodiment;

FIG. 8 is a diagram illustrating a processing procedure of step S350 shown in FIG. 3 in one embodiment;

FIG. 9 is a schematic diagram illustrating a processing procedure of step S350 shown in FIG. 3 in another embodiment;

FIG. 10 schematically shows a flow chart of a data processing method according to a further embodiment of the present disclosure;

FIG. 11 schematically illustrates a flow diagram of a data processing method according to yet another embodiment of the present disclosure;

FIG. 12 schematically illustrates a flow diagram of a data processing method according to yet another embodiment of the present disclosure;

fig. 13 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which the data processing method or data processing apparatus of the embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, a network 103, and a server 104. The network 103 serves as a medium for providing communication links between the

terminal devices

101, 102 and the server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102 to interact with the server 104 over the network 103 to receive or send messages or the like. The

terminal devices

101, 102 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, wearable devices, virtual reality devices, smart homes, and the like.

The server 104 may be a server that provides various services, such as a background management server that provides support for devices operated by the user using the

terminal apparatus

101, 102. The background management server can analyze and process the received data such as the request and feed back the processing result to the terminal equipment.

The server 104 may, for example, obtain characteristic information of the object; the server 104 may obtain the relevance of the features, for example, according to the feature information of the object; the server 104 may perform clustering processing on the objects according to the relevance of the features, for example, to obtain a clustering result; the server 104 may, for example, take less objects in the clustering result as minority class samples and more objects in the clustering result as majority class samples, where a ratio of the number of the minority class samples to the number of the majority class samples is 1: N, where N is a positive integer greater than 1 and is a data imbalance multiplying factor; server 104 may, for example, spread the minority class samples based on the minority class samples and the majority class samples, obtaining composite minority class samples; server 104 may train the classification model, for example, according to the minority class samples, the majority class samples, and the composite minority class samples.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is only illustrative, and the server 104 may be a physical server or may be composed of a plurality of servers, and there may be any number of terminal devices, networks and servers according to actual needs.

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 208 including a hard disk and the like; and a communication section 209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 210 as necessary, so that a computer program read out therefrom is installed into the storage section 208 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and/or apparatus of the present application.

It should be noted that the computer readable storage medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer-readable storage medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 3, 4, 5,6, 7, 8,9, 10, 11, or 12.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The technical solution provided by the embodiment of the present disclosure relates to aspects such as a machine learning technique of an artificial intelligence technique, and is illustrated by the following specific embodiments.

Fig. 3 schematically shows a flow chart of a data processing method according to an embodiment of the present disclosure. The method provided by the embodiment of the present disclosure may be executed by any electronic device with computing processing capability, for example, the

terminal devices

101 and 102 and/or the server 104 in fig. 1. In the following description, the server 104 is used as an execution subject for example.

As shown in fig. 3, a data processing method provided by an embodiment of the present disclosure may include the following steps.

In step S310, feature information of the object is acquired.

In the embodiment of the present disclosure, the object may have different meanings according to different classification scenarios, for example, if different classes of the user need to be identified, the object may be the user; for another example, if different categories of goods need to be identified, the object may be a good, which is not limited by the present disclosure.

Accordingly, when the meaning of the object and the adaptive classification scene change, the feature information of the object changes. For example, if the emotional tendency of the user needs to be identified, the feature information is any relevant information that can be used for assisting in identifying the emotional tendency of the user, such as comment information posted by the user, praise of the user, sharing behavior information, and the like; for another example, if it is necessary to identify the article type of the article, the feature information is any relevant information that can be used to assist in identifying the article type of the article to which the article belongs, such as the name, model, and manufacturer of the article.

In step S320, a correlation of the features is obtained according to the feature information of the object.

For example, the method in fig. 5 below may be referred to obtain the correlation of the feature corresponding to the feature information.

In step S330, the objects are clustered according to the relevance of the features, and a clustering result is obtained.

For example, referring to the methods in fig. 6 and 7 below, a clustering result of the object may be obtained.

In step S340, less objects in the clustering result are used as minority samples, more objects in the clustering result are used as majority samples, a ratio of the number of the minority samples to the number of the majority samples is 1: N, and N is a positive integer greater than 1 and is a data imbalance multiplying factor.

For example, a data imbalance ratio N may be preset, and the value of N may be set according to an actual scene, such as N being 100, 100000, or the like. The data imbalance multiplying factor refers to a ratio between the number of minority samples and the number of majority samples in an original training data set, wherein the minority samples refer to a category with a small number in the training data set, the majority samples refer to a category with a large number in the training data set, and the emotional tendency of a user is classified into two categories, for example, the number of objects with negative emotional tendency in the training data set is 100, the number of objects with positive emotional tendency is 10000, the category of the object with negative emotional tendency is the minority samples, the category of the object with positive emotional tendency is the majority samples, and at this time, the data imbalance multiplying factor N of the training data set is 100.

In step S350, the minority class samples are diffused based on the minority class samples and the majority class samples, and a synthesized minority class sample is obtained.

In practical application, in an original training data set, the number of samples in different classes is basically unequal, that is, the data is unbalanced or the classes are unbalanced, and if the original training data set is directly used for training a classification model, the data distribution and the sample characteristics of a small number of samples are difficult to learn, so that the classification of the small number of samples is easily inaccurate, and the classification effect of the classification model is further reduced.

In the related art, there are two methods for solving the data imbalance, one is oversampling, and the other is undersampling. The oversampling is to keep the number of most samples unchanged, and to reduce the difference of the number of samples of different classes by having the sampling data put back from the few samples many times. However, oversampling in the related art easily causes an overfitting problem by repeatedly sampling a small number of samples with putting back by keeping a large number of samples unchanged. In the undersampling, a few classes of samples are kept unchanged, and the class equalization is achieved by cutting part of samples from a plurality of classes of samples. Under-sampling severely reduces the number of most types of samples when the class imbalance proportion is too large, and the loss of sample information is caused.

In this embodiment, the minority samples are diffused based on the minority samples and the majority samples to obtain synthesized minority samples, where the synthesized minority samples are new minority samples added to make the original unbalanced data tend to be class-balanced. The minority samples can be synthesized between the minority samples and the minority samples or between the minority samples and the majority samples, so that the risk that the noise of the model is increased due to the fact that the minority samples are added in a traditional blind random mode is avoided, and the classification model is more stable and effective.

In step S360, a classification model is trained according to the minority class samples, the majority class samples and the synthesized minority class samples.

In the embodiment of the present disclosure, the added new synthesized minority class samples are added to the original training data set to update the training data set, and the classification model is trained with the updated training data set. The classification model is a mathematical model constructed by applying a mathematical logic method and a mathematical language, and is a mathematical model which enables a computer to learn new knowledge from existing data, namely, to perform systematic learning according to a training data set, such as how to classify comments, news or works, how to optimize classification results, and the like. The training process is a process of determining model parameters according to existing data by using training samples in a training data set and combining class labels corresponding to the training samples.

In the embodiment of the present disclosure, the original training data set is updated by using the added synthesized minority samples, and then the updated training data set, that is, the plurality of labeled minority samples, majority samples, and newly added synthesized minority samples, is used as a training sample to train the classification model. The updated training data set is added with a few types of synthesized samples, so that the class distribution tends to be balanced, the classification error rate of the total samples is reduced, and the overall classification performance is enhanced. That is to say, the trained classification model can be simply deployed, and then the classification result with high accuracy can be quickly obtained.

The classification model is trained by aiming at reducing the classification error rate of the total samples, the distribution of data in each class is required to be uniform, and the oversampling method in the related art is to perform sampling with putting back from the minority samples at random for multiple times, namely, the generated new sample is positioned at any position between the two minority samples, so that only the samples are expanded, the distribution characteristics of the minority samples are not changed, the influence on the classification boundary is small, the problem of sample overlapping is easily caused, and the effect of the classification model is not improved. In this embodiment, the added synthesized minority samples are not limited to only adding the minority samples and the samples between the minority samples, but also consider the relationship between the minority samples and the majority samples, so that the boundary of the minority samples is expanded while the number of the samples is expanded.

In the embodiment of the present disclosure, the classification model may be used in any classification scenario, for example, the classification model may be applied to a review of a commodity, and may also be applied to a diagnosis in biomedicine, and specifically, a corresponding classification category may be designed according to actual applications, so as to train various different classification models. For example, the model may be any one or a combination of RF (Random Forest) model, GBDT (gradient boosting Tree) model, neural network model, deep learning model, and the like. The classification model may be a binary classification model or a multi-classification model.

In the following description, a two-class RF model is taken as an example for illustration, but the scope of the disclosure is not limited thereto.

The RF model is an ensemble learning algorithm that is trained and predicted using a plurality of decision trees, and the RF model uses a CART (classification and regression tree) decision tree as a weak learner. The input to the RF model is a training dataset D { (x)₁,y₁),(x₂,y₂),...(x_m,y_m) And the iteration times T, x of the weak classifier_iIs the characteristic information of the ith sample, y_iThe class label of the ith sample is, for example, the class label of the minority sample is assumed to be 0, and the class label of the majority sample is assumed to be 1, but the disclosure is not limited thereto, i is a positive integer greater than or equal to 1 and less than or equal to m, m is a positive integer greater than or equal to 1, T is a positive integer greater than or equal to 1, and the output is outputTo the final strong classifier f (x), specifically:

for T1, 2, T is a positive integer greater than or equal to 1 and less than or equal to T:

random sampling is carried out on the training data set for the t time, and m times are collected in total to obtain a sampling set D containing m samples_m；

Using a sample set D_mTraining mth decision tree model G_m(x)When the nodes of the decision tree model are trained, a part of sample features are selected from all sample features on the nodes, and an optimal feature is selected from the randomly selected part of sample features to divide left and right subtrees of the decision tree. The T weak learners cast the most votes for the category or one of the categories as the final predicted category.

The RF model can be trained in parallel, and the training speed and efficiency can be improved. Meanwhile, because the decision tree node division characteristics can be randomly selected, the model can still be efficiently trained when the characteristic dimension of the sample is high. In addition, random sampling is adopted, so that the trained classification model has small variance and strong generalization capability. And the RF model is not sensitive to the absence of partial features.

On one hand, according to the data processing method provided by the embodiment of the disclosure, the relevance of the features is obtained according to the obtained feature information of the object, the clustering processing is performed on the object according to the relevance of the features to obtain a clustering result, the object occupying a smaller proportion in the clustering result is used as a minority class sample, the object occupying a larger proportion in the clustering result is used as a majority class sample, the ratio of the number of the minority class sample to the number of the majority class sample is 1: N, and N is a data imbalance multiplying factor, so that the automatic identification of the class of the sample can be realized, the identification efficiency is high, the cost is low, and the operability is good in the industry; on the other hand, the minority samples are diffused based on the minority samples and the majority samples to obtain synthesized minority samples, the problem of unbalanced data types can be solved by adding the minority samples, the information of the minority samples is not covered, the classification model can be trained according to the minority samples, the majority samples and the synthesized minority samples, and the classification accuracy of the trained classification model is improved.

Fig. 4 schematically shows a flow chart of a data processing method according to another embodiment of the present disclosure.

As shown in fig. 4, compared with the above embodiment, the method provided by the embodiment of the present disclosure is different in that before the step 320, the method may further include the following steps.

In step S410, a variance of the feature is obtained according to the feature information of the object.

For example, assuming that a feature corresponding to certain feature information is taken as an example, and n objects (n is a positive integer greater than or equal to 1) are in total, the feature has n feature values, and the n feature values are (10,5,6,8,9, …), respectively, according to the following calculation formula of variance:

the variance of n eigenvalues of this feature is calculated, h being the mean of these n eigenvalues.

In step S420, if the variance of the feature is smaller than the variance threshold, the feature with the variance smaller than the variance threshold is filtered.

The value of the variance threshold can be set according to actual requirements. For example, the variance threshold may be set to 1. With reference to a similar method as described above, the variance of all features in the feature set can be calculated. If the variance of a feature is less than 1, the feature is removed from the feature set, i.e., the feature with variance less than the variance threshold is filtered.

Fig. 5 is a schematic diagram illustrating a processing procedure of step S320 illustrated in fig. 3 in an embodiment. It should be noted that, in the embodiment of the present disclosure, if the step in the embodiment of fig. 4 is further included before the step S320, the feature processed in fig. 5 refers to other features remaining after the filtering of the variance threshold.

As shown in fig. 5, in the embodiment of the present disclosure, the step S320 may further include the following steps.

In step S321, correlation between features is obtained from the feature information of the object.

For example, the correlation p (X, Y) between any two features X and Y can be calculated by using pearson correlation coefficients, and the calculation formula is as follows:

wherein, in the above formula, X_jJ-th feature value, Y, representing feature X_jRepresents the jth eigenvalue of the characteristic Y, and it is assumed here that the characteristic X and the characteristic Y are each an n-dimensional vector, i.e., each characteristic has n eigenvalues, n is a positive integer greater than or equal to 1,

the mean value of the characteristic X is represented,

represents the mean of the feature Y.

It should be noted that the embodiments of the present disclosure are not limited to calculating the correlation between two features by using the above formula (2).

In step S322, an average correlation of the features is obtained based on the correlation between the features.

For example, the average correlation Rel (X) of any one feature X can be calculated using the following formula:

wherein, in the above formula, it is assumed that there are a total of Q features in the feature set, Q is a positive integer greater than or equal to 1, f_qRepresenting the qth feature in the feature set.

In step S323, the correlation of the feature is determined according to the average correlation of the feature.

In the embodiment of the present disclosure, the correlation between any one feature X and the whole feature X may be defined asThe average correlation between all features in the feature set, i.e., the correlation of feature X, is equal to Rel (X). However, the present disclosure is not limited thereto, and for example, the correlation degree of the feature X may be the root of the average correlation of the feature X, that is, the average correlation of the feature X

As another example, the correlation of feature X may also be the square of the average correlation of feature X, i.e., Rel (X)²And so on.

Fig. 6 is a schematic diagram illustrating a processing procedure of step S330 shown in fig. 3 in an embodiment.

As shown in fig. 6, in the embodiment of the present disclosure, the step S330 may further include the following steps.

In step S331, the features are sorted in descending order according to the correlation of the features, and an ordered feature sequence is obtained.

And (4) sequencing each feature in the feature set in a descending order according to the Rel (X) corresponding to each feature, and outputting an ordered feature sequence.

It should be noted that, although the example is described in which the features are sorted in a descending order according to the correlation degrees of the features, in other embodiments, the features may be sorted in an ascending order according to the correlation degrees of the features, and only when the features are selected for clustering in the following steps, the last M1 features or the last M2 features may be selected.

In step S332, the top M1 features are selected from the ordered sequence of features, and M1 is a positive integer greater than or equal to 1.

In the embodiment of the present disclosure, objects with a predetermined ratio may be randomly selected from all the objects as training samples, and the predetermined ratio may be, for example, 1:3, but the present disclosure is not limited thereto, and the value of the predetermined ratio may be adjusted according to an actual scene, or even all the objects may be directly used as training samples.

Then, the top M1 sorted features are selected from the ordered feature sequence, for example, the value of M1 is assumed to be 10, but the present disclosure is not limited thereto, and the top M1 features may be set autonomously according to actual needs.

In step S333, the objects are clustered based on the top M1 features.

For example, k-means clustering (k-means clustering) may be performed on the selected training samples based on the first 10 features, and the value of k may be 2, taking a classification model of two classes as an example. If the classification model is a three-classification model, the value of k can be 3, and the rest can be analogized in the same way. It should be noted that the present disclosure is not limited to K-means clustering for specific clustering algorithms.

k-means clustering is an iterative solution clustering analysis algorithm, which comprises the steps of randomly selecting k objects as initial clustering centers, then calculating the distance between each object and each clustering center, and assigning each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. Each training sample is assigned, the cluster center of the cluster is recalculated based on the objects already in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or a minimum number) objects are reassigned to different clusters, no (or a minimum number) cluster centers are changed again, the sum of squared errors is locally minimal, and so on.

In step S334, if the ratio of the number of objects with a small percentage to the number of objects with a large percentage is 1: N after the objects are clustered based on the top M1 features, the result of clustering the objects based on the top M1 features is used as the clustering result.

For example, taking a classification model of two classifications as an example, if the training samples are clustered based on the first 10 features, two clusters can be obtained, where the number of objects in one cluster is small, the objects in the cluster with the small number of objects are called minority samples, and the number of objects in the other cluster is large, the objects in the cluster with the large number of objects are called majority samples, and the ratio between the number of minority samples and the number of majority samples is 1: N, where N is the data magnification imbalance, at this time, it may be stopped to continuously select more features for clustering, and two clusters based on the first 10 features are directly used as the clustering result, for example, the minority samples are labeled with a category label of "0", and the majority samples are labeled with a category label of "1", but the disclosure is not limited thereto.

Fig. 7 is a schematic diagram illustrating a processing procedure of step S330 illustrated in fig. 3 in another embodiment.

As shown in fig. 7, in the embodiment of the present disclosure, the step S330 may further include the following steps.

In step S335, if the ratio of the number of objects with a small proportion to the number of objects with a large proportion is not 1: N after the objects are clustered based on the top M1 features, the top M2 features are selected from the ordered feature sequence, and M2 is a positive integer greater than M1.

For example, if after performing 2-mean clustering on the training samples based on the first 10 features in the ordered feature sequence, the ratio of the minority class samples to the majority class samples is not equal to 1: N, a step-by-step strategy may be adopted at this time, one new feature in the ordered feature sequence is sequentially added each time, and the training samples are re-clustered, where if M2 is 11, the first 11 features in the ordered feature sequence are selected.

Of course, the number of features incremented at a time is not limited to one, and may be two, three, or more. Meanwhile, the number of features added each time may be equal or different, for example, the first 10 features are selected for the first time, the first 13 features are selected for the second time, the first 18 features are selected for the third time, and so on.

In step S336, the objects are clustered based on the top M2 features.

For example, the training samples are re-2-means clustered based on the first 11 features to obtain two new clusters. Wherein, the number of objects in one of the two new clusters is smaller, the objects in the cluster with the smaller number of objects are called few class samples, and the number of objects in the other cluster is larger, the objects in the cluster with the larger number of objects are called most class samples.

In step S337, if the ratio of the number of objects with a small percentage to the number of objects with a large percentage is 1: N after the objects are clustered based on the first M2 features, the result of clustering the objects based on the first M2 features is used as the clustering result.

For example, if two new clusters obtained after re-clustering the training samples based on the first 11 features satisfy that the ratio of the number of the minority class samples to the number of the majority class samples is equal to 1: N, the two new clusters are used as the clustering result, the category label of "0" is marked on the minority class samples in the two new clusters, and the category label of "1" is marked on the majority class samples, but the disclosure is not limited thereto. If the two new clusters obtained after the training samples are reselected and clustered based on the first 11 features do not satisfy the ratio of the number of the minority samples to the number of the majority samples to be 1: N, the first 12 features can be continuously selected from the ordered feature sequence to re-cluster the training samples, so that iterative processing is performed, until the two clusters after certain clustering processing satisfy the ratio of the number of the minority samples to the number of the majority samples to be 1: N, the iteration is stopped, and the automatic identification of the class label of each object in the training samples can be completed.

According to the data processing method provided by the embodiment of the disclosure, a set of method for quantifying the difference of each characteristic based on statistical analysis is designed, so that a high-quality class label for automatically determining the training sample is automatically constructed, the problem that a large amount of training samples need to be manually labeled in the related technology is solved, the automatic labeling of the class label of the training sample in various business scenes can be realized, and the method has good operability in industry.

Fig. 8 is a schematic diagram illustrating a processing procedure of step S350 shown in fig. 3 in an embodiment. In the embodiment of the present disclosure, since the Minority class samples in the training sample set occupy less in the entire training sample, the spreading of the Minority class samples is performed by using SMOTE (Synthetic minimum Oversampling Technique) to solve the class imbalance problem in the training data set.

As shown in fig. 8, in the embodiment of the present disclosure, the step S350 may further include the following steps.

In step S351, a target minority class sample is determined from the minority class samples.

Assume a total of d minority samples, d × N majority samples, in the training dataset, d being a positive integer greater than or equal to 1. And assume the i1 th minority class sample X of the d minority class samples_i1As the target minority class sample, i1 is a positive integer greater than or equal to 1 and less than or equal to d.

In step S352, neighbor samples of the target minority class sample are obtained.

For the i1 th few class samples X_i1For example, calculating the distance from the training data set to all training samples (including minority class samples and majority class samples) by using Euclidean distance as a standard, and selecting the minority class sample X from the ith 1_i1The most recent K1 samples were taken as the i1 th minority class sample X_i1The K1 neighboring samples (which may be all minority samples, may be all majority samples, may be part minority samples, and the other part is majority samples), K1 is a positive integer greater than or equal to 1, and the value of K1 may be selected according to the actual scenario.

In step S353, a distance weight of the neighboring sample is obtained according to the distance between the neighboring sample and the target minority sample.

K1 neighbor samples of sample a are compared with the i1 minority sample X_i1The distance of (c) is sorted in ascending order from near to far, assuming that from near to far, K1 neighbor samples and the i1 minority sample X_i1Are respectively J₁,J₂,…,J_K1I.e. J₁＜J₂＜...＜J_K1Based on the principle that the diffusion minority class samples are close to the minority class boundary and the closer the distance, the higher the probability of being synthesized is, the j1 th neighbor sample X is assumed_ij(near)Has a distance weight of D_j1J1 is a positive integer greater than or equal to 1 and less than or equal to K1, the i1 th minority sample X is determined_i1The distance weights of K1 neighbor samples satisfy D₁＞D₂＞...＞D_K1Thus, here the distance and distance weight of each neighboring sample can be setInversely proportional, e.g. j-th neighbor sample X_ij(near)The distance weight of (d) is: d_j1＝1/J_j1Wherein J_j1The j1 th neighbor sample and the i1 th minority sample X_i1The distance between them.

In step S354, a class weight of the neighboring sample is obtained according to the sample class of the neighboring sample.

For the i1 th few class samples X_i1K1 neighbor samples, if the j1 th neighbor sample X_ij(near)The class label of (1) is a majority class sample, i.e. the i1 th minority class sample X_i1If the class labels of the neighbor samples are different, the neighbor sample X is_ij(near)The class weight of (A) may be a preset constant S_j1E.g. S_j11 is ═ 1; if the j1 th neighbor sample X_ij(near)Is labeled with the minority class sample, i.e. with the i1 th minority class sample X_i1Is the same, the neighbor sample X is_ij(near)The class weight of (A) may be a preset constant S_j1Sum with another constant δ greater than 0, e.g. S_j1′＝S_j1+ δ, if S_j11, then S_j1′＝1+δ。

In step S355, a combination weight of the neighboring samples is obtained according to the distance weight and the class weight of the neighboring samples.

Synthesize the j1 th neighbor sample X_ij(near)Distance weight D of_j1And a class weight S_j1Or S_j1' obtaining the i1 th few class samples X_i1J1 th neighbor sample X_ij(near)Combining weight W of_ij(near). In some embodiments, it may be that the sum of the distance weight and the category weight is a combined weight. In other embodiments, the product of the distance weight and the category weight may be a combined weight, for example, a product of the i1 th few class sample X_i1Corresponding j1 th neighbor sample X_ij(near)The combining weights of (a) and (b) are: w_ij(near)＝D_j1×S_j1。

In step S356, a synthesized minority sample number between the target minority sample and the neighbor sample is determined according to the data imbalance multiplying factor N and the combined weight of the neighbor sample.

In the embodiment of the present disclosure, if the number of the minority samples and the majority samples is 1:1 in order to make the updated training data set (i.e., the training data set after the minority samples are combined in the original training data set), the i1 th minority sample X in the original training data set may be determined according to the data imbalance ratio N_i1It is necessary to co-interpolate (N-1) synthesized minority sample points within the range of K1 neighboring samples, and specifically, the (i) th minority sample X may be interpolated separately according to the combined weight of each neighboring sample, for example, the (i) th minority sample X1_i1Corresponding to the j1 th neighbor sample X_ij(near)The number of synthesized minority samples N to be interpolated_j1Comprises the following steps:

referring to the method, each minority class sample is subjected to similar diffusion, and the number of synthesized minority class samples needing interpolation between each minority class sample and the corresponding neighbor sample is obtained. It should be noted that, the ratio of the number of the minority samples to the number of the majority samples in the updated training data set is not limited to 1:1 in the present disclosure, and the number of the synthesized minority samples of the range interpolation value of the K1 neighboring samples of each minority sample may be adjusted accordingly according to the actual situation.

Fig. 9 is a schematic diagram illustrating a processing procedure of step S350 illustrated in fig. 3 in another embodiment.

As shown in fig. 9, in the embodiment of the present disclosure, the step S350 may further include the following steps.

In step S357, if the neighbor sample is a minority class sample, the synthesized minority class sample is inserted between the neighbor sample and the target minority class sample.

In the embodiment of the present disclosure, in the embodiment of fig. 8, after determining the number of synthesized minority samples that need to be interpolated for the K1 neighbor samples of each minority sample, the number of synthesized minority samples is determined nextDetermine the range of interpolation positions to synthesize minority samples, assuming the original i1 th minority sample X_i1And its j1 th neighbor sample X_ij(near)Interpolate between if X_ij(near)Class label of is a minority class sample, then the new interpolated composite minority class sample X_i2Between two homogeneous point ranges, i.e. X_i2＝X_i1+ε1*(X_ij(near)-X_i1) Where ε 1 ∈ (0,1), i.e., a few class samples X that can be at the i1 th_i1And its j1 th neighbor sample X_ij(near)Is inserted and synthesized into a few types of samples X at any position between the connecting lines_i2。

In step S358, if the neighboring sample is a majority class sample, the synthesized minority class sample is inserted between the neighboring sample and the target minority class sample and near the position of the target minority class sample.

If X_ij(near)Class label of is majority class sample, then new interpolated composite minority class sample X_i2Approaching the original i1 th few class sample X_i1Between two different categories of points, i.e. X_i2＝X_i1+ε2*(X_ij(near)-X_i1) Where ε 2 ∈ (0,0.5), i.e., the minority sample X can be located at i1_i1And its j1 th neighbor sample X_ij(near)Is close to the i1 th few class samples X_i1Is inserted and synthesized into a few types of samples X_i2。

In the SMOTE method in the related art, a selected target minority sample is combined with other minority samples closest to the target minority sample to generate a synthesized minority sample.

Since there are K1 neighbor samples per minority class sample, organic screening of d x K1 neighbor samples is required to achieve class equalization, non-overlapping interpolated samples, and to expand minority class boundaries. Specifically, according to the data imbalance multiplying factor N, it is determined that the minority class samples need to be interpolated by (N-1) samples within the range of K1 neighbor samples, and then the number of synthesized minority class samples corresponding to the combining weight is obtained according to the combining weight of each neighbor sample. And after determining the number of synthesized minority samples of the interpolation samples required by each adjacent sample corresponding to the minority sample, determining the position range of the interpolated synthesized minority sample. Further, the interpolated synthesized minority sample may be between two samples that are both in the minority class, the class label of the corresponding neighboring sample is the minority sample, and the value of ∈ 1 is 0 to 1. The interpolated synthesized minority samples may also be between the minority samples and the majority samples because the class labels of the neighboring samples may be the majority samples or the minority samples, and if the neighboring samples are the majority samples, ε 2 may be 0 to 0.5, thereby allowing the synthesized samples to be anywhere between two minority samples, and/or the synthesized samples to be between the minority samples and the majority samples and approach a location of the minority samples, thereby allowing the interpolated synthesized minority samples to approach the corresponding target minority samples, thereby expanding the minority boundary. That is, by improving the original unbalanced training data set, how to reasonably interpolate between the minority samples and the majority samples to obtain new samples is realized, so that the training data set is balanced, and a more reliable, stable and high-accuracy classification model is obtained.

According to the data processing method provided by the embodiment of the disclosure, the distribution characteristics of a few types of samples and the distribution characteristics of the neighboring samples are considered comprehensively, and a self-adaptive synthesis strategy is set for the neighboring samples according to the distribution characteristics of different degrees of influence of the samples in different areas, so that the classification effect of the classification model is effectively improved. The problem that the number of the minority samples in the training data set is too small is solved by using the improved SMOTE, so that the information of the minority samples can be ensured not to be covered, and the classification accuracy of the classification model trained on the basis of the updated training data set is greatly improved.

Fig. 10 schematically shows a flow chart of a data processing method according to a further embodiment of the present disclosure.

As shown in fig. 10, the method provided by the embodiment of the present disclosure is different from the above embodiment in that the method may further include at least one of the following steps before the step 350.

In step S1010, if the number of missing values in the features is greater than the missing threshold, the features with the number of missing values greater than the missing threshold are filtered.

In practical applications, a part of data is missing due to some reasons (for example, information cannot be acquired temporarily, information is missed, certain attribute or attributes are not available, some information is considered to be unimportant, the cost for acquiring the information is too high, the real-time performance of the system is high, a judgment or decision needs to be made quickly before the information is required to be acquired, and the like), and only a part of data can be observed, and the part of missing data is called a missing value. For example, a missing threshold may be set equal to the sample data size in the training data set × 0.4 (the present disclosure is not limited thereto), if the number of missing eigenvalues of a certain feature exceeds the missing threshold, the feature may be filtered out, and conversely, if the number of missing eigenvalues of a certain feature does not exceed the missing threshold, the feature may be retained in the feature set.

In step S1020, if the feature is a single-valued feature, the single-valued feature is filtered.

In the embodiment of the present disclosure, the single-valued feature means that all feature values of a certain feature are the same, for example, the feature of gender is collected, 10000 users are collected in total, and all users are males, and the feature is removed from the feature set.

In step S1030, the outliers in the feature are discarded.

The abnormal value is a characteristic value for judging that the measured data deviates from a normal result due to external interference, human errors and the like according to the existing knowledge of people on objective objects. In the embodiment of the present disclosure, a value range of a feature value of a feature may be defined, and if a certain feature value exceeds the value range, the feature value is considered to be an abnormal value, and the abnormal feature value is discarded; outliers with too large values of the features may also be discarded according to the feature distribution, for example, all feature values of the features are sorted in descending order from large to small, and outliers in the first thousandth of the row (this value may be selected according to the actual situation, for example, ten thousandth, one million, etc.) are discarded.

In step S1040, a padding process is performed on missing values in the feature in which the number of missing values is less than or equal to the missing threshold.

The missing values of the remaining features in the feature set are subjected to a filling process, and for example, if a feature is a continuous feature, the missing value of the feature may be filled with the average value of all the feature values that are not missing in the feature. Continuous-type features refer to the fact that the possible values of the feature values of a feature are infinite. If a feature is a discrete feature, a default constant may be used to fill in the missing value of the feature. A discrete feature means that the feature value of the feature is selected from a finite number. The predetermined constant may be a constant C out of the finite number, and the missing value of the discrete type feature is filled with the constant C out of the finite number, which may inform the computer system that it is a special class. However, the present disclosure is not limited to filling the missing value with a mean value or a preset constant, and for example, the missing value may be filled with a mode, where the mode refers to one of the feature values of the feature that occurs the most frequently.

In step S1050, the feature is subjected to a derivation process.

The feature derivation means that new features are derived by performing transformation processing on original features, such as feature square, feature addition and subtraction, feature evolution for 3 times, and the like.

In step S1060, if the feature is a continuous feature, discretization is performed on the continuous feature.

For example, the continuous features may be subjected to binning discretization, and the binning mode may be any one of equal frequency binning, equidistant binning, chi-square binning and the like.

In step S1070, if the discrete feature is a one-hot (one-hot) encoding process is performed on the discrete feature.

In step S1080, the features are selected by chi-square test, and features of a preset dimension are selected.

In the embodiment of the present disclosure, a chi-square test may be used to perform feature selection, and a previous preset dimension feature with a relatively high degree of correlation with classification recognition performed by a classification model is selected, for example, the previous 200 features are selected as a final feature set.

It should be noted that the sequence between the steps S1010-S1080 is not limited to the above example, and the sequence of the steps S1010-S1080 may be arbitrarily adjusted, for example, the abnormal value is discarded first, and then the missing value is processed; for example, discretizing the continuous feature and then filtering the single-valued feature.

In some embodiments, the steps S1010-S1080 may be performed between the step S340 and the step S350, that is, after a batch of training samples are obtained through clustering, data preprocessing and feature selection are performed on features of the training samples, and then a few types of synthetic samples are obtained by performing diffusion on the features of the preset dimensions after the data preprocessing and the feature selection, so that in the process of obtaining the training samples, the calculation of the correlation degree is performed based on the features in the original training data set, and the features of the data themselves can be better reflected, thereby avoiding data noise introduced in the data preprocessing process, for example, some transformed features may be introduced by feature derivation.

In other embodiments, the steps S1010-S1080 may also be performed between the step S310 and the step S320, that is, data preprocessing and feature selection are performed on the acquired feature information, then variance calculation and correlation calculation are performed on the features of the preset dimension after the data preprocessing and the feature selection, and the data preprocessing and the feature selection are performed first, so that the calculation amount of subsequently acquired training samples can be reduced, and the operation speed can be increased.

The embodiment of the disclosure designs a set of more perfect feature preprocessing and feature selection methods aiming at the data characteristics of the training data set, improves the contribution degree of the features to the model, reduces the data volume of the classification model training process on the premise of ensuring the classification accuracy, and improves the speed and efficiency of the model training.

The following description takes a classification model as a random forest model, an object to be identified as a user to be identified, and the classification model is used for identifying a financial KOL (Key Opinion Leader) user and a non-financial KOL user as an example.

Among them, KOL is a concept in marketing, generally defined as: a person who has more, more accurate product information, is accepted or trusted by the relevant group, and has a greater impact on the purchasing behavior of that group. Unlike the opinion leaders, the key opinion leaders are often authoritative persons in an industry or field that are easily recognized and identified in the dissemination of information without relying on their own liveness. The first is a persistent intervention feature: KOL has a longer and more intensive intervention on a certain class of products than others in the population, and therefore is more knowledgeable about products, has a broader source of information, more knowledge and more experience. Secondly, interpersonal communication characteristics: compared with common people, the KOL is more in group and healthy conversation, has extremely strong social ability and interpersonal communication skills, actively participates in various activities, is good at paying friends and closing friends, likes the broad theory of high conversation, is a public opinion center and an information publishing center of a group, and has strong infectivity for other people. The third is character feature: KOL is open in concept, fast in accepting new things, concerned about changes in fashion and fashion trends, willing to use new products first, and is an early user of new products in marketing.

Financial KOL refers to a user with a higher value in the financial field, particularly in terms of financial investments and trading behavior.

The identification and excavation of the financial KOL have very important significance on the related application of the financial field, and the operation and the release of financial products and related services can be guided by excavating the users with high potential and strong spreading force in the financial field, so that the targeted user group can be popularized more pertinently, and the effect of achieving twice the result with half the effort is achieved. For example, in the fields of financial investment and financing loan, target customers who mine products have significant effects on enhancing the propagation effect among customer groups, improving PV (Page View) and UV (uniform viewer) of the products, and mining of the users has great significance on popularization, operation and sale of financial products; the exploration and guidance of the financial KOL in the financial information and forum can effectively drive the market trend and public opinion direction. Therefore, the method can accurately and effectively identify and mine the financial KOL user group, and plays a crucial role in investment financing and financial events.

Therefore, the mining of the financial KOL user group has very high value in the aspects of financial applications such as financial product operation. In the related technology, the method for mining and identifying financial KOL users mainly comprises the following steps: the method comprises the steps of expanding a target user group based on a user relationship network and generating a target mining model based on a general classification model to obtain classification probabilities.

The method based on the user relationship network comprises the steps of firstly, obtaining portrait characteristics of a target user group, constructing a relationship network of a seed user facing other users through the portrait characteristics, calculating relationship degree information of the other users and the seed user, and extracting other users matched with conditions according to the relationship degree information to serve as target users.

According to the method based on the general classification model, historical data of all users are obtained through a plurality of dimensional features, then training is carried out to establish a plurality of mining models for user prediction, a target mining model is determined based on the mining models, and a target user is determined from all the users through the target mining model.

The method for KOL excavation in the above-described related art has at least the following problems:

1. most of class labels of training samples for constructing the machine learning model depend on manual work, or are simply and roughly divided by setting a threshold value and rules, so that the difference and the value of each feature for identifying the KOL user are not well quantified, and the automatic identification of the class labels of the training samples is not well realized.

2. The method for constructing the relational network needs to face very huge network nodes and complex node relations in the process of modeling the user relation chain, and the process of constructing and training the network is very time-consuming.

3. For the KOL user to identify this task, the number of KOL user groups is usually much smaller than that of non-KOL users, and the KOL classification method in the related art does not solve this problem well, and the feature processing does not solve the problems of outliers and missing values well.

Fig. 11 schematically shows a flow chart of a data processing method according to yet another embodiment of the present disclosure.

As shown in fig. 11, the method provided by the embodiment of the present disclosure is different from the above-described embodiment in that the method may further include the following steps.

In step S1110, feature information of the object to be recognized is acquired.

For example, the characteristic information of the user to be identified may include any information related to identifying whether the user is a financial KOL user, such as financial information, social information, and the like.

In step S1120, the feature information of the object to be recognized is processed through the classification model, so as to obtain a recognition result of the object to be recognized, where the recognition result is a financial key opinion leader or a non-financial key opinion leader.

For example, the feature information of the user to be recognized is input into the trained classification model, and the classification model automatically outputs the recognition result of whether the user to be recognized is the financial KOL.

Fig. 12 schematically shows a flow chart of a data processing method according to yet another embodiment of the present disclosure. Embodiments of the present disclosure provide a financial KOL user mining scheme that combines improved SMOTE and RF models.

As shown in fig. 12, a data processing method provided by an embodiment of the present disclosure may include the following steps.

In step S1210, social and financial information of the user is obtained, and features of the user are constructed.

For example, historical data of users on social products and financial products is collected, including but not limited to red packages, account transfers, number of people who send and receive, number of strokes sent and received, amount of money sent and received, number of published social comments, sending and receiving information ratio, number of people concerned, number of interactions, number of questions invited and the like in recent M (M is a positive integer greater than or equal to 1) days. Based on these collected historical data, the characteristics of the user are constructed. For example, assuming that M takes on the values of 7 days (one week), 15 days (half month), 30 days (one month), 90 days (three months), 180 days (half year), and 365 days (one year), respectively, the initial feature set constructed may include the following features: the feature 1 is the number of red packet sending and receiving persons in the last 7 days, the feature 2 is the number of red packet sending and receiving strokes in the last 7 days, the feature 3 is the amount of red packet sending and receiving in the last 7 days, the feature 4 is the number of red packet sending and receiving persons in the last 7 days, the feature 5 is the number of red packet sending and receiving strokes in the last 7 days, the feature 6 is the amount of money transferred in the last 7 days, the feature 7 is the number of red packet sending and receiving persons paid in the last 7 days, the feature 8 is the number of red packet sending and receiving strokes in the last 7 days, the feature 9 is the amount of money paid in the last 7 days, the feature 10 is the number of social comment issuing strokes in the last 7 days, the feature 11 is the sending and receiving information ratio in the last 7 days, the feature 12 is the number of attention in the last 7 days (namely, the number of attention paid by the user), the feature 13 is the number of interaction in the last 7 days; the feature 15 is the number of red packets sent and received in the last 15 days, the feature 16 is the number of red packets sent and received in the last 15 days, the feature 17 is the amount of red packets sent and received in the last 15 days, the feature 18 is the number of red packets sent and received in the last 15 days, the feature 19 is the number of red packets sent and received in the last 15 days, the feature 20 is the amount of money transferred in the last 15 days, the feature 21 is the number of red packets sent and received in the last 15 days, the feature 22 is the number of red packets paid in the last 15 days, the feature 23 is the amount of money paid in the last 15 days, the feature 24 is the number of social comment published in the last 15 days, the feature 25 is the ratio of sent and received information in the last 15 days, the feature 26 is the number of people concerned in the last 15 days (namely, how many people the user concerned the user is concerned), the feature 27 is the number of interactions in the last 15 days, and the feature 28 is the; and the like, namely, a large number of features are constructed in advance, and if feature values of all the features of the P users are collected together, all the features of the P users form an initial feature set.

In step S1220, differences between the features are quantified, the features are ranked and scored, and a batch of training samples is obtained.

In the embodiment of the present disclosure, based on statistical analysis and quantification of feature differences, important features obtained by analysis are ranked and sorted to recall (through data filtering) a batch of training samples, so as to form an original training data set, and a category label of each training sample in the training data set is automatically identified.

The method comprises the following steps of taking a feature set which obeys long tail distribution (is a subtype of heavy tail distribution and is also called Zipfu law), quantifying differences among various features based on a statistical analysis method, extracting important features based on a maximum discrimination principle, carrying out comprehensive sequencing and scoring according to the importance degree of the extracted important features based on a maximum correlation principle, and outputting an ordered feature sequence, wherein the specific steps comprise:

1. the variance of all the features in the feature set is calculated. Features with larger variances are more useful, assuming a variance threshold of 1 is set, features with variances less than that 1 are filtered.

2. The correlation between the features is calculated using the above equation (2). And calculates the degree of correlation of the features using the above equation (3).

Selecting a sample (user) with a predetermined ratio as a training sample, for example, 1:3, then selecting the top M1 sorted features from the ordered feature sequence, where M1 may be set to 10, performing K-Means clustering on the training sample on these feature dimensions, where K is selected to be 2, sequentially increasing a new feature in the ordered feature sequence each time by using a stepping strategy, and re-clustering until the ratio of the class with a small class to the total number of samples is a preset ratio of the financial KOL to all users, for example, may be set to 1/100, that is, the data imbalance magnification N is 100. This iteration is done so that the small category is taken as a category label for the financial KOL.

In step S1230, data is preprocessed and feature selection is performed.

The specific processes of data preprocessing and feature selection can refer to the above steps S1010-S1080.

In step S1240, financial KOL user diffusion is performed using SMOTE to update the training data set.

Because the financial KOL user population is less populated in the overall user population, embodiments of the present disclosure utilize improved SMOTE for financial KOL user diffusion to address the problem of category imbalance. The specific implementation process may refer to the embodiments of fig. 8 and fig. 9. Newly inserted synthesized minority samples are added to the original training data set, i.e. minority samples are added, so that the minority samples in the updated training data set are as many as the majority samples.

In step S1250, the RF model is trained using the updated training data set.

The method comprises the steps of constructing a loss function of an RF model in advance, inputting characteristic information of training samples in a training data set into the RF model, outputting prediction categories of the training samples by the RF model, calculating the loss function according to category labels (for example, a financial KOL user, the category label is 0, but not the financial KOL user, the category label is 1) and the prediction categories of the training samples, optimizing model parameters to minimize the value of the loss function, and performing iteration until a stopping condition is met, such as a preset iteration number is reached, or the size of the loss function meets a preset value.

In step S1260, social and financial information of the user to be identified is obtained, and the characteristics of the user to be identified are obtained.

And acquiring social and financial information of the current user to be identified, and constructing the characteristics of the user to be identified according to the social and financial information.

In step S1270, the characteristics of the user to be identified are input to the RF model, and the identification result of whether the user to be identified is a financial KOL user is output.

After the financial KOL users are diffused in step S1240, the RF model is trained, and then the features of the user to be identified are input into the trained RF model, so as to automatically output the identification result of whether the user to be identified is the financial KOL user.

The data processing method provided by the embodiment of the disclosure discloses a scheme for mining financial KOL users by combining improved SMOTE and RF models, and relates to mining of financial high-end users. On one hand, the difference of each characteristic is quantified through statistical analysis based on user information, so that the class label of the training sample is automatically identified, the problem that a large number of training samples need to be manually labeled in the conventional method is solved, manual labeling is reduced, the automatic identification of the financial KOL user in a business scene can be realized, and the method has good operability in industry. On the other hand, a set of more complete preprocessing and feature selection method aiming at the characteristics of the financial KOL data is designed, so that the method is more complete in the links of feature processing such as abnormal values and missing values, and the contribution degree of features to the model is improved. In addition, aiming at the characteristics of financial field users and KOL mining, the improved SMOTE is used for diffusing KOL user groups, the problem that the conventional classification method is not obvious enough in identifying the characteristics of a few types of samples of the KOL due to the fact that the difference between the financial KOL user groups and the common user groups is large is solved, and then the RF model is combined to model and output the classification result of whether the user to be identified is the financial KOL user or not.

The data processing method provided by the embodiment of the disclosure can be widely applied to the field of mining of key opinion leaders related to finance, for example, in a stock-selecting news information scene, the opinion leaders of financial events can be mined to better understand industry viewpoints and market conditions, so that future trends and trends are effectively analyzed, and even some guidance and supervision are performed; in the operation of financing product and put in the scene, have the target user crowd of transmission and influence through excavating to the product, can put in and carry out user crowd's diffusion better accurately, improve the active volume of product, the amazing growth. Identification and mining of financial KOLs are both application scenarios of the present disclosure. Compared with a financial KOL user mining method in the related technology, the embodiment of the disclosure firstly provides that in the financial KOL identification process, the improved SMOTE is utilized to diffuse financial KOL sample population, a set of system based on statistical analysis unsupervised construction of KOL category labels and then generalization is designed, and the method has higher reference value and practical significance. On the whole, the method provided by the embodiment of the disclosure has good operability, innovativeness and flow completeness in the aspect of financial KOL identification, and has higher industrial application value and guiding significance.

As shown in fig. 13, a data processing apparatus 1300 provided in an embodiment of the present disclosure may include: the device comprises a feature information acquisition unit 1310, a feature correlation degree acquisition unit 1320, a clustering result acquisition unit 1330, a sample class determination unit 1340, a minority class sample synthesis unit 1350 and a classification model training unit 1360.

The characteristic information acquiring unit 1310 may be configured to acquire characteristic information of an object. The feature correlation obtaining unit 1320 may be configured to obtain the correlation of the features according to the feature information of the object. The clustering result obtaining unit 1330 may be configured to perform clustering on the objects according to the relevance of the features to obtain a clustering result. The sample type determining unit 1340 may be configured to use fewer objects in the clustering result as minority class samples, and use more objects in the clustering result as majority class samples, where a ratio of the number of the minority class samples to the number of the majority class samples is 1: N, where N is a positive integer greater than 1 and is a data imbalance multiplying factor. The minority sample synthesis unit 1350 may be configured to diffuse the minority samples based on the minority samples and the majority samples to obtain synthesized minority samples. The classification model training unit 1360 may be used to train a classification model based on the minority class samples, the majority class samples, and the synthesized minority class samples.

In an exemplary embodiment, before the feature correlation obtaining unit 1320, the data processing apparatus 1300 may further include: a feature variance obtaining unit, configured to obtain a variance of the feature according to feature information of the object; a variance feature filtering unit, configured to filter a feature with a variance smaller than a variance threshold if the variance of the feature is smaller than the variance threshold.

In an exemplary embodiment, the feature correlation obtaining unit 1320 may include: a feature correlation obtaining unit, configured to obtain correlation between features according to feature information of the object; an average correlation obtaining unit, configured to obtain an average correlation of the features based on a correlation between the features; the correlation determination unit may be configured to determine the correlation of the feature according to the average correlation of the feature.

In an exemplary embodiment, the clustering result obtaining unit 1330 may include: the feature sorting unit may be configured to perform descending order on the features according to the relevance of the features to obtain an ordered feature sequence; a first feature extracting unit, configured to extract the first M1 features from the ordered feature sequence, where M1 is a positive integer greater than or equal to 1; a first clustering unit, which may be configured to perform clustering processing on the objects based on the top M1 features; the first result obtaining unit may be configured to, if a ratio of a number of objects with a small proportion to a number of objects with a large proportion is 1: N after the objects are clustered based on the first M1 features, take a result of clustering the objects based on the first M1 features as the clustering result.

In an exemplary embodiment, the clustering result obtaining unit 1330 may further include: a second feature extracting unit, configured to extract, if a ratio of a number of objects with a small proportion to a number of objects with a large proportion is not 1: N after clustering the objects based on the first M1 features, the first M2 features from the ordered feature sequence, where M2 is a positive integer greater than M1; a second clustering unit, which may be configured to cluster the objects based on the top M2 features; the second result obtaining unit may be configured to, if a ratio of the number of objects with a small proportion to the number of objects with a large proportion is 1: N after the objects are clustered based on the first M2 features, take a result of the objects being clustered based on the first M2 features as the clustering result.

In an exemplary embodiment, the minority class sample synthesis unit 1350 may include: a target sample determination unit operable to determine a target minority class sample from the minority class samples; a neighbor sample obtaining unit, configured to obtain neighbor samples of the target minority sample; a distance weight obtaining unit, configured to obtain a distance weight of the neighboring sample according to a distance between the neighboring sample and the target minority sample; a class weight obtaining unit, configured to obtain a class weight of the neighboring sample according to a sample class of the neighboring sample; a combined weight obtaining unit, configured to obtain a combined weight of the neighboring samples according to the distance weight and the class weight of the neighboring samples; a sample synthesis number unit, configured to determine a number of synthesized minority class samples between the target minority class sample and the neighboring sample according to the data imbalance multiplying factor N and the combined weight of the neighboring sample.

In an exemplary embodiment, the minority sample synthesis unit 1350 may further include: a first sample insertion unit, configured to insert the synthesized minority class sample between the neighbor sample and the target minority class sample if the neighbor sample is a minority class sample; a second sample insertion unit, configured to insert the synthesized minority class sample between the neighbor sample and the target minority class sample and near the target minority class sample if the neighbor sample is a majority class sample.

In an exemplary embodiment, before the minority class sample synthesis unit 1350, the data processing apparatus 1300 may further include at least one of: the system comprises a missing feature filtering unit, a single-value feature filtering unit, an abnormal value processing unit, a missing value filling unit, a feature derivation unit, a feature discretization processing unit, a feature encoding unit and a feature selecting unit.

The missing feature filtering unit may be configured to filter the feature with the number of missing values greater than the missing threshold value if the number of missing values in the feature is greater than the missing threshold value. The single-valued feature filtering unit may be configured to filter the single-valued feature if the feature is a single-valued feature. The outlier processing unit may be adapted to discard outliers in the feature. The missing value padding unit may be configured to perform padding processing on missing values in the feature in which the number of missing values is less than or equal to the missing threshold. The feature derivation unit may be configured to derive the feature. The feature discretization processing unit may be configured to discretize the continuous type feature if the feature is the continuous type feature. The feature encoding unit may be configured to perform a one-hot encoding process on the discrete feature if the feature is a discrete feature. The feature selection unit may be configured to select the feature by using chi-square test, and select a feature of a preset dimension.

In an exemplary embodiment, the classification model may be a random forest model. The data processing apparatus 1300 may further include: the device comprises an object information acquisition unit to be identified, a recognition unit and a recognition unit, wherein the object information acquisition unit can be used for acquiring the characteristic information of an object to be identified; the identification result obtaining unit may be configured to process the feature information of the object to be identified through the classification model, and obtain an identification result of the object to be identified, where the identification result is a financial key opinion leader or a non-financial key opinion leader.

The specific implementation of each unit in the data processing apparatus provided in the embodiment of the present disclosure may refer to the content in the data processing method, and is not described herein again.

It should be noted that although in the above detailed description several units of the device for action execution are mentioned, this division is not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A data processing method, comprising:

acquiring characteristic information of an object;

obtaining the correlation degree of the characteristics according to the characteristic information of the object;

clustering the objects according to the relevance of the features to obtain a clustering result;

taking the objects with less occupation in the clustering result as minority samples, and taking the objects with more occupation in the clustering result as majority samples, wherein the ratio of the number of the minority samples to the number of the majority samples is 1: N, and N is a positive integer with a data imbalance multiplying power larger than 1;

diffusing the minority class samples based on the minority class samples and the majority class samples to obtain synthesized minority class samples;

training a classification model according to the minority class samples, the majority class samples and the synthesized minority class samples.

2. The data processing method of claim 1, wherein before the obtaining the correlation of features from the feature information of the object, the method further comprises:

obtaining the variance of the characteristics according to the characteristic information of the object;

and if the variance of the features is smaller than the variance threshold value, filtering the features with the variance smaller than the variance threshold value.

3. The data processing method according to claim 1, wherein the obtaining a correlation of features from the feature information of the object comprises:

obtaining the correlation between the features according to the feature information of the object;

obtaining an average correlation of features based on the correlation between the features;

and determining the relevance of the features according to the average relevance of the features.

4. The data processing method according to any one of claims 1 to 3, wherein the clustering the objects according to the relevance of the features to obtain a clustering result comprises:

according to the correlation degree of the features, performing descending arrangement on the features to obtain an ordered feature sequence;

selecting the first M1 features from the ordered feature sequence, wherein M1 is a positive integer greater than or equal to 1;

clustering the object based on the top M1 features;

and if the ratio of the number of the objects with less proportion to the number of the objects with more proportion is 1: N after the objects are clustered based on the first M1 features, taking the result of clustering the objects based on the first M1 features as the clustering result.

5. The data processing method according to claim 4, wherein the clustering the objects according to the relevance of the features to obtain a clustering result further comprises:

if the ratio of the number of the objects with less proportion to the number of the objects with more proportion is not 1: N after the objects are clustered based on the first M1 features, selecting the first M2 features from the ordered feature sequence, wherein M2 is a positive integer larger than M1;

clustering the object based on the top M2 features;

and if the ratio of the number of the objects with less proportion to the number of the objects with more proportion is 1: N after the objects are clustered based on the first M2 features, taking the result of clustering the objects based on the first M2 features as the clustering result.

6. The data processing method of claim 1, wherein the diffusing the minority class samples based on the minority class samples and the majority class samples to obtain composite minority class samples comprises:

determining a target minority class sample from the minority class samples;

obtaining neighbor samples of the target minority sample;

obtaining distance weight of the neighbor sample according to the distance between the neighbor sample and the target minority sample;

obtaining the class weight of the neighbor sample according to the sample class of the neighbor sample;

obtaining the combined weight of the neighboring samples according to the distance weight and the category weight of the neighboring samples;

determining the number of synthesized minority class samples between the target minority class sample and the neighbor sample according to the data imbalance multiplying factor N and the combined weight of the neighbor sample.

7. The data processing method of claim 6, wherein the diffusing the minority class samples based on the minority class samples and the majority class samples to obtain composite minority class samples further comprises:

if the neighbor sample is a minority class sample, inserting the synthesized minority class sample between the neighbor sample and the target minority class sample;

if the neighbor sample is a majority class sample, inserting the composite minority class sample between the neighbor sample and the target minority class sample and near the target minority class sample.

8. The data processing method of claim 1, wherein before the diffusing the minority class samples based on the minority class samples and the majority class samples to obtain composite minority class samples, the method further comprises at least one of:

if the number of missing values in the features is larger than a missing threshold, filtering the features of which the number of missing values is larger than the missing threshold;

if the feature is a single-valued feature, filtering the single-valued feature;

discarding outliers in the features;

filling missing values in the features of which the number of the missing values is less than or equal to the missing threshold value;

performing a derivation process on the features;

if the feature is a continuous feature, discretizing the continuous feature;

if the characteristic discrete characteristic exists, carrying out one-hot coding processing on the discrete characteristic;

and selecting the features by utilizing chi-square test, and selecting the features with preset dimensionality.

9. The data processing method of claim 1, wherein the classification model is a random forest model; wherein the method further comprises:

acquiring characteristic information of an object to be identified;

and processing the characteristic information of the object to be identified through the classification model to obtain an identification result of the object to be identified, wherein the identification result is a financial key opinion leader or a non-financial key opinion leader.

10. A data processing apparatus, comprising:

a characteristic information acquisition unit for acquiring characteristic information of an object;

a feature correlation obtaining unit configured to obtain correlation of features according to feature information of the object;

a clustering result obtaining unit, configured to perform clustering processing on the object according to the relevance of the feature to obtain a clustering result;

the sample type determining unit is used for taking the objects with less occupation in the clustering result as minority samples and taking the objects with more occupation in the clustering result as majority samples, the ratio of the number of the minority samples to the number of the majority samples is 1: N, and N is a positive integer which is larger than 1 and is the data imbalance multiplying factor;

a minority sample synthesis unit, configured to diffuse the minority sample based on the minority sample and the majority sample, and obtain a synthesized minority sample;

and the classification model training unit is used for training a classification model according to the minority class samples, the majority class samples and the synthesized minority class samples.

11. An electronic device, comprising:

one or more processors;

a storage device configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the data processing method of any one of claims 1 to 9.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 9.