CN112329816A - Data classification method and device, electronic equipment and readable storage medium - Google Patents

Data classification method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112329816A
CN112329816A CN202011075308.4A CN202011075308A CN112329816A CN 112329816 A CN112329816 A CN 112329816A CN 202011075308 A CN202011075308 A CN 202011075308A CN 112329816 A CN112329816 A CN 112329816A
Authority
CN
China
Prior art keywords
behavior
data
target
sample
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011075308.4A
Other languages
Chinese (zh)
Inventor
薛淼
孟格思
李敏
王瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN202011075308.4A priority Critical patent/CN112329816A/en
Publication of CN112329816A publication Critical patent/CN112329816A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a data classification method, a device, electronic equipment and a readable storage medium, and relates to the technical field of computers, wherein a terminal side program can record data used for representing the behavior of a user, and a target behavior data set corresponding to the data can also be used for representing the behavior of the user, so that target behavior characteristics determined according to the target behavior data set can also be used for representing the behavior of the user, and further, target behavior categories used for representing the behavior mode of the user can be predicted based on a pre-trained behavior classification model and target behavior characteristics, and in addition, as a training sample set of the pre-trained behavior classification model comprises generated samples (namely virtual samples generated by a behavior characteristic sample generation model), the number of samples in the training sample set is sufficient, namely, the behavior classification model can be fully trained, and further, the trained behavior classification model can predict the behavior class of the user more accurately.

Description

Data classification method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data classification method and apparatus, an electronic device, and a readable storage medium.
Background
At present, as the living standard of people is improved, more and more services related to personal credit appear, such as automobile leasing services and the like, in which the personal credit of a user is important, once the credit loss behavior of the user (such as vehicle fraud or illegal activities by using a leasing vehicle and the like) occurs, huge loss can be caused to the provider of the services (such as an automobile leasing company).
Since each user has a low frequency of using such services, the amount of data stored in the whole history is small, and it is difficult to classify the credit rating of the user through the existing data stored in the history.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data classification method, an apparatus, an electronic device, and a readable storage medium, so that a trained behavior classification model can more accurately predict a behavior class of a user.
In a first aspect, a data classification method is provided, and the method includes:
acquiring a target behavior data set, wherein the target behavior data set comprises a plurality of man-machine interaction behaviors and network access behavior data recorded through a terminal side program;
determining target behavior characteristics based on the target behavior data set, wherein the target behavior characteristics are used for representing the portrait corresponding to the target behavior data set; and
and determining a target behavior category output by the pre-trained behavior classification model based on a pre-trained behavior classification model by taking the target behavior characteristics as input, wherein the pre-trained behavior classification model is determined based on training of a training sample set, the training sample set comprises a plurality of generation samples, and the generation samples are generated by a pre-trained behavior characteristic sample generation model.
Optionally, the behavior feature sample generation model includes a generator module and a discriminator module, and the behavior feature sample generation model is trained based on the following steps:
acquiring a first preset number of real behavior feature samples, wherein the real behavior feature samples are used for representing behavior features corresponding to the acquired behavior data;
generating a second preset number of virtual behavior feature samples based on the generator module;
determining a loss function between the real behavior feature sample and the virtual behavior feature sample based on the discriminator module; and
and adjusting parameters of the behavior feature sample generation model based on the loss function.
Optionally, the method further includes:
obtaining a plurality of virtual behavior characteristics generated by the generator module after the parameters are adjusted;
performing discrimination operation on the plurality of virtual behavior features based on the discriminator module, and determining discrimination probabilities corresponding to the plurality of virtual behavior features, wherein the discrimination probabilities are used for representing the probability that the discriminator module judges that the virtual behavior features are real behavior features or representing the probability that the discriminator module judges that the virtual behavior features are virtual behavior features; and
and in response to the judgment probability not being in a preset threshold range, adjusting parameters of the behavior feature sample generation model so as to enable the judgment probability to be in the preset threshold range.
Optionally, the training sample set further includes the real behavior feature sample, a label corresponding to the real behavior feature sample, and a label corresponding to the generated sample;
the behavior classification model is trained based on the following steps:
acquiring a training sample set;
taking the generated sample and the real behavior feature sample as input, and determining a behavior class output by the behavior classification model; and
and adjusting parameters of the behavior classification model based on the behavior category, the label corresponding to the real behavior feature sample and the label corresponding to the generated sample.
Optionally, the target behavior data set includes basic information, performance information and embedded point information, where the basic information is used to represent inherent attributes, the performance information is used to represent credit behaviors, and the embedded point information is used to represent data collected by a preset embedded point algorithm;
the obtaining of the target behavior data set includes:
acquiring the buried point information from a preset first database, wherein the first database is used for collecting and preprocessing the buried point data to determine the buried point information, and the preprocessing comprises data structuring;
acquiring the basic information and the performance information from a preset second database, wherein the second database is used for collecting and preprocessing basic data and performance data to determine the basic information and the performance information; and
determining the target behavior data set based on the basic information, the performance information and the buried point information.
Optionally, the first database is a data warehouse hive processor, and the second database is a full link processor.
Optionally, the behavior feature sample generation model is built based on a generative confrontation network GAN, and the behavior classification model includes at least one of a logistic regression model, a gradient descent tree GBDT model, a distributed gradient enhancement library XGBoost, a deep learning model, or an end-to-end model.
In a second aspect, there is provided an apparatus for data classification, the apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target behavior data set, and the target behavior data set comprises a plurality of human-computer interaction behaviors and network access behavior data recorded through a terminal side program;
a first determining module, configured to determine a target behavior feature based on the target behavior data set, where the target behavior feature is used to characterize an image corresponding to the target behavior data set; and
and the second determination module is used for determining the target behavior category output by the pre-trained behavior classification model based on the pre-trained behavior classification model by taking the target behavior characteristics as input, wherein the pre-trained behavior classification model is determined based on training of a training sample set, the training sample set comprises a plurality of generation samples, and the generation samples are generated by a pre-trained behavior characteristic sample generation model.
Optionally, the behavior feature sample generation model includes a generator module and a discriminator module, and the apparatus further includes:
the second acquisition module is used for acquiring a first preset number of real behavior feature samples, and the real behavior feature samples are used for representing behavior features corresponding to the acquired behavior data;
the generator module is used for generating a second preset number of virtual behavior feature samples;
the discriminator module is used for determining a loss function between the real behavior feature sample and the virtual behavior feature sample; and
and the first adjusting module is used for adjusting the parameters of the behavior characteristic sample generation model based on the loss function.
Optionally, the apparatus further comprises:
the third acquisition module is used for acquiring a plurality of virtual behavior characteristics generated by the generator module after the parameters are adjusted;
the discriminator module is used for performing discrimination operation on the plurality of virtual behavior features and determining discrimination probabilities corresponding to the plurality of virtual behavior features, wherein the discrimination probabilities are used for representing the probability that the discriminator module judges that the virtual behavior features are real behavior features or representing the probability that the discriminator module judges that the virtual behavior features are virtual behavior features; and
and the second adjusting module is used for adjusting the parameters of the behavior feature sample generation model in response to the fact that the discrimination probability is not in the preset threshold range, so that the discrimination probability is in the preset threshold range.
Optionally, the training sample set further includes the real behavior feature sample, a label corresponding to the real behavior feature sample, and a label corresponding to the generated sample;
the device further comprises:
the fourth acquisition module is used for acquiring a training sample set;
a third determining module, configured to determine a behavior category output by the behavior classification model, using the generated sample and the real behavior feature sample as inputs; and
and the third adjusting module is used for adjusting the parameters of the behavior classification model based on the behavior category, the label corresponding to the real behavior feature sample and the label corresponding to the generated sample.
Optionally, the target behavior data set includes basic information, performance information and embedded point information, where the basic information is used to represent inherent attributes, the performance information is used to represent credit behaviors, and the embedded point information is used to represent data collected by a preset embedded point algorithm;
the first obtaining module is specifically configured to:
acquiring the buried point information from a preset first database, wherein the first database is used for collecting and preprocessing the buried point data to determine the buried point information, and the preprocessing comprises data structuring;
acquiring the basic information and the performance information from a preset second database, wherein the second database is used for collecting and preprocessing basic data and performance data to determine the basic information and the performance information; and
determining the target behavior data set based on the basic information, the performance information and the buried point information.
Optionally, the first database is a data warehouse hive processor, and the second database is a full link processor.
Optionally, the behavior feature sample generation model is built based on a generative confrontation network GAN, and the behavior classification model includes at least one of a logistic regression model, a gradient descent tree GBDT model, a distributed gradient enhancement library XGBoost, a deep learning model, or an end-to-end model.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement the method according to the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect.
In the embodiment of the invention, the terminal side program can record the data of the man-machine interaction behavior and the network access behavior, the data can be used for representing the behavior of the user, and similarly, the target behavior data set corresponding to the data can also be used for representing the behavior of the user, thus, the target behavior characteristics determined by the server from the target behavior data set can also be used to characterize the behavior of the user, furthermore, the server can predict the target behavior category used for representing the behavior mode of the user based on the pre-trained behavior classification model and the target behavior characteristics, and in addition, since the generation samples (i.e., the virtual samples generated by the behavior feature sample generation model) are included in the training sample set of the pre-trained behavior classification model, the number of samples in the training sample set is sufficient, therefore, the behavior classification model can be fully trained, and the trained behavior classification model can more accurately predict the behavior class of the user.
Drawings
The above and other objects, features and advantages of the embodiments of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a diagram of a data classification system according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data classification method according to an embodiment of the present invention;
FIG. 3 is a flow chart of another data classification method provided by the embodiment of the invention;
FIG. 4 is a schematic diagram of a training process of a behavior feature sample generation model according to an embodiment of the present invention;
FIG. 5 is a flow chart of another data classification method provided by the embodiment of the invention;
FIG. 6 is an exemplary flowchart of a data classification method provided by an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a data classification apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the embodiment of the invention, the target users can be classified by the human-computer interaction behaviors and the network access behavior data of the plurality of target users recorded by the terminal side program based on the pre-trained behavior classification model.
Specifically, as shown in fig. 1, fig. 1 is a schematic diagram of a data classification system according to an embodiment of the present invention, where the schematic diagram includes: a plurality of terminal devices 1, a server 2, and a user a;
the terminal device 1 may be a smart phone, a tablet Computer, a Personal Computer (PC), or the like, and the server 2 may be a single server, a server cluster configured in a distributed manner, or a cloud server.
In the embodiment of the present invention, the terminal device 1 may be a smart phone or a PC used by the user a, and when the user a uses the terminal device 1, data of a human-computer interaction behavior and a network access behavior are often generated, and these data may be used to characterize a behavior classification attribute of the user.
Furthermore, the server 2 may collect data of human-computer interaction behavior and network access behavior generated by the user a using the terminal device 1 based on the network communication connection with each terminal device 1, and then determine the behavior class of the user a according to the data corresponding to the user a based on a preset behavior classification model.
In one scenario, the behavior category of the user a may be a credit level classification of the user a, that is, the data of the man-machine interaction behavior and the network access behavior generated by the user a using the terminal device 1 may be specifically used to characterize the credit level of the user, and further, the server 2 may determine the credit level classification (for example, excellent credit, good credit, or poor credit, etc.) corresponding to the user a according to the data corresponding to the user a and a preset behavior classification model.
In another scenario, the behavior category of the user a may be a habit category of the user a, and the human-computer interaction behavior and the network access behavior data generated by the user a using the terminal device 1 may also be specifically used to characterize the behavior habit of the user, and further, the server 2 may determine the behavior habit category (for example, robust type or risk preference type) corresponding to the user a according to the data corresponding to the user a and a preset behavior classification model.
For a more clear description, a detailed description will be given below of a data classification method provided in an embodiment of the present invention with reference to a specific embodiment, as shown in fig. 2, the specific steps are as follows:
at step 100, a target behavior data set is obtained.
The target behavior data set comprises a plurality of man-machine interaction behaviors and network access behavior data recorded through a terminal side program.
At step 200, target behavior characteristics are determined based on the target behavior dataset.
The target behavior characteristics are used for representing the portrait corresponding to the target behavior data set.
In step 300, a target behavior class output by the pre-trained behavior classification model is determined based on the pre-trained behavior classification model with the target behavior feature as input.
The pre-trained behavior classification model is determined based on a sample training set in a training mode, the sample training set comprises a plurality of generating samples, and the generating samples are generated by the pre-trained behavior characteristic sample generating model.
In the embodiment of the invention, the terminal side program can record the data of the man-machine interaction behavior and the network access behavior, the data can be used for representing the behavior of the user, and similarly, the target behavior data set corresponding to the data can also be used for representing the behavior of the user, thus, the target behavior characteristics determined by the server from the target behavior data set can also be used to characterize the behavior of the user, furthermore, the server can predict the target behavior category used for representing the behavior mode of the user based on the pre-trained behavior classification model and the target behavior characteristics, and in addition, since the generation samples (i.e., the virtual samples generated by the behavior feature sample generation model) are included in the training sample set of the pre-trained behavior classification model, the number of samples in the training sample set is sufficient, therefore, the behavior classification model can be fully trained, and the trained behavior classification model can more accurately predict the behavior class of the user.
Further, the behavior feature sample generation model may be a model established based on a Generative Adaptive Networks (GAN), and specifically may include a generator module and a discriminator module.
The GAN model is a deep learning model, and the trained GAN model can generate data with specified attributes based on random noise.
The generator module can be a neural network, the input of the generator module can be random data (random noise), and designated data is output.
The discriminator module may also be a neural network, and the input of the discriminator module may be a real data set and a virtual data set (i.e., a data set generated by the generator module), and the output is the discrimination result, for example, after the discriminator module receives the data a, if the output of the discriminator module is "1", the discriminator module judges that the data a is real data, that is, the data a is from the real data set, and if the output of the discriminator module is "0", the discriminator module judges that the data a is virtual data, that is, the data a is from the virtual data set.
In summary, in the embodiment of the present invention, the generator module may be configured to generate the dummy data, and the discriminator module may be configured to judge whether the data is the dummy data, that is, the combination of the generator module and the discriminator module forms a countermeasure network.
In order to make the virtual data generated by the generator module similar to the real data, the embodiment of the invention can train the feature sample generation model, so that the trained discriminator module cannot distinguish the real data from the virtual data, i.e. the discrimination probability of the trained discriminator module is 50% for the real data or the virtual data (the discrimination probability is up to the probability that the discriminator module judges some data as the real data or the virtual data).
Specifically, as shown in fig. 3, the behavior feature sample generation model may be trained based on the following steps:
in step 31, a first preset number of real behavior feature samples are obtained.
The real behavior feature sample is used for representing behavior features corresponding to the collected behavior data.
For better explanation, an embodiment of the present invention further provides a schematic diagram of a training process of a behavior feature sample generation model, as shown in fig. 4, the schematic diagram takes a GAN model as an example for illustration, and the schematic diagram includes: the device comprises a generator module, a discriminator module, random noise X, a virtual behavior feature sample Y, a real behavior feature sample Z and a loss function.
With reference to the content shown in fig. 4, the first preset number of real behavior feature samples in step 31 is the real behavior feature sample Z in fig. 4, where the first preset number may be appropriately adjusted according to an actual situation, and the first preset number is not limited in the embodiment of the present invention.
At step 32, a second preset number of virtual behavior feature samples are generated based on the generator module.
With reference to the content shown in fig. 4, the second preset number of virtual behavior feature samples is the virtual behavior feature sample Y in fig. 4, specifically, the random noise X may be used as an input of the generator module, and the generator module may generate the virtual behavior feature sample Y based on the random noise X, where the second preset number may be adjusted according to an actual situation, and the embodiment of the present invention is not limited.
In step 33, based on the discriminator module, a loss function (loss function) between the real behavior feature samples and the virtual behavior feature samples is determined.
The loss function can be used for adjusting the model parameters, and generally comprises a hinge loss function, a cross entropy loss function and an exponential loss function.
At step 34, parameters of the behavior feature sample generation model are adjusted based on the loss function.
In practical application, the behavior feature sample generation model adjusted once often cannot be converged, so that the embodiment of the present invention needs to verify the adjusted behavior feature sample generation model, and specifically, the process may be executed as: obtaining a plurality of virtual behavior characteristics generated by the generator module after the parameters are adjusted; carrying out discrimination operation on the plurality of virtual behavior characteristics based on the discriminator module, and determining discrimination probabilities corresponding to the plurality of virtual behavior characteristics; and adjusting parameters of the behavior feature sample generation model in response to the judgment probability not being within the preset threshold range, so that the judgment probability is within the preset threshold range.
The judgment probability is used for representing the probability that the virtual behavior feature is judged to be the real behavior feature by the discriminator module, or used for representing the probability that the virtual behavior feature is judged to be the virtual behavior feature by the discriminator module.
It should be noted that the preset threshold range may be appropriately adjusted according to the actual situation, for example, the preset threshold range may be 0.49-0.51, and in addition, a threshold (for example, 0.5) may also be directly set, that is, when the discriminant probability output by the discriminant module is 0.5, the training of the behavior feature sample generation model is completed.
In combination with the training and verification processes, the training of the behavior feature sample generation model in the embodiment of the present invention is substantially a process of adjusting parameters circularly, that is, after m real data and n virtual data generated by the generator module are acquired in the embodiment of the present invention, the real data and the virtual data can be distinguished based on a discriminator, and a loss function is calculated.
Then, the embodiment of the present invention may update the parameters of the arbiter module and the parameters of the generator module cyclically k times based on the verification process, and when the verification is completed (that is, the decision probability output by the arbiter module after adjusting the parameters is within the preset threshold range), the training of the behavior feature sample generation model is completed.
In the embodiment of the present invention, the values of m, n, and k are not limited.
After the training of the behavior feature sample generation model is completed, the behavior feature sample generation model can be used for outputting a generation sample, the generation sample and the real behavior feature sample can be used as training samples of the behavior classification model, and then the behavior classification model can be trained based on a training set containing the generation sample and the real behavior feature sample.
The behavior classification model comprises at least one of a Logistic Regression (LR) model, a Gradient Boosting Decision Tree (GBDT) model, a distributed Gradient enhancement library (XGboost), a deep learning model or an end-to-end (end-to-end) model.
The model can be used as a part of a behavior classification model, wherein the LR model is a model commonly used for classification tasks in machine learning, is a generalized linear regression analysis model in essence, and has the advantages of simple structure, high training speed and good probability explanation on output variables.
The GBDT model is a precision Tree model trained based on the Gradient Boosting strategy, and can realize the classification function of data based on a Decision Tree.
In addition, the GBDT model used alone is prone to overfitting, and therefore, in practical applications, the GBDT model and the LR model may be combined to implement a data classification function, that is, behavior classification is implemented by the GBDT + LR model.
The XGboost is an extensible machine learning system which can be used as an open-source software package, meanwhile, the influence of the system is widely recognized in a large number of machine learning and data mining challenges, and in the embodiment of the invention, the XGboost can play a good classification role along with the continuous increase of the data volume.
The deep learning model is a model established based on a deep neural network, and can realize accurate classification based on good learning ability.
The end-to-end model is different from a traditional machine learning model (consisting of a plurality of independent modules), integrates a plurality of modules, and takes all the modules as a whole, so that the model training process is simplified, and the fault tolerance rate is increased.
Further, as shown in fig. 5, the process of training the behavior classification model may include the following steps:
in step 51, a set of training samples is obtained.
The training sample set comprises a real behavior characteristic sample, a label corresponding to the real behavior characteristic sample and a label corresponding to the generated sample besides the generated sample.
The label of the real behavior feature sample can be artificially labeled based on the data corresponding to the real behavior feature sample, and the label of the generated sample can be labeled based on the similarity between the generated sample and each real behavior feature sample, that is, the label of the generated sample can be the same label as the real behavior feature sample most similar to the generated sample.
In step 52, the generated samples and the real behavior feature samples are used as input, and the behavior classification output by the behavior classification model is determined.
In step 53, parameters of the behavior classification model are adjusted based on the behavior class, the label corresponding to the real behavior feature sample, and the label corresponding to the generated sample.
In the embodiment of the invention, as the trained behavior characteristic sample generation model can generate a large number of generation samples, the training samples of the behavior classification model collectively contain sufficient sample number, so that the behavior classification model can be fully trained, and the trained behavior classification model can accurately predict the behavior category.
When the training of the behavior classification model is completed, a target behavior class may be determined based on the trained behavior classification model and the target behavior dataset.
The target behavior data set can comprise basic information, performance information and embedded point information, wherein the basic information is used for representing inherent attributes, the performance information is used for representing credit behaviors, and the embedded point information is used for representing data collected by a preset embedded point algorithm.
For example, in an application scenario of car rental, the target user may be a user using a car rental service, and further, the basic information in the target behavior data set may include gender, age, birth year and month, work nature, marriage status, and the like of the target user.
The performance information may be used to characterize the target user's performance data during and after the rental car, such as whether the target user returned the rental car on a scheduled basis, whether the target user rented other cars simultaneously, whether the user continued to rent after the rental was completed, and the like.
The buried point information can be used to characterize the network access behavior data of the target user, such as the sequential behavior (click sequence, etc.) of the target user in the car rental program.
Specifically, the step of obtaining the target behavior data set may be performed as: acquiring buried point information from a preset first database; acquiring basic information and performance information from a preset second database; and determining a target behavior data set based on the basic information, the performance information and the buried point information.
The first database is used for collecting and preprocessing the data of the buried points to determine the information of the buried points, the second database is used for collecting and preprocessing basic data and performance data to determine the basic information and the performance information, and the preprocessing comprises data structuring processing.
Further, the first database may be a data warehouse (hive) processor and the second database may be a full link processor.
The hive is a data warehouse tool based on Hadoop (a distributed system infrastructure), and can be used for data extraction, conversion and loading, in practical application, the hive can be used for functions of storage and preprocessing of big data and the like, and a full-link processor can be used for storage and preprocessing of basic data and representation data.
With reference to the foregoing embodiments, as shown in fig. 6, fig. 6 is an exemplary flowchart of a data classification method provided in an embodiment of the present invention, where the exemplary flowchart includes: the system comprises a buried point information acquisition device, a basic information acquisition device, an expression information acquisition device, a hive processor, a full link processor, a user credit portrait module and a behavior classification model.
Specifically, the embedded point information acquisition device can acquire embedded point information corresponding to a target user, the basic information acquisition device can acquire basic information corresponding to the target user, and the performance information acquisition device can acquire performance information corresponding to the target user.
Then, the buried point information collecting device can send the collected buried point information to the hive processor, so that the hive processor preprocesses (data structuring processing) and stores the buried point information corresponding to the target user.
The basic information acquisition device can send the acquired basic information to the full-link processor, so that the full-link processor can preprocess and store the basic information corresponding to the target user, and similarly, the performance information acquisition device can send the acquired performance information to the full-link processor, so that the full-link processor can preprocess and store the performance information corresponding to the target user.
When credit rating classification is performed on a target user, the user credit portrait module can extract the buried point information of the target user from the hive processor and extract the basic information and the performance information of the target user from the full-link processor, and then the user credit portrait module can determine the target behavior characteristics corresponding to the target user based on the buried point information, the basic information and the performance information of the target user, wherein the target behavior characteristics represent the user portrait of the target user.
Further, the user credit representation module may send the target behavior feature to a pre-trained behavior classification model, and then the pre-trained behavior classification model may output a credit rating classification corresponding to the target user (the credit rating classification of the target user is the target behavior category of the target user).
In the embodiment of the invention, a terminal side program can record credit behavior data of a target user, so that a user credit portrait module can determine a credit portrait of the target user according to the credit behavior data of the target user, and further, a pre-trained behavior classification model can predict credit rating classification of the target user according to the credit portrait of the target user.
Based on the same technical concept, an embodiment of the present invention further provides a data classification apparatus, as shown in fig. 7, the apparatus includes: a first obtaining module 71, a first determining module 72 and a second determining module 73;
a first obtaining module 71, configured to obtain a target behavior data set, where the target behavior data set includes a plurality of human-computer interaction behaviors and network access behavior data recorded through a terminal-side program;
a first determining module 72, configured to determine a target behavior feature based on the target behavior data set, where the target behavior feature is used to represent an image corresponding to the target behavior data set; and
and a second determining module 73, configured to determine, based on the pre-trained behavior classification model, a target behavior class output by the pre-trained behavior classification model with the target behavior feature as an input, where the pre-trained behavior classification model is determined based on a training sample set, the training sample set includes a plurality of generating samples, and the generating samples are generated by the pre-trained behavior feature sample generating model.
In the embodiment of the invention, the terminal side program can record the data of the man-machine interaction behavior and the network access behavior, the data can be used for representing the behavior of the user, and similarly, the target behavior data set corresponding to the data can also be used for representing the behavior of the user, thus, the target behavior characteristics determined by the server from the target behavior data set can also be used to characterize the behavior of the user, furthermore, the server can predict the target behavior category used for representing the behavior mode of the user based on the pre-trained behavior classification model and the target behavior characteristics, and in addition, since the generation samples (i.e., the virtual samples generated by the behavior feature sample generation model) are included in the training sample set of the pre-trained behavior classification model, the number of samples in the training sample set is sufficient, therefore, the behavior classification model can be fully trained, and the trained behavior classification model can more accurately predict the behavior class of the user.
Fig. 8 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 8, the electronic device shown in fig. 8 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 81 and a memory 82. The processor 81 and the memory 82 are connected by a bus 83. The memory 82 is adapted to store instructions or programs executable by the processor 81. Processor 81 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 81 implements the processing of data and the control of other devices by executing instructions stored by the memory 82 to perform the method flows of embodiments of the present invention as described above. The bus 83 connects the above components together, and also connects the above components to a display controller 84 and a display device and an input/output (I/O) device 85. Input/output (I/O) devices 85 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 85 are coupled to the system through an input/output (I/O) controller 86.
It should be noted that, when the processor 81 is used for executing the program stored in the memory 82, it is also used for implementing other steps described in the foregoing method embodiment, and reference may be made to the relevant description in the foregoing method embodiment, which is not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device) or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method of the above embodiments may be accomplished by specifying related hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (16)

1. A method of data classification, the method comprising:
acquiring a target behavior data set, wherein the target behavior data set comprises a plurality of man-machine interaction behaviors and network access behavior data recorded through a terminal side program;
determining target behavior characteristics based on the target behavior data set, wherein the target behavior characteristics are used for representing the portrait corresponding to the target behavior data set; and
and determining a target behavior category output by the pre-trained behavior classification model based on a pre-trained behavior classification model by taking the target behavior characteristics as input, wherein the pre-trained behavior classification model is determined based on training of a training sample set, the training sample set comprises a plurality of generation samples, and the generation samples are generated by a pre-trained behavior characteristic sample generation model.
2. The method of claim 1, wherein the behavior feature sample generation model comprises a generator module and a discriminator module, and wherein the behavior feature sample generation model is trained based on the following steps:
acquiring a first preset number of real behavior feature samples, wherein the real behavior feature samples are used for representing behavior features corresponding to the acquired behavior data;
generating a second preset number of virtual behavior feature samples based on the generator module;
determining a loss function between the real behavior feature sample and the virtual behavior feature sample based on the discriminator module; and
and adjusting parameters of the behavior feature sample generation model based on the loss function.
3. The method of claim 2, further comprising:
obtaining a plurality of virtual behavior characteristics generated by the generator module after the parameters are adjusted;
performing discrimination operation on the plurality of virtual behavior features based on the discriminator module, and determining discrimination probabilities corresponding to the plurality of virtual behavior features, wherein the discrimination probabilities are used for representing the probability that the discriminator module judges that the virtual behavior features are real behavior features or representing the probability that the discriminator module judges that the virtual behavior features are virtual behavior features; and
and in response to the judgment probability not being in a preset threshold range, adjusting parameters of the behavior feature sample generation model so as to enable the judgment probability to be in the preset threshold range.
4. The method according to claim 2 or 3, wherein the training sample set further comprises the real behavior feature sample, a label corresponding to the real behavior feature sample, and a label corresponding to the generated sample;
the behavior classification model is trained based on the following steps:
acquiring a training sample set;
taking the generated sample and the real behavior feature sample as input, and determining a behavior class output by the behavior classification model; and
and adjusting parameters of the behavior classification model based on the behavior category, the label corresponding to the real behavior feature sample and the label corresponding to the generated sample.
5. The method according to claim 1, wherein the target behavior data set comprises basic information, performance information and buried point information, the basic information is used for representing inherent attributes, the performance information is used for representing credit behavior, and the buried point information is used for representing data collected by a preset buried point algorithm;
the obtaining of the target behavior data set includes:
acquiring the buried point information from a preset first database, wherein the first database is used for collecting and preprocessing the buried point data to determine the buried point information, and the preprocessing comprises data structuring;
acquiring the basic information and the performance information from a preset second database, wherein the second database is used for collecting and preprocessing basic data and performance data to determine the basic information and the performance information; and
determining the target behavior data set based on the basic information, the performance information and the buried point information.
6. The method of claim 5, wherein the first database is a data warehouse hive processor and the second database is a full link processor.
7. The method of claim 1, wherein the behavior feature sample generation model is built based on a generative confrontation network (GAN), and wherein the behavior classification model comprises at least one of a logistic regression model, a gradient descent tree (GBDT) model, a distributed gradient enhancement library (XGboost), a deep learning model, or an end-to-end model.
8. An apparatus for classifying data, the apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target behavior data set, and the target behavior data set comprises a plurality of human-computer interaction behaviors and network access behavior data recorded through a terminal side program;
a first determining module, configured to determine a target behavior feature based on the target behavior data set, where the target behavior feature is used to characterize an image corresponding to the target behavior data set; and
and the second determination module is used for determining the target behavior category output by the pre-trained behavior classification model based on the pre-trained behavior classification model by taking the target behavior characteristics as input, wherein the pre-trained behavior classification model is determined based on training of a training sample set, the training sample set comprises a plurality of generation samples, and the generation samples are generated by a pre-trained behavior characteristic sample generation model.
9. The apparatus of claim 8, wherein the behavior feature sample generation model comprises a generator module and a discriminator module, the apparatus further comprising:
the second acquisition module is used for acquiring a first preset number of real behavior feature samples, and the real behavior feature samples are used for representing behavior features corresponding to the acquired behavior data;
the generator module is used for generating a second preset number of virtual behavior feature samples;
the discriminator module is used for determining a loss function between the real behavior feature sample and the virtual behavior feature sample; and
and the first adjusting module is used for adjusting the parameters of the behavior characteristic sample generation model based on the loss function.
10. The apparatus of claim 9, further comprising:
the third acquisition module is used for acquiring a plurality of virtual behavior characteristics generated by the generator module after the parameters are adjusted;
the discriminator module is used for performing discrimination operation on the plurality of virtual behavior features and determining discrimination probabilities corresponding to the plurality of virtual behavior features, wherein the discrimination probabilities are used for representing the probability that the discriminator module judges that the virtual behavior features are real behavior features or representing the probability that the discriminator module judges that the virtual behavior features are virtual behavior features; and
and the second adjusting module is used for adjusting the parameters of the behavior feature sample generation model in response to the fact that the discrimination probability is not in the preset threshold range, so that the discrimination probability is in the preset threshold range.
11. The apparatus according to claim 9 or 10, wherein the training sample set further includes the real behavior feature sample, a label corresponding to the real behavior feature sample, and a label corresponding to the generated sample;
the device further comprises:
the fourth acquisition module is used for acquiring a training sample set;
a third determining module, configured to determine a behavior category output by the behavior classification model, using the generated sample and the real behavior feature sample as inputs; and
and the third adjusting module is used for adjusting the parameters of the behavior classification model based on the behavior category, the label corresponding to the real behavior feature sample and the label corresponding to the generated sample.
12. The apparatus of claim 8, wherein the target behavior data set comprises basic information, performance information and buried point information, the basic information is used for representing inherent attributes, the performance information is used for representing credit behavior, and the buried point information is used for representing data collected by a preset buried point algorithm;
the first obtaining module is specifically configured to:
acquiring the buried point information from a preset first database, wherein the first database is used for collecting and preprocessing the buried point data to determine the buried point information, and the preprocessing comprises data structuring;
acquiring the basic information and the performance information from a preset second database, wherein the second database is used for collecting and preprocessing basic data and performance data to determine the basic information and the performance information; and
determining the target behavior data set based on the basic information, the performance information and the buried point information.
13. The apparatus of claim 12, wherein the first database is a data warehouse hive processor and the second database is a full link processor.
14. The apparatus of claim 8, wherein the behavior feature sample generation model is built based on a generative confrontation network (GAN), and wherein the behavior classification model comprises at least one of a logistic regression model, a gradient descent tree (GBDT) model, a distributed gradient enhancement library (XGboost), a deep learning model, or an end-to-end model.
15. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-7.
16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
CN202011075308.4A 2020-10-09 2020-10-09 Data classification method and device, electronic equipment and readable storage medium Pending CN112329816A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011075308.4A CN112329816A (en) 2020-10-09 2020-10-09 Data classification method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011075308.4A CN112329816A (en) 2020-10-09 2020-10-09 Data classification method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112329816A true CN112329816A (en) 2021-02-05

Family

ID=74313425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011075308.4A Pending CN112329816A (en) 2020-10-09 2020-10-09 Data classification method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112329816A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948412A (en) * 2021-04-21 2021-06-11 携程旅游网络技术(上海)有限公司 Flight inventory updating method, system, electronic equipment and storage medium
CN112990480A (en) * 2021-03-10 2021-06-18 北京嘀嘀无限科技发展有限公司 Method and device for building model, electronic equipment and storage medium
CN113011966A (en) * 2021-03-18 2021-06-22 中国光大银行股份有限公司 Credit scoring method and device based on deep learning
CN113850309A (en) * 2021-09-15 2021-12-28 支付宝(杭州)信息技术有限公司 Training sample generation method and federal learning method
CN114282684A (en) * 2021-12-24 2022-04-05 支付宝(杭州)信息技术有限公司 Method and device for training user-related classification model and classifying users
CN114493781A (en) * 2022-01-25 2022-05-13 工银科技有限公司 User behavior prediction method and device, electronic equipment and storage medium
CN114510305A (en) * 2022-01-20 2022-05-17 北京字节跳动网络技术有限公司 Model training method and device, storage medium and electronic equipment
CN115035722A (en) * 2022-06-20 2022-09-09 浙江嘉兴数字城市实验室有限公司 Road safety risk prediction method based on combination of spatio-temporal features and social media

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359686A (en) * 2018-10-18 2019-02-19 西安交通大学 A kind of user's portrait method and system based on Campus Network Traffic
CN109492104A (en) * 2018-11-09 2019-03-19 北京京东尚科信息技术有限公司 Training method, classification method, system, equipment and the medium of intent classifier model
CN109543740A (en) * 2018-11-14 2019-03-29 哈尔滨工程大学 A kind of object detection method based on generation confrontation network
CN109766911A (en) * 2018-12-04 2019-05-17 深圳先进技术研究院 A kind of behavior prediction method
CN110580268A (en) * 2019-08-05 2019-12-17 西北大学 Credit scoring integrated classification system and method based on deep learning
CN110647921A (en) * 2019-09-02 2020-01-03 腾讯科技(深圳)有限公司 User behavior prediction method, device, equipment and storage medium
CN110781929A (en) * 2019-10-12 2020-02-11 腾讯科技(深圳)有限公司 Training method, prediction device, medium, and apparatus for credit prediction model
CN111461168A (en) * 2020-03-02 2020-07-28 平安科技(深圳)有限公司 Training sample expansion method and device, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359686A (en) * 2018-10-18 2019-02-19 西安交通大学 A kind of user's portrait method and system based on Campus Network Traffic
CN109492104A (en) * 2018-11-09 2019-03-19 北京京东尚科信息技术有限公司 Training method, classification method, system, equipment and the medium of intent classifier model
CN109543740A (en) * 2018-11-14 2019-03-29 哈尔滨工程大学 A kind of object detection method based on generation confrontation network
CN109766911A (en) * 2018-12-04 2019-05-17 深圳先进技术研究院 A kind of behavior prediction method
CN110580268A (en) * 2019-08-05 2019-12-17 西北大学 Credit scoring integrated classification system and method based on deep learning
CN110647921A (en) * 2019-09-02 2020-01-03 腾讯科技(深圳)有限公司 User behavior prediction method, device, equipment and storage medium
CN110781929A (en) * 2019-10-12 2020-02-11 腾讯科技(深圳)有限公司 Training method, prediction device, medium, and apparatus for credit prediction model
CN111461168A (en) * 2020-03-02 2020-07-28 平安科技(深圳)有限公司 Training sample expansion method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
任娟: "《网店运营》", vol. 1, 30 June 2020, 北京理工大学出版社, pages: 35 *
赵宏田: "《用户画像方法论与工程化解决方案》", vol. 1, 31 May 2020, 机械工业出版社, pages: 213 - 219 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990480A (en) * 2021-03-10 2021-06-18 北京嘀嘀无限科技发展有限公司 Method and device for building model, electronic equipment and storage medium
CN113011966A (en) * 2021-03-18 2021-06-22 中国光大银行股份有限公司 Credit scoring method and device based on deep learning
CN112948412A (en) * 2021-04-21 2021-06-11 携程旅游网络技术(上海)有限公司 Flight inventory updating method, system, electronic equipment and storage medium
CN112948412B (en) * 2021-04-21 2024-03-12 携程旅游网络技术(上海)有限公司 Flight inventory updating method, system, electronic device and storage medium
CN113850309A (en) * 2021-09-15 2021-12-28 支付宝(杭州)信息技术有限公司 Training sample generation method and federal learning method
CN114282684A (en) * 2021-12-24 2022-04-05 支付宝(杭州)信息技术有限公司 Method and device for training user-related classification model and classifying users
CN114510305A (en) * 2022-01-20 2022-05-17 北京字节跳动网络技术有限公司 Model training method and device, storage medium and electronic equipment
CN114510305B (en) * 2022-01-20 2024-01-23 北京字节跳动网络技术有限公司 Model training method and device, storage medium and electronic equipment
CN114493781A (en) * 2022-01-25 2022-05-13 工银科技有限公司 User behavior prediction method and device, electronic equipment and storage medium
CN115035722A (en) * 2022-06-20 2022-09-09 浙江嘉兴数字城市实验室有限公司 Road safety risk prediction method based on combination of spatio-temporal features and social media
CN115035722B (en) * 2022-06-20 2024-04-05 浙江嘉兴数字城市实验室有限公司 Road safety risk prediction method based on combination of space-time characteristics and social media

Similar Documents

Publication Publication Date Title
CN112329816A (en) Data classification method and device, electronic equipment and readable storage medium
CN108416198B (en) Device and method for establishing human-machine recognition model and computer readable storage medium
CN107229708B (en) Personalized travel service big data application system and method
CN109634698B (en) Menu display method and device, computer equipment and storage medium
CN107391760A (en) User interest recognition methods, device and computer-readable recording medium
WO2024067387A1 (en) User portrait generation method based on characteristic variable scoring, device, vehicle, and storage medium
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN109685104B (en) Determination method and device for recognition model
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
EP3726441A1 (en) Company bankruptcy prediction system and operating method therefor
CN103810162A (en) Method and system for recommending network information
CN112215696A (en) Personal credit evaluation and interpretation method, device, equipment and storage medium based on time sequence attribution analysis
CN110647995A (en) Rule training method, device, equipment and storage medium
CN112463859B (en) User data processing method and server based on big data and business analysis
CN111210332A (en) Method and device for generating post-loan management strategy and electronic equipment
CN114187036A (en) Internet advertisement intelligent recommendation management system based on behavior characteristic recognition
CN112070559A (en) State acquisition method and device, electronic equipment and storage medium
CN114238764A (en) Course recommendation method, device and equipment based on recurrent neural network
JP2021018466A (en) Rule extracting apparatus, information processing apparatus, rule extracting method, and rule extracting program
CN111784360B (en) Anti-fraud prediction method and system based on network link backtracking
CN117608889A (en) Log semantic based anomaly detection method and related equipment
CN117235633A (en) Mechanism classification method, mechanism classification device, computer equipment and storage medium
CN112801784A (en) Bit currency address mining method and device for digital currency exchange
CN111309706A (en) Model training method and device, readable storage medium and electronic equipment
CN110472680B (en) Object classification method, device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination