CN113705072A - Data processing method, data processing device, computer equipment and storage medium - Google Patents
Data processing method, data processing device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN113705072A CN113705072A CN202110396801.4A CN202110396801A CN113705072A CN 113705072 A CN113705072 A CN 113705072A CN 202110396801 A CN202110396801 A CN 202110396801A CN 113705072 A CN113705072 A CN 113705072A
- Authority
- CN
- China
- Prior art keywords
- sample data
- feature
- sample
- characteristic
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Evolutionary Computation (AREA)
- Entrepreneurship & Innovation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Development Economics (AREA)
- Marketing (AREA)
- Educational Administration (AREA)
- General Business, Economics & Management (AREA)
- Medical Informatics (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Computer Hardware Design (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Geometry (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a data processing method, a data processing device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a sample data set under a model training scene; the sample data set contains M sample data, wherein M is a positive integer; acquiring N types of feature types to be analyzed, and respectively acquiring a feature value of each sample data under each type of feature; n is a positive integer; and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types. By the method and the device, the accuracy of the determined characteristic type matched with the model training scene can be improved.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.
Background
With the continuous development of computer networks, the related technology of Artificial Intelligence (AI) has penetrated aspects of life, for example, a model is trained through the machine learning technology related to Artificial Intelligence, so that the trained model can be used for distinguishing and predicting data.
In the prior art, when a model is trained, a user generally selects one or more data feature types suitable for model training according to experience in advance, and then training of the model can be completed through the features of sample data under the selected data feature types.
Since the data feature types associated with the trained models can be arbitrarily selected when the user selects the data feature types for model training based on experience, it is highly likely that the data feature types selected by the user for model training are not accurate.
Disclosure of Invention
The application provides a data processing method, a data processing device, computer equipment and a storage medium, which can improve the accuracy of the determined characteristic type matched with a model training scene.
One aspect of the present application provides a data processing method, including:
acquiring a sample data set under a model training scene; the sample data set contains M sample data, wherein M is a positive integer;
acquiring N types of feature types to be analyzed, and respectively acquiring a feature value of each sample data under each type of feature; n is a positive integer;
and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types.
One aspect of the present application provides a data processing apparatus, including:
the sample acquisition module is used for acquiring a sample data set in a model training scene; the sample data set contains M sample data, wherein M is a positive integer;
the characteristic acquisition module is used for acquiring N types of characteristic types to be analyzed and respectively acquiring a characteristic value of each sample data under each type of characteristic; n is a positive integer;
and the index generation module is used for generating training index information of the sample data set in each feature type according to the feature value of each sample data in each feature type, and the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types.
Optionally, the N feature types include an ith feature type, where i is a positive integer less than or equal to N; the ith characteristic type has t target characteristic values, and t is a positive integer;
the method for generating training index information of the sample data set under each feature type by the index generation module according to the feature value of each sample data under each feature type comprises the following steps:
determining the frequency number of the characteristic value corresponding to each target characteristic value in the t types of target characteristic values according to the characteristic value of each sample data under the ith characteristic type;
and generating training index information of the sample data set under the ith characteristic type according to the frequency of the characteristic value corresponding to each target characteristic value.
Optionally, the M sample data include negative sample data and positive sample data; the N characteristic types comprise a jth characteristic type, and j is a positive integer less than or equal to N;
the method for generating training index information of the sample data set under each feature type by the index generation module according to the feature value of each sample data under each feature type comprises the following steps:
dividing M sample data according to the characteristic value of each sample data under the jth characteristic type to obtain K1 sample data sub-boxes corresponding to the jth characteristic type; k1 is a positive integer less than or equal to M;
respectively obtaining negative sample frequencies of negative sample data contained in K1 sample data sub-boxes;
determining the frequency variation trend among the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes according to the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes;
and determining the frequency variation trend as training index information of the sample data set under the jth characteristic type.
Optionally, the M sample data include negative sample data and positive sample data; the N characteristic types comprise an s characteristic type, wherein s is a positive integer less than or equal to N;
the method for generating training index information of the sample data set under each feature type by the index generation module according to the feature value of each sample data under each feature type comprises the following steps:
dividing M sample data according to the characteristic value of each sample data under the s type of characteristic type to obtain K2 sample data sub-boxes corresponding to the s type of characteristic type; k2 is a positive integer less than or equal to M;
respectively acquiring the number of negative samples of the negative sample data contained in K2 sample data sub-boxes;
respectively acquiring the number of positive samples of positive sample data contained in K2 sample data sub-boxes;
and determining training index information of the sample data set under the s-th characteristic type according to the quantity of the negative samples and the quantity of the positive samples respectively corresponding to the K2 sample data sub-boxes.
Optionally, the method for determining training index information of the sample data set in the s-th feature type by the index generation module according to the negative sample number and the positive sample number respectively corresponding to the K2 sample data sub-boxes includes:
determining sample distinguishing weights corresponding to K2 sample data sub-boxes respectively according to the quantity of negative sample data and positive sample data corresponding to the K2 sample data sub-boxes respectively;
determining a sample distinguishing index value corresponding to the s-th characteristic type according to the number of negative samples, the number of positive samples and the sample distinguishing weight respectively corresponding to the K2 sample data sub-boxes;
and determining the sample distinguishing index value corresponding to the s-th characteristic type as training index information of the sample data set under the s-th characteristic type.
Optionally, the apparatus further comprises:
the sorting module is used for sorting the N characteristic types according to the sequence of the sample distinguishing index values of the sample data set under each characteristic type from big to small to obtain the sorted N characteristic types;
the characteristic selection module is used for determining the first L characteristic types in the sequenced N characteristic types as the characteristic types matched with the model training scene; l is a positive integer less than or equal to N.
Optionally, the M sample data are M sample users;
the above-mentioned device still includes:
the model acquisition module is used for acquiring a model to be trained in a model training scene;
the target sample acquisition module is used for acquiring target sample users and acquiring characteristic values of the target sample users under the L characteristic types; carrying a user label by a target sample user; the user label is an abnormal user label or a normal user label;
the model training module is used for training a model to be trained according to the characteristic values of the target sample user under the L characteristic types and the user label carried by the target sample user to obtain a target model;
the target model is used for distinguishing and predicting the user attribute of the predicted user, and the user attribute is a normal user attribute or an abnormal user attribute.
Optionally, the apparatus further comprises:
the predicted user obtaining module is used for obtaining predicted users and obtaining characteristic values of the predicted users under the L characteristic types;
the characteristic prediction module is used for inputting the characteristic values of the predicted users under the L characteristic types into the target model and predicting the user attributes of the target users in the target model;
and the early warning module is used for determining the predicted user as the abnormal user if the user attribute of the predicted user is the abnormal user attribute and carrying out early warning operation on the abnormal user.
Optionally, the M sample data include negative sample data and positive sample data; the N characteristic types comprise a z characteristic type, and z is a positive integer less than or equal to N;
the method for generating training index information of the sample data set under each feature type by the index generation module according to the feature value of each sample data under each feature type comprises the following steps:
dividing M sample data according to the characteristic value of each sample data under the z-th characteristic type to obtain K3 sample data sub-boxes corresponding to the z-th characteristic type; k3 is a positive integer less than or equal to M;
respectively obtaining negative sample frequencies of negative sample data contained in K3 sample data sub-boxes;
respectively acquiring positive sample frequencies of positive sample data contained in K3 sample data sub-boxes;
according to the negative sample frequency and the positive sample frequency respectively corresponding to the k3 sample data sub-boxes, obtaining the model discrimination corresponding to the z-th characteristic type;
and determining the model discrimination as the training index information of the sample data set under the z-th characteristic type.
Optionally, the mode for acquiring the sample data set in the model training scenario by the sample acquisition module includes:
acquiring a sample data set in a sample data system;
the mode that the characteristic obtaining module obtains the characteristic value of each sample data under each characteristic type respectively comprises the following steps:
and respectively acquiring the characteristic value of each sample data under each characteristic type in the metadata system based on an access interface between the sample data system and the metadata system.
Optionally, the mode for acquiring the sample data set in the model training scenario by the sample acquisition module includes:
acquiring a sample acquisition request sent by a client; the sample acquisition request carries a sample retrieval field;
acquiring a sample data set according to the sample retrieval field;
the above apparatus is also for:
and transmitting the training index information of the sample data set under each characteristic type to the client so that the client displays the training index information of the sample data set under each characteristic type on a client interface.
An aspect of the application provides a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method of an aspect of the application.
An aspect of the application provides a computer-readable storage medium having stored thereon a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the above-mentioned aspect.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternatives of the above aspect and the like.
The method comprises the steps of firstly, obtaining a sample data set under a model training scene; the sample data set contains M sample data, wherein M is a positive integer; n types of characteristic types to be analyzed can be obtained, and the characteristic value of each sample data under each type of characteristic can be respectively obtained; n is a positive integer; and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types. Therefore, the training index information of the sample data set under each feature type can be generated according to the feature value of each sample data under each feature type, the feature type matched with the model training scene can be more accurately selected from the N feature types through the training index information, and then the model under the model training scene can be more accurately trained through the feature type matched with the model training scene, so that the trained model has better model performance.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;
FIG. 2 is a schematic view of a feature measurement scenario provided herein;
FIG. 3 is a schematic flow chart diagram of a data processing method provided herein;
FIG. 4 is a schematic diagram of an interface for managing sample data at a client according to the present application;
5a-5b are schematic views of a data acquisition scenario provided herein;
FIG. 6 is a schematic view of a model training scenario provided herein;
FIGS. 7a-7b are schematic diagrams of an interface for displaying measurement results provided herein;
FIG. 8 is a schematic flow chart diagram of a data processing method provided herein;
FIG. 9 is a schematic diagram of a data processing apparatus provided in the present application;
fig. 10 is a schematic structural diagram of a computer device provided in the present application.
Detailed Description
The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The application relates to artificial intelligence related technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The present application relates generally to machine learning in artificial intelligence. Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like, and is used for specially researching how a computer simulates or realizes human Learning behaviors to acquire new knowledge or skills and reorganizing an existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The machine learning referred to in the present application mainly refers to how to accurately select the feature type for training the model in the model training scenario, and specifically, refer to the following description in the embodiment corresponding to fig. 3.
The application also relates to a related technology of the block chain. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer. The Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, once a new Block is added into the Block chain, the new Block cannot be removed, and the recorded data submitted by the nodes in the Block chain system are recorded in the blocks. In the application, the generated training index information corresponding to each feature type can be added to the block chain for storage, so that the non-tamper property of the training index information corresponding to each feature type can be ensured, and further, the real training index information of each feature type can be obtained when the training index information corresponding to each feature type needs to be obtained subsequently.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a server 200 and a terminal device cluster, and the terminal device cluster may include one or more terminal devices, where the number of terminal devices is not limited herein. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 100a, a terminal device 101a, terminal devices 102a, …, and a terminal device 103 a; as shown in fig. 1, the terminal device 100a, the terminal device 101a, the terminal devices 102a, …, and the terminal device 103a may all be in network connection with the server 200, so that each terminal device may perform data interaction with the server 200 through the network connection.
The server 200 shown in fig. 1 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal device may be: the intelligent terminal comprises intelligent terminals such as a smart phone, a tablet computer, a notebook computer, a desktop computer and an intelligent television.
The terminal device 101a, the terminal devices 102a, …, and the terminal device 103a in the terminal device cluster may be terminal devices of a common user registered in a transaction client, and the transaction client may be any client capable of performing data transaction. User information registered by the user in the transaction client of the terminal device 101a, the terminal devices 102a, …, and the terminal device 103a, and related transaction records when the user performs data transaction in the transaction client may all be synchronized to the server 200, and the server 200 may be a background server of the transaction client.
In addition, the terminal device 100a may be a terminal device of a meter, the meter may be a related staff in a client background, the terminal device 100a may include a meter client, the meter client may be a client for measuring and calculating a related user feature type of a user in the transaction client, by measuring and calculating the related user feature type of the user in the transaction client, a feature type more suitable for training a model may be selected from the several feature types of the user, the model obtained through training may more accurately distinguish an abnormal user from a normal user, the abnormal user may refer to a user having an abnormal transaction behavior in the transaction client, and the normal user may refer to a user having no abnormal transaction behavior in the transaction client. The background server of the measurement and calculation client may also be the server 200, and it may be understood that the server 200 may store related transaction data and user information for all users in the transaction client, so that a measurement and calculation person may request the server 200 to measure and calculate the related user feature types of the users in the transaction client through the measurement and calculation client.
Please refer to fig. 2, fig. 2 is a schematic view of a scene of feature measurement provided in the present application. As shown in fig. 2, the meter may upload a sample user (for example, user identification of the sample user may represent information of the corresponding user) in the meter client of the terminal device 100a, or request to obtain the corresponding sample user through a related sample search field in the meter client, for example, if the sample search field is "a user having a transaction behavior in a certain transaction time period", the sample user requesting to obtain may include a user having a transaction behavior in the transaction time period, and if the meter requests to obtain the sample user in the meter client, the meter may request to obtain the sample user from the server 200. Therefore, it can be understood that the terminal device 100a may submit M sample users (e.g., M sample users in the block 100 b) to the server 200 in the measurement client in response to the operation of the relevant user of the measurement staff, where the M sample users may be uploaded by the measurement staff at the measurement client, may be acquired by the measurement client through a request of the relevant sample search field, or may be partially uploaded at the measurement client and partially acquired by the measurement client according to the request of the sample search field, and the M sample users constitute a sample data set.
Furthermore, it is also supported that the meter personnel submits N feature types (including feature type 1 to feature type N) for the M sample users in the meter client of the terminal device 100a, and submits a request (which may be referred to as a meter request) for measuring the N feature types by the feature values of the M sample users in the N feature types in the meter client, the meter client may send the meter request to the server 200, and the server 200 may measure the N feature types by the feature values of the M sample users in the N feature types according to the meter request, please refer to the following description.
The server 200 may obtain a feature value of each sample user (of M sample users) in each feature type (of N feature types), the feature value does not necessarily have an index value, but may represent any feature of the sample user, such as a gender feature (a male or female feature), a user account feature (a character string of a user account), and the like. Furthermore, the features in any feature type may be represented by unique characters or numerical values, and the feature values may also be characters or numerical values for uniquely representing the corresponding features, and a sample user may have a feature value in one feature type. As shown in fig. 2, the server 200 may obtain feature values of the M sample users in feature type 1 (as in block 101b), the server 200 may obtain feature values of the M sample users in feature type 2 (as in block 102b), and … …, and the server 200 may obtain feature values of the M sample users in feature type N (as in block 103 b).
Furthermore, the server 200 may obtain training index information corresponding to each feature type through calculation of feature values of the M sample users under each feature type, and as shown in fig. 2, may include training index information 1 corresponding to feature type 1 to training index information N corresponding to feature type N. Furthermore, the server 200 may perform recommendation ranking on the N feature types through the training index information corresponding to each feature type, that is, the feature type most suitable for model training is arranged in front, the server 200 may send the recommendation ranking including the N feature types to the measurement and calculation client, or/and the server 200 may also send the training index information corresponding to each feature type to the measurement and calculation client, the measurement and calculation client may display the obtained recommendation ranking and/or training index information on a client interface through the terminal device 100a, and a measurement and calculation person may view the recommendation ranking and/or training index information by himself, thereby selecting a feature type for model training from the N feature types. As shown in block 105b, the selected feature type 106b for model training may be a feature type calculated to be easier to distinguish an abnormal user from a normal user, and a measurer may use the feature type 106b for model training, and may further train to obtain an abnormal user distinguishing model, and then may identify a probability that a user to be identified is an abnormal user through the abnormal user distinguishing model.
The M sample users may be M sample data in the following embodiment corresponding to fig. 3, and the abnormal user distinction model may be a target model obtained by training in the following embodiment corresponding to fig. 3, and a specific process of how to generate training index information corresponding to various feature types through feature values of the sample users under various feature types and a specific process of how to train the abnormal user distinction model may both be referred to in the following description related to the embodiment corresponding to fig. 3.
By the method, the effect of each characteristic type on model training can be measured and calculated through the characteristic values of the sample user under various characteristic types, the training index information corresponding to various characteristic types is obtained, the characteristic type used for model training can be accurately selected from N characteristic types through the training index information corresponding to various characteristic types, and the distinguishing accuracy of the abnormal user distinguishing model obtained through training on the abnormal user can be improved.
Referring to fig. 3, fig. 3 is a schematic flow chart of a data processing method provided in the present application. The execution subject in the embodiment of the present application may be one computer device or a computer device cluster formed by a plurality of computer devices. The computer equipment can be a server or terminal equipment. Therefore, the execution subject in the embodiment of the present application may be a server, or may be a terminal device, or may be formed by the server and the terminal device together. Here, the description will be made taking an execution subject in the present application as an example of a server. As shown in fig. 3, the method may include:
step S101, acquiring a sample data set in a model training scene; the sample data set contains M sample data, wherein M is a positive integer;
in this application, a client (such as the measurement and calculation client in the embodiment corresponding to fig. 2) may be provided for a user, the client may be borne in a terminal device of the user, the client may be a web client or an application software, the client supports the user to upload or acquire sample data in a model training scenario in the client, and initiates a measurement and calculation request for one or more feature types of the sample data to obtain a measurement and calculation result (such as training index information corresponding to various feature types described below), the test result may assist the user to select a feature type adapted to the model training scenario, and further, a feature (such as a feature value described below) of the sample data in the feature type adapted to the model training scenario may be used to train the model in the model training scenario.
Therefore, the server can obtain a sample data set under a model training scene, wherein the sample data set can comprise M sample data, M is a positive integer, the specific value of M is determined according to an actual application scene, and the sample data set can be a sample data set uploaded by a user at a client and then sent to the server by the client; or, a user is supported to enter a sample retrieval field at a client, and a sample acquisition request is submitted to a server through the client, the sample acquisition request carries the sample retrieval field, and the sample retrieval field is a field for acquiring corresponding sample data.
Optionally, the sample data may be sample users, that is, the sample data set may be a sample user set, that is, the M sample data may refer to M sample users. Therefore, if the sample retrieval field is "a user marked by the police bureau", the acquired sample data associated with the sample retrieval field can be the user marked by the police bureau. If the sample search field is "a user who has transacted within a certain time period," the sample data associated with the sample search field that is obtained may be the user who has transacted within the time period. The specific content of the sample retrieval field may be determined according to the actual application scenario, and is not limited thereto.
The data system may be a system in a server, or may be a system that is not in a server but is accessible by the server. The client may be a client of an organization or an enterprise, and the user having access to the client may include all or part of employees of the organization or the enterprise. The data system can comprise a sample data system, and supports the operations of uploading sample data in the sample data system, acquiring the sample data in the sample data system, managing the sample data in the sample data system and the like by a user having access right to the client through the client.
Furthermore, the data system may further include a metadata system, which may be understood as that the metadata system includes all the service data, and therefore, if a user acquires sample data from the data system through the client (for example, retrieves the sample data through the sample retrieval field), the acquired sample data may be acquired from the metadata system, and the sample data acquired from the metadata system may also be uploaded to the sample data system.
If the sample data is a sample user, the components of the sample data uploaded in the sample data system may include: sample source (e.g., metadata system or user upload), sample identification (e.g., user identification of sample user), and sample label (e.g., good or bad label of sample user). In other words, the sample data in the sample data system can be uploaded according to a specified canonical format, so that a subsequent user can conveniently retrieve, view and manage the sample data in the sample data system at a client.
Optionally, the good or bad label of the sample user may be a label that the sample user is a normal user (which may be referred to as a normal user label) or a label that the sample user is an abnormal user (which may be referred to as an abnormal user label), for example, the sample user carrying the normal user label indicates that the transaction behavior is normal, and the sample user carrying the abnormal user label indicates that the transaction behavior is abnormal.
The model training scenario can be determined according to the type of the model which needs to be trained actually. For example, if the model to be trained is a face recognition model, the model training scene may refer to a scene of face recognition; for another example, if the model to be trained is an abnormal user distinguishing model, the model training scenario may be a scenario distinguished by an abnormal user.
Therefore, according to the above process, the sample data set acquired by the server may be uploaded by the user at the client, acquired by the user from the metadata system through the sample search field, or directly selected from the existing sample data in the sample data system.
Referring to fig. 4, fig. 4 is a schematic diagram of an interface for managing sample data at a client according to the present application. As shown in block 101c of fig. 4, operations such as uploading a sample, registering a sample, viewing a record of the uploaded sample, and retrieving an existing sample table from a sample library (e.g., a sample data system) may be performed in the client interface 100c of the client, and the intelligent analysis platform may be the aforementioned measurement client. As shown in fig. 4, a sample table 102c and a sample table 103c existing in the sample data system are also displayed in the client interface 100c, one sample table may include one or more sample data, and one sample table may have information such as a sample name (e.g., "sample abc" and "sample bcd" herein), an ID field (e.g., an identifier of the sample table), a label meaning (i.e., a meaning of a sample label carried by each sample data), a sample description (e.g., related description information of the sample table), a creation time, and a valid time period.
Step S102, obtaining N types of characteristic types to be analyzed, and respectively obtaining a characteristic value of each sample data under each type of characteristic; n is a positive integer;
in the application, the server may further obtain N feature types to be analyzed, where the N feature types may be feature types that are input by the user in the client and then submitted to the server by the client, or the client may include a feature type list, and the N feature types may be feature types that are submitted to the server by the client after the user selects the feature type list in the client. The N feature types may be feature types associated with a model training scenario, and the user enters the N feature types at the client, that is, the user features of which feature types of the N feature types need to be determined are more effective for training the model in the model training scenario. For example, if the sample data is a sample user and the model training scenario is a scenario distinguished by an abnormal user, it is necessary to determine which user features of the N feature types are more effective features for distinguishing the abnormal user.
For example, if the sample data is a sample user, the N feature types may include a feature type of an age of the user, a feature type of a gender of the user, a feature type of a transaction time of the user, a feature type of a bank card transacted by the user, a feature type of an account number of the user, and the like.
The server may further obtain a feature value of each sample data in the M sample data under the N feature types, where the feature value is not necessarily a numerical value, and the feature value characterizes a feature of the sample data. One feature type the next sample data may have one feature value. For example, the N feature types may include a feature type of the user's age, and the feature value of the sample data may be the age of the sample user (e.g., 18, 28, or 38 years old). For another example, the N feature types may include a feature type of a gender of the user, and the feature value of the sample data may be the gender of the sample user (e.g., male or female). For another example, where the N feature types may include a user transaction time feature type, the feature value of the sample data may be the transaction time of the sample user (e.g., xx minutes xx seconds xx).
As can be seen from the above, the metadata system may contain all the service data, and therefore, the server may obtain the characteristic value of the sample data from the metadata system. An access system can be arranged between the sample data system and the metadata system, so that the access limit between the sample data system and the metadata system is opened through the access interface, and the sample data system and the metadata system can be mutually accessed. Therefore, after the sample data system acquires the sample data set, the server can access the metadata system through the access interface between the sample data system and the metadata system to acquire the characteristic value of the sample data from the metadata system.
Please refer to fig. 5a-5b, and fig. 5a-5b are schematic views of a data acquisition scenario provided in the present application. As shown in fig. 5a, the sample data may be a sample user and the metadata system may include an access layer, a base layer, a feature layer, and a market layer. The access layer may include service data (such as transaction data) at the bottom layer of the user, including data such as a wind control cache, a transaction log, and a BI (big data analysis); the base layer can include data such as transaction details, a calendar, a week table, a month table and the like; the feature layer may include feature information such as an account, a mobile phone number, and a device (e.g., a device for logging in an account of a user) of the user; the bazaar layer may include trading disks (e.g., users trading disks), policy intercepts (e.g., users subject to trade intercepts), and 110 portraits (e.g., users portrayed in the police bureau), among other users. Therefore, it can be understood that the sample data set may be obtained from the market layer, the feature type of the sample data may be obtained from the feature layer, and if the feature type of the sample data that is not needed in the feature layer is not obtained, the more detailed feature type of the sample data may be further obtained from the base layer or the access layer.
Optionally, the feature types of the sample user may generally have 9 categories of feature types, where the 9 categories of features may include a feature type of a user account, a feature type of a bank card (e.g., a bank card used in a user transaction), a feature type of an identity card (e.g., an identity card used in registering a user account), a feature type of a device (e.g., a device in which a user logs in a user account), a feature type of a mobile phone number (e.g., a mobile phone number associated with a user account), a feature type of a two-dimensional code (e.g., a two-dimensional code scanned when a user conducts a transaction), a feature type of an IP (communication address), a feature type of a merchant (e.g., a merchant conducting a transaction by a user), and a feature type of a user text (e.g., related text content when a user conducts a transaction). There may be more detailed feature types under the 9 major feature types, and all of the feature types may include feature types of several dimensions (e.g., ten thousand dimensions), so that the above N feature types may be some or all of the feature types of the several dimensions. As shown in block 101f, the training index information corresponding to each of the N feature types obtained in step S103 may be obtained by analyzing in-team schedule feature analysis, out-team temporary feature analysis, feature analysis when constructing a special item (e.g., for a specific project) analysis database, or/and feature analysis for external cooperation data, or other scenarios, in other words, the method provided in the present application may be used for analysis as long as the analysis scenario of the feature type is required according to an actual application scenario, and the analysis result is the training index information corresponding to the feature type, which may be used for not only selecting the feature type of the training model but also analyzing the features of the corresponding feature type.
Further, as shown in fig. 5b, the data system of the present application may include 3 systems, where the 3 systems may be a feature measurement system, a sample data system, and a metadata system, and all of the 3 systems may have an access interface, and all of the 3 systems may access each other through the access interfaces therebetween. The method and the device have the advantages that access restriction among the 3 systems is opened, so that when the characteristic types are measured and calculated, sample data can be obtained from a sample data system, characteristic values of the sample data can be obtained from a metadata system, further, the characteristic types can be measured and calculated through the obtained sample data and the characteristic values of the sample data based on the characteristic measuring and calculating system, the triggering of the measurement and calculation of the characteristic types and the display of measurement and calculation results can be achieved through a characteristic measuring and calculating client of the characteristic measuring and calculating system, and the characteristic measuring and calculating client can be the mentioned client or the measuring and calculating client in the figure 2.
Step S103, generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from N feature types;
in the application, the server may generate training index information of the sample data set in each feature type according to the feature value of each sample data in each feature type, where the training index information may be understood as statistical feature information between the feature values of N sample data in each feature type. The server can send the training index information to the client, the client can display the training index information corresponding to each feature type on a client interface, and the training index information can be used as a basis for a user to select the feature type for model training from the N feature types (namely, to select the feature type matched with a model training scene), so that the user can more accurately select the feature type for model training, and further the model can be more accurately trained through the feature type selected by the training index information.
The generation mode of the training index information corresponding to each feature type may include several:
the first way of generating the training index information: determining the frequency number of the characteristic value corresponding to each target characteristic value in the t types of target characteristic values according to the characteristic value of each sample data under the ith characteristic type; and generating training index information of the sample data set under the ith characteristic type according to the frequency of the characteristic value corresponding to each target characteristic value. The N feature types may include an ith feature type, where i is a positive integer less than or equal to N, the ith feature type may have t target feature values, t is a positive integer, and the ith feature type may be any one of the N feature types. For example, if the ith feature type may be a gender type, the ith feature type may have 2 (i.e., t equals 2) target feature values, and the 2 target feature values may be the target feature value "male" and the target feature value "female". For another example, the ith feature type may be a type of a transaction time period in a day, and if one hour of the day is divided into one transaction time period, and 24 transaction time periods are included in the day, the ith feature type may have 24 (i.e., t is equal to 24) target feature values, and the 24 target feature values are 24 transaction time periods in the day.
Therefore, the server may obtain, according to the feature value of each sample data in the ith feature type, a feature value frequency count corresponding to each target feature value in the t target feature values, where the feature value frequency count may refer to the number of times that each target feature value appears. For example, the ith feature type is a gender type, M is equal to 6, that is, the sample data set includes 6 sample data, the sample data are sample users, the number of sample users with gender as male in the 6 sample users is 2, the number of sample users with gender as female is 4, the feature value frequency of the target feature value "male" is equal to 2, and the feature value frequency of the target feature value "female" is 4.
More, the feature types are classified into continuous types and discrete types, such that the feature type of gender is discrete, and for the discrete type, the corresponding target feature values are classified, such as a "male" target feature value type and a "female" target feature value type. Therefore, the frequency count of each target feature value (i.e., the feature value frequency count) can be calculated for the discrete type of feature, and the frequency and the cumulative frequency of each target feature value can be calculated from the feature value frequency count.
For example, if 3 kinds of target feature values are included, i.e., a target feature value 1, a target feature value 2, and a target feature value 3, the frequency of the feature value 1 is 5, the frequency of the feature value 1 is 10, and the frequency of the feature value 1 is 15, the frequency of the target feature value 1 may be equal to 1/6 (i.e., 5/(5+10+15)), the frequency of the target feature value 2 may be equal to 1/3 (i.e., 10/(5+10+15)), and the frequency of the target feature value 3 may be equal to 1/2 (i.e., 15/(5+10+ 15)). The cumulative frequency of the target characteristic value 1, the target characteristic value 2 and the target characteristic value 3 is equal to 1/6+1/3+1/2 being equal to 1.
Furthermore, it should be noted that if a sample data has no feature value under a certain feature type (for example, a certain sample user has no registered gender), missing value filling may be performed on the sample data, for example, the missing feature value of the sample data under the feature type is marked as "empty", and this process may be understood as a process of performing data cleansing on the sample data. Therefore, if the characteristic value of the sample data under the characteristic type is not acquired subsequently, the result of data acquisition failure is not caused. It should be noted that, if there is one or more sample data without a feature value under a certain feature type and the feature value of the one or more sample data under the feature type is filled with a missing value, the feature type may have one more target feature value, and the added target feature value is "empty". Therefore, it can be understood that, since the frequency of the target feature value being "empty" can also be calculated, even if there is no feature value of the sample data in a certain feature type, in general, the calculated cumulative frequency of the target feature value in the feature type is equal to 1, and if not equal to 1, it indicates that the relevant statistical data (such as frequency and frequency, etc.) for the target feature value in the feature type is wrong, and a problem needs to be screened and re-counted. By filling missing values in the sample data, even if only part of the sample data has the characteristic value under a certain characteristic type (namely the coverage rate of the characteristic value is low), the part of the sample data can be utilized, and the selection range and the utilization rate of the sample data are improved.
Therefore, the frequency, and cumulative frequency of each target feature value calculated as described above may be used as the training index information corresponding to the ith feature type (i.e., the training index information of the sample data set in the ith feature type).
It will be appreciated that for a feature type of continuity, corresponding frequencies, frequencies and accumulated frequencies may also be calculated, but for a feature type of continuity, the frequency, frequency and accumulated frequency are calculated not according to the classification of the target characteristic value, but according to the classification, namely the segmentation, for the age, one sub-box can be calculated for 0-50 years, or one sub-box can be calculated for 50-100 years, furthermore, the frequency count of the sample data contained in each sub-box, the frequency count of good samples (for example, sample data carrying normal user tags) contained in each sub-box, the frequency count of bad samples (for example, sample data carrying abnormal user tags) contained in each sub-box, and the proportion (namely frequency count) occupied by the bad samples in each sub-box can be calculated, and also used as training index information corresponding to the characteristic type of continuity.
Optionally, in the present application, if a sample data in the sample data set has no eigenvalue under N types of characteristic, the sample data may be filtered from the sample data set, and when generating the training index parameter corresponding to each type of characteristic, the filtered sample data set may be used to generate the training index parameter.
The sample data set may be used as an observation group, and the server may obtain statistical information of a comparison group (i.e., obtain training index information of a comparison group) including all users in the trading large-scale in the same manner as the statistical information of the observation group regarding the statistical information (belonging to the training index information) such as the frequency, and the cumulative frequency. Therefore, the server can send the statistical information of the observation group and the statistical information of the comparison group to the client, and the client can compare and display the statistical information of the observation group and the statistical information of the comparison group together, so that a measurer can analyze the statistical information of the observation group more deeply and intuitively through the statistical information of the comparison group, and the method is also beneficial to the measurement and calculation of the characteristic type of the model selected from the N characteristic types by the user.
Second way of generating training index information: for a continuous feature type, such as an age feature type, the M sample data may be binned (i.e., segmented) by the continuous feature type, where a binning manner for binning the M sample data may be a default binning manner set by the system or a user-defined binning manner. Further, the default binning mode of the system or the user-defined binning mode may be equal-distance binning or equal-frequency binning and the like on the M sample data, and the user-defined binning mode may be that the user inputs one or more binning interval points through the client, and then the M sample data may be binned through the one or more binning interval points.
The N feature types may include a jth feature type, j is a positive integer less than or equal to N, and the jth feature type may be any one of the N feature types that belong to a continuous type. If the jth feature type is an age continuity type, the feature value of each sample data in the M sample data under the jth feature type may be an age, and one sample data may correspond to one age, then the way of binning the M sample data may be to divide the M sample data into bins every consecutive 30 years, for example, according to the age corresponding to each sample data, the sample data corresponding to the age of 1 to 30 years may be divided into bins, the sample data corresponding to the age of 31 to 60 years may be divided into bins, and the sample data corresponding to the age of 61 to 90 years may be divided into bins, …. Therefore, it is understood that the above-mentioned equidistant binning may refer to binning at the same age interval (e.g., every 30 years), and the above-mentioned equal frequency binning may be binning according to the principle that the frequency (i.e., the number) of sample data contained in each bin is the same.
The sample data sub-boxes divided by the jth feature type for the M sample data may be referred to as sample data sub-boxes, the number of the sample data sub-boxes may be K1, K1 is a positive integer less than or equal to M, and a specific value of K1 is determined according to an actual application scenario. The M sample data may include negative sample data and positive sample data, where the negative sample data may be sample data to which a negative sample tag (such as the above abnormal user tag) is added, the positive sample data may be sample data to which a positive sample tag (such as the above normal user tag) is added, and the sample tags (such as the negative sample tag and the positive sample tag) added to the sample data may both be tags added to the sample data by the user. Therefore, the server may obtain the negative sample frequency (e.g., the number of negative sample data) and the negative sample frequency (e.g., the proportion of the negative sample data in the sub-box) of the negative sample data contained in each sub-box of the K1 sample data, and one sub-box of the sample data may correspond to one negative sample frequency. For example, if a sample data bin includes 10 sample data, and there are 6 negative sample data in the 10 sample data, the negative sample frequency corresponding to the sample data bin is equal to 6/10. Therefore, the frequency variation trend between the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes can be obtained by the negative sample frequency respectively corresponding to each sample data sub-box in the K1 sample data sub-boxes, and the frequency variation trend can be embodied by a frequency variation curve, the frequency variation curve can pass through (i.e., include) the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes, that is, the frequency variation curve can be drawn by the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes. Optionally, if the negative sample frequencies corresponding to a plurality of (the number of) consecutive sample data sub-boxes in the K1 sample data sub-boxes is determined according to an actual application scenario are monotonically increased/decreased in sequence, the frequency variation trend may further include an increasing/decreasing trend between the negative sample frequencies corresponding to the consecutive sample data sub-boxes. Therefore, the frequency variation trend between the negative sample frequencies corresponding to the K1 sample data bins can be used as the training index information in the jth feature type.
A third way of generating training index information: the N feature types may include an s-th feature type, where s is a positive integer less than or equal to N, and the s-th feature type may be any one of the N feature types that belong to a continuous type. Also, the M sample data may include several negative sample data and several positive sample data. Similarly, the server may divide (i.e., bin) the M sample data according to the eigenvalue of each sample data in the s-th characteristic type to obtain K2 sample data bins corresponding to the s-th characteristic type, where K2 is a positive integer less than or equal to M, and the bin dividing mode is determined according to an actual application scenario, such as a default bin dividing mode of the system or a user-defined bin dividing mode.
Furthermore, the server may obtain the number of negative sample data (which may be referred to as the number of negative samples) and the number of positive sample data (which may be referred to as the number of positive samples) contained in each sample data sub-box of the K2 sample data sub-boxes, one sample data sub-box may correspond to one negative sample number and one positive sample number, and the server may further calculate the total number of negative sample data and the total number of positive sample data contained in M sample data. The server may respectively calculate sample distinguishing weights (i.e., a Weight of occurrence, WOE value, which may also be referred to as an Evidence Weight) corresponding to each sample data sub-box according to the negative sample data and the positive sample data corresponding to each sample data sub-box, the total amount of negative sample data and the total amount of positive sample data contained in the sample data set, see the following formula (1):
the f-th sample data sub-box can refer to any one sample data sub-box in K2 sample data sub-boxes, and f is a positive integer less than or equal to K2, so that WOEfCan refer to the sample discrimination weight, bad, of the f-th sample data binningfIndicating the number of negative sample data, good, contained in the f-th sample data binfIndicates the number of positive sample data, bad, contained in the f-th sample data binYRepresenting the total number of negative sample data, good, contained in the sample data setYRepresenting the total amount of positive sample data contained by the sample data set. Wherein, the WOE value can be used for evaluating that the negative sample data exists in each sample data scoreThe probability of the box is larger, the higher the WOE value is, the higher the probability that the sample data in the corresponding sample data box is negative sample data is, otherwise, the smaller the WOE value is, the lower the probability that the sample data in the corresponding sample data box is negative sample data is.
After the sample distinguishing weights respectively corresponding to the sample data sub-boxes are obtained through calculation, a sample distinguishing index Value (i.e., an Information Value, which may also be referred to as an Information Value) corresponding to the s-th feature type may be calculated according to the sample distinguishing weights respectively corresponding to the sample data sub-boxes, see the following formula (2):
wherein, the meaning of the same character in the formula (2) is the same as that of the same character in the formula (1).
More specifically, when the feature type is a discrete type, the sample distinguishing weight corresponding to the calculated discrete type feature type may be calculated according to the same principle as that of calculating the sample distinguishing index value corresponding to the above-mentioned s-th feature type, and for the discrete type feature type, the sample data to which one target feature value corresponding to the feature type belongs may be understood as the sample data in one sample data bin, for example, for the feature type of gender, the sample data to which the corresponding feature value is "male" may be understood as the sample data belonging to one sample data bin, and the sample data to which the corresponding feature value is "female" may be understood as the sample data belonging to another sample data bin. In other words, the WOE value and IV value described above can be calculated for both discrete and continuous feature types. The IV value may be used to evaluate the degree of distinguishing between negative and positive sample data, where a larger IV value indicates a better effect of the corresponding feature type on distinguishing between negative and positive sample data, and conversely, a smaller IV value indicates a worse effect of the corresponding feature type on distinguishing between negative and positive sample data.
The server may use the sample distinguishing weight corresponding to each sample data bin and the sample distinguishing index value corresponding to the s-th feature type in the calculated s-th feature type as training index information corresponding to the s-th feature type.
More specifically, the server may calculate the sample distinguishing index value corresponding to each of the N feature types by using the principle that the sample distinguishing index values corresponding to the s-th feature type are the same. The server can also distinguish the sequence of index values from large to small according to samples respectively corresponding to each feature type, and sequence the N feature types to obtain the sequenced N feature types, wherein the feature types which are sequenced more forward can be understood as feature types which are more adaptive to a model training scene, and the feature types which are more adaptive to the model training scene are more suitable for training the model under the model training scene.
Therefore, the server can use the L feature types with the top ranking in the N feature types after the ranking as the feature types matched with the model training scene, wherein L is a positive integer less than or equal to N, and the specific value of L is determined according to the actual application scene. Furthermore, if the M sample data are M sample users, the server may further obtain a model to be trained in a model training scenario, for example, if the model training scenario is a scenario for distinguishing abnormal users, the model to be trained is an initial model that needs to be trained and is used for distinguishing abnormal users. The server can also obtain target sample users, the number of the target sample users is determined according to the actual application scenario, the target sample users can carry user tags, the user tags can be abnormal user tags or normal user tags, and different target sample users can carry the same user tags or different user tags. The actual user attribute of the target sample user carrying the abnormal user label is an abnormal user attribute, which indicates that the target sample user is an abnormal user, for example, indicates that the transaction behavior of the target sample user is abnormal; otherwise, the actual user attribute of the target sample user carrying the normal user label is the normal user attribute, which indicates that the target sample user is a normal user, for example, indicates that the transaction behavior of the target sample user is normal. Wherein the target sample user may be a different sample user than the M sample users.
More, the server may further obtain the feature values of the target sample user in the L feature types, and may input the feature values of the target sample user in the L feature types into the model to be trained, so as to train the model to be trained. The process of training the model to be trained may be: the model to be trained can predict the user attributes of the target sample user according to the input feature values of the target sample user under the L feature types, the user attribute can be an abnormal user attribute or a normal user attribute, the server can obtain a prediction loss function of the model to be trained according to the difference between the user attribute of the target sample user predicted by the model to be trained and the actual user attribute of the target sample user indicated by the user label carried by the target sample user, and further modify the model parameter of the model to be trained through the prediction loss function, the purpose of training the model to be trained can be achieved, the model to be trained (such as the model to be trained with the model parameters corrected to be convergent) after training can be called as a target model, the objective model can be used for distinguishing and predicting the user attributes of the predicted user, and the predicted user can be any user needing transaction behavior evaluation.
For example, the server may obtain the predicted user from the data system, or the predicted user may also be a user submitted by the user at the client, and the server may obtain the feature values of the predicted user in the L feature types, and may input the feature values of the predicted user in the L feature types into the target model, so that the user attribute of the predicted user may be predicted through the target model. Further, if the user attribute of the predicted user obtained through prediction is an abnormal user attribute, the predicted user can be used as an abnormal user, and an early warning operation can be performed on the abnormal user, for example, an abnormal prompt message for the abnormal user can be returned to the client to early warn relevant staff.
Referring to fig. 6, fig. 6 is a schematic view of a model training scenario provided in the present application. As shown in fig. 6, the target sample data may refer to the target sample user, and the feature values of the target sample data in the L feature types may be input into the model to be trained 100g, so that the model to be trained 100g is trained through the feature values of the target sample data in the L feature types, and then the target model 101g is obtained through training. Subsequently, the server may obtain the predicted user, may input the feature values of the predicted user in the L feature types into the target model, and may predict the user attribute 102g of the predicted user through the target model. Further, if the user attribute 102g is an abnormal user attribute, the predicted user may be taken as an abnormal user, and an early warning operation may be performed on the abnormal user.
A fourth way of generating training index information: the N feature types may include a z-th feature type, where z is a positive integer less than or equal to N, and the z-th feature type may be any one of the N feature types belonging to a continuous type. Also, the M sample data may include several negative sample data and several positive sample data. Similarly, the server may divide (i.e., bin) the M sample data according to the eigenvalue of each sample data in the z-th characteristic type to obtain K3 sample data bins corresponding to the z-th characteristic type, where K3 is a positive integer less than or equal to M, and the bin dividing mode is determined according to an actual application scenario, such as a default bin dividing mode of the system or a user-defined bin dividing mode.
Furthermore, the server may further obtain a frequency of negative sample data (which may be referred to as a negative sample frequency) included in each sample data bin of the K3 sample data bins, where the negative sample frequency may be a ratio of the number of negative sample data in the corresponding sample data bin to the number of all sample data included in the sample data bin, and the server may further obtain a frequency of positive sample data (which may be referred to as a positive sample frequency) included in each sample data bin, where the positive sample frequency may be a ratio of the number of positive sample data in the corresponding sample data bin to the number of all sample data included in the sample data bin. Since the z-th feature type is a continuous type, the K3 sample data sub-bins have a certain ordering, and therefore, a corresponding model discrimination (i.e., Kolmogorov-Smirnov, abbreviated as KS evaluation index) can be calculated according to a negative sample frequency and a positive sample frequency respectively corresponding to each sample data sub-bin, and the model discrimination can be obtained according to the following formula (3):
ks=max{|cum(badg-rate)-cum(goodg-rate)|,1≤g≤K3} (3)
the g sample data split box may be any one of K3 sample data split boxes, cum (bad)g-rate) represents the sum of the negative sample frequency corresponding to the g-th sample data bin and the negative sample frequency of the sample data bin ordered before the g-th sample data bin, cum (good)g-rate) represents the sum of the positive sample frequency corresponding to the g-th sample data bin and the positive sample frequency of the sample data bin ordered before the g-th sample data bin. The KS evaluation index is used for evaluating the distinguishing and predicting capacity of the model, the value range of the KS evaluation index can be 0-1, the larger the KS evaluation index is, the better the distinguishing and predicting capacity of the model is indicated, the larger the characteristic type of the KS evaluation index is used for training the model is, the stronger the distinguishing and predicting capacity of the model obtained by training on positive and negative sample data can be made, otherwise, the smaller the KS evaluation index is, the worse the distinguishing and predicting capacity of the model is indicated, and the smaller the characteristic type of the KS evaluation index is used for training the model, the worse the distinguishing and predicting capacity of the model obtained by training on the positive and negative sample data can be made. The model discrimination obtained by the calculation may be used as training index information corresponding to the z-th feature type, and the server may also calculate model discriminations corresponding to other continuous feature types in the same manner as the model discrimination corresponding to the z-th feature type is calculated.
It can be understood that the 4 described ways of generating the training index information describe 4 types of training index information, and one, two, three or four of the 4 types of training index information may be selected according to an actual application scenario, and all of the 4 types of training index information are used as the training index information corresponding to the feature type. The server may perform recommendation sorting on the N feature types through the training index information (for example, perform recommendation sorting on the N feature types sorted by using the IV value), and the server may further generate text description information corresponding to each feature type according to the training index information corresponding to each feature type, the text description information may be in the form of text to describe training index information corresponding to the feature type, for example, the frequency of the feature value of each target feature value of the t-th feature type is described by text, the frequency variation trend between the negative sample frequencies respectively corresponding to each sample data bin of the j-th feature type is described by text, the sample discrimination index value corresponding to each sample data bin of the s-th feature type is described by text, the model discrimination degree of the z-th feature type is described by text, and the like. The server may further generate an intuitive schematic diagram such as a table schematic diagram, a histogram, or a graph corresponding to each feature type according to the training index information corresponding to each feature type, for example, generate a table schematic diagram corresponding to a frequency of a feature value of each target feature value of the t-th feature type, generate a graph corresponding to a frequency variation trend between negative sample frequencies respectively corresponding to each sample data bin of the j-th feature type, generate a graph corresponding to a sample discrimination index value corresponding to each sample data bin of the s-th feature type, and generate a graph corresponding to a model discrimination of each feature type.
Further, the server may send the obtained index information (the index information indicates training index information of each feature type) such as recommendation ranking, text description information, table schematic, histogram, or graph for the N feature types to the client, so that the client may display the obtained index information on a client interface, and through the index information, a user may clearly see how well each feature type of the N feature types is adapted to the model training scenario, and the index information may be used as a basis for the user to select a feature type adapted to the model training scenario from the N feature types, so that the feature type selected by the user may perform more accurate training on the model in the model training scenario.
Referring to fig. 7a-7b, fig. 7a-7b are schematic views of an interface for displaying a measurement result provided by the present application. As shown in fig. 7a, the training index information (including frequency (i.e. number), frequency) of the observation group (i.e. the sample data set) and the training index information of the comparison group are visually compared and displayed on the client interface 100d of the client. As shown in fig. 7a, in an area 101d of the client interface 100d, frequency counts of the observation group and the comparison group in different sample data bins (e.g., sample data bins of 0 to 50 and sample data bins of 51 to 100) are displayed in a comparison manner; the total number of the observed group and the control group and the number in each box are shown in a comparison manner through a table in the area 102d of the client interface 100 d; the frequency numbers and frequencies of the observed group and the control group are shown in a comparison manner through a table in the area 103d of the client interface 100 d.
As shown in fig. 7b, a histogram 101e in a client interface 104e of the client displays the frequency counts (i.e., the number) of the negative sample data and the positive sample data contained in each sample data bin (including the bin 1, the bin 2, the bin 3, and the bin 4), a curve 102e is also displayed in the histogram 101e, the curve 102e contains the proportion of the negative sample data in each sample data bin, and the curve 102e shows the variation trend between the proportions of the negative sample data in each sample data bin (i.e., the frequency variation trend). The graph 103e in the client interface 104e also shows the magnitude change law of the WOE values of the individual bins.
In addition, the client interface 104e further includes a button 100e (i.e., a button "download data"), and data (such as a histogram 101e, a graph 103e, and the like) displayed in the client interface and training index information corresponding to each feature type can be downloaded through the button 100e, where the downloaded data may be in a word format (an editable text format) or a pdf format (a portable document format), and the like.
The method comprises the steps of firstly, obtaining a sample data set under a model training scene; the sample data set contains M sample data, wherein M is a positive integer; n types of characteristic types to be analyzed can be obtained, and the characteristic value of each sample data under each type of characteristic can be respectively obtained; n is a positive integer; and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types. Therefore, the training index information of the sample data set under each feature type can be generated according to the feature value of each sample data under each feature type, the feature type matched with the model training scene can be more accurately selected from the N feature types through the training index information, and then the model under the model training scene can be more accurately trained through the feature type matched with the model training scene, so that the trained model has better model performance.
Referring to fig. 8, fig. 8 is a schematic flow chart of a data processing method provided in the present application. Referring to fig. 8, the method may include:
first, the following steps s1 to s3 may be front-end page operations.
Step s 1: obtaining a sample;
and enabling a user to obtain a sample in the client, wherein the sample can be the sample data set, and the sample data set can comprise M sample data.
Step s 2: obtaining characteristics;
and if the user does not submit the customized binning mode, the default binning mode of the system can be obtained.
Step s 3: submitting the task to a background for execution:
and the user is supported to submit a measuring and calculating task aiming at the feature type to the background in the client.
Further, the following steps s4 to s8 may be steps of performing characteristic descriptive statistics in the background.
Step s 4: processing original data;
the server can judge whether the training index information of each feature type needs to be obtained through the feature values of the comparison group under each feature type, wherein the comparison group can be selected by a user in the client, if the user selects the required comparison group in the client, the server can judge that the training index information of each feature type needs to be obtained through the feature values of the comparison group under each feature type, otherwise, if the user does not select the required comparison group in the client, the server can judge that the training index information of each feature type does not need to be obtained through the feature values of the comparison group under each feature type. The process of obtaining the training index information of each feature type by comparing the feature values of the group under each feature type is the same as the process of obtaining the training index information of each feature type by observing the feature values of the group (i.e., the sample data set) under each feature type.
The server may perform the association feature: that is, after the server acquires the sample data set and the N feature types, the server may associate (i.e., acquire) feature values of each sample data set in the N feature types from the data system. The server may also filter out IDs of sample data (e.g., user identifications of sample users) for which feature values do not exist under the feature type. The server can also obtain a sample data list formed by sample data without characteristic values under various characteristic types, one characteristic type can correspond to one sample data list, and if M sample data have characteristic values under a certain characteristic type, the characteristic type can have no sample data list.
Step s 5: distinguishing variable types;
the variable is a feature type, the server may distinguish which of the N feature types are discrete feature types and which are continuous feature types, and for the discrete feature types, a processing manner of classifying the feature values may be adopted, for example, for the gender feature type, the feature values may be classified into a category of "male" and a category of "female". For the continuous type feature type, a processing manner of binning (i.e., segmenting) the feature value may be adopted.
Step s 6: index processing;
the server may perform missing value filling on sample data without a feature value under the feature type, for example, may fill a feature value corresponding to the sample data to "null". Furthermore, the server may calculate statistical information between the characteristic values of the M sample data, where the statistical information may include frequency, cumulative frequency, and frequency variation trend of the negative sample frequency in each bin. The statistical information of the sample data under the discrete feature type and the statistical information of the sample data under the continuous feature type may be calculated separately. One feature type may correspond to one statistical information, and the statistical information may be training index information corresponding to the feature type.
Step s 7: outputting the data;
the server may output statistical information of the calculated sample data under the discrete type feature types, output statistical information of the calculated sample data under the continuous type feature types, and derive a sample data list corresponding to each feature type, where the sample data list includes sample data without a feature value under a corresponding feature type.
Step s 8: uploading the combined result to a distributed storage system (HDFS);
the server may merge the statistical information corresponding to the discrete feature types, the statistical information corresponding to the continuous feature types, and the sample data list corresponding to each feature type, and store the merged data in the distributed storage system, and then, when the data needs to be acquired, the data may be acquired from the distributed storage system.
By obtaining the statistical information corresponding to the discrete feature type and the statistical information corresponding to the continuous feature type, a result 103h may be obtained, where the result 103h may include contents of 5 parts, specifically, contents including descriptive statistics (which may be text description information of the statistical information), a feature distribution diagram (such as a histogram or a graph showing the statistical information), a frequency distribution list (i.e., a table showing the frequency), a frequency variation trend (e.g., a variation trend curve of the frequency of negative sample data in each bin), and a proportion of good and bad samples (e.g., the frequency of the good and bad samples in each bin).
Further, the following steps s9 to s12 may be steps of performing feature validity analysis in the background.
Step s 9: processing original data;
and acquiring the characteristic values of the M sample data under each characteristic type.
Step s 10: measuring and calculating the feature effectiveness;
also, the discrete type feature type and the continuous type feature type may be calculated separately. The server may also perform missing value filling on sample data for which no feature value exists under the feature type. For the discrete type of feature, the server may calculate the proportion of negative sample data and the proportion of positive sample data in the sample data to which the feature values of different types of the feature type belong, respectively, for example, if the feature type of the discrete type is a gender feature type, the server may calculate the proportion of negative sample data in the sample data with the characteristic value of "male" to all the sample data with the characteristic value of "male", the server may further calculate the proportion of negative sample data in the sample data with the characteristic value of "male" to the sample data set, the server may further calculate the proportion of negative sample data in the sample data with the characteristic value of "female", and the proportion of negative sample data in the sample data with the characteristic value of "female" to the sample data set; the server may further calculate a ratio of positive sample data to all sample data having a feature value of "male" in sample data having a feature value of "male", the server may further calculate a ratio of positive sample data to the sample data set in sample data having a feature value of "male", the server may further calculate a ratio of positive sample data to all sample data having a feature value of "female" in sample data having a feature value of "female", and the server may further calculate a ratio of positive sample data to the sample data set in sample data having a feature value of "female".
For the continuous characteristic type, the server can calculate the proportion of the negative sample data in each sample data sub-box to the sample data in the sub-box to which the negative sample data belongs, the proportion of the negative sample data in each sample data sub-box to the sample data in the sample data set, the proportion of the positive sample data in each sample data sub-box to the sample data in the sub-box to which the positive sample data belongs, and the proportion of the positive sample data in each sample data sub-box to the sample data in the sample data set.
Furthermore, through the various ratios obtained by the above calculation, index values (which may be training index information corresponding to the feature types) such as a WOE value, an IV value, and a KS evaluation index, which are respectively corresponding to each feature type, can be calculated, and the WOE value, the IV value, and the KS evaluation index can be used to evaluate the validity of the corresponding feature type, that is, to evaluate whether the corresponding feature type is adapted to the model training scene.
Step s 11: storing the result;
the server may upload the various proportions and index values calculated in the steps s9 to s12 to the distributed storage system, and may acquire the proportions and index values from the distributed storage system if needed.
By obtaining various proportions and index values, a result 104h can be obtained, where the result 104h can include 5 parts of content, and specifically includes characteristic monotonicity judgment (for example, whether the frequency of the negative sample data is monotonically increased or monotonically decreased among the bins, and a more effective application point can be found by monotonicity judgment, for example, the frequency of the negative sample data is monotonically increased from the application point), a characteristic IV value, a characteristic KS evaluation index, a characteristic WOE value, and the like.
The steps (i.e., the steps s4 to s8) for performing the characteristic descriptive statistics and the steps (i.e., the steps s9 to s12) for performing the characteristic validity analysis may be executed in parallel, and the steps of the two parts are not sequential and do not affect each other. Optionally, according to the user requirement, if the user only needs the feature descriptive statistics, only the step s4 to the step s8 may be performed, and the step s9 to the step s12 are not performed; if the user only needs feature validity, only the steps s9 to s12 may be performed, and the steps s4 to s8 may not be performed; if the user requires both feature descriptive statistics and feature validity, steps s 4-s 12 may be performed. The server can send the obtained result 103h and/or result 104h to the client, the client displays the result 103h and/or result 104h to the user on a client interface, the result 103h and/or result 104h is obtained through training index information corresponding to each feature type, and the user can be guided to select the feature type for model training through the result 103h and/or result 104h, namely the feature type effective for model training is obtained.
Further, the following code explains the relevant steps involved in the present application:
for the related step flow of the above feature descriptive statistics, the flow constructs several different functions for the variable type (discrete variable) and the binning result (such as equidistant binning, equal frequency binning or custom binning), and takes the processing flow of the discrete variable (i.e. discrete feature type) as an example for code description, please refer to the following:
the functions of calculating the frequency, the frequency and the accumulated frequency of the discrete variables are as follows:
def discri_var_distribute(df_feature,var_name,result_path)。
missing value filling: missing values in the original data may be filled in using fillna ('missing');
counting frequency: the print (df _ discrete) may be defined by value _ counts (sort ═ False), and the frequency may be counted by the print (df _ discrete);
calculating the frequency: the frequency may be calculated using 'sub _ total _ num _ percentage'/'sub _ total _ num';
frequency calculation: df _ differential [ 'sub _ total _ num _ percentage _ add' ]can be utilized
=df_discri['sub_total_num_percentage'].cumsum()
print (df _ discrete) to calculate the frequency count;
the result data (such as frequency and frequency) obtained by the above calculation can be written into a local csv (comma separated value file format) file in an additional manner, as the following codes:
df_discri.to_csv(result_path,mode='a',header=False,sep='\t',index=False)。
for the above related step flow of feature validity, for example, when calculating woe values of discrete variables, in order to avoid type errors caused by some numerical discrete variables, the discrete variables may be forced to be converted into str data type (a data type) for processing, as the following codes:
for var in cfg.discrete_var_list:
the code forces the discrete variable to be of the str data type.
The method comprises the steps of firstly, obtaining a sample data set under a model training scene; the sample data set contains M sample data, wherein M is a positive integer; n types of characteristic types to be analyzed can be obtained, and the characteristic value of each sample data under each type of characteristic can be respectively obtained; n is a positive integer; and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types. Therefore, the training index information of the sample data set under each feature type can be generated according to the feature value of each sample data under each feature type, the feature type matched with the model training scene can be more accurately selected from the N feature types through the training index information, and then the model under the model training scene can be more accurately trained through the feature type matched with the model training scene, so that the trained model has better model performance.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus provided in the present application. The data processing apparatus may be a computer program (including program code) running on a computer device, for example, the data processing apparatus is an application software, and the data processing apparatus may be configured to execute corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 9, the data processing apparatus 1 may include: a sample acquisition module 101, a feature acquisition module 102 and an index generation module 103;
a sample obtaining module 101, configured to obtain a sample data set in a model training scenario; the sample data set contains M sample data, wherein M is a positive integer;
the feature obtaining module 102 is configured to obtain N types of features to be analyzed, and obtain a feature value of each sample data in each type of features; n is a positive integer;
the index generating module 103 is configured to generate training index information of the sample data set in each feature type according to a feature value of each sample data in each feature type, where the training index information is used to assist in determining a feature type adapted to a model training scenario from among the N feature types.
Optionally, the N feature types include an ith feature type, where i is a positive integer less than or equal to N; the ith characteristic type has t target characteristic values, and t is a positive integer;
the mode of the index generation module 103 generating training index information of the sample data set in each feature type according to the feature value of each sample data in each feature type includes:
determining the frequency number of the characteristic value corresponding to each target characteristic value in the t types of target characteristic values according to the characteristic value of each sample data under the ith characteristic type;
and generating training index information of the sample data set under the ith characteristic type according to the frequency of the characteristic value corresponding to each target characteristic value.
Optionally, the M sample data include negative sample data and positive sample data; the N characteristic types comprise a jth characteristic type, and j is a positive integer less than or equal to N;
the mode of the index generation module 103 generating training index information of the sample data set in each feature type according to the feature value of each sample data in each feature type includes:
dividing M sample data according to the characteristic value of each sample data under the jth characteristic type to obtain K1 sample data sub-boxes corresponding to the jth characteristic type; k1 is a positive integer less than or equal to M;
respectively obtaining negative sample frequencies of negative sample data contained in K1 sample data sub-boxes;
determining the frequency variation trend among the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes according to the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes;
and determining the frequency variation trend as training index information of the sample data set under the jth characteristic type.
Optionally, the M sample data include negative sample data and positive sample data; the N characteristic types comprise an s characteristic type, wherein s is a positive integer less than or equal to N;
the mode of the index generation module 103 generating training index information of the sample data set in each feature type according to the feature value of each sample data in each feature type includes:
dividing M sample data according to the characteristic value of each sample data under the s type of characteristic type to obtain K2 sample data sub-boxes corresponding to the s type of characteristic type; k2 is a positive integer less than or equal to M;
respectively acquiring the number of negative samples of the negative sample data contained in K2 sample data sub-boxes;
respectively acquiring the number of positive samples of positive sample data contained in K2 sample data sub-boxes;
and determining training index information of the sample data set under the s-th characteristic type according to the quantity of the negative samples and the quantity of the positive samples respectively corresponding to the K2 sample data sub-boxes.
Optionally, the method for determining training index information of the sample data set in the s-th feature type by the index generating module 103 according to the negative sample number and the positive sample number respectively corresponding to the K2 sample data sub-boxes includes:
determining sample distinguishing weights corresponding to K2 sample data sub-boxes respectively according to the quantity of negative sample data and positive sample data corresponding to the K2 sample data sub-boxes respectively;
determining a sample distinguishing index value corresponding to the s-th characteristic type according to the number of negative samples, the number of positive samples and the sample distinguishing weight respectively corresponding to the K2 sample data sub-boxes;
and determining the sample distinguishing index value corresponding to the s-th characteristic type as training index information of the sample data set under the s-th characteristic type.
Optionally, the apparatus 1 further includes:
the sorting module 104 is configured to sort the N feature types according to a descending order of the sample distinguishing index values of the sample data set under each feature type, so as to obtain N sorted feature types;
the feature selection module 105 is configured to determine the top L feature types of the sequenced N feature types as feature types adapted to the model training scenario; l is a positive integer less than or equal to N.
Optionally, the M sample data are M sample users;
the above apparatus 1 further comprises: a model acquisition module 106, a target sample acquisition module 107 and a model training module 108;
the model obtaining module 106 is configured to obtain a model to be trained in a model training scene;
a target sample obtaining module 107, configured to obtain a target sample user, and obtain feature values of the target sample user in L feature types; carrying a user label by a target sample user; the user label is an abnormal user label or a normal user label;
the model training module 108 is used for training a model to be trained according to the feature values of the target sample user under the L feature types and the user label carried by the target sample user to obtain a target model;
the target model is used for distinguishing and predicting the user attribute of the predicted user, and the user attribute is a normal user attribute or an abnormal user attribute.
Optionally, the apparatus 1 further includes: a predicted user obtaining module 109, a feature prediction module 110 and an early warning module 111;
a predicted user obtaining module 109, configured to obtain a predicted user, and obtain feature values of the predicted user in L feature types;
the feature prediction module 110 is configured to input feature values of predicted users in the L feature types into a target model, and predict user attributes of the target user in the target model;
and the early warning module 111 is configured to determine the predicted user as an abnormal user if the user attribute of the predicted user is the abnormal user attribute, and perform an early warning operation on the abnormal user.
Optionally, the M sample data include negative sample data and positive sample data; the N characteristic types comprise a z characteristic type, and z is a positive integer less than or equal to N;
the mode of the index generation module 103 generating training index information of the sample data set in each feature type according to the feature value of each sample data in each feature type includes:
dividing M sample data according to the characteristic value of each sample data under the z-th characteristic type to obtain K3 sample data sub-boxes corresponding to the z-th characteristic type; k3 is a positive integer less than or equal to M;
respectively obtaining negative sample frequencies of negative sample data contained in K3 sample data sub-boxes;
respectively acquiring positive sample frequencies of positive sample data contained in K3 sample data sub-boxes;
according to the negative sample frequency and the positive sample frequency respectively corresponding to the k3 sample data sub-boxes, obtaining the model discrimination corresponding to the z-th characteristic type;
and determining the model discrimination as the training index information of the sample data set under the z-th characteristic type.
Optionally, the mode of acquiring the sample data set in the model training scenario by the sample acquisition module 101 includes:
acquiring a sample data set in a sample data system;
the manner in which the feature obtaining module 102 obtains the feature value of each sample data under each feature type respectively includes:
and respectively acquiring the characteristic value of each sample data under each characteristic type in the metadata system based on an access interface between the sample data system and the metadata system.
Optionally, the mode of acquiring the sample data set in the model training scenario by the sample acquisition module 101 includes:
acquiring a sample acquisition request sent by a client; the sample acquisition request carries a sample retrieval field;
acquiring a sample data set according to the sample retrieval field;
the above-described device 1 is also used for:
and transmitting the training index information of the sample data set under each characteristic type to the client so that the client displays the training index information of the sample data set under each characteristic type on a client interface.
According to an embodiment of the present application, the steps involved in the data processing method shown in fig. 3 may be performed by respective modules in the data processing apparatus 1 shown in fig. 9. For example, step S101 shown in fig. 3 may be performed by the sample acquisition module 101 in fig. 9, and step S102 shown in fig. 3 may be performed by the feature acquisition module 102 in fig. 9; step S103 shown in fig. 3 may be performed by the index generation module 103 in fig. 9.
The method comprises the steps of firstly, obtaining a sample data set under a model training scene; the sample data set contains M sample data, wherein M is a positive integer; n types of characteristic types to be analyzed can be obtained, and the characteristic value of each sample data under each type of characteristic can be respectively obtained; n is a positive integer; and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types. Therefore, the training index information of the sample data set under each feature type can be generated according to the feature value of each sample data under each feature type, the feature type matched with the model training scene can be more accurately selected from the N feature types through the training index information, and then the model under the model training scene can be more accurately trained through the feature type matched with the model training scene, so that the trained model has better model performance.
According to an embodiment of the present application, each module in the data processing apparatus 1 shown in fig. 9 may be respectively or entirely combined into one or several units to form the unit, or some unit(s) may be further split into multiple sub-units with smaller functions, which may implement the same operation without affecting implementation of technical effects of the embodiment of the present application. The modules are divided based on logic functions, and in practical application, the functions of one module can be realized by a plurality of units, or the functions of a plurality of modules can be realized by one unit. In other embodiments of the present application, the data processing apparatus 1 may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.
According to an embodiment of the present application, the data processing apparatus 1 as shown in fig. 9 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method as shown in fig. 3 on a general-purpose computer device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and implementing the data processing method of the embodiment of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded into and executed by the computing apparatus via the computer-readable recording medium.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device provided in the present application. As shown in fig. 10, the computer device 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 10, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
In the computer device 1000 shown in fig. 10, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
acquiring a sample data set under a model training scene; the sample data set contains M sample data, wherein M is a positive integer;
acquiring N types of feature types to be analyzed, and respectively acquiring a feature value of each sample data under each type of feature; n is a positive integer;
and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types.
It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3, and may also perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 9, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
Further, here, it is to be noted that: the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores the aforementioned computer program executed by the data processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method in the embodiment corresponding to fig. 3 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application.
By way of example, the program instructions described above may be executed on one computer device, or on multiple computer devices located at one site, or distributed across multiple sites and interconnected by a communication network, which may comprise a blockchain network.
The computer readable storage medium may be the data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
A computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device performs the description of the data processing method in the embodiment corresponding to fig. 3, which is described above, and therefore, the description thereof will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.
The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.
Claims (15)
1. A method of data processing, the method comprising:
acquiring a sample data set under a model training scene; the sample data set comprises M sample data, wherein M is a positive integer;
acquiring N types of feature types to be analyzed, and respectively acquiring a feature value of each sample data under each type of feature; n is a positive integer;
and generating training index information of the sample data set under each feature type according to the feature value of each sample data under each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types.
2. The method of claim 1, wherein the N feature types include an ith feature type, i being a positive integer less than or equal to N; the ith characteristic type has t target characteristic values, and t is a positive integer;
the generating, according to the feature value of each sample data in each feature type, training index information of the sample data set in each feature type includes:
according to the characteristic value of each sample data under the ith characteristic type, determining the frequency of the characteristic value corresponding to each target characteristic value in the t target characteristic values;
and generating training index information of the sample data set under the ith characteristic type according to the frequency of the characteristic value corresponding to each target characteristic value.
3. The method of claim 1, wherein the M sample data comprises negative and positive sample data; the N characteristic types comprise a jth characteristic type, and j is a positive integer less than or equal to N;
the generating, according to the feature value of each sample data in each feature type, training index information of the sample data set in each feature type includes:
dividing the M sample data according to the characteristic value of each sample data under the jth characteristic type to obtain K1 sample data sub-boxes corresponding to the jth characteristic type; k1 is a positive integer less than or equal to M;
respectively acquiring negative sample frequencies of the negative sample data included in the K1 sample data sub-boxes;
determining the frequency variation trend among the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes according to the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes;
and determining the frequency variation trend as training index information of the sample data set under the jth characteristic type.
4. The method of claim 1, wherein the M sample data comprises negative and positive sample data; the N characteristic types comprise an s characteristic type, and s is a positive integer less than or equal to N;
the generating, according to the feature value of each sample data in each feature type, training index information of the sample data set in each feature type includes:
dividing the M sample data according to the characteristic value of each sample data under the s-th characteristic type to obtain K2 sample data sub-boxes corresponding to the s-th characteristic type; k2 is a positive integer less than or equal to M;
respectively acquiring the number of negative samples of the negative sample data in the K2 sample data sub-boxes;
respectively acquiring the number of positive samples of the positive sample data contained in the K2 sample data sub-boxes;
and determining training index information of the sample data set under the s-th characteristic type according to the quantity of the negative samples and the quantity of the positive samples respectively corresponding to the K2 sample data sub-boxes.
5. The method according to claim 4, wherein the determining training index information of the sample data set in the s-th feature type according to the number of negative samples and the number of positive samples respectively corresponding to the K2 sample data bins comprises:
determining sample distinguishing weights respectively corresponding to the K2 sample data sub-boxes according to the quantity of negative sample data and positive sample data respectively corresponding to the K2 sample data sub-boxes;
determining a sample distinguishing index value corresponding to the s-th characteristic type according to the number of negative samples, the number of positive samples and the sample distinguishing weight respectively corresponding to the K2 sample data sub-boxes;
and determining the sample distinguishing index value corresponding to the s-th feature type as training index information of the sample data set under the s-th feature type.
6. The method of claim 5, further comprising:
sorting the N characteristic types according to the descending order of the sample distinguishing index values of the sample data set under each characteristic type to obtain the sorted N characteristic types;
determining the first L feature types in the sequenced N feature types as the feature types matched with the model training scene; l is a positive integer less than or equal to N.
7. The method of claim 6, wherein said M sample data are M sample users;
the method further comprises the following steps:
obtaining a model to be trained under the model training scene;
acquiring a target sample user, and acquiring the characteristic value of the target sample user under the L characteristic types; the target sample user carries a user tag; the user label is an abnormal user label or a normal user label;
training the model to be trained according to the characteristic values of the target sample user under the L characteristic types and the user label carried by the target sample user to obtain a target model;
the target model is used for distinguishing and predicting the user attribute of the predicted user, and the user attribute is a normal user attribute or an abnormal user attribute.
8. The method of claim 7, further comprising:
acquiring the predicted user and acquiring the characteristic value of the predicted user under the L characteristic types;
inputting the characteristic values of the predicted users under the L characteristic types into the target model, and predicting the user attributes of the target users in the target model;
and if the user attribute of the predicted user is the abnormal user attribute, determining the predicted user as the abnormal user, and performing early warning operation on the abnormal user.
9. The method of claim 1, wherein the M sample data comprises negative and positive sample data; the N characteristic types comprise a z characteristic type, and z is a positive integer less than or equal to N;
the generating, according to the feature value of each sample data in each feature type, training index information of the sample data set in each feature type includes:
dividing the M sample data according to the characteristic value of each sample data under the z-th characteristic type to obtain K3 sample data sub-boxes corresponding to the z-th characteristic type; k3 is a positive integer less than or equal to M;
respectively acquiring negative sample frequencies of the negative sample data included in the K3 sample data sub-boxes;
respectively acquiring positive sample frequencies of positive sample data included in the K3 sample data sub-boxes;
obtaining model discrimination corresponding to the z-th characteristic type according to the negative sample frequency and the positive sample frequency respectively corresponding to the k3 sample data sub-boxes;
and determining the model discrimination as the training index information of the sample data set under the z-th characteristic type.
10. The method of claim 1, wherein the obtaining of the sample data set in a model training scenario comprises:
acquiring the sample data set in a sample data system;
the respectively obtaining the feature value of each sample data under each feature type includes:
and respectively acquiring the characteristic value of each sample data under each characteristic type in the metadata system based on an access interface between the sample data system and the metadata system.
11. The method of claim 1, wherein the obtaining of the sample data set in a model training scenario comprises:
acquiring a sample acquisition request sent by a client; the sample acquisition request carries a sample retrieval field;
acquiring the sample data set according to the sample retrieval field;
the method further comprises the following steps:
and sending the training index information of the sample data set under each characteristic type to the client, so that the client displays the training index information of the sample data set under each characteristic type on a client interface.
12. A data processing apparatus, characterized in that the apparatus comprises:
the sample acquisition module is used for acquiring a sample data set in a model training scene; the sample data set comprises M sample data, wherein M is a positive integer;
the characteristic acquisition module is used for acquiring N types of characteristic types to be analyzed and respectively acquiring a characteristic value of each sample data under each type of characteristic; n is a positive integer;
and the index generation module is used for generating training index information of the sample data set in each feature type according to the feature value of each sample data in each feature type, wherein the training index information is used for assisting in determining the feature type matched with the model training scene from the N feature types.
13. The apparatus according to claim 12, wherein the M sample data comprise negative and positive sample data; the N characteristic types comprise a jth characteristic type, and j is a positive integer less than or equal to N;
the index generation module is further configured to:
dividing the M sample data according to the characteristic value of each sample data under the jth characteristic type to obtain K1 sample data sub-boxes corresponding to the jth characteristic type; k1 is a positive integer less than or equal to M;
respectively acquiring negative sample frequencies of the negative sample data included in the K1 sample data sub-boxes;
determining the frequency variation trend among the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes according to the negative sample frequencies respectively corresponding to the K1 sample data sub-boxes;
and determining the frequency variation trend as training index information of the sample data set under the jth characteristic type.
14. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1-11.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to be loaded by a processor and to perform the method of any of claims 1-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110396801.4A CN113705072A (en) | 2021-04-13 | 2021-04-13 | Data processing method, data processing device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110396801.4A CN113705072A (en) | 2021-04-13 | 2021-04-13 | Data processing method, data processing device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113705072A true CN113705072A (en) | 2021-11-26 |
Family
ID=78648005
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110396801.4A Pending CN113705072A (en) | 2021-04-13 | 2021-04-13 | Data processing method, data processing device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113705072A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114444576A (en) * | 2021-12-30 | 2022-05-06 | 北京达佳互联信息技术有限公司 | Data sampling method and device, electronic equipment and storage medium |
-
2021
- 2021-04-13 CN CN202110396801.4A patent/CN113705072A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114444576A (en) * | 2021-12-30 | 2022-05-06 | 北京达佳互联信息技术有限公司 | Data sampling method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3985578A1 (en) | Method and system for automatically training machine learning model | |
CN110223168B (en) | Label propagation anti-fraud detection method and system based on enterprise relationship map | |
EP3522078A1 (en) | Explainable artificial intelligence | |
US20170109657A1 (en) | Machine Learning-Based Model for Identifying Executions of a Business Process | |
CN106096657B (en) | Based on machine learning come the method and system of prediction data audit target | |
US20170109676A1 (en) | Generation of Candidate Sequences Using Links Between Nonconsecutively Performed Steps of a Business Process | |
US20170109668A1 (en) | Model for Linking Between Nonconsecutively Performed Steps in a Business Process | |
CN107908606A (en) | Method and system based on different aforementioned sources automatic report generation | |
US20170109667A1 (en) | Automaton-Based Identification of Executions of a Business Process | |
Viscusi et al. | Digital information asset evaluation: Characteristics and dimensions | |
CN108885628A (en) | Data analysing method candidate's determination device | |
CN111882420A (en) | Generation method of response rate, marketing method, model training method and device | |
Sood et al. | Hybridization of cluster-based LDA and ANN for student performance prediction and comments evaluation | |
CN108268624A (en) | User data method for visualizing and system | |
US20170109638A1 (en) | Ensemble-Based Identification of Executions of a Business Process | |
CN113537807A (en) | Enterprise intelligent wind control method and device | |
CN113095408A (en) | Risk determination method and device and server | |
CN112631889A (en) | Portrayal method, device and equipment for application system and readable storage medium | |
CN110378739B (en) | Data traffic matching method and device | |
CN115545103A (en) | Abnormal data identification method, label identification method and abnormal data identification device | |
JP2021018466A (en) | Rule extracting apparatus, information processing apparatus, rule extracting method, and rule extracting program | |
CN113705072A (en) | Data processing method, data processing device, computer equipment and storage medium | |
CN108711074A (en) | Business sorting technique, device, server and readable storage medium storing program for executing | |
US20170109670A1 (en) | Crowd-Based Patterns for Identifying Executions of Business Processes | |
US20170109637A1 (en) | Crowd-Based Model for Identifying Nonconsecutive Executions of a Business Process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |