CN113515383A - System resource data allocation method and device - Google Patents

System resource data allocation method and device Download PDF

Info

Publication number
CN113515383A
CN113515383A CN202110854993.9A CN202110854993A CN113515383A CN 113515383 A CN113515383 A CN 113515383A CN 202110854993 A CN202110854993 A CN 202110854993A CN 113515383 A CN113515383 A CN 113515383A
Authority
CN
China
Prior art keywords
target
training sample
sample
initial
target training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110854993.9A
Other languages
Chinese (zh)
Other versions
CN113515383B (en
Inventor
袁世聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110854993.9A priority Critical patent/CN113515383B/en
Publication of CN113515383A publication Critical patent/CN113515383A/en
Application granted granted Critical
Publication of CN113515383B publication Critical patent/CN113515383B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The specification relates to the technical field of machine learning, and particularly discloses a system resource data distribution method and device, wherein the method comprises the following steps: obtaining an initial training sample set; determining a sample space center based on an initial training sample set, and constructing a target training sample set according to the initial training sample set and the sample space center, wherein the target training sample set comprises a plurality of target training samples which are distributed in a spherical manner around the sample space center; calculating the abnormal score of each target training sample in the plurality of target training samples, and determining the target label of each target training sample according to the abnormal score of each target training sample to obtain a target label set corresponding to the target training sample set; and constructing a target classifier by using the target label set so as to allocate system resource data to the target user based on the risk prediction result of the target classifier on the target user. The scheme can improve the accuracy and the efficiency of the system resource data allocation.

Description

System resource data allocation method and device
Technical Field
The present disclosure relates to the field of machine learning technologies, and in particular, to a method and an apparatus for allocating system resource data.
Background
With the rapid development of the big data service platform technology, the financial resource data service types and the selectable service channels are more and more diversified and more convenient, and the risk prediction of the user is more and more important for financial institutions. For example, for online loan services in some service channels which are convenient and fast, due to relatively less manual intervention, if the user risk prediction is not accurate enough, the problems of unreasonable resource data distribution, poor user experience and the like may exist.
Currently, machine learning and deep learning techniques have been used to model anti-fraud scenarios to manage the risk of fraud present in transactions. In the current modeling process, one of the most encountered technical problems is the unbalance problem of the sample. It is well understood that in a normal data set acquired through a service, most of the samples should be normal samples, i.e. non-fraudulent samples, and only a very few samples are black samples, i.e. fraudulent samples. For example, there may be only 10 black samples out of 27 ten thousand samples, which is a scale that is difficult to model using machine learning methods.
At present, the mainstream method for solving the sample imbalance only has two types of over-sampling and under-sampling. The two methods have problems respectively, oversampling essentially means that a few samples in a data set are repeatedly used, so that overfitting of a trained model is inevitably caused, the generalization capability in final application is influenced, undersampling actually means that some normal samples are randomly discarded, but some useful information is lost, so that the accuracy of the trained model is not high, the accuracy of resource data allocated to corresponding users is caused, further, the resource data allocation is possibly unreasonable, and the use experience of the users is reduced.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the specification provides a method and a device for allocating system resource data, which can improve the accuracy and the efficiency of allocating the system resource data.
An embodiment of the present specification provides a method for allocating system resource data, including: acquiring an initial training sample set, wherein the initial training sample set comprises a plurality of initial training samples, and the initial training samples comprise feature data for representing user risk features; determining a sample space center based on the initial training sample set, and constructing a target training sample set according to the initial training sample set and the sample space center, wherein the target training sample set comprises a plurality of target training samples which are distributed in a spherical manner around the sample space center; calculating an abnormal score of each target training sample in the plurality of target training samples, determining a target label of each target training sample according to the abnormal score of each target training sample, and obtaining a target label set corresponding to the target training sample set, wherein the abnormal score is used for representing the distance between the target training sample and the space center of the sample, and the target label represents the risk category corresponding to the target training sample; and constructing a target classifier by using the target label set so as to distribute system resource data to the target user based on the risk prediction result of the target classifier on the target user.
In one embodiment, the initial training samples include features in multiple dimensions, and accordingly, determining a sample space center based on the initial training sample set includes: and forming the sample space center by the average value of the features of each dimension in the plurality of dimensions.
In one embodiment, constructing a target training sample set from the initial training sample set and the sample space center comprises: inputting the initial training sample set into a trained self-encoder, and outputting the target training sample set; the trained self-encoder is obtained through training optimization of an objective function, and the objective function is used for enabling the distance between a target training sample and the center of the sample space to be minimum.
In one embodiment, the objective function is:
Figure BDA0003183798240000021
where M is the objective function and W is the parameter to be trained from the encoderNumber, xiFor the ith initial training sample,
Figure BDA0003183798240000022
represents xiA target training sample obtained by linear combination with W, n is the total number of initial training samples, c is the sample space center, lambda is a regularization parameter, L is the number of layers of a neural network corresponding to a self-encoder, | | · |FRepresenting the Frobenius norm.
In one embodiment, calculating an anomaly score for each of the plurality of target training samples comprises: calculating the abnormal score of each target training sample according to the following formula:
s(x)=||φ(x;W*)-c||2
wherein s is the abnormality score,
Figure BDA0003183798240000031
is a target training sample corresponding to an initial training sample x, c is the sample space center, W*The trained parameters of the self-encoder.
In one embodiment, determining the target label of each target training sample according to the abnormal score of each target training sample comprises: acquiring a preset black sample proportion; multiplying the preset black sample proportion by the total number of the target training samples in the target training sample set to obtain a first number; and sorting the abnormal scores of the target training samples in a descending order, and determining the target labels of the first number of the target training samples ranked in the front as risky.
In one embodiment, allocating system resource data to a target user based on a risk prediction result of the target classifier for the target user comprises: acquiring initial characteristic data of a target user; inputting the initial characteristic data of the target user into the trained self-encoder to obtain target characteristic data; inputting the target characteristic data into the target classifier to obtain a risk prediction result of the target user; allocating system resource data to the target user based on the risk prediction result of the target user.
An embodiment of the present specification further provides a system resource data allocation apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an initial training sample set, the initial training sample set comprises a plurality of initial training samples, and the initial training samples comprise characteristic data used for representing user risk characteristics; a construction module, configured to determine a sample space center based on the initial training sample set, and construct a target training sample set according to the initial training sample set and the sample space center, where the target training sample set includes a plurality of target training samples that are distributed spherically around the sample space center; the determining module is used for calculating an abnormal score of each target training sample in the plurality of target training samples, determining a target label of each target training sample according to the abnormal score of each target training sample, and obtaining a target label set corresponding to the target training sample set, wherein the abnormal score is used for representing the distance between the target training sample and the space center of the sample, and the target label represents the risk category corresponding to the target training sample; and the construction module is used for constructing a target classifier by utilizing the target label set so as to distribute system resource data to the target user based on the risk prediction result of the target classifier on the target user.
Embodiments of the present specification further provide a computer device, including a processor and a memory for storing processor-executable instructions, where the processor executes the instructions to implement the steps of the system resource data allocation method described in any of the above embodiments.
Embodiments of the present specification also provide a computer readable storage medium, on which computer instructions are stored, and when executed, the instructions implement the steps of the system resource data allocation method described in any of the above embodiments.
In an embodiment of the present specification, a method for allocating system resource data is provided, in which a server may obtain an initial training sample set, the initial training sample set includes a plurality of initial training samples having feature data representing user risk features, a sample space center may be determined based on the initial training sample set, a target training sample set is constructed according to the initial training sample set and the sample space center, a plurality of target training samples in the target training sample set are distributed around the sample space center in a spherical manner, then an abnormal score of each target training sample in the plurality of target training samples may be calculated, a target label of each target training sample is determined according to the abnormal score of each target training sample, a target label set corresponding to the target training sample set is obtained, the abnormal score may be used to represent a distance between the target training sample and the sample space center, and then, constructing a target classifier by using the target label set, wherein the target classifier can be used for predicting the risk category of the target user, and can allocate system resource data to the target user based on the risk category of the target user. In the scheme, the sample space of the data set is reconstructed for the initial training sample set with unbalanced samples, the target training samples in the target training sample set obtained through reconstruction surround the set sample space center point, then the abnormal scores of the target training samples can be calculated, the abnormal scores can represent the distance between each training sample and the sample space center, then the labels of the target training samples can be determined according to the abnormal scores, the samples far away from the center can be determined as black samples, the proportion of the black samples is improved, the problem of unbalanced samples is solved, then model training is carried out by using the data set and the reconstructed label set, the obtained target classifier can accurately predict the risk categories of target users, and therefore the accuracy and the efficiency of system resource data distribution can be improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, are incorporated in and constitute a part of this specification, and are not intended to limit the specification. In the drawings:
FIG. 1 is a flow diagram illustrating a method for system resource data allocation in one embodiment of the present description;
FIG. 2 shows a schematic diagram of an auto-encoder in one embodiment of the present description;
FIG. 3 illustrates a flow diagram of an overall model training for an anti-fraud scenario in one embodiment of the present description;
FIG. 4 shows a flow diagram of a sample imbalance handling scheme in one embodiment of the present description;
FIG. 5 is a schematic diagram of a system resource data allocation apparatus in one embodiment of the present specification;
FIG. 6 shows a schematic diagram of a computer device in one embodiment of the present description.
Detailed Description
The principles and spirit of the present description will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely to enable those skilled in the art to better understand and to implement the present description, and are not intended to limit the scope of the present description in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present description may be embodied as a system, an apparatus, a method, or a computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
The embodiment of the specification provides a system resource data distribution method. In one scenario example of the present specification, a server may obtain an initial training sample set. The initial training sample set may include a plurality of initial training samples having feature data characterizing a user risk. The server may determine a sample space center based on the initial training sample set. Then, a target training sample set may be constructed according to the initial training sample set and the sample space center, and a plurality of target training samples in the target training sample set are distributed spherically around the sample space center. The server may then calculate an anomaly score for each of the plurality of target training samples. The anomaly score may be used to characterize how far the target training sample is from the center of the sample space. The server can determine the target label of each target training sample according to the abnormal score of each target training sample, and obtain a target label set corresponding to the target training sample set. The server may then build a target classifier using the target training sample set and the target label set. The target classifier may be used to predict risk categories for the target user, and may assign system resource data to the target user based on the risk categories for the target user.
Fig. 1 shows a flowchart of a system resource data allocation method in an embodiment of the present specification. Although the present specification provides method operational steps or apparatus configurations as illustrated in the following examples or figures, more or fewer operational steps or modular units may be included in the methods or apparatus based on conventional or non-inventive efforts. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution sequence of the steps or the module structure of the apparatus is not limited to the execution sequence or the module structure described in the embodiments and shown in the drawings. When the described method or module structure is applied in an actual device or end product, the method or module structure according to the embodiments or shown in the drawings can be executed sequentially or executed in parallel (for example, in a parallel processor or multi-thread processing environment, or even in a distributed processing environment).
Specifically, as shown in fig. 1, a method for allocating system resource data provided by one embodiment of the present specification may include the following steps.
Step S101, obtaining an initial training sample set, wherein the initial training sample set comprises a plurality of initial training samples, and the initial training samples comprise feature data for representing user risk features.
The method in the embodiment of the present specification may be applied to a server. The server may be a single server, a server cluster, or a cloud server, and the specific composition forms the present application without limitation. The server may obtain an initial training sample set. The initial training sample set may include a plurality of initial training samples. The initial training sample may include feature data characterizing the user's risk profile. The characteristic data can be extracted from business data of a user stored in a business system based on a financial institution. Feature extraction may be performed by feature engineering. The extraction mode and the feature type of the feature data may be set according to an actual application scenario, and are not limited herein. Of course, feature data extracted by the server from the user information acquired from the platform associated with the financial institution may also be included.
The pre-constructed initial training sample set may be stored locally or in a database. The server may extract an initial training sample set at the time of system resource data allocation or predictive model construction. If the constructed initial training sample set is an information set composed of information of users corresponding to a specified product or a specified service scenario, an identifier may be set for each initial training sample set. Correspondingly, the server may obtain the initial training sample set corresponding to the corresponding identifier according to the requirement of the current test scenario, so as to be used for system resource data allocation in the current test scenario. The updating speed of a large part of service data in the current service system is high, and correspondingly, the initial training sample set can be dynamically updated at intervals so as to ensure the accuracy of information in the information set and further improve the accuracy of prediction.
Step S102, determining a sample space center based on the initial training sample set, and constructing a target training sample set according to the initial training sample set and the sample space center, wherein the target training sample set comprises a plurality of target training samples, and the plurality of target training samples are distributed in a spherical manner around the sample space center.
After obtaining the initial training sample set, the sample space center may be determined. The feature data in the initial training sample may include features in multiple dimensions. Accordingly, the sample space is a multi-dimensional space. In one embodiment, for each of the plurality of features, a feature value having the highest distribution probability is determined, and the feature value having the highest distribution probability for each feature is combined as a sample space center. Then, a target training sample set can be constructed according to the initial training sample set and the sample space center, so that each target training sample in a plurality of target training samples in the target training sample set is distributed in a spherical manner around the sample space center.
Step S103, calculating an abnormal score of each target training sample in the plurality of target training samples, determining a target label of each target training sample according to the abnormal score of each target training sample, and obtaining a target label set corresponding to the target training sample set, wherein the abnormal score is used for representing the distance between the target training sample and the space center of the sample, and the target label represents the risk category corresponding to the target training sample.
After the set of target training samples is obtained, an anomaly score for each of the plurality of target training samples may be calculated. The anomaly score may be used to characterize how far the target training sample is from the center of the sample space. The farther the distance, the larger the anomaly score, and the closer the distance, the smaller the anomaly score. And then, determining the target label of each target training sample according to the abnormal score of each target training sample to obtain a target label set corresponding to the target training sample set. The risk category may be at risk, no risk, etc., or may be a risk grade of high risk, medium risk, low risk, etc.
In one embodiment, the labels of target training samples with an abnormality score greater than a preset score may be set to be risky, and the labels of target training samples with an abnormality score not greater than a preset score may be set to be risk-free.
In another embodiment, the label of the target training sample with the abnormal score larger than the first preset score may be set as high risk, the label of the target training sample with the abnormal score larger than the second preset score but not larger than the first preset score may be set as medium risk, and the label of the target training sample with the abnormal score smaller than the second preset score may be set as low risk.
In another embodiment, a plurality of target training samples in the target training sample set may be sorted in descending order by the abnormality score, and the label of the target training sample sorted in the previous preset number of bits is set to be at risk.
In another embodiment, a plurality of target training samples in the target training sample set may be sorted in descending order according to the abnormal score, the label of the target training sample ranked in the front by the first preset number is set as risky, the label of the target training sample ranked in the rear by the second preset number is set as low risk, and the labels of the other target training samples ranked in the middle are set as medium risk.
And step S104, constructing a target classifier by using the target label set, and distributing system resource data to the target user based on a risk prediction result of the target classifier on the target user.
After the target label set corresponding to the target training sample set is obtained, a target classifier can be constructed by using the target training sample set and the target label set, and the target classifier can also be constructed by using the initial training sample set and the target label set. Namely, model training can be performed by using a target training sample set or an initial training sample set and a target label set, so as to obtain a trained target classifier. The target classifier may be used to predict risk categories for the user. In one embodiment, feature data of the target user may be obtained and input into the target classifier to obtain the target risk category. System resource data allocation may then be performed based on the target risk category.
According to the method in the embodiment, the sample space of the data set is reconstructed for the initial training sample set with unbalanced samples, the target training samples in the target training sample set obtained through reconstruction surround the set sample space center point, then the abnormal scores of the target training samples can be calculated, the abnormal scores can represent the distance between each training sample and the sample space center, then the labels of the target training samples can be determined according to the abnormal scores, the samples far away from the center can be determined as the black samples, the proportion of the black samples is improved, the problem of unbalanced samples is solved, then model training is carried out by using the data set and the reconstructed label set, the obtained target classifier can accurately predict the risk categories of the target users, and therefore the accuracy and the efficiency of system resource data distribution can be improved.
In some embodiments of the present description, the initial training samples may include features in multiple dimensions, and accordingly, determining a sample spatial center based on the initial training sample set may include: and forming the sample space center by the average value of the features of each dimension in the plurality of dimensions.
In particular, features of multiple dimensions may be included in the initial training sample. In determining the sample space center, the sample space center may be composed of an average of features of each of the plurality of dimensions. By the method, the center of the sample space can be conveniently and simply determined.
In some embodiments of the present description, constructing a target training sample set from the initial training sample set and the sample space center may include: inputting the initial training sample set into a trained self-encoder, and outputting the target training sample set; the trained self-encoder is obtained through training optimization of an objective function, and the objective function is used for enabling the distance between a target training sample and the center of the sample space to be minimum.
Specifically, the initial training sample set may be input to a trained auto-encoder, and the target training sample set may be output. Wherein, the trained self-encoder is obtained by training and optimizing an objective function. Referring to fig. 2, a schematic diagram of an auto-encoder in an embodiment of the present disclosure is shown. The so-called self-encoder can be considered as a neural network, as shown in fig. 2. This neural network is particular in that its input and output layers are of the same dimension. The neural network does not generally classify or output a value, but reconstructs the sample space of the input data set, that is, inputs the data set to the trained self-encoder, and then obtains a new data set, that is, a target training sample set. The target training sample set is spherically distributed around the determined spatial center of the sample, the further from this center the higher the probability that this target training sample is anomalous.
As shown in fig. 2, the self-encoder is a multi-layered fully-connected neural network. Each neuron in fig. 2 represents 10 neurons in practice, due to the limited size of the drawing. The number of neurons in the input layer and the output layer in fig. 2 is 137, and the corresponding data set includes 137 dimensional features. The hidden layer in the middle of fig. 2 has 4 layers, which are 128-dimensional, 64-dimensional and 128-dimensional respectively. The activation function used between each layer is ReLU, and the optimization method in the optimizer selects an adaptive method Adam method, namely calculating and independently storing the corresponding momentum change of each parameter.
In some embodiments of the present description, the objective function is:
Figure BDA0003183798240000091
where M is the objective function, W is a parameter to be trained from the encoder, xiFor the ith initial training sample,
Figure BDA0003183798240000092
represents xiA target training sample obtained by linear combination with W, n is the total number of initial training samples, c is the sample space center, lambda is a regularization parameter, L is the number of layers of a neural network corresponding to a self-encoder, | | · |FRepresenting the Frobenius norm.
In the above embodiment, a part of the target function before the plus sign is to perform linear combination on the sample and the parameter W to obtain a new point in the sample space, that is, a target training sample, and then calculate the average distance from all newly generated points to the center c. To ensure that negative values are not taken, the results are squared. The term after the plus sign of the objective function is the regularization term. The whole equation is used for solving the minimum value, namely the self-encoder is trained based on the target function.
In some embodiments of the present description, calculating the abnormality score for each of the plurality of target training samples may include: calculating the abnormal score of each target training sample according to the following formula:
s(x)=||φ(x;W*)-c||2
wherein s is the abnormality score,
Figure BDA0003183798240000093
is a target training sample corresponding to an initial training sample x, c is the sample space center, W*The trained parameters of the self-encoder.
In some embodiments of the present disclosure, determining the target label of each target training sample according to the abnormal score of each target training sample may include: acquiring a preset black sample proportion; multiplying the preset black sample proportion by the total number of the target training samples in the target training sample set to obtain a first number; and sorting the abnormal scores of the target training samples in a descending order, and determining the target labels of the first number of the target training samples ranked in the front as risky.
Specifically, the server may obtain a preset black sample ratio. The preset black sample proportion can be set manually, or can be calculated by the server according to the total number of samples in the initial training sample set and a preset algorithm. The server may multiply the preset black sample ratio by the total number of the target training samples in the target training sample set to obtain the first number. Thereafter, the anomaly scores for each target training sample may be sorted in descending order and the target labels of the first number of target training samples ranked in the front may be determined to be at risk. By means of the method, a part of white samples with high abnormal scores can be converted into black samples, the problem of sample imbalance can be relieved, and accuracy of the model trained on the basis of the target training sample set and the target label set is improved.
Considering that the initial training samples corresponding to some target training samples with lower abnormal scores may be black samples, and if the samples with lower abnormal scores are directly determined as white samples, model training may be inaccurate, so the server may obtain an initial label set corresponding to the initial training sample set. The initial label set comprises initial labels corresponding to the initial training samples in the initial training sample set and is used for representing risk categories corresponding to the initial training samples. In some embodiments of the present description, determining the target label of each target training sample according to the abnormality score of each target training sample may include: acquiring a preset black sample proportion; multiplying the preset black sample proportion by the total number of the target training samples in the target training sample set to obtain a first number; and performing descending order arrangement on the abnormal scores of the target training samples, determining the target labels of the first number of the target training samples arranged in front as risky, and determining the target labels of the target training samples except the first number of the target training samples in the target training sample set as initial labels corresponding to the initial training samples to obtain a target label set. By the method, the target label set can be obtained, and the model obtained based on the target label set training is more accurate.
Under the condition of model training, model training can be carried out by utilizing a target training sample set and a target label set to obtain a target classifier, and model training can also be carried out by utilizing an initial training sample set and a target label set to obtain the target classifier.
In the case of performing model training using the initial training sample set and the target label set, the target label of the target training sample may be determined to be the label corresponding to the initial training sample, so as to perform training.
Under the condition of performing model training by using a target training sample set and a target label set, after a target classifier is obtained, feature data needs to be converted into a format corresponding to a target training sample. Thus, in some embodiments of the present description, allocating system resource data to a target user based on a risk prediction result of the target classifier for the target user may include: acquiring initial characteristic data of a target user; inputting the initial characteristic data of the target user into the trained self-encoder to obtain target characteristic data; inputting the target characteristic data into the target classifier to obtain a risk prediction result of the target user; allocating system resource data to the target user based on the risk prediction result of the target user. By the method, the risk category of the target user can be predicted based on the target classifier, so that system resources are better distributed to the target user, the efficiency and the accuracy of resource distribution are improved, and the resource utilization rate is improved.
The above method is described below with reference to a specific example, however, it should be noted that the specific example is only for better describing the present specification and should not be construed as an undue limitation on the present specification.
In the specific embodiment, a system resource allocation method is provided, which can be used for improving the accuracy of a risk prediction model of an anti-fraud scene based on the problem of unbalanced sample in the anti-fraud scene in deep learning.
Referring to fig. 3, a flowchart of the entire anti-fraud scenario model training in an embodiment of the present specification is shown. As shown in fig. 3, the present embodiment relates to steps before feature engineering, mainly solving the problem of sample imbalance in a data set, and includes the steps of: firstly, transaction characteristic information provided by a service and related to anti-fraud is obtained from a data warehouse, and data preprocessing is carried out on a data set. And then, transforming the data set by using the processing method for sample unbalance provided by the embodiment, constructing characteristics through characteristic engineering, finally training the newly generated data set by using a model to obtain a prediction model, and obtaining a prediction result by using the model. The present embodiment mainly relates to the steps before feature engineering, and the following description will focus on these parts.
Referring to fig. 4, a flow chart of a sample imbalance processing scheme in an embodiment of the present disclosure is shown. As shown in fig. 4, a new hypersphere surrounding a set center can be formed from the original sample space through a self-encoder of a deep neural network, and then the abnormal samples outside the hypersphere are converted into black samples. The following explains the respective portions related to the above-described flow:
the partial processing steps of the data preprocessing are as follows:
1.1 data selection. The transactions selected by the present embodiment may be transactions that occur between 1/3 of 2019 and 12/29 of 2019. The anti-fraud transaction risk prediction related features are divided into two categories: basic information of the transaction and account information of both parties of the transaction. The data ranges and thus the data tables involved can be determined by category.
1.2 data preprocessing. And observing data columns related to the basic information of the transaction and the account information of both parties of the transaction in different tables. And splicing related data columns in different tables according to the transaction id to form the original characteristics. And for missing value columns, completing according to a certain rule, specifically: the null values of the three types of data, namely the relevant proportion of the transaction amount in the past period and the relevant proportion of the transaction account number in the past period, are filled with the maximum values except the relevant proportion of the transaction frequency in the past period, and all other data columns are filled with the null values by using 0 values.
The relevant steps for "sample imbalance treatment" are as follows:
2.1 determine the sample space center. The choice of sample centers will vary depending on the data set and modeling task. In some image recognition tasks, the sharpest picture of each category in the data set is centered on the sample. In the present invention, the mean of the features of each dimension is selected as the coordinate to form the center of the sample space.
2.2 construct the self-encoder. Referring to fig. 2, a schematic diagram of an auto-encoder in an embodiment of the present disclosure is shown. The so-called self-encoder can be considered as a neural network, as shown in fig. 2. This neural network is particular in that its input and output layers are of the same dimension. The neural network does not generally classify or output a value, but reconstructs the sample space of the input data set, that is, inputs the data set to the trained self-encoder, and then obtains a new data set, that is, a target training sample set. The target training sample set is spherically distributed around the determined spatial center of the sample, the further from this center the higher the probability that this target training sample is anomalous.
As shown in fig. 2, the self-encoder is a multi-layered fully-connected neural network. Each neuron in fig. 2 represents 10 neurons in practice, due to the limited size of the drawing. In fig. 2, the number of neurons in the Input Layer (Input Layer) and the Output Layer (Output Layer) is 137, and the corresponding feature includes 137 dimensions in the data set. The Hidden Layer (Hidden Layer) in the middle of fig. 2 has 4 layers, which are 128-dimensional, 64-dimensional and 128-dimensional respectively. The activation function used between each layer is ReLU, and the optimization method in the optimizer selects an adaptive method Adam method, namely calculating and independently storing the corresponding momentum change of each parameter.
2.3 constructing an objective function. The target function optimized by self-encoder training is shown in formula (1), wherein c in the function represents a sample center, xiRepresenting the samples in the data set, the first term of the objective function requires that all samples extract features as close as possible to the center, and the second term is L2 regularization. By optimizing the objective function, the average distance of all samples is closer to the center, and the network can learn common characteristics in the process.
Figure BDA0003183798240000121
Where M is the objective function, W is a parameter to be trained from the encoder, xiFor the ith initial training sample,
Figure BDA0003183798240000122
represents xiA target training sample obtained by linear combination with W, n is the total number of initial training samples, c is the sample space center, lambda is a regularization parameter, L is the number of layers of a neural network corresponding to a self-encoder, | | · |FRepresenting the Frobenius norm. Increasing lambda can penalize function fitting data and increase model generalization capability
2.4 training the essence of the self-encoder. Firstly, the selected center cannot be fixed as an origin, and the center c cannot be put into a neural network as a free variable to train iteration. Both result in the final result approximating a trivial solution, i.e., a zero solution, indefinitely. In actual operation, the neural network may be trained with the spatial average point of the samples of the first iteration of neural network training as the center. Second, the neural network cannot select a boundary activation function (saturation function), so that the ReLU is preferred as the activation function. Assuming that the network has a saturation activation function with an upper section of B, if a certain feature k is a positive number for all input samples, the network may only retain the feature and increase the weight of k to make it output B, and subsequent layers only need to map B to C.
2.5 testing. After the reconstructed new data set is obtained by the trained self-encoder, the abnormal score of all samples can be calculated by the abnormal score calculation formula of formula (2), and then the part with the highest abnormal score is converted into the black sample according to the proportion of the abnormal samples provided by the service.
Figure BDA0003183798240000131
The relevant steps of "training the model and predicting" are as follows:
inputting the converted new data set into a classification model for training, and then predicting by using the trained model.
In the method in the above embodiment, under the condition that the same machine learning model is used, the model prediction result obtained by training the data set processed by the method in the embodiment is better than the model obtained by training the original data set in terms of accuracy, recall rate and comprehensive prediction performance, and the fraud risk existing in the transaction can be predicted more accurately. By applying the model to financial institutions such as banks, accurate prediction can be performed before transactions with fraud risks occur, relevant personnel can perform corresponding processing by referring to model prediction results, clients can be prevented from being cheated, loss is reduced, and user experience is improved.
Based on the same inventive concept, the embodiment of the present specification further provides a system resource data allocation apparatus, as described in the following embodiments. Because the principle of the system resource data allocation apparatus for solving the problem is similar to the system resource data allocation method, the implementation of the system resource data allocation apparatus can refer to the implementation of the system resource data allocation method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 5 is a block diagram of a structure of a system resource data allocation apparatus according to an embodiment of the present specification, and as shown in fig. 5, the apparatus includes: an acquisition module 501, a construction module 502, a determination module 503, and a construction module 504, the structure of which is described below.
The obtaining module 501 is configured to obtain an initial training sample set, where the initial training sample set includes a plurality of initial training samples, and the initial training samples include feature data for characterizing a user risk feature.
The constructing module 502 is configured to determine a sample space center based on the initial training sample set, and construct a target training sample set according to the initial training sample set and the sample space center, where the target training sample set includes a plurality of target training samples that are distributed spherically around the sample space center.
The determining module 503 is configured to calculate an abnormal score of each target training sample in the plurality of target training samples, determine a target label of each target training sample according to the abnormal score of each target training sample, and obtain a target label set corresponding to the target training sample set, where the abnormal score is used to represent a distance between the target training sample and the spatial center of the sample, and the target label represents a risk category corresponding to the target training sample.
The construction module 504 is configured to construct a target classifier using the target label set, so as to allocate system resource data to the target user based on a risk prediction result of the target classifier on the target user.
In some embodiments of the present description, the construction module may be specifically configured to group the average of the features of each of the plurality of dimensions into the sample space center.
In some embodiments of the present description, the construction module may be specifically configured to: inputting the initial training sample set into a trained self-encoder, and outputting the target training sample set; the trained self-encoder is obtained through training optimization of an objective function, and the objective function is used for enabling the distance between a target training sample and the center of the sample space to be minimum.
In some embodiments of the present description, the objective function may be:
Figure BDA0003183798240000141
where M is the objective function, W is a parameter to be trained from the encoder, xiFor the ith initial training sample,
Figure BDA0003183798240000142
represents xiA target training sample obtained by linear combination with W, n is the total number of initial training samples, c is the sample space center, lambda is a regularization parameter, L is the number of layers of a neural network corresponding to a self-encoder, | | · |FRepresenting the Frobenius norm.
In some embodiments of the present description, the determining module may be specifically configured to: calculating the abnormal score of each target training sample according to the following formula:
s(x)=||φ(x;W*)-c||2
wherein s is the abnormality score,
Figure BDA0003183798240000143
is a target training sample corresponding to an initial training sample x, c is the sample space center, W*The trained parameters of the self-encoder.
In some embodiments of the present description, the determining module may be specifically configured to: acquiring a preset black sample proportion; multiplying the preset black sample proportion by the total number of the target training samples in the target training sample set to obtain a first number; and sorting the abnormal scores of the target training samples in a descending order, and determining the target labels of the first number of the target training samples ranked in the front as risky.
In some embodiments of the present description, allocating system resource data to the target user based on the risk prediction result of the target classifier for the target user may include: acquiring initial characteristic data of a target user; inputting the initial characteristic data of the target user into the trained self-encoder to obtain target characteristic data; inputting the target characteristic data into the target classifier to obtain a risk prediction result of the target user; allocating system resource data to the target user based on the risk prediction result of the target user.
From the above description, it can be seen that the embodiments of the present specification achieve the following technical effects: the method comprises the steps of reconstructing a sample space of a data set for an initial training sample set with unbalanced samples, enabling target training samples in a target training sample set obtained through reconstruction to surround a set sample space central point, then calculating abnormal scores of the target training samples, enabling the abnormal scores to represent the distances between the training samples and the sample space central point, then determining labels of the target training samples according to the abnormal scores, and determining samples far away from the central point as black samples, so that the proportion of the black samples is improved, the problem of unbalanced samples is solved, then performing model training by using the data set and the reconstructed label set, enabling an obtained target classifier to accurately predict risk categories of target users, and further improving the accuracy and the efficiency of system resource data distribution.
The embodiment of the present specification further provides a computer device, which may specifically refer to a schematic structural diagram of a computer device based on the system resource data allocation method provided in the embodiment of the present specification, shown in fig. 6, where the computer device may specifically include an input device 61, a processor 62, and a memory 63. Wherein the memory 63 is for storing processor executable instructions. The processor 62, when executing the instructions, performs the steps of the system resource data allocation method described in any of the embodiments above.
In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input device may include a keyboard, a mouse, a camera, a scanner, a light pen, a handwriting input board, a voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, the memory may be any memory as long as it can store binary data; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
In this embodiment, the functions and effects of the specific implementation of the computer device can be explained in comparison with other embodiments, and are not described herein again.
The present specification also provides a computer storage medium based on the system resource data allocation method, and the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium implements the steps of the system resource data allocation method in any of the above embodiments.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present specification described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed over a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present description are not limited to any specific combination of hardware and software.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the description should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above description is only a preferred embodiment of the present disclosure, and is not intended to limit the present disclosure, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiment of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.

Claims (10)

1. A method for allocating system resource data, comprising:
acquiring an initial training sample set, wherein the initial training sample set comprises a plurality of initial training samples, and the initial training samples comprise feature data for representing user risk features;
determining a sample space center based on the initial training sample set, and constructing a target training sample set according to the initial training sample set and the sample space center, wherein the target training sample set comprises a plurality of target training samples which are distributed in a spherical manner around the sample space center;
calculating an abnormal score of each target training sample in the plurality of target training samples, determining a target label of each target training sample according to the abnormal score of each target training sample, and obtaining a target label set corresponding to the target training sample set, wherein the abnormal score is used for representing the distance between the target training sample and the space center of the sample, and the target label represents the risk category corresponding to the target training sample;
and constructing a target classifier by using the target label set so as to distribute system resource data to the target user based on the risk prediction result of the target classifier on the target user.
2. The method of claim 1, wherein the initial training samples comprise features in a plurality of dimensions, and wherein determining a sample spatial center based on the initial set of training samples comprises:
and forming the sample space center by the average value of the features of each dimension in the plurality of dimensions.
3. The method of claim 1, wherein constructing a target training sample set from the initial training sample set and the sample space center comprises:
inputting the initial training sample set into a trained self-encoder, and outputting the target training sample set;
the trained self-encoder is obtained through training optimization of an objective function, and the objective function is used for enabling the distance between a target training sample and the center of the sample space to be minimum.
4. The method of claim 3, wherein the objective function is:
Figure FDA0003183798230000011
where M is the objective function, W is a parameter to be trained from the encoder, xiFor the ith initial training sample,
Figure FDA0003183798230000012
represents xiA target training sample obtained by linear combination with W, n is the total number of initial training samples, c is the sample space center, lambda is a regularization parameter, L is the number of layers of a neural network corresponding to a self-encoder, | | · |FRepresenting the Frobenius norm.
5. The method of claim 3, wherein calculating an anomaly score for each of the plurality of target training samples comprises:
calculating the abnormal score of each target training sample according to the following formula:
s(x)=||φ(x;W*)-c||2
wherein s is the abnormality score,
Figure FDA0003183798230000021
is a target training sample corresponding to an initial training sample x, c is the sample space center, W*The trained parameters of the self-encoder.
6. The method of claim 1, wherein determining the target label for each target training sample based on the anomaly score for each target training sample comprises:
acquiring a preset black sample proportion;
multiplying the preset black sample proportion by the total number of the target training samples in the target training sample set to obtain a first number;
and sorting the abnormal scores of the target training samples in a descending order, and determining the target labels of the first number of the target training samples ranked in the front as risky.
7. The method of claim 3, wherein allocating system resource data to the target user based on the risk prediction result of the target classifier for the target user comprises:
acquiring initial characteristic data of a target user;
inputting the initial characteristic data of the target user into the trained self-encoder to obtain target characteristic data;
inputting the target characteristic data into the target classifier to obtain a risk prediction result of the target user;
allocating system resource data to the target user based on the risk prediction result of the target user.
8. A system resource data allocation apparatus, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an initial training sample set, the initial training sample set comprises a plurality of initial training samples, and the initial training samples comprise characteristic data used for representing user risk characteristics;
a construction module, configured to determine a sample space center based on the initial training sample set, and construct a target training sample set according to the initial training sample set and the sample space center, where the target training sample set includes a plurality of target training samples that are distributed spherically around the sample space center;
the determining module is used for calculating an abnormal score of each target training sample in the plurality of target training samples, determining a target label of each target training sample according to the abnormal score of each target training sample, and obtaining a target label set corresponding to the target training sample set, wherein the abnormal score is used for representing the distance between the target training sample and the space center of the sample, and the target label represents the risk category corresponding to the target training sample;
and the construction module is used for constructing a target classifier by utilizing the target label set so as to distribute system resource data to the target user based on the risk prediction result of the target classifier on the target user.
9. A computer device comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium having computer instructions stored thereon which, when executed, implement the steps of the method of any one of claims 1 to 7.
CN202110854993.9A 2021-07-28 2021-07-28 System resource data distribution method and device Active CN113515383B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110854993.9A CN113515383B (en) 2021-07-28 2021-07-28 System resource data distribution method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110854993.9A CN113515383B (en) 2021-07-28 2021-07-28 System resource data distribution method and device

Publications (2)

Publication Number Publication Date
CN113515383A true CN113515383A (en) 2021-10-19
CN113515383B CN113515383B (en) 2024-02-20

Family

ID=78067732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110854993.9A Active CN113515383B (en) 2021-07-28 2021-07-28 System resource data distribution method and device

Country Status (1)

Country Link
CN (1) CN113515383B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654054A (en) * 2015-12-30 2016-06-08 上海颐本信息科技有限公司 Semi-supervised neighbor propagation learning and multi-visual dictionary model-based intelligent video analysis method
US20170132512A1 (en) * 2015-11-06 2017-05-11 Google Inc. Regularizing machine learning models
CN111915437A (en) * 2020-06-30 2020-11-10 深圳前海微众银行股份有限公司 RNN-based anti-money laundering model training method, device, equipment and medium
US20210028885A1 (en) * 2019-07-26 2021-01-28 Maxim Integrated Products, Inc. Cnn-based demodulating and decoding systems and methods for universal receiver
CN112836742A (en) * 2021-02-02 2021-05-25 中国工商银行股份有限公司 System resource adjusting method, device and equipment
CN113011722A (en) * 2021-03-04 2021-06-22 中国工商银行股份有限公司 System resource data allocation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170132512A1 (en) * 2015-11-06 2017-05-11 Google Inc. Regularizing machine learning models
CN105654054A (en) * 2015-12-30 2016-06-08 上海颐本信息科技有限公司 Semi-supervised neighbor propagation learning and multi-visual dictionary model-based intelligent video analysis method
US20210028885A1 (en) * 2019-07-26 2021-01-28 Maxim Integrated Products, Inc. Cnn-based demodulating and decoding systems and methods for universal receiver
CN111915437A (en) * 2020-06-30 2020-11-10 深圳前海微众银行股份有限公司 RNN-based anti-money laundering model training method, device, equipment and medium
CN112836742A (en) * 2021-02-02 2021-05-25 中国工商银行股份有限公司 System resource adjusting method, device and equipment
CN113011722A (en) * 2021-03-04 2021-06-22 中国工商银行股份有限公司 System resource data allocation method and device

Also Published As

Publication number Publication date
CN113515383B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
US10713597B2 (en) Systems and methods for preparing data for use by machine learning algorithms
CN110569322A (en) Address information analysis method, device and system and data acquisition method
CN110378786B (en) Model training method, default transmission risk identification method, device and storage medium
CN111080442A (en) Credit scoring model construction method, device, equipment and storage medium
CN109948149A (en) A kind of file classification method and device
CN111210072B (en) Prediction model training and user resource limit determining method and device
CN111325248A (en) Method and system for reducing pre-loan business risk
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN114612251A (en) Risk assessment method, device, equipment and storage medium
CN111062806B (en) Personal finance credit risk evaluation method, system and storage medium
CN112785005A (en) Multi-target task assistant decision-making method and device, computer equipment and medium
CN113011722A (en) System resource data allocation method and device
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
CN111563187A (en) Relationship determination method, device and system and electronic equipment
CN112836750A (en) System resource allocation method, device and equipment
CN113515383B (en) System resource data distribution method and device
CN115731030A (en) Method, device and storage medium for mining bank consumption loan customer requirements
CN115375453A (en) System resource allocation method and device
CN111984637B (en) Missing value processing method and device in data modeling, equipment and storage medium
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN110837847A (en) User classification method and device, storage medium and server
CN111459990A (en) Object processing method, system, computer readable storage medium and computer device
CN111079992A (en) Data processing method, device and storage medium
CN111127184A (en) Distributed combined credit evaluation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant