CN114978616B - Construction method and device of risk assessment system, and risk assessment method and device - Google Patents

Construction method and device of risk assessment system, and risk assessment method and device Download PDF

Info

Publication number
CN114978616B
CN114978616B CN202210486217.2A CN202210486217A CN114978616B CN 114978616 B CN114978616 B CN 114978616B CN 202210486217 A CN202210486217 A CN 202210486217A CN 114978616 B CN114978616 B CN 114978616B
Authority
CN
China
Prior art keywords
samples
labeling
sample
risk
risk assessment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210486217.2A
Other languages
Chinese (zh)
Other versions
CN114978616A (en
Inventor
张长浩
傅欣艺
王维强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210486217.2A priority Critical patent/CN114978616B/en
Publication of CN114978616A publication Critical patent/CN114978616A/en
Application granted granted Critical
Publication of CN114978616B publication Critical patent/CN114978616B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Abstract

The embodiment of the specification provides a method for constructing a risk assessment system, which comprises the following steps: training a first risk assessment model using a first set of labeling event samples, the first set of labeling event samples comprising a first number of black samples and a second number of white samples, the first number being greater than the second number; processing a plurality of ash samples by using a trained first risk assessment model to obtain predicted risk scores of all the ash samples, wherein all the ash samples are identified as risk samples by the existing wind control technology; selecting a part of gray samples from the plurality of gray samples based on the prediction risk score as an expansion for black samples in a second labeling event sample set; the second labeling event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; and training a second risk assessment model by using the expanded second labeling event sample set, wherein the trained second risk assessment model is used for constructing the risk assessment system.

Description

Construction method and device of risk assessment system, and risk assessment method and device
Technical Field
The present disclosure relates to the field of machine learning technologies, and in particular, to a method and an apparatus for constructing a risk assessment system, and a risk assessment method and an apparatus.
Background
With the rapid development of computer networks, network security issues are increasingly prominent. There are a number of high risk operational activities such as theft of accounts, traffic attacks, fraudulent transactions, etc., that may threaten network security or user information security. In many scenarios, due to network security and risk prevention and control, user risk types, user operation behaviors or operation events, etc. need to be analyzed and processed, and the risk degree related to the user is evaluated for risk prevention and control.
However, the risk assessment methods are limited in effectiveness. Therefore, a solution is needed, which can effectively improve the accuracy of risk assessment, so as to better meet the actual application requirements.
Disclosure of Invention
One or more embodiments of the present disclosure describe a method and an apparatus for constructing a risk assessment system, which constructs a training sample set with a high concentration of black samples, and screens and uses gray samples that have a large amount of black sample information and cannot be effectively utilized, so as to effectively improve the accuracy of identifying risk samples.
According to a first aspect, there is provided a method for constructing a risk assessment system, including: training a first risk assessment model by using a first labeling event sample set; the first labelling event sample set includes a first number of black samples and a second number of white samples, the first number being greater than the second number; processing a plurality of ash samples by using the trained first risk assessment model to obtain predicted risk scores of all the ash samples; each ash sample is identified as a risk sample by the existing wind control technology; selecting a part of gray samples from the plurality of gray samples based on the prediction risk score as an expansion for black samples in a second labeling event sample set; the second labeling event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; training a second risk assessment model by using the expanded second labeling event sample set; the trained second risk assessment model is used to construct the risk assessment system.
In one embodiment, prior to training the first risk assessment model with the first set of labeling event samples, the method further comprises: splitting the second labeling event sample set into two labeling subsets; training two risk assessment models correspondingly by using the two labeling subsets, and constructing the risk assessment system; cross scoring is carried out on the two annotation subsets by using the trained two risk assessment models, and predicted risk scores of all annotation samples in the second annotation event sample set are obtained; and selecting the first number of black samples and the second number of white samples from the second labeling sample set based on the predicted risk score of each labeling sample to form the first labeling event sample set.
In a specific embodiment, selecting the first number of black samples and the second number of white samples from the second set of labeling samples based on the predicted risk score for each labeling sample comprises: the prediction risk scores of the labeling samples are subjected to reverse sequencing; selecting the first number of black samples from the plurality of black samples, which are arranged in a front position, and selecting the second number of white samples from the plurality of white samples, which are arranged in a rear position, according to the reverse ordering result.
In a specific embodiment, after obtaining the predicted risk score for each annotation sample in the second set of annotation event samples, the method further comprises: for each labeling sample, if the labeling sample is a black sample, rejecting the labeling sample from the second labeling event sample set if the predicted risk score is smaller than a first threshold value, or if the labeling sample is a white sample, rejecting the labeling sample from the second labeling event sample set if the predicted risk score is larger than a second threshold value; training a third risk assessment model based on the second labeling event sample set subjected to the elimination processing; and the trained third risk assessment model is used for constructing the risk assessment system.
In a specific embodiment, before splitting the second set of annotation event samples comprising the plurality of black samples and the plurality of white samples into two annotation subsets, the method further comprises: acquiring a third marked event sample set corresponding to the first historical period, and acquiring a fourth marked event sample set corresponding to the second historical period; the first history period is earlier than the second history period; training a fourth risk assessment model by using the third labeling event sample set, and predicting the fourth labeling event sample set by using the trained fourth risk assessment model to obtain predicted risk scores of all fourth labeling samples; and carrying out feature expansion on each fourth labeling sample by utilizing the predicted risk score of each fourth labeling sample to obtain a corresponding fifth labeling sample, wherein the fifth labeling sample is used for forming the second labeling event sample set.
Further, in a more specific embodiment, before training a fourth risk assessment model using the third labeling sample, the method further comprises: splitting the characteristic dimension of each third labeling sample in the third labeling event sample set according to a preset mode to obtain a preset number of sub-samples, wherein the sub-samples are correspondingly classified into the preset number of sub-sample sets; wherein the fourth risk assessment model includes the predetermined number of sub-models; the training of the fourth risk assessment model by using the third labeling sample, and the predicting of the fourth labeling event sample set by using the trained fourth risk assessment model, to obtain the predicted risk score of each fourth labeling sample, includes: correspondingly training the predetermined number of sub-models by using the predetermined number of sub-sample sets; and processing each fourth labeling sample by using the preset number of sub-models respectively to obtain a preset number of predicted risk scores corresponding to the fourth labeling samples.
Further, in one example, for each fourth labeling sample, the feature expansion is performed on the fourth labeling sample by using its predicted risk score to obtain a corresponding fifth labeling sample, which includes: and carrying out preset calculation on each fourth labeling sample based on the preset number of predicted risk scores of the fourth labeling samples, and carrying out feature expansion on the fourth labeling samples by utilizing calculation results to obtain corresponding fifth labeling samples.
In one embodiment, the trained first risk assessment model is used to construct the risk assessment system.
In one embodiment, each risk assessment model in the risk assessment system is implemented based on a tree model.
According to a second aspect, there is provided a risk assessment method comprising: obtaining a target event sample to be detected; inputting the target event sample into a risk assessment system constructed by adopting the method of any one of the first aspect to obtain a plurality of risk scores predicted by a plurality of risk assessment models; and determining a risk assessment result of the target event sample based on the plurality of risk scores.
In one embodiment, determining a risk assessment result for the target event sample based on the number of risk scores comprises: calculating the average value of a plurality of risk scores, and if the average value is larger than a score threshold value, determining that the risk exists as the risk assessment result; or determining the number of risk scores greater than a score threshold value in the plurality of risk scores, and if the number of risk scores is greater than the score threshold value, determining that the risk exists as the risk assessment result.
According to a third aspect, there is provided a construction apparatus of a risk assessment system, comprising: a first training unit configured to train a first risk assessment model using a first set of annotation event samples; the first labelling event sample set includes a first number of black samples and a second number of white samples, the first number being greater than the second number; the ash sample prediction unit is configured to process a plurality of ash samples by using the trained first risk assessment model to obtain predicted risk scores of the ash samples; each ash sample is identified as a risk sample by the existing wind control technology; a gray sample screening unit configured to select a part of gray samples from the plurality of gray samples based on the prediction risk score as an extension to black samples in a second labeling event sample set; the second labeling event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; a second training unit configured to train a second risk assessment model using the expanded second set of annotation event samples; the trained second risk assessment model is used to construct the risk assessment system.
According to a fourth aspect, there is provided a risk assessment apparatus comprising: a target sample acquisition unit configured to acquire a target event sample to be detected; the risk prediction unit is configured to input the target event sample into the risk assessment system constructed by adopting the method according to any one of the first aspect, so as to obtain a plurality of risk scores predicted by a plurality of risk assessment models; and a result determining unit configured to determine a risk assessment result of the target event sample based on the plurality of risk scores.
According to a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first or second aspect.
According to a sixth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which when executing the executable code implements the method of the first or second aspect.
By adopting the method and the device provided by the embodiment of the specification, the training sample set with higher black sample concentration is constructed, and the gray samples which have a large amount of black sample information and cannot be effectively utilized originally are screened and used, so that the identification accuracy of the risk samples is effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments below are briefly introduced, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a schematic diagram of a build implementation of a risk assessment system according to one embodiment;
FIG. 2 illustrates a flow diagram of a method of constructing a risk assessment system, according to one embodiment;
FIG. 3 illustrates a flow chart of a risk assessment method according to one embodiment;
FIG. 4 illustrates a schematic construction of a build device of a risk assessment system according to one embodiment;
fig. 5 shows a schematic structural diagram of a risk assessment apparatus according to an embodiment.
Detailed Description
The following describes the scheme provided in the present specification with reference to the drawings.
As mentioned above, the existing risk assessment methods have limited effectiveness, and the main reasons include that the collected historical wind control data has a label imbalance problem. In particular, since wind control is a scenario where most are good, the concentration of black sample tags is extremely small, which is very difficult for machine learning modeling. In addition, the inventor finds that a large number of high-risk samples (or called gray samples) identified by the existing wind control technology exist in the wind control scene, and the high-risk samples are usually blocked in time without a user report, but belong to black samples with high probability.
Thus, the inventors propose to enhance the risk recognition effect with gray samples that actually have a large amount of black sample information. FIG. 1 illustrates a schematic diagram of a constructed implementation of a risk assessment system according to one embodiment, as shown in FIG. 1, by first training a first risk assessment model using a high black concentration event sample set that includes a large number of black samples and a small number of white samples; screening the gray sample set by using the trained first risk assessment model, and screening out black samples in the gray sample set; and then, expanding an original labeling event sample set which originally has a small number of black samples and a large number of white samples by using the screened black samples, so that a second risk assessment model is trained by using the event sample set with balanced labels after expansion and is used for constructing a risk assessment system. The risk assessment system constructed in this way can realize accurate identification of risk samples.
Scheme steps for realizing the above inventive concept are described below in conjunction with specific embodiments.
Fig. 2 shows a schematic flow diagram of a method for constructing a risk assessment system according to an embodiment, where the execution subject of the method may be any apparatus, platform, or device cluster with computing, processing capabilities, etc. As shown in fig. 2, the method comprises the steps of:
step S210, training a first risk assessment model by using a first labeling event sample set; the first labelling event sample set includes a first number of black samples and a second number of white samples, the first number being greater than the second number; step S220, a plurality of ash samples are processed by using a trained first risk assessment model, and predicted risk scores of all the ash samples are obtained; each ash sample is identified as a risk sample by the existing wind control technology; step S230, selecting part of gray samples from the plurality of gray samples based on the prediction risk score, and expanding black samples in a second labeling event sample set; the second labeling event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; step S240, training a second risk assessment model by using the expanded second labeling event sample set; the trained second risk assessment model is used to construct the risk assessment system.
The development of the above steps is described as follows:
step S210, training a first risk assessment model by using the first labeling event sample set. It should be noted that, the first risk assessment model is used for screening the black sample from the gray sample, further, considering that the gray sample is not labeled, but the risk coefficient of the gray sample is very high, then, it is proposed to construct a labeling data set with high black sample concentration, which is matched with the gray sample data distribution, and train the risk assessment model, so as to realize accurate screening of the black sample in the gray sample.
The first set of annotation event samples described above may be determined based on the original set of annotation event samples (or the second set of annotation event samples) that were collected. Note that, the labeling event samples refer to event samples with labels, the labels indicate samples with risks as black samples, and the labels indicate samples without risks as white samples; the event to which the event sample relates may include: transaction events, access events, login events, etc., the characteristics in the event sample may include time of occurrence of the event, network address, geographic location, device information of the involved terminal device (e.g., device ID, model, etc.), and user information of the involved user (e.g., gender, age, hobbies, liveness, network behavior preferences, etc.).
On the other hand, the number of black samples in the first labeling event sample set is more than the number of white samples, and the number of white samples in the second labeling event sample set is more than the number of black samples. Thus, the first set of annotation event samples may be constructed in such a way that a large number of white samples are removed from the second set of annotation event samples. In one embodiment, a portion of the white samples may be selected from the second set of labeling event samples for rejection, and then the remaining black and white samples may be combined into the first set of labeling event samples.
In another embodiment, for efficient use of the first set of annotation event samples and for enabling the constructed first set of annotation event samples to substantially approximate the data distribution of gray samples, it is proposed to introduce a double-tower structure for cross-validation with each other. Specifically, splitting a second labeling event sample set into two labeling subsets, correspondingly training two risk assessment models by using the two labeling subsets, and performing cross scoring on the two labeling subsets by using the trained two risk assessment models to obtain predicted risk scores of all labeling samples in the second labeling event sample set; further, based on the predicted risk score, the first number of black samples and the second number of white samples are selected from the second labeling sample set to form a first labeling event sample set.
Further, in a specific embodiment, the black sample and the white sample in the second labeling event sample set may be divided into two equal parts, and then one of the black samples and one of the white samples form one labeling subset, and the other of the black samples and the other of the white samples form the other labeling subset, so as to obtain two labeling subsets, which are referred to as a first labeling subset and a second labeling subset for convenience of description.
In a particular embodiment, the risk assessment model m is trained using a first subset of annotations A Training a risk assessment model m using the second subset of annotations B The method comprises the steps of carrying out a first treatment on the surface of the Next, a trained risk assessment model m is used A Scoring the first subset of annotations, using a trained risk assessment model m B And scoring the second labeling subset, so that the predicted risk score (or predicted risk probability) of each labeling sample in the second labeling event sample set can be obtained. Therefore, the two risk assessment models are trained by splitting the original labeling data set, so that training time can be saved, training efficiency of the models is improved, and overfitting can be effectively prevented;further, through cross-validation scoring, accurate assessment of risk levels of all labeling samples can be achieved.
In a specific embodiment, the predicted risk scores of the labeling samples are inverted, that is, sorted from large to small. It should be understood that the description herein mainly takes as an example that the higher the risk score, the higher the corresponding risk degree. Then, based on the result of the reverse ordering, a first number of black samples arranged in a front position is selected, and a second number of white samples arranged in a rear position is selected. In another specific embodiment, samples in which the predicted risk score is greater than a first threshold (e.g., 0.8) are selected based on a plurality of black samples (or a third number of black samples) in the second set of labeling event samples, the first number of black samples are classified, and samples in which the predicted risk score is less than a second threshold (e.g., 0.1) are selected based on a plurality of white samples (or a fourth number of white samples) are classified.
In this manner, the first set of labeled samples may be constructed using the selected first number of black samples and the second number of white samples. In addition, the trained risk assessment model m A And risk assessment model m B May be used to build a risk assessment system or may be integrated into a risk assessment system.
Further, a first risk assessment model may be trained using the first set of labeling event samples. In one embodiment, the first risk assessment model may be implemented as a tree model, a Deep Learning (DL) model, a bayesian network model. In a specific embodiment, the tree model may be a gradient-lifting decision tree (Gradient Boosting Decision Tree, GBDT for short), an ID3 classification tree, or a C4.5 classification tree. On the other hand, in a specific embodiment, the features of each sample in the first labeling event sample set may be input into the first risk assessment model to obtain a corresponding risk assessment result, and then a training gradient is calculated based on the risk assessment result and a corresponding sample label, for example, a black sample label is 1, a white sample label is 0, and then model parameters of the first risk assessment model are adjusted according to a training gradient sampling back propagation method, so that multiple iterative updates are performed on the model parameters until the model parameters converge, and a trained first risk assessment model may be obtained.
From this, a trained first risk assessment model can be obtained. In addition, the trained first risk assessment model may be used to construct a risk assessment system.
Then, in step S220, the plurality of ash samples are processed using the trained first risk assessment model, and a predicted risk score for each ash sample therein is obtained. Note that, the gray sample is a sample recognized as a risk sample by the existing wind control technology, and the high probability is a black sample, but it cannot be determined whether the sample is a black sample without manual marking. In one embodiment, existing wind control techniques may include wind control rules, wind control strategies, or wind control models, among others. On the other hand, in one embodiment, in an online transaction scenario, when an existing wind control technology identifies that there is a risk for a current transaction, the transaction is typically interfered with, such as blocking the transaction from proceeding, or closing the transaction. Thus, even though this transaction is at risk, it is not completed and no user report is received. At this time, the transaction sample may be used as a gray sample.
In the step, a plurality of collected ash samples are respectively predicted by using a trained first risk assessment model, so that the predicted risk score of each ash sample can be obtained. Next, in step S230, based on the predicted risk score of each gray sample, a part of gray samples is selected from the plurality of gray samples as an extension to the black samples in the second labeling event sample set.
In one embodiment, gray samples with predicted risk scores greater than a preset threshold may be selected, labeled with risk, and supplemented as black samples to the second set of labeled event samples. It is to be understood that the preset threshold may be set manually. For example, a data distribution map of the predicted risk score may be drawn according to a plurality of predicted risk scores corresponding to a plurality of gray samples, and then the threshold value may be set according to the data distribution map.
In another embodiment, the plurality of predicted risk scores corresponding to the plurality of gray samples may be ranked from high to low, and then gray samples ranked within a predetermined range (e.g., 10 tens of thousands or 30% before) are selected and supplemented as black samples to the second set of labeled event samples.
By the method, the actual black samples can be selected from the gray sample set and supplemented to the second labeling event sample set, so that the labels in the expanded second labeling event sample set are balanced, namely the difference between the numbers of the black and white samples is effectively reduced.
Then, in step S240, a second risk assessment model is trained using the expanded second labeling event sample set, and the trained second risk assessment model is used to construct the risk assessment system. In one embodiment, the second risk assessment model may be based on the same model algorithm as the first risk assessment model or may be different.
It should be noted that, because the second labeling event sample set after expansion is compared with the black sample set which is already supplemented with a large number of black samples screened from the gray samples before expansion, the sample distribution is converted from label unbalance to label balance, and the second risk assessment model trained based on the labeling sample set of label balance has excellent prediction performance.
From this, the trained second risk assessment model may be integrated into the risk assessment system.
According to another embodiment, the inventor further proposes that the abnormal samples in the second labeling event sample set are removed, and the third risk assessment model is trained by using the remaining samples, so that the model effect is effectively improved, and the generalization capability of the model is enhanced. Specifically, for each labeling sample in the second labeling event sample set before expansion, if the labeling sample is a black sample, if the predicted risk score obtained during the cross-validation is smaller than a first threshold (e.g., 0.6), the labeling sample is removed from the second labeling event sample set, or if the labeling sample is a white sample, if the predicted risk score obtained during the cross-validation is greater than a second threshold (e.g., 0.4), the labeling sample is removed from the second labeling event sample set.
Further, a third risk assessment model may be trained based on labeled samples that remain after the abnormal black-and-white samples are removed. In one embodiment, the third risk assessment model may be trained directly based on the labeled samples. In another embodiment, the third risk assessment model may be trained using the remaining annotation samples from the second set of annotation event samples and the black samples screened from the gray sample set as described above. Thus, the trained third risk assessment model may be integrated into the risk assessment system.
According to an embodiment of a further aspect, considering that the wind control is a very aggressive scenario, the model trained in the previous month may fail by the next month, requiring retraining with the newly acquired sample. Therefore, the inventor proposes to use the samples collected before, so as to improve the accuracy of the currently trained model.
Specifically, the second set of annotation event samples may be constructed based on the following steps: first, a third marked event sample set corresponding to a first historical period is obtained, and a fourth marked event sample set corresponding to a second historical period is obtained, wherein the first historical period (e.g., 2022, month 2) is earlier than the second historical period (e.g., 2022, month 3); then, training a fourth risk assessment model by using the third labeling event sample set, and predicting the fourth labeling event sample set by using the trained fourth risk assessment model to obtain predicted risk scores of all fourth labeling samples; and then, for each fourth labeling sample, performing feature expansion on the fourth labeling sample by utilizing the predicted risk score to obtain a corresponding fifth labeling sample, wherein the fifth labeling sample is used for forming the second labeling event sample set.
In a specific embodiment, to further improve the usability of the extended feature, after the third labeling event sample set is obtained, the feature dimension of the third labeling event sample set may be split according to a preset manner, so as to obtain a plurality of (m) sub-samples, which are correspondingly classified into m sub-sample sets. For example, for transaction event labeling samples, it is assumed that the 1 st-50 th dimension is a feature of a transaction user, the 51 st-100 th dimension is a feature of a transaction merchant, and the 100 th-150 th dimension is a feature of a transaction order, so that each transaction event labeling sample can be split into 3 sub-samples, and the labels of the original samples are shared, and the 3 sub-samples are correspondingly classified into 3 sub-sample sets. Further, m sub-sample sets may be utilized to correspondingly train m sub-models that constitute the fourth risk assessment model, i.e., the ith sub-model may be trained using any of the ith sub-sample sets. And then, aiming at each fourth labeling sample, respectively processing the fourth labeling samples by using the trained m sub-models to obtain the prediction risk score corresponding to the fourth labeling sample.
In a specific embodiment, for each fourth labeling sample, the predicted risk of the fourth labeling sample can be directly used as a new feature to perform feature expansion. In another specific embodiment, if each fourth labeling sample has m predicted risk scores, at this time, a predetermined calculation (such as averaging or median) may be performed using the m predicted risk scores, and the calculation result may be used as a new feature, so as to implement feature expansion.
Therefore, feature expansion of the current training data can be realized by utilizing the past training data, so that the usability of the second labeling event sample set is improved, and the recognition effect of the risk evaluation system is further improved.
The risk assessment system at least comprising the trained second risk assessment model can be constructed, and further, the trained risk assessment model m can be integrated in the risk assessment system A Risk assessment model m B A third risk assessment model or a fourth risk assessment model. Thus, a risk assessment system may be constructed that includes one or more trained risk assessment models.
According to the embodiment of the further aspect, the inventor provides that the idea of ensemble learning can be utilized more fully, more risk assessment models can be trained, and accuracy of the finally integrated prediction result is improved. The method specifically comprises the following steps:
1) The original marked event sample set (namely the second marked event sample set) is obtained and split into a first marked subset and a second marked subset, and then the risk assessment model m is correspondingly trained A Risk assessmentModel m B
2) Using trained risk assessment model m B Processing the first labeling subset to obtain a predicted risk score of each first labeling sample, and according to the predicted risk score, planing a large number of white samples from the first labeling subset to enable the number of original first black samples to be larger than that of the remaining first white samples, thereby forming a first high-black-concentration sample set, and training a risk assessment model m by utilizing the first high-black-concentration sample set C The method comprises the steps of carrying out a first treatment on the surface of the Similarly, a risk assessment model m is utilized A Processing the second labeling subsets to obtain predicted risk scores of the second labeling samples, planing a large number of white samples from the second labeling subsets according to the predicted risk scores to ensure that the number of original second black samples is greater than that of the remaining second white samples, thereby forming a second high-black-concentration sample set, and training a risk assessment model m by utilizing the first high-black-concentration sample set D
3) Acquiring a gray sample set, splitting the gray sample set into two gray sample subsets, and utilizing a risk assessment model m C Processing the first gray sample subset, so as to select a first part of gray samples from the first gray sample subset according to the predicted risk score of each first gray sample, expanding the first marked subset as black samples, and training a risk assessment model m by using the expanded first marked subset E The method comprises the steps of carrying out a first treatment on the surface of the Similarly, a risk assessment model m is also utilized D Processing the second gray sample subset, selecting a second part of gray samples from the second gray sample subset according to the predicted risk score of each second gray sample, expanding the second marked subset as black samples, and training a risk assessment model m by using the expanded second marked subset F
4) Removing the first abnormal sample from the first labeling subset according to the predicted risk score of each first labeling sample in the first labeling subset, including obtaining a high-score first white sample and obtaining a low-score first black sample, thereby training a risk assessment model m by using the remaining first black and white samples and the first part of the gray samples screened as black samples G The method comprises the steps of carrying out a first treatment on the surface of the Class(s)Similarly, based on the predicted risk score for each second labeling sample in the second labeling subset, removing the second abnormal sample from the second labeling subset, including obtaining a high-score second white sample and obtaining a low-score second black sample, thereby training the risk assessment model m using the remaining second black, white samples and the second portion of the gray samples screened as black samples H
From the above, 8 trained risk assessment models, i.e., m, can be obtained A ~m H And further integrating to obtain a risk assessment system with excellent prediction performance.
According to an embodiment of yet another aspect, the present specification also discloses a method for using the above-constructed abnormality detection system. Fig. 3 shows a flow chart of a risk assessment method, according to one embodiment, the execution subject of which may be any platform, server or cluster of devices with computing, processing capabilities. As shown in fig. 3, the method comprises the steps of:
Step S310, obtaining a target event sample to be detected; step S320, inputting the target event sample into a constructed risk assessment system to obtain a plurality of risk scores predicted by a plurality of risk assessment models; step S330, determining a risk assessment result of the target event sample based on the several risk scores.
The development of the above steps is described as follows:
first, in step S310, a target event sample to be detected is acquired. Illustratively, payment information is obtained in response to a user triggering a payment operation, forming a payment event sample.
Next, in step S320, the target event sample is input into the constructed risk assessment system, so as to obtain a plurality of risk scores predicted by a plurality of risk assessment models. It should be understood that the constructed risk assessment system may include one or more risk assessment models as described above, and accordingly, one or more risk scores corresponding to the target event samples may be predicted.
Then, in step S330, a risk assessment result of the target event sample is determined based on the several risk scores.
In one embodiment, in the case that several risks are divided into single risk categories, the risk categories may be directly compared with a preset score threshold (e.g., 0.75), if the risk categories are greater, the risk is determined, otherwise, the risk is determined to be absent. In another embodiment, in the case that the plurality of risk scores are divided into a plurality of risk scores, an average value of the plurality of risk scores may be obtained, and then in the case that the average value is greater than a score threshold value, the target event is determined to have a risk, otherwise, it is determined that there is no risk; alternatively, a number of risk scores of the plurality of risk scores greater than a score threshold may be determined, and if the number of risk scores (e.g., 6) is greater than a number threshold (e.g., 4), the target event is determined to be at risk, otherwise, no risk is determined.
By the method, the use of an abnormality evaluation system can be realized, and an accurate risk identification result is obtained.
Corresponding to the construction and use methods of the risk assessment system, the embodiment of the present specification also discloses construction and use devices. FIG. 4 shows a schematic construction diagram of a construction apparatus of a risk assessment system according to one embodiment, as shown in FIG. 4, the apparatus comprising the following units:
a first training unit 410 configured to train a first risk assessment model using a first set of labeling event samples; the first labelling event sample set includes a first number of black samples and a second number of white samples, the first number being greater than the second number; a gray sample prediction unit 420 configured to process a plurality of gray samples using the trained first risk assessment model to obtain a predicted risk score for each of the gray samples; each ash sample is identified as a risk sample by the existing wind control technology; a gray sample screening unit 430 configured to select a portion of gray samples from the plurality of gray samples as an extension to black samples in the second set of labeling event samples based on the predicted risk score; the second labeling event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; a second training unit 440 configured to train a second risk assessment model using the augmented second set of annotation event samples; the trained second risk assessment model is used to construct the risk assessment system.
In one embodiment, the apparatus further comprises a first sample set determination unit 450 comprising a subunit: a first splitting subunit 451 configured to split the second set of annotation event samples into two annotation subsets; a training subunit 452 configured to correspondingly train two risk assessment models using the two label subsets, for constructing the risk assessment system; the scoring unit 453 is configured to cross score the two labeling subsets by using the trained two risk assessment models, so as to obtain predicted risk scores of each labeling sample in the second labeling event sample set; a selecting subunit 454 is configured to select, based on the predicted risk score of each labeling sample, the first number of black samples and the second number of white samples from the second labeling sample set, to form the first labeling event sample set.
Further, in a specific embodiment, the selection subunit 454 is specifically configured to: the prediction risk scores of the labeling samples are subjected to reverse sequencing; selecting the first number of black samples from the plurality of black samples, which are arranged in a front position, and selecting the second number of white samples from the plurality of white samples, which are arranged in a rear position, according to the reverse ordering result.
In a specific embodiment, the apparatus further comprises: an abnormal sample rejection unit 460 configured to reject, for each of the labeling samples, the labeling sample from the second labeling event sample set if its predicted risk score is less than a first threshold value, or reject the labeling sample from the second labeling event sample set if its predicted risk score is greater than a second threshold value, if the labeling sample is a white sample; training a third risk assessment model based on the second labeling event sample set subjected to the elimination processing; and the trained third risk assessment model is used for constructing the risk assessment system.
In a specific embodiment, the apparatus further comprises a second sample set determination unit 470 comprising a subunit: a sample set obtaining subunit 471 configured to obtain a third sample set of labeling events corresponding to the first history period and obtain a fourth sample set of labeling events corresponding to the second history period; the first history period is earlier than the second history period; a prediction subunit 472 configured to train a fourth risk assessment model using the third labeling event sample set, and predict the fourth labeling event sample set using the trained fourth risk assessment model to obtain a predicted risk score of each fourth labeling event sample therein; and the feature expansion subunit 473 is configured to perform feature expansion on the fourth labeling samples by using the predicted risk scores thereof to obtain corresponding fifth labeling samples for forming the second labeling event sample set.
Still further, in a more specific embodiment, the second sample set determining unit 470 further includes: a second split sub-unit 474 configured to: and splitting the characteristic dimension of each third labeling sample in the third labeling event sample set according to a preset mode to obtain a preset number of sub-samples, wherein the sub-samples are correspondingly classified into the preset number of sub-sample sets. Wherein the fourth risk assessment model includes the predetermined number of sub-models; wherein, the prediction subunit 472 is specifically configured to: correspondingly training the predetermined number of sub-models by using the predetermined number of sub-sample sets; and processing each fourth labeling sample by using the preset number of sub-models respectively to obtain a preset number of predicted risk scores corresponding to the fourth labeling samples.
In one example, the feature expansion subunit 473 is specifically configured to: and carrying out preset calculation on each fourth labeling sample based on the preset number of predicted risk scores of the fourth labeling samples, and carrying out feature expansion on the fourth labeling samples by utilizing calculation results to obtain corresponding fifth labeling samples.
In one embodiment, the trained first risk assessment model is used to construct the risk assessment system.
In one embodiment, each risk assessment model in the risk assessment system is implemented based on a tree model.
Fig. 5 shows a schematic structural diagram of a risk assessment apparatus according to an embodiment, wherein the apparatus shown comprises:
a target sample acquiring unit 510 configured to acquire a target event sample to be detected; the risk prediction unit 520 is configured to input the target event sample into the risk assessment system constructed by adopting the embodiment, so as to obtain a plurality of risk scores predicted by a plurality of risk assessment models; a result determining unit 530 configured to determine a risk assessment result of the target event sample based on the several risk scores.
In one embodiment, the result determination unit 530 is specifically configured to: calculating the average value of a plurality of risk scores, and if the average value is larger than a score threshold value, determining that the risk exists as the risk assessment result; or determining the number of risk scores greater than a score threshold value in the plurality of risk scores, and if the number of risk scores is greater than the score threshold value, determining that the risk exists as the risk assessment result.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2 or 3.
According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 2 or 3. Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims (15)

1. A method of constructing a risk assessment system, comprising:
training a first risk assessment model by using a first labeling event sample set; the first labelling event sample set includes a first number of black samples and a second number of white samples, the first number being greater than the second number;
Processing a plurality of ash samples by using the trained first risk assessment model to obtain predicted risk scores of all the ash samples; each ash sample is identified as a risk sample by the existing wind control technology;
selecting a part of gray samples from the plurality of gray samples based on the prediction risk score as an expansion for black samples in a second labeling event sample set; the second labeling event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number;
training a second risk assessment model by using the expanded second labeling event sample set; the trained second risk assessment model is used for constructing the risk assessment system;
wherein the event sample relates to an event including a transaction event, an access event, or a login event.
2. The method of claim 1, wherein prior to training the first risk assessment model with the first set of labeling event samples, the method further comprises:
splitting the second labeling event sample set into two labeling subsets;
training two risk assessment models correspondingly by using the two labeling subsets, and constructing the risk assessment system;
Cross scoring is carried out on the two annotation subsets by using the trained two risk assessment models, and predicted risk scores of all annotation samples in the second annotation event sample set are obtained;
and selecting the first number of black samples and the second number of white samples from the second labeling event sample set based on the prediction risk scores of the labeling samples to form the first labeling event sample set.
3. The method of claim 2, wherein selecting the first number of black samples and the second number of white samples from the second set of labeling event samples based on the predicted risk score for the respective labeling sample comprises:
the prediction risk scores of the labeling samples are subjected to reverse sequencing;
selecting the first number of black samples from the plurality of black samples, which are arranged in a front position, and selecting the second number of white samples from the plurality of white samples, which are arranged in a rear position, according to the reverse ordering result.
4. The method of claim 2, wherein after deriving the predicted risk score for each annotation sample in the second set of annotation event samples, the method further comprises:
For each labeling sample, if the labeling sample is a black sample, rejecting the labeling sample from the second labeling event sample set if the predicted risk score is smaller than a first threshold value, or if the labeling sample is a white sample, rejecting the labeling sample from the second labeling event sample set if the predicted risk score is larger than a second threshold value;
training a third risk assessment model based on the second labeling event sample set subjected to the elimination processing; and the trained third risk assessment model is used for constructing the risk assessment system.
5. The method of claim 2, wherein prior to splitting the second set of annotation event samples comprising the plurality of black samples and the plurality of white samples into two annotation subsets, the method further comprises:
acquiring a third marked event sample set corresponding to the first historical period, and acquiring a fourth marked event sample set corresponding to the second historical period; the first history period is earlier than the second history period;
training a fourth risk assessment model by using the third labeling event sample set, and predicting the fourth labeling event sample set by using the trained fourth risk assessment model to obtain predicted risk scores of all fourth labeling samples;
And carrying out feature expansion on each fourth labeling sample by utilizing the predicted risk score of each fourth labeling sample to obtain a corresponding fifth labeling sample, wherein the fifth labeling sample is used for forming the second labeling event sample set.
6. The method of claim 5, wherein prior to training a fourth risk assessment model using the third set of labeling event samples, the method further comprises:
splitting the characteristic dimension of each third labeling sample in the third labeling event sample set according to a preset mode to obtain a preset number of sub-samples, wherein the sub-samples are correspondingly classified into the preset number of sub-sample sets;
wherein the fourth risk assessment model includes the predetermined number of sub-models; the training of the fourth risk assessment model by using the third labeling event sample set, and the predicting of the fourth labeling event sample set by using the trained fourth risk assessment model, to obtain the predicted risk score of each fourth labeling event sample, includes:
correspondingly training the predetermined number of sub-models by using the predetermined number of sub-sample sets;
and processing each fourth labeling sample by using the preset number of sub-models respectively to obtain a preset number of predicted risk scores corresponding to the fourth labeling samples.
7. The method of claim 6, wherein for each fourth labeling sample, feature expansion is performed on the fourth labeling sample by using its predicted risk score to obtain a corresponding fifth labeling sample, including:
and carrying out preset calculation on each fourth labeling sample based on the preset number of predicted risk scores of the fourth labeling samples, and carrying out feature expansion on the fourth labeling samples by utilizing calculation results to obtain corresponding fifth labeling samples.
8. The method of claim 1, wherein the trained first risk assessment model is used to construct the risk assessment system.
9. The method of claim 1, wherein each risk assessment model in the risk assessment system is implemented based on a tree model.
10. A risk assessment method, comprising:
obtaining a target event sample to be detected;
inputting the target event sample into a risk assessment system constructed by the method of any one of claims 1-9 to obtain a plurality of risk scores predicted by a plurality of risk assessment models;
and determining a risk assessment result of the target event sample based on the plurality of risk scores.
11. The method of claim 10, wherein determining a risk assessment result for the target event sample based on the number of risk scores comprises:
Calculating the average value of a plurality of risk scores, and if the average value is larger than a score threshold value, determining that the risk exists as the risk assessment result; or alternatively, the first and second heat exchangers may be,
and determining the number of risk scores greater than a score threshold value in the plurality of risk scores, and if the number of risk scores is greater than the number threshold value, determining that the risk exists as the risk assessment result.
12. A construction apparatus of a risk assessment system, comprising:
a first training unit configured to train a first risk assessment model using a first set of annotation event samples; the first labelling event sample set includes a first number of black samples and a second number of white samples, the first number being greater than the second number;
the ash sample prediction unit is configured to process a plurality of ash samples by using the trained first risk assessment model to obtain predicted risk scores of the ash samples; each ash sample is identified as a risk sample by the existing wind control technology;
a gray sample screening unit configured to select a part of gray samples from the plurality of gray samples based on the prediction risk score as an extension to black samples in a second labeling event sample set; the second labeling event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number;
A second training unit configured to train a second risk assessment model using the expanded second set of annotation event samples; the trained second risk assessment model is used for constructing the risk assessment system;
wherein the event sample relates to an event including a transaction event, an access event, or a login event.
13. A risk assessment apparatus comprising:
a target sample acquisition unit configured to acquire a target event sample to be detected;
a risk prediction unit configured to input the target event sample into a risk assessment system constructed by the method according to any one of claims 1 to 9, so as to obtain a plurality of risk scores predicted by a plurality of risk assessment models;
and a result determining unit configured to determine a risk assessment result of the target event sample based on the plurality of risk scores.
14. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-11.
15. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-11.
CN202210486217.2A 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device Active CN114978616B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210486217.2A CN114978616B (en) 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210486217.2A CN114978616B (en) 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device

Publications (2)

Publication Number Publication Date
CN114978616A CN114978616A (en) 2022-08-30
CN114978616B true CN114978616B (en) 2024-01-09

Family

ID=82981196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210486217.2A Active CN114978616B (en) 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device

Country Status (1)

Country Link
CN (1) CN114978616B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743877A (en) * 2015-11-02 2016-07-06 哈尔滨安天科技股份有限公司 Network security threat information processing method and system
CN110020746A (en) * 2019-02-20 2019-07-16 阿里巴巴集团控股有限公司 A kind of risk prevention system method, apparatus, processing equipment and system
CN110147823A (en) * 2019-04-16 2019-08-20 阿里巴巴集团控股有限公司 A kind of air control model training method, device and equipment
CN113420789A (en) * 2021-05-31 2021-09-21 北京经纬信息技术有限公司 Method, device, storage medium and computer equipment for predicting risk account
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model
CN114154556A (en) * 2021-11-03 2022-03-08 同盾科技有限公司 Training method and device of sample prediction model, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9396332B2 (en) * 2014-05-21 2016-07-19 Microsoft Technology Licensing, Llc Risk assessment modeling
CN107391569B (en) * 2017-06-16 2020-09-15 阿里巴巴集团控股有限公司 Data type identification, model training and risk identification method, device and equipment
CN107798390B (en) * 2017-11-22 2023-03-21 创新先进技术有限公司 Training method and device of machine learning model and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743877A (en) * 2015-11-02 2016-07-06 哈尔滨安天科技股份有限公司 Network security threat information processing method and system
CN110020746A (en) * 2019-02-20 2019-07-16 阿里巴巴集团控股有限公司 A kind of risk prevention system method, apparatus, processing equipment and system
CN110147823A (en) * 2019-04-16 2019-08-20 阿里巴巴集团控股有限公司 A kind of air control model training method, device and equipment
CN113420789A (en) * 2021-05-31 2021-09-21 北京经纬信息技术有限公司 Method, device, storage medium and computer equipment for predicting risk account
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model
CN114154556A (en) * 2021-11-03 2022-03-08 同盾科技有限公司 Training method and device of sample prediction model, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
决策树模型ID3算法在突发公共卫生事件风险评估中的应用;杨云;孙宏;康正;吴群红;;中国预防医学杂志(第01期);全文 *
物流金融风险管理全过程;何明珂;钱文彬;;系统工程(第05期);全文 *

Also Published As

Publication number Publication date
CN114978616A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN109583468B (en) Training sample acquisition method, sample prediction method and corresponding device
CN106875078B (en) Transaction risk detection method, device and equipment
CN109922032B (en) Method, device, equipment and storage medium for determining risk of logging in account
CN107577945B (en) URL attack detection method and device and electronic equipment
CN110852755B (en) User identity identification method and device for transaction scene
CN114207648A (en) Techniques to automatically update payment information in a computing environment
CN111460312A (en) Method and device for identifying empty-shell enterprise and computer equipment
CN112837069B (en) Block chain and big data based secure payment method and cloud platform system
CN112927061B (en) User operation detection method and program product
CN112785157B (en) Risk identification system updating method and device and risk identification method and device
CN112990294B (en) Training method and device of behavior discrimination model, electronic equipment and storage medium
US11809519B2 (en) Semantic input sampling for explanation (SISE) of convolutional neural networks
EP4053757A1 (en) Degradation suppression program, degradation suppression method, and information processing device
CN111738441A (en) Prediction model training method and device considering prediction precision and privacy protection
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
CN116452329A (en) Abnormal behavior monitoring method and device, electronic equipment and storage medium
CN111159241A (en) Click conversion estimation method and device
CN115204322B (en) Behavior link abnormity identification method and device
CN114978616B (en) Construction method and device of risk assessment system, and risk assessment method and device
CN115567224A (en) Method for detecting abnormal transaction of block chain and related product
CN115438747A (en) Abnormal account recognition model training method, device, equipment and medium
CN111783088B (en) Malicious code family clustering method and device and computer equipment
CN113052604A (en) Object detection method, device, equipment and storage medium
CN114201999A (en) Abnormal account identification method, system, computing device and storage medium
CN114707990B (en) User behavior pattern recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant