CN114978616A - Method and device for constructing risk assessment system and method and device for risk assessment - Google Patents

Method and device for constructing risk assessment system and method and device for risk assessment Download PDF

Info

Publication number
CN114978616A
CN114978616A CN202210486217.2A CN202210486217A CN114978616A CN 114978616 A CN114978616 A CN 114978616A CN 202210486217 A CN202210486217 A CN 202210486217A CN 114978616 A CN114978616 A CN 114978616A
Authority
CN
China
Prior art keywords
sample
samples
risk
labeled
risk assessment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210486217.2A
Other languages
Chinese (zh)
Other versions
CN114978616B (en
Inventor
张长浩
傅欣艺
王维强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210486217.2A priority Critical patent/CN114978616B/en
Publication of CN114978616A publication Critical patent/CN114978616A/en
Application granted granted Critical
Publication of CN114978616B publication Critical patent/CN114978616B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An embodiment of the present specification provides a method for constructing a risk assessment system, including: training a first risk assessment model by utilizing a first labeled event sample set, wherein the first labeled event sample set comprises a first number of black samples and a second number of white samples, and the first number is larger than the second number; processing a plurality of gray samples by using a trained first risk assessment model to obtain a predicted risk score of each gray sample, wherein each gray sample is identified as a risk sample by an existing wind control technology; based on the predicted risk score, selecting partial gray samples from the plurality of gray samples as extensions of black samples in a second labeled event sample set; the second annotated event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; and training a second risk assessment model by using the expanded second labeling event sample set, wherein the trained second risk assessment model is used for constructing the risk assessment system.

Description

Method and device for constructing risk assessment system and method and device for risk assessment
Technical Field
One or more embodiments of the present disclosure relate to the field of machine learning technologies, and in particular, to a method and an apparatus for constructing a risk assessment system, and a method and an apparatus for risk assessment.
Background
With the rapid development of computer networks, the network security problem is increasingly highlighted. There are many high-risk operational behaviors, such as account theft, traffic attacks, fraudulent transactions, etc., that may threaten network security or user information security. In consideration of network security and risk prevention and control, in many scenarios, it is necessary to analyze and process a user risk type, a user operation behavior or an operation event, and the like, and evaluate a risk degree related to a user so as to perform risk prevention and control.
However, currently available risk assessment approaches have limited effectiveness. Therefore, a scheme is needed to effectively improve the accuracy of risk assessment, so as to better meet the requirements of practical application.
Disclosure of Invention
One or more embodiments of the present disclosure describe a method and an apparatus for constructing a risk assessment system, which are used to construct a training sample set with a high black sample concentration, and screen and use a gray sample that has a large amount of black sample information and cannot be effectively used originally, so as to effectively improve the accuracy rate of identifying a risk sample.
According to a first aspect, there is provided a method of constructing a risk assessment system, comprising: training a first risk assessment model by using a first labeled event sample set; the first marked event sample set comprises a first number of black samples and a second number of white samples, and the first number is larger than the second number; processing a plurality of gray samples by using the trained first risk assessment model to obtain the predicted risk score of each gray sample; identifying each ash sample as a risk sample by the existing wind control technology; based on the predicted risk score, selecting partial gray samples from the plurality of gray samples as extensions of black samples in a second labeled event sample set; the second annotated event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; training a second risk assessment model by using the expanded second labeled event sample set; and the trained second risk assessment model is used for constructing the risk assessment system.
In one embodiment, prior to training the first risk assessment model with the first set of annotated event samples, the method further comprises: splitting the second annotated event sample set into two annotated subsets; correspondingly training two risk assessment models by utilizing the two label subsets for constructing the risk assessment system; performing cross scoring on the two labeling subsets by using the two trained risk assessment models to obtain a predicted risk score of each labeling sample in the second labeling event sample set; and selecting the black samples with the first quantity and the white samples with the second quantity from the second labeled sample set based on the predicted risk score of each labeled sample to form the first labeled event sample set.
In a specific embodiment, the selecting the first number of black samples and the second number of white samples from the second labeled sample set based on the predicted risk score of each labeled sample includes: performing reverse sequencing on the predicted risk scores of the labeled samples; selecting the first number of black samples arranged at a front position from the plurality of black samples and the second number of white samples arranged at a rear position from the plurality of white samples according to a result of the inverse sorting.
In a specific embodiment, after obtaining the predicted risk score of each labeled sample in the second labeled event sample set, the method further comprises: for each labeled sample, if the predicted risk score of the labeled sample is less than a first threshold value under the condition that the labeled sample is a black sample, removing the labeled sample from the second labeled event sample set, or if the predicted risk score of the labeled sample is greater than a second threshold value under the condition that the labeled sample is a white sample, removing the labeled sample from the second labeled event sample set; training a third risk assessment model based on the second labeled event sample set subjected to rejection processing; and the trained third risk assessment model is used for constructing the risk assessment system.
In a specific embodiment, before splitting the second set of annotated event samples comprising a plurality of black samples and a plurality of white samples into two annotated subsets, the method further comprises: acquiring a third labeled event sample set corresponding to the first historical time period, and acquiring a fourth labeled event sample set corresponding to the second historical time period; the first history period is earlier than the second history period; training a fourth risk assessment model by using the third labeled event sample set, and predicting the fourth labeled event sample set by using the trained fourth risk assessment model to obtain a predicted risk score of each fourth labeled sample; and performing feature expansion on each fourth labeled sample by utilizing the predicted risk score of the fourth labeled sample to obtain a corresponding fifth labeled sample for forming the second labeled event sample set.
Further, in a more specific embodiment, prior to training a fourth risk assessment model with the third annotated sample, the method further comprises: splitting characteristic dimensions of each third labeled sample in the third labeled event sample set according to a preset mode to obtain a preset number of sub-samples, and correspondingly classifying the sub-samples into the preset number of sub-sample sets; wherein the fourth risk assessment model comprises the predetermined number of sub-models; training a fourth risk assessment model by using the third labeled sample, and predicting the fourth labeled event sample set by using the trained fourth risk assessment model to obtain a predicted risk score of each fourth labeled sample, including: correspondingly training the sub models with the preset number by utilizing the sub sample sets with the preset number; and respectively processing each fourth labeled sample by using the sub models with the preset number to obtain the predicted risk scores with the preset number corresponding to the fourth labeled sample.
Further, in an example, for each fourth labeled sample, performing feature expansion on the fourth labeled sample by using the predicted risk score thereof to obtain a corresponding fifth labeled sample, including: and performing predetermined calculation on the fourth labeling samples based on the predicted risk scores of the fourth labeling samples in a predetermined number, and performing feature expansion on the fourth labeling samples by using the calculation results to obtain corresponding fifth labeling samples.
In one embodiment, the trained first risk assessment model is used to construct the risk assessment system.
In one embodiment, each risk assessment model in the risk assessment system is implemented based on a tree model.
According to a second aspect, there is provided a method of risk assessment comprising: acquiring a target event sample to be detected; inputting the target event sample into a risk assessment system constructed by adopting the method in any one of the first aspect to obtain a plurality of risk scores predicted by a plurality of risk assessment models; determining a risk assessment result of the target event sample based on the plurality of risk scores.
In one embodiment, determining a risk assessment result for the sample of target events based on the number of risk scores comprises: calculating the average value of a plurality of risk scores, and if the average value is greater than a score threshold value, determining the risk as the risk evaluation result; or determining the number of risk scores of the risk scores, which are greater than a score threshold value, and determining the risk as the risk assessment result if the number of risk scores is greater than a number threshold value.
According to a third aspect, there is provided a construction apparatus of a risk assessment system, including: a first training unit configured to train a first risk assessment model using a first set of annotated event samples; the first marked event sample set comprises a first number of black samples and a second number of white samples, and the first number is larger than the second number; the grey sample prediction unit is configured to process a plurality of grey samples by using the trained first risk assessment model to obtain a predicted risk score of each grey sample; identifying each ash sample as a risk sample by the existing wind control technology; a gray sample screening unit configured to select a part of gray samples from the plurality of gray samples as extensions to black samples in a second labeled event sample set based on the predicted risk score; the second annotated event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; the second training unit is configured to train a second risk assessment model by using the expanded second labeled event sample set; and the trained second risk assessment model is used for constructing the risk assessment system.
According to a fourth aspect, there is provided a risk assessment apparatus comprising: the target sample acquisition unit is configured to acquire a target event sample to be detected; a risk prediction unit configured to input the target event sample into a risk assessment system constructed by the method of any one of the first aspect, and obtain a plurality of risk scores predicted by a plurality of risk assessment models; a result determination unit configured to determine a risk assessment result of the target event sample based on the number of risk scores.
According to a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first or second aspect.
According to a sixth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor which, when executing the executable code, implements the method of the first or second aspect.
By adopting the method and the device provided by the embodiment of the specification, the training sample set with high black sample concentration is constructed, and the gray samples which have a large amount of black sample information and cannot be effectively utilized originally are screened and used, so that the identification accuracy of the risk samples is effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 shows a schematic diagram of a build implementation of a risk assessment system according to one embodiment;
FIG. 2 shows a flowchart of a method of constructing a risk assessment system according to one embodiment;
FIG. 3 illustrates a flow diagram of a risk assessment method according to one embodiment;
FIG. 4 shows a schematic block diagram of a construction apparatus of a risk assessment system according to one embodiment;
FIG. 5 shows a schematic structural diagram of a risk assessment device according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
As mentioned above, the existing risk assessment methods have limited effectiveness, and the main reasons include the problem of label imbalance in the collected historical wind control data. In particular, since wind control is a scene that is mostly good, the concentration of the black sample label is extremely small, which is very difficult for machine learning modeling. In addition, the inventor finds that a large number of high-risk samples (or called grey samples) identified by the existing wind control technology exist in the wind control scene, and generally, the high-risk samples are blocked in time and do not have a report of a user, but the high-risk samples belong to black samples.
Thus, the inventors propose to improve the risk identification effect using a gray sample actually having a large amount of black sample information. FIG. 1 is a schematic diagram illustrating an implementation of a risk assessment system according to an embodiment, as shown in FIG. 1, a first risk assessment model is trained by using a high black concentration event sample set including a large number of black samples and a small number of white samples; screening the gray sample set by using the trained first risk assessment model, and screening out black samples in the gray sample set; and then, expanding an original labeled event sample set originally comprising a small number of black samples and a large number of white samples by using the screened black samples, and training a second risk evaluation model by using the expanded event sample set with balanced labels for constructing a risk evaluation system. The risk assessment system constructed in the way can realize accurate identification of the risk sample.
The steps of the scheme for realizing the above inventive concept are described below with reference to specific embodiments.
Fig. 2 is a schematic flow chart of a method for constructing a risk assessment system according to an embodiment, and an execution subject of the method may be any device, platform, or equipment cluster with computing and processing capabilities. As shown in fig. 2, the method comprises the steps of:
step S210, training a first risk assessment model by using a first annotation event sample set; the first marked event sample set comprises a first number of black samples and a second number of white samples, and the first number is larger than the second number; step S220, processing a plurality of gray samples by using the trained first risk assessment model to obtain the predicted risk score of each gray sample; identifying each ash sample as a risk sample by the existing wind control technology; step S230, based on the predicted risk score, selecting partial gray samples from the plurality of gray samples as extensions of black samples in a second labeled event sample set; the second annotated event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; step S240, training a second risk assessment model by using the expanded second annotation event sample set; and the trained second risk assessment model is used for constructing the risk assessment system.
The steps are developed as follows:
step S210, training a first risk assessment model by using the first labeled event sample set. It should be noted that the first risk assessment model is used for screening black samples from gray samples, and further, considering that the gray samples are not labeled but have high risk coefficients, a labeling data set of high black sample concentration is provided, which is fit with the data distribution of the gray samples, and the risk assessment model is trained, so as to accurately screen the black samples from the gray samples.
The first annotated event sample set may be determined based on the acquired original annotated event sample set (or referred to as a second annotated event sample set). Note that the labeled event sample refers to an event sample with a label, the label indicates a sample with risk and is called a black sample, and the label indicates a sample without risk and is called a white sample; the events to which the event samples relate may include: transaction events, access events, login events, etc., and the characteristics in the event sample may include the time of the event occurrence, network address, geographical location, device information of the terminal device involved (e.g., device ID, model, etc.), and user information of the user involved (e.g., gender, age, hobbies, frequent, network behavior preferences, etc.).
On the other hand, the number of black samples in the first labeled event sample set is greater than the number of white samples, and the number of white samples in the second labeled event sample set is greater than the number of black samples. Thus, the first set of annotated event samples may be constructed in a manner that removes a large number of white samples from the second set of annotated event samples. In one embodiment, a part of the white samples may be selected from the second labeled event sample set for culling, and then the remaining black and white samples may be combined into the first labeled event sample set.
In another embodiment, for efficient utilization of the first annotated event sample set and for the constructed first annotated event sample set to substantially approximate the data distribution of the gray samples, it is proposed to introduce a double tower structure for cross-validation. Specifically, the second labeled event sample set is divided into two labeled subsets, two risk assessment models are correspondingly trained by using the two labeled subsets, and the two labeled subsets are cross-scored by using the two trained risk assessment models to obtain the predicted risk score of each labeled sample in the second labeled event sample set; further, based on the predicted risk score, the first number of black samples and the second number of white samples are selected from the second labeled sample set to form a first labeled event sample set.
Further, in a specific embodiment, the black and white samples in the second labeled event sample set may be divided into two equal parts, and then one of the black samples and one of the white samples are combined into one labeled subset, and the other of the black samples and the other of the white samples are combined into another labeled subset, so as to obtain two labeled subsets, or the first labeled subset and the second labeled subset, for convenience of description.
In a particular embodiment, the risk assessment model m is trained using a first subset of labels A Training risk assessment model m with a second subset of labels B (ii) a Next, using the trained risk assessment model m A Scoring the first annotation subset, and utilizing the trained risk assessment model m B And scoring the second labeling subset, so as to obtain the predicted risk score (or the predicted risk probability) of each labeled sample in the second labeled event sample set. Therefore, the two risk assessment models are trained by splitting the original annotation data set, so that the training time can be saved, the training efficiency of the models can be improved, and overfitting can be effectively prevented; furthermore, accurate assessment of the risk degree of each labeled sample can be realized through cross validation scoring.
In a specific embodiment, the predicted risk scores of the labeled samples are first sorted inversely, i.e. sorted from large to small. It is to be understood that the above description is primarily made with reference to the higher the risk score, and the higher the corresponding risk level. Then, a first number of black samples arranged at a front position and a second number of white samples arranged at a rear position are selected according to the result of the reverse sorting. In another specific embodiment, the samples with the predicted risk score larger than the first threshold (e.g. 0.8) are selected based on a plurality of black samples (or third number of black samples) in the second labeled event sample set and classified into the first number of black samples, and the samples with the predicted risk score smaller than the second threshold (e.g. 0.1) are selected based on a plurality of white samples (or fourth number of white samples) and classified into the second number of white samples.
In this way, a first set of annotated samples may be constructed using the first number of black samples and the second number of white samples selected. It should be noted that the trained risk assessment model m is described above A And a risk assessment model m B May be used to construct, or may be integrated into, a risk assessment system.
Further, a first risk assessment model may be trained using the first set of annotated event samples. In one embodiment, the first risk assessment model may be implemented as a tree model, a Deep Learning (DL) model, a bayesian network model. In a specific embodiment, the Tree model may be a Gradient Boosting Decision Tree (GBDT), an ID3 classification Tree or a C4.5 classification Tree. On the other hand, in a specific embodiment, the characteristics of each sample in the first labeled event sample set may be input into the first risk assessment model to obtain a corresponding risk assessment result, and then a training gradient is calculated based on the risk assessment result and a corresponding sample label, for example, a black sample label is 1, a white sample label is 0, and then the model parameters of the first risk assessment model are adjusted according to a training gradient sampling back propagation method, so that the model parameters are iteratively updated for multiple rounds until the model parameters converge, and the trained first risk assessment model may be obtained.
From the above, a trained first risk assessment model can be obtained. It should be noted that the trained first risk assessment model may be used to construct a risk assessment system.
Then, in step S220, the trained first risk assessment model is used to process a plurality of gray samples, so as to obtain the predicted risk score of each gray sample. It should be noted that the gray sample is a sample identified as a risk sample by existing wind control technology, and is a black sample with high probability, but is not marked manually and cannot be determined whether the gray sample is a black sample. In one embodiment, the existing wind control technology may include a wind control rule, a wind control strategy, a wind control model, or the like. On the other hand, in an online transaction scenario, when existing wind control technologies recognize that a risk exists in a current transaction, the transaction is usually interfered, for example, the transaction is blocked or closed. Therefore, even if the transaction is a risk transaction, the user's application will not be received later because the transaction is not completed. At this point, this transaction sample may be taken as a gray sample.
In this step, the trained first risk assessment model is used to predict the collected multiple ash samples respectively, so as to obtain the predicted risk score of each ash sample. Next, in step S230, based on the predicted risk score of each gray sample, a part of gray samples are selected from the plurality of gray samples as extensions to the black samples in the second labeled event sample set.
In one embodiment, the gray samples with the predicted risk score greater than the preset threshold may be selected and labeled as risk-bearing samples to be supplemented as black samples in the second labeled event sample set. It is to be understood that the preset threshold value can be set manually. For example, a data distribution map of the predicted risk scores may be drawn based on a plurality of predicted risk scores corresponding to a plurality of gray samples, and the threshold value may be set based on the data distribution map.
In another embodiment, the plurality of predicted risk scores corresponding to the plurality of gray samples may be ranked from high to low, and then gray samples ranked within a predetermined range (e.g., the top 10 ten thousand or the top 30%) may be selected as black samples to be added to the second labeled event sample set.
Therefore, the real black samples can be selected from the gray sample set and supplemented to the second labeled event sample set, so that the labels in the expanded second labeled event sample set are balanced, namely, the difference of the number of the black and white samples is effectively reduced.
Then, in step S240, a second risk assessment model is trained by using the extended second annotated event sample set, and the trained second risk assessment model is used for constructing the risk assessment system. In one embodiment, the model algorithm on which the second risk assessment model is based may be the same as or different from the first risk assessment model.
It should be noted that, compared with a black sample screened from a large number of gray samples which has been supplemented before the expansion, the second labeled event sample set after the expansion converts the sample distribution from label imbalance to label equilibrium, and the second risk assessment model trained based on the labeled sample set with label equilibrium has excellent prediction performance.
Thus, the trained second risk assessment model can be integrated into the risk assessment system.
According to another embodiment, the inventor further provides that abnormal samples in the second labeled event sample set are removed, and the retained samples are used for training a third risk assessment model, so that the effect of the model is effectively improved, and the generalization capability of the model is enhanced. Specifically, for each labeled sample in the second labeled event sample set before expansion, if the predicted risk score obtained during the cross validation is smaller than a first threshold (e.g., 0.6) in the case where the labeled sample is a black sample, the labeled sample is removed from the second labeled sample set, or if the predicted risk score obtained during the cross validation is larger than a second threshold (e.g., 0.4) in the case where the labeled sample is a white sample, the labeled sample is removed from the second labeled event sample set.
Further, a third risk assessment model may be trained based on the labeled samples retained after the abnormal black and white samples are removed. In one embodiment, the third risk assessment model may be trained directly based on these annotated samples. In another embodiment, the third risk assessment model may be trained using the labeled samples retained in the second labeled event sample set and the black samples selected from the gray sample set. Thus, the trained third risk assessment model may be integrated into the risk assessment system.
According to another aspect of the embodiment, considering that the wind control is a scene with very strong attack and defense, the model trained in the previous month may fail in the next month, and needs to be retrained by using the newly collected sample. Thus, the inventors propose to utilize the previously acquired samples as well, thereby improving the accuracy of the currently trained model.
Specifically, the second annotated event sample set may be constructed based on the following steps: first, a third labeled event sample set corresponding to a first history period is obtained, and a fourth labeled event sample set corresponding to a second history period is obtained, wherein the first history period (e.g. 2022 years 2 months) is earlier than the second history period (e.g. 2022 years 3 months); secondly, training a fourth risk assessment model by using the third labeled event sample set, and predicting the fourth labeled event sample set by using the trained fourth risk assessment model to obtain the predicted risk score of each fourth labeled sample; and then, performing feature expansion on each fourth labeled sample by utilizing the predicted risk score of the fourth labeled sample to obtain a corresponding fifth labeled sample for forming the second labeled event sample set.
In a specific embodiment, to further improve the usability of the extended features, after the third labeled event sample set is obtained, the feature dimensions may be split according to a preset manner to obtain a plurality of (labeled m) subsamples, and the plurality of subsamples are correspondingly included in the m subsample sets. For example, for the transaction event annotation samples, assuming that the 1 st to 50 th dimensions are characteristics of a transaction user, the 51 st to 100 th dimensions are characteristics of a transaction merchant, and the 100 th and 150 th dimensions are characteristics of a transaction order, each transaction event annotation sample can be split into 3 sub-samples, and the labels of the original samples are shared, so that the 3 sub-samples are correspondingly classified into 3 sub-sample sets. Further, m sub-models forming a fourth risk assessment model may be trained correspondingly using m sub-sample sets, that is, the ith sub-model may be trained using any ith sub-sample set. And then, aiming at each fourth labeled sample, respectively processing the fourth labeled sample by using the trained m sub-models to obtain a predicted risk score corresponding to the fourth labeled sample.
In a specific embodiment, for each fourth labeled sample, the predicted risk score thereof can be directly used as a new feature for feature expansion. In another specific embodiment, if each fourth labeled sample has m predicted risk scores, then a predetermined calculation (such as averaging or median number) may be performed using the m predicted risk scores, and the calculation result may be used as a new feature, so as to implement feature expansion.
Therefore, feature expansion of the current training data can be achieved by using the current training data, so that the usability of the second labeled event sample set is improved, and the recognition effect of the risk assessment system is improved.
In the above, a risk assessment system at least including the trained second risk assessment model may be constructed, and further, the trained risk assessment model m may be integrated into the risk assessment system A Risk assessment model m B A third risk assessment model or a fourth risk assessment model. Thus, a risk assessment system can be constructed, which includes one or more ofAnd (5) multiple trained risk assessment models.
According to the embodiment of the further aspect, the inventor proposes that the thought of ensemble learning can be more fully utilized, more risk assessment models are trained, and the accuracy of the final integrated prediction result is further improved. The method specifically comprises the following steps:
1) obtaining the original labeled event sample set (i.e. a second labeled event sample set), splitting the original labeled event sample set into a first labeled subset and a second labeled subset, and correspondingly training a risk assessment model m A Risk assessment model m B
2) Utilizing a trained risk assessment model m B Processing the first labeling subset to obtain a predicted risk score of each first labeling sample, planing off a large number of white samples from the first labeling subset according to the predicted risk score to enable the number of original first black samples to be more than the number of remaining first white samples, forming a first high-black-concentration sample set, and training the risk assessment model m by using the first high-black-concentration sample set C (ii) a Similarly, a risk assessment model m is utilized A Processing the second labeling subset to obtain a predicted risk score of each second labeling sample, planing off a large number of white samples from the second labeling subset according to the predicted risk score to enable the number of original second black samples to be more than the number of the remaining second white samples, so as to form a second high-black-concentration sample set, and training the risk assessment model m by using the first high-black-concentration sample set D
3) Obtaining a gray sample set, splitting the gray sample set into two gray sample subsets, and utilizing a risk assessment model m C Processing the first gray sample subset, so as to select a first part of gray samples from the first gray sample subset according to the predicted risk score of each first gray sample, using the first part of gray samples as black samples to expand the first labeling subset, and further training a risk assessment model m by using the expanded first labeling subset E (ii) a Similarly, a risk assessment model m is also utilized D Processing the second subset of the plurality of second gray samples to derive a second set of gray samples from the second set of gray samples based on the predicted risk score for each of the second set of gray samplesSelecting a second part of gray samples in a centralized manner, using the second part of gray samples as black samples to expand the second labeling subset, and further utilizing the expanded second labeling subset to train the risk assessment model m F
4) According to the predicted risk score of each first labeled sample in the first labeled subset, first abnormal samples are removed from the first labeled subset, wherein the first abnormal samples comprise a first white sample with a high score and a first black sample with a low score, and therefore the risk assessment model m is trained by using the remaining first black and white samples and the first part of gray samples screened as the black samples G (ii) a Similarly, according to the predicted risk score of each second labeled sample in the second labeled subset, second abnormal samples are removed from the second labeled subset, wherein the second abnormal samples comprise a second white sample with a high score and a second black sample with a low score, and therefore the risk assessment model m is trained by using the remaining second white and black samples and the second part of gray samples screened as the black samples H
From the above, 8 trained risk assessment models, i.e., m, can be obtained A ~m H And further integrating to obtain a risk assessment system with excellent prediction performance.
According to an embodiment of another aspect, the present specification further discloses a method for using the above-constructed abnormality detection system. Fig. 3 shows a flowchart of a risk assessment method according to an embodiment, and the execution subject of the method can be any platform, server or device cluster with computing and processing capability. As shown in fig. 3, the method comprises the steps of:
step S310, obtaining a target event sample to be detected; step S320, inputting the target event sample into a constructed risk assessment system to obtain a plurality of risk scores predicted by a plurality of risk assessment models; step S330, determining the risk evaluation result of the target event sample based on the plurality of risk scores.
The development of the above steps is as follows:
first, in step S310, a target event sample to be detected is acquired. Illustratively, in response to the triggering of the payment operation by the user, the payment information is acquired, and a payment event sample is formed.
Next, in step S320, the target event sample is input into the constructed risk assessment system, and a plurality of risk scores predicted by a plurality of risk assessment models are obtained. It should be understood that the constructed risk assessment system may include one or more of the risk assessment models described above, and accordingly, one or more risk scores corresponding to the target event sample may be predicted.
Then, in step S330, a risk assessment result of the target event sample is determined based on the risk scores.
In one embodiment, in the case of several risks being divided into a single risk score, it may be directly compared to a preset score threshold (e.g., 0.75), and if greater, it is determined to be at risk, otherwise it is determined to be no risk. In another embodiment, under the condition that a plurality of risks are divided into a plurality of risk scores, the average value of the risk scores can be obtained, and then under the condition that the average value is greater than a score threshold value, the target event is judged to have risks, otherwise, the target event is judged to have no risks; alternatively, the risk score number greater than the score threshold in the risk scores may be determined, and if the risk score number (e.g., 6) is greater than the number threshold (e.g., 4), the target event is determined to be at risk, otherwise, the target event is determined to be at no risk.
Therefore, the abnormity evaluation system can be used, and an accurate risk identification result can be obtained.
Corresponding to the construction and use methods of the risk assessment system, the embodiment of the specification also discloses a construction and use device. Fig. 4 is a schematic structural diagram of a device for constructing the risk assessment system according to an embodiment, and as shown in fig. 4, the device comprises the following units:
a first training unit 410 configured to train a first risk assessment model using the first set of annotated event samples; the first marked event sample set comprises a first number of black samples and a second number of white samples, and the first number is larger than the second number; a gray sample prediction unit 420 configured to process a plurality of gray samples by using the trained first risk assessment model to obtain a predicted risk score of each gray sample; identifying each ash sample as a risk sample by the existing wind control technology; a gray sample screening unit 430, configured to select a part of gray samples from the plurality of gray samples based on the predicted risk score, as an extension to the black samples in the second labeled event sample set; the second annotated event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number; a second training unit 440 configured to train a second risk assessment model using the extended second annotated event sample set; and the trained second risk assessment model is used for constructing the risk assessment system.
In one embodiment, the apparatus further comprises a first sample set determination unit 450 comprising the sub-units: a first splitting subunit 451 configured to split the second annotated event sample set into two annotated subsets; a training subunit 452 configured to train two risk assessment models using the two labeling subsets, for constructing the risk assessment system; a scoring subunit 453, configured to cross score the two labeled subsets by using the two trained risk assessment models to obtain a predicted risk score of each labeled sample in the second labeled event sample set; a selecting subunit 454, configured to select, based on the predicted risk score of each labeled sample, the first number of black samples and the second number of white samples from the second labeled sample set, so as to form the first labeled event sample set.
Further, in a specific embodiment, the selecting subunit 454 is specifically configured to: performing reverse sequencing on the predicted risk scores of the labeled samples; selecting the first number of black samples arranged at the front position from the plurality of black samples and selecting the second number of white samples arranged at the rear position from the plurality of white samples according to the result of the reverse sorting.
In a specific embodiment, the apparatus further comprises: an abnormal sample removing unit 460, configured to remove the marked sample from the second marked event sample set if the marked sample is a black sample and the predicted risk score is smaller than a first threshold, or remove the marked sample from the second marked event sample set if the marked sample is a white sample and the predicted risk score is larger than a second threshold; training a third risk assessment model based on the second labeled event sample set subjected to rejection processing; and the trained third risk assessment model is used for constructing the risk assessment system.
In a specific embodiment, the apparatus further comprises a second sample set determination unit 470 comprising the sub-units: a sample set obtaining subunit 471, configured to obtain a third labeled event sample set corresponding to the first historical period, and obtain a fourth labeled event sample set corresponding to the second historical period; the first history period is earlier than the second history period; the predicting subunit 472 is configured to train a fourth risk assessment model by using the third labeled event sample set, and predict the fourth labeled event sample set by using the trained fourth risk assessment model to obtain a predicted risk score of each fourth labeled sample; a feature expansion subunit 473, configured to perform feature expansion on each fourth labeled sample by using the predicted risk score to obtain a corresponding fifth labeled sample, which is used to form the second labeled event sample set.
Still further, in a more specific embodiment, the second sample set determining unit 470 further includes: a second splitting subunit 474 configured to: and splitting the characteristic dimension of each third labeled sample in the third labeled event sample set according to a preset mode to obtain a preset number of sub-samples, and correspondingly classifying the sub-samples into the preset number of sub-sample sets. Wherein the fourth risk assessment model comprises the predetermined number of sub-models; the predictor 472 is specifically configured to: correspondingly training the sub models with the preset number by utilizing the sub sample sets with the preset number; and respectively processing each fourth labeled sample by using the sub models with the preset number to obtain the predicted risk scores with the preset number corresponding to the fourth labeled sample.
In one example, the feature expansion subunit 473 is specifically configured to: and performing predetermined calculation on the fourth labeling samples based on the predicted risk scores of the fourth labeling samples in a predetermined number, and performing feature expansion on the fourth labeling samples by using the calculation results to obtain corresponding fifth labeling samples.
In one embodiment, the trained first risk assessment model is used to construct the risk assessment system.
In one embodiment, each risk assessment model in the risk assessment system is implemented based on a tree model.
FIG. 5 shows a schematic structural diagram of a risk assessment device according to one embodiment, wherein the device shown comprises:
a target sample acquiring unit 510 configured to acquire a target event sample to be detected; a risk prediction unit 520, configured to input the target event sample into the risk assessment system constructed by using the above embodiment, to obtain a plurality of risk scores predicted by a plurality of risk assessment models; a result determination unit 530 configured to determine a risk assessment result of the target event sample based on the number of risk scores.
In one embodiment, the result determining unit 530 is specifically configured to: calculating the average value of a plurality of risk scores, and if the average value is greater than a score threshold value, determining the risk as the risk evaluation result; or determining the risk score number which is greater than the score threshold value in the risk scores, and if the risk score number is greater than the score threshold value, determining that the risk is the risk assessment result.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2 or fig. 3.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2 or fig. 3. Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (15)

1. A method for constructing a risk assessment system comprises the following steps:
training a first risk assessment model by using a first labeled event sample set; the first marked event sample set comprises a first number of black samples and a second number of white samples, and the first number is greater than the second number;
processing a plurality of gray samples by using the trained first risk assessment model to obtain the predicted risk score of each gray sample; identifying each ash sample as a risk sample by the existing wind control technology;
based on the predicted risk score, selecting partial gray samples from the plurality of gray samples as extensions of black samples in a second labeled event sample set; the second annotated event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number;
training a second risk assessment model by using the expanded second annotation event sample set; and the trained second risk assessment model is used for constructing the risk assessment system.
2. The method of claim 1, wherein prior to training the first risk assessment model with the first set of annotated event samples, the method further comprises:
splitting the second annotated event sample set into two annotated subsets;
correspondingly training two risk assessment models by utilizing the two labeling subsets for constructing the risk assessment system;
performing cross scoring on the two labeling subsets by using the two trained risk assessment models to obtain a predicted risk score of each labeling sample in the second labeling event sample set;
and selecting the black samples with the first quantity and the white samples with the second quantity from the second labeled sample set based on the predicted risk score of each labeled sample to form the first labeled event sample set.
3. The method of claim 2, wherein selecting the first number of black samples and the second number of white samples from the second set of labeled samples based on the predicted risk score of the respective labeled sample comprises:
performing reverse sequencing on the predicted risk scores of the labeled samples;
selecting the first number of black samples arranged at a front position from the plurality of black samples and the second number of white samples arranged at a rear position from the plurality of white samples according to a result of the inverse sorting.
4. The method of claim 2, wherein after obtaining the predicted risk score for each annotated sample in the second set of annotated event samples, the method further comprises:
for each labeled sample, if the predicted risk score of the labeled sample is less than a first threshold value under the condition that the labeled sample is a black sample, removing the labeled sample from the second labeled event sample set, or if the predicted risk score of the labeled sample is greater than a second threshold value under the condition that the labeled sample is a white sample, removing the labeled sample from the second labeled event sample set;
training a third risk assessment model based on the second labeled event sample set subjected to rejection processing; and the trained third risk assessment model is used for constructing the risk assessment system.
5. The method of claim 2, wherein prior to splitting a second set of annotated event samples, comprising a plurality of black samples and a plurality of white samples, into two annotated subsets, the method further comprises:
acquiring a third labeled event sample set corresponding to the first historical time period, and acquiring a fourth labeled event sample set corresponding to the second historical time period; the first history period is earlier than the second history period;
training a fourth risk assessment model by using the third labeled event sample set, and predicting the fourth labeled event sample set by using the trained fourth risk assessment model to obtain a predicted risk score of each fourth labeled sample;
and performing feature expansion on each fourth labeled sample by utilizing the predicted risk score of the fourth labeled sample to obtain a corresponding fifth labeled sample for forming the second labeled event sample set.
6. The method of claim 5, wherein prior to training a fourth risk assessment model with the third annotated sample, the method further comprises:
for each third labeling sample in the third labeling event sample set, splitting the feature dimension of each third labeling sample according to a preset mode to obtain a preset number of subsamples, and correspondingly classifying the subsamples into the preset number of subsample sets;
wherein the fourth risk assessment model comprises the predetermined number of sub-models; training a fourth risk assessment model by using the third labeled sample, and predicting the fourth labeled event sample set by using the trained fourth risk assessment model to obtain a predicted risk score of each fourth labeled sample, including:
correspondingly training the sub models with the preset number by utilizing the sub sample sets with the preset number;
and respectively processing each fourth labeled sample by using the sub models with the preset number to obtain the predicted risk scores with the preset number corresponding to the fourth labeled sample.
7. The method of claim 6, wherein, for each fourth labeled sample, performing feature expansion on the fourth labeled sample by using the predicted risk score thereof to obtain a corresponding fifth labeled sample, comprising:
and performing predetermined calculation on the fourth labeling samples based on the predicted risk scores of the fourth labeling samples in a predetermined number, and performing feature expansion on the fourth labeling samples by using the calculation results to obtain corresponding fifth labeling samples.
8. The method of claim 1, wherein the trained first risk assessment model is used to construct the risk assessment system.
9. The method of claim 1, wherein each risk assessment model in the risk assessment system is implemented based on a tree model.
10. A method of risk assessment, comprising:
acquiring a target event sample to be detected;
inputting the target event sample into a risk assessment system constructed by the method of any one of claims 1-9 to obtain a plurality of risk scores predicted by a plurality of risk assessment models;
determining a risk assessment result of the target event sample based on the plurality of risk scores.
11. The method of claim 10, wherein determining a risk assessment result for the sample of target events based on the number of risk scores comprises:
calculating the average value of a plurality of risk scores, and if the average value is greater than a score threshold value, determining the risk as the risk evaluation result; or the like, or, alternatively,
and determining the number of risk scores of the risk scores, which are greater than a score threshold value, and determining the risk as the risk evaluation result if the number of risk scores is greater than a number threshold value.
12. A risk assessment system construction apparatus comprising:
a first training unit configured to train a first risk assessment model using a first set of annotated event samples; the first marked event sample set comprises a first number of black samples and a second number of white samples, and the first number is larger than the second number;
the grey sample prediction unit is configured to process a plurality of grey samples by using the trained first risk assessment model to obtain a predicted risk score of each grey sample; identifying each ash sample as a risk sample by the existing wind control technology;
a gray sample screening unit configured to select a part of gray samples from the plurality of gray samples as extensions to black samples in a second labeled event sample set based on the predicted risk score; the second annotated event sample set initially comprises a third number of black samples and a fourth number of white samples, the third number being less than the fourth number;
the second training unit is configured to train a second risk assessment model by using the expanded second labeled event sample set; and the trained second risk assessment model is used for constructing the risk assessment system.
13. A risk assessment device comprising:
the target sample acquisition unit is configured to acquire a target event sample to be detected;
a risk prediction unit configured to input the target event sample into a risk assessment system constructed by the method of any one of claims 1 to 9, and obtain a plurality of risk scores predicted by a plurality of risk assessment models;
a result determination unit configured to determine a risk assessment result of the target event sample based on the number of risk scores.
14. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.
15. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-11.
CN202210486217.2A 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device Active CN114978616B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210486217.2A CN114978616B (en) 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210486217.2A CN114978616B (en) 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device

Publications (2)

Publication Number Publication Date
CN114978616A true CN114978616A (en) 2022-08-30
CN114978616B CN114978616B (en) 2024-01-09

Family

ID=82981196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210486217.2A Active CN114978616B (en) 2022-05-06 2022-05-06 Construction method and device of risk assessment system, and risk assessment method and device

Country Status (1)

Country Link
CN (1) CN114978616B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150339477A1 (en) * 2014-05-21 2015-11-26 Microsoft Corporation Risk assessment modeling
CN105743877A (en) * 2015-11-02 2016-07-06 哈尔滨安天科技股份有限公司 Network security threat information processing method and system
CN110020746A (en) * 2019-02-20 2019-07-16 阿里巴巴集团控股有限公司 A kind of risk prevention system method, apparatus, processing equipment and system
CN110147823A (en) * 2019-04-16 2019-08-20 阿里巴巴集团控股有限公司 A kind of air control model training method, device and equipment
US20190303569A1 (en) * 2017-06-16 2019-10-03 Alibaba Group Holding Limited Data type recognition, model training and risk recognition methods, apparatuses and devices
US20200210899A1 (en) * 2017-11-22 2020-07-02 Alibaba Group Holding Limited Machine learning model training method and device, and electronic device
CN113420789A (en) * 2021-05-31 2021-09-21 北京经纬信息技术有限公司 Method, device, storage medium and computer equipment for predicting risk account
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model
CN114154556A (en) * 2021-11-03 2022-03-08 同盾科技有限公司 Training method and device of sample prediction model, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150339477A1 (en) * 2014-05-21 2015-11-26 Microsoft Corporation Risk assessment modeling
CN105743877A (en) * 2015-11-02 2016-07-06 哈尔滨安天科技股份有限公司 Network security threat information processing method and system
US20190303569A1 (en) * 2017-06-16 2019-10-03 Alibaba Group Holding Limited Data type recognition, model training and risk recognition methods, apparatuses and devices
US20200210899A1 (en) * 2017-11-22 2020-07-02 Alibaba Group Holding Limited Machine learning model training method and device, and electronic device
CN110020746A (en) * 2019-02-20 2019-07-16 阿里巴巴集团控股有限公司 A kind of risk prevention system method, apparatus, processing equipment and system
CN110147823A (en) * 2019-04-16 2019-08-20 阿里巴巴集团控股有限公司 A kind of air control model training method, device and equipment
CN113420789A (en) * 2021-05-31 2021-09-21 北京经纬信息技术有限公司 Method, device, storage medium and computer equipment for predicting risk account
CN113537630A (en) * 2021-08-04 2021-10-22 支付宝(杭州)信息技术有限公司 Training method and device of business prediction model
CN114154556A (en) * 2021-11-03 2022-03-08 同盾科技有限公司 Training method and device of sample prediction model, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
何明珂;钱文彬;: "物流金融风险管理全过程", 系统工程, no. 05 *
杨云;孙宏;康正;吴群红;: "决策树模型ID3算法在突发公共卫生事件风险评估中的应用", 中国预防医学杂志, no. 01 *

Also Published As

Publication number Publication date
CN114978616B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN109922032B (en) Method, device, equipment and storage medium for determining risk of logging in account
CN109388675B (en) Data analysis method, device, computer equipment and storage medium
CN111652290B (en) Method and device for detecting countermeasure sample
CN109816200B (en) Task pushing method, device, computer equipment and storage medium
CN111950643B (en) Image classification model training method, image classification method and corresponding device
CN110471945B (en) Active data processing method, system, computer equipment and storage medium
CN111459922A (en) User identification method, device, equipment and storage medium
CN111340233B (en) Training method and device of machine learning model, and sample processing method and device
CN109271957B (en) Face gender identification method and device
CN112927061A (en) User operation detection method and program product
CN112801231B (en) Decision model training method and device for business object classification
CN110135428B (en) Image segmentation processing method and device
CN111159241A (en) Click conversion estimation method and device
CN108416662B (en) Data verification method and device
CN117522586A (en) Financial abnormal behavior detection method and device
CN115204322B (en) Behavior link abnormity identification method and device
CN110880117A (en) False service identification method, device, equipment and storage medium
CN114978616B (en) Construction method and device of risk assessment system, and risk assessment method and device
CN114003648B (en) Identification method and device for risk transaction group partner, electronic equipment and storage medium
CN110414845B (en) Risk assessment method and device for target transaction
CN112906785B (en) Zero sample object type identification method, device and equipment based on fusion
CN114742644A (en) Method and device for training multi-scene wind control system and predicting business object risk
CN113469816A (en) Digital currency identification method, system and storage medium based on multigroup technology
Liyanage et al. What matters the most? optimal quick classification of urban issue reports by importance
CN112446428A (en) Image data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant