CN113487208A - Risk assessment method and device - Google Patents
Risk assessment method and device Download PDFInfo
- Publication number
- CN113487208A CN113487208A CN202110808789.3A CN202110808789A CN113487208A CN 113487208 A CN113487208 A CN 113487208A CN 202110808789 A CN202110808789 A CN 202110808789A CN 113487208 A CN113487208 A CN 113487208A
- Authority
- CN
- China
- Prior art keywords
- risk
- sample
- objects
- regression model
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012502 risk assessment Methods 0.000 title claims abstract description 34
- 238000012552 review Methods 0.000 claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 24
- 230000008520 organization Effects 0.000 claims abstract description 19
- 238000007689 inspection Methods 0.000 claims abstract description 10
- 238000012550 audit Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 8
- 238000007477 logistic regression Methods 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 238000004900 laundering Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000003066 decision tree Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Development Economics (AREA)
- Physics & Mathematics (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Educational Administration (AREA)
- Operations Research (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
An embodiment of the present specification provides a risk assessment method, including: acquiring n first samples corresponding to n objects to be evaluated, wherein each first sample comprises a review tag which indicates whether the corresponding object is selected as a risk review object by a specific organization; training an inspection regression model by using the n first samples, and determining n Inverse Mills Ratios (IMRs) corresponding to the n objects based on the trained inspection regression model; obtaining m second samples corresponding to the m objects, each second sample including a corresponding IMR value and a risk category label determined based on a result of a risk review of the corresponding object by the specific organization; training a tree model using the m second samples; and determining n risk prediction probabilities corresponding to the n objects by utilizing the trained tree model based on the n IMR values.
Description
Technical Field
One or more embodiments of the present disclosure relate to the field of machine learning technologies, and in particular, to a method and an apparatus for risk assessment using machine learning.
Background
With the progress of science and technology and the development of society, more and more service platforms emerge. They are directed to the vast users, or develop business in different countries or regions, etc., to provide various services, fully meeting the various needs of users. Before a business is developed or during a service providing process, risk detection is required, for example, whether a certain area has a high money laundering risk is evaluated to assist in deciding whether the business is developed in the area or not, or a risk user is identified, so that the service is interrupted in time, and the damage to the benefit of a legal user or a service platform due to malicious operation is prevented.
Therefore, a solution is needed to efficiently and accurately perform risk assessment so as to prevent the loss of users and enterprises in terms of property and the like.
Disclosure of Invention
One or more embodiments of the present disclosure describe a risk assessment method and apparatus, which can efficiently and accurately assess risk of an object to be assessed.
According to a first aspect, there is provided a risk assessment method comprising: acquiring n first samples corresponding to n objects to be evaluated, wherein each first sample comprises a review tag which indicates whether the corresponding object is selected as a risk review object by a specific organization; training an inspection regression model by using the n first samples, and determining n Inverse Mills Ratios (IMRs) corresponding to the n objects based on the trained inspection regression model; obtaining m second samples corresponding to the m objects, each second sample including a corresponding IMR value and a risk category label determined based on a result of a risk review of the corresponding object by the specific organization; training a tree model using the m second samples; and determining n risk prediction probabilities corresponding to the n objects by utilizing the trained tree model based on the n IMR values.
In one embodiment, after determining n risk prediction probabilities for the n objects, the method further comprises: taking the n risk prediction probabilities as n risk probability labels, and constructing n third samples, wherein each third sample also comprises an IMR value of a corresponding object; and training a risk regression model by using the n third samples, and determining a risk probability confidence interval of each object in the n objects under a preset confidence level based on the trained risk regression model.
In a specific embodiment, after determining the risk probability confidence interval of each object of the n objects under the preset confidence level, the method further includes: constructing a corresponding fourth sample based on the risk prediction probability corresponding to each object and the interval endpoint of the risk probability confidence interval; clustering n fourth samples corresponding to the n objects to obtain a plurality of clusters; and determining the risk level corresponding to each class cluster as the risk level of the object corresponding to the sample in the class cluster.
In another specific embodiment, the review regression model and the risk regression model both belong to a logistic regression model; or, the review regression model and the risk regression model both belong to the probit regression model.
In one embodiment, the respective object is one of: region, user, commodity, event.
In one embodiment, the respective objects are regions; each first sample also comprises the economic degree of freedom of the corresponding region and/or the external immigration condition; and/or the each second sample also comprises the economic degree of freedom of the corresponding region and/or the foreign immigration condition.
In one embodiment, each of the second samples further includes therein whether the corresponding subject was selected by the particular institution as the risk review subject.
According to a second aspect, there is provided a risk assessment apparatus comprising: a first sample acquisition unit configured to acquire n first samples corresponding to n objects to be evaluated, each first sample including an audit tag indicating whether the corresponding object is selected as a risk audit object by a specific agency; a first training unit configured to train an audit regression model using the n first samples; an IMR value determination unit configured to determine n inverse Mills ratios IMRs corresponding to the n objects based on the trained censorship regression model; a second sample acquisition unit configured to acquire m second samples corresponding to the m objects, each second sample including a corresponding IMR value and a risk category label determined based on a result of a risk review of the corresponding object by the specific institution; a second training unit configured to train a tree model using the m second samples; and the probability prediction unit is configured to determine n risk prediction probabilities corresponding to the n objects by using the trained tree model based on the n IMR values.
In one embodiment, the apparatus further comprises: a third sample construction unit, configured to construct n third samples by using the n risk prediction probabilities as n risk probability labels, wherein each third sample further includes an IMR value of a corresponding object; a third training unit configured to train a risk regression model using the n third samples; and the confidence interval prediction unit is configured to determine a risk probability confidence interval of each object in the n objects under a preset confidence level based on the trained risk regression model.
In a specific embodiment, the apparatus further comprises: the fourth sample construction unit is configured to construct a corresponding fourth sample based on the risk prediction probability corresponding to each object and the interval endpoint of the risk probability confidence interval; the clustering unit is configured to cluster n fourth samples corresponding to the n objects to obtain a plurality of clusters; and the risk level determining unit is configured to determine the risk level corresponding to each class cluster as the risk level of the object corresponding to the sample in the class cluster.
In another specific embodiment, the review regression model and the risk regression model both belong to a logistic regression model; or, the review regression model and the risk regression model both belong to the probit regression model.
In one embodiment, the respective object is one of: region, user, commodity, event.
In one embodiment, the respective objects are regions; each first sample also comprises the economic degree of freedom of the corresponding region and/or the external immigration condition; and/or the each second sample also comprises the economic degree of freedom of the corresponding region and/or the foreign immigration condition.
In one embodiment, each of the second samples further includes therein whether the corresponding subject was selected by the particular institution as the risk review subject.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
By adopting the method and the device provided by the embodiment of the specification, the annotation data is constructed based on the risk examination list of a specific mechanism, the inverse Mills ratio IMR of all the objects to be evaluated is determined to correct the sample selection deviation, then the annotation data is constructed based on the examination result and the IMR value, and the tree model is trained, so that the risk probability of the objects to be evaluated is predicted. Furthermore, a risk regression model can be fitted based on the risk probability and the IMR value, so that a confidence interval under a preset confidence degree is determined, clustering is performed based on the risk probability and an interval endpoint, and the risk grade corresponding to each cluster is calibrated to serve as the risk grade of a corresponding object. Therefore, efficient and accurate risk assessment can be realized, and loss of users and enterprises in the aspects of property and the like is prevented.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 illustrates an implementation architecture diagram of a risk assessment method according to one embodiment;
FIG. 2 shows a schematic flow diagram of a risk assessment method according to one embodiment;
FIG. 3 illustrates a decision tree included in a tree model according to one embodiment;
FIG. 4 illustrates an implementation architecture diagram of a risk assessment method according to another embodiment;
fig. 5 shows a schematic device structure diagram of a risk assessment method according to an embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
The embodiment of the specification discloses a risk assessment method which can efficiently and accurately carry out risk assessment, so that loss of users and enterprises in the aspects of property and the like is prevented.
Fig. 1 is a diagram illustrating an implementation architecture of a risk assessment method according to an embodiment, as shown in fig. 1, first acquiring n first samples corresponding to n objects to be risk assessed, where a review tag in the first sample indicates whether the corresponding object is selected as a risk review object by a specific organization; then, training a review regression model by using the n first samples, so as to determine n IMRs (inverse mills ratios) of the n objects, wherein the IMR values can be used for correcting the sample selection deviation; then, based on the n IMR values, n risk prediction probabilities corresponding to the n objects are determined using a tree model trained based on risk review results of a specific organization. Therefore, accurate risk assessment of the object to be assessed can be achieved.
The following describes the steps of the above method with reference to specific examples.
Fig. 2 shows a schematic flow diagram of a risk assessment method according to an embodiment, and an execution subject of the method may be any device with computing and processing capabilities, an equipment cluster or platform, and the like. As shown in fig. 2, the method comprises the steps of: step S210, acquiring n first samples corresponding to n objects to be evaluated, wherein each first sample comprises an examination tag which indicates whether the corresponding object is selected as a risk examination object by a specific organization; step S220, training an inspection regression model by using the n first samples, and determining n Inverse Mills Ratios (IMRs) corresponding to the n objects based on the trained inspection regression model; step S230, obtaining m second samples corresponding to m objects, each second sample including a corresponding IMR value and a risk category label, the risk category label being determined based on a result of risk review performed by the corresponding object via the specific organization; step S240, training a tree model by using the m second samples; step S250, based on the n IMR values, n risk prediction probabilities corresponding to the n objects are determined by using a trained tree model.
The development of the above steps is as follows:
first, in step S210, n first samples corresponding to n objects to be evaluated are obtained, each first sample including a review tag indicating whether the corresponding object is selected as a risk review object by a particular organization. It should be understood that the first sample also includes sample features, and for the sake of descriptive distinction, the sample features in the first sample are referred to as first sample features.
In one embodiment, the object to be evaluated (or referred to as a business object) may be a user, such as a natural person, a business, an enterprise, or the like. Accordingly, in a specific embodiment, assuming that the business object is a natural human user, the first sample feature thereof may include: user age, gender, occupation, address, consumption (e.g., type of consumption or frequency of consumption), etc. In another specific embodiment, assuming that the business object is a merchant or business, the first sample feature may include: registration duration, registration category, size, turnover, net profit, etc.
In another embodiment, the business object to be evaluated may include one or more of the following: country, region, jurisdiction. Thus, in a particular embodiment, the first sample feature relates to the following aspects or dimensions: economics (e.g., extent of economic development, regional income for residents, total value of regional production), society (e.g., rate of unemployment or poverty loss), politics (e.g., government expenditure in education and in resident health), environment (e.g., percentage of urban electricity usage, number of urban electricity consumers), etc. Further, given that the risk to be assessed is money laundering risk, in a more specific embodiment, economic degrees of freedom, and/or foreign immigration situations may be devised as a first sample feature. It is noted that the economic degree of freedom is measured from 4 aspects of national law, government scale, regulatory efficiency and market openness, and provides comprehensive and objective indexes of national political and economic development for national risk assessment. The design principle for the external immigration situation is that thousands of people are wanted due to the fact that data of the anti-terrorism center of the unisex in a certain year are displayed, wherein 2/3 people are immigrants of each country of the unisex, and ideas are infiltrated, extreme ideas are spread to other people, and violent terrorism events are produced in the process of outward repeated migration. These two factors have indeed been shown to have a significant impact on national money laundering risks in subsequently established machine learning models.
In yet another embodiment, the object to be evaluated may be a commodity. Accordingly, in a particular embodiment, the first sample feature may include: commodity category, place of production, shelf life, cost price, selling price, raw materials, selling conditions, etc.
In yet another embodiment, the object to be evaluated may be an event, such as an access event, a login event, or the like. Accordingly, in a particular embodiment, the first sample feature may include: the time or the time period of the event, the operation account, the network environment (such as the network type and the IP address), and the like.
In the above, the object to be evaluated and the sample characteristics in the first sample are described. For the review tag in the first sample, which indicates whether the corresponding object was selected by a particular organization as a risk review object, for example, if selected, the review tag is 1, and if not selected, the review tag is 0. It is noted that a particular authority may be one or more authorities that are related. In one example, assuming that The risk to be assessed is a Money Laundering risk, The corresponding particular institution may be a FATF (Financial Task Force on Money Laundering, anti-Money Laundering finance Action-specific working group), and/or an OFAC (The Office of foreigns Assets controls of The US Department of technology, The U.S. finance Department overseas asset Control Office). In another example, assuming the risk to be assessed is a payment risk, the respective particular institution may be a large payment platform.
After the n first samples are obtained, in step S220, an inspection regression model is trained using the n first samples, and n inverse miler ratios IMR corresponding to the n objects are determined based on the trained inspection regression model.
It should be noted that, in an embodiment, the risk label data may be constructed directly using the examination result (if there is risk) of the object examined by the specific institution to train the risk identification model, but the case of whether there is risk in the object not examined by the specific institution is unknown, and thus, such a sample construction method may cause a sample selection bias (selection bias). By implementing the step S210 and the step S220, the IMR value of the object to be evaluated is calculated for the construction of the model for subsequently outputting the risk result, and the sample deviation can be effectively corrected, thereby improving the accuracy and the usability of the risk identification result.
In one embodiment, the censored regression model belongs to a logistic regression model. In another embodiment, the audit regression model belongs to the probit regression model.
After the review regression model is trained using the n first samples, a trained risk regression model can be obtained. For each of the n objects, the IMR value of the object may be determined by using the model parameters in the trained risk regression model, the first sample feature and the censorship label in the first sample corresponding to the object, and the probability density function and the cumulative distribution function of the standard normal distribution, so that n IMR values corresponding to the n objects may be determined. It should be understood that the specific calculation method of IMR may be implemented by using an existing method, and is not described in detail.
After determining the IMR value for correcting the sample selection deviation, in order to eliminate the influence caused by the missing of the feature value and the data imbalance and obtain a stable risk prediction result, in step S230, m second samples corresponding to m objects are obtained, each second sample including a corresponding IMR value and a risk category label determined based on the result of risk review performed by the corresponding object through the specific institution. It is to be understood that m objects have risk screening results for a particular organization, which are necessarily risk screening objects that have been selected as a particular organization. In one embodiment, the m objects are included in the n objects.
In one embodiment, after risk review of m objects by a particular organization, a blacklist, and/or a gray list, is generated that includes high risk objects. Further, the risk category labels of business objects on the black-and-gray list may be set as risky (e.g., label value of 1), while the risk category labels of business objects not on the black-and-gray list may be set as no risk (e.g., label value of 0). In this way, a definition of risk can be achieved. In a specific embodiment, assuming that the risk to be assessed is money laundering risk of a country/jurisdiction, the risk category labels of countries/jurisdictions on black and gray lists issued by the FATF and/or OFAC may be set as risky and the risk category labels of countries/jurisdictions not on black and gray lists may be set as no risk. If the FATF or OFAC lists the risk category label, the risk category label is set according to the black gray list issued by the FATF or OFAC examination, so that the money laundering risk is defined reasonably and effectively, and the usability of the second sample is improved.
In another embodiment, after risk review of m subjects, a particular organization generates risk levels (e.g., high risk, medium risk, low risk, etc.) for each subject, and thus, the determined risk levels can be directly used as risk category labels for the corresponding subjects.
It should be understood that the second sample may include sample features other than the IMR value, and for the purpose of distinguishing the description, the sample features in the second sample are referred to as second sample features. Generally, the first sample feature and the second sample feature will be different because the emphasis points of the models used for training the first sample and the second sample are different. In one embodiment, the second sample characteristics may also include whether the corresponding object was selected by a particular institution as a risk review object. In another embodiment, assuming that the object to be evaluated is a country/jurisdiction, the second sample feature may further include: tax burden, population, resident income index, trade freedom, etc.
Based on the m second samples corresponding to the m objects acquired above, in step S240, the tree model is trained using the m second samples. In one embodiment, the algorithm based on which the Tree model is based may be a GBDT (Gradient boosting decision Tree) algorithm, an xgboost (extremegratingboosting) algorithm, a CART (Classification And Regression Tree) algorithm, or the like.
To facilitate understanding, the tree model established may include a plurality of decision trees, and in one embodiment, fig. 3 illustrates a decision tree included in the tree model according to one embodiment, where the decision tree includes a root node 31 and a plurality of leaf nodes (e.g., leaf nodes 35), and includes a plurality of parent nodes (e.g., parent nodes 32, 33, and 34) between the root node and each leaf node, and each parent node has a corresponding splitting characteristic and a splitting threshold, where the splitting characteristic is a certain characteristic item of a plurality of characteristic items that the second sample has. In one example, parent node 32 corresponds to split feature x1Splitting threshold v for annual income of residents1In the order of 10 thousand dollars.
Further, in the training process, the root node 31 corresponds to the m second samples, the second samples may be divided into a certain leaf node through a prediction path in the decision tree, and one or more second samples divided into the same leaf node correspond to the same risk category label, so that the risk probability value corresponding to each leaf node may be counted, for example, if the number of risky samples divided into the leaf node 35 is 8, and the number of non-risky samples is 2, the risk statistical probability corresponding to the leaf node 35 is 0.8.
After the training of the tree model is completed, in step S250, based on the n IMR values of the n objects, n risk prediction probabilities corresponding to the n objects are determined by using the trained tree model. Specifically, n input samples corresponding to the n objects may be constructed based on the n IMR values, and the input samples have the same sample feature items as the second sample, except that there is no risk category label in the input samples. Based on this, n input samples are input into the trained tree model, and n risk prediction probabilities can be obtained. For example, assuming that an input sample is divided into leaf nodes 35 along the bold prediction path from the root node shown in fig. 3, it can be determined that the risk prediction probability of the input sample is 0.8.
Thus, n risk prediction probabilities corresponding to the n objects can be obtained. Further, in one embodiment, the n risk prediction probabilities may be used as the risk prediction results for the n subjects. In another embodiment, for each object, the risk prediction probability is compared with a predetermined probability threshold (e.g., 0.7), and if the risk prediction probability is greater than the predetermined probability threshold, the risk prediction result is determined to be at risk, otherwise, the risk prediction result is determined to be no risk.
According to another embodiment, as shown in fig. 4, after obtaining n risk prediction probabilities, the method may further include: based on the n risk prediction probabilities, a risk regression model is trained, and n risk probability confidence intervals for the n objects are output by using the risk regression model. Further, the final risk assessment result can be determined in a clustering mode.
Specifically, in an embodiment, after step S250, the method may further include: taking the n risk prediction probabilities as n risk probability labels, and constructing n third samples, wherein each third sample also comprises an IMR value of a corresponding object; and training a risk regression model by using the n third samples, and determining a risk probability confidence interval of each object in the n objects under a preset confidence level based on the trained risk regression model. It should be understood that other sample characteristics besides the IMR value may also be included in the third sample, and for the purpose of description differentiation, the sample characteristics in the third sample are referred to as third sample characteristics, and the third sample characteristics may be different from or partially different from the first sample characteristics and the second sample characteristics. In a specific embodiment, assuming that the business object is a country/jurisdiction, the third sample feature may further include: medium and long term political risk, forensic assessment, whether to participate in a treaty, etc.
For the risk regression models described above, they are the same as the algorithms on which the regression models are examined, and may all belong to logistic regression models, or, alternatively, to probit regression models, for example.
After the risk regression model is fitted with the n third samples, n risk probability confidence intervals of the n objects at a predetermined confidence level (e.g., 0.95 or 0.98) may be determined based on the fitted (or trained) risk regression model and the sample characteristics in the n third samples. It should be noted that, in order to improve the accuracy of risk assessment, the inventor considers introducing a risk probability confidence interval, so that the accuracy of a risk assessment result can be further improved through the interval length and the endpoint value.
Further, in a specific embodiment, for each object in the n objects, the corresponding risk prediction probability and the average probability between two interval endpoints of the risk probability confidence interval may be determined as the final risk assessment result. In another specific embodiment, for each object, a corresponding fourth sample may be constructed based on the corresponding risk prediction probability and two interval endpoints of the risk probability confidence interval, and n fourth samples corresponding to n objects are clustered to obtain a plurality of clusters; then, the risk level corresponding to each class cluster is determined as the risk level of the object corresponding to the sample in the class cluster. In one example, a k-means clustering algorithm may be used, and the number of clusters may be predetermined, such as 3 or 5. In another example, a density-based clustering algorithm may also be employed. In an example, after obtaining a plurality of class clusters, the staff may determine the risk level corresponding to the class cluster according to the sample characteristics of the fourth sample in each class cluster, and use the risk level as the risk level of the object corresponding to the fourth sample in the class cluster, that is, the final risk assessment result. The risk assessment results thus obtained have a very high accuracy and usability.
In summary, with the risk assessment method disclosed in the embodiments of the present specification, annotation data is constructed based on a risk review list of a specific organization, the inverse miers ratio IMR of all objects to be assessed is determined to correct sample selection deviation, then annotation data is constructed based on review results and IMR values, and a tree model is trained, thereby predicting the risk probability of the objects to be assessed. Furthermore, a risk regression model can be fitted based on the risk probability and the IMR value, so that a confidence interval under a preset confidence degree is determined, clustering is performed based on the risk probability and an interval endpoint, and the risk grade corresponding to each cluster is calibrated to serve as the risk grade of a corresponding object. Therefore, efficient and accurate risk assessment can be realized, and loss of users and enterprises in the aspects of property and the like is prevented.
Corresponding to the risk assessment method, the embodiment of the specification also discloses a risk assessment device. Fig. 5 is a schematic structural diagram of an apparatus of a risk assessment method according to an embodiment, and as shown in fig. 5, the apparatus 500 includes:
a first sample acquiring unit 502 configured to acquire n first samples corresponding to n objects to be evaluated, each first sample including a review tag indicating whether the corresponding object is selected as a risk review object by a specific organization; a first training unit 504 configured to train an audit regression model using the n first samples; an IMR value determining unit 506 configured to determine n inverse mils ratios IMRs corresponding to the n objects based on the trained censorship regression model; a second sample acquiring unit 508 configured to acquire m second samples corresponding to the m objects, each second sample including a corresponding IMR value and a risk category label determined based on a result of a risk review of the corresponding object by the specific institution; a second training unit 510 configured to train a tree model using the m second samples; a probability prediction unit 512 configured to determine n risk prediction probabilities corresponding to the n objects based on the n IMR values by using the trained tree model.
In one embodiment, the apparatus 500 further comprises: a third sample construction unit 514, configured to construct n third samples by using the n risk prediction probabilities as n risk probability labels, where each third sample further includes an IMR value of a corresponding object; a third training unit 516 configured to train a risk regression model using the n third samples; a confidence interval prediction unit 518, configured to determine a confidence interval of the risk probability of each object in the n objects under a preset confidence level based on the trained risk regression model.
In a specific embodiment, the apparatus 500 further comprises: a fourth sample construction unit 520, configured to construct a corresponding fourth sample based on the risk prediction probability corresponding to each object and the interval endpoint of the risk probability confidence interval; a clustering unit 522 configured to cluster n fourth samples corresponding to the n objects to obtain a plurality of clusters; the risk level determining unit 524 is configured to determine a risk level corresponding to each class cluster as a risk level of an object corresponding to a sample in the class cluster.
In another specific embodiment, the review regression model and the risk regression model both belong to a logistic regression model; or, the review regression model and the risk regression model both belong to the probit regression model.
In one embodiment, the respective object is one of: region, user, commodity, event.
In one embodiment, the respective objects are regions; each first sample also comprises the economic degree of freedom of the corresponding region and/or the external immigration condition; and/or the each second sample also comprises the economic degree of freedom of the corresponding region and/or the foreign immigration condition.
In one embodiment, each of the second samples further includes therein whether the corresponding subject was selected by the particular institution as the risk review subject.
In summary, with the risk assessment device disclosed in the embodiment of the present specification, annotation data is constructed based on a risk review list of a specific organization, an inverse miers ratio IMR of all objects to be assessed is determined to correct a sample selection deviation, then annotation data is constructed based on a review result and an IMR value, and a tree model is trained, thereby predicting a risk probability of the objects to be assessed. Furthermore, a risk regression model can be fitted based on the risk probability and the IMR value, so that a confidence interval under a preset confidence degree is determined, clustering is performed based on the risk probability and an interval endpoint, and the risk grade corresponding to each cluster is calibrated to serve as the risk grade of a corresponding object. Therefore, efficient and accurate risk assessment can be realized, and loss of users and enterprises in the aspects of property and the like is prevented.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.
Claims (16)
1. A method of risk assessment, comprising:
acquiring n first samples corresponding to n objects to be evaluated, wherein each first sample comprises a review tag which indicates whether the corresponding object is selected as a risk review object by a specific organization;
training an inspection regression model by using the n first samples, and determining n Inverse Mills Ratios (IMRs) corresponding to the n objects based on the trained inspection regression model;
obtaining m second samples corresponding to the m objects, each second sample including a corresponding IMR value and a risk category label determined based on a result of a risk review of the corresponding object by the specific organization;
training a tree model using the m second samples;
and determining n risk prediction probabilities corresponding to the n objects by utilizing the trained tree model based on the n IMR values.
2. The method of claim 1, wherein after determining n risk prediction probabilities for the n objects, the method further comprises:
taking the n risk prediction probabilities as n risk probability labels, and constructing n third samples, wherein each third sample also comprises an IMR value of a corresponding object;
and training a risk regression model by using the n third samples, and determining a risk probability confidence interval of each object in the n objects under a preset confidence level based on the trained risk regression model.
3. The method of claim 2, wherein after determining a risk probability confidence interval for each of the n objects at a preset confidence level, the method further comprises:
constructing a corresponding fourth sample based on the risk prediction probability corresponding to each object and the interval endpoint of the risk probability confidence interval;
clustering n fourth samples corresponding to the n objects to obtain a plurality of clusters;
and determining the risk level corresponding to each class cluster as the risk level of the object corresponding to the sample in the class cluster.
4. The method of claim 2, wherein the review regression model and the risk regression model both belong to a logistic regression model; or the like, or, alternatively,
the review regression model and the risk regression model both belong to the probit regression model.
5. The method of claim 1, wherein the respective object is one of: region, user, commodity, event.
6. The method of claim 1, wherein the respective object is a region;
each first sample also comprises the economic degree of freedom of the corresponding region and/or the external immigration condition; and/or the presence of a gas in the gas,
the second samples also comprise economic freedom degrees of corresponding regions and/or foreign immigration conditions.
7. The method of claim 1, wherein each second sample further comprises whether the corresponding subject was selected by the particular institution as the risk screening subject.
8. A risk assessment device comprising:
a first sample acquisition unit configured to acquire n first samples corresponding to n objects to be evaluated, each first sample including an audit tag indicating whether the corresponding object is selected as a risk audit object by a specific agency;
a first training unit configured to train an audit regression model using the n first samples;
an IMR value determination unit configured to determine n inverse Mills ratios IMRs corresponding to the n objects based on the trained censorship regression model;
a second sample acquisition unit configured to acquire m second samples corresponding to the m objects, each second sample including a corresponding IMR value and a risk category label determined based on a result of a risk review of the corresponding object by the specific institution;
a second training unit configured to train a tree model using the m second samples;
and the probability prediction unit is configured to determine n risk prediction probabilities corresponding to the n objects by using the trained tree model based on the n IMR values.
9. The apparatus of claim 8, wherein the apparatus further comprises:
a third sample construction unit, configured to construct n third samples by using the n risk prediction probabilities as n risk probability labels, wherein each third sample further includes an IMR value of a corresponding object;
a third training unit configured to train a risk regression model using the n third samples;
and the confidence interval prediction unit is configured to determine a risk probability confidence interval of each object in the n objects under a preset confidence level based on the trained risk regression model.
10. The apparatus of claim 9, wherein the apparatus further comprises:
the fourth sample construction unit is configured to construct a corresponding fourth sample based on the risk prediction probability corresponding to each object and the interval endpoint of the risk probability confidence interval;
the clustering unit is configured to cluster n fourth samples corresponding to the n objects to obtain a plurality of clusters;
and the risk level determining unit is configured to determine the risk level corresponding to each class cluster as the risk level of the object corresponding to the sample in the class cluster.
11. The apparatus of claim 9, wherein the review regression model and the risk regression model both belong to a logistic regression model; or the like, or, alternatively,
the review regression model and the risk regression model both belong to the probit regression model.
12. The apparatus of claim 8, wherein the respective object is one of: region, user, commodity, event.
13. The apparatus of claim 8, wherein the respective objects are regions;
each first sample also comprises the economic degree of freedom of the corresponding region and/or the external immigration condition; and/or the presence of a gas in the gas,
the second samples also comprise economic freedom degrees of corresponding regions and/or foreign immigration conditions.
14. The apparatus of claim 8, wherein each second sample further comprises whether a corresponding subject was selected by the particular institution as the risk screening subject.
15. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.
16. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110808789.3A CN113487208B (en) | 2021-07-16 | 2021-07-16 | Risk assessment method and risk assessment device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110808789.3A CN113487208B (en) | 2021-07-16 | 2021-07-16 | Risk assessment method and risk assessment device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113487208A true CN113487208A (en) | 2021-10-08 |
CN113487208B CN113487208B (en) | 2024-06-18 |
Family
ID=77941914
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110808789.3A Active CN113487208B (en) | 2021-07-16 | 2021-07-16 | Risk assessment method and risk assessment device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113487208B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114282684A (en) * | 2021-12-24 | 2022-04-05 | 支付宝(杭州)信息技术有限公司 | Method and device for training user-related classification model and classifying users |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160104163A1 (en) * | 2014-10-14 | 2016-04-14 | Jpmorgan Chase Bank, N.A. | ldentifying Potentially Risky Transactions |
JP2019150018A (en) * | 2018-02-28 | 2019-09-12 | 国立大学法人 宮崎大学 | Cell determination device, cell determination method and program |
CN111080123A (en) * | 2019-12-14 | 2020-04-28 | 支付宝(杭州)信息技术有限公司 | User risk assessment method and device, electronic equipment and storage medium |
CN111291900A (en) * | 2020-03-05 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | Method and device for training risk recognition model |
CN111383101A (en) * | 2020-03-25 | 2020-07-07 | 深圳前海微众银行股份有限公司 | Post-loan risk monitoring method, device, equipment and computer-readable storage medium |
CN111461216A (en) * | 2020-03-31 | 2020-07-28 | 浙江邦盛科技有限公司 | Case risk identification method based on machine learning |
US20210021258A1 (en) * | 2019-07-19 | 2021-01-21 | University of Flordia Research Foundation, Incorporated | Method And Apparatus For Eliminating Crosstalk Effects In High Switching-Speed Power Modules |
US20210166197A1 (en) * | 2017-03-14 | 2021-06-03 | iMitig8 Risk LLC | System and method for providing risk recommendation, mitigation and prediction |
-
2021
- 2021-07-16 CN CN202110808789.3A patent/CN113487208B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160104163A1 (en) * | 2014-10-14 | 2016-04-14 | Jpmorgan Chase Bank, N.A. | ldentifying Potentially Risky Transactions |
US20210166197A1 (en) * | 2017-03-14 | 2021-06-03 | iMitig8 Risk LLC | System and method for providing risk recommendation, mitigation and prediction |
JP2019150018A (en) * | 2018-02-28 | 2019-09-12 | 国立大学法人 宮崎大学 | Cell determination device, cell determination method and program |
US20210021258A1 (en) * | 2019-07-19 | 2021-01-21 | University of Flordia Research Foundation, Incorporated | Method And Apparatus For Eliminating Crosstalk Effects In High Switching-Speed Power Modules |
CN111080123A (en) * | 2019-12-14 | 2020-04-28 | 支付宝(杭州)信息技术有限公司 | User risk assessment method and device, electronic equipment and storage medium |
CN111291900A (en) * | 2020-03-05 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | Method and device for training risk recognition model |
CN111383101A (en) * | 2020-03-25 | 2020-07-07 | 深圳前海微众银行股份有限公司 | Post-loan risk monitoring method, device, equipment and computer-readable storage medium |
CN111461216A (en) * | 2020-03-31 | 2020-07-28 | 浙江邦盛科技有限公司 | Case risk identification method based on machine learning |
Non-Patent Citations (1)
Title |
---|
黄溶冰;: "企业漂绿行为影响审计师决策吗?", 审计研究, no. 03, 28 May 2020 (2020-05-28) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114282684A (en) * | 2021-12-24 | 2022-04-05 | 支付宝(杭州)信息技术有限公司 | Method and device for training user-related classification model and classifying users |
Also Published As
Publication number | Publication date |
---|---|
CN113487208B (en) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alm et al. | Using dynamic panel methods to estimate shadow economies around the world, 1984–2006 | |
CN110188198B (en) | Anti-fraud method and device based on knowledge graph | |
Yang | Information theoretic approaches in economics | |
CN110009503A (en) | Finance product recommended method, device, computer equipment and storage medium | |
Petrides et al. | Cost-sensitive learning for profit-driven credit scoring | |
CN110866832A (en) | Risk control method, system, storage medium and computing device | |
Fernández-Gámez et al. | Integrating corporate governance and financial variables for the identification of qualified audit opinions with neural networks | |
Bruns et al. | Leading indicators of fiscal distress: evidence from extreme bounds analysis | |
Gupta et al. | Exchange rate returns and volatility: the role of time-varying rare disaster risks | |
Song et al. | Bayesian bootstrap aggregation for tourism demand forecasting | |
CN113919886A (en) | Data characteristic combination pricing method and system based on summer pril value and electronic equipment | |
CN113762973A (en) | Data processing method and device, computer readable medium and electronic equipment | |
Abdillah et al. | Effect of corporate social responsibility disclosure (CSRD) on financial performance and role of media as moderation variables | |
Bouvatier et al. | Time-varying Z-score measures for bank insolvency risk: Best practice | |
CN113435713B (en) | Risk map compiling method and system based on GIS technology and two-model fusion | |
Tak et al. | Dating currency crises and designing early warning systems: Meta‐possibilistic fuzzy index functions | |
CN112434862B (en) | Method and device for predicting financial dilemma of marketing enterprises | |
CN113487208B (en) | Risk assessment method and risk assessment device | |
Aldana et al. | A machine learning model to identify corruption in M\'exico's public procurement contracts | |
CN117934154A (en) | Transaction risk prediction method, model training method, device, equipment, medium and program product | |
CN111754261B (en) | Method and device for evaluating taxi willingness and terminal equipment | |
Filippidis et al. | Evaluating oil price forecasts: a meta-analysis | |
CN112733897A (en) | Method and equipment for determining abnormal reason of multi-dimensional sample data | |
CN111882113A (en) | Enterprise mobile banking user prediction method and device | |
CN117437019A (en) | Credit card overdue risk prediction method, apparatus, device, medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230118 Address after: 200120 Floor 15, No. 447, Nanquan North Road, Free Trade Pilot Zone, Pudong New Area, Shanghai Applicant after: Alipay.com Co.,Ltd. Address before: 310000 801-11 section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province Applicant before: Alipay (Hangzhou) Information Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |