CN116384750A - Method and computing device for generating marking sample and training risk rating prediction model - Google Patents

Method and computing device for generating marking sample and training risk rating prediction model Download PDF

Info

Publication number
CN116384750A
CN116384750A CN202310398899.6A CN202310398899A CN116384750A CN 116384750 A CN116384750 A CN 116384750A CN 202310398899 A CN202310398899 A CN 202310398899A CN 116384750 A CN116384750 A CN 116384750A
Authority
CN
China
Prior art keywords
risk
samples
sample
marked
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310398899.6A
Other languages
Chinese (zh)
Inventor
靳佳为
李洪世
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhige Digital Technology Co ltd
Original Assignee
Shenzhen Zhige Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhige Digital Technology Co ltd filed Critical Shenzhen Zhige Digital Technology Co ltd
Priority to CN202310398899.6A priority Critical patent/CN116384750A/en
Publication of CN116384750A publication Critical patent/CN116384750A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a method and computing equipment for generating a marked sample and training a risk rating prediction model. A method of generating a marked sample based on e-commerce data includes: performing dimension reduction on a risk index space containing a plurality of risk indexes, sequencing samples, obtaining an initial marked sample, and placing the initial marked sample into a sample space; repeating the steps of training a classification model using the marked samples in the sample space until the number of marked samples in the sample space reaches a threshold; and labeling samples through the trained classification model, and expanding the sample space by using the obtained labeled samples. According to the scheme, sample labeling is performed in a semi-supervised learning mode, so that labels are generated through data, training samples are generated, and labor is saved.

Description

Method and computing device for generating marking sample and training risk rating prediction model
Technical Field
The application relates to the technical field of machine learning and business big data, in particular to a method and computing equipment for generating a marking sample and training a risk rating prediction model.
Background
With the development of network computing technology, a large amount of business big data is generated in electronic commerce. For example, compared to the traditional industry, e-commerce can produce a vast amount of raw e-commerce data available on its ecological value chain. The acquisition, processing, or efficient use of such data may provide assistance to the business operations or support for business decisions.
For example, these e-commerce big data may be utilized by being used to train various machine learning models. However, in general, these data need to be labeled before they can be used to train various models, which is relatively time consuming and labor costly, especially for massive e-commerce data.
Therefore, there is a need for a low cost method of labeling e-commerce big data to fully utilize the e-commerce big data to assist or support the business operations of the enterprise.
The above information disclosed in the background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The application aims to provide a method and a computing device for generating a marking sample and training a risk rating prediction model based on electronic commerce data, and the sample is labeled in a semi-supervised learning mode, so that a label is generated through the data, a training sample is generated, and manpower is saved.
The user characteristics and advantages of the present application will become apparent from the detailed description set forth below, or may be learned in part by practice of the application.
According to an aspect of the present application, there is provided a method of generating a marker sample based on e-commerce data, comprising: performing dimension reduction on a risk index space containing a plurality of risk indexes, sequencing samples, obtaining an initial marked sample, and placing the initial marked sample into a sample space; repeating the steps of training a classification model using the marked samples in the sample space until the number of marked samples in the sample space reaches a threshold; and labeling samples through the trained classification model, and expanding the sample space by using the obtained labeled samples.
According to another aspect of the present application, there is provided a method of training a risk rating prediction model, comprising: generating a marked sample by the method; taking at least part of the marked sample as a training sample; selecting a plurality of risk indicators; dividing the plurality of risk indicators into at least one risk dimension; training a random forest model based on the training samples, the plurality of risk indexes and the at least one risk dimension, wherein the random forest model comprises a first group of decision trees and a second group of decision trees, the first group of decision trees randomly acquire the plurality of marking samples and the plurality of risk indexes, and the second group of decision trees randomly acquire the training samples and respectively acquire the risk indexes of each risk dimension.
According to another aspect of the present application, there is provided a computing device comprising: a processor; a memory having a computer program stored thereon; the aforementioned method is implemented when the processor executes the computer program.
According to another aspect of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method as described above.
According to some embodiments, sample labeling is performed in a semi-supervised learning mode, so that labels are generated through data to generate training samples, and labor is saved.
According to some embodiments, random forest surrogate logistic regression is used as the underlying model. When a random forest model is used for generating a sub-decision tree, a risk index of a specific risk dimension is selected by a specific number of subtrees. By embedding the user risk portrait function into the risk rating model, time and calculation cost are saved, and the risk portrait is held by the random forest model. In this way, the model prediction results obtained by the method according to the example embodiment are more accurate.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
Fig. 1 shows a schematic diagram of an application scenario of the technical solution of the present application.
FIG. 2A illustrates a random forest model for risk rating prediction using business big data according to an example embodiment of the present application.
FIG. 2B illustrates a training pattern of a random forest model for risk rating predictions using business big data according to an example embodiment of the present application.
FIG. 3 illustrates a method for risk rating with business big data using a random forest model according to an example embodiment of the present application.
FIG. 4 illustrates a process of normalizing risk indicators according to an example embodiment.
FIG. 5 illustrates a flow chart of a method of training a risk rating prediction model according to an embodiment of the present application.
FIG. 6 illustrates a flow chart of a method for sample tagging through semi-supervised learning according to an embodiment of the present application.
FIG. 7 illustrates an example of overall risk prediction and risk portrayal in accordance with an example embodiment of the present application.
FIG. 8 illustrates a block diagram of a computing device according to an example embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will appreciate that the embodiments described herein may be combined with other embodiments.
Financial institutions traditionally have managed risk to electronic commerce businesses in order to take offline, due diligence of the business as the main. The risk assessment of enterprises is performed through investigation of aspects such as company operation status, company property clues, legal property clues (houses, cars, etc.), bank credit status, debt status, legal litigation, etc. The investigation data sources mainly comprise enterprise financial reports, bank running water, tax returns, an industrial and commercial information platform, a real estate bureau database, a medium network access database and the like. The main problems of this approach are that the labor cost and time cost of the offline investigation are relatively high, the credibility (financial report) of the data and the definition (bank running water) of the data cannot be guaranteed, and the data cannot be obtained in batches. The acquisition of data typically relies on manual manipulation processes, which are far from completing processing tasks with respect to the massive data generated on the e-commerce platform.
In addition, financial institutions typically conduct risk rating predictions through human experience or by way of creating risk scoring cards. First, the risk index is boxed (based on a logistic regression algorithm) by combining the sample labels, that is, continuous data is discretized, for example, the variable of age can be boxed into 0-18, 18-30, 30-45, 45-60, etc. Then, risk scores of different intervals of different indexes are calculated. And finally, matching the risk indexes of the target users with the risk scores of the corresponding intervals and summing up to obtain a total risk score.
The risk scoring card is not capable of doing user risk portraits, such as scoring for e-commerce from different risk dimensions (inventory, sales). Furthermore, the accuracy of the method is not very high. Because of the very simple form (very similar to a linear model), it is difficult to fit the true distribution of the data. However, there is currently a lack of solutions in the industry that can effectively utilize e-commerce big data for risk rating.
Therefore, the embodiment of the application provides a method for generating a marking sample and training a risk rating prediction model based on electronic commerce data, wherein the electronic commerce data is utilized to generate the sample, and reliable financing basis can be provided for a financial institution through the establishment and training of a machine learning model.
The technical scheme of the present application will be described in detail with reference to examples.
Fig. 1 shows a schematic diagram of an application scenario of the technical solution of the present application.
Referring to fig. 1, in an e-commerce system, data generated on an e-commerce value chain may be deposited in a database of an e-commerce platform. In order to acquire the precipitation data, the data can be authorized to be used by the data processing system according to the embodiment of the application through an API interface of the e-commerce platform in an e-commerce authorization mode. In addition, the system can also obtain the original electronic commerce data of multiple dimensions (sales, stock, flow, policy violations, logistics, settlement and the like) of the user in real time by carrying out data docking in the form of interfaces and the like with other main participants (including third party payments, logistics merchants and warehouse service merchants) of the electronic commerce, and can use a distributed technical means for storage and calculation.
After receiving the authorization, the data processing system according to the embodiment of the application pulls the original electronic commerce data of the corresponding electronic commerce in the electronic commerce platform to a storage system associated with the data processing system. According to some embodiments, the storage system may be a distributed storage system.
The data processing system according to the embodiment of the application processes the data, for example, through standardization processing, so as to obtain data which can be used subsequently. And then, combining industry experience, business model, financial model, statistical model and the like, obtaining a risk rating result and a risk portrait of the e-commerce enterprise through a machine learning mode and the like, and providing the risk rating result and the risk portrait to financial institutions such as banks and the like as reliable financing basis.
FIG. 2A illustrates a random forest model for risk rating prediction using business big data according to an example embodiment of the present application.
The random forest model shown in fig. 2A may provide financing basis for financial institutions by predicting risk ratings for enterprises based on business big data (e.g., e-commerce big data).
FIG. 2B illustrates a training pattern of a random forest model for risk rating predictions using business big data according to an example embodiment of the present application.
The random forest constructs a plurality of decision trees, when a certain sample needs to be predicted, the predicted result of each tree in the forest on the sample is counted, and then the final result is selected from the predicted results through a voting method. The random body now takes features randomly and samples randomly, so that each tree in the forest has both similarity and variability. The random forest is used as a Bagging algorithm for integrated learning, and an original E-commerce data set is sampled to obtain a new data set. And randomly selecting one sample in the original data set, adding the selected sample to the new data set, and performing multiple operations to form different training sets. In other words, the random forest can independently and randomly extract a plurality of subsets from the majority class, train each subset with minority class data to generate a plurality of base classifiers, and then weight the base classifiers to form a new classifier to solve the problem of data unbalance. Random forests are a basic and commonly used non-linear classification and regression method.
Referring to fig. 2A, a random forest model according to an example embodiment includes n+i decision trees, where n first sets of decision trees obtain k risk indicators and i second sets of decision trees respectively obtain risk indicators for a particular dimension of the i risk dimensions.
Referring to fig. 2B, when training the model, the first set of decision trees randomly acquire a plurality of marking samples and the plurality of risk indexes, and the second set of decision trees randomly acquire training samples and respectively acquire risk indexes of each risk dimension.
The risk indicators may include, but are not limited to, for example, sales-to-comparison, inventory turnover, flow conversion, infringement complaint number, refund rate, and the like. Risk dimensions include, but are not limited to, inventory, sales, returns, settlements, and the like. These risk indicators may be obtained based on business big data.
According to the random forest model of the example embodiment, the client risk portrait function is embedded into the risk rating model, so that time and calculation cost are saved, and the risk portrait is held by a random forest algorithm.
FIG. 3 illustrates a method for risk rating prediction using e-commerce big data by a random forest model according to an example embodiment of the present application.
Referring to fig. 3, at S301, a risk sample of a target customer is obtained, the risk sample having a plurality of risk indices, the plurality of risk indices being divisible into at least one risk dimension.
According to an example embodiment, the plurality of risk indicators may include a time-slice based statistical indicator.
For example, the plurality of risk indicators may include, but are not limited to, a slot-based sales-to-comparison ratio, inventory turnover, flow conversion, infringement complaint number, refund rate, and the like.
The plurality of risk indicators may be divided into at least one risk dimension, e.g., sales odds ratio, inventory turnover, and return rate may be divided into three risk dimensions, sales, inventory, and settlement, respectively.
According to an example embodiment, a risk sample may be obtained and a risk indicator of the sample may be normalized by a method described later with reference to fig. 4.
At S303, a plurality of risk indicators are placed into a random forest model for calculation.
According to an example embodiment, the random forest model includes a first set of decision trees and a second set of decision trees. The first set of decision trees obtains the plurality of risk indexes, and the second set of decision trees respectively obtains the risk indexes of each risk dimension.
And S305, obtaining an output result of the random forest model to obtain an overall risk prediction and a risk portrait.
For example, the results (e.g., averages) of the first and second sets of decision trees may be taken as overall risk predictions for the target user, and the risk predictions for the respective risk dimensions of the second set of decision trees may be taken as risk portraits, see the overall risk predictions and examples of risk portraits given in FIG. 7.
According to some embodiments, the output of the second set of decision trees is an average overdue probability, and the output of the second set of decision trees is a return risk, an inventory risk, a base risk, a settlement performance, and a sales performance, respectively.
According to some embodiments, the overall risk prediction value is predicted as a future operating condition of the target user. According to some embodiments, the method is for a financial institution to rate risk to an electronic commerce.
According to some embodiments, the predicted values of the plurality of samples are weighted averaged according to a particular indicator of the plurality of samples. For example, the predicted values may be weighted averaged according to the sales index.
According to an example embodiment of the present application, random forest replacement logistic regression is used as the underlying algorithm. When a random forest algorithm is used for generating a sub-decision tree, a risk index of a specific risk dimension is selected by a specific number of subtrees. By embedding the user risk portrait function into the risk rating model, time and calculation cost are saved, and the risk portrait is held by a random forest algorithm. In this way, the prediction result obtained by the method according to the example embodiment is more accurate. Furthermore, the possibility of overfitting problems can be reduced at the same time.
According to some embodiments, after the risk indicator is obtained, the abnormal change information can be obtained by comparing with the risk indicator obtained before, so as to send out early warning. For example, when the peer-to-peer sales drop by more than 80% of peer-to-peer competitors, or the redundant inventory ratio by more than 80% of peer-to-peer competitors, or the daily sales are more than 3 standard deviations above the average of the sales over the last 30 days, the early warning information can be sent out, so that the risk can be controlled within the minimum range.
FIG. 4 illustrates a process of normalizing risk indicators according to an example embodiment.
After the business big data is obtained through the data interface, the obtained original electronic commerce data can be subjected to statistical processing to generate a marked sample and a risk index. The risk indicator may then be normalized for use in prediction or for training the model. By data standardization, the convergence speed and precision of the model can be improved, and the influence of time, region, class and the like can be removed.
According to some embodiments, the risk indicator may first be determined in conjunction with an RFM model, a financial model, an e-commerce operation index system, and the like.
The RFM model is an important tool and means to measure customer value and customer ability to create benefits. Among the numerous analysis modes of Customer Relationship Management (CRM), the RFM model is widely mentioned. The model describes the value status of a customer by three indicators of recent purchases, overall frequency of purchases, and how much money is spent.
The financial model classifies, sorts and links various information of enterprises according to a main line of value creation so as to complete the functions of analysis, prediction, evaluation and the like of financial performance of the enterprises. The overall operation index may include a traffic class index, a sales conversion index, a commodity class index, and the like.
According to some embodiments, the return rate as a risk indicator may be defined as the ratio of the return singular to the total singular, the return rate may be defined as the ratio of the total amount of money the platform spends in the customer account to the total amount sold by the platform, and the sales rate may be defined as the ratio of the number of items sold to the average inventory.
At S401, an e-commerce data sample is acquired.
In an e-commerce system, data generated on an e-commerce value chain may be deposited in a database of an e-commerce platform. In order to acquire the precipitation data, the data can be pulled and saved to a storage system by using an API (application program interface) of an e-commerce platform in an e-commerce authorization mode, and then the data is processed and saved as an e-commerce data sample. An e-commerce data sample may then be obtained from the storage system. And acquiring risk samples of at least one time window from the electronic commerce data of at least one preset period according to the sliding time window of the preset period. In this way, through the use of a time window, the number of risk samples can be expanded, which is particularly useful for satisfying the number of samples required for model training.
At S403, a plurality of time slices for performing statistical calculations on the e-commerce data samples are determined.
According to an example embodiment, time slices, such as 0-7 days, 8-14 days, 15-21 days, 22-28 days, etc., may be set within a time window to count e-commerce data samples, such as counting time slice statistics of the amount of orders, the amount of returns, etc., in each sample. The number of risk indicators may be expanded by multiple time slice statistics, as described in detail below.
In S405, index statistics is performed on the e-commerce data sample according to the screening result of the attribute dimension combination according to a plurality of time slices, and a risk index is calculated, so as to obtain a risk sample.
For example, the e-commerce data sample is subjected to time slice statistics of indexes such as the return goods singular number, the total sales amount and the like according to attribute dimension combinations of the goods class, the region and the time window, and then the risk indexes can be calculated according to time slice statistics results and risk index definition of each index, so that a risk sample comprising a plurality of risk indexes is obtained. Tables 1 and 2 give statistics and risk indices for example risk samples.
TABLE 1 example multidimensional statistics
Figure BDA0004178727540000091
TABLE 2 Risk index example
Figure BDA0004178727540000092
At S407, the risk indicators are data normalized according to the attribute dimension combinations to eliminate or reduce the possibility of deviation due to different dimensions.
According to an example embodiment, a set of risk samples with the same combination of attribute dimensions is screened, and the mean and standard deviation of risk indicators in the set are calculated.
According to some embodiments, the risk indicator may be z-score (zero-mean normalization) normalized. Normalization of results
Figure BDA0004178727540000101
x is risk index value, & lt & gt>
Figure BDA0004178727540000102
The mean value of the risk index is s, and the standard deviation is s. Table 3 shows the example results after normalization.
TABLE 3 Risk index standardization example
Figure BDA0004178727540000103
FIG. 5 illustrates a flow chart of a method of training a risk rating prediction model according to an embodiment of the present application.
Referring to fig. 5, at S501, a labeled training sample is acquired.
Training samples may be labeled in a variety of labeling ways. For example, the training samples may be determined by labeling the samples with manual labeling. The sample tagging process may also be performed by means of semi-supervised learning, whereby tags are generated from data to generate training samples, as described below with reference to fig. 6.
According to some embodiments, sample tagging may be performed by a method described later with reference to fig. 6 using semi-supervised learning, and at least a portion of the tagged sample is used as a training sample.
At S503, a plurality of risk indicators are selected.
According to an example embodiment, the metrics may be first filtered, leaving the top k risk metrics with higher importance ranking, to ease the computational tasks of the model.
According to some embodiments, a simple logistic regression model may be used, using regression coefficients as screening criteria. In addition, regularized L1, L2 screening may also be used.
According to some embodiments, a KS test ranking may be applied to the metric space, retaining the top k risk metrics.
The KS test (Kolmogorov-Smirnov test) is used to test whether one distribution corresponds to a certain theoretical distribution or whether there is a significant difference between two empirical distributions. In wind control, KS tests are often used to evaluate risk index discrimination. The larger the discrimination, the stronger the risk ranking ability (ranking ability) of the risk index is explained. The KS statistics are established based on an empirical cumulative distribution function (Empirical Cumulative Distribution Function, ECDF). The test statistics are:
b (x) is the duty ratio of bad samples in samples with a specific index less than or equal to x.
G (x) refers to the ratio of good samples in samples with a specific index less than or equal to x.
The test procedure was as follows:
(1) Let it be assumed that H0: b (x) =g (x).
(2) And calculating the absolute difference between the good sample accumulation frequency and the bad sample accumulation frequency of the specific index, wherein the maximum absolute difference is D, and D=max { |B (x) -G (x) |}.
According to some embodiments, further comprising normalizing the plurality of risk indicators, as described with reference to fig. 4.
(3) The KS scores for the specific indicators are used and ranked.
By screening the features of the risk indexes, risk rank scores with higher accuracy can be obtained, and the operation task of the model can be lightened.
At S505, the plurality of risk indicators are divided into at least one risk dimension. For example, risk indicators may be grouped by inventory dimension, sales dimension, market dimension, user dimension, financial dimension, etc., to obtain a rank score for different risk dimensions for a target user.
At S507, a random forest model is trained based on the training samples and the plurality of risk indicators and the at least one risk dimension.
According to an example embodiment, the random forest model includes a first set of decision trees and a second set of decision trees. The first group of decision trees randomly acquire a plurality of marking samples and a plurality of risk indexes, and the second group of decision trees randomly acquire training samples and respectively acquire risk indexes of each risk dimension.
The results (e.g., averages) output by the first and second sets of decision trees may be used as overall risk predictors.
In addition, the samples and risk indexes of the target users can be put into a random forest model obtained through training, and the results output by the second group of decision trees are respectively taken as risk prediction values of the preset dimension, such as risk prediction values of the stock dimension.
According to some embodiments, the average oob score (out-of-bag error rate) of the first set of decision trees and the second set of decision trees may also be used as an evaluation criterion to optimize parameters of the random forest model, thereby obtaining an optimized model.
FIG. 6 illustrates a flow chart of a method for sample tagging through semi-supervised learning according to an embodiment of the present application.
In training a model, a large number of labeled training samples are required. Training samples may be labeled in a variety of labeling ways. For example, the training samples may be determined by labeling the samples with manual labeling. Manual labeling often takes a lot of labor and time and is sometimes difficult to accomplish due to practical conditions. The sample labeling process can also be performed by means of semi-supervised learning, so that labels are generated through data to generate training samples.
When the risk rating is predicted for overdue bank loans of the electronic commerce, the concept of transfer learning is adopted to label, and the prediction of the future loan risk overdue probability of the user is transferred to the prediction of the future operation condition of the user. Transfer learning is the application of knowledge or patterns learned over a certain domain or task to different but related domains or questions. Unsupervised transfer learning is a transfer learning task (currently, the B-side data of enterprises widely lack labels) without labeling data in the target field. The transfer learning is based on: some features in the feature space are domain-independent, while another part is domain-shared and generalizable, i.e., enterprise operating conditions and loan overdue probabilities have a large number of shared features.
According to an exemplary embodiment, a classifier is trained with tagged data, and then the non-tagged data is classified with this classifier. And selecting unlabeled samples with high classification accuracy confidence, and using the selected unlabeled samples for training the classifier. For example, after the unlabeled data is put into the classifier, the output probability >0.95 is marked as a negative sample, and the output probability <0.05 is marked as a positive sample.
Referring to fig. 6, in S601, the risk indicator space is reduced in dimension and the samples are ordered to obtain an initial marked sample.
According to an example embodiment, the risk indicator space may be reduced in dimension and the samples ranked by principal component analysis (PCA, principal components analysis).
According to an embodiment, the original k features can be replaced by a smaller number of m features by PCA, the new feature being a linear combination of the old features. These linear combinations maximize the sample variance, as much as possible, making the new m features uncorrelated with each other. The mapping from old features to new features captures the inherent variability in the data. According to an embodiment, m may be set to 1, with each sample corresponding to a risk value (down to a one-dimensional feature space), and the samples may be ordered according to risk value. Table 4 gives sample ordering after exemplary risk indicator space dimension reduction. And selecting head samples and tail samples in the sequenced samples according to a certain proportion, respectively marking the head samples and the tail samples as positive samples and negative samples, obtaining initial marked samples, and placing the initial marked samples in a sample space. Then, S603 and S605 may be repeatedly performed until the number of marked samples in the sample space reaches the threshold.
TABLE 4 sample ordering example after Risk index space dimension reduction
Figure BDA0004178727540000131
At S603, a classification model is trained using the labeled samples in the sample space.
And cutting the marked samples in the sample space into a training set and a testing set, and putting the training set and the testing set into a classification model for training, for example, putting the training set into a decision tree model for training, so as to obtain a trained classification model.
At S605, sample labeling is performed by the trained classification model, and the sample space is expanded by using the obtained labeled sample.
And placing the previous unlabeled samples into a trained classification model to obtain the predictive labeling probability of the unlabeled samples, sorting the unlabeled samples, selecting head samples and tail samples in the sorted samples according to a certain proportion, respectively labeling the head samples and the tail samples as positive samples and negative samples, placing the positive samples and the negative samples into a sample space of the existing labeled samples, and expanding the sample space.
Therefore, sample labeling is performed in a semi-supervised learning mode, so that labels are generated through data, training samples are generated, and labor is saved. In addition, collinearity between features can be eliminated.
FIG. 8 illustrates a block diagram of a computing device according to an example embodiment of the present application.
As shown in fig. 8, the computing device 30 includes a processor 12 and a memory 14. Computing device 30 may also include a bus 22, a network interface 16, and an I/O interface 18. The processor 12, memory 14, network interface 16, and I/O interface 18 may communicate with each other via a bus 22.
The processor 12 may include one or more general purpose CPUs (Central Processing Unit, central processing units), microprocessors, or application specific integrated circuits, etc. for executing associated program instructions.
Memory 14 may include machine-system-readable media in the form of volatile memory, such as Random Access Memory (RAM), read Only Memory (ROM), and/or cache memory. Memory 14 is used to store one or more programs including instructions as well as data. The processor 12 may read instructions stored in the memory 14 to perform the methods described above in accordance with embodiments of the present application.
Computing device 30 may also communicate with one or more networks through network interface 16. The network interface 16 may be a wired network interface or a wireless network interface, or may be a virtual network interface.
Computing device 30 may also communicate with one or more external devices (e.g., audio input devices, audio output devices, cameras, keyboards, mice, displays, various types of sensors, etc.) through input/output (I/O) interface 18.
Bus 22 may include an address bus, a data bus, a control bus, and the like. Bus 22 provides a path for exchanging information between the components.
It should be noted that, in the implementation, the computing device 30 may further include other components necessary to achieve normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method. The computer readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVDs, CD-ROMs, micro-drives, and magneto-optical disks, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), network storage devices, cloud storage devices, or any type of media or device suitable for storing instructions and/or data.
The present application also provides a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform part or all of the steps of any one of the methods described in the method embodiments above.
It will be clear to a person skilled in the art that the solution of the present application may be implemented by means of software and/or hardware. "Unit" and "module" in this specification refer to software and/or hardware capable of performing a specific function, either alone or in combination with other components, where the hardware may be, for example, a field programmable gate array, an integrated circuit, or the like.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
The embodiments of the present application have been described and illustrated in detail above. It should be clearly understood that this application describes how to make and use particular examples, but is not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
Those skilled in the art will readily appreciate from the description of example embodiments that the risk rating prediction method according to embodiments of the present application has at least one or more of the following advantages.
According to some embodiments, the enterprise risk rating prediction can be provided by converting big data generated in the operation of an electronic commerce into a risk index system and then into a risk rating model.
According to some embodiments, risk scoring portraits for different dimensions are provided based on the dimension partitioning (inventory, sales, settlement, etc.) of the risk indicator space derived from big data of the electronic commerce.
According to some embodiments, financial institutions may conduct risk admission rating during the admission phase through risk operation reports based on these highly trusted risk indicators according to the present application, saving manpower and time, and resulting in a relatively more reliable result.
According to some embodiments, enterprise risk ratings are predicted by machine learning based on e-commerce big data, thereby providing a reliable financing basis for financial institutions.
According to some embodiments, the client risk portrait function is embedded into the risk rating model through the random forest model obtained through training, so that time and calculation cost are saved, and the risk portrait is held by a random forest algorithm.
According to some embodiments, sample labeling is performed in a semi-supervised learning mode, so that labels are generated through data to generate training samples, and labor is saved.
According to some embodiments, random forest surrogate logistic regression is used as the underlying model. When a random forest model is used for generating a sub-decision tree, a risk index of a specific risk dimension is selected by a specific number of subtrees. By embedding the user risk portrait function into the risk rating model, time and calculation cost are saved, and the risk portrait is held by the random forest model. In this way, the model prediction results obtained by the method according to the example embodiment are more accurate.
The foregoing may be better understood in light of the following clauses:
Clause 1, a method of generating a marked sample based on electronic commerce data, comprising:
performing dimension reduction and sequencing on a risk index space containing a plurality of risk indexes to obtain an initial mark sample and placing the initial mark sample into a sample space;
the following steps are repeatedly performed until the number of marked samples in the sample space reaches a threshold value:
training a classification model using the labeled samples in the sample space;
and labeling samples through the trained classification model, and expanding the sample space by using the obtained labeled samples.
The method of clause 2, wherein the dimension reducing and ordering the risk indicator space comprising the plurality of risk indicators, obtaining an initial marked sample and placing the initial marked sample in the sample space, comprises:
substituting the plurality of risk indicators with fewer risk features by PCA;
sorting samples according to the risk features;
and marking the head sample and the tail sample in the sequenced samples as positive samples and negative samples respectively according to a certain proportion, and placing the positive samples and the negative samples in a sample space as initial marked samples.
The method of clause 3, wherein the dimension reducing and ordering the risk indicator space comprising the plurality of risk indicators, obtaining an initial marked sample and placing the initial marked sample in the sample space, comprises:
Replacing a plurality of risk indicators with a risk feature, the risk feature being a linear combination of the plurality of risk indicator features;
sorting the samples by the value of the risk feature;
and marking the head sample and the tail sample in the sequenced samples as positive samples and negative samples respectively according to a certain proportion, and placing the positive samples and the negative samples in a sample space as initial marked samples.
The method of clause 4, wherein training a classification model using the labeled samples in the sample space, comprises:
and placing the marked samples in the sample space into a decision tree model for training to obtain a trained classification model.
The method of clause 5, wherein labeling the sample with the trained classification model and expanding the sample space with the resulting labeled sample comprises:
placing the sample which is not marked before into a trained classification model to obtain the predicted marking probability of the sample which is not marked before and sequencing;
and marking the head sample and the tail sample in the sequenced samples as positive samples and negative samples respectively according to a certain proportion, and placing the positive samples and the negative samples in a sample space of the existing marked samples.
The method of clause 6, wherein the plurality of risk indicators comprises one or more of sales odds ratio, inventory turnover, flow conversion, infringement complaint number, and refund rate.
Clause 7, a method of training a risk rating predictive model, comprising:
generating a labeled sample using the method of any one of clauses 1-6;
taking at least part of the marked sample as a training sample;
selecting a plurality of risk indicators;
dividing the plurality of risk indicators into at least one risk dimension;
training a random forest model based on the training samples and the plurality of risk indicators and the at least one risk dimension,
the random forest model comprises a first group of decision trees and a second group of decision trees, the first group of decision trees randomly acquire the plurality of marking samples and the plurality of risk indexes, and the second group of decision trees randomly acquire the training samples and respectively acquire the risk indexes of each risk dimension.
The method of clause 8, wherein the results output by the first set of decision trees and the second set of decision trees are used as model-output overall risk prediction values; and taking the risk prediction value of each risk dimension output by the second group of decision trees as a risk portrait.
The method of clause 9, 7, wherein the selecting a plurality of risk indicators comprises:
The indexes are filtered, and a plurality of risk indexes with the top importance rank are reserved.
The method of clause 10, 7, wherein the selecting a plurality of risk indicators comprises:
a KS test ranking is applied to the index space, retaining a plurality of risk indices with top importance ranks.
The method of clause 11, the method of clause 7, wherein the plurality of risk indicators comprise one or more of sales compliance ratio, inventory turnover, flow conversion, infringement complaint number, and refund rate.
The method of clause 12, wherein the at least one risk dimension comprises: at least one of return risk, inventory risk, base risk, settlement performance, and sales performance.
The method of clause 13, wherein the method further comprises: and carrying out standardization processing on the multiple risk indexes.
The method of clause 14, wherein the method is used by a financial institution to model loan risk ratings for electronic merchants.
Clause 15, a computing device, comprising:
a processor;
a memory having a computer program stored thereon;
The method of any one of clauses 1-14 being implemented when the processor executes the computer program.
Exemplary embodiments of the present application are specifically illustrated and described above. It is to be understood that this application is not limited to the details of construction, arrangement or method of implementation described herein; on the contrary, the intention is to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (15)

1. A method of generating a marked sample based on e-commerce data, comprising:
performing dimension reduction on a risk index space containing a plurality of risk indexes, sequencing samples, obtaining an initial marked sample, and placing the initial marked sample into a sample space;
the following steps are repeatedly performed until the number of marked samples in the sample space reaches a threshold value:
training a classification model using the labeled samples in the sample space;
and labeling samples through the trained classification model, and expanding the sample space by using the obtained labeled samples.
2. The method of claim 1, wherein said dimension reducing and sorting samples of a risk indicator space containing a plurality of risk indicators, obtaining initial marked samples and placing them in a sample space, comprises:
Substituting the plurality of risk indicators with fewer risk features by PCA;
sorting samples according to the risk features;
and marking the head sample and the tail sample in the sequenced samples as positive samples and negative samples respectively according to a certain proportion, and placing the positive samples and the negative samples in a sample space as initial marked samples.
3. The method of claim 1, wherein said dimension reducing and sorting samples of a risk indicator space containing a plurality of risk indicators, obtaining initial marked samples and placing them in a sample space, comprises:
replacing a plurality of risk indicators with a risk feature, the risk feature being a linear combination of the plurality of risk indicator features;
sorting the samples by the value of the risk feature;
and marking the head sample and the tail sample in the sequenced samples as positive samples and negative samples respectively according to a certain proportion, and placing the positive samples and the negative samples in a sample space as initial marked samples.
4. The method of claim 1, wherein training a classification model using the labeled samples in the sample space comprises:
and placing the marked samples in the sample space into a decision tree model for training to obtain a trained classification model.
5. The method of claim 4, wherein sample tagging by the trained classification model and expanding the sample space with the resulting tagged samples comprises:
Placing the sample which is not marked before into a trained classification model to obtain the predicted marking probability of the sample which is not marked before and sequencing;
and marking the head sample and the tail sample in the sequenced samples as positive samples and negative samples respectively according to a certain proportion, and placing the positive samples and the negative samples in a sample space of the existing marked samples.
6. The method of claim 1, wherein the plurality of risk indicators comprises one or more of a sales-to-cycle ratio, an inventory turnover, a flow conversion, an infringement complaint number, and a refund rate.
7. A method of training a risk rating predictive model, comprising:
generating a labeled sample using the method of any one of claims 1-6;
taking at least part of the marked sample as a training sample;
selecting a plurality of risk indicators;
dividing the plurality of risk indicators into at least one risk dimension;
training a random forest model based on the training samples and the plurality of risk indicators and the at least one risk dimension,
the random forest model comprises a first group of decision trees and a second group of decision trees, the first group of decision trees randomly acquire the plurality of marking samples and the plurality of risk indexes, and the second group of decision trees randomly acquire the training samples and respectively acquire the risk indexes of each risk dimension.
8. The method of claim 7, wherein the results output by the first set of decision trees and the second set of decision trees are output as model output overall risk prediction values; and taking the risk prediction value of each risk dimension output by the second group of decision trees as a risk portrait.
9. The method of claim 7, wherein the selecting a plurality of risk indicators comprises:
the indexes are filtered, and a plurality of risk indexes with the top importance rank are reserved.
10. The method of claim 7, wherein the selecting a plurality of risk indicators comprises:
a KS test ranking is applied to the index space, retaining a plurality of risk indices with top importance ranks.
11. The method of claim 7, wherein the plurality of risk indicators includes one or more of sales equivalence ratio, inventory turnover, flow conversion, infringement complaint number, and refund rate.
12. The method of claim 11, wherein the at least one risk dimension comprises: at least one of return risk, inventory risk, base risk, settlement performance, and sales performance.
13. The method as recited in claim 7, further comprising: and carrying out standardization processing on the multiple risk indexes.
14. The method of claim 7, wherein the method is used for modeling loan risk ratings by financial institutions for electronic commerce.
15. A computing device, comprising:
a processor;
a memory having a computer program stored thereon;
the method of any of claims 1-14 being implemented when the processor executes the computer program.
CN202310398899.6A 2023-04-04 2023-04-04 Method and computing device for generating marking sample and training risk rating prediction model Pending CN116384750A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310398899.6A CN116384750A (en) 2023-04-04 2023-04-04 Method and computing device for generating marking sample and training risk rating prediction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310398899.6A CN116384750A (en) 2023-04-04 2023-04-04 Method and computing device for generating marking sample and training risk rating prediction model

Publications (1)

Publication Number Publication Date
CN116384750A true CN116384750A (en) 2023-07-04

Family

ID=86980430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310398899.6A Pending CN116384750A (en) 2023-04-04 2023-04-04 Method and computing device for generating marking sample and training risk rating prediction model

Country Status (1)

Country Link
CN (1) CN116384750A (en)

Similar Documents

Publication Publication Date Title
CN108648074B (en) Loan assessment method, device and equipment based on support vector machine
Kotu et al. Predictive analytics and data mining: concepts and practice with rapidminer
Tsai et al. Predicting stock returns by classifier ensembles
CN108711107A (en) Intelligent financing services recommend method and its system
CN113095927B (en) Method and equipment for identifying suspected transactions of backwashing money
CN113111924A (en) Electric power customer classification method and device
CN113256409A (en) Bank retail customer attrition prediction method based on machine learning
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
Jain et al. NFT Appraisal Prediction: Utilizing Search Trends, Public Market Data, Linear Regression and Recurrent Neural Networks
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
Han et al. Semi-supervised clustering for financial risk analysis
Zhou et al. Research on corporate financial performance prediction based on self‐organizing and convolutional neural networks
Attanasio et al. Leveraging the explainability of associative classifiers to support quantitative stock trading
Moedjahedy et al. Stock price forecasting on telecommunication sector companies in Indonesia Stock Exchange using machine learning algorithms
CN116523301A (en) System for predicting risk rating based on big data of electronic commerce
Bao et al. Summarization of corporate risk factor disclosure through topic modeling
Sebt et al. Implementing a data mining solution approach to identify the valuable customers for facilitating electronic banking
Chen et al. Predicting a corporate financial crisis using letters to shareholders
Yang et al. An evidential reasoning rule-based ensemble learning approach for evaluating credit risks with customer heterogeneity
CN116384750A (en) Method and computing device for generating marking sample and training risk rating prediction model
CN114529063A (en) Financial field data prediction method, device and medium based on machine learning
CN116384751A (en) Method and computing device for carrying out standardized risk index and risk rating prediction
CN116385151A (en) Method and computing device for risk rating prediction based on big data
CN116384749A (en) Method for training risk rating prediction model and computing equipment
Fedyk News-driven trading: who reads the news and when

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination