CN115545912B - Credit risk prediction method and device based on green identification information - Google Patents

Credit risk prediction method and device based on green identification information Download PDF

Info

Publication number
CN115545912B
CN115545912B CN202211513284.5A CN202211513284A CN115545912B CN 115545912 B CN115545912 B CN 115545912B CN 202211513284 A CN202211513284 A CN 202211513284A CN 115545912 B CN115545912 B CN 115545912B
Authority
CN
China
Prior art keywords
credit
category
green
sample
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211513284.5A
Other languages
Chinese (zh)
Other versions
CN115545912A (en
Inventor
罗文辉
朱赛
张楠
梁重庆
吉秋红
连霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
United Equatorial Environmental Assessment Co ltd
Original Assignee
United Equatorial Environmental Assessment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by United Equatorial Environmental Assessment Co ltd filed Critical United Equatorial Environmental Assessment Co ltd
Priority to CN202211513284.5A priority Critical patent/CN115545912B/en
Publication of CN115545912A publication Critical patent/CN115545912A/en
Application granted granted Critical
Publication of CN115545912B publication Critical patent/CN115545912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a credit risk prediction method and device based on green identification information, wherein the method comprises the following steps: receiving green credit application information, and determining a corresponding green credit category according to the green credit application information; acquiring a complete sample of the green credit category; when the number of the complete samples is smaller than the preset number, obtaining samples of the approximate green credit category and related traditional categories; mixing and fusing the similar green credit category and the related traditional category samples to generate a virtual sample; the virtual sample is verified through a preset classification decision tree by utilizing a coefficient of a radix; and adding the virtual samples which are qualified in verification into the samples to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result, so that the accuracy of credit risk prediction can be improved under the condition of lacking enough samples.

Description

Credit risk prediction method and device based on green identification information
Technical Field
The invention relates to the technical field of credit risk prediction, in particular to a credit risk prediction method and device based on green identification information.
Background
Green credit is a completely new credit policy. The nature of green credit is to properly address the financial industry's relationship to sustainable development. The main expression forms are as follows: for ecological protection, ecological construction and green industry financing, a new financial system and a complete financial tool are constructed.
Credit is a type of financial instrument that requires prediction of risk to avoid possible risks as much as possible. Among these, most importantly, the risk of default is avoided, which would cause a loss of all or part of the loan. The traditional credit risk prediction can be carried out by adopting a mathematical model, in particular a neural network model mode, and the risk prediction result is used as an important reference condition for credit approval.
When the neural network model is utilized to predict the green credit risk, the number of samples of each green credit is too small due to the fact that the green credit is numerous in variety, the project period is long and the development period of the green credit is short, so that the neural network model cannot be fully trained, and further the prediction accuracy is low. In addition, the general green credit financing projects are mostly emission reduction and energy saving, more profit points are concentrated on carbon emission transactions, and the influence factors are greatly different from the traditional credit factors, so that the input data are scattered, and the neural network model prediction result accuracy is lower and cannot be used as an approval reference condition of green credit.
Disclosure of Invention
The embodiment of the invention provides a credit risk prediction method and device based on green identification information, which are used for solving the technical problem that the prediction accuracy is reduced due to too few number of credit samples and scattered data in the prior art.
In a first aspect, an embodiment of the present invention provides a credit risk prediction method based on green identification information, including:
receiving credit application information, extracting a field and a keyword of a loan direction and a project name from the credit application information, and determining industry category multistage content according to the keyword;
judging whether green credit is judged according to the minimum classification content in the multi-level content, and determining the corresponding green credit category when the green credit is judged;
acquiring a complete sample of the green credit category;
when the number of the complete samples of the green credit category is smaller than the preset number of samples, determining a related traditional category and an approximate green credit category according to the green credit category;
acquiring an approximate green credit category sample and a related traditional category sample;
mixing and fusing the approximate green credit category sample and the related traditional category sample to generate a virtual sample;
verifying the virtual sample through a preset classification decision tree by utilizing a coefficient of a radix, and screening to obtain a virtual sample qualified in verification;
adding the virtual sample which is qualified in verification into the sample to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on green identification information;
the mixing and fusing of the similar green credit category and the related traditional category samples comprises the following steps:
dividing initial data of related traditional category samples into project data and credit applicant data, and respectively establishing a project data matrix and a credit applicant data matrix;
extracting proportional coefficients of project data corresponding to each check node in the approximate green credit category sample, policy patch proportion and environmental benefit conversion parameters, and generating check node influence matrixes, wherein each check node influence matrix corresponds to one project investigation node;
multiplying the project data matrix and the credit applicant data matrix with each check node influence matrix respectively to obtain a project data matrix and a credit applicant data matrix corresponding to each check node;
and extracting corresponding elements from the project data matrix and the credit applicant data matrix corresponding to each assessment node to serve as data of a virtual sample.
Further, the random forest includes:
five decision trees, the data input in each decision tree at least comprises part of the data in the other decision tree.
Further, the five decision trees include:
a profitability decision tree, a capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree, and a subsidized yield decision tree.
Further, the determining the relevant traditional category according to the green credit category includes:
determining a minimum level corresponding to the green credit category;
acquiring the name of the same level as the minimum level;
and determining the related traditional category according to the names of the same level as the minimum level.
In a second aspect, an embodiment of the present invention further provides a credit risk prediction apparatus based on green identification information, including:
the receiving module is used for receiving credit application information, extracting a field, a loan direction and a keyword of an item name from the credit application information, and determining industry category multistage content according to the keyword;
the judging module is used for judging whether the multi-level content is green credit according to the minimum classification content in the multi-level content, and determining the corresponding green credit category when the multi-level content is green credit;
the green credit category complete sample acquisition module is used for acquiring a complete sample of the green credit category;
the determining module is used for determining a relevant traditional category and an approximate green credit category according to the green credit category when the number of the complete samples of the green credit category is smaller than the preset number of samples;
an approximate and related sample acquisition module for acquiring approximate green credit category samples and related traditional category samples;
the generation module is used for carrying out mixed fusion on the approximate green credit category sample and the related traditional category sample to generate a virtual sample;
the verification module is used for verifying the virtual sample through a preset classification decision tree by utilizing a coefficient of a radix, and screening to obtain a virtual sample qualified in verification;
the training module is used for adding the virtual samples which are qualified in verification into the samples to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on the green identification information;
the generating module comprises:
the dividing unit is used for dividing the related traditional category sample initial data into project data and credit applicant data, and respectively establishing a project data matrix and a credit applicant data matrix;
the extraction unit is used for extracting the proportional coefficients of the project data corresponding to each check node in the approximate green credit category sample, the policy subsidy proportion and the environmental benefit conversion parameter, and generating check node influence matrixes, wherein each check node influence matrix corresponds to one project investigation node;
the multiplying unit is used for multiplying the project data matrix and the credit applicant data matrix with each examination node influence matrix respectively to obtain a project data matrix and a credit applicant data matrix corresponding to each examination node;
and the unit is used for extracting corresponding elements from the project data matrix and the credit applicant data matrix corresponding to each assessment node as data of the virtual sample.
Further, the random forest includes:
five decision trees, the data input in each decision tree at least comprises part of the data in the other decision tree.
Further, the five decision trees include:
a profitability decision tree, a capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree, and a subsidized yield decision tree.
Still further, the determining module includes:
the minimum level determining unit is used for determining a minimum level corresponding to the green credit category;
an obtaining unit, configured to obtain a name of the same level as the minimum level;
and the related traditional category determining unit is used for determining the related traditional category according to the name of the same level as the minimum level.
The credit risk prediction method and the credit risk prediction device based on the green identification information provided by the embodiment of the invention are used for determining the corresponding green credit category according to the green credit application information by receiving the green credit application information; acquiring a complete sample of the green credit category; when the number of the complete samples of the green credit category is smaller than the preset number of samples, determining a related traditional category according to the green credit category; acquiring a related traditional category sample and an incomplete sample of the green credit category; obtaining a sample of the approximate green credit category and related traditional categories; mixing and fusing the similar green credit category and the related traditional category samples to generate a virtual sample; mixing and fusing incomplete samples of the green credit category and related traditional category samples to generate virtual samples; the virtual sample is verified through a preset classification decision tree by utilizing a coefficient of a radix; and adding the virtual sample which is qualified in verification into the sample to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on the green identification information. And fusing the approximate green credit category and the related traditional category sample by utilizing the similarity and the correlation between the approximate green credit category and the related traditional category and the green credit category sample data, so as to ensure the correlation between the data. Virtual green credit category sample data is generated. And the correlation between the green sample identification data is utilized, and the reliability verification is carried out on the virtual green credit category sample data by utilizing the coefficient of the kene through a preset classification decision tree. And training the decision forest with the virtual green credit category sample data and the complete samples of the green credit category that pass the reliability verification. The problem of insufficient training samples is effectively solved, and the credit risk of the green credit category is predicted by utilizing the trained decision forest.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 is a flowchart of a credit risk prediction method based on green identification information according to an embodiment of the present invention;
FIG. 2 is a flowchart of a credit risk prediction method based on green identification information according to a second embodiment of the present invention;
fig. 3 is a block diagram of a credit risk prediction device based on green identification information according to a third embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Fig. 1 is a flowchart of a credit risk prediction method based on green identification information according to an embodiment of the present invention, where the embodiment is applicable to a situation of accurately predicting green credit risk in the absence of a sufficient green credit category sample, and specifically includes the following steps:
step 110, receiving credit application information, reading the keywords of fields, loan directions and project names from the credit application information, and determining the industry category multilevel content according to the keywords.
In this embodiment, the green credit applicant or green credit customer manager may enter green credit application information in the green credit approval system. The green credit application information may include: applicant's name, time of establishment, recent financial data, loan data, and other applicant's information and project categories, design indicators, implementation periods, indicators corresponding to the periods, project total investments, project predicted credits, policy subsidies, and environmental benefit conversion parameters. And extracting corresponding keywords according to the preset data types in the input application information and the preset positions in the application form.
And determining the multi-level content of the industry category according to the extracted keywords. In the present embodiment, the industry category multilevel content may be determined based on the keywords extracted from the item information. The national economy industry classification can be divided into four layers of category relationships according to the inclusion relationship, corresponding codes are respectively arranged, and specific description is arranged under the fourth layer of category, so that the national economy industry classification can be used for supplementing the fifth category. The method can be used for matching the national economy industry category according to the item names and categories in the keywords and matching corresponding subclasses to obtain the industry category multilevel content.
And 120, judging whether the multi-level content is green credit according to the minimum classification content in the multi-level content, and determining the corresponding green credit category when the multi-level content is green credit.
The content of the fourth class or the fifth class can be matched with the content in the green industry catalog according to the minimum classification in the extracted multi-level content, and whether the content is green credit is judged according to the matching result. When the match is successful, a green credit is determined. In the case of green credit, the subclass is taken as the corresponding green credit class. For example, the green fuel class may be determined from "biomass heating, waste heat and excess pressure utilization" in the class 5 classification.
And 130, acquiring a complete sample of the green credit category.
All information of the credit of the same green credit category before is acquired as a complete sample. Since the period corresponding to green credit is long, typically divided into multiple periods, all data of each period node should be collected, especially time-varying data, such as: policy subsidy proportion, environmental benefit conversion parameters and the like.
And 140, determining a relevant traditional category and an approximate green credit category according to the green credit category when the number of the complete samples of the green credit category is smaller than a preset number of samples.
In general, to meet the training requirements of a neural network model or other classifier, a sufficient number of samples are typically required. Since the green items are executed for a long period, usually 4-7 years, the number of green items that can be completely executed for all periods is small. Therefore, it is first necessary to determine the number of items of the green credit category that are completely executed, and if the number of the samples is smaller than the preset number of samples, additional virtual samples are generated for training the classifier. In this embodiment, the virtual samples may be generated from related information relating to the traditional category and the near green credit category. The related traditional category may be a traditional industry in the same industry category as the green credit category in the national economy industry category table described previously. For example: the green credit category is a garbage biomass energy heat supply and pressure supply category, and the corresponding traditional category is a coal-fired energy heat supply and pressure supply category. The approximate green credit category may be determined from the environmental benefit effects of the green credit item. And selecting the category with approximate environmental protection benefit effect as the approximate green credit category. The environmental benefit result of the approximate green credit category is the same as or similar to the environmental benefit effect of the green credit project. Taking the biomass energy heat supply and pressure supply type as an example, the realized environmental benefit effect is energy conservation and emission reduction, and then the natural gas heat supply and pressure supply type can be used as an approximate green credit type.
Illustratively, the determining the relevant traditional category from the green credit category may include: determining a minimum level corresponding to the green credit category; acquiring the name of the same level as the minimum level; and determining the related traditional category according to the names of the same level as the minimum level.
Step 150, obtaining a sample of the near green credit category and related traditional categories.
Because the relevant traditional category samples cannot embody the long-period special and have no policy subsidy proportion and environmental benefit conversion parameter response data, the samples similar to the green credit category and the incomplete samples of the green credit category are required to be used as supplements for comprehensively processing the samples of the relevant traditional category in the later period so as to ensure the integrity of virtual sample data.
Step 160, mixing and fusing the similar green credit category and the related traditional category samples to generate a virtual sample.
Since the approximate green credit category samples include: the data in a specific node of each stage in the complete period of the project, in particular the data of the policy patch proportion, the environmental benefit conversion parameters and the like in the change. And the corresponding related traditional categories can provide similar information of project construction cost, operation, cost and the like. Thus, by fusing the two, a virtual sample embodying the green credit category credit characteristics can be generated.
Optionally, the performing hybrid fusion on the similar green credit category and the related traditional category sample may include: dividing initial data of related traditional category samples into project data and credit applicant data, and respectively establishing a project data matrix and a credit applicant data matrix; extracting proportional coefficients of project data corresponding to each check node in the approximate green credit category sample, policy patch proportion and environmental benefit conversion parameters, and generating check node influence matrixes, wherein each check node influence matrix corresponds to one project investigation node; multiplying the project data matrix and the credit applicant data matrix with each check node influence matrix respectively to obtain a project data matrix and a credit applicant data matrix corresponding to each check node; and extracting corresponding elements from the project data matrix and the credit applicant data matrix corresponding to each assessment node to serve as data of a virtual sample.
Wherein the assessment nodes can be respective assessment time nodes executed with items in the green credit items. Because the green project period is longer, the patch and the green benefit corresponding to each time node are different from each period of input-test operation-small-scale pilot test-comprehensive generation. Therefore, the item data and policy subsidy proportion corresponding to each time node need to be considered.
In addition, the project data matrix can comprise various financial data related to the project, and the financial data is sequentially filled in according to the position relation of the financial data category and the matrix element to form a corresponding project data matrix. The entry matrix should be an m×m matrix, and the shortage can be filled with 0. Correspondingly, the number of rows corresponding to the check node influence matrix is the same as the number of columns of the corresponding project data matrix.
And 170, verifying the virtual sample by using a coefficient of a kenel through a preset classification decision tree.
The most important part of the decision tree generation process is feature selection. Feature selection refers to selecting one feature from a plurality of features in training data as a splitting standard of a current node, and how to select the feature has a plurality of different quantitative evaluation standards so as to derive different decision tree algorithms. The feature with the highest information gain can be selected as a test feature by using the coefficient, and the node samples are divided into subsets by using the feature, so that the mixing degree of different types of samples in each subset is the lowest, and the information (entropy) required for dividing the samples in each subset is the least.
In this embodiment, the preset classification decision tree may be trained with a small number of complete samples of green credit categories. After training is completed, the coefficient of the first node is calculated. And then training the generated small amount of virtual samples, and calculating the coefficient of the first node. If the two are closer, it is indicated that the classification characteristic of the first node is closer. It can be stated that the quality of the simulated sample is higher and can be more consistent with a complete sample of the true green credit category. Can be used for training a later risk prediction model.
And 180, adding the virtual sample which is qualified in verification into the sample to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on the green identification information.
The random forest can be trained by using a large number of virtual samples which are verified to be qualified. The corresponding risk prediction recognition result can be accurately obtained by the random forest under the training of a large number of samples. After training is completed, the currently received green credit application information can be input into a random forest after training is completed, and green credit risk results output by the random forest are received.
The embodiment determines a corresponding green credit category according to green credit application information by receiving the green credit application information; acquiring a complete sample of the green credit category; when the number of the complete samples of the green credit category is smaller than the preset number of samples, determining a related traditional category according to the green credit category; acquiring a related traditional category sample and an incomplete sample of the green credit category; obtaining a sample of the approximate green credit category and related traditional categories; mixing and fusing the similar green credit category and the related traditional category samples to generate a virtual sample; mixing and fusing incomplete samples of the green credit category and related traditional category samples to generate virtual samples; the virtual sample is verified through a preset classification decision tree by utilizing a coefficient of a radix; and adding the virtual sample which is qualified in verification into the sample to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on the green identification information. And fusing the approximate green credit category and the related traditional category sample by utilizing the similarity and the correlation between the approximate green credit category and the related traditional category and the green credit category sample data, so as to ensure the correlation between the data. Virtual green credit category sample data is generated. And the correlation between the green sample identification data is utilized, and the reliability verification is carried out on the virtual green credit category sample data by utilizing the coefficient of the kene through a preset classification decision tree. And training the decision forest with the virtual green credit category sample data and the complete samples of the green credit category that pass the reliability verification. The problem of insufficient training samples is effectively solved, and the credit risk of the green credit category is predicted by utilizing the trained decision forest.
Example two
Fig. 2 is a flowchart of a credit risk prediction method based on green identification information according to a second embodiment of the present invention, referring to fig. 2, the present embodiment optimizes, based on the above embodiment, the random forest structure, where the random forest includes: five decision trees, the five decision trees comprising: a profitability decision tree, a capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree, and a subsidized yield decision tree.
Correspondingly, the credit risk prediction method based on the green identification information provided by the embodiment comprises the following steps:
step 210, receiving credit application information, extracting a field and keywords of loan direction and project name from the credit application information, and determining industry category multilevel content according to the keywords.
And 220, judging whether the multi-level content is green credit according to the minimum classification content in the multi-level content, and determining the corresponding green credit category when the multi-level content is green credit.
Step 230, obtaining a complete sample of the green credit category.
Step 240, determining a relevant traditional category and an approximate green credit category according to the green credit category when the number of complete samples of the green credit category is smaller than a preset number of samples.
At step 250, a sample of the near green credit category and related traditional categories is obtained.
Step 260, mix-fuse the similar green credit category and the related traditional category samples to generate a virtual sample.
And 270, verifying the virtual sample by using a coefficient of a kene through a preset classification decision tree, and screening to obtain a virtual sample qualified in verification.
Step 280, adding the virtual sample qualified in verification into the sample to train a random forest, and inputting the received green credit application information into the trained random forest, wherein the random forest comprises: five decision trees, the five decision trees comprising: the profitability decision tree, the operational capacity decision tree, the environmental benefit decision tree, the emission reduction yield decision tree and the subsidy yield decision tree output credit risk prediction results based on the green identification information.
The random forest is a classifier and has the advantages of balancing errors, calculating the affinities in various cases and the like. Thus, a random forest is employed as the classifier in this embodiment. On one hand, since most of training data is virtual data obtained by construction, recognition errors can be reduced by utilizing the characteristic of balance errors. In another aspect, through the five decision trees selected for use, a profitability decision tree, an operational capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree and a subsidy yield decision tree, the output result of each decision tree can reflect the result of green credit in the aspect, and the evaluation standard of green credit is different from that of traditional credit, and the evaluation standard is not the only decision point for stably recovering the interest. It needs to comprehensively consider the combination of social benefit and economic benefit, so the five decision trees are set, and meanwhile, the five decision trees can influence each other. For example: the operation and the profit are indistinguishable from the environmental benefit, and the emission reduction income is related to the environmental benefit and the profit capability. And the data correspondingly adopted among the decision trees are associated with each other, so that the similarity among the decision trees in the random forest can be further enhanced through the control and constraint of the input data among the five decision trees, the discrimination results obtained by the decision trees tend to be consistent, and contradictory discrimination results caused by excessive fitting of the input quantity types can be avoided.
In addition, since the five decision trees also affect each other, the data input by each decision tree needs to be set to include at least a part of the data in the other decision tree. By utilizing the characteristic that the generated data corresponding to each evaluation index of the green credit are related to each other, the constraint among decision trees is further improved, and the judgment result obtained by each decision tree is ensured to be consistent.
The random forest structure is optimized, and the random forest comprises: five decision trees, the five decision trees comprising: a profitability decision tree, a capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree, and a subsidized yield decision tree. By utilizing the characteristic of balance error, the identification error can be reduced. And the characteristics of the discrimination factors related to each other are utilized to further enhance the similarity among all decision trees in the random forest, so that discrimination results obtained by all the decision trees tend to be consistent, and the occurrence of over-fitting and contradictory discrimination results can be avoided. The accuracy of credit risk prediction based on green identification information by utilizing the random forest is further improved.
Example III
Fig. 3 is a block diagram of a credit risk prediction device based on green identification information according to a third embodiment of the present invention, referring to fig. 3, the credit risk prediction device based on green identification information includes:
a receiving module 310, configured to receive credit application information, extract a field and a keyword of a loan direction and a project name from the credit application information, and determine industry category multilevel content according to the keyword;
a judging module 320, configured to judge whether the content is green credit according to the minimum classification content in the multi-level content, and determine a corresponding green credit category when the content is green credit;
a green credit category complete sample acquisition module 330 for acquiring a complete sample of the green credit category;
a determining module 340, configured to determine a relevant traditional category and an approximate green credit category according to the green credit category when the number of complete samples of the green credit category is less than a preset number of samples;
an approximate and related sample acquisition module 350 for acquiring approximate green credit category samples and related traditional category samples;
the generating module 360 is configured to mix and fuse the approximately green credit category sample and the related traditional category sample to generate a virtual sample;
the verification module 370 is configured to verify the virtual sample by using a coefficient of a kenel through a preset classification decision tree, and screen the virtual sample to obtain a virtual sample that is qualified for verification;
the training module 380 is configured to add the virtual sample that is qualified in verification into a sample to train the random forest, input the received green credit application information into the trained random forest, and output a credit risk prediction result based on the green identification information.
The credit risk prediction device based on the green identification information provided by the embodiment determines a corresponding green credit category according to green credit application information by receiving the green credit application information; acquiring a complete sample of the green credit category; when the number of the complete samples of the green credit category is smaller than the preset number of samples, determining a related traditional category according to the green credit category; acquiring a related traditional category sample and an incomplete sample of the green credit category; obtaining a sample of the approximate green credit category and related traditional categories; mixing and fusing the similar green credit category and the related traditional category samples to generate a virtual sample; mixing and fusing incomplete samples of the green credit category and related traditional category samples to generate virtual samples; the virtual sample is verified through a preset classification decision tree by utilizing a coefficient of a radix; and adding the virtual sample which is qualified in verification into the sample to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on the green identification information. And fusing the approximate green credit category and the related traditional category sample by utilizing the similarity and the correlation between the approximate green credit category and the related traditional category and the green credit category sample data, so as to ensure the correlation between the data. Virtual green credit category sample data is generated. And the correlation between the green sample identification data is utilized, and the reliability verification is carried out on the virtual green credit category sample data by utilizing the coefficient of the kene through a preset classification decision tree. And training the decision forest with the virtual green credit category sample data and the complete samples of the green credit category that pass the reliability verification. The problem of insufficient training samples is effectively solved, and the credit risk of the green credit category is predicted by utilizing the trained decision forest.
On the basis of the above embodiments, the generating module includes:
the dividing unit is used for dividing the related traditional category sample initial data into project data and credit applicant data, and respectively establishing a project data matrix and a credit applicant data matrix;
the extraction unit is used for extracting the proportional coefficients of the project data corresponding to each check node in the approximate green credit category sample, the policy subsidy proportion and the environmental benefit conversion parameter, and generating check node influence matrixes, wherein each check node influence matrix corresponds to one project investigation node;
the multiplying unit is used for multiplying the project data matrix and the credit applicant data matrix with each examination node influence matrix respectively to obtain a project data matrix and a credit applicant data matrix corresponding to each examination node;
and the unit is used for extracting corresponding elements from the project data matrix and the credit applicant data matrix corresponding to each assessment node as data of the virtual sample.
On the basis of the above embodiments, the random forest includes:
five decision trees, the data input in each decision tree at least comprises part of the data in the other decision tree.
On the basis of the above embodiments, the five decision trees include:
a profitability decision tree, a capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree, and a subsidized yield decision tree.
On the basis of the above embodiments, the determining module includes:
the minimum level determining unit is used for determining a minimum level corresponding to the green credit category;
an obtaining unit, configured to obtain a name of the same level as the minimum level;
and the related traditional category determining unit is used for determining the related traditional category according to the name of the same level as the minimum level.
The credit risk prediction device based on the green identification information provided by the embodiment of the invention can execute the credit risk prediction method based on the green identification information provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (8)

1. A credit risk prediction method based on green identification information, comprising:
receiving credit application information, extracting a field and a keyword of a loan direction and a project name from the credit application information, and determining industry category multistage content according to the keyword;
judging whether green credit is judged according to the minimum classification content in the multi-level content, and determining the corresponding green credit category when the green credit is judged;
acquiring a complete sample of the green credit category;
when the number of the complete samples of the green credit category is smaller than the preset sample number, determining a related traditional category and an approximate green credit category according to the green credit category;
acquiring an approximate green credit category sample and a related traditional category sample;
mixing and fusing the approximate green credit category sample and the related traditional category sample to generate a virtual sample;
verifying the virtual sample through a preset classification decision tree by utilizing a coefficient of a radix, and screening to obtain a virtual sample qualified in verification;
adding the virtual sample which is qualified in verification into the sample to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on green identification information;
the mixing and fusing of the similar green credit category and the related traditional category samples comprises the following steps:
dividing initial data of related traditional category samples into project data and credit applicant data, and respectively establishing a project data matrix and a credit applicant data matrix;
extracting proportional coefficients of project data corresponding to each check node in the approximate green credit category sample, policy patch proportion and environmental benefit conversion parameters, and generating check node influence matrixes, wherein each check node influence matrix corresponds to one project investigation node;
multiplying the project data matrix and the credit applicant data matrix with each check node influence matrix respectively to obtain a project data matrix and a credit applicant data matrix corresponding to each check node;
and extracting corresponding elements from the project data matrix and the credit applicant data matrix corresponding to each assessment node to serve as data of a virtual sample.
2. The method of claim 1, wherein the random forest comprises:
five decision trees, the data input in each decision tree at least comprises part of the data in the other decision tree.
3. The method of claim 2, wherein the five decision trees comprise:
a profitability decision tree, a capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree, and a subsidized yield decision tree.
4. The method of claim 1, wherein said determining a relevant legacy category from said green credit category comprises:
determining a minimum level corresponding to the green credit category;
acquiring the name of the same level as the minimum level;
and determining the related traditional category according to the names of the same level as the minimum level.
5. A credit risk prediction apparatus based on green identification information, comprising:
the receiving module is used for receiving credit application information, extracting a field, a loan direction and a keyword of an item name from the credit application information, and determining industry category multistage content according to the keyword;
the judging module is used for judging whether the multi-level content is green credit according to the minimum classification content in the multi-level content, and determining the corresponding green credit category when the multi-level content is green credit;
the green credit category complete sample acquisition module is used for acquiring a complete sample of the green credit category;
the determining module is used for determining a relevant traditional category and an approximate green credit category according to the green credit category when the number of the complete samples of the green credit category is smaller than the preset number of samples;
an approximate and related sample acquisition module for acquiring approximate green credit category samples and related traditional category samples;
the generation module is used for carrying out mixed fusion on the approximate green credit category sample and the related traditional category sample to generate a virtual sample;
the verification module is used for verifying the virtual sample through a preset classification decision tree by utilizing a coefficient of a radix, and screening to obtain a virtual sample qualified in verification;
the training module is used for adding the virtual samples which are qualified in verification into the samples to train the random forest, inputting the received green credit application information into the trained random forest, and outputting a credit risk prediction result based on the green identification information;
the generating module comprises:
the dividing unit is used for dividing the related traditional category sample initial data into project data and credit applicant data, and respectively establishing a project data matrix and a credit applicant data matrix;
the extraction unit is used for extracting the proportional coefficients of the project data corresponding to each check node in the approximate green credit category sample, the policy subsidy proportion and the environmental benefit conversion parameter, and generating check node influence matrixes, wherein each check node influence matrix corresponds to one project investigation node;
the multiplying unit is used for multiplying the project data matrix and the credit applicant data matrix with each examination node influence matrix respectively to obtain a project data matrix and a credit applicant data matrix corresponding to each examination node;
and the unit is used for extracting corresponding elements from the project data matrix and the credit applicant data matrix corresponding to each assessment node as data of the virtual sample.
6. The apparatus of claim 5, wherein the random forest comprises:
five decision trees, the data input in each decision tree at least comprises part of the data in the other decision tree.
7. The apparatus of claim 6, wherein the five decision trees comprise:
a profitability decision tree, a capacity decision tree, an environmental benefit decision tree, an emission reduction yield decision tree, and a subsidized yield decision tree.
8. The apparatus of claim 5, wherein the means for determining comprises:
the minimum level determining unit is used for determining a minimum level corresponding to the green credit category;
an obtaining unit, configured to obtain a name of the same level as the minimum level;
and the related traditional category determining unit is used for determining the related traditional category according to the name of the same level as the minimum level.
CN202211513284.5A 2022-11-30 2022-11-30 Credit risk prediction method and device based on green identification information Active CN115545912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211513284.5A CN115545912B (en) 2022-11-30 2022-11-30 Credit risk prediction method and device based on green identification information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211513284.5A CN115545912B (en) 2022-11-30 2022-11-30 Credit risk prediction method and device based on green identification information

Publications (2)

Publication Number Publication Date
CN115545912A CN115545912A (en) 2022-12-30
CN115545912B true CN115545912B (en) 2023-04-25

Family

ID=84722470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211513284.5A Active CN115545912B (en) 2022-11-30 2022-11-30 Credit risk prediction method and device based on green identification information

Country Status (1)

Country Link
CN (1) CN115545912B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898476A (en) * 2018-06-14 2018-11-27 中国银行股份有限公司 A kind of loan customer credit-graded approach and device
CN110175911A (en) * 2019-06-03 2019-08-27 卓尔智联(武汉)研究院有限公司 Credit approval results pre-judging method and relevant device based on deep learning
CN113962800A (en) * 2021-10-26 2022-01-21 度小满科技(北京)有限公司 Model training and overdue risk prediction method, device, equipment and medium
CN114881765A (en) * 2022-05-20 2022-08-09 中国银行股份有限公司 Credit item risk identification method and device
CN115018628A (en) * 2022-06-06 2022-09-06 中国银行股份有限公司 Green loan risk prediction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898476A (en) * 2018-06-14 2018-11-27 中国银行股份有限公司 A kind of loan customer credit-graded approach and device
CN110175911A (en) * 2019-06-03 2019-08-27 卓尔智联(武汉)研究院有限公司 Credit approval results pre-judging method and relevant device based on deep learning
CN113962800A (en) * 2021-10-26 2022-01-21 度小满科技(北京)有限公司 Model training and overdue risk prediction method, device, equipment and medium
CN114881765A (en) * 2022-05-20 2022-08-09 中国银行股份有限公司 Credit item risk identification method and device
CN115018628A (en) * 2022-06-06 2022-09-06 中国银行股份有限公司 Green loan risk prediction method and device

Also Published As

Publication number Publication date
CN115545912A (en) 2022-12-30

Similar Documents

Publication Publication Date Title
CN108876166A (en) Financial risk authentication processing method, device, computer equipment and storage medium
CN110852065A (en) Document auditing method, device, system, equipment and storage medium
CN105718490A (en) Method and device for updating classifying model
US20120150820A1 (en) System and method for testing data at a data warehouse
US20110270858A1 (en) File type recognition analysis method and system
CN110033284A (en) Source of houses verification method, apparatus, equipment and storage medium
CN112700325A (en) Method for predicting online credit return customers based on Stacking ensemble learning
CN110222733B (en) High-precision multi-order neural network classification method and system
US20170221075A1 (en) Fraud inspection framework
CN112288455A (en) Label generation method and device, computer readable storage medium and electronic equipment
CN111199469A (en) User payment model generation method and device and electronic equipment
CN115630621A (en) PDF financial data report form-based data acquisition and processing method and system
CN114490404A (en) Test case determination method and device, electronic equipment and storage medium
CN112153378A (en) Method and system for testing video auditing capability
Mirakhorli et al. A domain-centric approach for recommending architectural tactics to satisfy quality concerns
Alariqi et al. Modelling dynamic links among energy transition, technological level and economic development from the perspective of economic globalisation: Evidence from MENA economies
CN115545912B (en) Credit risk prediction method and device based on green identification information
CN112184495B (en) Stock low-efficiency land monitoring system and analysis platform using same
CN115423600A (en) Data screening method, device, medium and electronic equipment
CN115240145A (en) Method and system for detecting illegal operation behaviors based on scene recognition
Vunnava et al. PIOT‐Hub‐A collaborative cloud tool for generation of physical input–output tables using mechanistic engineering models
CN115309705A (en) Data integration classification system and method for automatically identifying basic data elements of urban information model platform
CN114187081A (en) Estimated value table processing method and device, electronic equipment and computer readable storage medium
CN111489134A (en) Data model construction method, device, equipment and computer readable storage medium
CN102289748A (en) Customer satisfaction and service quality comprehensive evaluation system and method for power supply company

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant