CN114519508A - Credit risk assessment method based on time sequence deep learning and legal document information - Google Patents

Credit risk assessment method based on time sequence deep learning and legal document information Download PDF

Info

Publication number
CN114519508A
CN114519508A CN202210085355.XA CN202210085355A CN114519508A CN 114519508 A CN114519508 A CN 114519508A CN 202210085355 A CN202210085355 A CN 202210085355A CN 114519508 A CN114519508 A CN 114519508A
Authority
CN
China
Prior art keywords
legal document
data
legal
information
credit risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210085355.XA
Other languages
Chinese (zh)
Inventor
许伟
杜玮
王明明
周宣晔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN202210085355.XA priority Critical patent/CN114519508A/en
Publication of CN114519508A publication Critical patent/CN114519508A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Computation (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Finance (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Technology Law (AREA)
  • Accounting & Taxation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Educational Administration (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Primary Health Care (AREA)

Abstract

The invention relates to a credit risk assessment method based on time sequence deep learning and legal document information, which comprises the following steps: determining the optimal observation period, and classifying the judgment according to the litigation state and the judgment result of the loan applicant; crawling a legal decision document within a set time, configuring document entity extraction rules and a dictionary, and extracting legal document entities by adopting a rule-based extraction method; preprocessing the extracted legal document data, and extracting events from the text information of the legal document; selecting legal document features with strong prediction capability by using an RFE recursive feature selection method; and setting a mixed data set and training an LSTM model to obtain an evaluation model for evaluating the credit risk. According to the method and the system, risk early identification, early warning and early discovery can be realized, client risk early warning is initiated in time, the risk control quality is improved, a more accurate and reliable basis is provided for anti-fraud application decision making of the bank, the risk management of the user is enabled by algorithm practice, and the reject ratio and the reject amount of the bank are effectively reduced.

Description

Credit risk assessment method based on time sequence deep learning and legal document information
Technical Field
The invention relates to a credit risk assessment method based on time sequence deep learning and legal document information, and relates to the field of computer science and technology.
Background
Currently in the field of credit risk assessment, the developed financial trading market is constantly generating large amounts of transaction-level trading data containing timing information. In the face of a large amount of time sequence transaction data, the traditional machine learning model cannot solve the problems of gradient disappearance and gradient explosion in the long sequence training process, so that large-scale data is not fully mined and applied. In addition, the legal decision documents contain rich information, the default risk of lenders is reflected to a certain extent, and the legal document information is often ignored by the construction of the conventional credit risk assessment model feature set.
Conventional credit risk assessment includes expert-based credit assessment methods and statistics-based credit risk assessment methods. The credit assessment method based on experts has strong subjective colors, and the credit risk assessment method based on statistics is difficult to include complex relationships among variables. Along with the rise of machine learning and deep learning, the time series prediction class problem is more and more abstracted to the regression problem, so that the relevant models of the machine learning and the deep learning can be used, the limitation of basic hypothesis is not needed, the application range is wider, and the method is more favored by people.
Most of the traditional machine learning algorithms are based on basic statistical data, and the time sequence relation of the data cannot be well utilized. Therefore, a large amount of transaction-level transaction data generated in the credit field cannot be fully utilized to construct a risk assessment model, the risk assessment model constructed in the prior art has low prediction accuracy, and the bank can suffer from the problem of information asymmetry such as insufficient information of lenders during examination of credit staffs before lending, so that the bank has high reject rate and bad amount, and the bank has low transaction efficiency.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a credit risk assessment method, system, electronic device and storage medium based on time series deep learning and legal document information, which can improve the accuracy of credit risk assessment prediction by analyzing data including time series relationship and mining time series information included in the data.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the invention provides a credit risk assessment method based on time series deep learning and legal document information, which is characterized by comprising the following steps:
determining the optimal observation period, and classifying the judgment according to the litigation state and the judgment result of the subject to be evaluated;
crawling a legal decision document within a set time, configuring document entity extraction rules and a dictionary, and extracting legal document entities by adopting a rule-based extraction method;
preprocessing the extracted legal document data, and extracting events from the text information of the legal document;
selecting legal document features with strong prediction capability by using an RFE recursive feature selection method;
and setting a mixed data set and training an LSTM model to obtain an evaluation model for evaluating the credit risk.
Further, the best observation period is determined by adopting a chi-square inspection method, wherein the legal document decisions are divided into four categories: the litigation state is the type of judgment that is reported and the judgment result is unfavorable to the litigation state; the litigation state is the type of judgment that the quilt is reported and the judgment result is favorable; the litigation state is an original and the judgment type that the judgment result is unfavorable to the original; litigation states are types of decisions for which the source and decision results are favorable.
Further, crawling legal decision documents within a set time is accessed by using a script frame + selenium in python to simulate a browser.
Further, preprocessing the extracted legal document data comprises data duplication elimination, missing value processing and/or data import, wherein the data duplication elimination comprises the step of carrying out duplication elimination processing based on the case number of the legal document to be detected and the court information data corresponding to the legal document; missing value processing includes converting credit data to obtain vectorized data of a uniform format; data import includes merging data from various data sources to form a data set.
Further, event extraction is carried out on the legal document text information, and the event extraction comprises the following steps:
the method comprises the steps of adopting a regular matching method, utilizing expert experience to artificially define a keyword table, defining a plurality of dictionaries for regular matching in python codes according to the keyword table, carrying out keyword matching line by line aiming at all contents in the python dictionaries defined by crawling text data of legal documents circularly, extracting information of positions where the keywords are located after matching is successful, storing the information into a corresponding python-defined list, circularly writing collected characteristic field information of each legal document into a file for storage, and extracting to obtain an initial characteristic set of the legal document.
Further, the legal document features with strong prediction ability are selected by using an RFE recursive feature selection method, which comprises the following steps:
s1, constructing an initial feature set F of the legal document, wherein the initial feature set comprises case numbers, titles, case groups, case types, trial court, release dates, litigation states, cases, parties, public dates, referee dates, trial programs, judgment results and involved amount;
s2, setting an initial feature set Fx as an original data set, setting an optimal feature set Fy as a null, setting a root mean square error value of an optimal feature subset as Rx, generating a decision tree by carrying out bootstrap resampling on the Fx for modeling, establishing a random forest classification model, obtaining a final classification result by voting, and training the model by using all feature variables;
s3, calculating the importance of each characteristic variable, sequencing the characteristic variables, calculating a root mean square error value Rx, and sequencing the characteristic variables in a descending order according to the absolute value | C | of the characteristic score;
s4, deleting ranked features Fi in the subset Fx until the feature set Fx is empty, if the root mean square error value Ry of the feature subset Fy is smaller than Rx, then Ry is equal to Rx, otherwise, executing S3 and S4;
and S5, outputting the optimal feature subset Fy, wherein the obtained optimal feature variables comprise case numbers, release dates, litigation states, cases, judgment results and money in the legal documents.
Further, the mixed data set includes the optimal feature variable set in the legal document information and the time series related data of the user recorded in time series and the demographic data.
In a second aspect, the present invention provides a credit risk assessment system based on time series deep learning and legal document information, the system comprising:
the data acquisition unit is configured to determine an optimal observation period and classify the judgment according to the litigation state and the judgment result of the subject to be evaluated;
the entity extraction unit is configured to crawl legal decision documents within set time, configure document entity extraction rules and dictionaries, and extract legal document entities by adopting a rule-based extraction method;
an event extraction unit configured to perform preprocessing on the extracted legal document data and perform event extraction on the legal document text information;
the characteristic acquisition unit is configured to select legal document characteristics with strong prediction capability by using an RFE recursive characteristic selection method;
and the risk evaluation unit is configured to set the mixed data set and train the LSTM model to obtain an evaluation model for evaluating the credit risk.
In a third aspect, the present invention provides an electronic device, which at least includes a processor and a memory, where the memory stores a computer program, and the processor executes the computer program to perform the method.
In a fourth aspect, the present invention provides a computer storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor to perform the method.
Due to the adoption of the technical scheme, the invention has the following characteristics:
1. the method has the advantages that the legal documents are subjected to feature extraction, feature selection is carried out on all features by utilizing an RFE feature selection algorithm, an LSTM-based time sequence credit risk assessment model is constructed, data containing a time sequence relation are analyzed, time sequence information contained in the data is mined, and the accuracy of credit risk assessment prediction is improved, so that the problem of asymmetrical pain points of information between a lending subject and a bank is solved quickly and efficiently, the problem of personal credit risk rating and credit management assessment evaluation of financial institutions such as banks is solved to a certain extent, the problem of personal lending loan difficulty is solved, the reject ratio and the reject amount of the banks are effectively reduced, the banks have greater advantages in the aspects of personal loan approval, credit line confirmation and the like, and the banks can be helped to improve the transaction efficiency and the service quality;
2. the invention utilizes LSTM model to discover the information between the time series data, introduces legal decision document information as supplementary characteristic on the basis, adopts a method of combining financial information and non-financial information to carry out credit risk assessment, reasonably predicts default risk of users, and can further help financial credit institutions such as banks and the like to reasonably assess asset condition, thereby maintaining the asset liability ratio at a relatively stable level and keeping the safety and stability of the financial credit market;
3. the user credit risk assessment model framework based on time sequence deep learning and legal document information is constructed, so that financial credit institutions such as banks can be helped to carry out more reasonable and effective assessment and prediction on default probability of users, high-quality customers can be screened by the banks conveniently, high-risk customers can be filtered, early risk identification, early warning and early discovery can be realized, customer risk early warning can be initiated in time, the risk control quality is improved, more accurate and reliable basis is provided for anti-fraud application decisions of the banks, the risk management of the users can be realized by algorithm practice, and the reject ratio and the reject amount of the banks can be effectively reduced;
in conclusion, the invention can be widely applied to credit risk assessment of banks.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Like reference numerals refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic diagram illustrating a credit evaluation method according to an embodiment of the present invention;
FIG. 2 is a flowchart of legal document information extraction according to an embodiment of the present invention;
FIG. 3 is a flow chart of legal document feature set construction according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an implementation of RFE according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an LSTM according to an embodiment of the present invention;
FIG. 6 is a flow chart of an LSTM-based time series credit risk assessment model according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" may be intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless specifically identified as an order of performance. It should also be understood that additional or alternative steps may be used.
According to the invention, published legal judgment document information is used as non-financial information, demographic information and lender time sequence transaction information are used to construct a user credit risk assessment model framework based on LSTM, so that high-quality customers can be screened by banks conveniently, high-risk customers are filtered, early risk identification, early warning and early discovery are realized, the risk warning of the customers is initiated in time, and the risk control quality is improved. The legal decision documents contain rich information and reflect the default risk of lenders to a certain extent. Because the judgment book is published on the Internet, the information collection cost of the bank can be reduced by using the judgment book, and the authenticity of the information is ensured. The feature is extracted from the judgment text by using a text mining technology, so that the recognition performance of the prediction model can be obviously improved. Therefore, the time sequence deep learning model LSTM is used for processing mixed data consisting of time sequence data and demographic data, and a user credit risk assessment model framework based on the LSTM is finally constructed.
The first embodiment is as follows: as shown in fig. 1, the method for constructing a credit risk assessment model based on time series deep learning and legal document information provided in this embodiment includes:
s1, as shown in fig. 2, determines the best observation period and sorts the decisions based on litigation status of the loan applicant and the results of the decisions.
Specifically, in order to select the optimal observation period before extracting the legal document information, the present embodiment adopts the conventional method of testing correlation, namely the card method test method, to test the correlation between the observation period and the loan default, and finds out the observation period with the highest default correlation.
To determine which decisions are valid for predicting credit risk, legal document decisions for a selected observation period are classified into four categories: the lawsuit state is a type of judgment that is reported and the judgment result is unfavorable, the lawsuit state is a type of judgment that is reported and the judgment result is favorable, the lawsuit state is a type of judgment that is original and the judgment result is unfavorable, the lawsuit state is an original and the judgment result is favorable, and the existing classical machine learning method logistic regression model is used to verify the default probability of each type of judgment quantity, and the judgment type that the lawsuit state is reported and the judgment result is unfavorable is found, for example, the judgment type that the loan application is a legal person of a reported company and the case result is a failure requires the loss of the original, which may have higher influence on the credit risk.
S2, as shown in FIG. 3, crawling framework + selenium in python is used for simulating browser access, legal decision documents in last two years are crawled, document entity extraction rules and dictionaries are configured, and a rule-based extraction method is adopted for legal document entity extraction.
The keyword list can be obtained by utilizing expert experience and artificial definition, the legal document entity is extracted and used by adopting a rule-based extraction method, and the keyword plays a role in marking and identifying the extraction task and is used for activating the vocabulary of the extraction task. Although legal documents are stored in an unstructured format, the content itself is structured and the language expression in the official document is relatively fixed, such as case number, release date, judge, litigation state, decision result, and the fact that the quotation requests the colons of these keywords is the feature information to be extracted. And matching the crawled legal documents by using a regular expression, and extracting corresponding characteristic field information according to the keywords.
And S3, preprocessing the legal document data and extracting the event of the document information.
Specifically, preprocessing legal document data comprises data duplication elimination, missing value processing and/or data import, wherein the data duplication elimination comprises the duplication elimination processing based on the case number of the legal document to be detected and the court information data corresponding to the legal document; missing value processing includes converting credit data to obtain vectorized data of a uniform format; data import includes merging data from various data sources to form a data set.
The method comprises the following steps of extracting events from all crawled legal document text information, wherein the method comprises the following specific steps:
because the language expression in the referee document is relatively fixed and has a certain pattern, a regular matching method is adopted, a keyword table is artificially defined by using expert experience, a plurality of dictionaries for regular matching are defined in the python code according to the keyword table (such as re _ fact 1 { ' case by (·) (} 1,2} - \ d {1,2}) } and re _ fact 2 { ' commission litigant attorney-original person ': can entrust lition attorney):?all keywords in the legal document defined by the crawl data loop are matched line by line, and after matching is successful, extracting the information of the position of the keyword, storing the information into a corresponding python-defined list, circularly writing the collected legal document feature field information into a file for storage, finally extracting to obtain an initial feature set of the legal document, and selecting the initial feature set as an RFE recursive feature.
S4, selecting legal document features with strong prediction ability by using an RFE recursive feature selection method, wherein the finally selected text features comprise case numbers, release dates, litigation states, case situations, judgment results and money.
As shown in fig. 4, specifically, an RFE recursive feature selection method is used to select legal document features with strong prediction capability, and an optimal legal document feature subset is selected by calculating a root mean square error value of a model, where the feature selection method selection process includes input and output processes, specifically:
s41, constructing an initial feature set, inputting an initial feature set F, wherein the initial feature set is that all available features comprise case numbers, titles, case bases, case types, trial courts, release dates, litigation states, cases, parties, public dates, referee dates, trial programs, decision results and involved money;
s42, setting an initial feature set Fx as an original data set, setting an optimal feature set Fy as null, setting a root mean square error value of an optimal feature subset Rx, generating a decision tree by using Fx through bootstrap resampling to model, establishing a random forest classification model, obtaining a final classification result through voting, and training the model by using all feature variables;
s43, calculating the importance of each feature variable, sorting, calculating a root mean square error value Rx, and sorting in descending order according to the absolute value | C | of the feature score, wherein the feature score calculation formula of the ith feature is as follows:
Figure BDA0003487566540000071
in the formula, wiIs the weight of the ith feature; c. CiA score is scored for the feature of the ith feature.
S44, deleting ranked features Fi in the subset Fx until the feature set Fx is empty, if the root mean square error value Ry of the feature subset Fy is smaller than Rx, then Ry is equal to Rx, otherwise, executing S43 and S44;
and S45, outputting the optimal feature subset Fy, wherein the finally obtained optimal feature variables comprise case numbers, release dates, litigation states, cases, judgment results and money in the legal documents.
S5, splitting a mixed data set consisting of the extracted optimal characteristic variable set in the legal document information, time-series related data of the user (such as bill amount situation of the user generated all the time) and demographic data (such as characteristics of sex, age, education degree and the like of the user) recorded according to time sequence into a training set, a testing set and a verifying set;
s6, as shown in fig. 5 and 6, training of the LSTM model is performed, and after the training of the model is completed, the model is evaluated.
The LSTM model, a long-short term memory network, is a variation of the recurrent neural network RNN, and is an improvement based on the basic algorithm idea of the recurrent neural network.
Specifically, the training set data and the test set data are first read in a defined ratio. The training set data is divided into two parts of labels and features. Since the LSTM model is a model for processing time series data, dimension change is performed on two-dimensional training set data according to a time series order by using a Reshape method of a numpy array in Python. After dimension conversion, the training process of the model is started. And (3) training by using the two models, and splicing the tensor output by the LSTM model and the tensor output by the fully-connected neural network together in a row after one round of training of the two models is finished. And then, through a preset full connection layer, the classified prediction of the samples is realized. And then entering logic judgment, if the iteration times do not meet the preset requirement, calculating corresponding loss through a loss function, updating relevant parameters of the LSTM layer and the fully-connected neural network through an optimization function, and performing a new round of model parameter iteration updating until the model meets the preset iteration termination condition. After the training of the model is completed, the evaluation of the model is performed.
Further, for time series data, processing via the LSTM model is required. All parameters involved in the LSTM model, including all relevant parameters in the three gating mechanisms, need to be initialized before data entry. The data is then passed into the LSTM layer for a round of training. For non-time series data, which are not suitable for processing by the LSTM model, learning and prediction in a three-layer fully-connected neural network initialized in advance are required.
The following describes in detail the implementation process of the credit risk assessment method based on time series deep learning and legal document information according to the present invention by using specific embodiments.
As shown in fig. 1, the credit risk assessment method based on time series deep learning and legal document information provided by this embodiment includes:
1. and (3) extracting events of the legal documents, extracting the optimal characteristic variable case number, the release date, the litigation state, the case, the judgment result and the amount of money in the legal documents, and evaluating the credit risk of the lender by using the information.
As shown in fig. 2, to mine information from the judgments, the present embodiment uses a text mining method to convert legal documents into structured information.
2. A legal instrument feature set is constructed.
To determine which legal decision documents are valid to predict credit risk and facilitate subsequent targeted crawling of legal documents, two aspects are considered: legal document time range and legal document decision category.
In terms of time, the time span between the decision date and the loan application date is analyzed to determine the optimal observation period.
In the category aspect, the decisions are classified according to litigation conditions of the loan applicant and the decision results, and the decision category related to the credit risk assessment is determined.
In order to select the optimal observation period before extracting the legal document information, the present embodiment tests the correlation between the observation period and the loan default using the card method. And a logistic regression method is used to check the predictive power of each observation period. The results show that the chi-squared value of the variables for which legal decision instruments exist is the greatest within 2 years, with the highest degree of correlation with loan default.
3. To determine which decisions are valid for predicting credit risk, four legal document categories are considered, with the classification criteria detailed in table 1 for classifying the legal documents.
With respect to categories, the present embodiment will make decisions based on the litigation conditions of the loan applicant and the results of the decisions, and determine the categories of decisions relevant to assessing credit risk.
Original notice, defended notice and judgment result: because the litigation state and the judgment result have different influences on the credit risk, the judgment files are classified according to the two factors. Litigation states and judgment results are divided into two groups, namely non-negative and negative. Based on this division, the selected legal document categories were divided into four categories (C1-C4), with the classification criteria as shown in the following Table:
TABLE 1 legal documents Categories
Figure BDA0003487566540000091
The legal document judgment of the selected observation period is divided into four categories according to the litigation state and the judgment result. And verifying the default probability of each decision quantity by using a logistic regression model, and finding that the default probability of the lender judged by the C4 is the highest, which indicates that the judgment belonging to the C4 probably has higher influence on the credit risk.
4. After the time range and the judgment category of the legal documents are determined, the legal judgment documents disclosed on the judgment document network are crawled in a targeted manner, and the legal judgment documents in the last two years are crawled by utilizing a script frame and a selenium simulation browser in python. To process missing values, k nearest neighbors are found using a k-new neighbor algorithm (KNN), and the missing values are filled in using the average of the neighbors.
5. Extracting features for legal documents
As shown in fig. 3, structured information is extracted using a method of keywords and regular expressions. The legal documents are stored in an unstructured format, but the content of the legal documents is structured, the language expression in the referee documents is relatively fixed, for example, each treaty comprises two components of a preface and a text, the text comprises a plurality of chapters or each chapter comprises a plurality of sections, each section comprises a plurality of bars, each section comprises a plurality of money and the like, and often case numbers, release dates, judges, litigation states, judgment results, characteristic information to be extracted after the complaints request the keywords and the like. And extracting information corresponding to each optimal characteristic of the legal document by using the regular expression to match the characteristic characters in the treaty body according to the characteristics. The RFE recursive feature selection can then be used to select features with strong predictive power to be added to the model as complementary features, and the non-financial information is combined with financial and user-specific information to help assess the credit risk of the lender.
6. As shown in fig. 4, the RFE automatic feature selection method is used to determine the role of each extracted legal document feature, and extract the optimal legal document feature, and the implementation process of the RFE is as follows:
61. the initial set of features is all available features.
62. And modeling by using a logistic regression model by using the current feature set, and training the model by using all extracted feature variables.
63. Calculating the importance of each characteristic variable and ordering
64. For each variable subset S _ { i }, i 1.. S, the first S _ { i } most important feature variables are extracted, and the feature set is updated.
65. Jump to step 62 until the importance rating of all features is completed
66. And calculating and comparing the effect of the model obtained by each subset to determine an optimal characteristic variable set.
7. Combining the screened basic information features and the screened legal document features into an overall feature set, wherein the data not only comprise basic statistical information such as marital, education, age, gender and the like, but also comprise transaction information of time sequence and legal document features of each user within a range of six months, and the overall feature set is specifically described as follows:
table 2 contains the set of overall characteristics of legal documents
Figure BDA0003487566540000111
8. And reading the training set data and the test set data according to the defined proportion. The training set data is divided into two parts of labels and features. Since the LSTM model is a model for processing time series data, the dimension change is performed on two-dimensional training set data in time series order. After dimension conversion, the training process of the model is started. As shown in fig. 5 and 6, the specific process of model training is as follows:
81. and pulling all mixed data, and associating the three tables into a wide table according to the unique identifier of the common primary key lender ID in the three tables of the demographic data table, the time sequence data table and the legal document data table, wherein the wide table comprises all characteristics in the demographic data and the time sequence data and the optimal characteristic variable set of the legal document data obtained before.
82. And determining K samples nearest to the missing data according to Euclidean distance or correlation analysis, and performing missing value processing on the missing data of the supplementary samples by weighted averaging the K values.
83. The SMOTE algorithm is adopted to process the unbalanced data, the SMOTE process the unbalanced data, and the algorithm flow is as follows:
(1) for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample set by using the Euclidean distance as a standard to obtain the k neighbor of the sample x.
(2) And setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each few class sample x, wherein the selected neighbors are assumed to be xn.
(3) For each randomly selected neighbor xn, a new sample is constructed with the original sample according to the following formula.
xnew=x+rand(0,1)*(x′-x)
84. And converting the characteristic variables into vector signals.
85. And inputting the characteristic signal into an input layer, and calculating a signal output from the input layer to the hidden layer according to the sigmoid activation function.
86. And sequentially calculating the input and output signals of the input gate, the input signal and the state value of the memory unit, the output and output signal of the forgetting gate and the input and output signal of the output gate.
87. The final memory cell output vector is calculated.
88. The final output vector of the memory cell is used as the input of the next hidden layer, and the operations 85 and 86 are repeated to obtain the output of the next hidden layer.
89. Similarly, the output of the third and fourth hidden layers is obtained.
810. And (3) updating the weight according to a predicted value calculation error function (logarithmic loss function).
811. And (5) passing the signal obtained in the step 89 through an output layer to obtain a predicted value.
812. Step 82 to step 811 are repeated until the maximum number of iterations is reached.
813. With regard to the setting of the learning rate, the learning rate determines the extent to which the weights are updated according to the loss function. When the learning rate is high, the training speed of the model is high, but the effect of the model is possibly relatively poor; when the learning rate is low, the training of the model takes a long time. However, the data amount of the embodiment is relatively small, and thus the learning rate of 0.001 is selected to ensure the accuracy of the model prediction. And (4) adjusting and optimizing parameters, wherein the total number of LSTM is 7. The parameter table for the LSTM model is as follows:
TABLE 3 LSTM model parameter values
Figure BDA0003487566540000131
For the fully-connected neural network, the non-time sequence part in the data is extracted to form a 24000 × 6 × 5 three-dimensional list. And then, spreading and expanding the list to form a batch size multiplied by 30 input structure, and inputting the input structure into a three-layer fully-connected neural network to learn and train the neural network. The activation function selected in the embodiment is a Sigmoid activation function, the error function selects a mean square error function, and the optimization function selects an adam optimization function. The hidden layer is set to be 3 layers, the number of nodes of the hidden layer is 60, 30 and 16 in sequence, the batch processing number is 90, and the number of iterations is set to be 900.
814. After training, the model is stored, real numbers between 0 and 1 are output by the model to represent default probability of the customers, so that the bank is helped to carry out more reasonable and effective evaluation and prediction on whether default risks exist in the customers, the bank can conveniently screen high-quality customers, and high-risk customers are filtered.
Example two: correspondingly, the embodiment provides a credit risk assessment system based on time sequence deep learning and legal document information. The system provided by this embodiment can implement the credit risk assessment method based on time series deep learning and legal document information of the first embodiment, and the system can be implemented by software, hardware or a combination of software and hardware. For convenience of description, the present embodiment is described with the functions divided into various units, which are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in one or more pieces. For example, the system may comprise integrated or separate functional modules or units to perform the corresponding steps in the method of an embodiment. Since the system of the present embodiment is basically similar to the method embodiment, the description process of the present embodiment is relatively simple, and reference may be made to part of the description of the first embodiment for relevant points.
The credit risk assessment system based on time series deep learning and legal document information provided by the embodiment comprises:
a data acquisition unit configured to determine an optimal observation period and classify the decisions according to litigation conditions of the loan applicant and the decision result;
the entity extraction unit is configured to crawl legal decision documents within set time, configure document entity extraction rules and dictionaries, and extract legal document entities by adopting a rule-based extraction method;
an event extraction unit configured to perform preprocessing on the extracted legal document data and perform event extraction on the legal document text information;
the characteristic acquisition unit is configured to select legal document characteristics with strong prediction capability by using an RFE recursive characteristic selection method;
and the risk evaluation unit is configured to set the mixed data set and train the LSTM model to obtain an evaluation model for evaluating the credit risk.
Example three: the present embodiment provides an electronic device corresponding to the credit risk assessment method based on time series deep learning and legal document information provided in the first embodiment, where the electronic device may be an electronic device for a client, such as a mobile phone, a notebook computer, a tablet computer, a desktop computer, and the like, to execute the method of the first embodiment.
As shown in fig. 7, the electronic device includes a processor, a memory, a communication interface, and a bus, and the processor, the memory, and the communication interface are connected by the bus to perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The memory stores a computer program capable of running on the processor, and the processor executes the credit risk assessment method based on time series deep learning and legal document information provided by the embodiment when running the computer program. Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In some implementations, the logic instructions in the memory may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an optical disk, and various other media capable of storing program codes.
In other implementations, the processor may be various general-purpose processors such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), and the like, and is not limited herein.
Example four: the time-series deep learning and legal document information-based credit risk assessment method of this embodiment can be embodied as a computer program product, which can include a computer-readable storage medium having computer-readable program instructions embodied thereon for executing the time-series deep learning and legal document information-based credit risk assessment method of this embodiment.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any combination of the foregoing.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of "one embodiment," "some implementations," or the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A credit risk assessment method based on time sequence deep learning and legal document information is characterized by comprising the following steps:
determining the optimal observation period, and classifying the judgment according to the litigation state and the judgment result of the subject to be evaluated;
crawling a legal decision document within a set time, configuring document entity extraction rules and a dictionary, and extracting legal document entities by adopting a rule-based extraction method;
preprocessing the extracted legal document data, and extracting events from the text information of the legal document;
selecting legal document features with strong prediction capability by using an RFE recursive feature selection method;
and setting a mixed data set and training an LSTM model to obtain an evaluation model for evaluating the credit risk.
2. The time series deep learning and legal document information-based credit risk assessment method according to claim 1, wherein the card method inspection method is adopted to determine the optimal observation period; legal document decisions are divided into four categories: the litigation state is the type of judgment that is reported and the judgment result is unfavorable to the litigation state; the litigation state is a type of judgment that is reported and the judgment result is favorable; the litigation state is an original and the judgment type that the judgment result is unfavorable to the original; litigation states are types of decisions for which the source and decision results are favorable.
3. The time-series deep learning and legal document information-based credit risk assessment method according to claim 1, wherein the legal decision document in the crawling set time is accessed by using a script frame + selenium simulation browser in python.
4. The credit risk assessment method based on time series deep learning and legal document information as claimed in claim 1, wherein the extracted legal document data is preprocessed by data deduplication, missing value processing and/or data import, wherein the data deduplication comprises deduplication processing based on the case number of the legal document to be detected and the corresponding court information data of the legal document; missing value processing includes converting credit data to obtain vectorized data of a uniform format; data import includes merging data from various data sources to form a data set.
5. The credit risk assessment method based on time series deep learning and legal document information as claimed in claim 3, wherein the event extraction of the legal document text information comprises:
the method comprises the steps of defining a keyword list by a regular matching method, defining a plurality of dictionaries for regular matching in a python code according to the keyword list, carrying out keyword matching line by line aiming at all contents in the python dictionary defined by circulating traversal of crawled legal document text data, extracting information of positions where keywords are located after matching is successful, storing the information into a corresponding python-defined list, circularly writing collected legal document characteristic field information into a file for storage, and extracting to obtain an initial characteristic set of the legal document.
6. The credit risk assessment method based on time series deep learning and legal document information as claimed in claim 1, wherein the RFE recursive feature selection method is used to select legal document features with strong prediction ability, comprising:
s1, constructing an initial feature set F of the legal document, wherein the initial feature set comprises case numbers, titles, case groups, case types, trial court, release dates, litigation states, cases, parties, public dates, referee dates, trial programs, judgment results and involved amount;
s2, setting an initial feature set Fx as an original data set, setting an optimal feature set Fy as a null, setting a root mean square error value of an optimal feature subset as Rx, generating a decision tree by carrying out bootstrap resampling on the Fx for modeling, establishing a random forest classification model, obtaining a final classification result by voting, and training the model by using all feature variables;
s3, calculating the importance of each characteristic variable, sequencing the characteristic variables, calculating a root mean square error value Rx, and sequencing the characteristic variables in a descending order according to the absolute value | C | of the characteristic score;
s4, deleting ranked features Fi in the subset Fx until the feature set Fx is empty, if the root mean square error value Ry of the feature subset Fy is smaller than Rx, then Ry is equal to Rx, otherwise, executing S3 and S4;
and S5, outputting the optimal feature subset Fy, wherein the obtained optimal feature variables comprise case numbers, release dates, litigation states, cases, judgment results and money in the legal documents.
7. The time-series deep learning and legal document information-based credit risk assessment method according to claim 1, wherein the mixed data set comprises an optimal feature variable set in the legal document information, time-series related data of the user recorded in time sequence, and demographic data.
8. A credit risk assessment system based on time series deep learning and legal document information, the system comprising:
the data acquisition unit is configured to determine an optimal observation period and classify the judgment according to the litigation state of the evaluation subject and the judgment result;
the entity extraction unit is configured to crawl legal decision documents within set time, configure document entity extraction rules and dictionaries, and extract legal document entities by adopting a rule-based extraction method;
an event extraction unit configured to perform preprocessing on the extracted legal document data and perform event extraction on the legal document text information;
the characteristic acquisition unit is configured to select legal document characteristics with strong prediction capability by using an RFE recursive characteristic selection method;
and the risk evaluation unit is configured to set the mixed data set and train the LSTM model to obtain an evaluation model for evaluating the credit risk.
9. An electronic device comprising at least a processor and a memory, the memory having stored thereon a computer program, characterized in that the processor, when executing the computer program, executes to carry out the method of any of claims 1 to 7.
10. A computer storage medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 7.
CN202210085355.XA 2022-01-25 2022-01-25 Credit risk assessment method based on time sequence deep learning and legal document information Pending CN114519508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210085355.XA CN114519508A (en) 2022-01-25 2022-01-25 Credit risk assessment method based on time sequence deep learning and legal document information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210085355.XA CN114519508A (en) 2022-01-25 2022-01-25 Credit risk assessment method based on time sequence deep learning and legal document information

Publications (1)

Publication Number Publication Date
CN114519508A true CN114519508A (en) 2022-05-20

Family

ID=81596653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210085355.XA Pending CN114519508A (en) 2022-01-25 2022-01-25 Credit risk assessment method based on time sequence deep learning and legal document information

Country Status (1)

Country Link
CN (1) CN114519508A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796285A (en) * 2023-02-13 2023-03-14 上海百事通法务信息技术有限公司浙江分公司 Litigation case prejudging method and device based on engineering model and electronic equipment
CN116342246A (en) * 2023-03-06 2023-06-27 浙江孚临科技有限公司 Method, device and storage medium for evaluating risk of default

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796285A (en) * 2023-02-13 2023-03-14 上海百事通法务信息技术有限公司浙江分公司 Litigation case prejudging method and device based on engineering model and electronic equipment
CN116342246A (en) * 2023-03-06 2023-06-27 浙江孚临科技有限公司 Method, device and storage medium for evaluating risk of default
CN116342246B (en) * 2023-03-06 2024-04-23 浙江孚临科技有限公司 Method, device and storage medium for evaluating risk of default

Similar Documents

Publication Publication Date Title
Sinaga et al. Implementation of Decision Support System for Determination of Employee Contract Extension Method Using SAW
Tang et al. A pruning neural network model in credit classification analysis
Zhang et al. DBNCF: Personalized courses recommendation system based on DBN in MOOC environment
CN107507038B (en) Electricity charge sensitive user analysis method based on stacking and bagging algorithms
CN108475393A (en) The system and method that decision tree is predicted are promoted by composite character and gradient
CN112700325A (en) Method for predicting online credit return customers based on Stacking ensemble learning
CN112215696A (en) Personal credit evaluation and interpretation method, device, equipment and storage medium based on time sequence attribution analysis
CN114519508A (en) Credit risk assessment method based on time sequence deep learning and legal document information
CN110674636A (en) Power utilization behavior analysis method
Chi et al. Establish a patent risk prediction model for emerging technologies using deep learning and data augmentation
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
CN114663002A (en) Method and equipment for automatically matching performance assessment indexes
CN113449204A (en) Social event classification method and device based on local aggregation graph attention network
Lamba et al. A MCDM-based performance of classification algorithms in breast cancer prediction for imbalanced datasets
Zhu et al. Loan default prediction based on convolutional neural network and LightGBM
CN112434862B (en) Method and device for predicting financial dilemma of marketing enterprises
CN113837266A (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
Tiruneh et al. Feature selection for construction organizational competencies impacting performance
CN117235633A (en) Mechanism classification method, mechanism classification device, computer equipment and storage medium
KR102663632B1 (en) Device and method for artwork trend data prediction using artificial intelligence
ELYUSUFI et al. Churn prediction analysis by combining machine learning algorithms and best features exploration
CN111708865A (en) Technology forecasting and patent early warning analysis method based on improved XGboost algorithm
CN113821571B (en) Food safety relation extraction method based on BERT and improved PCNN
CN116778210A (en) Teaching image evaluation system and teaching image evaluation method
CN114626940A (en) Data analysis method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination