WO2017067153A1 - 基于文本分析的信用风险评估方法及装置、存储介质 - Google Patents

基于文本分析的信用风险评估方法及装置、存储介质 Download PDF

Info

Publication number
WO2017067153A1
WO2017067153A1 PCT/CN2016/081998 CN2016081998W WO2017067153A1 WO 2017067153 A1 WO2017067153 A1 WO 2017067153A1 CN 2016081998 W CN2016081998 W CN 2016081998W WO 2017067153 A1 WO2017067153 A1 WO 2017067153A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
features
feature
credit risk
classifier
Prior art date
Application number
PCT/CN2016/081998
Other languages
English (en)
French (fr)
Inventor
刘宏志
蒋杰
王巨宏
管刚
吴中海
张兴
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2017067153A1 publication Critical patent/WO2017067153A1/zh
Priority to US15/728,128 priority Critical patent/US11164075B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Definitions

  • the invention relates to the field of internet finance, in particular to a credit risk assessment method and device based on text analysis, and a storage medium.
  • P2P online lending is growing at an alarming rate and has received widespread attention. Opportunities and challenges have also emerged. Due to China's special historical background, P2P online lending has developed rapidly in China and its scale is relatively large. China's financial sector has financial regulation to a certain extent. The increasingly diverse financial needs of a large number of SMEs and individuals are no longer satisfied with existing financial services, which has led to the rapid development of P2P online lending. Because of this, P2P online lending innovation is too fast, and the lack of supervision is prone to systemic risks represented by amount and maturity mismatch, illegal fund raising and liquidity traps. There is still no perfect certification system in terms of payment. , lack of supervision and other issues, facing transaction fraud, privacy leaks And other risks; in terms of financing, the issue of credit risk has also become prominent with the improvement of the efficiency of the use of social funds.
  • the embodiment of the present invention provides a credit risk assessment method and apparatus based on text analysis, and a storage medium, which can effectively evaluate the credit risk of the borrower, so as to solve at least one problem existing in the prior art. Provide investors with important decision-making basis when investing.
  • an embodiment of the present invention provides a credit risk assessment method based on text analysis, where the method includes:
  • Parsing the text to obtain a basic language feature the basic language feature being used to predict whether the borrower will default
  • the borrower's credit risk value is output.
  • an embodiment of the present invention provides a credit risk assessment apparatus based on text analysis, where the apparatus includes a first acquisition unit, an analysis unit, a processing unit, and an output unit, where:
  • the first obtaining unit is configured to acquire a text of the borrower
  • the analyzing unit is configured to analyze the text to obtain a basic language feature, where the basic language feature is used to predict whether the borrower defaults;
  • the processing unit is configured to input the basic language feature into a preset credit risk assessment model, and obtain a credit risk value of the borrower outputted from the credit risk assessment model;
  • the output unit is configured to output the credit risk value of the borrower.
  • a computer storage medium the computer storage medium, according to an embodiment of the present invention
  • Embodiments of the present invention provide a credit risk assessment method and apparatus based on text analysis, and a storage medium, wherein a borrower's text is obtained; the text is analyzed to obtain a basic language feature, and the basic language feature is used to predict a loan. Whether the person will default; input the basic language feature into a preset credit risk assessment model, obtain the credit risk value of the borrower outputted from the credit risk assessment model; and output the credit risk value of the borrower; In this way, the credit risk of the borrower can be effectively evaluated, thereby providing investors with important decision-making basis when investing.
  • FIG. 1 is a schematic flowchart of an implementation process of a credit risk assessment method based on text analysis according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing the relationship between abstract text features and basic language features in an embodiment of the present invention
  • FIG. 3 is a schematic diagram showing a general process of feature selection according to an embodiment of the present invention.
  • 4-1 is a schematic diagram showing a comparison result of credit evaluation effects of financial features and text features according to an embodiment of the present invention
  • 4-2 is a schematic diagram showing a comparison result of credit effects between financial features and financial + text features according to an embodiment of the present invention
  • 4-3 is a schematic diagram showing the influence of different text features on credit evaluation in an embodiment of the present invention.
  • 5-1 is a schematic structural diagram of a credit risk assessment system based on multiple classifiers according to an embodiment of the present invention
  • FIG. 5-2 is a performance comparison diagram of combining different numbers of classifiers according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a credit risk assessment apparatus based on text analysis according to Embodiment 6 of the present invention.
  • P2P online lending is usually P2P online lending company as an intermediate display platform to display the information of both parties' lending. Investors and borrowers conduct online transactions through free bidding, so that the company earns corresponding service fees when the transaction is successful.
  • the general process of P2P online lending can also be simply described as a loan transaction on the network through a method such as personal-to-individual. When the borrower needs to repay the principal due to the expiration, it needs to pay the interest of the lender, and the lender At the same time as obtaining the income, it is necessary to bear the risk that the principal is not repaid.
  • Credit evaluation is also called credit rating. As an important role in the construction of credit system, it is a comprehensive understanding of enterprises or individuals according to certain indicators and methods, and a comprehensive assessment of the credit level from the collected information scientifically and objectively. The main starting point is to get the probability of default of the borrower under investigation and judge whether it can complete the agreed good things on time. In P2P lending, it is the money that is paid off on time. Credit evaluation will fundamentally be a classification problem in data mining. It divides the population belonging to the same category into two or several different subsets according to different characteristics. In general, in the credit evaluation of loans, the lenders are classified into credible “good” users and “bad” users with credit risk, that is, positive and negative examples in the classification. These two categories are categorized by historical credit data to help investors understand the potential risks of this investment.
  • Such data is called credit data.
  • it can be divided into structured data and unstructured data, such as social network comments, user-uploaded audio and video, and user-filled applications, which are text, image, audio, video, etc.
  • Existence all unstructured data.
  • soft information refers to accurate, logical and traceable information, that is, information that can be directly verified. They can be quantified and recorded in documents, and can be accurately transmitted, such as financial statements, salary levels, and so on.
  • soft information refers to information that is subjectively given by the information provider and cannot be directly verified by others.
  • 28,853 loan records generated from the Prosper platform between 20060101 and 20081231 are used as training data.
  • the borrower applies for a loan through the P2P lending platform, the borrower needs to fill out the loan application description.
  • the application description, together with the borrower's financial information can be used as training data to study the characteristics of the credit affecting and to adjust the model trained by these characteristics to form an effective credit. Risk assessment system.
  • the credit risk of the loan will be assessed by the borrower's textual characteristics. For example, obtain relevant data (the borrower's text description) from the world's largest P2P online lending platform, and then use machine learning methods and statistical methods to extract the borrower's six abstract text features from the borrower's text description, and then use These six abstract text features are used to assess the borrower's willingness to repay and repayment ability. These six characteristics include subjectivity, deception, text readability, emotion, user personality and thinking.
  • the credit risk assessment of P2P online lending is determined by two factors: repayment ability and repayment ability.
  • Repayment ability is a major factor, which means whether the borrower can repay on time, and the repayment on time depends on the economic status of the borrower. .
  • the willingness to repay as a subordinate factor depends on the borrower's ideas and concepts.
  • the embodiment of the invention provides a credit risk assessment method based on text analysis, which is applied to a computing device.
  • the computing device can be a personal computer or a service. Electronic equipment with information processing capabilities such as computers, industrial computers, and notebook computers.
  • the functions implemented by the method can be implemented by a processor call program code in a computing device.
  • the program code can be stored in a computer storage medium.
  • the computing device includes at least a processor and a storage medium.
  • FIG. 1 is a schematic diagram of an implementation process of a credit risk assessment method based on text analysis according to an embodiment of the present invention. As shown in FIG. 1 , the method includes:
  • Step S101 obtaining a text of the borrower
  • the text may be any text written by the borrower regarding the loan, for example, the borrower's application for the lender may be the text of the borrower in the embodiment of the present invention.
  • Step S102 analyzing the text to obtain a basic language feature, where the basic language feature is used to predict whether the borrower defaults;
  • a basic language feature may be extracted from the text by a related method of natural language processing, and the related method of the natural language processing, such as a topic model method, wherein the related method of natural language processing is
  • automated machines are methods and theories for identifying, transmitting, storing, and understanding processing from different granularities such as words, sentences, paragraphs, and documents through computable methods. It can process words into word segmentation, part-of-speech tagging, structural analysis and even meaning understanding, so as to obtain more features that can represent text from different aspects.
  • the basic language feature includes at least a statistical feature, a part-of-speech feature, an emotional feature, an entity feature, and a temporal feature of the text; wherein the statistical feature includes a sentence feature, a word feature, and a punctuation feature, wherein: the sentence feature includes at least The total number of sentences, the average sentence length, the maximum sentence length, and the number of question sentences; the word features include at least: average word length, maximum number of word types, total number of words, average number of occurrences of words, and maximum number of occurrences of words;
  • Features include at least: the ratio of the number of question marks and the proportion of the number of exclamation points.
  • Step S103 input the basic language feature into a preset credit risk assessment model, To the credit risk value of the borrower output from the credit risk assessment model;
  • the credit risk assessment model is pre-established, and the establishment process of the credit risk assessment module is described below.
  • the credit risk assessment model may be a simple classifier or a credit risk assessment system composed of multiple classifiers, wherein a classifier may be regarded as a certain domain or aspect.
  • An expert system, and a credit risk assessment system consisting of multiple classifiers can be seen as a hybrid expert system.
  • Step S104 outputting the credit risk value of the borrower.
  • the method further includes: Step S100, establishing the credit risk assessment model, including:
  • Step S111 acquiring training data
  • the training data is a text about borrowing by the borrower.
  • Step S112 analyzing the training data to obtain basic language features of the training data
  • step S112 is similar to the above-described step S102, and the present invention will be described in the following embodiments.
  • Step S113 using the basic language feature as a parameter, and using a machine learning method to establish a classifier corresponding to different abstract text features;
  • the machine learning method includes: an artificial neural network method, a support vector machine method, a decision tree method, a Bayesian method, a random forest method, and a logistic regression method.
  • different machine learning methods may also be used to establish a classifier corresponding to the abstract text feature; for example, in the case of fraud, a classifier of an artificial neural network method may be established, and a Bayesian method is established.
  • the classifier establishes a classification of the random forest method; then the classifier with the highest accuracy is used as the classifier corresponding to the abstract text feature.
  • the using the basic language feature as a parameter comprises: according to the basic language A relationship between the feature and the abstract text feature, the basic language feature being input to a classifier corresponding to each of the abstract text features.
  • subjectivity corresponds to part of speech features and emotional features
  • deceptiveness corresponds to part of speech features
  • emotional features physical features and tense features
  • readability corresponds to statistical features
  • emotions correspond to Emotional characteristics
  • personality characteristics correspond to statistical features
  • thinking mode corresponds to part of speech features and physical features.
  • Step S114 using the classifier as a basic classifier, using a decision tree algorithm to perform decision fusion to form a credit risk assessment model.
  • the classifier corresponding to the abstract text feature is used as a basic classifier, and the decision tree algorithm is used for decision fusion to form a credit risk assessment model.
  • the establishing the credit risk assessment model further includes: segmenting the training data according to punctuation marks of the sentence, and performing statistics on the segmented training data to obtain statistical features.
  • the punctuation of the sentence includes at least a period, a question mark, and an exclamation mark.
  • Embodiments of the present invention provide a credit risk assessment method and apparatus based on text analysis, in which a borrower's text is obtained; the text is analyzed to obtain a basic language feature, and the basic language feature is used to predict whether the borrower defaults Entering the basic language feature into a preset credit risk assessment model, obtaining a credit risk value of the borrower outputted from the credit risk assessment model; and outputting the credit risk value of the borrower;
  • the land’s assessment of the credit risk of the borrower provides investors with an important basis for decision-making when investing.
  • FIG. 2 is a schematic diagram showing the relationship between the abstract text feature and the basic language feature in the embodiment of the present invention, as shown in FIG. 2, in order to obtain from the borrower.
  • Mining useful information in text information first identifying various abstract text features from text information, wherein the abstract text features are used to describe various aspects of the borrower; The basic language features are then constructed and combined according to the abstract text features.
  • Abstract text features are based on knowledge of psychology and linguistics, identifying six abstract text features for credit risk assessment from textual descriptions. These six abstract text features include deceptive, subjective, emotional, and readable text. Sex, personality and way of thinking.
  • Deceptiveness is used to identify deceivers and honests.
  • deceptiveness is defined from four dimensions, namely cognitive load, internal imagination, decomposability, and negative emotions.
  • cognitive load namely cognitive load, internal imagination, decomposability, and negative emotions.
  • Specificity and cohesion are often used to measure the magnitude of cognitive load.
  • the specificity can be obtained from the MRC Psycholinguistic Database by the Coh-Metrix program, and the cohesiveness is often closely related to the number of connected words. Studies have shown that descriptive texts with deception are highly specific and very cohesive.
  • Decomposition is related to the use of personal pronouns. In order to break down false stories, the deceiver always uses more vocabulary for third parties (like "her” and "he") to describe the story.
  • Negative emotions are associated with the use of emotional words, and because of the increase in guilt, the deceiver always uses more negative vocabulary than honest people.
  • Subjectivity is a kind of text mining. It is used to evaluate the subjective and objective situation or tendency of the text. It is about the information of the objective world or the individual's feeling. Studies have shown that texts containing more objective information are more likely to default. After the lender provides a series of objective information about the loan situation, the borrower with high credit is more focused on explaining the purpose of the loan in the text description. And more subjective information, and borrowers with default risk are not willing to involve more unpleasant facts, and use a lot of objective information in the description. Therefore, subjectiveness and subjectiveness of vocabulary and other emotional characteristics and physical characteristics reflecting human thoughts, the use of modal verbs, the use of numerals, adjectives and adverbs are closely related.
  • Emotion is also a directional understanding of the emotional direction of the borrower's text description. Through the processing of the text, it is judged whether the borrower is positive or negative, friendly or not, and understands the borrower's viewpoint, emotion and attitude from a deep level. Through the combined analysis of the basic characteristics of emotions, a cognitive aspect of the more comprehensive three-dimensional emotion of the text is formed. The more positive a borrower treats life, the more credit he has, and vice versa.
  • personality traits include behavior, temperament, emotion and inner spirit.
  • the cultivation of personality traits is a long-term, stable process that affects many different aspects of individual behavior. For example, people who are willing to share and positively motivate are less likely to default than pessimistic people. The more important the difference in personality is more easily reflected in a single word, the linguistic features in the text will also reflect the characteristics of individual personality.
  • Personality can be defined in five dimensions, known as the Big Five. The first one is extroversion. Extroverts are more willing to communicate with others. They tend to use short sentences. There are fewer types of vocabulary. The text uses verbs, pronouns, adverbs and interjections. The emotions of the text are mostly positive and optimistic, including more. Social vocabulary and so on. Among the influencing factors of many internal and external personality traits, the most important dimension can be selected by calculation of formality:
  • personality traits are closely related to linguistic features. It is not only related to the characteristics of the words themselves (such as word length, word type), but also to the emotional tendency of words (positive, negative), part of speech (adjectives, verbs). ), tense (third person, past tense) and the specific meaning of the word (social vocabulary).
  • the functional vocabulary reflects the way the author communicates rather than the actual content of the description. It is more consistent with the author's social environment and the psychological real world. As the development of the matter and the author's cognition change, the functional vocabulary used will also occur. The corresponding change.
  • the perceived complexity describes the richness in the argumentation, that is, the degree of difference between contradictory schemes and the integration between different solutions, usually expressed by excluding vocabulary and conjunctions.
  • People who like to tell real stories are more inclined to use exclusions.
  • the two cognitive mechanisms of causal vocabulary and insight vocabulary often appear in describing past events and reflect reflections on what has happened. If a person is unsure about what is being described, then prefer to use uncertain vocabulary and supplemental words to buffer, and excessive use of uncertainty vocabulary indicates that the truth of the story is questionable. Therefore, the way of thinking has a certain relationship with the part of speech and the perceptual entity words such as description reasons and opinions. It is possible to combine the two basic characteristics to abstract the author's way of thinking to reflect more vividly. The author's true intention when writing the text.
  • the readability of the text is a relatively traditional indicator for measuring text, reflecting the author's educational level and social status. It has been used in the fields of commodity feedback, purchase intention, social media information review, etc. The reader understands the difficulty of the text.
  • the readability of text is measured in three dimensions: vocabulary type, lexical legibility, and lexical complexity.
  • Formula (1) is usually used, a measure of vocabulary richness that does not depend on the length of the text.
  • N is the length of the text
  • V(i,N) indicates that a certain type of word appears i times
  • lexical legibility and lexical complexity are also closely related to the length of the text sentence, the length of the vocabulary and the vocabulary type. relationship.
  • Abstract text features can reflect the author's default intention and credit habits from the perspective of actual abstraction, but there is no way to extract directly from the text features. Therefore, according to the meaning of the abstract text features and the text feature factors that affect the abstract text features, the statistics are summarized into five basic language features, which can be directly obtained from the text through machine learning methods and statistical methods, thereby using these features to represent abstract text. Characteristics, and ultimately get the intrinsic relationship with the willingness to repay, as a feature of predicting whether or not to default.
  • the most easily extracted parts can be directly obtained through statistics, such as the number of sentences appearing in the text, the number of words, the length of words, etc. These features are easy to calculate and can be calculated to varying degrees. Reflects the writer's writing attitude and even life attitude. For example, the length of a sentence can reflect the readability of the text to a certain extent. The longer the sentence exists in the text, the less likely it is to be read, and the more meaningful the expression. In addition, for the same statistical object, such as words, quantities and categories also represent different meanings, the number of words indicates the length of the text, and the word category indicates the vocabulary used in the text, reflecting the diversification of word usage.
  • the embodiment of the present invention adopts a statistical method, which has the characteristics of being simple and easy to implement.
  • 17 simple statistical features are extracted, which represent the sentence features and word features respectively; the meaning of each feature is listed in Table 2-1.
  • the maximum entropy model is used to perform part-of-speech tagging on words, that is, each word is assigned a part of speech category, such as adverbs, conjunctions, and the like.
  • the key problem of the maximum entropy model is feature selection.
  • the selected features directly affect the accuracy of the annotation. Adjacent position in the text
  • the part of speech between words affects each other, and the part of speech is also related to the suffix of the word itself and the adjacent word. Therefore, the contextual characteristics of the words in the text and the characteristics of the words themselves are comprehensively selected to form a maximum entropy feature template, as shown in Table 2-2:
  • wi, wi+1, ti, ti-1, and ti-2 are respectively represented as the current word, the previous word, the current part of speech, the pre-word part, and the former part of the word.
  • the General feature applies to each word, and the rare feature is added only when the word matches the template described by the rare feature type.
  • the maximum entropy model training corpus is from Penn Treebank and is referenced to the part-of-speech tagging label provided.
  • the part-of-speech feature used in this embodiment is a word level, so there are 36 word-level features obtained at the word level. Because the obtained part-of-speech classification is too detailed, for example, the noun singular and the noun plural belong to different classes, the comparative and adjectives of the adjective belong to two different classes, etc., and these word-of-speech categories are merged into 12 part-of-speech categories, and the calculation is performed. The number and specific types of each category are shown in Table 2-3 and Table 2-4.
  • the emotion lexicon method is used to extract the emotion features.
  • the General Inquirer classification dictionary is used to count the number and type of emotional polarity words according to the correspondence between the corresponding words in different categories in the classification dictionary and the experimental text words.
  • the classification information in the General Inquirer dictionary comes from the Harvard IV-4dictionary, the Lasswell value dictionary and so on, a total of 156. According to whether it is related to the attitude of the writer, 15 features were selected, as shown in Table 2-5:
  • Entity features generally have some practical meaning, such as time, space, and causal goals. Studies have shown that descriptive texts for real events contain more spatial and temporal information than event descriptions that are fabricated with imagination. When judging whether or not the breach is due, the difference between the real and the forged text description can be discriminated by judging the case including the entity information. Also used is the General Inquirer classification dictionary, which counts the number and type of entity words according to the correspondence between the corresponding words in different categories in the classification dictionary and the experimental text words. Finally, get 9 entity features, as shown in Table 2-6:
  • Temporal features are extracted from two aspects. Since the Penn Treebank annotated corpus also marks the past, present, and future tense properties of the verb, the maximum entropy model can be used to train the text to obtain temporal features. On the other hand, a phrase dictionary representing past, present, and future times in commonly used English is used to find a word corresponding to the sentence in the sentence, and the time at which the event currently described by the sentence occurs is determined. Finally, the verbs of the sentence and the time adverbial are combined to obtain the temporal characteristics of the sentence.
  • the embodiment of the present invention introduces the natural language processing method and the machine learning method in the first embodiment.
  • the natural language processing method is a method and theory for identifying, transmitting, storing, and understanding processing from different granularities such as words, sentences, paragraphs, and documents by means of an automated machine as a tool. It can process words into word segmentation, part-of-speech tagging, structural analysis and even meaning understanding, so as to obtain more features that can represent text from different aspects.
  • Part of speech is also called a word class, which refers to the basic grammatical attributes of vocabulary. It is usually divided according to the form, function and grammatical meaning of the words.
  • the part-of-speech tagging is to mark the words of a certain language with the word class to which it belongs. It is one of the basic and important tasks in the natural language processing method. The method is usually divided into a rule-based method and a statistical-based method. Part-of-speech tagging needs to first mark all possible word class tags in the sentence by looking up the dictionary, and then apply the rule to gradually delete the wrong tag, and finally get the correct result. Examples of part-of-speech tagging are as follows:
  • Entropy describes the uncertainty of the value of the variable.
  • the entropy is positively correlated with this uncertainty. The larger the value is, the closer the random variable is to the uniform distribution.
  • the distribution with the larger value that is, the average distribution, should be selected on the premise that the existing distribution is satisfied.
  • Statistical modeling based on the principle of maximum entropy is the best choice that can be made without knowledge of the distribution, since the choice of any non-maximum entropy principle represents subjective addition of non-distributed information.
  • this embodiment uses the maximum entropy model to perform part-of-speech tagging on text.
  • the texts written by people when they comment, write articles, and submit applications contain a lot of
  • the emotional color and inclination of the person can reflect the author's personality characteristics and life attitude to a certain extent. For example, positive, negative, and such as recognition or negation.
  • emotions are judged from the uncharacterized data of people's writings.
  • Machine learning literally means that the machine understands learning as human beings, and it can be inspired from the data set to highlight the true meaning behind the data.
  • the content of this study is the effect of text on credit evaluation. It is necessary to judge the credit level of the borrower from the feature set excavated in the text. It is difficult to obtain the required information intuitively from the original text data and even the feature set. It is necessary to use machine learning algorithms to process these unordered data and turn them into quantitative features that can be recognized by computers. By constructing a model and using the text represented by these features as input data, the borrower's credit level is the most. The category that may belong.
  • the main task of machine learning is classification and regression, which is exactly the same as the task of this embodiment.
  • Classification is a category in which an instance is judged to be attributed based on feature information.
  • Regression is the formation of a best fit curve from a given data point. They all belong to supervised learning. They must know what is predicted, that is, the classification information of the target variables. The data is often divided into training sets and test sets.
  • the language features in the text information submitted by the P2P platform borrower can improve the accuracy of credit evaluation, and whether these language features can be used to predict the available value is an important part of the research.
  • the text contains a large amount of information about the author himself. In addition to the semantic content, writing style, writing habits, etc. can also reflect the writer's personality characteristics, and even the credit level. But in general, the grammar, semantics, and sentimental tendencies contained in the text cannot be directly represented and processed by the computer, so these features need to be identified and quantified for use in text analysis.
  • the experimental data is processed to obtain the financial characteristic data and the text feature data mentioned above, the financial features are used as the comparison standard, the text feature data plus the basic financial features and the data of the combined text features and financial features are tested, and the credit evaluation model is observed. Effect, study the role of textual features in credit assessment.
  • This embodiment uses five basic classification learning algorithms commonly used in machine learning, namely decision trees, naive Bayes, logistic regression, neural networks, and random forests, by using different machine learning models. Explore the classification effect of using text features in credit evaluation.
  • experimental data the experimental data in this embodiment can be referred to the data description.
  • the Listing data downloaded from the database cannot be used directly. It is not in plain text format and needs to be extracted from the xml format.
  • the other two texts also contain xml tags, so the tags and other text-independent content are filtered out before the feature is extracted.
  • each item in the loan record is very different.
  • the loan income ratio is between 0 and 1
  • the number of loans is several thousand.
  • the number of words, vocabulary features, etc. is also several to several hundred. Floating between the two, these features are worth too much difference is easy to cause weight imbalance, so after extracting features, each feature data is transformed into the same range, that is, normalized.
  • This experiment uses a simple maximum and minimum processing method, that is, standardization of dispersion, linearly transforms the original data so that the resulting values are mapped between 0 and 1.
  • the conversion function is as follows:
  • max is the maximum
  • min is the minimum
  • x is the actual data that needs to be calculated.
  • FIG. 3 is a schematic diagram of a general process for describing feature selection according to an embodiment of the present invention.
  • a search algorithm is used to determine the feature subset.
  • This experiment mainly adopts the optimal priority search method of the full search class. The number of feature subsets starts from 1, using the exhaustive method to calculate the subset classification effect after each new feature is added, and then using the evaluation function to judge the classification effect of the subset.
  • the evaluation method of the packager is adopted, which selects different feature subsets according to different classifiers, that is, the sample is classified and the error rate of the classifier is used as a measurement index, so the classification effect is good.
  • This embodiment employs five common machine learning classifiers.
  • the decision tree classifier adopts the information gain ratio measurement method, and divides the data by selecting the feature with the highest information gain ratio each time. The confidence factor is set to 0.005 to crop the decision tree.
  • Logistic regression classifier using the Sigmoid function, and using the stochastic gradient ascent method to determine the best regression coefficient.
  • Neural network classifier uses back propagation neural network, and the activation function is Sigmoid function.
  • Random forest classifier select 100 trees as a classifier.
  • Naive Bayes classifier is a common machine learning classifier.
  • Cross-validation is a randomization when the amount of data is not large enough A practical way to cut a data sample into smaller subsets.
  • one of the subsets is used as a training sample to train the classifier, and the other subsets are used as a test set to verify the correctness of the classifier and other indicators.
  • Five-fold cross-validation divides the data set into five parts, each time one is selected as the test set, and the remaining four are used as the training set, so that five experiments are performed, and the correct rate obtained by these experiments is averaged as the accuracy of the algorithm. Estimate.
  • the influence of the text on the credit evaluation is mainly based on the evaluation method of the correct rate, and the correct rate is expressed as the test data having the repayment default record.
  • the result of the classification using the algorithm in this embodiment is the same as the original default record.
  • the present embodiment uses financial feature data, text feature data, and financial and text feature data as the input data training model and test, and the financial feature data as a control variable for comparison.
  • the accuracy of the classifier model for credit risk assessment is calculated. Since the addition of features will bring noise problems, considering the excessive number of features will cause the feature overload, which leads to the problem of reduced classification effect.
  • the feature data is selected before the model training. Compare it with the results of the financial feature classification.
  • the final prediction results of the three feature data on the five classifier models are shown in Table 6-2.
  • the classification results predicted by using text features alone are compared with the classification results using financial feature predictions alone. It is found from the data that for most classifier models, Although the correct rate of using text feature prediction is lower than the correct rate of using financial feature prediction, the values are relatively close, which is not much different. In particular, the correct rate of random forest prediction after feature selection is 67.42%, which is about 0.1% higher than the prediction accuracy of financial characteristics; the prediction result using neural network is 67.83%, and the result of financial characteristic prediction is 68.37%. The difference between the differences is within 0.5%. On the other hand, the predictions of financial and textual features are improved to varying degrees compared to the predictions of financial features alone.
  • the correct rate of classification will be significantly improved by 0.5 percentage points.
  • the accuracy rate has been significantly improved after adding the text feature, and the highest improvement is about 3%. It also proves that the text feature can improve the accuracy of the credit evaluation classification, but the results after adding all the features are added separately. The results of textual, emotional, and part-of-speech features are slightly reduced. After the number of features increases, the accuracy rate does not increase as expected. It is possible that while increasing the number, the noise also increases, which makes the classification effect lower. Therefore, it can be seen from the data that text features can improve the accuracy of credit evaluation, and the more text features, the more helpful the credit evaluation.
  • the seven basic classifiers include six text analysis classifiers.
  • the six text analysis classifiers correspond to six abstract text features, and the six abstract text features are used to represent Different aspects of the borrower, such as subjectivity, deception, readability of the text, emotions, personality characteristics of the user, and ways of thinking.
  • Each classifier uses input as a basic language feature and predicts whether the borrower will not perform the repayment and then integrates with the fusion system.
  • the output of the seven classifiers uses input as a basic language feature and predicts whether the borrower will not perform the repayment and then integrates with the fusion system.
  • Logistic regression is used for deceptive classifiers, subjective classifiers, and personality classifiers; random forests are used for readability classifiers, sentiment classifiers, and basic loan classifiers; multi-layer perceptrons are used for thinking mode classifiers.
  • the decision tree is used to fuse the results of different classifiers.
  • the experimental data in this embodiment comes from the Prosper website.
  • the Prosper website has a large number of users and is a well-known P2P online service platform.
  • the loan records from 2006 to 2008 were extracted, because the lending behavior during this period has had the final repayment result so far, whether it is a breach of contract or repayment on time. According to statistics, there are a total of 28,853 loan records with clear results available in the past three years.
  • Prosper has seven states of current, late, paid, charge-off, defaulted, repurchased, and canceled for the repayment history. Since the experimental data are all completed records, there are no current and two states. Then the data is divided into two categories: default and non-default. The default includes charge-off and defaulted, a total of 9937, the remaining non-default includes the remaining categories, a total of 18916, the ratio of default to non-default record is about 1:1.92.
  • One is the description item in the listing table submitted by the borrower to describe its own situation and the reason for the loan. It is a detailed description of the loan that the borrower personally filled out.
  • the remaining two are the description and endorsement in the registered user table Member, which describe the borrower's own situation and the recommendation for the borrower.
  • the description text in the listing table is used, mainly focusing on the borrower's own description of the loan, thereby excavating the credit status of the borrower.
  • 70 underlying features and combinations are extracted to form 6 abstract text features.
  • the features input into the model form a feature network, and the upper features are represented by the underlying features and represent The abstract meaning of the underlying features gradually expresses the credit level of the borrower.
  • the machine learning classifier according to the embodiment of the present invention will be described below.
  • the decision tree classifier adopts the information gain ratio measurement method, and divides the data with the highest information gain ratio each time.
  • the confidence factor is set to 0.005 to crop the decision tree;
  • the logistic regression classifier uses the Sigmoid function. And use the stochastic gradient ascent method to determine the best regression coefficient;
  • neural network classifier the experiment uses back propagation neural network, the activation function is sigmoid function; 4) the random forest classifier, select 100 trees as the classifier; 5) Naive Bayes classifier.
  • financial feature data, text feature data, and financial and text feature data are respectively used as input data training models and tested, and financial feature data is used as a control variable for comparison. .
  • financial feature data is used as a control variable for comparison.
  • the feature data is selected before the model training. Then compare with the results of the loan feature classification. The final prediction results of the three feature data on the five classifier models are shown in Table 5-2.
  • the classification results using the text feature prediction alone are compared with the classification results using the financial feature prediction alone. From the data, it can be found that for most classifier models, the correct rate of using text feature prediction is better than using The correct rate of financial feature prediction has been reduced, but the values are relatively close, with little difference.
  • the correct rate of random forest prediction after feature selection is 67.42%, which is about 0.1% higher than the prediction accuracy of financial characteristics; the prediction result using neural network is 67.83%, and the result of financial characteristic prediction is 68.37%. The difference between the differences is within 0.5%.
  • the predictions of financial and textual features are improved to varying degrees compared to the predictions of financial features alone.
  • FIG. 5-2 is a performance comparison diagram of combining different numbers of classifiers according to an embodiment of the present invention. As shown in FIG. 5-2, six classifiers that use text features alone and classifiers that use financial features respectively use logistic regression. Three kinds of classifiers, random forest and neural network, train the data independently to obtain the classification result, select one classification result of each classifier as the input of the second layer classifier, and finally obtain the final classification effect through the training of the upper classifier. .
  • the correct rate is 71.35%, compared with the highest accuracy of financial analysis 70.19 increased by more than 1% compared to single classifier The highest correct rate of 70.6% increased by 0.75%.
  • the correct rate also rises, and both are better than financial analysis and single classifiers in listing.
  • Boosting and bagging are decision fusion based on the same classifier. Their classification effect is slightly worse than the decision fusion using different classifiers, but the variance can be seen to make the classification effect more stable.
  • the use of different classifiers as the base classifier, that is, the decision-making fusion decision-making providers are different, can make the final result take into account different aspects, apply in different situations, and the more accurate the final result is. That is, the correct result can be tested by different algorithms, and the more types of algorithms used, the greater the probability that the wrong result is recognized.
  • the data shows that the classification accuracy of the multi-classifier integration algorithm based on weighted and simple voting is higher than that of other several decision fusion.
  • This parallel integration algorithm using different base classifiers takes into account the classification capabilities of different base classifiers, giving them different weights, resulting in more accurate predictions. From the perspective of different problem solving, the hybrid classifier is more obvious after the final decision fusion, and the effect is also the best among the multi-classifier integration methods implemented in this embodiment.
  • the highest accuracy rate that can be achieved by using a financial + text feature single classifier is 70.6% using logistic regression, and the highest accuracy rate can be achieved using financial features alone.
  • the predicted result is 70.19%.
  • the prediction effect compared with the use of financial features is significantly improved, the hybrid classifier is improved by more than 1 percentage point, compared to the single classifier using text + financial features.
  • the prediction effect is also improved, and since the multi-classifier integration is a decision fusion that combines the results of multiple classifiers, the results of multi-classifier integration are also more reliable and stable. Therefore, the multi-classifier integration method plays an important role in the credit classification of credit evaluation.
  • text analysis and overall learning are used to evaluate the credit risk of the network P2P loan.
  • the seven classifiers include six text analysis classifiers corresponding to six abstract text functions and a traditional credit analysis classifier. Experimental results show that different classifiers perform in textual functions close to those of traditional financial features including FICO scores and DTI.
  • text information is a good choice when traditional financial information is gradually disappearing.
  • the addition of text features can improve the performance of the entire credit risk assessment system, which means that textual information is a good source of complementary information to traditional sources of financial information.
  • textual information is combined with traditional information, it can improve credit risk assessment. Performance.
  • an embodiment of the present invention further provides a credit risk assessment apparatus based on text analysis, where the apparatus includes a first acquiring unit, an analyzing unit, a processing unit, an output unit, and an establishing unit, and Each module can be implemented by a processor in a computing device; of course, it can also be implemented by a logic circuit; in the process of the embodiment, the processor can be a central processing unit (CPU), a microprocessor (MPU), and a digital signal. Processor (DSP) or field programmable gate array (FPGA).
  • CPU central processing unit
  • MPU microprocessor
  • DSP field programmable gate array
  • the apparatus 600 includes a first obtaining unit 601, an analyzing unit 602, a processing unit 603, and an output unit 604. :
  • the first obtaining unit 601 is configured to acquire a text of the borrower
  • the analyzing unit 602 is configured to analyze the text to obtain a basic language feature, where the basic language feature is used to predict whether the borrower defaults;
  • the processing unit 603 is configured to input the basic language feature into a preset credit risk assessment model, and obtain a credit risk value of the borrower outputted from the credit risk assessment model;
  • the output unit 604 is configured to output the credit risk value of the borrower.
  • the device further includes: an establishing unit configured to establish the credit The risk assessment model, the establishing unit further comprises an obtaining module, an analyzing module, a building module and a fusion module, wherein:
  • the obtaining module is configured to acquire training data
  • the analyzing module is configured to analyze the training data to obtain basic language features of the training data
  • the first establishing module is configured to use the basic language feature as a parameter, and use a machine learning method to establish a classifier corresponding to different abstract text features;
  • the fusion module is configured to use the classifier as a basic classifier, and use a decision tree algorithm to perform decision fusion to form a credit risk assessment model.
  • the using the basic language feature as a parameter in the establishing module includes: inputting the basic language feature into each according to a relationship between the basic language feature and the abstract text feature The classifier corresponding to the abstract text feature.
  • the establishing unit further includes a segmentation module and a statistic module, wherein the segmentation module is configured to segment the training data according to a punctuation symbol of the sentence, and the statistic module is configured to The training data is statistically obtained to obtain statistical features.
  • the establishing unit further includes a second establishing module and a determining module, where:
  • the second establishing module is configured to establish a classifier corresponding to the abstract text feature by using different machine learning methods
  • the determining module is configured to determine a classifier with the highest accuracy as a classifier corresponding to the abstract text feature.
  • the foregoing text-based credit risk assessment method is implemented in the form of a software function module, and is sold or used as a stand-alone product, it may also be stored in a computer readable storage. In the medium.
  • the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium, including a plurality of instructions.
  • a computer device (which may be a personal computer, server, or network device, etc.) is caused to perform all or part of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read only memory (ROM), a magnetic disk, or an optical disk.
  • program codes such as a USB flash drive, a mobile hard disk, a read only memory (ROM), a magnetic disk, or an optical disk.
  • the embodiment of the present invention further provides a computer storage medium, where the computer storage medium stores computer executable instructions, and the computer executable instructions are used to perform a text risk analysis based credit risk assessment method in the embodiment of the present invention. .
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed.
  • the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit;
  • the unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the foregoing program may be stored in a computer readable storage medium, and when executed, the program includes The foregoing steps of the method embodiment; and the foregoing storage medium includes: a removable storage device, a read only memory (ROM), a magnetic disk, or an optical disk, and the like, which can store program codes.
  • ROM read only memory
  • the above-described integrated unit of the present invention may be stored in a computer readable storage medium if it is implemented in the form of a software function module and sold or used as a standalone product. Based on such understanding, the technical solution of the embodiments of the present invention is made substantially or prior to the prior art.
  • the contributed portion may be embodied in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the various aspects of the present invention. All or part of the methods described in the examples.
  • the foregoing storage medium includes various media that can store program codes, such as a mobile storage device, a ROM, a magnetic disk, or an optical disk.
  • the borrower's text is obtained; the text is analyzed to obtain a basic language feature, and the basic language feature is used to predict whether the borrower will default; and the basic language feature is input to the preset credit a risk assessment model, which obtains the credit risk value of the borrower outputted from the credit risk assessment model; outputs the credit risk value of the borrower; thus, the borrower's credit risk can be effectively evaluated, thereby investing People provide important decision-making basis when investing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Technology Law (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Educational Administration (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computational Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)

Abstract

一种基于文本分析的信用风险评估方法及装置、存储介质,其中,所述方法还包括:获取借款人的文本(S101);对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否会违约(S102);将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值(S103);输出所述借款人的信用风险值(S104)。

Description

基于文本分析的信用风险评估方法及装置、存储介质
本专利申请要求2015年10月22日提交的中国专利申请号为201510695316.1,申请人为腾讯科技(深圳)有限公司,发明名称为“一种基于文本分析的信用风险评估方法及装置”的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本发明涉及互联网金融领域,尤其涉及一种基于文本分析的信用风险评估方法及装置、存储介质。
背景技术
计算机和网络跟随着时代的发展而得到快速普及,互联网在不知不觉中已经和人们的生活中各个方面有着极其密切的关系。近年来,互联网的影响也逐步扩散到金融领域,互联网金融也就逐渐的进入了大众视野。理论上任意和金融有关联的使用网络在线上办理的业务都算是互联网金融。一般有以下6种常见的分类,它们分别是是大数据金融、第三方支付、P2P(Peer-to-Peer,点到点)网贷、众筹、信息化金融机构和互联网金融门户。
作为互联网金融行业的一个新兴领域——P2P网贷,正在以惊人的速度增长并受到广泛关注,机遇与挑战也随之同时出现。由于我国特殊的历史背景,P2P网贷在我国的发展速度尤为迅速,规模也比较大。中国的金融领域在一定程度上存在着金融管制,大量中小企业和个人越来越多样的金融需求已经不满足于现有金融服务,便促成了P2P网贷的快速发展。正因为如此,P2P网贷创新太快,监管缺失等问题很容易出现以金额和期限错配,非法集资以及流动性陷阱等为代表的系统性风险,在支付方面仍然还没有完善的认证体系制度,资金缺少监管等问题,面临着交易欺诈,隐私泄露 等风险;在融资方面,信用风险问题也随着提高社会资金运用效率的提升而突显出来。
发明内容
有鉴于此,本发明实施例为解决现有技术中存在的至少一个问题而提供一种基于文本分析的信用风险评估方法及装置、存储介质,能够有效地对借款人的信用风险进行评估,从而为投资人在投资时提供重要的决策依据。
本发明实施例的技术方案是这样实现的:
第一方面,本发明实施例提供一种基于文本分析的信用风险评估方法,所述方法包括:
获取借款人的文本;
对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否会违约;
将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;
输出所述借款人的信用风险值。
第二方面,本发明实施例提供一种基于文本分析的信用风险评估装置,所述装置包括第一获取单元、分析单元、处理单元和输出单元,其中:
所述第一获取单元,配置为获取借款人的文本;
所述分析单元,配置为对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否违约;
所述处理单元,配置为将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;
所述输出单元,配置为输出所述借款人的信用风险值。
第三方面,本发明实施例一种计算机存储介质,所述计算机存储介质 中存储有计算机可执行指令,该计算机可执行指令用于执行本发明第一方面实施例提供的基于文本分析的信用风险评估方法。
本发明实施例提供一种基于文本分析的信用风险评估方法及装置、存储介质,其中,获取借款人的文本;对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否会违约;将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;输出所述借款人的信用风险值;如此,能够有效地对借款人的信用风险进行评估,从而为投资人在投资时提供重要的决策依据。
附图说明
图1为本发明实施例一基于文本分析的信用风险评估方法的实现流程示意图;
图2为本发明实施例中抽象文本特征与基本语言特征之间的关系示意图;
图3为本发明实施例描述特征选择的一般流程示意图;
图4-1为本发明实施例财务特征与文本特征的信用评估效果对比结果示意图;
图4-2为本发明实施例财务特征与财务+文本特征的信用效果对比结果示意图;
图4-3为本发明实施例中不同文本特征对信用评估的影响的示意图;
图5-1为本发明实施例中基于多个分类器的信用风险评估系统的架构示意图;
图5-2为本发明实施例中结合不同数量的分类器的性能对比图;
图6为本发明实施例六基于文本分析的信用风险评估装置的组成结构示意图。
具体实施方式
P2P网贷的一般流程通常是P2P网贷公司作为一个展现双方借贷信息的中间展示平台存在,投资人和借款人通过自由竞价进行网上交易,从而公司在交易成功时赚取相应的服务费用。P2P网贷的一般流程也可以简单描述为,在网络上通过个人对个人这样的一种方式进行的贷款交易,借入人到期需要偿还本金同时需要支付给借出人利息,而借出人在获取收益的同时需要承担本金偿还不到位的风险。
信用是个体、团体以及商品之间在交易中产生的一种双方互相信任的生产及社会关系,它是社会经济发展的必然产物,是市场经济中不可缺少的一环。在P2P网贷中,无论是中小企业还是个人,其信用水平都是投资人考虑是否对其投资的重要决策。
信用评估也叫做资信评级,作为信用体系构建中的重要角色,是按照一定的指标和方法对企业或者个人进行全面了解,从收集的信息中科学、客观地对其信用水平做出全面的评估,主要出发点即为了得到受考察借款人具有多大的违约概率,判断其能否按时完成约定好的事情,在P2P借贷中即为按时还清借到的款项。信用评估从根本上将是数据挖掘中的分类问题,它是将属于同种类别的总体按照不同的特征分成两个或者若干个不同的子集。一般情况下,在借贷信用评估中,将贷款者分类为可信的“好”用户和存在信用风险的“坏”用户,也即分类中的正例和负例。通过历史的信用数据对这两种类别进行分类,以帮助投资人了解此次投资的潜在风险。
征信数据,进行信用评估的过程中会使用到各种各样的数据来帮助进行定性定量分析或者训练模型,这类数据被称为征信数据。根据数据的不同可以分为结构化数据和非结构化数据,比如社交网络的评论,用户上传的音视频//用户填写的申请,这些数据以文本、图片、音频、视频等数据格 式存在,都是非结构化数据。以是否容易被感知和接受为依据,将金融领域中的数据划成软信息和硬信息两种。硬信息是指精准的,符合逻辑并具有可追溯性的信息,也即可以被直接证实的信息,它们可以量化并记录在文档中,能够准确的进行传递,如财务报表、工资水平等。而相反软信息则是指由信息供给者主观给出且无法直接被其他人证实的信息。
在本发明实施例中,使用从20060101至20081231之间Prosper平台生成的28853条贷款记录作为训练数据。当借款人通过P2P贷平台进行借款申请时,借款人需要填写贷款申请描述。申请描述作为一种由借款人主观编写的文本信息,与借款人的财务信息一起作为训练数据,可以研究其影响信用的特征有哪些并且通过调整由这些特征训练的模型,进行形成一个有效的信用风险评估系统。
下面结合附图和具体实施例对本发明的技术方案进一步详细阐述。
在本发明实施例中,将通过借款人的文本特征来评估贷款的信用风险。例如,从全球最大的P2P网贷平台上获取相关的数据(借款人的文本描述),然后利用机器学习方法和统计方法从借款人的文本描述中提取借款人的六大抽象文本特征,接着利用这六大抽象文本特征来评估借款人的还款意愿和还款能力,其中这六大特征包括主观性、欺骗性、文本的可读性、情感、用户的个性特点和思维方式。
P2P网贷的信用风险评估由还款意愿和还款能力两个因素决定,其中还款能力作为一个主要因素,是指借款人是否能够按时还款,其中按时还款取决于借款人的经济状态。而作为从属因素的还款意愿,取决于借款人的想法和观念。
实施例一
本发明实施例提供一种基于文本分析的信用风险评估方法,该方法应用于计算设备,在实施的过程中,所述计算设备可以为个人计算机、服务 器、工控机、笔记本电脑等具有信息处理能力的电子设备。该方法所实现的功能可以通过计算设备中的处理器调用程序代码来实现,当然程序代码可以保存在计算机存储介质中,可见,该计算设备至少包括处理器和存储介质。
图1为本发明实施例一基于文本分析的信用风险评估方法的实现流程示意图,如图1所示,该方法包括:
步骤S101,获取借款人的文本;
这里,所述文本可以为借款人写的有关借款事项的任何文字,例如借款人对贷款人写的申请书等都可以作为本发明实施例中借款人的文本。
步骤S102,对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否违约;
这里,在实施的过程中,可以采用自然语言处理的相关方法从所述文本中抽取基本语言特征,所述自然语言处理的相关方法,例如话题模型方法,其中自然语言处理的相关方法即是以自动化机器作为工具,通过可计算的方法从词语、句子、段落、文档等不同的粒度进行识别、传输、储存、理解等加工的方法和理论。它可以对文本进行词语切分,词性标注,结构分析甚至意义理解等处理,从而从不同方面获取更多的可以表示文本的特征。
这里,所述基本语言特征至少包括文本的统计特征、词性特征、情感特征、实体特征和时态特征;其中所述统计特征包括句子特征、单词特征和标点特征,其中:所述句子特征至少包括:句子总数、平均句长、最大句长、疑问句数量比例;所述单词特征至少包括:平均词长、最长词单词种类数量、单词总数、单词平均出现次数和单词出现最大次数;所述标点特征至少包括:问号数量比例和感叹号数量比例。
步骤S103,将所述基本语言特征输入到预设的信用风险评估模型,得 到从所述信用风险评估模型输出的所述借款人的信用风险值;
这里,所述信用风险评估模型是预先建立好的,下面对信用风险评估模块的建立过程进行描述。在本发明实施例中,所述信用风险评估模型可以为一个简单的分类器,也可以为多个分类器组成的信用风险评估系统,其中,一个分类器可以看作是某一个领域或方面的专家系统,而由多个分类器组成的信用风险评估系统又可以看作是混合专家系统。
步骤S104,输出所述借款人的信用风险值。
本发明实施例中,所述方法还包括:步骤S100,建立所述信用风险评估模型,包括:
步骤S111,获取训练数据;
这里,所述训练数据是关于借款人进行借款的文本。
步骤S112,对所述训练数据进行分析,得到所述训练数据的基本语言特征;
这里,所述步骤S112与上述的步骤S102相似,本发明将在以下的实施例中进行说明。
步骤S113,将所述基本语言特征作为参数,采用机器学习方法建立不同的抽象文本特征对应的分类器;
这里,所述抽象文本特征包括欺骗性、主观性、情感、文本的可读性、个性特点和思维方式。所述机器学习方法包括:人工神经网络方法、支持向量机方法、决策树方法、贝叶斯方法、随机森林方法、逻辑回归方法。在实施的过程中,还可以采用不同的机器学习方法建立同一所述抽象文本特征对应的分类器;例如,以欺骗性为例,可以建立人工神经网络方法的分类器,建立贝叶斯方法的分类器,建立随机森林方法的分类;然后将准确率最高的分类器作为所述抽象文本特征所对应的分类器。
这里,所述将所述基本语言特征作为参数,包括:根据所述基本语言 特征与所述抽象文本特征之间的关系,将所述基本语言特征输入到每一所述抽象文本特征对应的分类器。这里,所述关系可以参见图2所示,主观性对应于词性特征和情感特征,欺骗性对应于词性特征、情感特征、实体特征和时态特征;可读性对应于统计特征、情感对应于情感特征,个性特点对应于统计特征、词性特征、情感特征、实体特征和时态特征;思维方式对应于词性特征和实体特征。
步骤S114,将所述分类器作为基础分类器,使用决策树算法进行决策融合形成信用风险评估模型。
这里,将所述抽象文本特征对应的分类器作为基础分类器,使用决策树算法进行决策融合形成信用风险评估模型。
本发明实施例中,所述建立所述信用风险评估模型,还包括:根据断句的标点符号对所述训练数据进行分割,对分割后的训练数据进行统计得到统计特征。
这里,所述断句的标点符号至少包括句号、问号、叹号。
本发明实施例提供一种基于文本分析的信用风险评估方法及装置,其中,获取借款人的文本;对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否违约;将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;输出所述借款人的信用风险值;如此,能够有效地对借款人的信用风险进行评估,从而为投资人在投资时提供重要的决策依据。
实施例二
本实施例介绍一下实施例一中的抽象文本特征与基本语言特征,图2为本发明实施例中抽象文本特征与基本语言特征之间的关系示意图,如图2所示,为了从借款人的文本信息中挖掘有用信息,首先从文本信息中识别各种抽象文本特征,其中所述抽象文本特征用于描述借款人的各个方面; 然后根据所述抽象文本特征构建和组合基本语言特征。
2.1.抽象文本特征
抽象文本特征是根据心理学和语言学等知识,从文本描述中识别出用于信用风险评估的六大抽象文本特征,这六大抽象文本特征包括欺骗性、主观性、情感、文本的可读性、个性特点和思维方式。
1)欺骗性
欺骗性用于识别欺骗者与诚实者,本实施例中从四个维度来定义欺骗性,分别是认知负荷、内部想象力、分解性以及消极情绪。欺骗者不仅仅伪造不存在的事实还需要避免被揭露,因此他们常常不得不花费更多的认知资源,产生较高的认知负荷来阐述简单的故事。通常使用具体性和凝聚性来度量认知负荷的大小。其中具体性可以由Coh-Metrix program从MRC Psycholinguistic Database获得,而凝聚性往往与连接词的数量有着密切关系。研究证明,存在欺骗的描述文本具有高的具体性和很低的凝聚性。
内部想象力与实体词和时态词的使用有关。一般来说,从实践经验的事件描述包含更多的信息,如时间(如“今天”、“昨天”和“本月”)和地点(如“这里”、“有”和“大街”),这些都不是内部想象力。
分解性与人称代词的使用有关,为了使分解假故事,欺骗者总是使用更多的词汇的第三人(像“她”和“他”)来描述故事。
消极情绪和情绪词的使用相关,因为内疚的增加引起的撒谎,欺骗者总是使用比诚实者更多的消极词汇。
2)主观性
主观性是文本挖掘的一种,它用来评估文本的主客观情况或者倾向,是关于客观世界的信息多还是侧重于个人的感觉。研究证明,包含客观信息多的文本更容易违约拖欠。贷款者在提供一系列关于借贷情况的客观信息后,信用高的借款者在文本描述中更加侧重于解释借款的用途,从而涉 及到更多的主观信息,而存在违约风险的借款者不愿意更多的涉及不愉快的事实,在描述时则使用大量客观信息。因此,主观性与词汇的主观性等情感特征以及反映人思想见解的实体特征,情态动词的使用,数词、形容词以及副词的使用情况等都有很密切的联系。
3)情感
情感也即对借款者文本描述的情感方向进行一个方向性的把握,通过对文本进行处理,判断借款者是积极还是消极,友好与否等,从深层次了解借款者的观点、情绪以及态度。通过对情感基本特征的组合分析,从而形成一个对于文本更加全面立体的情感方面的认知。借款者对待生活越积极乐观越拥有更高的信用度,反之亦然。
4)个性特点
人与人之间最基本的不同就是他们各自的性格特点,性格特点包括行为、气质、情绪以及内在的精神。性格特征的培养是一个长期、稳定的过程,影响着个体行为的很多不同方面,比如乐于分享、积极向上的性格的人比吝啬、悲观的人违约风险更低。而在性格上越重要的差别越容易体现在单个词中,在文本中语言学特征也会如实反映出个体性格的特点。
性格特点可以从五个维度来定义,也就是被人们熟知的Big Five。第一个就是外向性,外向的人更愿意与人沟通,倾向于使用短句子词汇种类少,文本中多用动词、代词、副词以及感叹词等,文本的情感也多是积极乐观,包含更多的社会词汇等等。在众多内外向性格特点的影响因素中,可以采用formality的度量方式挑选最重要的维度进行计算:
F=(noun freq+adjective freq+preposition freq+article fre-pronoun freq-verb freq-adverb freq-interjection freq+100)/2
研究发现每个维度与语言学特征都存在着微小但是重要的联系。神经质的群体喜欢使用更多的第一人称单数来表达,他们的文本中有更多的消 极词汇和少量的积极词汇。而相反,情绪稳定的人积极词汇使用更多,也更常用冠词。具有严谨性的人尽量避免使用否定词、消极词汇和情态动词。从开放性的人的文本中可以找到更多的长词和不确定词,他们更不习惯于使用第一人称单数和过去时态来表述。最后,宜人性描述了人们是否易于相处的方面,发誓咒骂词汇、消极词汇和愤怒词汇更多的人往往更难以相处。
可见,性格特点与语言学特征有着密切的联系,它不仅与词的本身特征有关(如词长、词的种类),还可以体现在词的情感倾向(积极、消极)、词性(形容词、动词)、时态(第三人称、过去式)以及词的具体含义(社会词汇)等方面。
5)思维方式
在文本信息中,除了包含了内容词汇,也即有着明确含义的用来表述文本思想的词汇外,还存在大量的功能词汇。功能词汇反映了作者沟通的方式而非描述的实际内容,它与作者的所处的社会环境与心理真实世界更加吻合,随着事情的发展与作者认知的变化,使用的功能词汇也会发生相应的变化。
首先感知复杂性描述了在论证时的丰富度,也即相互矛盾的方案间的差异度以及不同解决方法间的整合性,通常用排除词汇和连词表示。喜欢讲真实故事的人更倾向于使用排除词。描述复杂具体信息的时候,多数人会增加介词、感知词以及长词的使用。原因词汇和见解词汇这两种感知机制经常出现在描述过去的事件当中,能够反映出对已经发生过的事情的思考。如果一个人对所描述的事情不确定,那么更喜欢使用不确定词汇和补充词来进行缓冲,过多使用不确定性词汇表明故事的真实性存在着质疑。因此,思维方式与词性和描述原因、见解等感知实体词有着一定的联系,可以从这两种基本特征中组合抽象出作者的思维方式,来更加形象的反映 出作者在写描述文本时的真实意图。
6)文本的可读性
文本的可读性是一项衡量文本的比较传统的指标,反映着作者的教育程度、社会地位等,已经用在了商品反馈、购买意图、社会媒体信息评论等领域中,它的写作方式影响着读者理解文本的难易程度。从三个维度上来衡量文本的的可读性,分别是词汇种类、词汇易读性和词汇复杂性。
首先介绍词汇种类,如果一个文本使用了更少的词汇种类,那么它应该更容易阅读。通常使用公式(1),一个不依赖于文本长度的对于词汇丰富度的测量公式。
Figure PCTCN2016081998-appb-000001
公式(1)中,N是文本的长度,V(i,N)表示某类词出现了i次,词汇易读性和词汇复杂性也与文本句子的长度,词汇长度以及词汇种类等有着密切关系。
研究证明,文本的可读性较高的文本所对应的贷款不会违约的概率更大。如果一个人接受过良好的教育并且有着稳定的高收入,他所写的借款描述会更加清晰可读,对应的信用度也就越好。
2.2.基本语言特征
抽象文本特征可以从实际抽象意义角度反映作者违约意图以及信用习惯,但是却没有办法从文本特征中直接抽取。因此,根据抽象文本特征的意义以及影响抽象文本特征的文本特征因素,统计概括为5种基本语言特征,这些特征可以通过机器学习方法和统计方法直接从文本中得到,从而使用这些特征表示抽象文本特征,最终得到与还款意愿的内在关系,作为预测是否违约的特征。
1)统计特征
文本特征从直观上来讲,最容易从中抽取到的部分是可以经过统计直接得到的,比如文本中出现的句子数量、单词数量、单词长度等等,这些特征容易统计计算,并且可以从不同程度上反映出写作者的写作态度,甚至生活态度。比如句子的长度可以从一定程度上反映出文本的可读性,在文本中存在的句子越长,越不容易被阅读,表达的意思也越晦涩。此外,对于同一个统计对象,比如单词、数量和种类也代表着不同的含义,单词数量表示文本的长度,而单词种类表示的是文本中使用的词汇量,反映了单词使用的多样化。
抽取这些特征的方法有很多,本发明实施例采用统计方法,该方法具有简单、易行的特点。首先,如果要统计文本中有关句子的特征,那么一定要对文本进行分句。根据英文中常见的用于断句的标点符号来进行分割识别,比如句号、问号、叹号等,其中重点处理引号和括号的问题。其次,每个句子进行分词,统计有关单词的特征。目前,抽取了17个简单统计特征,这两个特征粒度分别表示句子特征和单词特征;在表2-1中列出了每个特征的意思。
表2-1文本简单统计特征
Figure PCTCN2016081998-appb-000002
2)词性特征
在本发明实施例中,采用最大熵模型对单词进行词性标注,也即给每个词分配一个词性类别,例如副词、连词等等。最大熵模型的关键问题在于特征选择,选取的特征直接影响着标注的准确性。在文本中相邻位置单 词间的词性都相互影响,词性也与单词本身的后缀、相邻词有关。因此,综合选择文本中单词的上下文特征和单词本身的特征,形成最大熵特征模板,如表2-2所示:
表2-2最大熵模型训练特征模板
特征编号 特征类型 特征模板
1 General wi=X&ti=T
2 General ti-1=T1&ti=T
3 General ti-1=T1&ti-2=T2&ti=T
4 General wi+1=X1&ti=T
5 Rare wi的后缀S,|S|<5&ti=T
6 Rare wi的前缀P,1<|P|<5&ti=T
7 Rare wi包含数字&ti=T
8 Rare wi包含大写字母&ti=T
9 Rare wi包含连字符&ti=T
其中,wi、wi+1、ti、ti-1、ti-2分别表示为当前词、前一个词、当前词性、前词词性、前前词词性。一般(General)特征适用于每个词,只有当单词与稀有(rare)特征类型描述的模板相吻合时,才加入rare特征。
最大熵模型训练语料来自于Penn Treebank,并且参照其提供的词性标注结果标签。在本实施例中使用到的词性特征主要对象为单词级别,因此得到单词级别的词性特征一共有36种。由于得到的词性分类太过详细,比如名词单数与名词复数属于不同类,形容词的比较级和形容词也属于两个不同类等等,将这些词性类别整理合并成了12个词性大类,并且计算了每一类的数量与具体包含种类,如表2-3、表2-4所示。
表2-3经过组合形成的文本词性特征
Figure PCTCN2016081998-appb-000003
Figure PCTCN2016081998-appb-000004
表2-4无组合的文本词性特征
Figure PCTCN2016081998-appb-000005
3)情感特征
除了文本本身直观统计的特征和词性特征外,由于最终目的是考察文本对于信用评估的作用,也即写作者是否存在违约风险,因此还需要抽取有关于写作者情感倾向的特征,这些特征会直接反映出写作者的人生态度和价值观念,也在很大程度上能够反映出违约的风险。考虑到机器学习的方法进行训练积极/消极等情感需要大量标注和训练成本,并且标注时需要对语言词性分类知识的专业了解和掌握,因此本发明实施例中采用情感词典的方法抽取情感特征,选用General Inquirer分类词典,根据分类词典中不同类别下对应的单词与实验文本单词的对应来统计情感极性词语的个数及种类。General Inquirer词典中的分类信息来源于the Harvard IV-4dictionary、the Lasswell value dictionary等四个方面,一共156个。根据是否与写作者的态度观念有关,最终选择了15个特征,如表2-5所示:
表2-5文本情感特征
Figure PCTCN2016081998-appb-000006
4)实体特征
实体特征一般具有着某些实际意义,比如时间、空间以及因果目标等。研究表明,对于真实事件的描述文本比凭借想象力随意捏造的事件描述包含着更多的空间和时间信息。在判断是否违约的时候,可以通过判断包含实体信息的情况来辨别真实与伪造的文本描述的区别。同样使用的是General Inquirer分类词典,根据分类词典中不同类别下对应的单词与实验文本单词的对应来统计实体词语的个数及种类。最终,得到9个实体特征,如表2-6所示:
表2-6文本实体特征
Figure PCTCN2016081998-appb-000007
5)时态特征
时态特征从两方面进行提取。由于Penn Treebank标注语料同样标注了动词的过去式、现在式以及将来时等时态属性,因此可以使用最大熵模型对文本进行训练,得到时态特征。另一方面,使用常用的英语中表示过去、现在以及未来等时间的短语词典,查找句子中与之对应的词,判断句子当前描述的事件发生的时间。最终,结合句子的动词以及时间状语得到句子的时态特征。
实施例三
本发明实施例介绍实施例一中的自然语言处理方法和机器学习方法。
3.1、自然语言处理方法
在P2P平台借贷过程中,借款人提交的借款理由描述等文本信息对信用评估的影响。这些用户文本信息通常是由自然语言组成,也即人们日常使用的口头语或者书面语。自然语言与计算机语言和数字有着明显的不同,它不能被计算机直接表示和理解,也不能直接用于计算,但是自然语言由语法,词语,句子等多种元素组成因而又包含着大量信息,能够反映出一个人的性格,感情以及其他复杂情绪。因此,需要采用简单统计方法或者自然语言处理方法对文本进行处理分析,从文本中抽取可以代表其某个维度的信息并且可以量化表示的特征,从而使得计算机可以使用这些特征进行计算,再进行下一步的处理。
在对文本的处理中,除了简单的对单词句子等进行统计外,自然语言处理方法被使用的越来越广泛。自然语言处理方法即是以自动化机器作为工具,通过可计算的方法从词语、句子、段落、文档等不同的粒度进行识别、传输、储存、理解等加工的方法和理论。它可以对文本进行词语切分,词性标注,结构分析甚至意义理解等处理,从而从不同方面获取更多的可以表示文本的特征。
1)词性标注
词性也叫做词类,指词汇基本的语法属性,通常根据词的形态、功能以及包含的语法意义进行划分。词性标注是给某种语言的词标注上其所属的词类,是一项在自然语言处理方法中基础并且重要的工作之一,方法通常分为基于规则的方法和基于统计的方法,基于规则的词性标注需要先通过查字典给句中各词标记所有可能的词类标记,再应用规则逐步删除错误的标记,最终得到正确的结果。词性标注的例子如下:
例句:The lead paint is unsafe.
标注结果:The/Det lead/N paint/N is/V unsafe/Adj.
熵描述了变量取值的不确定性,熵值与这种不确定性呈正相关,取值越大,该随机变量也就越接近均匀分布。在没有获得分布的全部信息时,根据最大熵原则,应该选取在满足现有分布的前提下取值越大的分布,也就是平均分布。根据最大熵原则进行统计建模,是在对分布不了解的情况下能够做出的最佳的选择,因为任意非最大熵原则进行的选择都代表主观加入了非分布信息。
最大熵原则由E.T.Jaynes在1957年提出,在许多领域有着广泛的应用。最大熵方法通过特征表示样本数据中的已知知识,通过增加其他条件使特征的模型期望与观察期望保持一致,从而就变成了最值问题。在构造最大熵模型时,关注于选择哪些有用的特征即可,无需考虑怎样使用。最大熵方法的一般陈述如下:
存在样本数据O,o={(m1,n1),(m2,n2),...,(ml,ml)},其中mi∈M,ni∈N,求解模型分布p(m,n),使得该分布满足一下两个条件:
(1)p(m,n)能使熵H(p)最大化,即p*=argmaxH(p);
(2)p(m,n)服从样本数据中已知的统计数据;
求解最大熵模型也就等价于求解下列约束最优化问题:
p*=argmaxH(p)
Figure PCTCN2016081998-appb-000008
其中,1≤j≤k
x,yp(x,y)=1,
等式两边分别为模型期望和观察期望最大熵模型对特征的相关性没有要求并且不存在过拟合的问题。从实现的简单性和分类的效果综合考虑,本实施例采用最大熵模型对文本进行词性标注。
2)情感
人们在评论、写文章、提交申请时所写的文本内容包含着大量关于作 者的感情色彩和倾向性,能够从一定程度上反映出作者的性格特点和生活态度。比如积极、消极,又比如认可或者否定等等。情感简单来讲,就是从人们文字性的非结构化数据中判断出隐含在其中的感情倾向。
文本作为一种非结构化数据,是很难被自动理解和处理的。因此在进行情感时,通常会把词句、段落、文档等单独抽取出来从不同层面进行分析,将文本转换为结构化数据。根据其挖掘内容又可以分为意见抽取、意见挖掘、情感挖掘和主观分析。本实施例主要关注于挖掘文本中的情感倾向,抽取其中的情感词汇,判断文本作者的情感态度。
在进行情感时,一方面可以利用较为流行并且成熟的开放的情感词典资源,它们通常会根据词性或者感情色彩等不同依据将词语划分为不同情感类别,从而给每个词从不同角度标上标签,进行全面的描述,同时也能够反映出不同类别的情感词规律。另一面,作为一种分类问题,能够使用机器学习中的分类算法进行处理,从而得到文本的态度倾向。
3.2.机器学习
机器学习从字面上理解即为让机器向人一样理解学习,它能够从数据集中受到启发,彰显数据背后的真实意义。本实施例研究的内容是文本对于信用评估的作用,需要从文本中挖掘的特征集合中判断借款人的信用水平,很难从这些原始文本数据,甚至特征集合中直观的获取所需信息,因此需要借助机器学习算法处理这些无序的数据,将其变成能够被计算机识别处理的量化特征,通过构造某种模型,将使用这些特征表示的文本作为输入数据,从而得到借款人的信用水平最可能所属的类别。机器学习的主要任务就是分类和回归,与本实施例的任务正好一致。分类就是根据特征信息对某一实例进行判断其归属的类别。回归则是通过给定的数据点形成一个最优拟合曲线。它们都属于有监督的学习,必须知道预测什么,即目标变量的分类信息,数据常常分为训练集和测试集。
实施例四
在P2P平台借款人提交的文本信息中的语言特征能否提高信用评估的准确性,能否利用这些语言特征预测出可用值是研究的重要内容。文本包含大量丰富的关于写作者自身的信息,除了语义内容,写作方式、写作习惯等也可以反映出写作者的性格特征,甚至信用水平。但是通常来讲,文本中包含的语法、语义以及情感倾向是不能直接被计算机表示和处理的,因此需要识别出这些特征,并且将这些特征进行量化,从而在文本分析中使用。
为了对借贷文本中的语言特征有一个全面综合的理解,提出了解释和预测互相补充的两个步骤,一方面,通过对相关语言学和心理学文献的总结以及计量经济学模型的应用,研究不同语言信息在预示潜在风险中的作用,选择合适的特征以进行预测;另一方面,采用了常见的几种机器学习的方法使用这些语言特征对信用进行评估,并且分析结果,发掘揭露这些语言信息的预测能力。
基于前述对抽象文本特征以及基本语言特征的介绍,本发明实施例介绍一下实验过程及实验结果。
4.1实验过程
财务相关特征分为两种,一种是基本财务特征,用户在注册和贷款申请时需要填写的财务相关信息,另一种是信用特征,需要根据用户的历史信贷记录或者向专门的机构进行购买的信用特征。将实验数据进行处理得到财务特征数据和上述提到的文本特征数据,财务特征作为对照标准,文本特征数据加上基本财务特征以及文本特征与财务特征合并后的数据进行实验,观察信用评估模型的效果,研究文本特征对信用评估的作用。本实施例采用了机器学习常见的五种基本分类学习算法,分别是决策树、朴素贝叶斯、逻辑回归、神经网络以及随机森林,通过使用不同机器学习模型 探索在信用评估中使用文本特征的分类效果。
1)实验数据,本实施例中的实验数据可以参见数据描述。
2)数据预处理和归一化。
由于抽取的文本不能直接利用,因此需要首先对文本进行预处理。从数据库中下载的Listing数据不能直接使用,它不是纯文本格式,而需要从xml格式中抽出。其他两个文本中同样包含了xml标签,因此在抽取特征之前,将标签等与文本无关内容过滤掉。
借贷记录中的每一个条目的取值范围大不相同,比如贷款收入比在0到1之间,贷款数目又在几千之上,统计的词性、词汇特征等数目也是在几个到几百个之间浮动,这些特征值得差异太大容易造成权重失衡,因此在抽取特征后,将每种特征数据变换到同一个范围中,也即归一化。本实验使用简单的最大最小处理方法,也即离差标准化,对原始数据进行线性变换,使结果值都映射到0和1之间。转换函数如下:
Figure PCTCN2016081998-appb-000009
其中,max表示最大,min表示最小,x为需要计算的实际数据。
3)特征选择
当训练机器学习模型时输入的特征过于多,不仅会延长训练模型的事件,还常常会出现分类效果反而下降的情况。这是由于在输入的大量特征中,可能存在不相关的特征或者特征间存在依赖关系,也就是所谓的引入噪声。当引入的噪声大于增加特征带来的提升效果时,分类结果的正确率反而出现下降。
特征选择的提出就是解决这类问题,是指从当前抽取的M个特征集合中剔除不相关特征或者冗余特征,只保留对分类具有帮助的特征子集,以降低数据集维度。图3为本发明实施例描述特征选择的一般流程示意图, 如图3所示,首先需要设定初始的子集。其次,使用搜索算法确定特征子集。本实验主要采用完全搜索类的最优优先搜索方法。特征子集的个数从1开始,使用穷举法,计算每次新加入一个特征后的子集分类效果,然后使用评价函数来对该子集的分类效果进行判断。本实验采用封装器的评价方法,它会根据分类器的不同而选出不同的特征子集,也即对样本进行试分类,用分类器的错误率作为衡量指标,因此分类效果较好。
4)模型训练
使用机器学习算法训练数据并且使用模型预测,通常遵循以下的步骤。1)准备输入数据。也即本实施例前面提到的抽取文本特征和loan特征,当然训练数据也需要包含已经分好类的目标变量。2)训练算法。机器学习算法从这一步才开始真正学习,将处理后得到的格式化数据输入到算法中,从中抽取知识或者信息,形成可以用来预测的模型,也即得到相应的模型参数。3)测试算法。在使用模型之前,必须测试算法工作的效果。本实施例使用的机器学习算法均属于监督学习,使用已知的用于评估的目标变量值与预测值的关系来进行评测,如果输出结果不满意,再对模型进行改正加以测试。
本实施例采用了五种常见的机器学习分类器。1)决策树分类器,采用信息增益比的度量方式,每次选择信息增益比最高的特征进行划分数据。置信因数设置为0.005对决策树进行裁剪。2)逻辑回归分类器,使用Sigmoid函数,并且使用随机梯度上升法来确定最佳回归系数。3)神经网络分类器,实验采用反向传播神经网络,激活函数为Sigmoid函数。4)随机森林分类器,选择100棵树作为分类器构成。5)朴素贝叶斯分类器。
5)交叉验证
训练好机器学习模型后还要对模型的正确率进行验证,本实施例实验中采用的是五折交叉验证。交叉验证是一种在数据量不够大的情况下随机 将数据样本切割成较小子集的实用方法。首先在把其中一个子集作为训练样本来训练分类器,其他子集作为测试集来验证此分类器的正确率等指标。五折交叉验证即将数据集划分为五份,每次选取一份作为测试集,剩下的四份作为训练集,从而进行五次实验,将这些实验所得的正确率进行平均作为对算法准确性的估计。
6)评价指标
本实施例对于文本对信用评估的影响主要采用正确率的评价方式,正确率即表示为在已经有还款违约记录的测试数据上,本实施例使用算法进行分类的结果与原违约记录结果相同的个数占整个实验数据数量的百分比。
4.2实验结果
在前面介绍了实验抽取的文本特征和实验过程后,下面将介绍从不同方面做过的多次实验及相应的实验结果,并对最终的实验结果做了比较和分析。
1)文本特征对信用评估分类效果的影响
为了研究文本特征对信用评估分类效果的影响,本实施例使用财务特征数据、文本特征数据以及财务和文本特征结合的数据分别作为输入数据训练模型并测试,以财务特征数据作为控制变量进行对照。采用了上述提到的五种分类器,计算分类器模型对信用风险评估的准确率。由于增加特征会带来噪声问题,因此考虑到特征数量过多会引起特征过载,从而导致分类效果降低的问题,在进行模型训练之前对特征数据进行了特征选择。再与财务特征分类的结果进行对比。三种特征数据在五种分类器模型上的最终预测结果表6-2所示。
首先将单独使用文本特征进行预测的分类结果与单独使用财务特征预测的分类结果进行比较,从数据中发现,对于大多数的分类器模型而言, 使用文本特征预测的正确率虽然比使用财务特征预测的正确率有所降低,但是数值比较接近,相差不大。特别地,经过特征选择后使用随机森林预测的正确率为67.42%,比财务特征的预测正确率还要高0.1%左右;使用神经网络的预测结果为67.83%,与财务特征预测的结果68.37%的差值相差在0.5%以内。另一方面,财务和文本特征的预测结果均比单独使用财务特征的预测结果有了不同程度的提升。
在当前时期的P2P网贷中,一些与借款人相关的信用评级和历史数据、资产数据等与借款相关的客观性可量化数据都没有完全公开,甚至有些数据需要进行购买才可以获得。另一方面,现有的个人信用评估体系还尚未完善,很多借款人不存在足够的财务特征数据来进行信用评估。因此,财务特征获取存在成本高,来源少的问题。相比于财务特征,文本特征的获取就容易的多。在借款人申请贷款的时候,使用对借款的申请描述作为研究对象抽取文本特征,从而对借款人进行信用评估,在成本较低和获取容易的情况下,文本特征可以代替财务特征进行信用风险评估,并且能够得到差别不大的在可接受范围内的评估效果。分类结果数据的对比情况可以直观的表示为图4-1。
此外,将在财务特征中加入文本特征的分类结果与只使用财务特征进行分类的结果进行对比如图4-2所示:不难发现,在五种分类器模型中对于所有分类器模型而言,加入文本特征后的分类正确率都有不同程度的提升。也即说明,本实施例提取的文本特征不但可以用来代替财务特征作为一种准确率稍差但是成本低的信用评估方式,而且在财务特征中加入文本特征后,信用风险的预测准确率也有一定的提升。其中,加入文本特征后,分类正确率最高能够到70.6%,并且使用随机森林分类器能够提升3%。除了使用单分类器进行分类预测之外,本实施例又引入了多分类器集成的概念,探索通过集成进一步提升分类的正确率。
2)文本特征的数量以及种类对信用评估分类效果的影响
根据前面的实验结果,文本特征在信用评估中可以在某种程度上替代财务特征,并且加入文本特征后能够提升信用评估的效果。因此进一步探索文本特征的数量以及不同种类的文本特征对信用评估分类效果是否有影响。使用logistic分类器和随机森林分类器分别对listing借款描述文本的不同特征进行训练,并且与单纯使用控制变量财务贷款本身特征的分类结果进行对比。如图4-3所示,实验结果数据分析显示,对于逻辑回归分类器来讲,虽然加入统计特征简单统计文本特征后逻辑回归分类器的正确率会有稍微降低,但是加入情感特征和词性特征后,正确率都会有所增加,其中加入情感特征后效果提升最好,而加入所有文本特征后,分类的正确率有着明显提升,提高了0.5个百分点。而对于随机森林分类器,加入文本特征后准确率有了显著提升,最高提升了约为3%,同样证明了文本特征可以提高信用评估分类的准确率,但是加入全部特征后的结果比单独加入文本统计特征、情感特征和词性特征后的结果稍微降低了一些。特征数量增加后准确率没有如预期的提高,有可能是在增加数量的同时,噪声也随之增多,使得分类效果有所降低。因此,从数据中可以看出文本特征能够提高信用评估的准确率,并且文本特征越多对信用评估越有帮助。
实施例五
基于多个分类器集成设计一个信用风险评估系统,如图5-1所示。首先,建立七个基本分类器,这七个基本分类器包括六个文本分析的分类器,这六个文本分析的分类器对应于六个抽象文本特征,而这个六个抽象文本特征用于表征借款人的各个不同方面,例如,主观性、欺骗性、文本的可读性、情感、用户的个性特点和思维方式。每一个分类器将输入作为基本语言特征,而且能够预测借款人是否会不履行还款,然后采用融合系统集成 七个分类器的输出。
逻辑回归用于欺骗性分类器、主观性分类器和个性特点分类器;随机森林用于可读性分类器、情感分类器和基本的贷款分类器;多层感知器是用于思维方式分类器;决策树用于融合的结果不同的分类器。
5.1实验数据
本实施例中的实验数据来源于Prosper网站,Prosper网站拥有大量的用户,是非常著名的P2P在线服务平台。在Prosper网站提供的数据中,提取了2006年到2008年的借贷记录,因为在这一时期的借贷行为到目前为止都已经有了最终还款结果,无论是违约或者按时还款。经过统计,这三年内一共有可用的结果明确的借贷记录28853条。
Prosper对于还款记录状态分为current、late、paid、charge-off、defaulted、repurchased以及cancelled七个状态,由于实验数据均为已经完成的记录,所以不存在current和late两个状态。然后将数据划分为违约和未违约两类,其中违约包括charge-off和defaulted两类,一共是9937条,则未违约的包含余下几类,一共18916条,违约与未违约记录之比大约为1:1.92。
首先,从Prosper的数据库的借款记录中抽取了描述贷款的8个基本loan特征,这些特征是可以量化的结构化特征,比如贷款收入比、网站评级、借款数目等。由于loan本身的特征都是由数值直接进行描述的,大多与借款人的还款能力、财富水平相关,因此将这些特征作为实验的控制变量,描述了还款的能力。这些特征分别如表5-1所示:
表5-1财务贷款特征
Figure PCTCN2016081998-appb-000010
其次,在借款人申请时,有三种文本申请描述可以被使用。一个是借款人提交的描述自身情况和借款原因的listing表中的description项,它是借款人亲自填写的对于此次贷款的一个详细描述。剩下两个是在注册用户表Member中的description和endorsement,它们分别描述了借款人自身情况和关于借款人的推荐书。本实施例中使用listing表中的description文本,主要关注于借款人自身对于借款的描述,从而挖掘出借款人的信用状况。
从文本中根据上述的特征选择和抽取方法,抽取了70个底层特征和组合形成了6个抽象文本特征,则最终输入到模型的特征构成一个特征网络,上层特征由底层特征表示,并代表着底层特征的抽象含义,逐步表达借款者的信用水平。
5.2分类器
下面介绍一下本发明实施例所涉及的机器学习分类器。1)决策树分类器,采用信息增益比的度量方式,每次选择信息增益比最高的特征进行划分数据,置信因数设置为0.005对决策树进行裁剪;2)逻辑回归分类器,使用Sigmoid函数,并且使用随机梯度上升法来确定最佳回归系数;3)神经网络分类器,实验采用反向传播神经网络,激活函数为sigmoid函数;4)随机森林分类器,选择100棵树作为分类器构成;5)朴素贝叶斯分类器。
5.3、实验1
文本特征对信用评估分类效果的影响。
为了研究文本特征对信用评估分类效果的影响,本实施例中使用财务特征数据、文本特征数据以及财务和文本特征结合的数据分别作为输入数据训练模型并测试,以财务特征数据作为控制变量进行对照。采用了上述提到的五种分类器,计算分类器模型对信用风险评估的准确率。由于增加特征会带来噪声问题,因此考虑到特征数量过多会引起特征过载,从而导致分类效果降低的问题,在进行模型训练之前对特征数据进行了特征选择。 再与loan特征分类的结果进行对比。三种特征数据在五种分类器模型上的最终预测结果表5-2所示。
首先,将单独使用文本特征进行预测的分类结果与单独使用财务特征预测的分类结果进行比较,从数据中可以发现,对于大多数的分类器模型而言,使用文本特征预测的正确率虽然比使用财务特征预测的正确率有所降低,但是数值比较接近,相差不大。特别地,经过特征选择后使用随机森林预测的正确率为67.42%,比财务特征的预测正确率还要高0.1%左右;使用神经网络的预测结果为67.83%,与财务特征预测的结果68.37%的差值相差在0.5%以内。另一方面,财务和文本特征的预测结果均比单独使用财务特征的预测结果有了不同程度的提升。
表5-2使用单分类器在不同特征数据上的结果
  Bayes Logistic 决策树 神经网络 随机森林
财务特征 69.26% 70.19% 69.85% 68.37% 67.3%
文本特征 67.3% 67.60% 68.7% 67.83% 67.42%
财务+文本 69.69% 70.6% 70.54% 69.2% 70.22%
5.4、实验2
在本实验中,评估通过结合多个分类器的性能,在每个分类器只考虑一个方面的知识。除了基于基本贷款功能的分类器,有六种文本分类器,使用的文本特性来描述借款人的不同方面,包括文本的可读性、欺骗性、主体性、情绪、个性特点和思维方式。
图5-2为本发明实施例中结合不同数量的分类器的性能对比图,如图5-2所示,将六种单独使用文本特征的分类器和使用财务特征的分类器分别使用逻辑回归、随机森林和神经网络三种分类器对数据独立进行训练得到分类结果,选择每种分类器的某一个分类结果当作第二层分类器的输入,最终通过上层分类器的训练得到最终分类效果。
最终得到当上层分类器使用决策树,并且七个分类器底层分别采用相应分类器(欺骗性:逻辑回归;文本的可读性:随机森林;情感:随机森林;思维方式:神经网络;主观性:逻辑回归;个性特点:逻辑回归;财务分析:随机森林)时,分类效果,正确率为71.35%,相比于财务分析最高的正确率70.19提升了高于1%,相比于单分类器最高正确率70.6%提升了0.75%。同时还可以看到,随着加入文本分类器数目的增多,正确率也在上升,并且都比财务分析和单分类器在listing上的效果好。
通过以上对每种多分类器集成方法效果的分析得出:经过多分类器集成后预测效果相比于单个分类器的预测效果能够有一定的提升。最后,选择每个多分类器集成算法中可以得到的最高的准确率作为算法的结果,与原始的财务特征数据的效果进行对比,可以得到不同多分类器集成算法的对比结果数据。
在上述的多分类器集成算法中,可以看到,使用不同的多分类器集成算法会带来不同幅度的分类正确率的提升。Boosting和bagging是基于相同分类器的决策融合,它们的分类效果比使用不同分类器的决策融合效果稍微差一些,但是从方差可以看出来,分类效果更加稳定。使用不同的分类器作为基分类器,也就是决策融合的决策提供者各不相同,可以使最终的结果考虑到不同的方面,在不同的情形下都适用,得到的最终结果也越正确。也即正确的结果可以经过不同算法的检验,使用的算法类型越多,错误的结果被识别出得几率越大。数据显示,基于加权和简单投票的多分类器集成算法的分类准确率相比于其他几种决策融合的提升是比较高的。这种使用不同基分类器的并行集成算法考虑到不同基分类器的分类能力,赋予它们不同的权重,从而得到更加准确的预测的结果。混合分类器从不同解决问题的角度出发,经过最后决策融合后多样性的体现更加明显,效果也是在本实施例中实现的多分类器集成方法中最好。
综合以上所有实验结果可以分析得到,使用财务+文本特征单个分类器能达到的最高准确率为使用逻辑回归预测,为70.6%,使用财务特征单独预测能达到的最高正确率为使用神经网络分类器,预测结果为70.19%。而经过多分类器集成后,不管采用哪一种算法,相比较使用财务特征的预测效果都有显著提升,混合分类器的提升幅度超过1个百分点,相比较于单分类器使用文本+财务特征的预测效果也有一定的提升,并且由于多分类器集成是综合了多个分类器的结果而做出的决策融合,因此多分类器集成的结果也具有更高的可靠性和稳定性。因此,多分类器集成方法在信用评估的信用分类中,有着重要的作用。
5.5、讨论
实验1的结果表明,不同分类器在文本特性的性能接近传统贷款特征,传统贷款特征包括FICO分数和DTI(债务收入比率)。收集和验证成本FICO分数和DTI是相对高。此外,包括在线P2P贷款的一个目标是提供服务的人没有商业信用记录,即没有FICO分数。
在这些情况下,它是一个很好的选择使用文本分析来评估信用风险。实验1和实验2的结果表明,添加文本特性可以提高整个信用风险评估系统的性能。直接基于随机森林的文本特性添加到现有的系统可能会增加其精度,从67%至70%。通过结合多个文本分类器和基本贷款分类器可以进一步提高精度为71%。所有这些表明,文本信息是对传统财务信息来源一个很好的互补信息来源。金融特征如DTI关注评估借款人的还款能力,文本特征关注评估借款人的还款意愿。
5.6、结论
在本实施例中采用文本分析和整体学习评估网络P2P贷款的信用风险。首先,设计一个包括六个抽象文本特性的概念模型,其中六个抽象文本特性用于从不同方面的探索借款人的思想。
然后,设计一个基于七个分类器的整体信用风险评价体系,这七个分类器中包括对应于6个抽象文本功能的6个文本分析的分类器和一个传统的信用分析的分类器。实验结果表明,不同分类器的表现在文本功能接近那些传统的金融特征包括FICO分数和DTI。
因此,当传统的金融信息在逐渐消失的时候文本信息是一个很好的选择。此外,添加文本特性可以提高整个信用风险评估系统的性能,这意味着文本信息是对传统财务信息来源的一个很好的互补信息来源,当文本信息与传统信息相结合时,能够提升信用风险评估的性能。
实施例六
基于前述的实施例,本发明实施例再提供一种基于文本分析的信用风险评估装置,该装置包括的第一获取单元、分析单元、处理单元、输出单元和建立单元,以及各单元所包括的各模块,都可以通过计算设备中的处理器来实现;当然也可通过逻辑电路实现;在实施例的过程中,处理器可以为中央处理器(CPU)、微处理器(MPU)、数字信号处理器(DSP)或现场可编程门阵列(FPGA)等。
图6为本发明实施例六基于文本分析的信用风险评估装置的组成结构示意图,如图6所示,该装置600包括第一获取单元601、分析单元602、处理单元603和输出单元604,其中:
所述第一获取单元601,配置为获取借款人的文本;
所述分析单元602,配置为对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否违约;
所述处理单元603,配置为将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;
所述输出单元604,配置为输出所述借款人的信用风险值。
本发明实施例中,所述装置还包括:建立单元,配置为建立所述信用 风险评估模型,所述建立单元进一步包括获取模块、分析模块、建立模块和融合模块,其中:
所述获取模块,配置为获取训练数据;
所述分析模块,配置为对所述训练数据进行分析,得到所述训练数据的基本语言特征;
所述第一建立模块,配置为将所述基本语言特征作为参数,采用机器学习方法建立不同的抽象文本特征对应的分类器;
所述融合模块,配置为将所述分类器作为基础分类器,使用决策树算法进行决策融合形成信用风险评估模型。
本发明实施例中,所述建立模块中的将所述基本语言特征作为参数,包括:根据所述基本语言特征与所述抽象文本特征之间的关系,将所述基本语言特征输入到每一所述抽象文本特征对应的分类器。
本发明实施例中,所述建立单元,还包括分割模块和统计模块,其中所述分割模块,配置为根据断句的标点符号对所述训练数据进行分割,所述统计模块,配置为对分割后的训练数据进行统计得到统计特征。
本发明实施例中,所述建立单元,还包括第二建立模块和确定模块,其中:
所述第二建立模块,配置为采用不同的机器学习方法建立同一所述抽象文本特征对应的分类器;
所述确定模块,配置为将准确率最高的分类器确定为所述抽象文本特征所对应的分类器。
这里需要指出的是:以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果,因此不做赘述。对于本发明装置实施例中未披露的技术细节,请参照本发明方法实施例的描述而理解,为节约篇幅,因此不再赘述。
需要说明的是,本发明实施例中,如果以软件功能模块的形式实现上述的基于文本分析的信用风险评估方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本发明实施例不限制于任何特定的硬件和软件结合。
相应地,本发明实施例再提供一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,该计算机可执行指令用于执行本发明实施例中的基于文本分析的信用风险评估方法。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本发明的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本发明的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
需要说明的是,在本实施例中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更 多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本发明各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出 贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。
工业实用性
本发明实施例中,获取借款人的文本;对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否会违约;将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;输出所述借款人的信用风险值;如此,能够有效地对借款人的信用风险进行评估,从而为投资人在投资时提供重要的决策依据。

Claims (16)

  1. 一种基于文本分析的信用风险评估方法,所述方法包括:
    获取借款人的文本;
    对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否会违约;
    将所述基本语言特征输入到预设的信用风险评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;
    输出所述借款人的信用风险值。
  2. 根据权利要求1所述的方法,其中,建立所述信用风险评估模型,包括:
    获取训练数据;
    对所述训练数据进行分析,得到所述训练数据的基本语言特征;
    将所述基本语言特征作为参数,采用机器学习方法建立不同的抽象文本特征对应的分类器;
    将所述抽象文本特征对应的分类器作为基础分类器,使用决策树算法进行决策融合形成信用风险评估模型。
  3. 根据权利要求2所述的方法,其中,所述基本语言特征至少包括文本的统计特征、词性特征、情感特征、实体特征和时态特征;其中所述统计特征包括句子特征、单词特征和标点特征,其中:所述句子特征至少包括:句子总数、平均句长、最大句长、疑问句数量比例;所述单词特征至少包括:平均词长、最长词单词种类数量、单词总数、单词平均出现次数和单词出现最大次数;所述标点特征至少包括:问号数量比例和感叹号数量比例。
  4. 根据权利要求2所述的方法,其中,所述抽象文本特征包括欺骗性、主观性、情感、可读性、个性特点和思维方式。
  5. 根据权利要求2所述的方法,其中,所述将所述基本语言特征作为参数,包括:
    根据所述基本语言特征与所述抽象文本特征之间的关系,将所述基本语言特征输入到每一所述抽象文本特征对应的分类器。
  6. 根据权利要求2所述的方法,其中,所述建立所述信用风险评估模型,还包括:根据断句的标点符号对所述训练数据进行分割,对分割后的训练数据进行统计得到统计特征。
  7. 根据权利要求6所述的方法,其中,所述断句的标点符号至少包括句号、问号、叹号。
  8. 根据权利要求2至7任一项所述的方法,其中,所述建立所述信用风险评估模型,还包括:
    采用不同的机器学习方法建立同一所述抽象文本特征对应的分类器;
    将准确率最高的分类器作为所述抽象文本特征所对应的分类器。
  9. 根据权利要求8所述的方法,其中,所述机器学习方法包括:人工神经网络方法、支持向量机方法、决策树方法、贝叶斯方法、随机森林方法、逻辑回归方法。
  10. 根据权利要求9所述的方法,其中,所述将所述分类器作为基础分类器,包括:将逻辑回归方法对应的分类器作为基础分类器。
  11. 一种基于文本分析的信用风险评估装置,所述装置包括第一获取单元、分析单元、处理单元和输出单元,其中:
    所述第一获取单元,配置为获取借款人的文本;
    所述分析单元,配置为对所述文本进行分析,得到基本语言特征,所述基本语言特征用于预测借款人是否会违约;
    所述处理单元,配置为将所述基本语言特征输入到预设的信用风险 评估模型,得到从所述信用风险评估模型输出的所述借款人的信用风险值;
    所述输出单元,配置为输出所述借款人的信用风险值。
  12. 根据权利要求11所述的装置,其中,所述装置还包括:建立单元,配置为建立所述信用风险评估模型,所述建立单元进一步包括获取模块、分析模块、建立模块和融合模块,其中:
    所述获取模块,配置为获取训练数据;
    所述分析模块,配置为对所述训练数据进行分析,得到所述训练数据的基本语言特征;
    所述第一建立模块,配置为将所述基本语言特征作为参数,采用机器学习方法建立不同的抽象文本特征对应的分类器;
    所述融合模块,配置为将所述分类器作为基础分类器,使用决策树算法进行决策融合形成信用风险评估模型。
  13. 根据权利要求12所述的装置,其中,所述建立模块中的将所述基本语言特征作为参数,包括:根据所述基本语言特征与所述抽象文本特征之间的关系,将所述基本语言特征输入到每一所述抽象文本特征对应的分类器。
  14. 根据权利要求12所述的装置,其中,所述建立单元,还包括分割模块和统计模块,其中所述分割模块,配置为根据断句的标点符号对所述训练数据进行分割,所述统计模块,配置为对分割后的训练数据进行统计得到统计特征。
  15. 根据权利要求12至14任一项所述的装置,其中,所述建立单元,还包括第二建立模块和确定模块,其中:
    所述第二建立模块,配置为采用不同的机器学习方法建立同一所述抽象文本特征对应的分类器;
    所述确定模块,配置为将准确率最高的分类器确定为所述抽象文本特征所对应的分类器。
  16. 一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,该计算机可执行指令用于执行权利要求1至10任一项所述的基于文本分析的信用风险评估方法。
PCT/CN2016/081998 2015-10-22 2016-05-13 基于文本分析的信用风险评估方法及装置、存储介质 WO2017067153A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/728,128 US11164075B2 (en) 2015-10-22 2017-10-09 Evaluation method and apparatus based on text analysis, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510695316.1A CN106611375A (zh) 2015-10-22 2015-10-22 一种基于文本分析的信用风险评估方法及装置
CN201510695316.1 2015-10-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/728,128 Continuation US11164075B2 (en) 2015-10-22 2017-10-09 Evaluation method and apparatus based on text analysis, and storage medium

Publications (1)

Publication Number Publication Date
WO2017067153A1 true WO2017067153A1 (zh) 2017-04-27

Family

ID=58556635

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081998 WO2017067153A1 (zh) 2015-10-22 2016-05-13 基于文本分析的信用风险评估方法及装置、存储介质

Country Status (3)

Country Link
US (1) US11164075B2 (zh)
CN (1) CN106611375A (zh)
WO (1) WO2017067153A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932530A (zh) * 2018-06-29 2018-12-04 新华三大数据技术有限公司 标签体系的构建方法及装置
CN109840281A (zh) * 2019-02-27 2019-06-04 浪潮软件集团有限公司 一种基于随机森林算法的自学习智能判定方法
CN109992668A (zh) * 2019-04-04 2019-07-09 上海冰鉴信息科技有限公司 一种基于自注意力的企业舆情分析方法和装置
WO2019137050A1 (zh) * 2018-01-12 2019-07-18 阳光财产保险股份有限公司 互联网信贷场景下的实时欺诈检测方法、装置及服务器
CN110427615A (zh) * 2019-07-17 2019-11-08 宁波深擎信息科技有限公司 一种基于注意力机制的金融事件修饰时态的分析方法
CN111061815A (zh) * 2019-12-13 2020-04-24 携程计算机技术(上海)有限公司 会话数据分类方法
CN111652627A (zh) * 2020-07-07 2020-09-11 中国银行股份有限公司 风险评估方法及装置
US11082454B1 (en) * 2019-05-10 2021-08-03 Bank Of America Corporation Dynamically filtering and analyzing internal communications in an enterprise computing environment
CN117876104A (zh) * 2024-03-13 2024-04-12 湖南三湘银行股份有限公司 一种基于ai语言模型的智能信贷管控方法及系统

Families Citing this family (74)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110692046A (zh) * 2017-04-21 2020-01-14 西门子股份公司 用于获取组件相关的需求信息的方法和设备
CN113610240A (zh) * 2017-05-05 2021-11-05 第四范式(北京)技术有限公司 利用嵌套机器学习模型来执行预测的方法及系统
CN113570064A (zh) * 2017-05-05 2021-10-29 第四范式(北京)技术有限公司 利用复合机器学习模型来执行预测的方法及系统
US10997672B2 (en) * 2017-05-31 2021-05-04 Intuit Inc. Method for predicting business income from user transaction data
CN107292713A (zh) * 2017-06-19 2017-10-24 武汉科技大学 一种基于规则与层级融合的个性推荐方法
CN109254993B (zh) * 2017-07-07 2021-06-01 掌沃云科技(北京)有限公司 一种基于文本的性格数据分析方法及系统
CN107437220A (zh) * 2017-07-13 2017-12-05 广东网金控股股份有限公司 一种生成差别化利率的方法及装置
CN107369081B (zh) * 2017-07-19 2021-07-27 无锡企业征信有限公司 用数据来源的动态影响因子确定数据有效性的系统及方法
CN110019654A (zh) * 2017-07-20 2019-07-16 南方电网传媒有限公司 一种不平衡网络文本分类优化系统
CN110019658B (zh) * 2017-07-31 2023-01-20 腾讯科技(深圳)有限公司 检索项的生成方法及相关装置
CN107481132A (zh) * 2017-08-02 2017-12-15 上海前隆信息科技有限公司 一种信用评估方法及系统、存储介质及终端设备
CN109472277A (zh) * 2017-09-08 2019-03-15 上海对外经贸大学 借贷方分类的方法、装置以及存储介质
CN108022146A (zh) * 2017-11-14 2018-05-11 深圳市牛鼎丰科技有限公司 征信数据的特征项处理方法、装置、计算机设备
CN108009911A (zh) * 2017-11-29 2018-05-08 上海出版印刷高等专科学校 一种识别p2p网络借贷借款人违约风险的方法
CN108256552A (zh) * 2017-12-18 2018-07-06 广东广业开元科技有限公司 基于大数据分类算法的民众友好指数评定方法及系统
CN107995428B (zh) * 2017-12-21 2020-02-07 Oppo广东移动通信有限公司 图像处理方法、装置及存储介质和移动终端
CN108038627B (zh) * 2017-12-27 2022-06-07 科大讯飞股份有限公司 一种对象评估方法及装置
CN108090830B (zh) * 2017-12-29 2021-01-15 上海勃池信息技术有限公司 一种基于面部画像的信贷风险评级方法和装置
CN108492104B (zh) 2018-02-12 2020-10-02 阿里巴巴集团控股有限公司 一种资源转移监测方法及装置
US10796095B2 (en) * 2018-04-05 2020-10-06 Adobe Inc. Prediction of tone of interpersonal text communications
WO2019198026A1 (en) * 2018-04-11 2019-10-17 Financial & Risk Organisation Limited Deep learning approach for assessing credit risk
US11397851B2 (en) * 2018-04-13 2022-07-26 International Business Machines Corporation Classifying text to determine a goal type used to select machine learning algorithm outcomes
US11682074B2 (en) * 2018-04-13 2023-06-20 Gds Link Llc Decision-making system and method based on supervised learning
CN108647822A (zh) * 2018-05-10 2018-10-12 平安科技(深圳)有限公司 电子装置、基于研报数据的预测方法和计算机存储介质
CN110598960B (zh) * 2018-05-23 2022-06-03 北京国双科技有限公司 一种实体级情感评估方法及装置
CN108874761A (zh) * 2018-05-31 2018-11-23 阿里巴巴集团控股有限公司 一种智能写作方法和装置
CN109325844A (zh) * 2018-06-25 2019-02-12 南京工业大学 多维数据下的网贷借款人信用评价方法
CN108876166A (zh) * 2018-06-27 2018-11-23 平安科技(深圳)有限公司 财务风险验证处理方法、装置、计算机设备及存储介质
CN109166027A (zh) * 2018-07-02 2019-01-08 阿里巴巴集团控股有限公司 一种借款合约处理方法及装置
US11144581B2 (en) * 2018-07-26 2021-10-12 International Business Machines Corporation Verifying and correcting training data for text classification
US20200065394A1 (en) * 2018-08-22 2020-02-27 Soluciones Cognitivas para RH, SAPI de CV Method and system for collecting data and detecting deception of a human using a multi-layered model
CN109118109B (zh) * 2018-08-31 2021-06-01 传神语联网网络科技股份有限公司 基于etm的质量评估
CN110968887B (zh) * 2018-09-28 2022-04-05 第四范式(北京)技术有限公司 在数据隐私保护下执行机器学习的方法和系统
CN110046200B (zh) * 2018-11-07 2023-05-05 创新先进技术有限公司 文本可信模型分析方法、设备和装置
CN109582791B (zh) * 2018-11-13 2023-01-24 创新先进技术有限公司 文本的风险识别方法及装置
CN109471932A (zh) * 2018-11-26 2019-03-15 国家计算机网络与信息安全管理中心 基于学习模型的谣言检测方法、系统及存储介质
CN109299228B (zh) * 2018-11-27 2021-09-03 创新先进技术有限公司 计算机执行的文本风险预测方法及装置
CN109543110A (zh) * 2018-11-28 2019-03-29 南京航空航天大学 一种微博情感分析方法及系统
CN109558592A (zh) * 2018-11-29 2019-04-02 上海点融信息科技有限责任公司 基于人工智能获取客户信用风险评估信息的方法及设备
CN109710925A (zh) * 2018-12-12 2019-05-03 新华三大数据技术有限公司 命名实体识别方法及装置
US11934931B2 (en) * 2018-12-17 2024-03-19 Shape Security, Inc. Decision tree training using a database system
US11467817B2 (en) * 2019-01-28 2022-10-11 Adobe Inc. Software component defect prediction using classification models that generate hierarchical component classifications
US11568286B2 (en) * 2019-01-31 2023-01-31 Fair Isaac Corporation Providing insights about a dynamic machine learning model
US11403300B2 (en) * 2019-02-15 2022-08-02 Wipro Limited Method and system for improving relevancy and ranking of search result
CN110163481A (zh) * 2019-04-19 2019-08-23 深圳壹账通智能科技有限公司 电子装置、用户风控审核系统测试方法及存储介质
CN110134948A (zh) * 2019-04-23 2019-08-16 北京淇瑀信息科技有限公司 一种基于文本数据的金融风险控制方法、装置和电子设备
US20200342056A1 (en) * 2019-04-26 2020-10-29 Tencent America LLC Method and apparatus for natural language processing of medical text in chinese
CN110097460A (zh) * 2019-05-09 2019-08-06 深圳美美网络科技有限公司 一种信用风险评估方法
CN110414548A (zh) * 2019-06-06 2019-11-05 西安电子科技大学 基于脑电信号进行情感分析的层级Bagging方法
US11568187B2 (en) 2019-08-16 2023-01-31 Fair Isaac Corporation Managing missing values in datasets for machine learning models
CN110543475A (zh) * 2019-08-29 2019-12-06 深圳市原点参数科技有限公司 一种基于机器学习的财务报表数据自动识别和分析方法
CN111144546B (zh) * 2019-10-31 2024-01-02 平安创科科技(北京)有限公司 评分方法、装置、电子设备及存储介质
US11574150B1 (en) 2019-11-18 2023-02-07 Wells Fargo Bank, N.A. Data interpretation analysis
CN111178687B (zh) * 2019-12-11 2024-04-26 北京淇瑀信息科技有限公司 金融风险分类方法、装置及电子设备
CN111191893B (zh) * 2019-12-20 2024-03-26 北京淇瑀信息科技有限公司 风控文本处理方法、装置及电子设备
CN113094706A (zh) * 2020-01-08 2021-07-09 深信服科技股份有限公司 一种WebShell检测方法、装置、设备及可读存储介质
CN111400496B (zh) * 2020-03-18 2023-05-09 江苏海洋大学 一种面向用户行为分析的大众口碑情感分析方法
CN112785441B (zh) * 2020-04-20 2023-12-05 招商证券股份有限公司 数据处理方法、装置、终端设备及存储介质
US11775196B2 (en) * 2020-05-27 2023-10-03 EMC IP Holding Company LLC Generating data replication configurations using artificial intelligence techniques
CN111859913B (zh) * 2020-06-12 2024-04-12 北京百度网讯科技有限公司 风控特征因子的处理方法、装置、电子设备及存储介质
CN111767399B (zh) * 2020-06-30 2022-12-06 深圳平安智慧医健科技有限公司 一种基于不均衡文本集的情感分类器构建方法、装置、设备和介质
US20230267379A1 (en) * 2020-06-30 2023-08-24 Australia And New Zealand Banking Group Limited Method and system for generating an ai model using constrained decision tree ensembles
CN112069781B (zh) * 2020-08-27 2024-01-02 广州视源电子科技股份有限公司 一种评语生成方法、装置、终端设备及存储介质
CN113407713B (zh) * 2020-10-22 2024-04-05 腾讯科技(深圳)有限公司 基于主动学习的语料挖掘方法、装置及电子设备
CN112183465A (zh) * 2020-10-26 2021-01-05 天津大学 一种基于人物属性和上下文的社会关系识别方法
CN112288279A (zh) * 2020-10-30 2021-01-29 平安医疗健康管理股份有限公司 基于自然语言处理和线性回归的业务风险评估方法和装置
CN112561682A (zh) * 2020-12-10 2021-03-26 中信银行股份有限公司 一种针对小微企业的银行授信风险评估方法及系统
CN112818118B (zh) * 2021-01-22 2024-05-21 大连民族大学 基于反向翻译的中文幽默分类模型的构建方法
TWI827910B (zh) * 2021-02-18 2024-01-01 合作金庫商業銀行股份有限公司 信用評價方法與系統
CN112818677A (zh) * 2021-02-22 2021-05-18 康美健康云服务有限公司 一种基于互联网的信息评估方法及系统
CN113506160A (zh) * 2021-06-17 2021-10-15 山东师范大学 一种面向不平衡财务文本数据的风险预警方法及系统
CN113609288B (zh) * 2021-06-23 2024-03-15 湖南大学 一种技术领域创新方法的分类体系构建方法、系统、终端及可读存储介质
CN117710081A (zh) * 2023-11-29 2024-03-15 浙江孚临科技有限公司 一种用于金融风险控制的信息服务处理系统
CN117874172B (zh) * 2024-03-11 2024-05-24 中国传媒大学 文本可读性评估方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093280A (zh) * 2011-10-31 2013-05-08 铭传大学 信用违约预测方法与装置
CN103154991A (zh) * 2010-07-23 2013-06-12 汤森路透环球资源公司 信用风险采集
CN104616198A (zh) * 2015-02-12 2015-05-13 哈尔滨工业大学 一种基于文本分析的p2p网络借贷风险预测系统
CN104866969A (zh) * 2015-05-25 2015-08-26 百度在线网络技术(北京)有限公司 个人信用数据处理方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789498B (zh) * 2012-07-16 2014-08-06 钱钢 基于集成学习的中文评论文本的情感分类方法与系统
US20150142446A1 (en) * 2013-11-21 2015-05-21 Global Analytics, Inc. Credit Risk Decision Management System And Method Using Voice Analytics
CN104657422B (zh) * 2015-01-16 2018-05-15 北京邮电大学 一种基于分类决策树的内容发布智能分类方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103154991A (zh) * 2010-07-23 2013-06-12 汤森路透环球资源公司 信用风险采集
CN103093280A (zh) * 2011-10-31 2013-05-08 铭传大学 信用违约预测方法与装置
CN104616198A (zh) * 2015-02-12 2015-05-13 哈尔滨工业大学 一种基于文本分析的p2p网络借贷风险预测系统
CN104866969A (zh) * 2015-05-25 2015-08-26 百度在线网络技术(北京)有限公司 个人信用数据处理方法和装置

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019137050A1 (zh) * 2018-01-12 2019-07-18 阳光财产保险股份有限公司 互联网信贷场景下的实时欺诈检测方法、装置及服务器
CN108932530A (zh) * 2018-06-29 2018-12-04 新华三大数据技术有限公司 标签体系的构建方法及装置
CN109840281A (zh) * 2019-02-27 2019-06-04 浪潮软件集团有限公司 一种基于随机森林算法的自学习智能判定方法
CN109992668A (zh) * 2019-04-04 2019-07-09 上海冰鉴信息科技有限公司 一种基于自注意力的企业舆情分析方法和装置
CN109992668B (zh) * 2019-04-04 2023-02-21 上海冰鉴信息科技有限公司 一种基于自注意力的企业舆情分析方法和装置
US11082454B1 (en) * 2019-05-10 2021-08-03 Bank Of America Corporation Dynamically filtering and analyzing internal communications in an enterprise computing environment
CN110427615A (zh) * 2019-07-17 2019-11-08 宁波深擎信息科技有限公司 一种基于注意力机制的金融事件修饰时态的分析方法
CN111061815A (zh) * 2019-12-13 2020-04-24 携程计算机技术(上海)有限公司 会话数据分类方法
CN111061815B (zh) * 2019-12-13 2023-04-25 携程计算机技术(上海)有限公司 会话数据分类方法
CN111652627A (zh) * 2020-07-07 2020-09-11 中国银行股份有限公司 风险评估方法及装置
CN111652627B (zh) * 2020-07-07 2024-04-23 中国银行股份有限公司 风险评估方法及装置
CN117876104A (zh) * 2024-03-13 2024-04-12 湖南三湘银行股份有限公司 一种基于ai语言模型的智能信贷管控方法及系统

Also Published As

Publication number Publication date
US11164075B2 (en) 2021-11-02
CN106611375A (zh) 2017-05-03
US20180032870A1 (en) 2018-02-01

Similar Documents

Publication Publication Date Title
WO2017067153A1 (zh) 基于文本分析的信用风险评估方法及装置、存储介质
Craja et al. Deep learning for detecting financial statement fraud
US11580459B2 (en) Systems and methods for extracting specific data from documents using machine learning
US10515153B2 (en) Systems and methods for automatically assessing constructed recommendations based on sentiment and specificity measures
Liang et al. Analyzing credit risk among Chinese P2P-lending businesses by integrating text-related soft information
US20230056987A1 (en) Semantic map generation using hierarchical clause structure
WO2018184518A1 (zh) 微博数据处理方法、装置、计算机设备及存储介质
Minhas et al. From spin to swindle: Identifying falsification in financial text
Ma et al. A credit risk assessment model of borrowers in P2P lending based on BP neural network
Biswas et al. Scope of sentiment analysis on news articles regarding stock market and GDP in struggling economic condition
Mnif et al. Big data tools for Islamic financial analysis
Lutz et al. Predicting sentence-level polarity labels of financial news using abnormal stock returns
Arvanitis et al. Real-time investors’ sentiment analysis from newspaper articles
Wang et al. Does digitalization sufficiently empower female entrepreneurs? Evidence from their online gender identities and crowdfunding performance
Huang et al. Central bank communication: one size does not fit all
Liu et al. Supporting features updating of apps by analyzing similar products in App stores
Hu et al. Infiagent-dabench: Evaluating agents on data analysis tasks
Chen et al. COVID risk narratives: a computational linguistic approach to the econometric identification of narrative risk during a pandemic
Kim et al. Opinion mining-based term extraction sentiment classification modeling
Zhang et al. Behind the scenes: The role of writing guideline design in online charitable crowdfunding market
Sun Textual features of peer review predict top-cited papers: An interpretable machine learning perspective
Dash Information Extraction from Unstructured Big Data: A Case Study of Deep Natural Language Processing in Fintech
Craja et al. Deep Learning application for fraud detection in financial statements
Moniz Textual analysis of intangible information
Zheng et al. The Effects of Sentiment Evolution in Financial Texts: A Word Embedding Approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16856596

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.08.2018)

122 Ep: pct application non-entry in european phase

Ref document number: 16856596

Country of ref document: EP

Kind code of ref document: A1