CN110659817A - Data processing method and device, machine readable medium and equipment - Google Patents

Data processing method and device, machine readable medium and equipment Download PDF

Info

Publication number
CN110659817A
CN110659817A CN201910872797.7A CN201910872797A CN110659817A CN 110659817 A CN110659817 A CN 110659817A CN 201910872797 A CN201910872797 A CN 201910872797A CN 110659817 A CN110659817 A CN 110659817A
Authority
CN
China
Prior art keywords
data
value
model
component
data table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910872797.7A
Other languages
Chinese (zh)
Inventor
周曦
姚志强
胡佩涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cloud From Enterprise Development Co Ltd
Original Assignee
Shanghai Cloud From Enterprise Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cloud From Enterprise Development Co Ltd filed Critical Shanghai Cloud From Enterprise Development Co Ltd
Priority to CN201910872797.7A priority Critical patent/CN110659817A/en
Publication of CN110659817A publication Critical patent/CN110659817A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Accounting & Taxation (AREA)
  • Mathematical Physics (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Analysis (AREA)
  • Educational Administration (AREA)
  • Pure & Applied Mathematics (AREA)
  • Operations Research (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Marketing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Game Theory and Decision Science (AREA)
  • Technology Law (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention provides a data processing method, which comprises the following steps: acquiring a service request of a financial service object; acquiring the attribute and behavior data of the financial business object; matching a corresponding service model according to the service type of the service request, wherein the service model is generated by training of a plurality of application components; and processing the attribute and behavior data of the financial service object through the service model, and outputting a financial service processing result. The invention breaks through the limitation that operators need to master modeling skills, and enables business experts to generate and edit the own scoring cards by utilizing the existing application components. The programming technology of the programming expert is converted into a universal component by utilizing the application component, so that the application component can be used by other people, and the popularization and the promotion of the artificial intelligence technology are facilitated.

Description

Data processing method and device, machine readable medium and equipment
Technical Field
The present invention relates to the field of financial technologies, and in particular, to a data processing method, apparatus, machine-readable medium, and device.
Background
With the development of artificial intelligence, artificial intelligence technology gradually moves out of laboratories and merges into various industries and our daily lives. The artificial intelligence has the advantages of identifying modes, predicting future events, formulating rules, driving an automatic flow and being rapid, brings good experience to users, and has high accuracy in specific application scenes. These features of artificial intelligence are rapidly changing to become a competing factor for a successful financial services enterprise. The scoring card is a common tool in the financial field, and needs artificial intelligence to be added.
However, the popularization of the current artificial intelligence technology in the financial industry has the following limitations:
1. recruitment professionals are costly
The artificial intelligence technique depends on statistics and computer knowledge and needs a lot of training to master, so the artificial intelligence technique is mastered in hands of some doctors, major and other people with professional skills. The rapid increase in demand has resulted in a surge in personnel costs, and for some small-scale financial institutions, it is not easy to maintain a large amount of human costs.
Recruited professionals are typically non-financial professionals and are not familiar with banking.
The method can be used for recruiting professional financial intelligent talents and is not suitable for small and medium financial institutions;
2. the original training difficulty of personnel is large
As previously mentioned, artificial intelligence techniques rely on statistical, computer knowledge and require extensive training to master. Original personnel of the bank cannot quickly learn the artificial intelligence modeling method. Even with some knowledge, the model cannot be optimized skillfully.
Therefore, there is a need to solve the above problems.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention provides a data processing method, device, machine-readable medium and apparatus, which are used to solve the problems of the prior art.
To achieve the above and other related objects, the present invention provides a data processing method, including:
acquiring the attribute and behavior data of the financial business object;
and processing the attribute and the behavior data of the financial business object and outputting a financial business processing result.
Optionally, the processing the attribute and behavior data of the financial transaction object includes inputting the attribute and behavior data of the financial transaction object into the generated transaction model.
Optionally, the attributes of the financial transaction object include name, age, locale, occupation, income, cultural degree, and asset condition.
Optionally, the behavior data includes whether a loan has occurred, whether it is overdue.
Optionally, the processing result includes whether to give credit and the amount and interest rate of the credit.
Optionally, the method for generating the business model includes:
preprocessing sample data;
performing box separation processing on the preprocessed sample data to output a box separation data table;
calculating WoE values for each data field in the binned data table to output a WoE value data table;
calculating the IV value of each data field in the box data table according to the WoE value data table to output an IV value data table;
screening the data fields in the IV value data table according to a set screening threshold value;
outputting a scoring card model according to the screened data fields and the box data table;
and outputting corresponding scoring cards according to the WoE value data tables and the model parameters of the scoring card models.
Optionally, the method for generating a business model further includes:
evaluating the scoring card model;
and outputting the rating card model and the evaluation indexes of the rating card model.
Optionally, the preprocessing the sample data includes:
receiving sample data;
sampling the sample data to output a first data table;
processing missing data in the first data table to output a second data table;
and processing the abnormal value in the second data table to output a third data table.
To achieve the above and other related objects, the present invention also provides a data processing method, including:
acquiring a service request of a financial service object;
acquiring the attribute and behavior data of the financial business object;
matching a corresponding service model according to the service type of the service request, wherein the service model is generated by training of a plurality of application components;
and processing the attribute and behavior data of the financial service object through the corresponding application component, and outputting a financial service processing result.
Optionally, the attributes of the financial transaction object include name, age, locale, occupation, income, cultural degree, and asset condition.
Optionally, the behavior data includes whether a loan has occurred, whether it is overdue.
Optionally, the processing result includes whether to give credit and the amount and interest rate of the credit.
Optionally, the application component comprises:
the data preprocessing component is used for preprocessing the sample data;
the data binning component is used for binning the preprocessed sample data to output a binning data table;
WoE a value calculation component for calculating WoE values for respective data fields in the binned data table to output a WoE value data table;
an IV value calculating component for calculating the IV value of each data field in the bin data table according to the WoE value data table to output an IV value data table;
the characteristic selection component is used for screening the data fields in the IV value data table according to a set screening threshold value;
the model generation component is used for outputting a scoring card model according to the screened data fields and the box data table;
and the scoring card generating component is used for outputting the corresponding scoring card according to the WoE value data table and the model parameters of the scoring card model.
Optionally, the application component further comprises:
an evaluation component for evaluating the scoring card model;
and the derivation component is used for outputting the rating card model and the evaluation indexes of the rating card model.
Optionally, the data preprocessing component includes:
a data receiving component for receiving sample data;
the data sampling component is used for sampling the sample data to output a first data table;
the missing value processing component is used for processing the missing data in the first data table to output a second data table;
and the abnormal value processing component is used for processing the abnormal value in the second data table to output a third data table.
Optionally, processing the missing data includes replacing the missing value with one of the following: pre-value, post-value, maximum value, minimum value, mean value, a self-defined value.
Optionally, processing the outliers comprises filling the outliers, and the filling method comprises mode filling, median filling, mean filling and specified value filling.
Optionally, the binning process includes equal frequency binning, equal width binning, and chi-square binning.
Optionally, the scoring card model is a logistic regression model, a probabilistic regression model, a decision tree, a neural network.
Optionally, the scoring card generating component converts the output value of the scoring card model into the scoring card score.
Optionally, the evaluation index for evaluating the effect of the score card model includes AUC value, KS value, Accuracy value, Precision value, Recall value, ROC curve, KS curve, PR curve.
Optionally, the evaluation index of the output scoring card model includes an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, an ROC curve, a KS curve, a PR curve, a scoring card scale, and a scoring card detail.
To achieve the above and other related objects, the present invention also provides a data processing apparatus comprising:
the service request acquisition module is used for acquiring a service request of a financial service object;
the data acquisition module is used for acquiring the attribute and behavior data of the financial business object;
the model matching module is used for matching a corresponding service model according to the service type of the service request, and the service model is generated by training a plurality of application components;
and the result output module is used for processing the attribute and the behavior data of the financial business object through the corresponding business model and outputting a financial business processing result.
Optionally, the attributes of the financial transaction object include name, age, locale, occupation, income, cultural degree, and asset condition.
Optionally, the behavior data includes whether a loan has occurred, whether it is overdue.
Optionally, the processing result includes whether to give credit and the amount and interest rate of the credit.
Optionally, the business model is generated by a business model generating component, and the business model generating component includes:
the data preprocessing component is used for preprocessing the sample data;
the data binning component is used for binning the preprocessed sample data to output a binning data table;
WoE a value calculation component for calculating WoE values for respective data fields in the binned data table to output a WoE value data table;
an IV value calculating component for calculating the IV value of each data field in the bin data table according to the WoE value data table to output an IV value data table;
the characteristic selection component is used for screening the data fields in the IV value data table according to a set screening threshold value;
the model generation component is used for outputting a scoring card model according to the screened data fields and the box data table;
and the scoring card generating component is used for outputting the corresponding scoring card according to the WoE value data table and the model parameters of the scoring card model.
Optionally, the business model generating component further comprises:
an evaluation component for evaluating the scoring card model;
and the derivation component is used for outputting the rating card model and the evaluation indexes of the rating card model.
Optionally, the data preprocessing component includes:
a data receiving component for receiving sample data;
the data sampling component is used for sampling the sample data to output a first data table;
the missing value processing component is used for processing the missing data in the first data table to output a second data table;
and the abnormal value processing component is used for processing the abnormal value in the second data table to output a third data table.
Optionally, processing the missing data includes replacing the missing value with one of the following: pre-value, post-value, maximum value, minimum value, mean value, a self-defined value.
Optionally, processing the outliers comprises filling the outliers, and the filling method comprises mode filling, median filling, mean filling and specified value filling.
Optionally, the binning process includes equal frequency binning, equal width binning, and chi-square binning.
Optionally, the scoring card model is a logistic regression model, a probabilistic regression model, a decision tree, a neural network.
Optionally, the scoring card generating component converts the output value of the scoring card model into the scoring card score.
Optionally, the evaluation index for evaluating the effect of the score card model includes AUC value, KS value, Accuracy value, Precision value, Recall value, ROC curve, KS curve, PR curve.
Optionally, the evaluation index of the output scoring card model includes an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, an ROC curve, a KS curve, a PR curve, a scoring card scale, and a scoring card detail.
To achieve the above and other related objects, the present invention also provides an apparatus comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more of the methods described previously.
To achieve the above objects and other related objects, the present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more of the methods described above.
As described above, the data processing method, apparatus, machine-readable medium and device provided by the present invention have the following beneficial effects:
the invention breaks through the limitation that operators need to master modeling skills, and enables business experts to generate and edit the own scoring cards by utilizing the existing application components. The programming technology of the programming expert is converted into a universal component by utilizing the application component, so that the application component can be used by other people, and the popularization and the promotion of the artificial intelligence technology are facilitated.
Drawings
FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for generating a score card model according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating exemplary pre-processing of sample data according to an embodiment of the present invention;
FIG. 4 is a flow chart of a data processing method according to another embodiment of the present invention;
FIG. 5 is a diagram illustrating application components included in generating a score card model according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a data preprocessing component according to an embodiment of the present invention;
FIG. 7 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
As shown in fig. 1, the present invention provides a data processing method, including:
s10, acquiring the attribute and behavior data of the financial business object;
the attributes of the financial business objects comprise names, ages, regions, occupation, income, cultural degree and asset conditions. The behavior data includes whether loan occurs or not and whether overdue occurs or not.
And S11, processing the attribute and behavior data of the financial service object and outputting a financial service processing result.
Wherein, the processing result comprises whether to give credit and the amount and interest rate of the credit.
In one embodiment, the processing the attribute and behavior data of the financial transaction object includes inputting the attribute and behavior data of the financial transaction object into the generated transaction model.
In an embodiment, as shown in fig. 2, the method for generating the business model includes:
s221, preprocessing sample data;
as shown in fig. 3, the preprocessing the sample data includes:
s2210 receives sample data;
in one embodiment, the sample data received by the data receiving component is presented in the form of a data table, including instance name, file path, table name, field format.
S2211 samples the sample data to output a first data table;
specifically, data is sampled in a random and layered mode, sample data is sampled randomly according to a given proportion or number, each sampling is independent, and finally a first data table is output. In one embodiment, the data may also be sampled in a manner with a set back acquisition.
S2212 processes the missing data in the first data table to output a second data table;
due to the fact that loss frequently occurs in business data, the component can fill data convenient for modeling into a data loss area so as to enhance modeling quality. Specifically, the missing data may be processed by replacing a null value or a specified value with a pre-value, a post-value, a maximum value, a minimum value, a mean value, or a self-defined value, or by replacing a character-type null value or a null character string with a pre-value, a post-value, or a self-defined value.
S2213 processes the abnormal value in the second data table to output a third data table.
Because outliers often occur in the business data, the component can first find the outliers and then fill in the data that facilitates modeling to the area of the outliers to improve the quality of the modeling. In one embodiment, the abnormal values are judged by using a box type graph, and abnormal data in the third data table is filled. The data filling method comprises mode filling, median filling, mean filling and specified value filling; mode population refers to population of outliers with the modes of the selected data fields, median population refers to population of outliers with the medians of the selected data fields, mean population refers to population of outliers with the means of the selected data fields, and specified value population refers to population of outliers with NA or other special values.
In an embodiment, pre-processing the sample data further comprises: extracting partial fields from the sample data.
S222, performing box separation processing on the preprocessed sample data to output a box separation data table;
the binning is a necessary step for making a rating card, and refers to smoothly storing data values by considering neighbors (surrounding values), wherein the bin depth is used for indicating that the same number of data exist in different bins, and the bin width is used for indicating a value range of each bin value. The box separation mode comprises equal-frequency box separation, equal-width box separation and card square box separation, wherein the card square box separation is particularly common, and discrete data and continuous data can be subjected to box separation by using the card square box separation.
The basic idea of chi-squared binning is to infer from sample data whether the distribution of the population differs significantly from the expected distribution, or whether the two classification variables are related or independent. The general assumptions can be assumed as: the observed frequency is not different from the expected frequency, or the two variables are independent of each other. In practical applications, the chi-squared value may be calculated assuming that the original assumption holds, where chi-squared represents the degree of deviation between the observed value and the theoretical value.
Equal frequency binning, the boundary values of the intervals are chosen such that each interval contains an approximately equal number of instances. For example, each interval should contain about 10% of instances, say N-10.
And (4) dividing the equal width into N equal parts from the minimum value to the maximum value. If a is the minimum value and B is the maximum value, the length of each interval is W ═ B-a)/N, and the interval boundary values are a + W, A +2W, …, a + (N-1) W. The number of instances of each aliquot may not be equal, considering here only the boundaries.
S223 calculating WoE values of each data field in the box data table to output a WoE value data table;
WoE (weight of evidence), an evidence weight, can convert the logistic regression model into the standard scorecard format, WoE is a form of encoding for the original independent variables. WoE, the contribution of the independent variable is reflected, after WoE encoding, the independent variable has certain standardized property and is insensitive to abnormal values.
S224, calculating the IV value of each data field in the box data table according to the WoE value data table to output an IV value data table;
calculating the IV value of each data field in the box data table according to the WoE value data table to output an IV value data table;
the IV is called Information Value, and Chinese means Information Value or Information amount.
How to select the most important and direct measurement standard of the model-entering variables? is the prediction capability of the variables, IV is one such index that can be used to measure the prediction capability of the independent variables.
S225, screening the data fields in the IV value data table according to the set screening threshold value;
in this embodiment, the IV value is used to screen the feature, and generally, when the IV value is 0.3 or more, the feature prediction ability is strong. Sometimes the IV of a variable that is inherently important to the business is low due to sample problems. In order to solve the problem, the platform provides a flexible manual feature selection function, and a user can eliminate some features with poor correlation or strong consistency according to expert experience.
S226, outputting a scoring card model according to the screened data fields and the box data table;
the scoring card model can be a logistic regression model, a probabilistic regression model, a decision tree, or a neural network. In this embodiment, the scorecard model selects the logistic regression model, which has the advantages of simplicity, stability, strong interpretability, mature technology, easy detection and deployment, and the like, and is the most frequently used algorithm for the scorecard model.
S227 outputs the corresponding rating card according to the WoE value data sheet and the model parameters of the rating card model.
In building scoring card models, logistic regression is often used to model the data. However, in prediction using logistic regression, logistic regression returns a probability value and not a scorecard score. Therefore, in the present embodiment, the corresponding rating card is generated from the WoE value data table and the model parameters of the rating card model.
As shown in fig. 4, the present invention further provides a data processing method, including:
s20, acquiring a service request of the financial service object;
s21, acquiring the attribute and behavior data of the financial business object;
the attributes of the financial business objects comprise names, ages, regions, occupation, income, cultural degree and asset conditions. The behavior data includes whether loan occurs or not and whether overdue occurs or not.
S22, matching a corresponding service model according to the service type of the service request, wherein the service model is generated by training a plurality of application components;
and S23, processing the attribute and behavior data of the financial service object through the corresponding application component, and outputting a financial service processing result. Wherein, the processing result comprises whether to give credit and the amount and interest rate of the credit.
In an embodiment, the method further comprises configuring parameters of the application component.
Generally, different service types correspond to different service models, and the scoring card model is further described in this embodiment. As shown in fig. 5, the application components may include a plurality of components, and the function of each application component may be a complete function, which can be directly used by a user when selecting the application components, thereby improving convenience of use. And each application component can be adjusted according to actual needs and then used, so that different functions are realized by combining different application components differently, and the operation flexibility is improved. And establishing input and output links for the required application components according to the step of generating the financial business processing result.
In one embodiment, the generation of the score card is described as a specific embodiment.
A scoring card: the credit scoring card is one of the most common financial wind control means, and is used for scoring the credit of a client by using a certain credit scoring model according to various attributes and behavior data of the client, and accordingly determining whether to give credit or not and the amount and interest rate of the credit so as to identify and reduce transaction risks in financial transactions.
The application components may specifically include a data preprocessing component 110, a data binning component 111, an WoE value calculation component 112, an IV value calculation component 113, a feature selection component 114, a model generation component 115, and a scorecard generation component 116.
The data preprocessing component 110 is configured to preprocess sample data; the preprocessing of the sample data specifically refers to processing the sample data into data meeting requirements.
As shown in fig. 6, the data preprocessing component includes:
a data receiving component 1110 for receiving sample data;
in one embodiment, the sample data received by the data receiving component is presented in the form of a data table, including instance name, file path, table name, field format.
A data sampling component 1111, configured to sample the sample data to output a first data table;
specifically, data is sampled in a random and layered mode, sample data is sampled randomly according to a given proportion or number, each sampling is independent, and finally a first data table is output. In one embodiment, the data may also be sampled in a manner with a set back acquisition.
A missing value processing component 1112, configured to process missing data in the first data table to output a second data table;
due to the fact that loss frequently occurs in business data, the component can fill data convenient for modeling into a data loss area so as to enhance modeling quality. Specifically, the missing data may be processed by replacing a null value or a specified value with a pre-value, a post-value, a maximum value, a minimum value, a mean value, or a self-defined value, or by replacing a character-type null value or a null character string with a pre-value, a post-value, or a self-defined value.
An abnormal value processing component 1113, configured to process the abnormal value in the second data table to output a third data table.
Because outliers often occur in the business data, the component can first find the outliers and then fill in the data that facilitates modeling to the area of the outliers to improve the quality of the modeling. In one embodiment, the abnormal values are judged by using a box type graph, and abnormal data in the third data table is filled. The data filling method comprises mode filling, median filling, mean filling and specified value filling; mode population refers to population of outliers with the modes of the selected data fields, median population refers to population of outliers with the medians of the selected data fields, mean population refers to population of outliers with the means of the selected data fields, and specified value population refers to population of outliers with NA or other special values.
In an embodiment, the application component may further comprise a data source reading component for extracting partial fields from the sample data.
The data binning component 111 is used for performing binning processing on the preprocessed sample data to output a binning data table;
the binning is a necessary step for making a rating card, and refers to smoothly storing data values by considering neighbors (surrounding values), wherein the bin depth is used for indicating that the same number of data exist in different bins, and the bin width is used for indicating a value range of each bin value. The box separation mode comprises equal-frequency box separation, equal-width box separation and card square box separation, wherein the card square box separation is particularly common, and discrete data and continuous data can be subjected to box separation by using the card square box separation.
The basic idea of chi-squared binning is to infer from sample data whether the distribution of the population differs significantly from the expected distribution, or whether the two classification variables are related or independent. The general assumptions can be assumed as: the observed frequency is not different from the expected frequency, or the two variables are independent of each other. In practical applications, the chi-squared value may be calculated assuming that the original assumption holds, where chi-squared represents the degree of deviation between the observed value and the theoretical value.
Equal frequency binning, the boundary values of the intervals are chosen such that each interval contains an approximately equal number of instances. For example, each interval should contain about 10% of instances, say N-10.
And (4) dividing the equal width into N equal parts from the minimum value to the maximum value. If a is the minimum value and B is the maximum value, the length of each interval is W ═ B-a)/N, and the interval boundary values are a + W, A +2W, …, a + (N-1) W. The number of instances of each aliquot may not be equal, considering here only the boundaries.
WoE a value calculating component 112 for calculating WoE values for respective data fields in the binned data table to output a WoE value data table;
WoE (weight of evidence), an evidence weight, can convert the logistic regression model into the standard scorecard format, WoE is a form of encoding for the original independent variables. WoE, the contribution of the independent variable is reflected, after WoE encoding, the independent variable has certain standardized property and is insensitive to abnormal values.
An IV value calculating component 113 for calculating IV values of the respective data fields in the binned data table from the WoE value data table to output an IV value data table;
the IV is called Information Value, and Chinese means Information Value or Information amount.
How to select the most important and direct measurement standard of the model-entering variables? is the prediction capability of the variables, IV is one such index that can be used to measure the prediction capability of the independent variables.
A feature selection component 114 for screening data fields in the IV value data table according to a set screening threshold;
in this embodiment, the IV value is used to screen the feature, and generally, when the IV value is 0.3 or more, the feature prediction ability is strong. Sometimes the IV of a variable that is inherently important to the business is low due to sample problems. In order to solve the problem, the platform provides a flexible manual feature selection function, and a user can eliminate some features with poor correlation or strong consistency according to expert experience.
The model generation component 115 is used for outputting a scoring card model according to the screened data fields and the box data table; the scoring card model is a logistic regression model, a probability regression model, a decision tree and a neural network. In this embodiment, the scorecard model selects the logistic regression model, which has the advantages of simplicity, stability, strong interpretability, mature technology, easy detection and deployment, and the like, and is the most frequently used algorithm for the scorecard model.
In building scoring card models, logistic regression is often used to model the data. However, in prediction using logistic regression, logistic regression returns a probability value and not a scorecard score. Accordingly, the scoring card generating component 116 is operable to generate a corresponding scoring card from the WoE value data sheet and the model parameters of the scoring card model.
In one embodiment, the conversion of the score card is described in detail.
Score card definition
The probability of a known bad user is: p (Y ═ 1| x) ═ p
The probability of a good user is: p (Y ═ 0| x) ═ 1-p
A ratio of good to bad users (ratio of bad to good users, numerator bad user) can be calculated, called the ratio:
odds={p}/{1-p}
the score scale set by the score card may be expressed by a linear expression expressing the score as a log of ratio, i.e. a scale with a scale of scores that is a function of the log of the ratio
score=A+B*ln(odds)
Wherein A and B are constants
Scoring card conversion
The conversion steps are as follows:
fraction p _ {0} when odds ═ Theta _ {0} is set
Setting the score of PDO (point of double odds) for each 1-fold increase of odds
Substituting the fraction p _ {0}, when odds ═ Theta _ {0}, the fraction p _ {0} + PDO, when odds ═ 2 · _ Theta _ {0}, into a fraction equation, to obtain:
p_{0}=A+B*ln(Theta_{0})
p_{0}+PDO=A+B*ln(2*Theta_{0})
then, the values of a and B can be calculated, i.e.:
B={PDO}/{ln(2)}
A=p_{0}-B*ln(Theta_{0})
typically, the score will be rounded to the nearest integer to simplify the presentation and interpretability of the score card. This rounding will yield an approximation of the score, but the effect is small and negligible.
In order to facilitate the use of business personnel, the scoring card can be displayed more carefully, namely the influence of different values of each variable on the result of the scoring card.
In the known manner, it is known that,
p(Y=1|x)={e^{Theta x}}/{1+e^{Theta x}}
p(Y=0|x)={1}/{1+e^{Theta x}}
odds={p(Y=1|x)}/{p(Y=0|x)}=e^{Theta x}
then, the score card can be expressed as:
score=A+B*ln(odds)
score=A+B*ln(e^{Theta*x})
score=A+B*sum{Theta_{i}*x_{i}}
wherein Theta _ { i } x _ { i }, (Theta _ { i } w _ { i1}) delta _ { i1} + (Theta _ { i } w _ { i2}) delta _ { i2} +,. + (Theta _ { i } w _ { m2}) delta _ { im }.
m is the value number of x _ { i } after box separation;
w _ { im } is WoE value corresponding to mth value of variable x _ { i };
and delta _ { im } is a binary variable, and if x _ { i } takes the m-th value after binning, the value is 1, otherwise, the value is 0.
In this embodiment, the reference score is 500, and PDO is 20.
The final rating card is shown in table 1.
TABLE 1
Figure BDA0002203363660000121
Figure BDA0002203363660000131
Generally, after the score card model is established, the effect of the score card model needs to be evaluated, so the application component further comprises an evaluation component 117 for evaluating the score card model. During testing, the data for testing can be input into the scoring card model to evaluate the effect of the scoring card model through the output indexes. Wherein, the evaluation index can comprise an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, a ROC curve, a KS curve and a PR curve;
AUC (area Under the dark) represents the probability that a sample A is randomly selected from all positive examples and a sample B is randomly selected from all negative examples, and the classifier judges A as a positive example more likely than B as a positive example. All samples are firstly sorted according to the prediction probability of the classifier when an ROC curve is drawn, so that the AUC reflects the sorting capability of the classifier on the samples, and the larger the AUC is, the better the sorting capability is, namely, the more positive examples are sorted before the negative examples by the classifier. The AUC is larger, which shows that the accuracy of the algorithm and the model is higher and better, and the requirement of the algorithm and the model on line can be achieved generally above 0.7.
The KS value is the maximum distance between two lines in the KS map, which reflects the partition capability of the classifier. The KS is larger, the accuracy of the algorithm and the model is higher and better, and the online requirement can be met generally above 0.7
Accuracy refers to the ratio of the number of correctly predicted samples to the total number of predicted samples, regardless of whether the predicted samples are positive or negative examples.
Precision refers to the ratio of the number of correctly predicted positive samples to the number of all predicted positive samples, i.e., how many of all predicted positive samples are true positive samples. Precision only focuses on the part predicted as positive samples, while Accuracy considers all samples.
Recall refers to the ratio of the number of correctly predicted positive samples to the total number of true positive samples, i.e., how many positive samples I can correctly find out from these samples.
ROC curve (Receiver Operating characterization): the ROC curve is commonly used for model comparison in the two-class problem, and is mainly expressed as a trade-off between true normal rate (TPR) and false normal rate (FPR). The specific method is to respectively use TPR and FPR as a vertical axis and a horizontal axis to be plotted under different classification threshold (threshold) settings. The ROC curve can be viewed as a "confrontation" between positive and negative examples in all samples as the threshold is moved. The closer the curve is to the upper left corner, meaning that more positive cases are preferred over negative cases, the better the overall performance of the model.
KS curve (Kolmogorov-Smirnov): the index measures the difference between the good and bad sample cumulative divisions. The greater the cumulative difference of good and bad samples, the greater the KS index, and the stronger the risk discrimination ability of the model.
PR curve (Kolmogorov-Smirnov): the PR curve shows a Precision vs Recall curve, the same point of the PR curve and the ROC curve is that TPR (Recall) is adopted, and the effect of the classifier can be measured by AUC. The difference is that the ROC curve uses FPR and the PR curve uses Precision, so both indices of the PR curve focus on the positive case. The PR curve is widely considered superior to the ROC curve in this case because of the major concern in the class imbalance problem.
In one embodiment, the application component further comprises a derivation component 118 for outputting the score card model and evaluation indexes of the score card model, including indexes (Auc, KS, Accuracy, real) not limited to the aforementioned models, and visual reports (ROC, KS, P-R), and may further include scales of the score cards and score card details.
As shown in fig. 7, the present invention also provides a data processing apparatus, including:
a service request obtaining module 10, configured to obtain a service request of a financial service object;
the data acquisition module 11 is used for acquiring the attribute and behavior data of the financial business object;
the model matching module 12 is configured to match a corresponding service model according to the service type of the service request, where the service model is generated by training a plurality of application components;
and the result output module 13 is used for processing the attribute and the behavior data of the financial service object through the corresponding service model and outputting a financial service processing result.
The attributes of the financial transaction object include name, age, location, occupation, income, cultural degree, and asset condition. The behavior data includes whether loan occurs or not and whether overdue occurs or not. The processing result comprises whether the credit is given or not and the amount and interest rate of the credit.
In an embodiment, the apparatus further includes a parameter configuration module configured to configure a parameter of the application component.
In this embodiment, the business model is generated by a business model generating component, the business model generating component may include a plurality of application components, and the function of each application component may be a complete function, and when a user selects the application components, the user can directly use the application components, thereby improving the convenience of use. And each application component can be adjusted according to actual needs and then used, so that different functions are realized by combining different application components differently, and the operation flexibility is improved. And establishing input and output links for the required application components according to the step of generating the financial business processing result.
In one embodiment, the generation of the score card is described as a specific embodiment.
A scoring card: the credit scoring card is one of the most common financial wind control means, and is used for scoring the credit of a client by using a certain credit scoring model according to various attributes and behavior data of the client, and accordingly determining whether to give credit or not and the amount and interest rate of the credit so as to identify and reduce transaction risks in financial transactions.
The application components may specifically include a data preprocessing component 110, a data binning component 111, an WoE value calculation component 112, an IV value calculation component 113, a feature selection component 114, a model generation component 115, and a scorecard generation component 116.
The data preprocessing component 110 is configured to preprocess sample data; the preprocessing of the sample data specifically refers to processing the sample data into data meeting requirements.
The data preprocessing component comprises:
a data receiving component 1110 for receiving sample data;
in one embodiment, the sample data received by the data receiving component is presented in the form of a data table, including instance name, file path, table name, field format.
A data sampling component 1111, configured to sample the sample data to output a first data table;
specifically, data is sampled in a random and layered mode, sample data is sampled randomly according to a given proportion or number, each sampling is independent, and finally a first data table is output. In one embodiment, the data may also be sampled in a manner with a set back acquisition.
A missing value processing component 1112, configured to process missing data in the first data table to output a second data table;
due to the fact that loss frequently occurs in business data, the component can fill data convenient for modeling into a data loss area so as to enhance modeling quality. Specifically, the missing data may be processed by replacing a null value or a specified value with a pre-value, a post-value, a maximum value, a minimum value, a mean value, or a self-defined value, or by replacing a character-type null value or a null character string with a pre-value, a post-value, or a self-defined value.
An abnormal value processing component 1113, configured to process the abnormal value in the second data table to output a third data table.
Because outliers often occur in the business data, the component can first find the outliers and then fill in the data that facilitates modeling to the area of the outliers to improve the quality of the modeling. In one embodiment, the abnormal values are judged by using a box type graph, and abnormal data in the third data table is filled. The data filling method comprises mode filling, median filling, mean filling and specified value filling; mode filling refers to filling the abnormal values by adopting modes of the selected data fields, median filling refers to filling the abnormal values by adopting median of the selected data fields, mean filling refers to filling the abnormal values by adopting mean of the selected data fields, and specified value filling refers to filling the abnormal values by adopting NA or other special values.
In an embodiment, the application component may further comprise a data source reading component for extracting partial fields from the sample data.
The data binning component 111 is used for performing binning processing on the preprocessed sample data to output a binning data table;
the binning is a necessary step for making a rating card, and refers to smoothly storing data values by considering neighbors (surrounding values), wherein the bin depth is used for indicating that the same number of data exist in different bins, and the bin width is used for indicating a value range of each bin value. The box separation mode comprises equal-frequency box separation, equal-width box separation and card square box separation, wherein the card square box separation is particularly common, and discrete data and continuous data can be subjected to box separation by using the card square box separation.
The basic idea of chi-squared binning is to infer from sample data whether the distribution of the population differs significantly from the expected distribution, or whether the two classification variables are related or independent. The general assumptions can be assumed as: the observed frequency is not different from the expected frequency, or the two variables are independent of each other. In practical applications, the chi-squared value may be calculated assuming that the original assumption holds, where chi-squared represents the degree of deviation between the observed value and the theoretical value.
Equal frequency binning, the boundary values of the intervals are chosen such that each interval contains an approximately equal number of instances. For example, each interval should contain about 10% of instances, say N-10.
And (4) dividing the equal width into N equal parts from the minimum value to the maximum value. If a is the minimum value and B is the maximum value, the length of each interval is W ═ B-a)/N, and the interval boundary values are a + W, A +2W, …, a + (N-1) W. The number of instances of each aliquot may not be equal, considering here only the boundaries.
WoE a value calculating component 112 for calculating WoE values for respective data fields in the binned data table to output a WoE value data table;
WoE (weight of evidence), an evidence weight, can convert the logistic regression model into the standard scorecard format, WoE is a form of encoding for the original independent variables. WoE, the contribution of the independent variable is reflected, after WoE encoding, the independent variable has certain standardized property and is insensitive to abnormal values.
An IV value calculating component 113 for calculating IV values of the respective data fields in the binned data table from the WoE value data table to output an IV value data table;
the IV is called Information Value, and Chinese means Information Value or Information amount.
How to select the most important and direct measurement standard of the model-entering variables? is the prediction capability of the variables, IV is one such index that can be used to measure the prediction capability of the independent variables.
A feature selection component 114 for screening data fields in the IV value data table according to a set screening threshold;
in this embodiment, the IV value is used to screen the feature, and generally, when the IV value is 0.3 or more, the feature prediction ability is strong. Sometimes the IV of a variable that is inherently important to the business is low due to sample problems. In order to solve the problem, the platform provides a flexible manual feature selection function, and a user can eliminate some features with poor correlation or strong consistency according to expert experience.
The model generation component 115 is used for outputting a scoring card model according to the screened data fields and the box data table; the scoring card model is a logistic regression model, a probability regression model, a decision tree and a neural network. In this embodiment, the scorecard model selects the logistic regression model, which has the advantages of simplicity, stability, strong interpretability, mature technology, easy detection and deployment, and the like, and is the most frequently used algorithm for the scorecard model.
In building scoring card models, logistic regression is often used to model the data. However, in prediction using logistic regression, logistic regression returns a probability value and not a scorecard score. Accordingly, the scoring card generating component 116 is operable to generate a corresponding scoring card from the WoE value data sheet and the model parameters of the scoring card model.
In one embodiment, the conversion of the score card is described in detail.
Score card definition
The probability of a known bad user is: p (Y ═ 1| x) ═ p
The probability of a good user is: p (Y ═ 0| x) ═ 1-p
A ratio of good to bad users (ratio of bad to good users, numerator bad user) can be calculated, called the ratio:
odds={p}/{1-p}
the score scale set by the score card may be expressed by a linear expression expressing the score as a log of ratio, i.e. a scale with a scale of scores that is a function of the log of the ratio
score=A+B*ln(odds)
Wherein A and B are constants
Scoring card conversion
The conversion steps are as follows:
fraction p _ {0} when odds ═ Theta _ {0} is set
Setting the score of PDO (point of double odds) for each 1-fold increase of odds
Substituting the fraction p _ {0}, when odds ═ Theta _ {0}, the fraction p _ {0} + PDO, when odds ═ 2 · _ Theta _ {0}, into a fraction equation, to obtain:
p_{0}=A+B*ln(Theta_{0})
p_{0}+PDO=A+B*ln(2*Theta_{0})
then, the values of a and B can be calculated, i.e.:
B={PDO}/{ln(2)}
A=p_{0}-B*ln(Theta_{0})
typically, the score will be rounded to the nearest integer to simplify the presentation and interpretability of the score card. This rounding will yield an approximation of the score, but the effect is small and negligible.
In order to facilitate the use of business personnel, the scoring card can be displayed more carefully, namely the influence of different values of each variable on the result of the scoring card.
In the known manner, it is known that,
p(Y=1|x)={e^{Theta x}}/{1+e^{Theta x}}
p(Y=0|x)={1}/{1+e^{Theta x}}
odds={p(Y=1|x)}/{p(Y=0|x)}=e^{Theta x}
then, the score card can be expressed as:
score=A+B*ln(odds)
score=A+B*ln(e^{Theta*x})
score=A+B*sum{Theta_{i}*x_{i}}
wherein Theta _ { i } x _ { i }, (Theta _ { i } w _ { i1}) delta _ { i1} + (Theta _ { i } w _ { i2}) delta _ { i2} +,. + (Theta _ { i } w _ { m2}) delta _ { im }.
m is the value number of x _ { i } after box separation;
w _ { im } is WoE value corresponding to mth value of variable x _ { i };
and delta _ { im } is a binary variable, and if x _ { i } takes the m-th value after binning, the value is 1, otherwise, the value is 0.
In this embodiment, the reference score is 500, and PDO is 20.
The final rating card is shown in table 2.
TABLE 2
Serial number Variables of Interval (left closed and right open) Score of
1 zhima_score -inf,603 16
2 zhima_score 603,611 17
3 zhima_score 611,615 18
4 zhima_score 615,635 19
5 zhima_score 635+ 21
6 step_number -inf,-1 13
7 step_number -1,3.6667 17
8 step_number 3.6667,1944 16
Generally, after the score card model is established, the effect of the score card model needs to be evaluated, so the application component further comprises an evaluation component 117 for evaluating the score card model. During testing, the data for testing can be input into the scoring card model to evaluate the effect of the scoring card model through the output indexes. Wherein, the evaluation index can comprise an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, a ROC curve, a KS curve and a PR curve;
AUC (area Under the dark) represents the probability that a sample A is randomly selected from all positive examples and a sample B is randomly selected from all negative examples, and the classifier judges A as a positive example more likely than B as a positive example. All samples are firstly sorted according to the prediction probability of the classifier when an ROC curve is drawn, so that the AUC reflects the sorting capability of the classifier on the samples, and the larger the AUC is, the better the sorting capability is, namely, the more positive examples are sorted before the negative examples by the classifier. The AUC is larger, which shows that the accuracy of the algorithm and the model is higher and better, and the requirement of the algorithm and the model on line can be achieved generally above 0.7.
The KS value is the maximum distance between two lines in the KS map, which reflects the partition capability of the classifier. The KS is larger, the accuracy of the algorithm and the model is higher and better, and the online requirement can be met generally above 0.7
Accuracy refers to the ratio of the number of correctly predicted samples to the total number of predicted samples, regardless of whether the predicted samples are positive or negative examples.
Precision refers to the ratio of the number of correctly predicted positive samples to the number of all predicted positive samples, i.e., how many of all predicted positive samples are true positive samples. Precision only focuses on the part predicted as positive samples, while Accuracy considers all samples.
Recall refers to the ratio of the number of correctly predicted positive samples to the total number of true positive samples, i.e., how many positive samples I can correctly find out from these samples.
ROC curve (Receiver Operating characterization): the ROC curve is commonly used for model comparison in the two-class problem, and is mainly expressed as a trade-off between true normal rate (TPR) and false normal rate (FPR). The specific method is to respectively use TPR and FPR as a vertical axis and a horizontal axis to be plotted under different classification threshold (threshold) settings. The ROC curve can be viewed as a "confrontation" between positive and negative examples in all samples as the threshold is moved. The closer the curve is to the upper left corner, meaning that more positive cases are preferred over negative cases, the better the overall performance of the model.
KS curve (Kolmogorov-Smirnov): the index measures the difference between the good and bad sample cumulative divisions. The greater the cumulative difference of good and bad samples, the greater the KS index, and the stronger the risk discrimination ability of the model.
PR curve (Kolmogorov-Smirnov): the PR curve shows a Precision vs Recall curve, the same point of the PR curve and the ROC curve is that TPR (Recall) is adopted, and the effect of the classifier can be measured by AUC. The difference is that the ROC curve uses FPR and the PR curve uses Precision, so both indices of the PR curve focus on the positive case. The PR curve is widely considered superior to the ROC curve in this case because of the major concern in the class imbalance problem.
In one embodiment, the application component further comprises a derivation component 118 for outputting the score card model and evaluation indexes of the score card model, including indexes (Auc, KS, Accuracy, real) not limited to the aforementioned models, and visual reports (ROC, KS, P-R), and may further include scales of the score cards and score card details.
An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
The present embodiment also provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may be caused to execute instructions (instructions) of steps included in the face recognition method in fig. 1 according to the present embodiment.
Fig. 8 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.
In this embodiment, the processor of the terminal device includes a module for executing the functions of the modules of the face recognition apparatus in each device, and specific functions and technical effects may refer to the foregoing embodiments, which are not described herein again.
Fig. 9 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. FIG. 9 is a specific embodiment of the implementation of FIG. 8. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 4 in the above embodiment.
The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the data processing method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.
The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.
The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.
The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.
The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 involved in the embodiment of fig. 9 can be implemented as the input device in the embodiment of fig. 8.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (40)

1. A method of data processing, the method comprising:
acquiring the attribute and behavior data of the financial business object;
and processing the attribute and the behavior data of the financial business object and outputting a financial business processing result.
2. The data processing method of claim 1, wherein processing the attribute and behavior data of the financial transaction object comprises inputting the attribute and behavior data of the financial transaction object into a generated transaction model.
3. The data processing method of claim 1, wherein the attributes of the financial transaction object include name, age, locale, occupation, income, cultural degree, and asset condition.
4. The data processing method of claim 1, wherein the behavior data includes whether a loan has occurred and whether it is overdue.
5. The data processing method according to claim 1, wherein the processing result includes whether or not to give credit, and the amount and interest rate of the credit.
6. The data processing method of claim 2, wherein the method for generating the business model comprises:
preprocessing sample data;
performing box separation processing on the preprocessed sample data to output a box separation data table;
calculating WoE values for each data field in the binned data table to output a WoE value data table;
calculating the IV value of each data field in the box data table according to the WoE value data table to output an IV value data table;
screening the data fields in the IV value data table according to a set screening threshold value;
outputting a scoring card model according to the screened data fields and the box data table;
and outputting corresponding scoring cards according to the WoE value data tables and the model parameters of the scoring card models.
7. The data processing method of claim 6, wherein the method for generating the business model further comprises:
evaluating the scoring card model;
and outputting the rating card model and the evaluation indexes of the rating card model.
8. The data processing method of claim 6, wherein the pre-processing the sample data comprises:
receiving sample data;
sampling the sample data to output a first data table;
processing missing data in the first data table to output a second data table;
and processing the abnormal value in the second data table to output a third data table.
9. A method of data processing, the method comprising:
acquiring a service request of a financial service object;
acquiring the attribute and behavior data of the financial business object;
matching a corresponding service model according to the service type of the service request, wherein the service model is generated by training of a plurality of application components;
and processing the attribute and behavior data of the financial service object through the service model, and outputting a financial service processing result.
10. A data processing method according to claim 9, characterized in that the method further comprises configuring parameters of said application components.
11. The data processing method of claim 9, wherein the attributes of the financial transaction object include name, age, location, occupation, income, cultural degree, and asset condition.
12. The data processing method of claim 9, wherein the behavior data includes whether a loan has occurred and whether it is overdue.
13. The data processing method according to claim 9, wherein the processing result includes whether or not to give credit, and the amount and interest rate of the credit.
14. The data processing method of claim 9, wherein the application component comprises:
the data preprocessing component is used for preprocessing the sample data;
the data binning component is used for binning the preprocessed sample data to output a binning data table;
WoE a value calculation component for calculating WoE values for respective data fields in the binned data table to output a WoE value data table;
an IV value calculating component for calculating the IV value of each data field in the bin data table according to the WoE value data table to output an IV value data table;
the characteristic selection component is used for screening the data fields in the IV value data table according to a set screening threshold value;
the model generation component is used for outputting a scoring card model according to the screened data fields and the box data table;
and the scoring card generating component is used for outputting the corresponding scoring card according to the WoE value data table and the model parameters of the scoring card model.
15. The data processing method of claim 14, wherein the application component further comprises:
an evaluation component for evaluating the scoring card model;
and the derivation component is used for outputting the rating card model and the evaluation indexes of the rating card model.
16. The data processing method of claim 14, wherein the data pre-processing component comprises:
a data receiving component for receiving sample data;
the data sampling component is used for sampling the sample data to output a first data table;
the missing value processing component is used for processing the missing data in the first data table to output a second data table;
and the abnormal value processing component is used for processing the abnormal value in the second data table to output a third data table.
17. The data processing method of claim 16, wherein processing the missing data comprises replacing the missing value with one of a null value or a specified value: pre-value, post-value, maximum value, minimum value, mean value, a self-defined value.
18. The data processing method of claim 16, wherein processing the outliers comprises filling the outliers, and wherein the filling comprises mode filling, median filling, mean filling, and specified value filling.
19. The data processing method of claim 14, wherein the binning process comprises equal frequency binning, equal width binning, chi-square binning.
20. The data processing method of claim 14, wherein the scoring card model is a logistic regression model, a probabilistic regression model, a decision tree, a neural network.
21. The data processing method of claim 14, wherein the scorecard generation component converts the output value of the scorecard model into a scorecard score.
22. The data processing method of claim 15, wherein the evaluation index for evaluating the effect of the scorecard model comprises an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, a ROC curve, a KS curve, and a PR curve.
23. The data processing method of claim 15, wherein the evaluation index of the outputted score card model includes an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, a ROC curve, a KS curve, a PR curve, a score card scale, and a score card detail.
24. A data processing apparatus, characterized in that the apparatus comprises:
the service request acquisition module is used for acquiring a service request of a financial service object;
the data acquisition module is used for acquiring the attribute and behavior data of the financial business object;
the model matching module is used for matching a corresponding service model according to the service type of the service request, and the service model is generated by training a plurality of application components;
and the result output module is used for processing the attribute and the behavior data of the financial business object through the corresponding business model and outputting a financial business processing result.
25. The data processing apparatus of claim 24, further comprising a parameter configuration module configured to configure parameters of the application components.
26. The data processing apparatus of claim 24, wherein the attributes of the financial transaction object include name, age, locale, occupation, income, cultural degree, and asset condition.
27. The data processing apparatus of claim 24, wherein the behavior data comprises whether a loan has occurred and whether it is overdue.
28. The data processing apparatus of claim 24, wherein the processing result includes whether or not to give credit and the amount and interest rate of the credit.
29. The data processing apparatus of claim 24, wherein the business model is generated by a business model generation component comprising:
the data preprocessing component is used for preprocessing the sample data;
the data binning component is used for binning the preprocessed sample data to output a binning data table;
WoE a value calculation component for calculating WoE values for respective data fields in the binned data table to output a WoE value data table;
an IV value calculating component for calculating the IV value of each data field in the bin data table according to the WoE value data table to output an IV value data table;
the characteristic selection component is used for screening the data fields in the IV value data table according to a set screening threshold value;
the model generation component is used for outputting a scoring card model according to the screened data fields and the box data table;
and the scoring card generating component is used for outputting the corresponding scoring card according to the WoE value data table and the model parameters of the scoring card model.
30. The data processing apparatus of claim 29, wherein the business model generation component further comprises:
an evaluation component for evaluating the scoring card model;
and the derivation component is used for outputting the rating card model and the evaluation indexes of the rating card model.
31. The data processing apparatus of claim 28, wherein the data pre-processing component comprises:
a data receiving component for receiving sample data;
the data sampling component is used for sampling the sample data to output a first data table;
the missing value processing component is used for processing the missing data in the first data table to output a second data table;
and the abnormal value processing component is used for processing the abnormal value in the second data table to output a third data table.
32. The data processing apparatus of claim 31, wherein processing missing data comprises replacing missing values by one of a null value or a specified value: pre-value, post-value, maximum value, minimum value, mean value, a self-defined value.
33. The data processing apparatus of claim 31, wherein processing the outliers comprises filling the outliers, and wherein the filling method comprises mode filling, median filling, mean filling, and specified value filling.
34. The data processing apparatus of claim 29, wherein the binning process comprises equal frequency binning, equal width binning, chi-square binning.
35. The data processing apparatus of claim 29, wherein the scoring card model is a logistic regression model, a probabilistic regression model, a decision tree, a neural network.
36. The data processing apparatus of claim 29, wherein the scorecard generation component converts the output value of the scorecard model into a scorecard score.
37. The data processing apparatus of claim 30, wherein the evaluation index for evaluating the effect of the scorecard model comprises an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, a ROC curve, a KS curve, a PR curve.
38. The data processing apparatus of claim 30, wherein the evaluation index of the outputted scorecard model includes an AUC value, a KS value, an Accuracy value, a Precision value, a Recall value, a ROC curve, a KS curve, a PR curve, a scorecard scale, and a scorecard detail.
39. An apparatus, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-8 or 9-23.
40. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform the method of one or more of claims 1-8 or 9-23.
CN201910872797.7A 2019-09-16 2019-09-16 Data processing method and device, machine readable medium and equipment Pending CN110659817A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910872797.7A CN110659817A (en) 2019-09-16 2019-09-16 Data processing method and device, machine readable medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910872797.7A CN110659817A (en) 2019-09-16 2019-09-16 Data processing method and device, machine readable medium and equipment

Publications (1)

Publication Number Publication Date
CN110659817A true CN110659817A (en) 2020-01-07

Family

ID=69037346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910872797.7A Pending CN110659817A (en) 2019-09-16 2019-09-16 Data processing method and device, machine readable medium and equipment

Country Status (1)

Country Link
CN (1) CN110659817A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909775A (en) * 2019-11-08 2020-03-24 支付宝(杭州)信息技术有限公司 Data processing method and device and electronic equipment
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN113010493A (en) * 2021-03-16 2021-06-22 北京云从科技有限公司 Data quality online analysis method and device, machine readable medium and equipment
CN113987182A (en) * 2021-10-28 2022-01-28 深圳永安在线科技有限公司 Fraud entity identification method, device and related equipment based on security intelligence
CN115841279A (en) * 2023-02-20 2023-03-24 塔比星信息技术(深圳)有限公司 Supply chain data evaluation method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087196A (en) * 2018-08-20 2018-12-25 北京玖富普惠信息技术有限公司 Credit-graded approach, system, computer equipment and readable medium
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087196A (en) * 2018-08-20 2018-12-25 北京玖富普惠信息技术有限公司 Credit-graded approach, system, computer equipment and readable medium
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909775A (en) * 2019-11-08 2020-03-24 支付宝(杭州)信息技术有限公司 Data processing method and device and electronic equipment
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN113010493A (en) * 2021-03-16 2021-06-22 北京云从科技有限公司 Data quality online analysis method and device, machine readable medium and equipment
CN113987182A (en) * 2021-10-28 2022-01-28 深圳永安在线科技有限公司 Fraud entity identification method, device and related equipment based on security intelligence
CN115841279A (en) * 2023-02-20 2023-03-24 塔比星信息技术(深圳)有限公司 Supply chain data evaluation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110245213B (en) Questionnaire generation method, device, equipment and storage medium
CN110659817A (en) Data processing method and device, machine readable medium and equipment
US8196066B1 (en) Collaborative gesture-based input language
US11521115B2 (en) Method and system of detecting data imbalance in a dataset used in machine-learning
CN110909165A (en) Data processing method, device, medium and electronic equipment
CN108681970A (en) Finance product method for pushing, system and computer storage media based on big data
CN110909222B (en) User portrait establishing method and device based on clustering, medium and electronic equipment
CN112733042A (en) Recommendation information generation method, related device and computer program product
CN113051317B (en) Data mining model updating method, system, computer equipment and readable medium
CN112598294A (en) Method, device, machine readable medium and equipment for establishing scoring card model on line
CN111898675B (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
CN112163642A (en) Wind control rule obtaining method, device, medium and equipment
CN112328869A (en) User loan willingness prediction method and device and computer system
CN112308143A (en) Sample screening method, system, equipment and medium based on diversity
CN115271931A (en) Credit card product recommendation method and device, electronic equipment and medium
CN114078008A (en) Abnormal behavior detection method, device, equipment and computer readable storage medium
CN112115710B (en) Industry information identification method and device
WO2022245469A1 (en) Rule-based machine learning classifier creation and tracking platform for feedback text analysis
CN111275683B (en) Image quality grading processing method, system, device and medium
CN113392920A (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN110060183A (en) Client intelligent matching process, device, computer equipment and storage medium
CN112417197B (en) Sorting method, sorting device, machine readable medium and equipment
CN110728243B (en) Business management method, system, equipment and medium for right classification
CN114357184A (en) Item recommendation method and related device, electronic equipment and storage medium
CN113010493A (en) Data quality online analysis method and device, machine readable medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right

Effective date of registration: 20200106

Address after: 102300 Room 102, floor 1, building 3, No. 20 Yong'an Road, Shilong Economic Development Zone, Mentougou District, Beijing

Applicant after: Beijing Yuncong Technology Co., Ltd

Address before: 201203 Shanghai City, Pudong New Area China Zuchongzhi Road (Shanghai) Free Trade Zone No. 1077 Building 2 room 1135-A

Applicant before: Shanghai cloud from enterprise development Co., Ltd.

TA01 Transfer of patent application right
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200107

RJ01 Rejection of invention patent application after publication