CN116503872B

CN116503872B - Trusted client mining method based on machine learning

Info

Publication number: CN116503872B
Application number: CN202310757418.6A
Authority: CN
Inventors: 严松; 黄奎; 刘利科
Original assignee: Beijing Jixian Information Technology Co ltd
Current assignee: Beijing Jixian Information Technology Co ltd
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2023-09-05
Anticipated expiration: 2043-06-26
Also published as: CN116503872A

Abstract

The invention discloses a credit client mining method based on machine learning, which belongs to the technical field of information processing and analysis.

Description

Trusted client mining method based on machine learning

Technical Field

The invention relates to the technical field of information processing and analysis, in particular to a trusted client mining method based on machine learning.

Background

The conventional trusted client mining needs to adopt manual data approval to check information such as the registered capital, the registered time, the operation period, the transaction duration, the gross profit, the transaction amount, the order quantity, the unit price of the client, the overdue number of the refund, the overdue amount, the overdue days and the like of the client based on the data provided by the client, so that the credit class of the client is estimated. However, the existing method for mining the trusted clients has the disadvantages of long time consumption, low auditing speed and inconvenience in rapidly mining the trusted clients.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the credit giving client mining method based on machine learning, which solves the problems of long time and low auditing speed of the existing credit giving client mining method.

In order to achieve the aim of the invention, the invention adopts the following technical scheme: a trusted client mining method based on machine learning comprises the following steps:

s1, shooting text data submitted by a client to obtain a text image;

s2, performing image and text recognition on the text image to obtain text data;

s3, extracting client characteristic information from the text data;

s4, processing the client characteristic information by adopting a classification model, and dividing the credit rating of the client;

and S5, grading credit giving is carried out on the clients according to the credit grades of the clients, and the credit giving clients are filed.

Further, the step S2 includes the following sub-steps:

s21, extracting a text image from the text image;

s22, extracting features of the character image by adopting a feature extraction model to obtain an image feature sequence;

s23, processing the image feature sequence by adopting a character recognition model to obtain text data.

Further, the step S21 includes the steps of:

s211, carrying out gray level processing on the text image to obtain a gray level image;

s212, finding all pixel points meeting the edge condition from the gray level image to serve as text pixel points, wherein the edge condition is as follows:

, wherein ,/>Gray value of any pixel point on gray scale map,/>The number of the pixel points at one side in the neighborhood range in any pixel point is +.>The number of the pixel points at the other side in the neighborhood range in any pixel point is +.>Is the +.o. of the other side in the neighborhood>Pixel value of each pixel, +.>Is the +.o of one side in the neighborhood>Pixel value of each pixel, +.>Is a distance threshold;

s213, forming the gray values of all the text pixels into a text image.

The beneficial effects of the above further scheme are: and (3) carrying out graying treatment on the text image to obtain a gray level image, screening out text pixel points by using pixel values of the text pixel points and pixel values of background pixel points, wherein two pixel values exist in a vicinity range of one pixel point, the pixel points are the text pixel points with high probability, whether a difference value of the two pixel values is larger than a distance threshold value is calculated, if so, the pixel values on two sides are larger, meanwhile, the pixel points are similar to the pixel value of the pixel point on one side, are far away from the pixel value of the pixel point on the other side, the pixel points are further determined to be edge text pixel points on the text, all edge text pixel points are extracted, and text features are extracted, so that the effects of quickly reducing the image features and accurately extracting the text pixel points are achieved.

Further, the feature extraction model in S22 includes: a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a depth convolution layer, a first normalization layer, a second normalization layer, a maximum pooling layer, an average pooling layer, a Concat layer, a first adder A1 and a second adder A2;

the input end of the first convolution layer is used as the input end of the feature extraction model, and the output end of the first convolution layer is respectively connected with the input end of the depth convolution layer, the input end of the maximum pooling layer, the input end of the average pooling layer and the input end of the second adder A2; the input end of the first normalization layer is connected with the output end of the depth convolution layer, and the output end of the first normalization layer is connected with the input end of the second convolution layer; the input end of the first adder A1 is respectively connected with the output end of the maximum pooling layer and the output end of the average pooling layer, and the output end of the first adder A1 is connected with the input end of the second normalization layer; the input end of the Concat layer is respectively connected with the output end of the second convolution layer and the output end of the second normalization layer, and the output end of the Concat layer is connected with the input end of the third convolution layer; the output end of the third convolution layer is connected with the input end of the second adder A2; the input end of the fourth convolution layer is connected with the output end of the second adder A2, and the output end of the fourth convolution layer is used as the output end of the feature extraction model.

The beneficial effects of the above further scheme are: the invention processes the character image by a first convolution layer, divides the character image into multiple paths, inputs different paths, extracts depth features by the path of the depth convolution layer, extracts significant features by a maximum pooling layer, extracts average features by an average pooling layer, and connects the first convolution layer and a second adder A2 to realize identity mapping, thereby solving the problem of gradient disappearance.

Further, the formula of the normalization layer is:

，

wherein ,is the->Output(s)>Is the->Personal input (s)/(s)>For normalizing the weights of the layers, +.>For normalizing layer bias, ++>For normalizing the number of input amounts of the layer, +.>For normalizing the coefficient, +.>Is the product.

Further, the Chinese character recognition model in S23 includes: a first LSTM layer, a second LSTM layer, an attention layer, a fully connected layer, and a Softmax layer;

the input end of the first LSTM layer is connected with the first input end of the attention layer and is used as the input end of the character recognition model; the input end of the second LSTM layer is connected with the output end of the first LSTM layer, and the output end of the second LSTM layer is connected with the second input end of the attention layer; the input end of the full-connection layer is connected with the output end of the attention layer, and the output end of the full-connection layer is connected with the input end of the Softmax layer; the output end of the Softmax layer is used as the output end of the character recognition model.

Further, the expression of the attention layer is:

，

wherein ,for the output of the attention layer +.>To activate the function +.>For the first weight of the attention layer, +.>For the second weight of the attention layer, +.>For the average pooling layer treatment,/->For maximum pooling layer treatment, +.>Is the +.>Input feature vectors, ">In order to input the number of feature vectors, the ++I is a two-norm process, ++I>Is the bias of the attention layer.

The beneficial effects of the above further scheme are: the invention carries out weighting treatment on the characteristics of the input attention layer, reflects the proportion of each quantity in the input characteristics according to the proportion of each quantity, avoids the pooling treatment from wiping out the data characteristics, carries out the maximum pooling treatment and the average pooling treatment, and respectively gives weights to increase the attention to the characteristics.

Further, the classification model in S4 is:

，

wherein ,for the output of the classification model, +.>Input for classification model->Customer characteristic information->Is->Customer characteristic information threshold,/->Is->Weight of the customer characteristic information, +.>Is->Bias of customer characteristic information->For the kind of the extracted customer characteristic information +.>As hyperbolic tangent function, +.>Is a proportionality coefficient.

The beneficial effects of the above further scheme are: in the classification model, each piece of customer characteristic information has a corresponding threshold value, if the customer characteristic information is smaller than the threshold value, the classification model plays a role in reducing the credit level of the customer, different weights and biases are given to each piece of customer characteristic information, different importance degrees of different customer characteristic information are achieved, the credit level of the customer is calculated by adopting a hyperbolic tangent function, a proportionality coefficient is set, the credit level is amplified, and the credit level of the customer is conveniently distinguished.

Further, the loss function of the classification model is:

，

wherein ,for loss function->For the statistical training times, ++>Is->Credit rating predicted by secondary training, +.>Is->Training the actual credit level +.>For counting the number of training times>For the number of actual training times>Satisfy the condition +.>Is>Is a loss difference threshold.

The beneficial effects of the above further scheme are: the invention adopts the difference square of the actual credit level and the predicted credit level as the main content of the loss function, and the condition of multiple times of training is adopted, so that the classification model can achieve higher precision on the whole, the influence of higher precision on the judgment of the training degree of the classification model is prevented, and the loss difference threshold value is set in the inventionFurther assists in judging the training degree of the classification model in the training process, and when the difference between the actual credit level and the predicted credit level is smaller, the classification model is +.>Equal to->When the calculated loss value is smaller, the training degree of the judgment classification model can be more accurate, if the difference between the actual credit level and the predicted credit level is smaller, but +.>Not equal to->Then->The value of 2 or more corresponds to multiplying the difference between the actual credit rating and the predicted credit rating by a scaling factor, and the loss value is increased, in which case the classification model still needs to be trained.

In summary, the invention has the following beneficial effects: the invention adopts image processing on the recorded customer data, extracts the text of the customer data, thereby obtaining the customer characteristic information, adopts a classification model to automatically divide the credit rating of the customer according to the customer characteristic information, and carries out classified credit giving on the customer according to the credit rating of the customer, thereby realizing a full-automatic and rapid credit giving customer mining method.

Drawings

FIG. 1 is a flow chart of a method of mining trusted clients based on machine learning;

FIG. 2 is a schematic diagram of a feature extraction model;

fig. 3 is a schematic diagram of a text recognition model.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

As shown in fig. 1, a trusted client mining method based on machine learning includes the following steps:

s1, shooting text data submitted by a client to obtain a text image;

the step S2 comprises the following sub-steps:

s21, extracting a text image from the text image;

the step S21 comprises the following steps:

in this embodiment, the distance threshold is set according to the experimental conditions.

S213, forming the gray values of all the text pixels into a text image.

And (3) carrying out graying treatment on the text image to obtain a gray level image, screening out text pixel points by using pixel values of the text pixel points and pixel values of background pixel points, wherein two pixel values exist in a vicinity range of one pixel point, the pixel points are the text pixel points with high probability, whether a difference value of the two pixel values is larger than a distance threshold value is calculated, if so, the pixel values on two sides are larger, meanwhile, the pixel points are similar to the pixel value of the pixel point on one side, are far away from the pixel value of the pixel point on the other side, the pixel points are further determined to be edge text pixel points on the text, all edge text pixel points are extracted, and text features are extracted, so that the effects of quickly reducing the image features and accurately extracting the text pixel points are achieved.

In the present embodiment, words according to the text materialThe characteristics of the background can be known:less than->Typically, the text imaging is black, < >>Approaching 0, the background approaches 255, so the difference between text and background pixel values is large.

as shown in fig. 2, the feature extraction model in S22 includes: a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a depth convolution layer, a first normalization layer, a second normalization layer, a maximum pooling layer, an average pooling layer, a Concat layer, a first adder A1 and a second adder A2;

The invention processes the character image by a first convolution layer, divides the character image into multiple paths, inputs different paths, extracts depth features by the path of the depth convolution layer, extracts significant features by a maximum pooling layer, extracts average features by an average pooling layer, and connects the first convolution layer and a second adder A2 to realize identity mapping, thereby solving the problem of gradient disappearance.

The formulas of the first normalization layer and the second normalization layer are as follows:

，

As shown in fig. 3, the chinese character recognition model in S23 includes: a first LSTM layer, a second LSTM layer, an attention layer, a fully connected layer, and a Softmax layer;

The expression of the attention layer is:

，

The invention carries out weighting treatment on the characteristics of the input attention layer, reflects the proportion of each quantity in the input characteristics according to the proportion of each quantity, avoids the pooling treatment from wiping out the data characteristics, carries out the maximum pooling treatment and the average pooling treatment, and respectively gives weights to increase the attention to the characteristics.

S3, extracting client characteristic information from the text data;

since text data is extracted by performing text recognition in step S2 and data vectors are obtained in the computer system, only the required text feature vector is extracted in step S3 to obtain client feature information, which corresponds to the extraction of the corresponding data vector from the storage unit, and the information of the client is known.

the classification model in the S4 is as follows:

，

wherein ,for the output of the classification model, +.>Input for classification model->Customer characteristic information->Is->Customer characteristic information thresholdValue of->Is->Weight of the customer characteristic information, +.>Is->Bias of customer characteristic information->For the kind of the extracted customer characteristic information +.>As hyperbolic tangent function, +.>Is a proportionality coefficient.

In the classification model, each piece of customer characteristic information has a corresponding threshold value, if the customer characteristic information is smaller than the threshold value, the classification model plays a role in reducing the credit level of the customer, different weights and biases are given to each piece of customer characteristic information, different importance degrees of different customer characteristic information are achieved, the credit level of the customer is calculated by adopting a hyperbolic tangent function, a proportionality coefficient is set, the credit level is amplified, and the credit level of the customer is conveniently distinguished.

In this embodiment, the customer feature information may be normalized first, so as to ensure that each amount is in a range of 0-1, so that components of each amount in the classification model are conveniently measured, and after normalization, the customer feature information threshold may be set to 0.5.

In this embodiment, the types of the client characteristic information include: the method comprises the steps of taking two types of registered capital and registered time as an example classification model, setting the value of client characteristic information corresponding to the normalized registered capital type to be 0.5 if the registered capital is ten millions, setting the registered capital to be two tens of millions at maximum, setting the registered time to be 6 years, and setting the maximum year to be 30 years, wherein the value of client characteristic information corresponding to the registered time type to be 0.2, namely the client characteristic information described in the invention is a value obtained by quantifying the client information. The description is merely illustrative of the use of classification models, and specific use procedures may be set according to requirements, the specific requirement setting not affecting the structure of the classification model of the present invention.

The type of the customer characteristic information is selected, and the customer characteristic information is freely selected according to the direction of the business of each enterprise or bank.

The classification model of the invention realizes that various information of the clients are summarized and counted, thereby obtaining the credit rating of the clients.

The loss function of the classification model is as follows:

，

The invention adopts the difference square of the actual credit level and the predicted credit level as the main content of the loss function, and the condition of multiple times of training is adopted, so that the classification model can achieve higher precision on the whole, the influence of higher precision on the judgment of the training degree of the classification model is prevented, and the loss difference threshold value is set in the inventionFurther assists in judging the training degree of the classification model in the training process, and when the difference between the actual credit level and the predicted credit level is smaller, the classification model is +.>Equal to->When the calculated loss value is smaller, the training degree of the judgment classification model can be more accurate, if the difference between the actual credit level and the predicted credit level is smaller, but +.>Not equal to->Then->The value of 2 or more corresponds to multiplying the difference between the actual credit rating and the predicted credit rating by a scaling factor, and the loss value is increased, in which case the classification model still needs to be trained.

The actual credit level in the invention is a label which is manually divided from a training sample of the classification model according to experience.

The higher the accuracy of the classification model training in the invention, the higher the classification accuracy of the client credit rating.

In this embodiment, the credit rating of the client includes: bronze, silver, gold, platinum, and diamond customers, or primary, secondary, tertiary, and quaternary customers, etc.

In summary, the beneficial effects of the embodiment of the invention are as follows: the invention adopts image processing on the recorded customer data, extracts the text of the customer data, thereby obtaining the customer characteristic information, adopts a classification model to automatically divide the credit rating of the customer according to the customer characteristic information, and carries out classified credit giving on the customer according to the credit rating of the customer, thereby realizing a full-automatic and rapid credit giving customer mining method.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for mining the trusted clients based on machine learning is characterized by comprising the following steps:

s1, shooting text data submitted by a client to obtain a text image;

the step S2 comprises the following sub-steps:

s21, extracting a text image from the text image;

s23, processing the image feature sequence by adopting a character recognition model to obtain text data;

the feature extraction model in S22 includes: a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a depth convolution layer, a first normalization layer, a second normalization layer, a maximum pooling layer, an average pooling layer, a Concat layer, a first adder A1 and a second adder A2;

the input end of the first convolution layer is used as the input end of the feature extraction model, and the output end of the first convolution layer is respectively connected with the input end of the depth convolution layer, the input end of the maximum pooling layer, the input end of the average pooling layer and the input end of the second adder A2; the input end of the first normalization layer is connected with the output end of the depth convolution layer, and the output end of the first normalization layer is connected with the input end of the second convolution layer; the input end of the first adder A1 is respectively connected with the output end of the maximum pooling layer and the output end of the average pooling layer, and the output end of the first adder A1 is connected with the input end of the second normalization layer; the input end of the Concat layer is respectively connected with the output end of the second convolution layer and the output end of the second normalization layer, and the output end of the Concat layer is connected with the input end of the third convolution layer; the output end of the third convolution layer is connected with the input end of the second adder A2; the input end of the fourth convolution layer is connected with the output end of the second adder A2, and the output end of the fourth convolution layer is used as the output end of the feature extraction model;

the Chinese character recognition model in S23 comprises: a first LSTM layer, a second LSTM layer, an attention layer, a fully connected layer, and a Softmax layer;

the input end of the first LSTM layer is connected with the first input end of the attention layer and is used as the input end of the character recognition model; the input end of the second LSTM layer is connected with the output end of the first LSTM layer, and the output end of the second LSTM layer is connected with the second input end of the attention layer; the input end of the full-connection layer is connected with the output end of the attention layer, and the output end of the full-connection layer is connected with the input end of the Softmax layer; the output end of the Softmax layer is used as the output end of the character recognition model;

s3, extracting client characteristic information from the text data;

the types of the customer characteristic information include: register capital, register time, business scope, business deadline, stakeholder information, high management information, transaction duration, gross profit, transaction amount, amount of orders, and guest unit price;

s5, grading credit giving is carried out on the clients according to the credit grades of the clients, and the credit giving clients are filed;

the classification model in the S4 is as follows:

，

wherein ,for the output of the classification model, +.>Input for classification model->Customer characteristic information->Is->Customer characteristic information threshold,/->Is->Weight of the customer characteristic information, +.>Is->Bias of customer characteristic information->For the kind of the extracted customer characteristic information +.>As hyperbolic tangent function, +.>Is a proportionality coefficient;

the loss function of the classification model is as follows:

，

wherein ,for loss function->For the statistical training times, ++>Is->Credit rating predicted by secondary training, +.>Is->Training the actual credit level +.>For counting the number of training times>For the number of actual training times>Satisfy the condition +.>Is used for the number of times of (a),is a loss difference threshold.

2. The machine learning based trusted client mining method of claim 1, wherein S21 comprises the steps of:

，

wherein ,gray value of any pixel point on gray scale map,/>The number of the pixel points at one side in the neighborhood range in any pixel point is +.>The number of the pixel points at the other side in the neighborhood range in any pixel point is +.>Is the +.o. of the other side in the neighborhood>Pixel value of each pixel, +.>Is the +.o of one side in the neighborhood>Pixel value of each pixel, +.>Is a distance threshold;

s213, forming the gray values of all the text pixels into a text image.

3. The machine learning based trusted client mining method of claim 1, wherein the formula of the normalization layer is:

，

4. The machine learning based trusted client mining method of claim 1, wherein the expression of the attention layer is:

，