CN111401329A

CN111401329A - Information flow direction identification method, device, equipment and storage medium

Info

Publication number: CN111401329A
Application number: CN202010338853.1A
Authority: CN
Inventors: 郭玮; 高宇航; 张丙松
Original assignee: Beijing Xinzhi Junyang Information Technology Co ltd
Current assignee: Chongqing Xinzhi Automotive Technology Co ltd
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2020-07-10
Anticipated expiration: 2040-04-26
Also published as: CN111401329B

Abstract

The application provides an information flow direction identification method, an information flow direction identification device, information flow direction identification equipment and a storage medium, wherein the method comprises the following steps: acquiring document information to be processed; analyzing the bill information according to a preset variable to generate a flow characteristic set of the bill information; and inputting the flow characteristic set into a target recognition model, and recognizing the flow direction information of the bill information. According to the method and the device, the flow direction information of the bill information is automatically identified by adopting the preset target identification model according to the analyzed flow characteristic set of the bill information.

Description

Information flow direction identification method, device, equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to an information flow direction identification method, apparatus, device, and storage medium.

Background

Motor vehicle insurance, i.e. automobile insurance (car insurance for short), refers to a commercial insurance for paying responsibility for personal casualties or property losses caused by natural disasters or accidents of motor vehicles.

The recent vehicle insurance market reports show that the vehicle insurance is obviously reduced in both acceleration and profit levels, and the saving cost of vehicle insurance clients is greatly lower than the development cost of the vehicle insurance clients. In addition, the automobile insurance is taken as short-term insurance in one year, the insurance industry has huge client amount, meanwhile, the automobile insurance renewal rate of insurance companies is between 50% and 60% on the whole, and the insurance industry has more than 2 hundred million insurance policy due amount every year.

Therefore, how to identify which expired warranties are about to run off so as to continuously improve the renewal rate becomes an important problem in the development of the industry.

Disclosure of Invention

An object of the embodiments of the present application is to provide an information flow direction identification method, apparatus, device and storage medium, so as to implement automatic identification of flow direction information of document information by using a preset target identification model according to a flow feature set of parsed document information.

A first aspect of the embodiments of the present application provides an information flow direction identification method, including: acquiring document information to be processed; analyzing the bill information according to a preset variable to generate a flow characteristic set of the bill information; and inputting the flow characteristic set into a target recognition model, and recognizing the flow direction information of the bill information.

In an embodiment, the analyzing the document information according to a preset variable to generate a flowing feature set of the document information further includes: identifying an initial variable set contained in the document information; carrying out invalid data cleaning on the initial variable set to generate an effective variable set; and extracting a data set corresponding to the preset variable from the effective variable set to serve as a flowing feature set of the bill information.

In an embodiment, the identifying the initial variable set included in the document information includes: analyzing the bill information according to the data dimension corresponding to the bill information to generate a plurality of initial variables; analyzing actual data of each initial variable in the bill information, and classifying all the initial variables according to a preset classification rule; and generating the initial variable set according to the actual data and the classification result.

In one embodiment, the step of selecting the predetermined variable includes: acquiring data of a plurality of historical documents; carrying out invalid data cleaning on the data of each historical receipt to generate a historical variable set; and after the historical variables with the information flow contribution rate to the historical documents smaller than a preset contribution threshold value are removed from the historical variable set, generating a plurality of preset variables.

In an embodiment, after removing the historical variables from the historical variable set, the generating a plurality of preset variables after the historical variables having the information flow contribution rate to the historical documents smaller than a preset contribution threshold includes: acquiring actual historical flow direction information of the historical document; calculating the correlation degree of each historical variable and the actual historical flow direction information; and after the historical variables with the correlation degrees smaller than the preset contribution threshold value are removed from the historical variable set, generating a plurality of preset variables.

In an embodiment, the step of presetting the target recognition model includes: respectively training multiple mathematical algorithm models according to the preset variables and generating multiple preset recognition models; respectively calculating the truth of each preset identification model based on the historical documents; judging whether a plurality of equal preset identification models with the same true degree and the maximum true degree exist in the plurality of preset identification models or not; and if the same preset identification model does not exist in the plurality of preset identification models, selecting the preset identification model with the maximum true degree as the target identification model.

In one embodiment, the method further comprises: if a plurality of identical preset recognition models exist in the plurality of preset recognition models, respectively calculating the accuracy of a confusion matrix of each identical preset recognition model; and selecting the equivalent preset recognition model with the maximum accuracy of the confusion matrix from the equivalent preset recognition models as the target recognition model.

A second aspect of the embodiments of the present application provides an information flow direction identification apparatus, including: the first acquisition module is used for acquiring the information of the document to be processed; the analysis module is used for analyzing the bill information according to a preset variable to generate a flow characteristic set of the bill information; and the identification module is used for inputting the flow characteristic set into a target identification model and identifying the flow direction information of the bill information.

In one embodiment, the parsing module is configured to: identifying an initial variable set contained in the document information; carrying out invalid data cleaning on the initial variable set to generate an effective variable set; and extracting a data set corresponding to the preset variable from the effective variable set to serve as a flowing feature set of the bill information.

In one embodiment, the method further comprises: the second acquisition module is used for acquiring data of a plurality of historical documents; the cleaning module is used for cleaning invalid data of each historical document to generate a historical variable set; and the removing module is used for removing the historical variables with the information flow contribution rate to the historical documents smaller than a preset contribution threshold value from the historical variable set to generate a plurality of preset variables.

In one embodiment, the culling module is to: acquiring actual historical flow direction information of the historical document; calculating the correlation degree of each historical variable and the actual historical flow direction information; and after the historical variables with the correlation degrees smaller than the preset contribution threshold value are removed from the historical variable set, generating a plurality of preset variables.

In one embodiment, the method further comprises: the training module is used for respectively training a plurality of mathematical algorithm models according to a plurality of preset variables and generating a plurality of preset recognition models; the calculation module is used for calculating the truth of each preset identification model based on the historical documents; the judging module is used for judging whether a plurality of equal preset identification models with the same true degree and the maximum true degree exist in the preset identification models or not; and the selecting module is used for selecting the preset identification model with the maximum true degree as the target identification model if the same preset identification model does not exist in the plurality of preset identification models.

In an embodiment, the calculating module is further configured to calculate a correctness of a confusion matrix of each of the equal predetermined recognition models, if a plurality of the equal predetermined recognition models exist in the plurality of predetermined recognition models; the selecting module is further configured to select, from the multiple equivalent preset recognition models, the equivalent preset recognition model with the largest accuracy of the confusion matrix as the target recognition model.

A third aspect of embodiments of the present application provides an electronic device, including: a memory to store a computer program; the processor is configured to execute the method of the first aspect and any embodiment thereof of the embodiments of the present application to identify flow information of document information.

A fourth aspect of embodiments of the present application provides a non-transitory electronic device-readable storage medium, including: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of an embodiment of the present application and any embodiment thereof.

According to the information flow direction identification method, the information flow direction identification device, the information flow direction identification equipment and the storage medium, the document information to be processed is analyzed based on the preset variable, the flow characteristic set of the document information is obtained, then the flow characteristic set is input into the target identification model, the flow direction information of the document information is output, and the future flow direction of the document information is automatically identified and predicted.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an information flow direction identification method according to an embodiment of the present application;

fig. 3 is a schematic flow chart of an information flow direction identification method according to an embodiment of the present application;

fig. 4A is a schematic flowchart of an information flow direction identification method according to an embodiment of the present application;

FIG. 4B is a schematic diagram of ROC curves generated for a sample of a partial set of historical variables according to an embodiment of the present application;

FIG. 4C is a diagram of L ift-train corresponding to a partial set of historical variables used as training samples according to an embodiment of the present application;

FIG. 4D is a schematic diagram of a ROC curve of a training set according to an embodiment of the present application;

FIG. 4E is a schematic diagram of a ROC curve for a validation set according to one embodiment of the present application;

fig. 5 is a schematic structural diagram of an information flow direction identification apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by the bus 10, and the memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the flow of the method in the embodiments described below, so as to automatically identify the flow information of the document information.

In an embodiment, the electronic device 1 may be a mobile phone, a notebook computer, a desktop computer, or the like.

Please refer to fig. 2, which is a method for identifying information flow direction according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 and can be applied to a scenario of predicting policy loss information, so as to identify the flow direction information of the document information by using a target identification model according to a flow feature set of the document information. The method comprises the following steps:

step 201: and acquiring the information of the document to be processed.

In this step, the document information may be an insurance document or various order documents, such as a car insurance policy, a personal health policy, a purchase order, and the like, the document information to be processed may be a document which is about to expire, such as a car insurance policy which is about to expire, policy information which is about to expire for a preset time (such as three months) may be selected as the document information to be processed, the preset time may be determined based on actual historical statistical data, the document information to be processed may be multiple ones, a CSV (Comma-Separated Values) structure data file which is externally provided may be imported into a data storage structure by a data import function, or the policy data file which is about to expire and is to be predicted may be imported into a data storage structure by an ET L (Extract-Transform-L ad) manner of loading data of a business system into a data warehouse after extraction and cleaning conversion.

Step 202: and analyzing the bill information according to the preset variable to generate a flow characteristic set of the bill information.

In this step, the preset variable is a characteristic variable representing the flow direction of the document information, and may be set based on historical statistical information in an actual scene. The preset variable may be plural. And performing data analysis on the document information to be processed based on preset variables, wherein the document information comprises data information corresponding to each preset variable, and a set formed by the data information corresponding to all the preset variables is used as a flow characteristic set of the document information.

Step 203: and inputting the flow characteristic set into a target recognition model, and recognizing the flow information of the bill information.

In this step, the target recognition model can automatically recognize the flow information of the document information. Training samples can be collected based on historical statistical data of an actual scene, and then an algorithm model is trained to obtain a target recognition model. Inputting the flow feature set of the bill information obtained in step 202 into the target recognition model, and outputting the flow information of the bill information. The flow information may be information about whether the document will be lost in the future. Such as whether the policy client will keep the relevant information.

According to the information flow direction identification method, the document information to be processed is analyzed based on the preset variable, the flow characteristic set of the document information is obtained, then the flow characteristic set is input into the target identification model, and further the flow direction information of the document information is output, so that the future flow direction of the predicted document information is automatically identified.

Please refer to fig. 3, which is a flow direction identification method according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1, and can be applied to a scenario of policy loss information prediction to identify flow direction information of document information according to a flow feature set of document information and by using a target identification model. The method comprises the following steps:

step 301: and acquiring data of a plurality of historical documents.

In this step, before identifying the document information to be processed, a preset variable needs to be selected based on historical statistical data. The historical documents and the document information to be processed are the same type of documents, and are similar to vehicle insurance policies. Taking the prediction of the loss information of the vehicle insurance policy customer as an example, the data can be primarily screened from the historical data, and key factors influencing the loss of the vehicle insurance policy of the customer can be excavated.

In one embodiment, the size of the sample size, the number of variables, the field missing condition, and the binary y in the model are determined, and the meanings of 0 and 1 in y are defined. The data of a plurality of historical documents can be stored in a wide-table data mode, and the data information can be divided into the following data according to the information dimension: customer data dimensions, vehicle data dimensions, insurance data dimensions, life insurance customer data dimensions, business member data dimensions, brand data dimensions. Specific broad table data are shown in table 1:

TABLE 1 possible significant factors for policy customer churn

Step 302: and carrying out invalid data cleaning on the data of each historical document to generate a historical variable set.

In this step, taking the prediction of the vehicle insurance policy loss information as an example, since the historical documents in the historical database contain a large amount of information, in order to more accurately obtain the variable factors capable of representing the document flow direction, invalid data cleaning is firstly performed on the data of each historical document. The imported data can be checked from the logical relationship rationality, the missing value condition and the abnormal data condition through the data checking function so as to judge whether the data meets the condition for predicting the loss. And cleaning invalid data, and finally generating a historical variable set.

In one embodiment, the actual meaning of each variable in all historical document data is first interpreted and divided into continuous variables, categorical variables, date variables, and non-function variables. As shown in table 2:

TABLE 2 practical significance and Classification of variables

Then, for the processing of the data missing value, the variable deletion processing with the missing value accounting for more than 80% can be performed, and if the actual meaning of the missing of one variable indicates 0, the value is filled with 0. Fields are distributed uniformly in the variable, and the missing data value can be filled by using a mean value; unevenly distributed fields, missing data values can be padded with a median or with 0.

For example: and (3) continuously keeping the number of times, wherein the missing value indicates that the continuous keeping is not continued, 0 can be used for filling, and the filled value indicates that the number of continuous keeping is 0. As shown in Table 3, the statistical distribution of the variables was found to have a missing value of 143792 for the number of consecutive passages. The data after missing value padding is shown in table 4.

TABLE 3 loss of consecutive hold times

var	mean	median	0％	1％	10％	25％	50％	75％	90％	99％	100％	nmiss
													RENEWNUM	2.20620647218415	2	1	1	1	1	2	3	5	6	6	143792

TABLE 4 loss of consecutive hold times

var	mean	median	0％	1％	10％	25％	50％	75％	90％	99％	100％	nmiss
													RENEWNUM	1.1001032063709	0	0	0	0	0	0	2	3	6	6	0

In one embodiment, 1% and 99% of the quantiles may be replaced or left untreated for treatment of variable outliers.

For example: as shown in Table 5, the age of the applicant was variable at 999% quantile, which really means the missing value.

TABLE 5 Defect value of applicant's age

var	mean	median	0％	1％	10％	25％	50％	75％	90％	99％	100％	nmiss
													APPLIACE	41.4385538555948	39	0	23	28	33	39	48	55	67	999	0

The 999 can be replaced by NA in Table 5, which is expressed as the true missing value, and the statistical distribution is seen, which is not uniform, so the missing value is filled by the median, and 1% and 99% quantiles are processed, as shown in Table 6:

TABLE 6 Defect value treatment of applicant's age

var	mean	median	0％	1％	10％	25％	50％	75％	90％	99％	100％	nmiss
													APPLIACE	40.623209798995	39	0	23	28	33	39	48	55	66	91	244

As shown in Table 7, the statistical distribution of the age variables of the applicant after the treatment of the missing values and abnormal values is:

TABLE 7 data of age-loss and abnormal values of applicant

var	mean	median	0％	1％	10％	25％	50％	75％	90％	99％	100％	nmiss
													APPLIACE	40.6127564469115	39	23	23	28	33	39	48	55	66	66	0

In one embodiment, some original variables may be processed and derived into new variables according to the statistical distribution, and the new variables are more significant in the representation of information flow and more useful for the model.

For example: number variable of insurance policy for insuring non-vehicle insurance

From the distribution statistics, the loss proportion of the insurable non-vehicle insurance unit number is too large, and the true meaning of the loss is that the client does not insurable non-vehicle insurance, so that the sample with the insurance policy is marked as 1, and the sample without the insurance policy is marked as 0, and a two-classification derivative variable is generated to indicate whether the non-vehicle insurance is insurable or not. As shown in table 8:

TABLE 8 non-insurance odd number distribution data for insuring

Step 303: and after removing the historical variables with the information flow contribution rate to the historical documents smaller than a preset contribution threshold value from the historical variable set, generating a plurality of preset variables.

In this step, the historical variables in the historical variable set are further analyzed, and in order to improve the accuracy of the variables in representing the loss information of the policy-preserving client, the historical variables with small contribution to information flow need to be removed, and then the historical variables with large contribution rate are left, and these historical variables can be used as preset variables.

Step 304: and respectively training various mathematical algorithm models according to a plurality of preset variables and generating a plurality of preset recognition models.

In this step, after the preset variable is selected, a target recognition model needs to be set. And taking a historical data set in the preset variable as a training sample document, and simultaneously training a plurality of mathematical algorithm models by adopting the training sample document, so that each mathematical algorithm model can generate a preset recognition model. For example, the historical data in the cleaned and screened preset variables can be reintegrated into a new table, and a training set and a verification set are divided, wherein the proportion of the training set to the verification set can be 7:3 or 8: 2.

In an embodiment, the preset variables finally entering the model are determined, and mathematical algorithm models such as CART classification tree, naive bayes, KNN, GradientBoosting, Xgboost and the like can be constructed. In the process of comparing the precision of each model, for example, when an xgboost algorithm model is constructed, parameters in each mathematical algorithm model can be automatically adjusted and optimized, so that the parameters are optimal in the value range.

Step 305: and respectively calculating the truth of each preset identification model based on the historical documents.

In this step, the history document may include the flow information of the document, such as actual history data of the sample document a that the customer has not continued, is lost, or the sample document B that the customer has continued, and so on. Comparing the identification result of the sample document A of each preset identification model in the step 304 with the actual historical data of the sample document A, and if the identification result of the sample document A is the same, indicating that the identification result of the preset identification model is real. And by analogy, counting the recognition truth of each preset recognition model to the training sample set.

In an embodiment, the degree of truth of the corresponding model may be represented by an AUC (Area Under cut, defined as an Area enclosed by coordinate axes Under an ROC (receiver operating characteristic Curve)) value of the training set of each predetermined recognition model.

Step 306: and judging whether a plurality of equal preset identification models with the same truth degree and the maximum truth degree exist in the plurality of preset identification models. If yes, go to step 308, otherwise go to step 307.

In this step, the comparison of the truth can be realized by judging the AUC values of the plurality of preset identification models, the preset identification model with the largest AUC value is selected first, and then it is judged whether there are a plurality of preset identification models with the largest AUC values, if so, step 308 is entered, otherwise, step 307 is entered.

Step 307: and selecting the preset recognition model with the maximum truth as the target recognition model. And proceeds to step 310.

In this step, if there is no equivalent preset identification model in the plurality of preset identification models, that is, there is only one preset identification model with the largest AUC value, the preset identification model with the largest AUC value is taken as the target identification model.

Step 308: and respectively calculating the accuracy of the confusion matrix of each equal preset identification model.

In this step, if there are a plurality of identical preset recognition models among the plurality of preset recognition models, that is, if there are a plurality of preset recognition models with the largest AUC values, it is necessary to further calculate the confusion matrix of the identical preset recognition models with the same AUC values, and the accuracy, sensitivity, hit rate, specificity, etc. of the confusion matrix can be calculated respectively, and the above information can be stored as one dataframe (data frame).

Step 309: and selecting the equivalent preset recognition model with the maximum accuracy of the confusion matrix from the equivalent preset recognition models as a target recognition model.

In this step, the same preset recognition model with the highest accuracy of the confusion matrix may be selected as the target recognition model and stored in a model (model) form.

Step 310: and acquiring the information of the document to be processed. See the description of step 201 in the above embodiments for details.

Step 311: an initial set of variables contained in the document information is identified.

In this step, the document information to be processed includes related information of the whole document, and the initial data is relatively complicated and needs to identify an initial variable set that contributes to information flow identification.

Step 312: and carrying out invalid data cleaning on the initial variable set to generate an effective variable set.

In this step, similarly, a plurality of missing or abnormal invalid data are found in the initial variable set, and the invalid data are subjected to data cleaning, so that the information flow direction characteristics of the effective variable set can be represented more accurately, and the calculation efficiency is improved.

Step 313: and extracting a data set corresponding to the preset variable from the effective variable set to serve as a flowing characteristic set of the document information.

In this step, based on the plurality of preset variables set in step 303, data is read from the set of valid variables, valid data information is given to each preset variable, and then a flow feature set of the document information is generated.

Step 314: and inputting the flow characteristic set into a target recognition model, and recognizing the flow information of the bill information. See the description of step 203 in the above embodiments for details.

Please refer to fig. 4A, which is an information flow direction identification method according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 and can be applied to a scenario of policy loss information prediction to identify flow direction information of document information by using a target identification model according to a flow feature set of the document information. The method comprises the following steps:

step 401: and acquiring data of a plurality of historical documents. See the description of step 301 in the above embodiments for details.

Step 402: and carrying out invalid data cleaning on the data of each historical document to generate a historical variable set. See the description of step 302 in the above embodiments for details.

Step 403: and acquiring actual historical flow information of the historical document.

In this step, the actual historical flow information may be the final renewal or loss information of the customer of the historical document. For example, the history document is a vehicle insurance policy A, and after the policy A is expired, the final renewal of the customer, the change of the customer into other types of insurance or non-renewal information, and the like can be used as actual history flow information. The historical data is recorded in the historical record of each bill, and corresponding actual historical flow information can be obtained from the historical database through statistical analysis.

Step 404: and calculating the correlation degree of each historical variable and the actual historical flow information.

In this step, correlation analysis may be employed to calculate the degree of correlation. Correlation analysis is a common method for screening variables in data analysis, and correlation analysis between an explanatory variable x and a response variable y and correlation analysis between the explanatory variable x1 and x2 are used to obtain the correlation degree between each historical variable and actual historical flow direction information.

Step 405: and after removing the historical variables with the correlation degrees smaller than the preset contribution threshold value from the historical variable set, generating a plurality of preset variables.

In this step, some variables can be effectively eliminated based on the correlation. For example, by dividing the historical variable and actual historical flow information into an explanatory variable x and a responsive variable y. By calculating the index value of the interpretation variable x, variables which do not contribute to the response variable y or variables with contribution rates smaller than a preset contribution threshold value can be effectively eliminated. For example: as shown in table 9, since indexes classified as 0 and 1 each represent about 100, it was determined that this variable has no significant influence on y and can be eliminated.

TABLE 9 variable distribution data

In one embodiment, a stepwise regression method may be used to perform variable screening on the historical variable set. And (4) reintegrating the cleaned and screened data into a new table, and dividing a training set and a verification set according to the proportion of 7:3 or 8: 2. Establishing a logistic regression model, wherein the establishment standard of the model is as follows:

1. the number of variables that eventually go to the model is between 8 and 15. The number of variables that the present embodiment finally proceeds to the model is 11.

2. There is no high correlation between the explanatory variables x and x entering the model (positive or negative correlation less than 0.6).

3. As shown in table 10, the correlation between the explanatory variable x and the response variable y entering the model cannot be too low to be lower than the preset value.

Table 10 partially explains the correlation between variable x and response variable y

4. After the logistic regression model is established, the model needs to be evaluated, and the evaluation model needs to evaluate the following important indexes:

I. the value of the regression coefficient Pr was less than 0.05 as shown in table 10.

TABLE 11 regression coefficient schematic for partial variables

	Estimate	Std.Error	z value	Pr(＞\|z\|)
						(Intercept)	1.49E+00	1.32E-01	11.299	＜2e-16	＊＊＊
DISCOUNT	-5.77E-01	8.86E-02	-6.514	7.30E-11	＊＊＊
						RENEWNUM	-2.61E-01	1.58E-02	-16.511	＜2e-16	＊＊＊
UNDERWRITESTART	1.83E-03	7.51E-04	2.437	0.0148	＊
						PRDGROUP11	-5.04E-01	3.74E-02	-13.486	＜2e-16	＊＊＊
APPLIAGE	-1.05E-02	1.27E-03	-8.284	＜2e-16	＊＊＊

II. The predetermined contribution threshold range of the explanatory variables is [ 5%, 40% ], and as shown in table 11, after the logistic regression model is established based on the historical variable set, the explanatory variables are normalized, and a new logistic regression model is re-established, and the interpretation degree of each explanatory variable x coefficient to the response variable y is compared with the whole, that is, the contribution ratio of the explanatory variable x.

TABLE 12 contribution rates for variables partially within the preset contribution threshold

IV, AUC values and ROC curves:

as shown in fig. 4B, ROC curves are generated for a portion of the set of historical variables for the sample. Wherein:

the abscissa of the ROC curve is FPR (false positive rate) (FP/(FP + TN)), which is the predicted positive negative sample result number/negative sample actual number, and the range of values of FPR [0, 1 ].

The ROC curve ordinate is TPR (true positive rate) (TP/(TP + FN)), and is the number of positive sample predicted results/number of positive samples actual, and the value range of TPR [0, 1 ].

The ROC curve is actually formed by connecting a plurality of points, each threshold (above which the value is classified into 1, and vice versa classified into 0) corresponds to a group of classification results, i.e., a group of FPR and TPR, and the plurality of thresholds form a plurality of points, i.e., ROC curve.

The area under the ROC curve is the AUC value. The physical meaning of the AUC values is: one sample (namely 0 and 1) is randomly selected from the two classes of 0 and 1, the two samples are predicted according to a classifier, the probability of classifying the sample 1 into the class 1 is p _1, the probability of classifying the sample 0 into the class 1 is p _0, and the probability of p _1> p _0 is an AUC value. I.e., AUC values reflect the ability of the classifier to rank the samples.

In this embodiment, the AUC of the historical variable entering the logistic regression model is [0.5,1 ].

V, L ift graph of the regression model generated by the historical variable set as the training set is to show a gradient descending trend, L ift graph is essential to measure the degree of distinction of the model, in general, the ratio (Resp-index) index of the packet actual response rate to the overall average level in L ift graph is required to be in a descending trend, for example, based on the partial historical variable set as shown in Table 13 as the training sample, the corresponding L ift-train graph is shown in FIG. 4C, wherein the solid line represents the overall average level index, i.e., the ratio of the packet actual response rate to the overall average level, the dotted line represents the overall average level, L ift graph shows a gradient descending trend, and thus the historical variable set in Table 13 can be retained as the preset variable.

Table 13 set of partial historical variables

In table 13, escape cnt: the train data set was equally divided into 10 bins by sample size. Total: and ranking the predicted response rates, and respectively putting the predicted response rates into 10 boxes according to the sequence. Resp: actual number of responses in each bin. And (3) Rate: in each box, the actual response counts are in proportion. Resp _ index is the ratio of the actual response to the overall average in each bin.

There was no significant difference in performance on VI, training set and validation set, avoiding overfitting. That is, the AUC values of the corresponding ROC curves show no significant difference between the training set (train) and the validation set, as shown in fig. 4D, which is a ROC curve of the training set, wherein the AUC values show AUC 0.7284 in the training set. As shown in fig. 4E, is a ROC curve for the validation set (test), where the AUC value is represented by AUC 0.7249 on the validation set. It can be seen that the difference between the two sets is small, and the corresponding training set and validation set meet the requirements of prediction. If the difference is large, the model over-learns the features on the training set, and over-fitting is generated, and the model needs to be adjusted at the moment. Whether the difference is significant or not can be obtained based on historical statistical data and actual application scene analysis.

The historical variable sets satisfying the above conditions may be used as preset variables.

In an embodiment, the variance inflation factor VIF (variance inflation coefficient) value may be calculated to perform a variable on the historical variable set, for example, the historical variable whose VIF value is less than 2 may be retained, so as to effectively reduce the collinearity of the regression model, as shown in table 14.

TABLE 14 variables with partial VIF values less than 2

var	vif.fit.
		AGENTID1	1.122457
DISCOUNT	1.403647
		RENEWNUM	1.472236
PRDGROUP1	1.328946
		APPLIAGE	1.058096
APPLINOCARNUMS1	1.021844
		ISJTCUST	1.018463
PURCHASEPRICE	1.041563
		APPLICARMINYEAR1	1.476288
RATE	1.051276

In one embodiment, it is able to analyze whether The interpretation variable x has an influence on The response variable y by using a statistical distribution map in RStudio (a development environment based on The R (The R) language for statistical analysis, drawing and operation environment) language of The historical variable set, thereby assisting in screening out a preset variable with a proper contribution rate from The historical variable set.

For example: take the explanation variable x-premium discount as an example: by plotting, via the ratetle interface in R, a statistical profile of premium discount x and response variable y-policy customer churn, if analyzed from the statistical profile: the larger the discount strength is, the lower the loss proportion is, the explanation variable x has an influence on the response variable y, and the explanation variable x-premium discount can be temporarily reserved as a primary screen.

In an embodiment, combining the above variable screening methods, and finally performing data analysis, the significant factors (preset variables) that affect the customer churn are screened from the historical variable set, as shown in table 15:

TABLE 15 Preset variables

Whether it runs off	REFLAG
		Representatives	AGENTID1
Discount and method for making same	DISCOUNT
		Number of continuous maintenance	RENEWNUM
Identity of same insurance	PRDGROUP1
		Age of insuring person	APPLIAGE
Whether or not the customer applies insurance other than car insurance	APPLINOCARNUMS1
		Whether or not to group clients	ISJTCUST
Purchase price of new car	PURCHASEPRICE
		Minimum year of car insurance for client	APPLICARMINYEAR1
Loss rate of brand	RATE

Step 406: and respectively training various mathematical algorithm models according to a plurality of preset variables and generating a plurality of preset recognition models. See the description of step 304 in the above embodiments for details.

Step 407: and respectively calculating the truth of each preset identification model based on the historical documents. See the description of step 305 in the above embodiments for details.

Step 408: and judging whether a plurality of equal preset identification models with the same truth degree and the maximum truth degree exist in the plurality of preset identification models. If yes, go to step 410, otherwise go to step 409. See the description of step 306 in the above embodiments for details.

Step 409: and if the same preset identification model does not exist in the plurality of preset identification models, selecting the preset identification model with the maximum true degree as the target identification model. And proceeds to step 412. See the description of step 307 in the above embodiments for details.

Step 410: if a plurality of identical preset recognition models exist in the plurality of preset recognition models, the accuracy of the confusion matrix of each identical preset recognition model is calculated respectively. See the description of step 308 in the above embodiments for details.

Step 411: and selecting the equivalent preset recognition model with the maximum accuracy of the confusion matrix from the equivalent preset recognition models as a target recognition model. See the description of step 309 in the above embodiments for details.

Step 412: and acquiring the information of the document to be processed. See the description of step 201 in the above embodiments for details.

Step 413: and analyzing the bill information according to the data dimension corresponding to the bill information to generate a plurality of initial variables.

In this step, the data dimensions may correspond to the type of document information. For example, the data dimension of a vehicle insurance policy can be divided into: customer data dimensions, vehicle data dimensions, insurance data dimensions, life insurance customer data dimensions, business member data dimensions, brand data dimensions. And analyzing the bill information based on the corresponding data dimension so as to obtain a plurality of initial variables.

Step 414: analyzing actual data of each initial variable in the document information, and classifying all the initial variables according to a preset classification rule.

In this step, the actual meaning of each initial variable in all the document information to be processed is read, and the preset classification rule may be similar to the classification scheme shown in table 2, and the initial variables are classified into continuous variables, classification variables, date variables, and non-function variables.

Step 415: and generating an initial variable set according to the actual data and the classification result.

In this step, the actual data, i.e. the meaning of each initial variable in the actual scenario, and the same set of initial variables may be stored in a manner similar to that shown in table 2.

Step 416: and carrying out invalid data cleaning on the initial variable set to generate an effective variable set. See the description of step 312 in the above embodiments for details.

Step 417: and extracting a data set corresponding to the preset variable from the effective variable set to serve as a flowing characteristic set of the document information. See the description of step 313 in the above embodiments for details.

Step 418: and inputting the flow characteristic set into a target recognition model, and recognizing the flow information of the bill information. See the description of step 203 in the above embodiments for details.

According to the information flow identification method, as the application scene is specific to the vehicle insurance policy data and the data platforms are different, the identification model models can be respectively constructed. Different bill information data are different, so that the variables which finally enter the identification model are different, and the obtained identification model is also different. When customer attrition probability prediction needs to be performed on certain type of document data, firstly, data processing is performed on original data of a branch to be predicted in a Database, missing values and abnormal values are filled, then, connection of the Database is performed in a development environment through an ODBC (Open Database Connectivity) connection mode, the processed data needing prediction is read, a previously stored target identification model is loaded, finally, customer attrition probability prediction is performed on the data, prediction results are stored into a data frame and written back to a data storage structure, and the prediction results include but are not limited to: loss/persistence flag, corresponding probability. The prediction result can be sent to the foreground terminal for page display, for example, the prediction result condition of each policy can be displayed from the foreground page. Export of bulk files may also be provided.

Please refer to fig. 5, which is an information flow direction identification apparatus 500 according to an embodiment of the present application, and the apparatus can be applied to the electronic device 1 shown in fig. 1 and can be applied to a scenario of predicting policy loss information, so as to identify the flow direction information of the document information by using a target identification model according to a flow feature set of the document information. The device includes: the first obtaining module 501, the analyzing module 502 and the identifying module 503 are as follows:

the first obtaining module 501 is configured to obtain document information to be processed. See the description of step 201 in the above embodiments for details.

The analyzing module 502 is configured to analyze the document information according to a preset variable, and generate a flow feature set of the document information. See the description of step 202 in the above embodiments for details.

And the identifying module 503 is configured to input the flow feature set to the target identification model, and identify flow direction information of the document information. See the description of step 203 in the above embodiments for details.

In one embodiment, the parsing module 502 is configured to: an initial set of variables contained in the document information is identified. And carrying out invalid data cleaning on the initial variable set to generate an effective variable set. And extracting a data set corresponding to the preset variable from the effective variable set to serve as a flowing characteristic set of the document information. Refer to the description of steps 311 to 313 in the above embodiments in detail.

In one embodiment, identifying an initial set of variables included in document information includes: and analyzing the bill information according to the data dimension corresponding to the bill information to generate a plurality of initial variables. Analyzing actual data of each initial variable in the document information, and classifying all the initial variables according to a preset classification rule. And generating an initial variable set according to the actual data and the classification result. See the description of steps 413-415 in the above embodiments in detail.

In one embodiment, the method further comprises: and a second obtaining module 504, configured to obtain data of a plurality of history documents. And a cleaning module 505, configured to perform invalid data cleaning on the data of each history document, and generate a history variable set. And the removing module 506 is configured to generate a plurality of preset variables after removing the historical variables from the historical variable set, where the information flow contribution rate of the historical documents is smaller than a preset contribution threshold. See the description of steps 301 to 303 in the above embodiments in detail.

In one embodiment, the culling module 506 is configured to: and acquiring actual historical flow information of the historical document. And calculating the correlation degree of each historical variable and the actual historical flow information. And after removing the historical variables with the correlation degrees smaller than the preset contribution threshold value from the historical variable set, generating a plurality of preset variables. See the above embodiments for a detailed description of steps 404 to 405.

In one embodiment, the method further comprises: the training module 507 is configured to train multiple mathematical algorithm models according to multiple preset variables, and generate multiple preset recognition models. And the calculation module is used for calculating the truth of each preset identification model based on the historical documents. The determining module 508 is configured to determine whether there are multiple identical preset recognition models with the same degree of truth and the largest degree of truth among the multiple preset recognition models. The selecting module 509 is configured to, if there is no equivalent preset recognition model in the plurality of preset recognition models, select the preset recognition model with the largest degree of truth as the target recognition model. See the description of step 304 to step 307 in the above embodiments in detail.

In an embodiment, the calculation module is further configured to calculate a correctness of the confusion matrix of each of the identical preset recognition models, if a plurality of identical preset recognition models exist in the plurality of preset recognition models. The selecting module 509 is further configured to select, from the multiple equivalent preset recognition models, an equivalent preset recognition model with a largest accuracy of the confusion matrix as the target recognition model. See the description of steps 308-309 in the above embodiments for details.

For a detailed description of the information flow identification device 500, please refer to the description of the related method steps in the above embodiments.

An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. An information flow direction identification method is characterized by comprising the following steps:

acquiring document information to be processed;

analyzing the bill information according to a preset variable to generate a flow characteristic set of the bill information;

and inputting the flow characteristic set into a target recognition model, and recognizing the flow direction information of the bill information.

2. The method according to claim 1, wherein the parsing the document information according to a preset variable to generate a flowing feature set of the document information further comprises:

identifying an initial variable set contained in the document information;

carrying out invalid data cleaning on the initial variable set to generate an effective variable set;

and extracting a data set corresponding to the preset variable from the effective variable set to serve as a flowing feature set of the bill information.

3. The method of claim 2, wherein the identifying the initial set of variables contained in the document information comprises:

analyzing the bill information according to the data dimension corresponding to the bill information to generate a plurality of initial variables;

analyzing actual data of each initial variable in the bill information, and classifying all the initial variables according to a preset classification rule;

and generating the initial variable set according to the actual data and the classification result.

4. The method of claim 1, wherein the step of selecting the predetermined variable comprises:

acquiring data of a plurality of historical documents;

carrying out invalid data cleaning on the data of each historical receipt to generate a historical variable set;

and after the historical variables with the information flow contribution rate to the historical documents smaller than a preset contribution threshold value are removed from the historical variable set, generating a plurality of preset variables.

5. The method according to claim 4, wherein the generating a plurality of preset variables after removing the history variables with the information flow contribution rate to the history document smaller than a preset contribution threshold from the history variable set comprises:

acquiring actual historical flow direction information of the historical document;

calculating the correlation degree of each historical variable and the actual historical flow direction information;

and after the historical variables with the correlation degrees smaller than the preset contribution threshold value are removed from the historical variable set, generating a plurality of preset variables.

6. The method of claim 4, wherein the step of pre-programming the target recognition model comprises:

respectively training multiple mathematical algorithm models according to the preset variables and generating multiple preset recognition models;

respectively calculating the truth of each preset identification model based on the historical documents;

judging whether a plurality of equal preset identification models with the same true degree and the maximum true degree exist in the plurality of preset identification models or not;

and if the same preset identification model does not exist in the plurality of preset identification models, selecting the preset identification model with the maximum true degree as the target identification model.

7. The method of claim 6, further comprising:

if a plurality of identical preset recognition models exist in the plurality of preset recognition models, respectively calculating the accuracy of a confusion matrix of each identical preset recognition model;

and selecting the equivalent preset recognition model with the maximum accuracy of the confusion matrix from the equivalent preset recognition models as the target recognition model.

8. An information flow direction identification device, comprising:

the first acquisition module is used for acquiring the information of the document to be processed;

the analysis module is used for analyzing the bill information according to a preset variable to generate a flow characteristic set of the bill information;

and the identification module is used for inputting the flow characteristic set into a target identification model and identifying the flow direction information of the bill information.

9. The apparatus of claim 8, wherein the parsing module is configured to:

identifying an initial variable set contained in the document information;

10. The apparatus of claim 9, wherein the identifying the initial set of variables contained in the document information comprises:

11. The apparatus of claim 8, further comprising:

the second acquisition module is used for acquiring data of a plurality of historical documents;

the cleaning module is used for cleaning invalid data of each historical document to generate a historical variable set;

and the removing module is used for removing the historical variables with the information flow contribution rate to the historical documents smaller than a preset contribution threshold value from the historical variable set to generate a plurality of preset variables.

12. The apparatus of claim 11, wherein the culling module is to:

13. The apparatus of claim 11, further comprising:

the training module is used for respectively training a plurality of mathematical algorithm models according to a plurality of preset variables and generating a plurality of preset recognition models;

the calculation module is used for calculating the truth of each preset identification model based on the historical documents;

the judging module is used for judging whether a plurality of equal preset identification models with the same true degree and the maximum true degree exist in the preset identification models or not;

and the selecting module is used for selecting the preset identification model with the maximum true degree as the target identification model if the same preset identification model does not exist in the plurality of preset identification models.

14. The apparatus of claim 13,

the calculation module is further configured to, if a plurality of identical preset recognition models exist in the plurality of preset recognition models, respectively calculate a correctness of a confusion matrix of each identical preset recognition model;

the selecting module is further configured to select, from the multiple equivalent preset recognition models, the equivalent preset recognition model with the largest accuracy of the confusion matrix as the target recognition model.

15. An electronic device, comprising:

a memory to store a computer program;

a processor arranged to perform the method of any of claims 1 to 7 to identify flow information for document information.

16. A non-transitory electronic device readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 7.