CN111178377A

CN111178377A - Visual feature screening method, server and storage medium

Info

Publication number: CN111178377A
Application number: CN201910977284.2A
Authority: CN
Inventors: 龚燕; 梁树峰; 李希加; 徐斌
Original assignee: Weikun Shanghai Technology Service Co Ltd
Current assignee: Weikun Shanghai Technology Service Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-05-19

Abstract

The invention discloses a visual feature screening method. The invention also discloses a server and a storage medium. The numerical characteristic variables and the non-numerical characteristic variables are extracted from the user data in the preset period, the numerical characteristic variables are screened by a variance method, the mean values of the characteristic variables of different types of users are displayed by radar maps, the association degrees between the screened characteristic variables and between the characteristic variables and target variables are calculated and displayed by thermodynamic diagrams, and the characteristic variables are screened again according to instructions sent out based on the thermodynamic diagrams and preset conditions.

Description

Visual feature screening method, server and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a visual feature screening method, a server and a computer storage medium.

Background

In a prediction or classification model based on machine learning, historical data used for training the model is multidimensional data, the number of characteristic variables converted from the data is large, the total data volume is large, and continuous characteristic variables and discrete characteristic variables are contained at the same time. How to screen out target features closely related to prediction or classification results in a large amount of feature data so that model prediction or classification is more accurate, and for users who do not know the execution principle of machine learning, the screening process like a black box makes feature screening interpretative poor, which is a problem to be solved at present.

Disclosure of Invention

The invention mainly aims to provide a visual feature screening method, a server and a computer storage medium, and aims to solve the technical problems of low accuracy and low interpretability of feature screening for predicting or classifying based on a machine learning model in the prior art.

In order to achieve the above object, the present invention provides a visual feature screening method, including the steps of:

acquiring data of a user in a preset period, and extracting a first feature set, a second feature set and a target variable from the data of the user according to a data type and a data identifier, wherein the first feature set comprises a numerical feature variable, the second feature set comprises a non-numerical feature variable, and the target variable is a classification result of the user;

acquiring the variance of each feature variable in the first feature set, and removing the feature variables of which the variances do not meet a first preset condition in the first feature set to obtain a third feature set;

drawing and displaying a radar map coordinate system in a display area, respectively calculating a mean value corresponding to the feature variables in each third feature set based on the data of each type of user, and drawing a polygon corresponding to the data of each type of user in the radar map coordinate system according to the mean value;

receiving a first instruction, determining a target user according to the first instruction, acquiring a characteristic value corresponding to a characteristic variable of the target user in a third characteristic set, and drawing and displaying a polygon corresponding to the target user in the radar map coordinate system according to the characteristic value;

calculating a first association degree between each feature variable in the second feature set and the third feature set, and calculating a second association degree between each feature variable in the second feature set and the third feature set and the target variable;

drawing and displaying a corresponding thermodynamic diagram in the display area according to the first relevance and the second relevance, wherein the thermodynamic diagram comprises a plurality of cells, each cell corresponds to one of the first relevance or the second relevance, and the cells are filled with corresponding colors according to the first relevance or the second relevance;

receiving a second instruction sent based on the thermodynamic diagram, and removing the feature variable which is indicated to be removed by the second instruction from the second feature set and the third feature set to obtain a fourth feature set;

and removing the feature variables of which the second association degrees do not meet a second preset condition from the fourth feature set.

Optionally, the obtaining the variance of each feature variable in the first feature set, and removing the feature variable in the first feature set whose variance does not satisfy the first preset condition to obtain a third feature set further includes:

and calculating the Pearson correlation coefficient of each feature variable in the third feature set and the target variable, and removing the feature variables of which the Pearson correlation coefficients are out of a preset coefficient threshold range.

Optionally, the association degree is a mutual information value, and the second preset condition is that the mutual information value is smaller than a mutual information threshold.

Optionally, the data of the user is financial data of the user, the step of acquiring the data of the user in a preset period and extracting a first feature set from the data of the user includes:

the method comprises the steps of obtaining financial data of a user in a preset period, and obtaining a first numerical characteristic variable according to the financial data;

carrying out the same-ratio and/or ring-ratio on the financial data to obtain a second numerical characteristic variable;

and taking the union of the first numerical characteristic variable and the second numerical characteristic variable as the first characteristic set.

Optionally, after the step of acquiring the user data in the preset period, the method further includes:

performing dimensionality reduction processing on the user data to obtain data points of the user data in a two-dimensional space;

and drawing a two-dimensional coordinate system in the display area, and displaying the data points in the two-dimensional coordinate system.

Optionally, the step of acquiring the data of the user in the preset period and extracting the first feature set, the second feature set and the target variable from the data of the user further includes:

drawing and displaying a probability distribution map of each feature variable in the first feature set and the second feature set in the display area;

receiving a third instruction based on the probability distribution map and removing the feature variables indicated to be removed from the second feature set and the third feature set by the third instruction.

receiving a fourth instruction, and acquiring a target characteristic variable indicated and displayed by the fourth instruction, wherein the target characteristic variable is a characteristic variable in the first characteristic set or the second characteristic set;

acquiring the characteristic value of the target characteristic variable at continuous time points;

and drawing a two-dimensional coordinate system in the display area, and displaying the characteristic values of the target characteristic variable at continuous time points in the two-dimensional coordinate system.

Optionally, the step of removing the feature variable of which the second association degree does not satisfy the second preset condition from the fourth feature set further includes:

acquiring user data to be classified, and extracting characteristic variables to be screened from the user data to be classified;

matching the feature variables to be screened in the fourth feature set to obtain matched feature variables;

inputting the matched characteristic variables into a preset classification model for processing to obtain a user classification result;

and for each matched feature variable of each type of user, acquiring corresponding statistical features, and drawing a corresponding box line graph in the display area according to the statistical features, wherein the statistical features comprise a maximum value, a minimum value, a median and two quartiles.

Further, to achieve the above object, the present invention also provides a server comprising: a memory, a processor and a processing program of visual feature screening stored on the memory and executable on the processor, the processing program of visual feature screening implementing the steps of the visual feature screening method as described above when executed by the processor.

In addition, in order to achieve the above object, the present invention further provides a computer storage medium, wherein the computer storage medium stores a processing program for visual feature filtering, and the processing program for visual feature filtering realizes the steps of the visual feature filtering method as described above when being executed by a processor.

The visual characteristic screening method, the server and the computer storage medium provided by the embodiment of the invention extract numerical characteristic variables and non-numerical characteristic variables from user data in a preset period, screen the numerical characteristic variables by using a variance method, display the characteristic variable mean values of different types of users by using a radar map, calculate the correlation degrees between the screened characteristic variables and between the characteristic variables and target variables and display the correlation degrees by using a thermodynamic diagram, and screen the characteristic variables again according to an instruction sent by the thermodynamic diagram and preset conditions, so that the visual characteristic screening method for graphically displaying the mean values of the characteristic variables, the correlation degrees between the characteristic variables and the target variables not only can intuitively display how the characteristic variables influence the target variables, but also can enable workers to directly participate in the characteristic screening process based on graphical display, the interpretability of the characteristic screening process is increased, and the accuracy of the characteristic screening is improved.

Drawings

FIG. 1 is a schematic diagram of a server according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram illustrating a visual feature screening method according to a first embodiment of the present invention;

FIG. 3 is a radar chart in a first embodiment of the visual feature screening method of the present invention;

FIG. 4 is a thermodynamic diagram in a first embodiment of a visual feature screening method of the present invention;

FIG. 5 is a schematic flow chart of a visual feature screening method according to a second embodiment of the present invention;

FIG. 6 is a two-dimensional view of a visual feature screening method according to a third embodiment of the present invention;

FIG. 7 is a probability distribution diagram in a fourth embodiment of the visual feature screening method of the present invention;

fig. 8 is a time-series diagram of a feature variable in a fifth embodiment of the visual feature screening method of the present invention;

fig. 9 is a box plot diagram in a sixth embodiment of the visual characteristic screening method of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The main solution of the embodiment of the invention is as follows: acquiring data of a user in a preset period, and extracting a first feature set, a second feature set and a target variable from the data of the user according to a data type and a data identifier, wherein the first feature set comprises a numerical feature variable, the second feature set comprises a non-numerical feature variable, and the target variable is a classification result of the user; acquiring the variance of each feature variable in the first feature set, and removing the feature variables of which the variances do not meet a first preset condition in the first feature set to obtain a third feature set; drawing and displaying a radar map coordinate system in a display area, respectively calculating a mean value corresponding to the feature variables in each third feature set based on the data of each type of user, and drawing a polygon corresponding to the data of each type of user in the radar map coordinate system according to the mean value; receiving a first instruction, determining a target user according to the first instruction, acquiring a characteristic value corresponding to a characteristic variable of the target user in a third characteristic set, and drawing and displaying a polygon corresponding to the target user in the radar map coordinate system according to the characteristic value; calculating a first association degree between each feature variable in the second feature set and the third feature set, and calculating a second association degree between each feature variable in the second feature set and the third feature set and the target variable; drawing and displaying a corresponding thermodynamic diagram in the display area according to the first relevance and the second relevance, wherein the thermodynamic diagram comprises a plurality of cells, each cell corresponds to one of the first relevance or the second relevance, and the cells are filled with corresponding colors according to the first relevance or the second relevance; receiving a second instruction sent based on the thermodynamic diagram, and removing the feature variable which is indicated to be removed by the second instruction from the second feature set and the third feature set to obtain a fourth feature set; and removing the feature variables of which the second association degrees do not meet a second preset condition from the fourth feature set.

The invention provides a feature screening method of a bond default model, which extracts numerical feature variables and non-numerical feature variables from user data in a preset period, screens the numerical feature variables by a variance method, displays feature variable mean values of different types of users by a radar map, calculates the relevance between the screened feature variables and between the feature variables and target variables, displays the feature variables by a thermodynamic diagram, and screens the feature variables again according to instructions sent by the thermodynamic diagram and preset conditions, so that the visual screening method for graphically displaying the mean values of the feature variables, the relevance between the feature variables and the target variables not only can visually display how the feature variables influence the target variables, but also can directly participate in the feature screening process based on graphical display by a worker, the interpretability of the characteristic screening process is increased, and the accuracy of the characteristic screening is improved.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a server according to an embodiment of the present invention.

As shown in fig. 1, the server may include: a processor 1001, such as a CPU, a communication bus 1002, and a memory 1003. Wherein a communication bus 1002 is used to enable connective communication between these components. The memory 1003 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1003 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the server architecture shown in FIG. 1 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, the memory 1003, which is a kind of computer storage medium, may include therein a processing application program of an operating system and a visual feature screening method.

Referring to fig. 2, a first embodiment of the present invention provides a visual feature screening method, including the following steps:

step S10, acquiring user data in a preset period, and extracting a first feature set, a second feature set and a target variable from the user data according to a data type and a data identifier, wherein the first feature set comprises a numerical feature variable, the second feature set comprises a non-numerical feature variable, and the target variable is a classification result of the user;

in this embodiment, the user is a company user, and the user data is financial data of a company. The financial data of a company is classified into various categories, including financial and newspaper data of the company, stock change data, company high management public opinion data, industry data and historical default data. The financial data of each category comprises a plurality of data contents, for example, the financial data comprises sales income, net income of main and business, total amount of liabilities, total amount of mobile assets and the like, the stock change data comprises stockholder pledge data, stockholder reduction data and the like, the company high management public opinion data comprises high management violation records, outage and bankruptcy records, acquisition and recombination data and the like, the industry data comprises industry default rates, and the historical default data comprises the same type of bond violation records, bond violation records of the same company and the like. Different types of financial data have different preset periods. The preset period may be in units of years or months, for example, two years or six months.

Each category of financial data is identified by a unique data identification, which may be an ID number, or a category name. The data type of each category of financial data can be marked in advance, or the characteristics of the financial data can be extracted through the identification model, and the data type of the financial data is automatically identified according to the characteristics of the identified financial data, wherein the data type comprises a numerical type and a non-numerical type.

Therefore, after the financial data and the stock change data of the company are separated from the financial data of the company according to the data identification, the data are identified to be numerical data according to the corresponding data types, and then the numerical data are subjected to numerical calculation processing to obtain a first feature set, wherein the first feature set comprises numerical feature variables.

After the company high management public opinion data, the industry data and the historical default data are classified from the financial data of a company according to the data identification, the company high management public opinion data, the industry data and the historical default data are all information data reflecting the evaluation type or the description type of the financial condition of the company, the company high management public opinion data, the industry data and the historical default data can be identified as non-numerical data according to the corresponding data types, and then semantic analysis processing is carried out to obtain a second feature set, wherein the second feature set is all non-numerical feature variables.

It should be noted that the data of the user in the preset period acquired in this embodiment is data of multiple users, that is, financial data of multiple companies, and the financial data has time dimensions and characteristic dimensions. For example, the financial data is from 4000 companies, then there are 4000 financial data for each category of financial data, for example, financial data. If the financial report data can be further subdivided into 64 subclasses, each subclass can be regarded as a feature, and for each feature, feature values from 4000 companies are corresponding to each subclass, so that the 4000 feature values integrally form a feature variable, 64 feature variables can be obtained by analyzing and processing the financial report data, and each feature variable corresponds to 4000 feature values. If the preset period is five years, the financial and newspaper data of 4000 companies within five years are acquired, and the data of each subclass is monthly data, the dimension of each eigenvalue is 5 x 12, that is, each eigenvalue has 60 data points.

Taking the processing of the financial and newspaper data as an example, the total amount of liabilities and the total amount of assets of the company can be obtained from the financial and newspaper data, the rate of liabilities of the company can be obtained according to the proportion of the total amount of liabilities and the total amount of assets, the rate of liabilities of the assets is a characteristic variable, and the rate of liabilities of the company reflects how much proportion of total assets of the company is derived through borrowing the liabilities; the sales income and the average receivable from the company can be obtained from the financial report data, the turnover rate of the receivable of the company can be obtained according to the proportion of the sales income and the average receivable, and the turnover rate of the receivable is a characteristic variable and reflects the short-term repayment capability of the company; the net income amount of the main business and the average total flowing assets of the company can be obtained from the financial report data, the flowing asset turnover rate of the company can be obtained according to the ratio of the net income amount of the main business to the average total flowing assets, and the flowing asset turnover rate is a characteristic variable and reflects the asset utilization condition of the company.

It is understood that before extracting the characteristic variables from the company's financial and stock change data, preprocessing is required on the data, wherein the preprocessing includes operations of deleting abnormal values, filling missing data, and the like on the data.

The characteristic variables are obtained by preprocessing the financial and newspaper data, the stock change data, the company high management public opinion data, the industry data and the historical default data, for example, if the data are from 4000 companies, the financial and newspaper data contain 64 characteristic variables, the stock change data contain 5 characteristic variables, and the company high management public opinion data, the shareholder investment data and the historical default data contain 8 characteristic variables in total, a characteristic variable matrix with the dimension of 4000 × 77 can be obtained.

Further, more characteristic variables can be obtained as training data of the bond default risk assessment model through the following steps: 1) the method comprises the steps of obtaining financial data of a company in a preset period, and obtaining a first numerical characteristic variable according to the financial data; 2) carrying out the same-ratio and/or ring-ratio on the financial data to obtain a second numerical characteristic variable; 3) and taking the union of the first numerical characteristic variable and the second numerical characteristic variable as the first characteristic set.

For example, the equity data of the home mother company shareholders in different periods can be subjected to ring ratio to obtain the equity change value of the home mother company shareholders, specifically, the time period T1 is set to be one quarter, data D1, D2, D3 and D4 of four consecutive quarters in 2018 of each mother company are obtained, the ring ratio increase rate of the second quarter is obtained according to (D2-D1)/D1 100%, and the ring ratio increase rates of the third quarter and the fourth quarter are obtained by analogy.

In this embodiment, the visualized feature screening method is applied to a bond default model, which takes the feature variables obtained based on the financial data of the target company user as input to predict the bond default result of the target company user, so that the target variables of the bond default model are the bond default result of the company user and are also equal to the classification result of the company user. The bond default result can be a result of two classifications, such as "default" or "no default", and can also be a bond default risk grade, such as "high default risk", "medium default risk", or "low default risk".

It should be noted that the financial data of the company processed in this embodiment is historical data, and therefore, for each of the companies, the classification result that has been marked in advance for the company is "default" or "non-default", and is "default risk is high", "default risk is medium", or "default risk is low".

Step S20, obtaining the variance of each feature variable in the first feature set, and removing the feature variables in the first feature set, wherein the variance does not meet a first preset condition, to obtain a third feature set; the large variance of the characteristic variable indicates that the characteristic variable is comparatively divergent, so that the target information can be measured according to the difference of the values of the characteristic variable. For example, if the variance of a feature is close to zero, i.e., all samples have very close values in the feature, the feature has very little effect on model training and should be discarded, so the first predetermined condition is greater than the predetermined variance threshold.

And calculating the variance of each feature variable in the first feature set, and removing the feature variables smaller than or equal to a preset variance threshold value in the first feature set to obtain a third feature set. The variances of the feature variables in the third feature set are all larger than a preset variance threshold.

It should be noted that, if the data of the user in the preset period is from multiple companies, for example, 4000 companies, and the preset period is 5 years, and the data of each feature variable in the first feature set is monthly data, when calculating the variance of each feature variable in the first feature set, the variance is calculated based on the data of 4000 × 5 × 12 dimensions.

Step S30, drawing and displaying a radar map coordinate system in a display area, respectively calculating the mean values corresponding to the feature variables in each third feature set based on the data of each type of user, and drawing polygons corresponding to the data of each type of user in the radar map coordinate system according to the mean values;

as shown in fig. 3, the feature variables included in the third feature set include: equity rate, revenue equity growth rate, operating capital turnover rate, fold of interest gained, ratio of physical assets to debt, cash ratio, ratio of mobile liabilities to total liabilities, ratio of equity to total invested capital of the affiliates of the home parent company. Calculating the mean value corresponding to the feature variables of the 'non-default' (i.e. normal) company users in the third feature set, and drawing and displaying the corresponding polygons in the radar map coordinate system according to the mean value; and calculating the mean value corresponding to the variables of the default company user in the third feature set, and drawing and displaying the corresponding polygon in the radar map coordinate system according to the mean value.

Step S40, receiving a first instruction, determining a target user according to the first instruction, acquiring a feature value corresponding to a feature variable of the target user in a third feature set, and drawing and displaying a polygon corresponding to the target user in the radar map coordinate system according to the feature value;

the visual feature screening method provided by the embodiment is applied to a feature screening device, the feature screening device comprises a processor, a memory and a display, and a worker uses the feature screening device to screen the features in a visual form.

When the display of the feature screening apparatus displays the polygon corresponding to each type of user data in the radar coordinate system, for example, when the worker sees the polygon corresponding to the "non-default" (i.e., normal) company user data and the polygon corresponding to the "default" company user data displayed in the radar coordinate system in fig. 3 from the display, the worker wants to see the polygon corresponding to a certain company user data in the radar coordinate system, and may issue a first instruction by touching the display, a voice, or a keyboard, to instruct the feature screening apparatus to display the polygon corresponding to the company user data.

As shown in fig. 3, the feature filtering apparatus determines a certain company user according to the received first instruction, calculates a feature value corresponding to a feature variable of the company user in the third feature set, and draws and displays a polygon corresponding to the company user in a radar map coordinate system according to the feature value;

step S50, calculating a first degree of association between each feature variable in the second feature set and the third feature set, and calculating a second degree of association between each feature variable in the second feature set and the third feature set and the target variable.

In this embodiment, the characteristic variables of the company user to be predicted are input into the bond default model to predict the bond default result of the company according to the characteristic variables screened out through visualization. The relevance between the characteristic variable and the target variable refers to the influence of the characteristic variable on the correct prediction of the bond default model, and the greater the relevance is, the greater the influence of the characteristic variable on the prediction result of the bond default model is, otherwise, the smaller the relevance is, the smaller the influence of the characteristic variable on the prediction result of the bond default model is.

Therefore, in this step, the degree of association between the characteristic variables is calculated, and if the degree of association between two characteristic variables is high, one of the characteristic variables needs to be removed, and the degree of association with the target variable needs to be removed.

The method of calculating the degree of association between variables may be: calculating a Pearson correlation coefficient between variables, taking the Pearson correlation coefficient as a correlation degree, and corresponding to a preset Pearson correlation coefficient threshold range at the moment; or mapping all the variables to a vector space, calculating the distance between the variables, and taking the distance as the association degree, wherein the association degree threshold is the distance threshold; mutual information values among the variables can also be calculated, the mutual information values are used as the association degree, and the association degree threshold is the mutual information threshold at the moment. Preferably, in this embodiment, on one hand, since the MIC algorithm (maximum mutual information value algorithm) can discretize continuous feature variables, and since the feature variables of the correlation to be calculated include both continuous and discrete feature variables, the MIC algorithm can process both types of feature variables simultaneously; on the other hand, the characteristic variable and the target variable have both linear relation and nonlinear relation, and the MIC algorithm can capture both linear relation and nonlinear relation, so that the MIC algorithm is adopted to calculate the characteristics of the linear relation and nonlinear relation between the characteristic variable and the target variable.

Calculating mutual information value pairs between the characteristic variables X in the second characteristic set and the third characteristic set and between the X and the target variable Y by using a MIC algorithm, wherein the mutual information value pairs comprise the following steps:

1) forming a scatter diagram of the sample data by taking the input characteristic variable X and the corresponding target variable Y as the sample data;

the dimension of the target variable Y is the same as that of the characteristic variable X, the target variable Y and the characteristic variable X are combined to obtain a two-dimensional variable matrix [ X, Y ], each two-dimensional element in the variable matrix is regarded as a data point, and the data points are distributed in a two-dimensional space.

2) Gridding a scatter diagram formed by sample data in m rows and n columns to obtain a probability distribution value;

the frequency of the data point falling in the (x, y) -th grid is taken as the estimate of P (x, y) according to the following equation,

p (x, y) ═ number of data points in (x, y) -th grid/total number of data points

The frequency of the data point falling on the x-th row is taken as the estimate of P (x), and an estimate of P (y) is obtained in the same way.

3) And calculating to obtain the maximum mutual information value of X and Y according to the probability distribution value.

Mutual information of the characteristic variable X and the target variable Y is calculated according to the following formula, and the value of the mutual information is converted into a (0,1) interval by using a normalization factor:

4) because more than one grid division data point mode of m multiplied by n is adopted, the MIC value under each grid division method is calculated, and the MIC value corresponding to the grid resolution which can enable the normalization mutual information to be maximum is selected as the mutual information value of the characteristic variable X and the target variable Y.

In the screening process of the step, the linear and nonlinear relations between the characteristic variables and the target variables are comprehensively considered, the characteristic variables most useful for estimating the target are finally obtained, and the accuracy of the bond default model for predicting bond default results of companies can be improved

Step S60, drawing and displaying a corresponding thermodynamic diagram in the display area according to the first relevance and the second relevance, wherein the thermodynamic diagram comprises a plurality of cells, each cell corresponds to one of the first relevance or the second relevance, and the cells are filled with corresponding colors according to the first relevance or the second relevance;

as shown in the thermodynamic diagram of fig. 4, a first degree of association between a plurality of characteristic variables (including a ratio of a management fee to a total business revenue, a receivable turnover rate, a receivable turnover number of days, a equity multiplier, and the like) is calculated, and color information of each square cell in the thermodynamic diagram represents the degree of association between two characteristic variables, and the darker the color of the cell, the greater the degree of association.

Step S70, receiving a second instruction issued based on the thermodynamic diagram, and removing the feature variable removed by the second instruction from the second feature set and the third feature set to obtain a fourth feature set;

when a display of the feature screening device displays a first thermodynamic diagram corresponding to the first relevance and a second thermodynamic diagram corresponding to the second relevance, if a worker looks at the first thermodynamic diagram to find that the relevance between the feature variable 1 and the feature variable 2 is high, one of the two thermodynamic diagrams needs to be removed, then looks at the second thermodynamic diagram to find that the relevance between the feature variable 2 and the target variable is lower than the relevance between the feature variable 1 and the target variable, and sends a second instruction through touching the display, voice or a keyboard to instruct the feature screening device to remove the feature variable 2. The feature filtering means removes the feature variable 2 from the feature set to which it belongs according to the received second instruction.

Step S80, removing the feature variable whose second degree of association does not satisfy a second preset condition from the fourth feature set.

After obtaining the association degrees of the characteristic variables and the target variables, the second preset condition may be that the association degrees of the characteristic variables and the target variables are greater than or equal to a preset association degree threshold, or that the association degrees of the characteristic variables and the target variables are within a preset association degree range, or that the characteristic variables are arranged according to the association degrees in a descending order, and a preset number of characteristic variables are screened out from the descending order.

In the embodiment, the numerical characteristic variables and the non-numerical characteristic variables are extracted from the user data in the preset period, the numerical characteristic variables are firstly screened by a variance method, and the mean values of the characteristic variables of different types of users are displayed by radar maps, then the association degrees between the screened characteristic variables and between the characteristic variables and the target variables are calculated and displayed by thermodynamic maps, the characteristic variables are screened again according to the instruction sent based on the thermodynamic diagram and the preset condition, the visual screening method for graphically displaying the mean value of the characteristic variables, the association degree between the characteristic variables and the target variable can intuitively display how the characteristic variables influence the target variable, and workers can directly participate in the characteristic screening process based on graphical display, so that the interpretability of the characteristic screening process is increased, and the accuracy of the characteristic screening is improved.

Further, referring to fig. 5, a second embodiment of the present invention provides a visual feature screening method based on the first embodiment, where the present embodiment further includes, after step S20:

step S90, calculating a pearson correlation coefficient between each feature variable in the third feature set and the target variable, and removing feature variables whose pearson correlation coefficients are outside a preset coefficient threshold range.

The characteristic variables of the bond default model are derived from processing of financial data of a company in a preset period, the financial data are classified into multiple categories including company financial and newspaper data, stock change data, company high management public opinion data, industry data and historical default data, the financial data of each category comprise multiple data contents, and each data can obtain one characteristic variable, so that the number of the obtained characteristic variables of the bond default model to be screened is large.

In step S20, the third feature set still includes a large number of feature variables, and in order to screen out the feature variables most valuable for the prediction of the target variable from the large number of feature variables, since the third feature set includes numerical variables, the relationship between the numerical variables and the target variables is linear, and the pearson correlation coefficient can be used to calculate the linear relationship, the pearson correlation coefficient between each feature variable and the target variable in the third feature set can be calculated, and the feature variables having little effect on the prediction of the target variable are further removed from the pearson correlation coefficient.

The characteristic variables screened in this step are further subjected to screening and removing operations in steps S70 and S80, and finally the characteristic variable which is most valuable for prediction of the target variable can be selected.

In this embodiment, by calculating the pearson correlation coefficient of each feature variable and the target variable in the third feature set and removing the feature variable having the pearson correlation coefficient smaller than the coefficient threshold, that is, different feature screening methods are respectively adopted for the feature variable of the numerical type and the feature variable of the non-numerical type, and in particular, more levels of feature screening are performed on the feature variable of the numerical type, and the linear and non-linear relationships between the feature variable and the target variable are comprehensively considered in the screening process, so that the feature loss rate is reduced, and the accuracy of feature variable screening is improved.

Further, a third embodiment of the present invention provides a visual feature screening method based on the first embodiment or the second embodiment, where the present embodiment further includes, after step 10:

step S100, performing dimension reduction processing on the user data to obtain data points of the user data in a two-dimensional space;

and step S110, drawing a two-dimensional coordinate system in the display area, and displaying the data point in the two-dimensional coordinate system.

As shown in fig. 6, multidimensional data of non-default corporate users and default corporate users are reduced to two dimensions using a TSNE (t-distributed systems organization bounding building) dimension reduction algorithm, respectively, and a distribution diagram of the reduced data is shown in a two-dimensional coordinate system. When the display of the feature screening device displays the two-dimensional distribution map corresponding to the dimensionality reduced data, the staff can intuitively know whether an obvious boundary exists between the data of the default company user and the data of the non-default company user.

It should be noted that, in the multidimensional data of the company user, each piece of dimensional data corresponds to a feature variable.

Further, for each of the feature variables screened in the above steps S20, S70, and S80, a dimension reduction process may be performed once, and a distribution diagram of the feature variables after dimension reduction of different types of company users is drawn or displayed in the display area, so that the worker may determine whether to screen a feature effective for the predicted target variable by checking whether an obvious boundary exists between data of different types of company users at the time, that is, if an obvious boundary exists, a valid feature is screened, and if no obvious boundary exists, it indicates that further screening is required.

In this embodiment, the data of the user is subjected to the dimension reduction processing and displayed, so that whether clear boundaries exist among different types of user data can be visually displayed, and the effectiveness of the feature variables corresponding to the user data on the prediction target variables can be visually displayed.

Further, a fourth embodiment of the present invention provides a visual feature screening method based on the first embodiment or the second embodiment, where the present embodiment further includes, after step 10:

step S120, drawing and displaying the probability distribution map of each characteristic variable in the first characteristic set and the second characteristic set in the display area;

step S130, receiving a third instruction based on the probability distribution map, and removing the feature variable that the third instruction indicates to remove from the second feature set and the third feature set.

As shown in fig. 7, the feature set includes feature variables such as physical assets, ratios of physical assets to amounts in charge, and profits per share at the end of the term, and a probability distribution map of these feature variables is drawn and displayed in the display area. When the display of the feature screening device displays the probability distribution map of the feature variables, if the characteristic variables 'physical assets' do not present a regular distribution when the staff watches the probability distribution map, the staff considers that the characteristic variables 'physical assets' do not contribute to the target prediction, and a third instruction is sent out by touching the display, voice or keyboard to instruct the feature screening device to remove the characteristic variables 'physical assets'. The feature filtering means removes the feature variable "physical asset" from the feature set to which it belongs according to the received third instruction.

In the embodiment, by drawing and displaying the probability distribution of the characteristic variables in the display area and screening the characteristic variables again according to the instruction sent based on the probability distribution diagram, the visual screening method for graphically displaying the probability distribution of the characteristic variables enables the staff to directly remove the characteristic variables which do not contribute to the target prediction based on the probability density distribution of the characteristic variables, so that the interpretability of the characteristic screening process is further increased and the accuracy of the characteristic screening is improved.

Further, a fifth embodiment of the present invention provides a visual feature screening method based on the first embodiment or the second embodiment, where the present embodiment further includes, after step 10:

step S140, receiving a fourth instruction, and acquiring a target feature variable indicated and displayed by the fourth instruction, where the target feature variable is a feature variable in the first feature set or the second feature set;

step S150, acquiring the characteristic value of the target characteristic variable at continuous time points;

step S160, drawing a two-dimensional coordinate system in the display area, and displaying the feature values of the target feature variable at the continuous time points in the two-dimensional coordinate system.

As shown in fig. 8, the horizontal axis is a time axis, the vertical axis is a value of the target characteristic variable at each continuous time point, and by observing a change in the value of the target characteristic variable at the continuous time point, it is possible to determine whether or not the data change is time-dependent, for example, the profit is generally low in spring and high in winter. This method helps to determine whether a same or a ring ratio needs to be calculated based on the data of the feature variables to eliminate the influence of time on the feature variables.

In this embodiment, by displaying the characteristic values of the characteristic variables at the continuous time points in the display area, the staff can intuitively judge whether the data change of the characteristic variables is related to the time, and determine whether the same ratio or the ring ratio of the characteristic variables needs to be calculated, so as to obtain data more beneficial to characteristic screening.

Further, a sixth embodiment of the present invention provides a visual feature screening method based on the first embodiment or the second embodiment, where the present embodiment further includes, after step 80:

step S170, acquiring user data to be classified, and extracting characteristic variables to be screened from the user data to be classified;

step S180, matching the characteristic variables to be screened in the fourth characteristic set to obtain matched characteristic variables;

step S190, inputting the matched characteristic variables into a preset classification model for processing to obtain a user classification result;

and S200, acquiring corresponding statistical characteristics for each matched characteristic variable of each type of user, and drawing a corresponding box line graph in the display area according to the statistical characteristics, wherein the statistical characteristics comprise a maximum value, a minimum value, a median and two quartiles.

As shown in fig. 9, an example of a box plot of the statistical characteristics of some of the characteristic variables is given. In the display area, boxplots corresponding to data of a plurality of characteristic variables are arranged in parallel, and distribution information such as median, tail length, abnormal values and distribution intervals of the data of each characteristic variable is visually displayed. The size of the quartile range of the data of each characteristic variable, whether the distribution of the normal value is centralized or scattered or the deviation condition of the data distribution can be basically estimated by observing the length of each square box and line segment.

In this embodiment, by drawing and displaying the box line graph corresponding to the statistical characteristics of the characteristic variables in the display area, the staff can directly observe and compare the distribution characteristics of a plurality of characteristic variables, thereby further improving the interpretability of the characteristic screening process.

Further, a sixth embodiment of the present invention provides a visual feature screening method based on the first embodiment, where the visual feature screening method is applied to a bond default model, and the embodiment further includes, after step S80:

step S210, inputting the third feature set and the fourth feature set into the bond default model for processing, and training parameters in the bond default model.

The bond default model is a machine learning model and can be constructed based on a neural network, a deep neural network, a vector machine or a random forest algorithm and the like.

In this embodiment, the third feature set and the fourth feature set after the removing operation are obtained, the feature variables in the two sets are both used as sample data and input into the bond default model, and parameters in the bond default model are trained.

After the bond default model is trained, predicting the bond default risk of the company according to the following steps:

step S220, acquiring data of a user to be predicted, and acquiring a characteristic variable to be screened according to the data of the user to be predicted;

the method comprises the steps of obtaining financial data of a company user to be predicted in a preset period, wherein the financial data is provided with a time dimension and a characteristic dimension, each characteristic dimension is regarded as a characteristic variable to be screened, and each characteristic variable corresponds to a sub-data of the financial data. For example, if the predetermined period is five years and the data of each subclass is monthly data, each feature variable is data of 5 x 12 dimensions.

Step S230, matching the feature variables to be screened in the third feature set and the fourth feature set respectively to obtain matched feature variables;

for example, the feature variable set to be screened is { a, B, C, D, E, F, G }, the feature variable included in the first feature set is { a, B }, the feature variable included in the second feature set is { E, G }, and then the matched feature variable is { a, B, E, G };

and S240, inputting the matched characteristic variables into the bond default model for processing, and predicting the classification result of the user to be predicted.

In this embodiment, the third feature set and the fourth feature set are input into the bond default model for processing, and parameters in the bond default model are trained to obtain a bond default model with high prediction accuracy.

The present invention also provides a server, comprising: the visual feature screening method comprises a memory, a processor and a visual feature screening processing program stored on the memory and capable of running on the processor, wherein the visual feature screening processing program realizes the steps of the visual feature screening method when being executed by the processor.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a processing program for visual feature screening is stored, and when executed by a processor, the processing program for visual feature screening implements the steps of the visual feature screening processing method.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention essentially or contributing to the prior art can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes several instructions for enabling a terminal server (which may be a mobile phone, a computer, a server, or a network server) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A visual feature screening method is characterized by comprising the following steps:

2. The visual feature screening method according to claim 1, wherein the step of obtaining the variance of each feature variable in the first feature set and removing the feature variables in the first feature set whose variances do not satisfy the first preset condition to obtain a third feature set further includes:

3. The visual feature screening method according to claim 1, wherein the degree of association is a mutual information value, and the second preset condition is that the mutual information value is smaller than a mutual information threshold value.

4. The visual feature screening method of claim 1, wherein the user data is financial data of a user, the step of obtaining the user data in a preset period comprises the steps of:

5. The visual feature screening method of any one of claims 1 to 4, wherein the step of obtaining the user's data in a preset period is followed by further comprising:

6. The visual feature screening method according to any one of claims 1 to 4, wherein the step of acquiring the user data in a preset period and extracting the first feature set, the second feature set and the target variable from the user data further comprises the following steps:

7. The visual feature screening method according to any one of claims 1 to 4, wherein the step of acquiring the user data in a preset period and extracting the first feature set, the second feature set and the target variable from the user data further comprises the following steps:

8. A visual feature screening method according to any one of claims 1 to 4, wherein the step of removing the feature variables whose second relevance does not satisfy the second preset condition from the fourth feature set further includes:

9. A server, characterized in that the server comprises: memory, a processor and a processing program of visual feature variable screening stored on the memory and executable on the processor, which when executed by the processor implements the steps of the visual feature variable screening method according to any one of claims 1 to 6.

10. A storage medium, characterized in that the storage medium has stored thereon a processing program for visual feature variable filtering, which when executed by a processor implements the steps of the visual feature variable filtering method according to any one of claims 1 to 8.