CN116842454A - Financial asset classification method and system based on support vector machine algorithm - Google Patents
Financial asset classification method and system based on support vector machine algorithm Download PDFInfo
- Publication number
- CN116842454A CN116842454A CN202310661037.8A CN202310661037A CN116842454A CN 116842454 A CN116842454 A CN 116842454A CN 202310661037 A CN202310661037 A CN 202310661037A CN 116842454 A CN116842454 A CN 116842454A
- Authority
- CN
- China
- Prior art keywords
- data
- financial
- sample
- drosophila
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000012706 support-vector machine Methods 0.000 title claims abstract description 9
- 238000013145 classification model Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 10
- 241000255581 Drosophila <fruit fly, genus> Species 0.000 claims description 47
- 238000005457 optimization Methods 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 17
- 241000255588 Tephritidae Species 0.000 claims description 15
- 230000008901 benefit Effects 0.000 claims description 15
- 238000012360 testing method Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000004140 cleaning Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- BTCSSZJGUNDROE-UHFFFAOYSA-N gamma-aminobutyric acid Chemical compound NCCCC(O)=O BTCSSZJGUNDROE-UHFFFAOYSA-N 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000011160 research Methods 0.000 claims description 2
- 238000013480 data collection Methods 0.000 claims 2
- 230000006870 function Effects 0.000 description 16
- 238000010200 validation analysis Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000019637 foraging behavior Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/04—Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/06—Asset management; Financial planning or analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Accounting & Taxation (AREA)
- Artificial Intelligence (AREA)
- Development Economics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Technology Law (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Economics (AREA)
- Evolutionary Biology (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Entrepreneurship & Innovation (AREA)
- Molecular Biology (AREA)
- Game Theory and Decision Science (AREA)
- Human Resources & Organizations (AREA)
- Operations Research (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
Abstract
The invention provides a financial asset classification method and a system, and particularly relates to a financial asset classification method and a system based on a support vector machine algorithm. The model aims to improve the accuracy and stability of the classification of financial assets. By combining specific characteristics and data in the financial accounting field, the invention constructs a financial asset classification model by utilizing the SVM algorithm, and realizes accurate classification and prediction of financial assets by learning and optimizing training data. The model has wide application prospect in financial investment decision and risk management. The invention provides an innovative solution for the field of financial asset classification, and can provide accurate support and guidance in financial investment decision and risk management.
Description
Technical Field
The invention relates to the financial field of finance and accounting, in particular to a financial asset classification method and system based on a support vector machine algorithm.
Background
In the financial market, accurate categorization and prediction of financial assets is critical to investment decision-making and risk management. Only by accurately knowing the nature and trends of the financial assets, investors can make informed decisions, reduce risk and achieve better return on investment. Thus, ensuring proper classification and reliable prediction of financial assets is critical to investors' success in the marketplace.
The SVM has strong learning ability and generalization ability, but the prediction performance of the SVM model is closely related to the selection of parameters. Therefore, an effective method is adopted to search the optimal SVM parameters, and obtaining higher classification accuracy is a hot spot problem of current research.
Although the existing SVM can deal with the problem of nonlinear classification through a kernel function, the data set is more complex or abstract in a certain sense and the classification precision requirement of the classical SVM is improved in the face of the need of large-scale data analysis in various scenes such as finance, biological medicine, astronomical measurement and the like, so that another path is needed. The scholars find that the Drosophila algorithm is easy to realize in operation through the demonstration result, and has the advantage of strong local searching capability. The SVM model can be subjected to parameter optimization by adopting the FOA drosophila optimization algorithm to improve the prediction performance of the model, but meanwhile, most of the samples in the financial field are unbalanced samples, especially, most of the assets classified by the assets can be divided into assets measured by amortization cost, and the unbalanced samples restrict the limitation of the FOA drosophila algorithm on the SVM support vector machine.
The unbalanced sample processing method HSMOTE (Hierarchical Synthesis Minority Oversampling Technique) provided by the invention can be well combined with a drosophila optimization algorithm, and sample reconstruction is obtained by carrying out hierarchical processing on sample feature vectors on the basis of SMOTE, so that the constraint of unbalanced sample on model optimization is broken, and a better super-parameter adaptation model can be searched by the drosophila method.
According to the method for processing the HSMOTE unbalanced sample and the FOA parameter optimizing method, the HSMOTE and FOA methods are combined with the SVM model, a financial asset classification model based on the FOA-HSMOTE-SVM is constructed, the comparison is carried out with the SVM evaluation effect of grid search parameters, and experimental results prove that the FOA-HSMOTE-SVM model has better applicability to unbalanced samples at the same time when the accuracy of the financial asset classification is improved.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defect of insufficient accuracy of classification of the financial assets in the prior art, provide a drosophila algorithm optimization SVM which can adapt to unbalanced samples, and aim to improve the accuracy and stability of classification of the financial assets.
The model adopts the core principle and advantages of a support vector machine, optimizes the kernel function by using a drosophila algorithm, combines specific characteristics and data in the financial field, and accurately classifies and predicts financial assets by learning and optimizing training data.
Consider a classification of financial assets, where the second category of assets belonging to the classification among the financial assets has the largest occupancy. For the imbalance problem, the HSMOTE performs well when processing the imbalance problem, can improve the effectiveness and scientificity of parameters, and has excellent noise resistance.
The HSMOTE is provided on the basis of SMOTE and is designed according to different types of sample feature vectors, one-hot coding is adopted in the method, so that Boolean type data and floating point type data are mixed in the vectors, and the Boolean type data fluctuation range is easy to be larger by directly adopting SMOTE processing, and the model construction is influenced.
And by combining the simpler and more convenient FOA and stronger local searching capability, the SVM is subjected to parameter optimization, and an improved SVM financial asset classification model is constructed. The verification result shows that the financial asset classification model based on the FOA-HSMOTE-SVM has excellent performance, and the method can provide an auxiliary means for accurately classifying financial assets in batches for enterprises, investors and related financial support institutions.
The technical solution of the invention is realized: a model for classifying financial assets based on an SVM algorithm, comprising the steps of:
step 1, collecting data: collecting financial market related data including corporate financial statement data, bond stock related data, corporate business data, and industry and market data;
step 2, preprocessing data: performing data cleaning, sample reconstruction, feature extraction and feature selection operations on the relevant data collected in the step 1 to extract effective features related to the classification of the financial assets;
step 3, constructing a financial asset classification model:
optimization of SVM parameters based on Drosophila optimization algorithm, wherein discriminant functionsWherein K (x) i ,y i ) As a kernel function, x i And y i Respectively representing different characteristic values of the sample, b is a constant, a i Is a lagrangian factor i=1, 2 … n;
kernel function K (x i ,y i ) Performing global optimization to construct a financial asset classification model;
step 4, the data preprocessed in the step 2 are brought into the financial asset classification model constructed in the step 3, and the model is compared with the accuracy, recall rate and F1 value constructed by processing the reconstructed sample in the default step 2;
and 5, selecting a financial asset classification model with better indexes obtained by comparison in the step 4.
Preferably, step 1, step 2, we collect the following data:
(1) Corporate financial statement data:
profit table: including business revenue, net profit, etc., can be used to evaluate business models and profitability of a company. Cash flow meter: in particular, business cash flows, can be used to analyze corporate cash inflow and outflow.
(2) Bond stock related data:
bond stock issuance files: including bond terms and conditions, which may be used to learn of bond payback arrangements and cash flow regulations. Bond repayment plan: the repayment schedule and amount of the bond are recorded and can be used to analyze the contracted cash flow of the bond.
(3) Company management data:
sales contract data: including sales contract amount, collection conditions, etc., may be used to evaluate the contract cash flow between the company and the customer. Vendor contract data: including purchase contract amounts, payment conditions, etc., may be used to evaluate contract cash flows between the company and the provider.
(4) Industry and market data:
industry report and study: knowledge of the common business patterns and cash flow characteristics of the industry provides a reference for comparison and analysis. Market index and competition conditions: market competition environments and industry trends are known to evaluate the advantages and disadvantages of companies managing financial assets.
Preferably, in step 2, the acquired data is subjected to data cleaning, abnormal values and missing data are removed, and data which are designated as non-transacting equity tool investments which are measured in terms of fair value and whose changes are counted in other comprehensive benefits are removed, so that 200 pieces of effective data are finally obtained. Subsequently, feature engineering, including feature extraction and feature selection, is performed with the aim of extracting valid features associated with the classification of financial assets. We have chosen the most representative and important features, namely business models in which companies manage financial assets and contractual cash flow features in which financial assets, to reduce model complexity and improve classification accuracy.
Table 1 company management financial asset transaction model
(1) Selecting a business mode, taking contract cash flow as a characteristic x value, and finally outputting a variable which is the classification of financial assets: 1 represents financial assets measured in terms of amortization costs, 2 represents financial assets measured in terms of equity value and whose variation is counted in other comprehensive benefits, 3 represents financial assets measured in terms of equity value and whose variation is counted in the current period of time, stock is reduced by deviations caused by special cases, stock is rejected in the data preprocessing stage, which is designated as investment of non-trade equity tools measured in terms of equity value and whose variation is counted in other comprehensive benefits, and the rest is classified directly as class 3 asset.
Table 2 company manages financial asset classifications
(2) And performing financial asset classification judgment on the collected stocks and bonds. For example, for bonds, if the condition is satisfied: with one-hot encoding, it is possible to pass the cash flow test and the traffic pattern is 1, i.e. vector x= [1,0, … ]. If the condition is satisfied: the cash flow test is passed and the traffic pattern is 2, i.e. vector x= [0,1,0, … ]. Likewise, if the bond does not meet the condition that the cash flow test is passed and the business pattern is the other business pattern, i.e., vector x= [0,1, … ].
(3) Preferably, in step 3, unbalanced samples are processed by using an HSMOTE algorithm, and the HSMOTE method inserts artificial samples into a few samples to reduce the excessive inclination degree of data, so as to improve the prediction accuracy of the model. The HSMOTE algorithm steps are as follows:
for each sample X in the minority class of samples, calculating Euclidean distance from the sample X to each other sample in the minority class, searching K nearest neighbor samples, and recording neighbor subscripts; setting a sampling rate vector n= (ω) of the minority class samples in accordance with a proportion of unbalance between the minority class samples and the majority class samples 1 ,ω 2 ,ω 3 ,..,ω n ) For all minority class samples X, X is randomly selected from the K nearest neighbor samples i (i=1, 2, …, N), the eigenvector of the ith sample is denoted as x i =(x i1 ,x i2 ,x i3 ,…,x in );
Taking into account inter-sample variabilityIn->Representing the average of all samples corresponding to the j-th feature. Consider the collision between samples: />Wherein r is ij Is the correlation coefficient of the ith index and the jth index.
The fluctuation coefficients f of different data types are considered, and the fluctuation coefficients of the Boolean type and the floating point type of the sample are respectively set to 0.5 and 2 through testing.
Weighting of
1) Each neighbor sample X i Respectively according to X with the original sample X new =X+rand(0,1)×N⊙(X i -X) synthesizing a new sample, wherein N (X) i -X) is the Hadamard (Hadamard) product of the vectors;
the FOA is an optimizing operation by simulating the foraging behavior of the drosophila population and based on a collaborative mechanism of the drosophila population, the algorithm comprises two parts of visual search and olfactory search, key parameters are only population quantity and maximum iteration times, and compared with other intelligent algorithms, the FOA is easier to understand and easy to operate, has stronger local searching capability and is applied to various fields such as multi-knapsack problems, financial crisis early warning, neural network parameter optimization and logistics service. The specific operation flow is as follows:
1) Setting the population scale sizepop, the maximum iteration number max gen, the drosophila population position range LR, the drosophila word flight range FR and other relevant parameter values. The position information of each individual in the corresponding drosophila population is given on (X, Y), and the initial position is: x is X axis =rand(LR),Y axis =rand(LR)。
2) Giving each fruit fly in the fruit fly group a random flying direction and distance, wherein the new position of the fruit fly individual i is as follows: x is X i =X axis +rand(FR),Y i =Y axis +rand(FR)
3) Calculating the distance DIST of the position of the individual fruit fly from the origin i The formula is calculated:
4) Calculating taste concentration determination value S i And a taste concentration value Smell for each Drosophila in the Drosophila population i ,S i =1/DIST i ,Smell i =fitness(S i ) Wherein fitness is a fitness function or an objective function;
5) Selecting the fruit fly with the best taste concentration in the current population, and recording the taste concentration value and the position of the fruit fly:
[bestSmell,bestIndex]=min(Smell)
6) And (3) other drosophila in the drosophila group are close to the position according to the optimal taste concentration value and the corresponding position information:
SmellBest=bestSmell
X axis =X(bestIndex)
Y axis =Y(bestIndex)
7) Repeating sub-step 2) to sub-step 6) until the algorithm iteration number reaches max gen.
The final calculation of FOA-HSMOTE-SVM is to measure the asset as a majority of samples S with amortization cost maj The rule value is measured and the variation thereof is counted as other comprehensive benefits to be a minority class sample S min The method comprises the following specific steps:
1) Calculate S min Each sample point (X) smin ,Y smin ) Is to randomly extract a neighbor |S maj -S min I/2, the neighbor is compared with the original sample point (X smin ,Y smin ) The difference multiplied by [0,1]A random number delta between them, plus the original sample point (X smin ,Y smin ) Thus, a new credit risk sample is obtained
2) Repetition 1)Until the number of artificially synthesized credit risk samples reaches |S maj -S min |/2;
3) Initializing model parameters, selecting a kernel function g and a penalty coefficient C of an SVM, determining a target function formula of a drosophila taste concentration judging function, and determining iteration times max gen and population scale sizepop of a drosophila optimizing algorithm, and parameters such as bestssmell and the like of algorithm termination, wherein max gen is 100, and sizepop is 20;
4) Optimizing parameters of an SVM early warning model by using FOA according toS i =1/DIST i Two formulas calculate the fruit fly taste concentration determination value Smell i And performing iterative loop;
5) Terminating the algorithm when the bestshell is smaller than the specified value, obtaining the parameter with the optimal concentration value, and substituting the optimal parameter and x new The FOA-HSMOTE-SVM financial asset classification model is input as a sample set.
(4) Preferably, in step 4, we divide the dataset into a training set and a validation set, containing a total of 200 pieces of data. The training set contains 140 pieces of data and the test set contains 60 pieces of data, divided by a proportion of 70%. For the linear inseparable case of the improved support vector machine algorithm, consider the relaxation variable, in A penalty factor C is introduced.
Meanwhile, models can be combined into Bayesian optimization, wherein w is all possible models, x is the input of the model, y is the output of the model, m is the average value of model parameters, and sigma is the covariance matrix of two models, and a Gaussian process is utilized:
P(y|x,D)=∫P(y|x,D)P(w|D)dw~N(m,∑)
further searching and determining the final super-parameters. Finally, the model is evaluated and optimized by using methods such as cross verification and the like, and indexes such as accuracy, recall rate, F1 score and the like are included. And comparing the model with the SVM and the FOA-SVM with super parameters obtained by a grid search method respectively, and displaying the result that the model has better generalization capability on unknown data.
Table 3 comparison of test results of three classification models
Preferably, we use the trained SVM model to predict and classify financial asset data in the validation set in step 5. The accuracy of the model is up to 98% through the prediction and classification results, which shows that the model has excellent classification effect.
Drawings
FIG. 1 is a specific workflow diagram of a method of classifying financial assets
FIG. 2 is a flow chart of a process for classifying stock bonds by financial assets
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
The technical solution of the invention is realized: a model for classifying financial assets based on an SVM algorithm, comprising the steps of:
step 1, collecting data: collecting financial market related data including corporate financial statement data, bond stock related data, corporate business data, and industry and market data;
step 2, preprocessing data: performing data cleaning, sample reconstruction, feature extraction and feature selection operations on the relevant data collected in the step 1 to extract effective features related to the classification of the financial assets;
step 3, constructing a financial asset classification model:
optimization of SVM parameters based on Drosophila optimization algorithm, wherein discriminant functionsWherein K (x) i ,y i ) As a kernel function, x i And y i Respectively representing different characteristic values of the sample, b is a constant, a i Is a lagrangian factor i=1, 2 … n;
kernel function K (x i ,y i ) Performing global optimization to construct a financial asset classification model;
step 4, the data preprocessed in the step 2 are brought into the financial asset classification model constructed in the step 3, and the model is compared with the accuracy, recall rate and F1 value constructed by processing the reconstructed sample in the default step 2;
and 5, selecting a financial asset classification model with better indexes obtained by comparison in the step 4.
Preferably, step 1, step 2, we collect the following data:
(1) Corporate financial statement data:
profit table: including business revenue, net profit, etc., can be used to evaluate business models and profitability of a company. Cash flow meter: in particular, business cash flows, can be used to analyze corporate cash inflow and outflow.
(2) Bond stock related data:
bond stock issuance files: including bond terms and conditions, which may be used to learn of bond payback arrangements and cash flow regulations. Bond repayment plan: the repayment schedule and amount of the bond are recorded and can be used to analyze the contracted cash flow of the bond.
(3) Company management data:
sales contract data: including sales contract amount, collection conditions, etc., may be used to evaluate the contract cash flow between the company and the customer. Vendor contract data: including purchase contract amounts, payment conditions, etc., may be used to evaluate contract cash flows between the company and the provider.
(4) Industry and market data:
industry report and study: knowledge of the common business patterns and cash flow characteristics of the industry provides a reference for comparison and analysis. Market index and competition conditions: market competition environments and industry trends are known to evaluate the advantages and disadvantages of companies managing financial assets.
Preferably, in step 2, the acquired data is subjected to data cleaning, abnormal values and missing data are removed, and data which are designated as non-transacting equity tool investments which are measured in terms of fair value and whose changes are counted in other comprehensive benefits are removed, so that 200 pieces of effective data are finally obtained. Subsequently, feature engineering, including feature extraction and feature selection, is performed with the aim of extracting valid features associated with the classification of financial assets. We have chosen the most representative and important features, namely business models in which companies manage financial assets and contractual cash flow features in which financial assets, to reduce model complexity and improve classification accuracy.
(1) Selecting a business mode, taking contract cash flow as a characteristic x value, and finally outputting a variable which is the classification of financial assets: 1 represents financial assets measured in terms of amortization costs, 2 represents financial assets measured in terms of equity value and whose variation is counted in other comprehensive benefits, 3 represents financial assets measured in terms of equity value and whose variation is counted in the current period of time, stock is reduced by deviations caused by special cases, stock is rejected in the data preprocessing stage, which is designated as investment of non-trade equity tools measured in terms of equity value and whose variation is counted in other comprehensive benefits, and the rest is classified directly as class 3 asset.
(2) And performing financial asset classification judgment on the collected stocks and bonds. For example, for bonds, if the condition is satisfied: the cash flow test is passed and the traffic pattern is 1, i.e. vector x= [1,0 … ]. If the condition is satisfied: the cash flow test is passed and the traffic pattern is 2, i.e. vector x= [0,1,0 … ]. Likewise, if the bond does not meet the condition that the cash flow test is passed and the business pattern is the other business pattern, i.e., vector x= [0,1 … ].
(3) Preferably, in step 3, unbalanced samples are processed by using an HSMOTE algorithm, and the HSMOTE method inserts artificial samples into a few samples to reduce the excessive inclination degree of data, so as to improve the prediction accuracy of the model. The HSMOTE is provided on the basis of SMOTE and is designed according to different types of sample feature vectors, one-hot coding is adopted in the method, so that Boolean type data and floating point type data are mixed in the vectors, and the Boolean type data fluctuation range is easy to be larger by directly adopting SMOTE processing, and the model construction is influenced.
The HSMOTE algorithm steps are as follows:
for each sample X in the minority class of samples, calculating Euclidean distance from the sample X to each other sample in the minority class, searching K nearest neighbor samples, and recording neighbor subscripts;
1) Setting a sampling rate vector n= (ω) of the minority class samples in accordance with a proportion of unbalance between the minority class samples and the majority class samples 1 ,ω 2 ,ω 3 ,..,ω n ) For all minority class samples X, X is randomly selected from the K nearest neighbor samples i (i=1, 2, …, N), the eigenvector of the ith sample can be expressed as x i =(x i1 ,x i2 ,x i3 ,…,x in ) The method comprises the steps of carrying out a first treatment on the surface of the Taking into account inter-sample variabilityIn->Representing the average of all samples corresponding to the j-th feature.
2) Consider the collision between samples:wherein r is ij Is the correlation coefficient of the ith index and the jth index.
3) The fluctuation coefficients f of different data types are considered, and the fluctuation coefficients of the Boolean type and the floating point type of the sample are respectively set to 0.5 and 2 through testing.
Weighting of
4) Each neighbor sample X i Respectively according to X with the original sample X new =X+rand(0,1)×N⊙(X i -X) synthesizing a new sample, wherein N (X) i -X) is the Hadamard (Hadamard) product of the vectors;
5) The synthesized new sample and the original training sample set are combined into a new training sample set, and learning is performed on the model using the new training sample set.
The FOA is an optimizing operation by simulating the foraging behavior of the drosophila population and based on a collaborative mechanism of the drosophila population, the algorithm comprises two parts of visual search and olfactory search, key parameters are only population quantity and maximum iteration times, and compared with other intelligent algorithms, the FOA is easier to understand and easy to operate, has stronger local searching capability and is applied to various fields such as multi-knapsack problems, financial crisis early warning, neural network parameter optimization and logistics service. The specific operation flow is as follows:
1) Setting the population scale sizepop, the maximum iteration number max gen, the drosophila population position range LR, the drosophila word flight range FR and other relevant parameter values. The position information of each individual in the corresponding drosophila population is given on (X, Y), and the initial position is: x is X axis =rand(LR),Y axis =rand(LR)。
2) Giving each fruit fly in the fruit fly group a random flying direction and distance, wherein the new position of the fruit fly individual i is as follows: x is X i =X axis +rand(FR),Y i =Y axis +rand(FR)
3) Calculating the distance DIST of the position of the individual fruit fly from the origin i The formula is calculated:
4) Calculating taste concentration determination value S i And a taste concentration value Smell for each Drosophila in the Drosophila population i ,S i =1/DIST i ,Smell i =fitness(S i ) Wherein fitness is a fitness function or an objective function;
5) Selecting the fruit fly with the best taste concentration in the current population, and recording the taste concentration value and the position of the fruit fly:
[bestSmell,bestIndex]=min(Smell)
6) And (3) other drosophila in the drosophila group are close to the position according to the optimal taste concentration value and the corresponding position information:
SmellBest=bestSmell
X axis =X(bestIndex)
Y axis =Y(bestIndex)
7) Repeating sub-step 2) to sub-step 6) until the algorithm iteration number reaches max gen.
The final calculation of FOA-HSMOTE-SVM is to measure the asset as a majority of samples S with amortization cost maj The rule value is measured and the variation thereof is counted as other comprehensive benefits to be a minority class sample S min The method comprises the following specific steps:
1) Calculate S min Each sample point (X) smin ,Y smin ) Is to randomly extract a neighbor |S maj -S min I/2, the neighbor is compared with the original sample point (X smin ,Y smin The difference multiplied by [0,1]A random number delta between them, plus the original sample point (X smin ,Y smin ) Thus, a new credit risk sample is obtained
2) Repeating 1) until the artificially synthesized credit risk sample number reaches |S maj -S min |/2;
3) Initializing model parameters, selecting a kernel function g and a penalty coefficient C of an SVM, determining a target function formula of a drosophila taste concentration judging function, and determining iteration times max gen and population scale sizepop of a drosophila optimizing algorithm, and parameters such as bestssmell and the like of algorithm termination, wherein max gen is 100, and sizepop is 20;
4) Optimizing parameters of an SVM early warning model by using FOA according toS i =1/DIST i Two formulas calculate the fruit fly taste concentration determination value Smell i And performing iterative loop;
5) Terminating the algorithm when the bestshell is smaller than the specified value, obtaining the parameter with the optimal concentration value, and substituting the optimal parameter and x new The FOA-HSMOTE-SVM financial asset classification model is input as a sample set.
(4) Preferably, in step 4, we divide the dataset into a training set and a validation set, containing a total of 200 pieces of data. The training set contains 140 pieces of data and the test set contains 60 pieces of data, divided by a proportion of 70%. For the linear inseparable case of the improved support vector machine algorithm, consider the relaxation variable, in A penalty factor C is introduced.
Meanwhile, models can be combined into Bayesian optimization, wherein w is all possible models, x is the input of the model, y is the output of the model, m is the average value of model parameters, and sigma is the covariance matrix of two models, and a Gaussian process is utilized:
P(y|x,D)=∫P(y|x,D)P(w|D)dw~N(m,∑)
further searching and determining the final super-parameters. Finally, the model is evaluated and optimized by using methods such as cross verification and the like, and indexes such as accuracy, recall rate, F1 score and the like are included. And comparing the model with the SVM and the FOA-SVM with super parameters obtained by a grid search method respectively, and displaying the result that the model has better generalization capability on unknown data.
Preferably, we use the trained SVM model to predict and classify financial asset data in the validation set in step 5. The accuracy of the model is up to 98% through the prediction and classification results, which shows that the model has excellent classification effect.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
Claims (12)
1. The financial asset classification method based on the support vector machine algorithm is characterized by comprising the following steps of:
step 1, collecting data: collecting financial market related data including corporate financial statement data, bond stock related data, corporate business data, and industry and market data;
step 2, preprocessing data: performing data cleaning, sample reconstruction, feature extraction and feature selection operations on the relevant data collected in the step 1 to extract effective features related to the classification of the financial assets;
step 3, constructing a financial asset classification model:
optimization of SVM parameters based on Drosophila optimization algorithm, wherein discriminant functionsWherein K (x) i ,y i ) As a kernel function, x i And y i Respectively representing different characteristic values of the sample, b is a constant, a i Is a lagrangian factor i=1, 2 … n;
kernel function K (x i ,y i ) Performing global optimization to construct a financial asset classification model;
step 4, the data preprocessed in the step 2 are brought into the financial asset classification model constructed in the step 3, and the model is compared with the accuracy, recall rate and F1 value constructed by processing the reconstructed sample in the default step 2;
and 5, selecting a financial asset classification model with better indexes obtained by comparison in the step 4.
2. The method of claim 1, wherein reconstructing the samples in step 2 comprises the sub-steps of:
step 2-a-1, processing unbalanced samples by using an HSMOTE algorithm, calculating Euclidean distances from each sample X in a minority class of samples to each other sample, searching K nearest neighbor samples, and recording neighbor subscripts;
step 2-a-2, setting a sampling rate vector n= (ω) of the minority class samples according to the proportion of the imbalance between the minority class samples and the majority class samples 1 ,ω 2 ,ω 3 ,..,ω n ) For all minority class samples X, X is randomly selected from the K nearest neighbor samples i (i=1, 2, …, N), the eigenvector of the ith sample is denoted as x i =(x i1 ,x i2 ,x i3 ,…,x in );
Step 2-a-3, consider the inter-sample variabilityIn->Representing the average value of all samples corresponding to the j-th feature;
step 2-a-4, consider the collision between samples:wherein r is ij The correlation coefficient of the ith index and the jth index;
step 2-a-5, considering the fluctuation coefficient f of different data types, and setting the fluctuation coefficients of the Boolean type and the floating point type of the sample to be 0.5 and 2 respectively through testing;
weighting of
Step 2-a-6, each neighbor sample X i Respectively according to the following steps with the original sample X: x is X new =X+rand(0,1)×N⊙(X i -X) synthesizing a new sample, wherein N (X) i -X) is the Hadamard (Hadamard) product of the vectors.
3. The method of claim 1, wherein the method further comprises the step of,
1) Corporate financial statement data: including the index of business income, net profit, etc
2) Bond stock related data: including bond terms and conditions, bond repayment plans:
3) Company management data: including sales contract data, vendor contract data:
4) Industry and market data: including industry reports and research.
4. A method of classifying a financial asset according to claim 1, wherein said data cleansing in step 2 comprises the sub-steps of:
and 2-b-1, performing data cleaning on the acquired data, removing abnormal values and missing data, selecting 200 pieces of effective data, dividing the effective data according to a proportion of 70%, wherein a training set comprises 140 pieces of data, and a test set comprises 60 pieces of data. The method comprises the steps of carrying out a first treatment on the surface of the
Step 2-b-2, selecting a business model and contract cash flow, wherein the input variable is the final output variable which is the classification of the financial assets, and the financial assets are classified into three types: 1 represents financial assets measured in terms of amortized costs, 2 represents financial assets measured in terms of equity value and whose variation accounts for other integrated benefits, and 3 represents financial assets measured in terms of equity value and whose variation accounts for current losses.
5. The method of claim 1, wherein the support vectorThe machine algorithm considers the relaxation variable for the linear inseparable case, inA penalty factor C is introduced.
6. A method of sorting financial assets according to any one of claims 1 to 4, wherein step 3 includes the sub-steps of:
1) And 3-1, setting parameter values such as population size sizepop, maximum iteration number max gen, drosophila population position range LR, drosophila word flight range FR and the like. The method comprises the steps of carrying out a first treatment on the surface of the The position information of each individual in the corresponding drosophila population is given on (X, Y), and the initial position is: x is X axis =rand(LR),Y axis =rand(LR);
2) Step 3-2, endowing each drosophila in the drosophila population with a random flying direction and distance, wherein the new positions of the drosophila individuals i are as follows: x is X i =X axis +rand(FR),Y i =Y axis +rand(FR)
3) Step 3-3, calculating the distance DIST of the individual position of the drosophila from the origin i The formula is calculated:
4) Step 3-4, calculating taste concentration determination value S i And a taste concentration value Smell for each Drosophila in the Drosophila population i ,S i =1/DIST i ,Smell i =fitness(S i ) Wherein fitness is the discriminant function of claim 3);
5) Step 3-5, selecting the Drosophila with the best taste concentration in the current population, and recording the taste concentration value and the position of the Drosophila:
[bestSmell,bestIndex]=min(Smell)
6) Step 3-6, other drosophila in the drosophila population are close to the position according to the optimal taste concentration value and the corresponding position information:
SmellBest=bestSmell
X axis =X(bestIndex)
Y axis =Y(bestIndex)
7) Step 3-7, repeating substeps 3-2, 2) to substeps 3-6, 6) until the algorithm iteration number reaches max gen.
7. The method of claim 5, wherein the assets are measured as a plurality of samples S at a amortized cost maj j, S is a minority class sample with the fair value measured and its variation counted into other comprehensive benefits min The method for constructing the financial asset classification model FOA-SMOTE-SVM comprises the following steps:
1) Calculate S min Each sample point (X) smin ,Y smin ) Is to randomly extract a neighbor |S maj -S min I/2, the neighbor is compared with the original sample point (X smin ,Y smin ) The difference multiplied by [0,1]A random number delta between them, plus the original sample point (X smin ,Y smin ) Thus, a new credit risk sample is obtained
2) Repeating 1) until the artificially synthesized credit risk sample number reaches |S maj -S min |/2;
3) Initializing model parameters, selecting a kernel function g and a penalty coefficient C of an SVM, determining a target function formula of a drosophila taste concentration judging function, and determining iteration times max gen and population scale sizepop of a drosophila optimizing algorithm, and parameters such as bestssmell and the like of algorithm termination, wherein max gen is 100, and sizepop is 20;
4) Optimizing parameters of an SVM early warning model by using FOA according toS i =1/DIST i Two formulas calculate the fruit fly taste concentration determination value Smell i And is overlapped withCirculation is carried out;
5) Terminating the algorithm when the bestshell is smaller than the specified value, obtaining the parameter with the optimal concentration value, and substituting the optimal parameter and x new The FOA-SMOTE-SVM financial asset classification model is input as a sample set.
8. The method of claim 6, wherein the model-built financial asset classification model incorporates bayesian optimizations, where w is all possible models, x is the input of the model, y is the output of the model, m is the average of the model parameters, and Σ is the covariance matrix between the models using gaussian processes:
P(y|x,D)=∫P(y|x,D)P(w|D)dw~N(m,∑)
further searching and determining the final super-parameters.
9. The method according to claim 1, wherein step 4) includes taking the data of the test set into the method for classification of the financial assets for evaluation of classification effects, and comparing the data with the accuracy, recall, and F1 values of the SVM, FOA-SVM, and the mesh search method to obtain the super-parameters, respectively.
10. A method of classifying a financial asset according to claim 3 or 8, characterised in that in order to reduce the bias caused by special circumstances in the stock, stock designated as investment in non-transacting equity instruments with equity value metering and its variation into other comprehensive benefits is removed in the data pre-processing stage, the remainder classifying the asset directly into a class 3 asset.
11. A financial asset classification system, comprising:
the data collection unit is used for collecting relevant data of financial markets, including company financial statement data, bond stock relevant data, company management data and industry and market data;
the data preprocessing unit is used for performing data cleaning, sample reconstruction, feature extraction and feature selection operations on the related data collected by the data collection unit so as to extract effective features related to financial asset classification;
the financial asset classification model construction unit is used for constructing a financial asset classification model based on the FOA-HSMOTE-SVM;
and the evaluation unit is used for carrying the data preprocessed by the data preprocessing unit into the constructed financial asset classification model and evaluating the model effect constructed by the sample which is not balanced.
12. A computer readable storage medium having stored therein at least one executable instruction that when executed on an electronic device causes the electronic device to perform the operations of the financial asset classification method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310661037.8A CN116842454B (en) | 2023-06-06 | 2023-06-06 | Financial asset classification method and system based on support vector machine algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310661037.8A CN116842454B (en) | 2023-06-06 | 2023-06-06 | Financial asset classification method and system based on support vector machine algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116842454A true CN116842454A (en) | 2023-10-03 |
CN116842454B CN116842454B (en) | 2024-04-30 |
Family
ID=88164339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310661037.8A Active CN116842454B (en) | 2023-06-06 | 2023-06-06 | Financial asset classification method and system based on support vector machine algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116842454B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8463053B1 (en) * | 2008-08-08 | 2013-06-11 | The Research Foundation Of State University Of New York | Enhanced max margin learning on multimodal data mining in a multimedia database |
CN103941131A (en) * | 2014-05-14 | 2014-07-23 | 国家电网公司 | Transformer fault detecting method based on simplified set unbalanced SVM (support vector machine) |
CN108229750A (en) * | 2018-01-17 | 2018-06-29 | 河南理工大学 | A kind of stock yield Forecasting Methodology |
CN109002918A (en) * | 2018-07-16 | 2018-12-14 | 国网浙江省电力有限公司经济技术研究院 | Based on drosophila optimization algorithm-support vector machines electricity sales amount prediction technique |
CN109299741A (en) * | 2018-06-15 | 2019-02-01 | 北京理工大学 | A kind of network attack kind identification method based on multilayer detection |
CN110223193A (en) * | 2019-03-27 | 2019-09-10 | 东北电力大学 | The method of discrimination and system of operation of power networks state are used for based on fuzzy clustering and RS-KNN model |
CN111639695A (en) * | 2020-05-26 | 2020-09-08 | 温州大学 | Method and system for classifying data based on improved drosophila optimization algorithm |
CN114742177A (en) * | 2022-06-08 | 2022-07-12 | 南京信息工程大学 | Meteorological data classification method based on AGA-XGboost and GWO-SVM |
US20220222931A1 (en) * | 2019-06-06 | 2022-07-14 | NEC Laboratories Europe GmbH | Diversity-aware weighted majority vote classifier for imbalanced datasets |
-
2023
- 2023-06-06 CN CN202310661037.8A patent/CN116842454B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8463053B1 (en) * | 2008-08-08 | 2013-06-11 | The Research Foundation Of State University Of New York | Enhanced max margin learning on multimodal data mining in a multimedia database |
CN103941131A (en) * | 2014-05-14 | 2014-07-23 | 国家电网公司 | Transformer fault detecting method based on simplified set unbalanced SVM (support vector machine) |
CN108229750A (en) * | 2018-01-17 | 2018-06-29 | 河南理工大学 | A kind of stock yield Forecasting Methodology |
CN109299741A (en) * | 2018-06-15 | 2019-02-01 | 北京理工大学 | A kind of network attack kind identification method based on multilayer detection |
CN109002918A (en) * | 2018-07-16 | 2018-12-14 | 国网浙江省电力有限公司经济技术研究院 | Based on drosophila optimization algorithm-support vector machines electricity sales amount prediction technique |
CN110223193A (en) * | 2019-03-27 | 2019-09-10 | 东北电力大学 | The method of discrimination and system of operation of power networks state are used for based on fuzzy clustering and RS-KNN model |
US20220222931A1 (en) * | 2019-06-06 | 2022-07-14 | NEC Laboratories Europe GmbH | Diversity-aware weighted majority vote classifier for imbalanced datasets |
CN111639695A (en) * | 2020-05-26 | 2020-09-08 | 温州大学 | Method and system for classifying data based on improved drosophila optimization algorithm |
CN114742177A (en) * | 2022-06-08 | 2022-07-12 | 南京信息工程大学 | Meteorological data classification method based on AGA-XGboost and GWO-SVM |
Non-Patent Citations (3)
Title |
---|
JIE SUN等: "Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting", INFORMATION FUSION, vol. 54 * |
张晴丽: "基于FOA-SMOTE-SVM的中国制造业上市公司信用风险预警研究", 中国优秀硕士论文电子期刊, no. 2, pages 24 - 25 * |
林浩;李雷孝;王慧;: "支持向量机在智能交通系统中的研究应用综述", 计算机科学与探索, no. 06 * |
Also Published As
Publication number | Publication date |
---|---|
CN116842454B (en) | 2024-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR20010102452A (en) | Methods and systems for finding value and reducing risk | |
CN111783829A (en) | Financial anomaly detection method and device based on multi-label learning | |
Todorovic et al. | Improving audit opinion prediction accuracy using metaheuristics-tuned XGBoost algorithm with interpretable results through SHAP value analysis | |
CN112700324A (en) | User loan default prediction method based on combination of Catboost and restricted Boltzmann machine | |
Özorhan et al. | Short-term trend prediction in financial time series data | |
Cao et al. | Bond rating using support vector machine | |
Chimonaki et al. | Identification of financial statement fraud in Greece by using computational intelligence techniques | |
Qadi et al. | Explaining credit risk scoring through feature contribution alignment with expert risk analysts | |
Reddy et al. | Machine Learning based Loan Eligibility Prediction using Random Forest Model | |
Rofik et al. | The Optimization of Credit Scoring Model Using Stacking Ensemble Learning and Oversampling Techniques | |
CN110322347A (en) | A kind of shot and long term strategy multiple-factor quantization capitalized method and device | |
Awad et al. | Using data mining tools to prediction of going concern on auditor opinion-empirical study in iraqi commercial | |
Alizadeh et al. | Introducing a hybrid data mining model to evaluate customer loyalty | |
CN116842454B (en) | Financial asset classification method and system based on support vector machine algorithm | |
Wu et al. | Customer churn prediction for commercial banks using customer-value-weighted machine learning models | |
Alireza et al. | Evaluation of the Financial Ratio Capability to Predict the Financial Crisis of Companies. | |
Terzi et al. | Comparison of financial distress prediction models: Evidence from turkey | |
Hiwase et al. | Review on application of data mining in life insurance | |
Marouani et al. | Predictive Modeling to Investigate and Forecast Customer Behaviour in the Banking Sector | |
Zekić-Sušac et al. | Selecting neural network architecture for investment profitability predictions | |
Nandi et al. | Susceptibility of Changes in Public Sector Banks with Special Emphasis on Some Newly Merged Banks: Focus on Cash Flow Risk | |
Horemuz | Application of Machine Learning to Financial Trading | |
Davalos et al. | Deriving rules for forecasting air carrier financial stress and insolvency: A genetic algorithm approach | |
Cheng | Predicting stock returns by decision tree combining neural network | |
Özorhan | Forecasting direction of exchange rate fluctuations with two dimensional patterns and currency strength |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |