CN115239502A

CN115239502A - Analyst simulation method, analyst simulation system, electronic device and storage medium

Info

Publication number: CN115239502A
Application number: CN202210856513.7A
Authority: CN
Inventors: 胡志勇; 胡立昂; 苏振伟
Original assignee: Guangzhou Smart Finance And Taxation Technology Co ltd
Current assignee: Guangzhou Smart Finance And Taxation Technology Co ltd
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-10-25

Abstract

The invention relates to the field of financial data analysis and processing, and provides an analyst simulation method, an analyst simulation system, electronic equipment and a storage medium, wherein the analyst simulation method is applied to an analyst simulation system which is configured in the electronic equipment; the surplus analysis data of the existing analysts and the accounting information data are aligned in a correlation mode to form a basic database, and the basic database is divided into a training data set and a testing data set; constructing a machine learning model, training the machine learning model by using a training data set, and testing the trained machine learning model by using a test data set; and performing surplus data analysis by using the tested machine learning model. The invention realizes effective artificial intelligence analysis on the performance of the company which is lack of analyst tracking, and provides information reference and help for investors to make investment decisions.

Description

Analyst simulation method, analyst simulation system, electronic device and storage medium

Technical Field

The invention relates to the field of financial data analysis and processing, in particular to an analyst simulation method, an analyst simulation system, electronic equipment and a storage medium.

Background

With the continued development of the capital market and the increasing exposure of financial systems to economic growth, the role and functionality of securities analysts (hereinafter analysts) has become of concern. The analyst collects various data and information of the listed companies, analyzes and analyzes the future prospects of the companies and the industries by using professional knowledge, and obtains investment decision reference suggestions. The analysts' information and recommendations, and in particular the surplus analysis they give, are important bases for many investors to make investment decisions.

There are many companies listed in China, and the companies for analysts to track and provide analysis information only account for one part of the companies, and many companies listed in China lack of analysts to track and provide analysis information quickly and effectively, which is also a significant obstacle faced by domestic investors in investment decision making.

Disclosure of Invention

To solve the above problems, the development of various statistical techniques and computer techniques is becoming mature. Through an artificial intelligence method such as machine learning, effective solutions can be provided by utilizing data inference analysis. The invention acquires the analysis data of the existing analysts by real-time collection, and analyzes the performance of the company which is lack of analyst tracking by means of an artificial intelligence model and a method such as machine learning.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

an analyst simulation method applied to an analyst simulation system configured in an electronic device communicatively connected to a first device and communicatively connected to a second device, the method comprising:

acquiring surplus analysis data of an existing analyst from the first equipment, and acquiring accounting information data from the second equipment;

the method comprises the steps that existing analyst surplus analysis data and accounting information data are aligned in a correlation mode to form a basic database, and the basic database is divided into a training data set and a testing data set;

constructing a machine learning model, training the machine learning model by using a training data set, and testing the trained machine learning model by using a test data set;

and analyzing and predicting surplus data by using the tested machine learning model.

Furthermore, before the existing analyst surplus analysis data is associated and aligned with the accounting information data, screening and averaging processing is performed on the existing analyst surplus analysis data, and dimension reduction, missing value processing and normalization are performed on the accounting information data.

Further, the analyst simulation method further comprises the following steps:

constructing a filling data set and a full analysis data set according to data obtained by surplus data analysis and prediction;

and carrying out validity verification on the filling data set and the full analysis data set.

Further, the associating and aligning the surplus analysis data of the existing analysts with the accounting information data to form the basic database refers to associating and aligning the surplus analysis data of the existing analysts based on the same listed company, the same year and the same quarter with the accounting information data to form the basic database with the associated data of the listed company having the surplus analysis data of the existing analysts.

Further, the machine learning models are modeled separately by year and industry to form model groups.

Further, the screening and averaging processing of the surplus analysis data of the existing analysts refers to screening of surplus analysis data of the analysts corresponding to the issued quarterly financial statements of the listed companies for years in the future, and based on the surplus analysis data of the existing analysts of the quarterly financial statements for years in the future, an average value of surplus analysis of all analysts based on the quarterly is adopted.

Further, the processing of missing values of the accounting information data refers to the processing of filling up the missing of part of index data of part of companies, and includes the following steps:

s1: replacing the missing value with the index mean value of the company;

s2: and if the step S1 fails, replacing the missing value by the index mean value of the industry where the company is located.

Further, the padding data set is a data set formed by adding data of a listed company having analyst surplus analysis data to data obtained by surplus analysis using a machine learning model performed on a listed company having no analyst surplus analysis data.

Further, the full-analysis data set refers to a data set obtained by performing surplus analysis on all listed companies by using a machine learning model.

Further, the validity verification of the filling data set and the full analysis data set means that convergence tests are respectively performed on the existing analyst surplus analysis data, the filling data set and the full analysis data set in the basic database by using a Spearman Rank correlation coefficient test method and a Pearson correlation coefficient test method, and the surplus analysis of the machine learning model is proved to be effective by the convergence.

Further, the analyst simulation method further comprises the following steps:

the method comprises the steps of obtaining information of product technology development, industry technology and market development trend of listed companies, identifying product characteristics and industry status of the listed companies, forming industry prospect texts of a single company, integrating surplus analysis data of a machine learning model and the industry prospect texts of the company in a plurality of years in the future, and forming company and industry research reports in a quarterly period.

An analysts simulation system configured in an electronic device communicatively coupled to a first device and communicatively coupled to a second device, comprising:

an acquisition component for acquiring surplus analysis data of an existing analyst from the first device and accounting information data from the second device;

the correlation component is used for correlating and aligning the surplus analysis data of the existing analysts and the accounting information data to form a basic database, and dividing the basic database into a training data set and a testing data set;

the model component is used for constructing a machine learning model, training the machine learning model by utilizing a training data set and testing the trained machine learning model by utilizing a testing data set;

an analysis component for utilizing the tested machine learning model for surplus data analysis prediction.

Furthermore, the analyst simulation system further comprises a data preprocessing component, which is used for screening and equalizing the existing analyst surplus analysis data before the existing analyst surplus analysis data is associated and aligned with the accounting information data, and performing dimension reduction, missing value processing and normalization on the accounting information data.

Further, the analyst simulation system also comprises a verification component, which is used for constructing a filling data set and a full analysis data set according to data obtained by performing surplus data analysis and prediction; and carrying out validity verification on the filling data set and the full analysis data set.

Further, the associating and aligning the existing analyst surplus analysis data with the accounting information data to form the basic database means that the existing analyst surplus analysis data based on the same listed company, the same year and the same quarter is associated and aligned with the accounting information data to form the basic database with the associated data of the listed company having the existing analyst surplus analysis data.

Further, the processing of the missing value of the accounting information data refers to the filling up of the missing of part of the index data of part of companies, and includes the following steps:

s1: replacing the missing value with the index mean value of the company;

s2: and if the step S1 fails, replacing the missing value by using the index mean value of the industry where the company is located.

The filling data set is a data set formed by adding data of a listed company having analyst surplus analysis data to data obtained by surplus analysis using a machine learning model performed on a listed company having no analyst surplus analysis data.

Further, the validity verification of the filled data set and the full analysis data set means that the existing analyst surplus analysis data, the filled data set and the full analysis data set in the basic database are subjected to convergence test by a Spearman Rank correlation coefficient test method and a Pearson correlation coefficient test method respectively, and the surplus analysis of the machine learning model is proved to be effective by convergence.

Furthermore, the analyst simulation system further comprises a reporting component, wherein the reporting component is used for acquiring information of product technology development, industry technology and market development trend of listed companies, identifying product characteristics and industry status of the listed companies, forming a single company industry prospect text, integrating surplus analysis data of a machine learning model and the industry prospect text of the company in a plurality of years in the future, and forming company and industry research reports in a quarterly period.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the analyst simulation method described above.

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the analyst simulation method described above.

The invention relates to the field of financial data analysis and processing, and provides an analyst simulation method, an analyst simulation system, electronic equipment and a storage medium, wherein the analyst simulation method is applied to an analyst simulation system, the analyst simulation system is configured in the electronic equipment, the electronic equipment is in communication connection with first equipment, the electronic equipment is in communication connection with second equipment, and the method and the system acquire accounting information data from the second equipment by acquiring surplus analysis data of an existing analyst from the first equipment; the method comprises the steps that existing analyst surplus analysis data and accounting information data are aligned in a correlation mode to form a basic database, and the basic database is divided into a training data set and a testing data set; constructing a machine learning model based on an AdaBoost regression tree model, training the machine learning model by using a training data set, and testing the trained machine learning model by using a testing data set; and performing surplus data analysis by using the tested machine learning model. The invention realizes effective artificial intelligent analysis of the performance of the company which is tracked by the lack of analysts, and simultaneously automatically provides company and industry research reports for investors by combining the performance data of the artificial intelligent analysis with the automatic text technology, and timely provides comprehensive information reference and help for the investors when making investment decisions.

Drawings

FIG. 1 is a flow chart of an analyst simulation method;

fig. 2 shows an AdaBoost algorithm regression flow chart.

Detailed Description

The embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

The embodiments of the present disclosure are described below with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure in the specification. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. The disclosure may be carried into practice or applied to various other specific embodiments, and various modifications and changes may be made in the details within the description and the drawings without departing from the spirit of the disclosure. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Example one

At present, various statistical techniques and computer techniques are developed increasingly. Through an artificial intelligence method such as machine learning, effective solutions can be provided by utilizing data inference analysis. The invention acquires the analysis data of the existing analysts by real-time collection, and analyzes the performance of the company which is lack of analyst tracking by means of artificial intelligence models and methods such as machine learning and the like.

Fig. 1 is a flowchart of an analysts simulation method applied to an analysts simulation system, where the analysts simulation system is configured in an electronic device, the electronic device is communicatively connected to a first device, and the electronic device is communicatively connected to a second device, and the method includes:

constructing a machine learning model, training the machine learning model by using a training data set, and testing the trained machine learning model by using a test data set; the machine learning model can be constructed based on an AdaBoost regression tree model, a support vector machine, an adaptive group Lasso and the like, and can also be constructed based on a convolutional neural network, a cyclic neural network, a long-short term memory model and the like;

In specific implementation, before the existing analyst surplus analysis data is associated and aligned with the accounting information data, screening and averaging processing is performed on the existing analyst surplus analysis data, and dimension reduction, missing value processing and normalization are performed on the accounting information data.

In a specific implementation, the analyst simulation method further includes the following steps:

In a specific implementation, the associating and aligning the surplus analysis data of the existing analysts with the accounting information data to form the basic database refers to associating and aligning the surplus analysis data of the existing analysts based on the same listed company, the same year and the same quarter with the accounting information data, and forming the basic database with the associated data of the listed company with the surplus analysis data of the existing analysts.

In a specific implementation, the machine learning models are modeled by year and industry respectively to form a model group.

In specific implementation, the screening and averaging processing of the surplus analysis data of the existing analysts refers to screening of surplus analysis data of the corresponding analysts for years in the future after the quarterly financial reports of the listed companies are published, and an average value of surplus analysis of all analysts based on the quarterly is adopted based on the surplus analysis data of the existing analysts for years in the future of the quarterly financial reports.

In specific implementation, the missing value processing of the accounting information data refers to filling up missing of part of index data of part of companies, and includes the following steps:

s1: replacing the missing value with the index mean value of the company;

In a specific implementation, the filling data set refers to a data set formed by adding data of a listed company having analyst surplus analysis data to data obtained by surplus analysis of a listed company having no analyst surplus analysis data using a machine learning model.

In a specific implementation, the full-analysis data set refers to a data set obtained by performing surplus analysis on all listed companies by using a machine learning model.

In specific implementation, the validity verification of the padding data set and the full-analysis data set means that convergence tests are respectively performed on existing analyst surplus analysis data, the padding data set and the full-analysis data set in the basic database by using a Spearman Rank correlation coefficient test method and a Pearson correlation coefficient test method, and the surplus analysis of the machine learning model is proved to be effective by the convergence.

the system comprises an association component, a database processing component and a database processing component, wherein the association component is used for associating and aligning the surplus analysis data of the existing analysts and the accounting information data to form a basic database and dividing the basic database into a training data set and a testing data set;

the model component is used for constructing a machine learning model, training the machine learning model by utilizing a training data set and testing the trained machine learning model by utilizing a testing data set; the machine learning model can be constructed based on an AdaBoost regression tree model, a support vector machine, an adaptive group Lasso and the like, and can also be constructed based on a convolutional neural network, a cyclic neural network, a long-short term memory model and the like;

In a specific implementation, the analyst simulation system further includes a data preprocessing component, which is configured to perform screening and equalization processing on the surplus analysis data of the analyst, and perform dimension reduction, missing value processing and normalization on the accounting information data before performing association and alignment between the surplus analysis data of the analyst and the accounting information data.

In specific implementation, the analyst simulation system further comprises a verification component, which is used for constructing a filling data set and a full-analysis data set according to data obtained by performing surplus data analysis and prediction; and carrying out validity verification on the filling data set and the full analysis data set.

In specific implementation, the screening and averaging processing of the surplus analysis data of the existing analysts refers to screening of surplus analysis data of the analysts corresponding to the issued quarterly financial statements of the listed companies for years in the future, and based on the surplus analysis data of the existing analysts of the quarterly financial statements for years in the future, an average value of surplus analysis of all analysts based on the quarterly is adopted.

s1: replacing the deficiency value with the index mean value of the company;

In specific implementation, the validity verification of the filled data set and the full-analysis data set means that the existing analyst surplus analysis data, the filled data set and the full-analysis data set in the basic database are subjected to convergence inspection by a Spearman Rank correlation coefficient inspection method and a Pearson correlation coefficient inspection method respectively, and the surplus analysis of the machine learning model is proved to be effective by convergence.

In a specific implementation, the analyst simulation system further comprises a report component, which is used for acquiring information of product technical development, industry technology and market development trend of the listed companies, identifying product characteristics and industry positions of the listed companies, forming a single company industry prospect text, integrating surplus analysis data of a machine learning model and the industry prospect text of the company for several years in the future, and forming company and industry research reports in a quarterly period.

Example two

In the embodiment, an analyst simulation system based on an AdaBoost regression tree model group is constructed through existing analyst surplus analysis data and accounting information data, and an intelligent research system for providing company performance within 3 years is taken as an example by combining a capital market public information source and by means of a text intelligent analysis technology. Firstly, starting from the aspects of obtaining and arranging surplus analysis data of an analyst, filling a missing value of an accounting information index and the like, the construction of a basic database is completed. And secondly, modeling and training by using an AdaBoost regression tree model, and constructing an analyst simulation system based on a machine learning model. And thirdly, verifying the convergence of the data analyzed by the intelligent model simulation through a Spearman Rank correlation coefficient test method and a Pearson correlation coefficient test method to prove the effectiveness of the simulation. Finally, intelligent reports of company and industry performance are provided by text intelligent analysis techniques in conjunction with capital market public information sources.

Fig. 1 is a flowchart of an analysts simulation method applied to an analysts simulation system configured in an electronic device communicatively connected to a first device and a second device, including:

acquiring surplus analysis data of an existing analyst from the first device, and acquiring accounting information data from the second device;

the electronic equipment firstly performs screening and equalization processing on surplus analysis data of the existing analysts through the analyst simulation system, and performs dimension reduction, missing value processing and normalization on accounting information data;

then, the surplus analysis data of the existing analysts and the accounting information data are aligned in a correlation mode to form a basic database, and the basic database is divided into a training data set and a testing data set;

establishing a machine learning model, wherein the machine learning model is modeled according to the year and the industry to form a model group, then training the machine learning model by utilizing a training data set, and testing the trained machine learning model by utilizing a testing data set; the machine learning model can be constructed based on an AdaBoost regression tree model, a support vector machine, an adaptive set Lasso and the like, and can also be constructed based on a convolutional neural network, a cyclic neural network, a long-short term memory model and the like;

finally, surplus data is analyzed and predicted by using the tested machine learning model;

constructing a filling data set and a full analysis data set according to data obtained by performing surplus data analysis and prediction;

carrying out validity verification on the filling data set and the full analysis data set;

the method comprises the steps of obtaining information of product technology development, industry technology and market development trend of a listed company, identifying product characteristics and industry positions of the listed company, forming a single company industry prospect text, integrating surplus analysis data of a machine learning model of the company in the years to come and the industry prospect text, and forming a research report of the company and the industry in the season period.

The present embodiment relates to analysts' existing surplus analysis data. Typically, analysts will analyze the annual EPS (revenue per share) of the company for the next three years. It should be noted that the present embodiment uses the average of the profit analyses of all analysts in the same period as the existing analysis value parameter index of the training and testing machine learning model. As an example of a listed company, table 1 shows EPS data in the 2013-2017 reports of the company. Wherein S4 is annual report data. Taking the first quarter S1 in 2014 as an example, analysts analyze EPS in 2014, 2015, 2016 three years. Similarly, after the S2 quarter financial report was released in 2014, analysts still analyze the EPS in 2014, 2015 and 2016. As analyst information aggregation increases the surplus information in the second quarter 2014, analyst analysis data will also adjust accordingly. By analogy, each time the EPS data published by a new financial report causes a change in the analysis result data of the analyst. In addition, when the financial report is published at the end of the year, an analyst analyzes the annual change and moves forward for one year. For example, after the 2014-year newsletter (S4) is released, the analysis period of the analyst is changed to 2015, 2016, and 2017 years.

TABLE 1 profit prediction and information consolidation of company reports made by analysts during 2013-2017

In a specific implementation, before the surplus analysis data of the existing analyst is associated and aligned with the accounting information data, screening and averaging processing is performed on the surplus analysis data of the existing analyst, and missing value processing and normalization are performed on the accounting information data, which includes two parts of data processing:

(1) The screening and averaging processing of the existing analyst surplus analysis data refers to screening of surplus analysis data of corresponding analysts of years in the future after the seasonal financial statements of the listed company are published, and the average value of surplus analysis of all analyst in the quarter is adopted based on the existing analyst surplus analysis data of years in the future of the seasonal financial statements. For example, in this embodiment, the surplus analysis data of the existing analysts is obtained by capturing. For the surplus analysis data of the analysts, surplus analysis result data of the analysts on the company in the next three years are screened out (after the financial statements of the companies in different seasons of the listed company are released), and the surplus analysis result data are respectively marked as the analysts EPSt, the analysts EPSt +1 and the analysts EPSt +2. Since there are situations where a company is being tracked by multiple analysts at the same time. Therefore, the analyst surplus analysis result data is the average value of all analyst surplus analysis data, and the existing analyst surplus analysis database is finally constructed (in this embodiment, the number of companies tracked by analysts is set to be 1/4 of companies listed in the market, that is, the existing analyst analysis database is the data of the 1/4 of companies listed in the market).

(2) In this embodiment, 618 items of accounting information and financial information of each quarter of the listed company are obtained through public information, and important information influencing the prediction of analysts, including but not limited to Eps, proportion of stock circulation, net assets per share, business income per share, etc., is extracted by means of dimensionality reduction methods such as stepwise regression, principal component analysis, partial least square regression, etc., so that the dimensionality reduction processing methods include methods such as stepwise regression, principal component analysis, partial least square regression, etc. Because a few company data are missing, index missing values need to be processed, and the missing value processing on accounting information data refers to filling up partial index data missing of partial companies, and the method comprises the following steps: s1: replacing the deficiency value with the index mean value of the company; s2: and if the step S1 fails, replacing the missing value by the index mean value of the industry where the company is located.

Secondly, normalizing the processed index data, wherein a normalization formula is as follows:

wherein mu is the mean value of all sample data, sigma is the standard deviation of all sample data, and finally, the accounting information database of all listed companies is constructed.

Because the result of the surplus analysis data of the existing analyst is affected by the company disclosure information and the industry in the current year, based on the accounting information data of the corresponding company in the surplus analysis database (1/4 listed companies) and the accounting information database (all listed companies), the surplus analysis data and the accounting information data are correlated and aligned to form a basic database (namely, the data in the basic database are the surplus analysis data of the analyst of the 1/4 listed company with analyst tracking and the correlated accounting information data), then the basic database is analyzed and trained into a data set and a test data set, and machine learning model groups are respectively established by adopting the industry division according to the different years (the 1/4 listed companies are respectively modeled and trained according to the industry according to the different years). The machine learning model can be based on machine learning methods such as an AdaBoost regression tree model, a support vector machine, a self-adaptive group Lasso and the like, deep learning methods such as a convolutional neural network, a cyclic neural network, a long-short term memory model and the like and corresponding model average results, surplus prediction is carried out on analysts in each quarter, namely, an analyst EPSt +1 and an analyst EPSt +2, and prediction analysis is carried out on the model in the quarter with the highest training and testing effect. In this embodiment, taking an AdaBoost regression tree model as an example, after modeling and learning training are performed based on the AdaBoost regression tree model, simulation analysis and prediction are performed on the surplus of the test data set, that is, the EPSt is simulated, the EPSt +1 is simulated, and the EPSt +2 is simulated for prediction.

The specific training and testing process is as follows:

(1) Given a data set with a total number of input samples N { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _N ,y _N ) Where x is a quantifiable information feature and y is analyst surplus analysis data.

(2) Initializing the weights of a training sample data set:

in the formula w _1i Is the initial weight value of the ith accounting information characteristic data, i =1,2 \8230n.

(3) Training a weak regressor, setting a weak regressor iteration number k, for k =1, 2.

(a) With training sets D having weight distributions _k Training data to obtain a weak regressor G _k (x)

(b) Computing weak regressor G _k (x) Maximum error of (c):

E _k ＝max|y _i -G _K (x _i )|i＝1,2,…,m (M3)

(c) Calculate the error for each sample:

(d) Computing weak regression G _k (x) Regression error rate e of _k :

In the formula w _ki Is the weight value of the ith feature data after iteration k times.

(e) Computing weak learner G based on error rate _k (x) Weight value of alpha _k ：

(f) Updating the weight distribution w of a sample set _k+1,i ：

In the formula Z _k Is a normalization factor, having

(4) Repeating the steps (a) to (b) for M times of iteration, and constructing a final strong regressor as follows:

(5) After the final strong regressor is constructed, testing is carried out by using a test data set, and R is selected ² As an evaluation index. R is ² The fitting effect of the model is reflected, and the value range is [0,1 ]]。R ² The larger the representation the better the fitting of the model:

a formula

Analytical data, y, representing the model, i.e. the final strong regressor _i The real analyst has the surplus to analyze the data,

the average value of the surplus analysis data of the real analysts in the test sample is shown, and n represents the number of samples.

The embodiment ensures R on the training set through parameter tuning ² R on test set above 0.9 ² Above 0.8.

Therefore, the machine learning model can be constructed based on an AdaBoost regression tree model, a support vector machine, an adaptive set Lasso and the like, and can also be constructed based on a convolutional neural network, a cyclic neural network, a long-short term memory model and the like.

The surplus data analysis is carried out by utilizing the tested machine learning model, and a filling data set and a full analysis data set are constructed according to data obtained by carrying out surplus data analysis prediction:

a. filling a data set: the filling data set is a data set formed by performing surplus analysis on the data of the listed companies without analyst surplus analysis data by using the machine learning model and adding the data of the listed companies with the analyst surplus analysis data, namely, the data set formed by adding surplus analysis data of the other 3/4 companies predicted by the machine learning model analysis to the existing 1/4 listed companies.

b. Full analysis data set: the full analysis data set is obtained by performing surplus analysis on all listed companies by using a machine learning model.

The embodiment ensures the validity of the analysis data by verifying the convergence of the analysis result of the machine learning model.

(1) Machine learning model analysis result convergence test

(a) Spearman Rank correlation coefficient test method

The Spearman Rank correlation coefficient is used to estimate the correlation between two variables X, Y, where the correlation between the variables can be described using a monotonic function. If there are not two elements in the two sets of values of the two variables, then when one of the variables can be expressed as a good monotonic function of the other variable (i.e., the two variables have the same trend), ρ between the two variables can reach +1 or-1.

Assuming that two random variables are X and Y respectively, the number of elements is N, and the ith (1) of the two random variables is measured<＝i<= N) values by X respectively _i 、Y _i And (4) showing. Sorting X and Y (ascending or descending at the same time) to obtain two element sorting sets X and Y, wherein the element X _i 、y _i Are each X _i Alignment in X and Y _i Row in Y. Subtracting the elements in the sets x and y to obtain a ranking difference set d, wherein d _i ＝x _i -y _i ，1<＝i<And (N). The spearman rank correlation coefficient between the random variables X, Y can be calculated from X, Y or d in the following way:

taking a company as an example, the organization and idea of data for calculating the Spearman Rank correlation coefficient are described, referring to table 2, the analysts begin from the fourth quarter (Q4) in 2014 to the third quarter (Q3) in 2017, and the EPS in 2017 is analyzed in 12 quarters. First, the analysis is performed by the analyst, and the quarter interval from the season to the fourth quarter (Q4) of 2017 is used as the distance variable, and the distance variable is incrementedThe order is y (when the two values of the variables are the same, their ranking is by averaging their positions). For example, the 2014 fourth quarter (Q4) is 12 quarters apart from the 2017 fourth quarter (Q4). And then, taking the absolute error between the season analysis value of the analyst and the actual EPS as an analysis deviation, and performing ascending sequencing on the analysis deviation to obtain a deviation sequence x. Calculating d by sorting x and distance y by deviation _i ＝x _i -y _i (1<＝i<= 12). And finally, calculating a rank correlation coefficient according to a formula M11, and testing the change rule of the analysis precision of an analyst along with the quarterly spacing by using the rank correlation coefficient. The larger the rank correlation coefficient is, the closer the analyst analysis value is to the analyzed year is, and the higher the analysis precision is, the more the analyst surplus analysis is effective in convergence.

TABLE 2 profit analysis and calculation arrangement of analysts on company

The rank correlation coefficient of the analyst for the surplus analysis of the company in 2017 is obtained through calculation and is 0.837, and the analysis result of the analyst is proved to be effective in convergence. That is, the analyst can adjust the analysis value to be closer to the real data by continuously accumulating surplus information for the company.

In order to test the strong and weak correlation between the distance y and the analysis deviation x, a hypothesis test of Spearman Rank is introduced to further verify the validity of the surplus analysis value of the analysts. The two data sets obtained by model analysis are subjected to Spearman Rank correlation coefficient test method test, and satisfy the proportion statistics of Rank correlation coefficient >0 and significance P <0.05, and the results are as follows:

table 3 meets Spearman Rank correlation coefficient test arrangement

As can be seen from table 3, in the full analysis dataset based on the machine learning model analysis, the highest proportion of the rank correlation coefficient >0 and the significance P <0.05 is satisfied, that is, the data analyzed by the trained machine learning model is closer to the real data, the convergence trend is more obvious, then the data set is filled, and finally the data set is the artificial analyst surplus analysis database.

(b) Pearson correlation coefficient test method

Pearson correlation, also known as product-difference correlation, is calculated by assuming two variables X, Y, and the pearson correlation coefficient between the two variables is calculated by the following equation:

the Pearson correlation coefficient of the analyst for the surplus analysis of the companies listed in the table 2 in 2017 is 0.779 through calculation, which shows that the analyst has a positive linear correlation between the analysis precision and the distance. That is, as the distance is reduced, the analysis deviation is smaller and the analysis precision of the analyst is improved. As can be seen from table 4, with the analysis deviation and the distance as variables, the Pearson correlation coefficient is calculated by using the equation M12 to describe the relationship between the analysis deviation and the distance for the analyst. The Pearson correlation coefficient test method is performed on the two data sets obtained by the model analysis, and the results of the proportion statistics that the Pearson correlation coefficient >0 and the significance P <0.05 are satisfied are as follows:

table 4 meets Pearson correlation coefficient inspection and sorting

In the full analysis data set based on machine learning model analysis, the proportion that the Pearson correlation coefficient is greater than 0 and the significance P is less than 0.05 is the highest, namely, the data analyzed by the trained machine learning model is closer to the real data, the convergence trend is more obvious, then the data set is filled, and finally the data set is the artificial analyst surplus analysis database.

the model component is used for constructing a machine learning model, training the machine learning model by utilizing a training data set and testing the trained machine learning model by utilizing a testing data set; the machine learning model can be constructed based on an AdaBoost regression tree model, a support vector machine, an adaptive set Lasso and the like, and can also be constructed based on a convolutional neural network, a cyclic neural network, a long-short term memory model and the like;

In a specific implementation, the machine learning models are modeled separately by year and industry to form a model group.

In specific implementation, the processing of missing values of accounting information data refers to the filling of partial missing index data of partial companies, and includes the following steps:

s1: replacing the deficiency value with the index mean value of the company;

In a specific embodiment, the padding data set is a data set formed by adding data of a listed company having analyst surplus analysis data to data obtained by surplus analysis using a machine learning model performed by a listed company having no analyst surplus analysis data.

In a specific implementation, the analyst simulation system further comprises a reporting component, which is used for acquiring information of product technology development, industry technology and market development trend of listed companies, identifying product characteristics and industry status of the listed companies, forming a single company industry prospect text, integrating surplus analysis data of a machine learning model and the industry prospect text of the company in a plurality of years in the future, and forming company and industry research reports in a quarterly period.

Capturing the information of the product technical development trend and the industry technical and market development trend of the listed companies from the public information source;

identifying the characteristics of company products and the industry status of companies to form a single company industry prospect text;

the intelligent simulation analysis of the performance of a single company (not tracked by an analyst) and the industry prospect text thereof are integrated in the last three years, and company and industry research reports are formed in the season period.

The invention relates to the field of financial data analysis and processing, and provides an analyst simulation method, an analyst simulation system, electronic equipment and a storage medium, wherein the analyst simulation method is applied to an analyst simulation system, the analyst simulation system is configured in the electronic equipment, the electronic equipment is in communication connection with first equipment, the electronic equipment is in communication connection with second equipment, and the method and the system acquire accounting information data from the second equipment by acquiring surplus analysis data of an existing analyst from the first equipment; the surplus analysis data of the existing analysts and the accounting information data are aligned in a correlation mode to form a basic database, and the basic database is divided into a training data set and a testing data set; constructing a machine learning model based on an AdaBoost regression tree model, training the machine learning model by using a training data set, and testing the trained machine learning model by using a testing data set; and performing surplus data analysis by using the tested machine learning model. The invention realizes the rapid and effective artificial intelligence analysis of the performance of the company which is lack of analyst tracking, can provide comprehensive information reference and help for investors in time when investment decision is made according to the analysis result, and simultaneously automatically provides company and industry research reports for investors by combining the performance data of the artificial intelligence analysis with the automatic text technology. The method is based on the real-time predictive analysis of the existing analyst, and tries to depict the predictive analysis behavior of the existing analyst by means of intelligent simulation technologies such as machine learning and the like, namely, the intelligent simulation of a real analyst is realized; by means of intelligent simulation of real analysts, research reports of quarterly-based company performance are provided for listed companies which lack analyst tracking, and the blank of basis for investment decision of capital markets in China is filled.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

The above description is for illustrative purposes only and is not intended to limit the present invention, and any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention should be included within the scope of the present invention as defined by the appended claims.

Claims

1. An analysts simulation method applied to an analysts simulation system configured in an electronic device communicatively coupled to a first device and communicatively coupled to a second device, the method comprising:

the surplus analysis data of the existing analysts and the accounting information data are aligned in a correlation mode to form a basic database, and the basic database is divided into a training data set and a testing data set;

2. The analyst simulation method of claim 1, wherein prior to aligning the existing analyst surplus analysis data with the accounting information data, the existing analyst surplus analysis data is screened and averaged, and the accounting information data is reduced in dimension, missing value and normalized.

3. The analyst simulation method of claim 1, further comprising the steps of:

4. The analyst simulation method of claim 1, wherein the associating and aligning the existing analyst surplus analysis data with the accounting information data to form the base database means associating and aligning the existing analyst surplus analysis data based on the same listed company, the same year, and the same quarter with the accounting information data to form the base database with the associated data of the listed company having the existing analyst surplus analysis data.

5. The analyst simulation method of claim 1 wherein the machine learning model is modeled separately by year and industry to form a model set.

6. The analyst simulation method of claim 2, wherein the screening and averaging of the surplus analysis data of the existing analysts is performed by screening the surplus analysis data of the corresponding analysts for years in the future after the seasonal financial reports of the listed companies are published, and based on the surplus analysis data of the existing analysts for years in the future of the seasonal financial reports, an average value of the surplus analysis of all analysts for the season is adopted.

7. The analyst simulation method of claim 2, wherein the missing value processing of the accounting information data is a filling processing of missing part of index data of part of companies, and the method comprises the following steps:

s1: replacing the deficiency value with the index mean value of the company;

8. The analyst simulation method of claim 3, wherein the padding data set is a data set formed by adding data of a listed company having analyst surplus analysis data to data of a listed company having no analyst surplus analysis data obtained by performing surplus analysis using a machine learning model.

9. The analysts simulation method of claim 3, wherein the full analysis dataset is a dataset obtained by performing a surplus analysis on all listed companies using a machine learning model.

10. The analyst simulation method of claim 3, wherein the validity verification of the padding data set and the full analysis data set is performed by performing a convergence check on the existing analyst surplus analysis data, the padding data set and the full analysis data set in the basic database by using a Spearman Rank correlation coefficient test method and a Pearson correlation coefficient test method respectively, and the convergence proves that the surplus analysis of the machine learning model is valid.

11. The analysts simulation method of any of claims 1-10, further comprising the steps of:

the method comprises the steps of obtaining information of product technology development, industry technology and market development trend of a listed company, identifying product characteristics and industry positions of the listed company, forming a single company industry prospect text, integrating surplus analysis data of a machine learning model and the industry prospect text of the company for several years in the future, and forming a company and industry research report in a quarterly period.

12. An analysts simulation system configured in an electronic device communicatively coupled to a first device and communicatively coupled to a second device, comprising:

an analysis component for utilizing the tested machine learning model for surplus data analytics prediction.

13. The analyst simulation system of claim 12, further comprising a data preprocessing component configured to screen, average, reduce dimensions, discard, and normalize existing analyst surplus analysis data prior to aligning the existing analyst surplus analysis data with the accounting information data in association.

14. The analyst simulation system of claim 12, further comprising a validation component configured to construct a fill-in dataset and a full-analysis dataset based on data predicted from performing a surplus data analysis; and carrying out validity verification on the filling data set and the full analysis data set.

15. The analyst simulation system of claim 12, wherein the aligning existing analyst surplus analysis data to accounting information data associations to form a base database is to align existing analyst surplus analysis data based on a same listed company, a same year, and a same quarter to accounting information data associations to form a base database with associated data of listed companies having existing analyst surplus analysis data.

16. The analyst simulation system of claim 12, wherein the machine learning models are modeled separately by year and industry to form model sets.

17. The analyst simulation system of claim 13, wherein the screening and averaging process of the surplus analysis data of the existing analysts is performed to screen surplus analysis data of years to be brought to the future of corresponding analysts after the financial statements of each quarter of the listed company are published, and an average value of the surplus analysis of all analysts in the quarter is adopted based on the surplus analysis data of the analysts in the years to be brought to the future of the financial statements of each quarter.

18. The analyst simulation system of claim 13, wherein the missing value processing of accounting information data is a filling processing of missing part of index data of part of companies, and the method comprises the following steps:

s1: replacing the deficiency value with the index mean value of the company;

19. The analyst simulation system of claim 14, wherein the pad data set refers to a data set formed by applying a surplus analysis to a listed company without analyst surplus analysis data using a machine learning model, and adding data of a listed company already having analyst surplus analysis data.

20. The analysts simulation system of claim 14, wherein the full analysis dataset is a dataset obtained from a surplus analysis of all listed companies using machine learning models.

21. The analyst simulation system of claim 14, wherein the validity verification of the padding data set and the full analysis data set is performed by performing a convergence check on the existing analyst surplus analysis data, the padding data set and the full analysis data set in the base database by using a Spearman Rank correlation coefficient test method and a Pearson correlation coefficient test method respectively, and the convergence test proves that the surplus analysis of the machine learning model is valid.

22. The analyst simulation system of any of claims 12-21, further comprising a reporting component configured to obtain information on technology development, industry technology and market development trends of the products of the listed companies, identify characteristics and industry status of the products of the listed companies, form a single company industry prospect text, and integrate machine learning model surplus analysis data and the industry prospect text of the company for several years in the future to form company and industry research reports in a quarterly period.

23. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the analyst simulation method of any of claims 1-11.

24. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the analyst simulation method of any of claims 1-11.