CN114418776A - Data processing method, device, terminal equipment and medium - Google Patents

Data processing method, device, terminal equipment and medium Download PDF

Info

Publication number
CN114418776A
CN114418776A CN202111657708.0A CN202111657708A CN114418776A CN 114418776 A CN114418776 A CN 114418776A CN 202111657708 A CN202111657708 A CN 202111657708A CN 114418776 A CN114418776 A CN 114418776A
Authority
CN
China
Prior art keywords
data
training
prediction model
factor data
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111657708.0A
Other languages
Chinese (zh)
Inventor
赵洋
包荣鑫
陈龙
田多
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Valueonline Technology Co ltd
Original Assignee
Shenzhen Valueonline Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Valueonline Technology Co ltd filed Critical Shenzhen Valueonline Technology Co ltd
Priority to CN202111657708.0A priority Critical patent/CN114418776A/en
Publication of CN114418776A publication Critical patent/CN114418776A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Abstract

The embodiment of the application is suitable for the technical field of deep learning, and provides a data processing method, a device, terminal equipment and a medium, wherein the method comprises the following steps: acquiring training data, wherein the training data comprises a plurality of factor data; processing the multiple factor data by taking a preset time length as a window period to obtain time sequence factor data corresponding to the training data; performing second-order polynomial processing on the multiple factor data to obtain polynomial factor data corresponding to training data; training a preset prediction model by adopting factor data, time sequence factor data and polynomial factor data to obtain a target prediction model; receiving data to be predicted, and determining target data in the data to be predicted; and inputting the target data into a target prediction model for prediction to obtain a corresponding prediction result. By the method, the accuracy of the prediction model can be improved.

Description

Data processing method, device, terminal equipment and medium
Technical Field
The present application belongs to the field of deep learning technologies, and in particular, to a data processing method, apparatus, terminal device, and medium.
Background
With the increase of the supervision of the securities industry, more and more false statement events such as financial fraud, annual report fraud and the like of listed companies are disclosed, and the false statement of the listed companies often causes misjudgment of investors, so that financial loss is formed.
But currently, no scientific and reasonable assessment method for investor loss caused by false statement of listed companies exists, and the investor loss cannot be accurately assessed.
Disclosure of Invention
In view of this, the embodiments of the present application provide a data processing method, an apparatus, a terminal device, and a medium, by which investment loss of an investor due to false statement of a listed company can be quantitatively calculated, thereby guaranteeing benefits of the investor.
A first aspect of an embodiment of the present application provides a data processing method, including:
acquiring training data, wherein the training data comprises a plurality of factor data;
processing the plurality of factor data by taking a preset time length as a window period to obtain time sequence factor data corresponding to the training data;
performing second-order polynomial processing on the factor data to obtain polynomial factor data corresponding to the training data;
training a preset prediction model by adopting the factor data, the time sequence factor data and the polynomial factor data to obtain a target prediction model;
receiving data to be predicted, and determining target data in the data to be predicted;
and inputting the target data into the target prediction model for prediction to obtain a corresponding prediction result.
A second aspect of an embodiment of the present application provides a data processing apparatus, including:
an acquisition module, configured to acquire training data, where the training data includes multiple factor data;
the time sequence processing module is used for processing the factor data by taking a preset time length as a window period to obtain time sequence factor data corresponding to the training data;
the polynomial processing module is used for carrying out second-order polynomial processing on the factor data to obtain polynomial factor data corresponding to the training data;
the training module is used for training a preset prediction model by adopting the factor data, the time sequence factor data and the polynomial factor data to obtain a target prediction model;
the device comprises a receiving module, a prediction module and a prediction module, wherein the receiving module is used for receiving data to be predicted and determining target data in the data to be predicted; and the prediction module is used for inputting the target data into the target prediction model for prediction to obtain a corresponding prediction result.
A third aspect of embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method according to the first aspect.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, implements the method according to the first aspect as described above.
A fifth aspect of embodiments of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to perform the method of the first aspect.
Compared with the prior art, the embodiment of the application has the following advantages:
according to the embodiment of the application, the corresponding prediction model can be obtained by training with training data. In the training process of the prediction model, the corresponding time sequence factor data and the corresponding polynomial factor data can be obtained based on the factor data of the training data, and the prediction model is trained by adopting the factor data, the time sequence factor data and the polynomial factor data, which is equivalent to the training process, so that the expression capacity of the training data is improved, and the accuracy of the obtained prediction model can be higher. After the training of the prediction model is completed, a corresponding prediction result can be obtained according to the training data and the prediction model. When the investment loss of an investor caused by false statement of a company is calculated, a stock price trend simulation model can be obtained according to the existing stock trading data in a training mode; then, the income which the investor should obtain can be calculated on the premise that the listed company has no false statement according to the stock trading data and the stock price trend simulation model of the investor; comparing the actual income obtained by the investor with the income due, the loss of the investor due to the false statement of the company can be obtained. In the embodiment of the application, the loss of the investor caused by the false statement can be quantitatively calculated, and the benefit of the investor is guaranteed.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a flow chart illustrating steps of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 3 is a schematic diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The technical solution of the present application will be described below by way of specific examples.
Referring to fig. 1, a schematic flow chart illustrating steps of a data processing method provided in an embodiment of the present application is shown, which may specifically include the following steps:
s101, obtaining training data, wherein the training data comprises a plurality of factor data.
The method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the application does not limit the specific types of the terminal devices at all.
The data processing method in the embodiment of the application establishes the prediction model according to the existing data, so that data prediction can be carried out based on the prediction model. The prediction model has corresponding transactions to be predicted, and the training data may include factor data corresponding to various factors affecting the transactions to be predicted, and results corresponding to each factor data.
The training data may be data related to a transaction to be predicted, and the factor data may affect the result of the transaction in the prediction process.
The method in the embodiment of the application can be particularly applied to a scene of simulating the stock price trend. The training data can be stock price data of a plurality of companies which are similar to the market value of the company to be evaluated, belong to the same block and do not have false statements in a time node period. Factor data affecting stock prices can include a wide variety, for example, the factor data shown in Table one can be selected.
Table one:
Figure BDA0003446157550000051
Figure BDA0003446157550000061
for example, according to the multiple factor data in table one, the training data can be processed into a multidimensional vector, and then model prediction is performed by using the multidimensional vector.
The training data needs to be preprocessed before model training with the training data. For example, outlier detection and data population of training data is required. In particular, Z-score methods may be used to determine outliers in the training data. And if the difference between the value of the data and the mean value exceeds three times of the standard deviation value, judging the data to be an abnormal value. The abnormal data can be refilled according to the previous data and the next data.
And S102, processing the multiple factor data by taking a preset time length as a window period to obtain time sequence factor data corresponding to the training data.
Specifically, each training data corresponds to a time, and the time length can be 3 days, 5 days and 7 days. For example, if the training data is training data within 2 years, the time of 2 years can be divided by taking 3 days, 5 days and 7 days as windows respectively, and then the mean and variance of the training data within each window period under each factor are calculated. The mean value and the variance in each window period are used as a factor data, and the factor data can reflect the characteristics of the training data in a preset time length, which is equivalent to endowing a data time sequence characteristic.
And S103, performing second-order polynomial processing on the plurality of factor data to obtain polynomial factor data corresponding to the training data.
Specifically, the factor data may be subjected to second-order polynomial feature processing, and assuming that there are two features (a, b), the second-order polynomial is (1, a, b, a)2,ab,b2) Then the new feature dimension is (a)2,ab,b2). For example, in making stock pricesIn the training process of the trading trend prediction model, second-order polynomial processing can be carried out on a plurality of characteristics of company public sentiment and industry public sentiment to obtain new polynomial characteristics.
And S104, training a preset prediction model by adopting the factor data, the time sequence factor data and the polynomial factor data to obtain a target prediction model.
Specifically, the original factor, the time sequence factor, and the polynomial factor may be combined together to finally form a plurality of different characteristic factors, and then the prediction model is trained by using data corresponding to the characteristic factors.
The above prediction model may employ a LightGBM model. LightGBM is a lightweight gradient boosting decision tree model proposed by Microsoft that uses a Leaf-wise Leaf growth strategy with depth constraints and differential acceleration using histograms.
The prediction model may include parameters and hyper-parameters, the hyper-parameters may be the depth of the tree, etc., the values of the hyper-parameters may be set by a user, and the values of the parameters may be obtained through training. In order to make the prediction result of the prediction model more accurate, an appropriate value of the hyper-parameter needs to be selected.
In order to select a suitable hyper-parameter value, a grid search may be performed, the accuracy of the model at each preset hyper-parameter value is determined, and then the hyper-parameter value with the best effect is selected.
In order to select a proper parameter value, the training data can be divided into a plurality of data sets, two of the data sets are randomly selected as a training combination, one of the training combinations is used as a training set, the other one of the training combinations is used as a test set to obtain a plurality of trained prediction models, and the prediction model with the minimum error in the plurality of trained prediction models is used as an intermediate prediction model; and then, training the intermediate prediction model by adopting factor data, time sequence factor data and polynomial factor data to obtain a target prediction model.
In addition, a preset number of training may be performed using the training combination, the preset number being equal to the combined value that the hyperparameter may have. Thereby determining a training result corresponding to the value of each hyper-parameter.
Illustratively, assume that the hyper-parameter has 3 values: A. b, C, respectively; the training data are divided into 3 data sets of x1, x2 and x3, and 3 training combinations of { x1, x2}, { x1, x3}, { x2, x3}, and { x1, x2} can be obtained through random combination. For each training combination, the value of the hyperparameter is A, B, C, and the training is performed three times to obtain 3 corresponding errors. The errors of the model obtained by training with { x1, x2} at the hyperparametric value of A, B, C are respectively y1, y2 and y 3; the errors of the model obtained by training with { x1, x3} at the value of the hyperparametric parameter of A, B, C are respectively z1, z2 and z 3; the errors of the model obtained by training with { x2, x3} at the hyperparametric value of A, B, C are t1, t2 and t3 respectively. The model corresponding to the smallest error can be selected from the 9 errors as an intermediate model, and then the intermediate model is trained by adopting the factor data, the time sequence factor data and the polynomial factor data of the training data to obtain a final prediction model.
In another possible implementation, the hyper-parameter is assumed to have 3 values: A. b, C, respectively; the training data are divided into 3 data sets of x1, x2 and x3, and 3 training combinations of { x1, x2}, { x1, x3}, { x2, x3}, and { x1, x2} can be obtained through random combination. For each training combination, the value of the hyperparameter is A, B, C, and the training is performed three times to obtain 3 corresponding errors. The errors of the model obtained by training with { x1, x2} at the hyperparametric value of A, B, C are respectively y1, y2 and y 3; the errors of the model obtained by training with { x1, x3} at the value of the hyperparametric parameter of A, B, C are respectively z1, z2 and z 3; the errors of the model obtained by training with { x2, x3} at the hyperparametric value of A, B, C are t1, t2 and t3 respectively. The average value of y1, z1 and t is the error value of the hyperparameter which is A; similarly, the value of the hyperparameter is determined to be the error value corresponding to B, C, and then the error of the hyperparameter at that value is observed to be the smallest, then the value of the hyperparameter can be set to that value. In addition, the average values of y1, y2 and y3 are calculated to obtain the error value of the prediction model trained by the first training combination; similarly, determining the error value of the pre-stored model trained by the second training combination and the third training combination, and then selecting the parameter combination with the minimum error value. Then selecting three prediction models corresponding to the training combination with the minimum error value; determining various possible combinations of the parameters of the prediction models by using the parameter values of the 3 prediction models; when the hyper-parameter is the hyper-parameter value with the minimum error, the error corresponding to the parameter combination is the minimum in a plurality of parameter values of the prediction model, and the corresponding intermediate model can be obtained according to the parameter combination. And then, the minimum error super parameter value and the minimum error parameter combination are used as an intermediate model, and the intermediate model is trained by using training data to obtain a final prediction model.
S105, receiving data to be predicted, and determining target data in the data to be predicted.
Specifically, the data to be predicted may be data related to a preset transaction to be predicted, and the data to be predicted may be processed to obtain target data, where the target data may be consistent with the input parameter form of the prediction model.
And S106, inputting the target data into the target prediction model for prediction to obtain a corresponding prediction result.
Specifically, the target data is input into the prediction model for calculation, and a corresponding prediction result is obtained.
In the embodiment of the application, time sequence factor data and polynomial factor data are adopted, various characteristics of training data are fused in the training process, and the expressive power of the data is enhanced; in addition, the model is trained by adopting the time sequence factor data in the training process, which is equivalent to that the time sequence characteristics of the data are considered in the training process, so that the prediction accuracy of the prediction model can be improved.
It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.
The method can be particularly applied to a stock price simulation scene, and a trading trend prediction model is obtained by adopting trading data training. The stock trading trend prediction model can be used for predicting the trading trend according to the trading data of the investor, so that the predicted trading result of the investor is calculated; calculating the actual trading result of the investor based on the existing actual stock price trend; the investor's losses due to false statements of listed companies can be calculated from the predicted trades and actual trades.
False statements by the listed companies can mislead the investor to make a false investment, resulting in investment losses. Generally speaking, investor losses are caused by the superposition of two risks, "systematic risk" and "non-systematic risk". Systematic risk, i.e., market risk, refers to the impact on the price of securities caused by environmental factors such as overall politics, economy, society, etc. Non-systematic risks are also known as "non-market risks" and "dispersible risks", and, as opposed to "systematic risks", refer to risks unrelated to financial speculation market fluctuations associated with stock markets, futures markets, foreign exchange markets, and the like.
The compensation for investors is jointly undertaken by false statement marketing companies, agency mechanisms and law offices, and mechanisms such as stock exchange centers are not involved, so that systematic risks are eliminated, and loss of stockholders caused by non-systematic risks is considered. The evaluation of the amount of compensation due to false statements, the existing methods refer to three categories, which are generally large disk index, comparable stock trend of the same kind, and stock simulation trend. The large disc index is used for selecting the index with the highest relevance from a plurality of candidate comprehensive indexes and component indexes as a reference basis. Comparable stock indexes of the same kind refer to the most relevant industry indexes or the plate market value without the industry indexes preferentially, and can refer to regional plate indexes, concept plate indexes and the like properly under special conditions. The stock simulation trend is based on a multi-factor machine learning algorithm such as linear regression, LASSO regression and recurrent neural network, and the stock trend from the implementation day to the disclosure day is simulated. The stock tendency simulation method based on the index is called a uniform relative proportion method, and the stock tendency simulation method based on the index is called an earning rate curve synchronous comparison method. The 'unified relative proportion method' is based on the large-scale index or the industry index in selection of the index, only considers the fluctuation of the index and the stock price at two points (the disclosure day and the reference day), and lacks the global analysis on the fluctuation of the stock price in the whole action interval of the false statement. This approach makes it difficult to objectively reflect the impact of non-systematic risks within the false statement action interval. Therefore, the present embodiment proposes a multi-factor model for indemnity evaluation.
The total Loss of investors in the investment process is reduced by the systematic risk LosssystemAnd Loss of non-systematic risk Lossnon-systemThe two parts are as follows:
Loss=Losssystem+Lossnon-system
currently, there is no authoritative influence factor and quantitative method research for non-systematic risks, but the influence factor of systematic risk loss is relatively few, and the evaluation is relatively simple. Therefore, for calculating the amount to be paid by the downtown company, the simulated profit-and-loss ratio of the investor can be calculated by other factors on the premise that no false statement occurs. The non-system risk loss calculation method is as follows: (all formulas below in which subscript b denotes buy and subscript s denotes sell)
Lossnon-system=Loss-Losssystem=cb×(pr-ps)
Wherein c isbRepresenting the total cost of buying, p, of the investor's stockrRepresenting the total proportion of investor losses, psRepresenting the proportion of investor system risk loss. The proportion of the total loss of the investors can be sold as the total profit csIs represented as follows:
Figure BDA0003446157550000101
wherein a isbAnd asRespectively representing the average bid and ask prices, nbAnd nsRepresenting the number of bought and sold shares, respectively.
The average purchase price is calculated from the first effective purchase price, and the average purchase price after investment for any day is calculated in the following mode:
Figure BDA0003446157550000111
wherein c istodayAnd ntodayRespectively representing the stock cost and the stock quantity bought in the current day, and respectively summing the stock cost and the stock quantity to obtain the total cost and the total quantity. c. CpreAnd npreRepresenting the total stock cost and the number of stocks before the current day, respectively.
Loss ratio p of investor system risksCan be obtained by simulating a multi-factor model algorithm, and the calculation mode and the total loss proportion p are calculatedrThe calculation is similar. The multi-factor model algorithm is a stock price trend simulation model obtained based on stock trading data and the method in the application.
In this embodiment, the training data may be selected from stock price data of a plurality of companies which have similar market value to the company to be evaluated, belong to the same block, and have no false statement in the time node period.
This training data is then outlier detected and data filled. In outlier detection, the Z-score method can be used, which is based on statistical
Figure BDA0003446157550000115
Principle. If the difference between the value of the data and the mean value exceeds three times the standard deviation value, the data is judged to be an abnormal value. This is because the data is distributed in the positive and negative of the mean
Figure BDA0003446157550000116
The probability of the interval of (a) is 99.7%, so the probability of occurring outside this range is:
Figure BDA0003446157550000112
this is an event with a very small probability and is therefore in
Figure BDA0003446157550000117
Data outside the interval is judged asOutliers. The formula for the Z-score method can therefore be summarized as:
Figure BDA0003446157550000113
where μ is the mean of the features,
Figure BDA0003446157550000114
is the standard deviation of the features. And (4) carrying out outlier detection on all the original data features, wherein the values of the outliers are replaced by null values.
Since the stock price related data is affected by the time series, the null value in the training data can be filled in by using a bidirectional filling method of the front and back values in the embodiment. Therefore, for all data null values, the following non-null value is used for filling, and in order to avoid the situation that the edge data is not filled, the filling is carried out again by using the previous non-null value. After the final bidirectional filling, all the characteristics have no null condition.
In this embodiment, the integrated model LightGBM may be used as the base model training, but the stock price data is related to time and belongs to time series data, and the time series data itself cannot be used as the input of the LightGBM. Therefore, in the feature engineering process of the time series data in the embodiment, the statistical features of each dimension are calculated according to the window period of time as input.
The stock price is affected more by the fluctuation range in the short term than the long term, therefore, three periods of 3 days, 5 days and 7 days can be selected as the window period, the average value and variance of some original factors such as "fluctuation range", "handoff rate", "fluctuation" and "beta factor" are counted respectively, and the calculation formulas of the average value and variance are respectively as follows:
Figure BDA0003446157550000121
Figure BDA0003446157550000122
wherein n is a window period, and 3,5 and 7 are respectively taken. And then, performing second-order polynomial characteristic processing on the factor data, wherein if two characteristics (a, b) exist, the second-order polynomial is (1, a, b, a)2,ab,b2) Then the new feature dimension is (a)2,ab,b2). And carrying out second-order polynomial processing on a plurality of characteristics of company public sentiment and industry public sentiment to obtain new polynomial characteristics. And combining the original factors, the time sequence factors and the polynomial factors together to finally form a plurality of different characteristic factors.
LightGBM is a lightweight gradient boosting decision tree model proposed by Microsoft, uses a Leaf-wise Leaf growth strategy with depth limitation, uses a histogram for differential acceleration, and has faster training efficiency, lower memory usage and higher accuracy.
The LightGBM is realized based on a decision tree, firstly, a decision tree is trained, then, the negative gradient of the current loss function is calculated to be used as a residual error approximate value of the current decision tree, a new decision tree is trained and fitted, and the process is circulated until the training is finished. LightGBM uses the GBDT strategy in training, similar to logistic regression, using log-likelihood as a loss function in the binary task:
Figure BDA0003446157550000123
L(y,F)=log(1+exp(-2yF)),y∈{-1,1}
where x is the test set data and y is the prediction output. F (x) is a logarithmic function with respect to x and y, and can also be considered a two-class prediction function. Since a plurality of trees are constructed along the direction of negative gradient, the prediction function F is calculatedt-1(x) Current negative gradient value of
Figure BDA0003446157550000124
Figure BDA0003446157550000131
After obtaining the negative gradient value, the learner still uses the decision tree as the basis to perform linear search to calculate the fitting value gamma of the optimal leaf node:
Figure BDA0003446157550000132
wherein R istjThe value range of j is the number of leaf nodes of the decision tree t.
Figure BDA0003446157550000133
In the embodiment, the model effect is verified by combining the two-fold cross verification of the blocks and the grid search, and the optimal parameter combination is searched.
The chunk cross validation randomly divides all data D into k parts:
D={D1,D2,...,Dk}
then, any two parts are selected from the k parts of data, one part is used as a training set, the other part is used as a testing set, and the total weight is total
Figure BDA0003446157550000134
Different combinations, then block cross validation mean error for evaluating model effect
Figure BDA0003446157550000135
The definition is as follows:
Figure BDA0003446157550000136
an exhaustive method is used in the grid search, a plurality of adjustable hyper-parameters in the LightGBM are adjusted, and the adjustment is carried out in a mode of chunk cross validation, so that
Figure BDA0003446157550000137
In order to be able to determine the criteria,
Figure BDA0003446157550000138
smaller indicates better performance of the parameter. And finally determining the optimal parameter combination for the current data through multiple rounds of training and verification. And then the learner can be continuously updated according to a recursion formula to obtain a final regression strong learner Ft(x) Daily stock price trends without false statements can be predicted.
Reading a LightGBM model obtained by optimal parameter combination training, taking the individual stock price trend and the related factors as input data, and calculating the loss proportion p of the investor system risk through the predicted simulated stock price trend and the real stock price trendsTo calculate the Loss of non-system risknon-system。Lossnon-systemThe amount of compensation that the listing company needs to pay the investor may be considered.
In the embodiment, the characteristic engineering is performed, and the stock transaction data is subjected to extraction of various characteristics including factor data, time sequence factor data, polynomial factor data and the like, so that the expression capacity of the data is fully enhanced. In addition, the time sequence data is processed into a format which is convenient to be used as the input of the LightGBM model, and the accuracy of stock trading trend prediction is improved. In addition, the embodiment proposes that the trend of each strand is simulated by using the multi-factor model without considering false statements, and the LightGBM algorithm is used for gradient boosting iterative decision tree training, so that the accuracy and the robustness of the model are ensured. In addition, the embodiment also provides an investor compensation algorithm, a complete loss proportion and compensation amount evaluation algorithm is provided in the algorithm, and the loss proportion and the final amount of the investor to be compensated can be finally obtained only by providing historical transaction running data of the investor. In the embodiment, the optimal parameter combination of the model can be found by using the model training method combining the block cross validation and the grid search, so that the overfitting is avoided, and the model has stronger universality and robustness.
Referring to fig. 2, a schematic diagram of a data processing apparatus according to an embodiment of the present application is shown, and may specifically include an obtaining module 21, a timing processing module 22, a polynomial processing module 23, a training module 24, a receiving module 25, and a predicting module 26, where:
an obtaining module 21, configured to obtain training data, where the training data includes multiple factor data;
the time sequence processing module 22 is configured to process the multiple factor data by using a preset time length as a window period to obtain time sequence factor data corresponding to the training data;
the polynomial processing module 23 is configured to perform second-order polynomial processing on the multiple factor data to obtain polynomial factor data corresponding to the training data;
the training module 24 is configured to train a preset prediction model by using the factor data, the time sequence factor data, and the polynomial factor data to obtain a target prediction model;
the receiving module 25 is configured to receive data to be predicted and determine target data in the data to be predicted;
and the prediction module 26 is configured to input the target data into the target prediction model for prediction, so as to obtain a corresponding prediction result.
In one possible processing manner, the timing processing module 22 includes:
a determining submodule for determining factor data of the training data in each window period;
a calculation submodule for calculating a mean and a variance of each of the factor data for each of the window periods;
and the time sequence factor data determining submodule is used for taking the mean value and the variance of each type of factor data in each window period as the time sequence factor data.
In one possible processing manner, the training module 24 includes:
a dividing submodule for dividing the training data into a plurality of data sets;
a training combination determining submodule, configured to determine a plurality of training combinations according to the plurality of data sets, where each training combination includes two data sets;
the first training submodule is used for training the prediction model by adopting a plurality of training combinations respectively to obtain a plurality of trained prediction models;
the intermediate prediction model determining submodule is used for taking a prediction model with the minimum error in the trained prediction models as an intermediate prediction model;
and the second training submodule is used for training the intermediate prediction model by adopting the factor data, the time sequence factor data and the polynomial factor data to obtain the target prediction model.
In one possible processing manner, the first training submodule includes:
the training unit is used for training the prediction model for preset times by taking a data set of the training combination as a training set aiming at any one training combination;
and the test unit is used for calculating the error of the prediction model after each training by adopting the other data set of the training combination as a test set.
In one possible processing mode, the prediction model includes a hyper-parameter, and the hyper-parameter has a plurality of corresponding values, and the apparatus further includes:
and the error determining module corresponding to the value of the super parameter is used for carrying out primary training on each value of the super parameter by adopting the training combination to obtain the error corresponding to the value of the super parameter.
In one possible processing manner, the training module 24 further includes:
a hyperparameter value determination sub-module for determining the error of the intermediate prediction model at each hyperparameter value;
and the selection submodule is used for selecting the value of the hyperparameter with the minimum error as the value of the hyperparameter of the intermediate prediction model.
In one possible processing manner, the prediction module 26 includes:
and the transaction trend information determining submodule is used for inputting the transaction data of a preset type into the target prediction model to obtain transaction trend information.
For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to the description of the method embodiment section for relevant points.
Fig. 3 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 3, the terminal device 3 of this embodiment includes: at least one processor 30 (only one shown in fig. 3), a memory 31, and a computer program 32 stored in the memory 31 and executable on the at least one processor 30, the processor 30 implementing the steps of any of the various method embodiments described above when executing the computer program 32.
The terminal device 3 may be a desktop computer, a notebook, a palm computer, a cloud terminal device, or other terminal devices. The terminal device may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is only an example of the terminal device 3, and does not constitute a limitation to the terminal device 3, and may include more or less components than those shown, or combine some components, or different components, for example, and may further include an input/output device, a network access device, and the like.
The Processor 30 may be a Central Processing Unit (CPU), and the Processor 30 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may in some embodiments be an internal storage unit of the terminal device 3, such as a hard disk or a memory of the terminal device 3. The memory 31 may also be an external storage device of the terminal device 3 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the terminal device 3. The memory 31 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 31 may also be used to temporarily store data that has been output or is to be output.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A data processing method, comprising:
acquiring training data, wherein the training data comprises a plurality of factor data;
processing the multiple factor data by taking a preset time length as a window period to obtain time sequence factor data corresponding to the training data;
performing second-order polynomial processing on the plurality of factor data to obtain polynomial factor data corresponding to the training data;
training a preset prediction model by adopting the factor data, the time sequence factor data and the polynomial factor data to obtain a target prediction model;
receiving data to be predicted, and determining target data in the data to be predicted;
and inputting the target data into the target prediction model for prediction to obtain a corresponding prediction result.
2. The method of claim 1, wherein the processing the plurality of factor data with a preset time duration as a window period to obtain time sequence factor data corresponding to the training data comprises:
determining factor data of the training data in each window period;
calculating the mean and variance of each of the factor data over each of the window periods;
and taking the mean value and the variance of each kind of factor data in each window period as the time sequence factor data.
3. The method according to claim 1 or 2, wherein the training a preset prediction model using the factor data, the time-series factor data and the polynomial factor data to obtain a target prediction model comprises:
dividing the training data into a plurality of data sets;
determining a plurality of training combinations according to the plurality of data sets, wherein each training combination comprises two data sets;
respectively adopting a plurality of training combinations to train the prediction model to obtain a plurality of trained prediction models;
taking a prediction model with the minimum error in the trained prediction models as an intermediate prediction model;
and training the intermediate prediction model by adopting the factor data, the time sequence factor data and the polynomial factor data to obtain the target prediction model.
4. The method of claim 3, wherein said training said predictive model with a plurality of said training combinations, respectively, to obtain a plurality of trained predictive models, comprises:
aiming at any one training combination, taking a data set of the training combination as a training set to train the prediction model for preset times;
and calculating the error of the prediction model after each training by using the other data set of the training combination as a test set.
5. The method of claim 3, wherein the predictive model includes a hyperparameter, the hyperparameter having a corresponding plurality of values, further comprising:
and for each value of the hyper-parameter, performing one-time training by adopting the training combination to obtain an error corresponding to the value of the hyper-parameter.
6. The method of claim 5, wherein after said using a prediction model with a smallest error among said plurality of trained prediction models as an intermediate prediction model, further comprising:
determining an error of the intermediate predictive model at each of the hyper-parameter values;
and selecting the value of the hyperparameter with the minimum error as the value of the hyperparameter of the intermediate prediction model.
7. The method of claim 1, wherein the training data is transaction data, and the inputting the objective data into the objective prediction model for prediction results to obtain corresponding prediction results comprises:
and inputting the preset transaction data into the target prediction model to obtain transaction trend information.
8. A data processing apparatus, comprising:
an acquisition module, configured to acquire training data, where the training data includes multiple factor data;
the time sequence processing module is used for processing the factor data by taking a preset time length as a window period to obtain time sequence factor data corresponding to the training data;
the polynomial processing module is used for carrying out second-order polynomial processing on the factor data to obtain polynomial factor data corresponding to the training data;
the training module is used for training a preset prediction model by adopting the factor data, the time sequence factor data and the polynomial factor data to obtain a target prediction model;
the device comprises a receiving module, a prediction module and a prediction module, wherein the receiving module is used for receiving data to be predicted and determining target data in the data to be predicted;
and the prediction module is used for inputting the target data into the target prediction model for prediction to obtain a corresponding prediction result.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202111657708.0A 2021-12-30 2021-12-30 Data processing method, device, terminal equipment and medium Pending CN114418776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111657708.0A CN114418776A (en) 2021-12-30 2021-12-30 Data processing method, device, terminal equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111657708.0A CN114418776A (en) 2021-12-30 2021-12-30 Data processing method, device, terminal equipment and medium

Publications (1)

Publication Number Publication Date
CN114418776A true CN114418776A (en) 2022-04-29

Family

ID=81269601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111657708.0A Pending CN114418776A (en) 2021-12-30 2021-12-30 Data processing method, device, terminal equipment and medium

Country Status (1)

Country Link
CN (1) CN114418776A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306819A (en) * 2023-03-22 2023-06-23 大连海事大学 Hyperspectral cross calibration method and device based on spectrum reconstruction and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306819A (en) * 2023-03-22 2023-06-23 大连海事大学 Hyperspectral cross calibration method and device based on spectrum reconstruction and electronic equipment

Similar Documents

Publication Publication Date Title
Collin‐Dufresne et al. Do prices reveal the presence of informed trading?
Meinshausen et al. Monte Carlo methods for the valuation of multiple‐exercise options
TW530234B (en) Methods and systems for efficiently sampling portfolios for optimal underwriting
Farboodi et al. Where has all the big data gone?
Iori et al. Empirical analyses of networks in finance
Burnie Exploring the interconnectedness of cryptocurrencies using correlation networks
US20220207326A1 (en) Anomaly detection, data prediction, and generation of human-interpretable explanations of anomalies
Bakhach et al. TSFDC: A trading strategy based on forecasting directional change
CN110991936A (en) Enterprise grading and rating method, device, equipment and medium
US20110137781A1 (en) Intermarket Analysis
CN110796539A (en) Credit investigation evaluation method and device
BenSaïda et al. Value‐at‐risk under market shifts through highly flexible models
CN111695938A (en) Product pushing method and system
Sekerke Bayesian risk management: A guide to model risk and sequential learning in financial markets
CN114418776A (en) Data processing method, device, terminal equipment and medium
JP2018514889A (en) Method and system for calculating and providing an initial margin based on an initial margin standard model
CN109767333A (en) Select based method, device, electronic equipment and computer readable storage medium
CN115860924A (en) Supply chain financial credit risk early warning method and related equipment
Kartiwi et al. Sukuk rating prediction using voting ensemble strategy
CN113222767A (en) Data processing method and device for indexing securities combination
CN112396455A (en) Pricing method, apparatus, device and medium for data assets
Minnoor et al. Nifty price prediction from Nifty SGX using machine learning, neural networks and sentiment analysis
US20190279301A1 (en) Systems and Methods Using an Algorithmic Solution for Analyzing a Portfolio of Stocks, Commodities and/or Other Financial Assets Based on Individual User Data to Achieve Desired Risk Based Financial Goals
CN112734559B (en) Enterprise credit risk evaluation method and device and electronic equipment
Lusk Evaluation of the predictive validity of the CapitalCube Market navigation platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination