CN112785090A

CN112785090A - Model training method, type prediction method, device and computing equipment

Info

Publication number: CN112785090A
Application number: CN202110233777.2A
Authority: CN
Inventors: 周茜; 浦婧蕾; 王毅
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-05-11

Abstract

The embodiment of the specification discloses a model training method, a type prediction method, a device and computing equipment. The type prediction method comprises the following steps: acquiring a plurality of index data of a business object in a historical period; and inputting the index data into an integrated learning classification model based on the decision tree to obtain the type of the business object in the future period. The model training method, the type prediction device and the computing equipment in the embodiment of the specification can improve the accuracy of the prediction result.

Description

Model training method, type prediction method, device and computing equipment

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a model training method, a type prediction device and computing equipment.

Background

With the development of science and technology, the application of artificial intelligence technology brings various conveniences to people's daily life. In some scenarios, a prediction of the type of business object is required. For example, it is desirable to predict whether a stock is a newly entered stock.

In the related art, a logistic regression model may be selected to predict the type of business object. However, the logistic regression model is relatively simple, and when a large number of features are processed, the logistic regression model is easy to be under-fitted, which causes inaccuracy of a prediction result.

Disclosure of Invention

The embodiment of the specification provides a model training method, a type prediction device and computing equipment, so that the accuracy of a prediction result is improved. The technical scheme of the embodiment of the specification is as follows.

In a first aspect of embodiments of the present specification, there is provided a model training method, including:

acquiring a plurality of index data and labels of the business object, wherein the labels are used for representing the type of the business object;

screening target index data from the plurality of index data;

and training the integrated learning classification model based on the decision tree according to the target index data and the label.

In a second aspect of embodiments of the present specification, there is provided a type prediction method including:

acquiring a plurality of index data of a business object in a historical period;

and inputting the index data into an integrated learning classification model based on the decision tree to obtain the type of the business object in the future period.

In a third aspect of embodiments of the present specification, there is provided a model training apparatus including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of index data and labels of business objects, and the labels are used for representing the types of the business objects;

the screening unit is used for screening target index data from the index data;

and the training unit is used for training the integrated learning classification model based on the decision tree according to the target index data and the label.

In a fourth aspect of embodiments of the present specification, there is provided a type prediction apparatus including:

the acquisition unit is used for acquiring a plurality of index data of the business object in a historical period;

and the input unit is used for inputting the index data into the integrated learning classification model based on the decision tree to obtain the type of the business object in the future time period.

In a fifth aspect of embodiments of the present specification, there is provided a computing device comprising:

at least one processor;

a memory storing program instructions configured to be suitable for execution by the at least one processor, the program instructions comprising instructions for performing the method of the first or second aspect.

According to the technical scheme provided by the embodiment of the specification, the integrated learning classification model based on the decision tree can be trained by using the index data and the labels of the business objects. In addition, the types of the business objects can be predicted by using the integrated learning classification model based on the decision tree, so that the accuracy of the prediction result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a model training method in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a model training process in an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating a type prediction method according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a type prediction apparatus in an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a computing device in an embodiment of the present specification.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.

Considering that the integrated learning classification model based on the decision tree has high accuracy, high efficiency and strong interpretability, the embodiment of the specification predicts the type of the business object by using the integrated learning classification model based on the decision tree.

Please refer to fig. 1. The embodiment of the specification provides a model training method. The model training method can be applied to a server. The server may be one server, a server cluster including a plurality of servers, or a server deployed in the cloud. The model training method can be used for training the integrated learning classification model based on the decision tree. The integrated learning classification model based on the decision tree can be an integrated learning model based on decision tree implementation. The decision tree-based ensemble learning classification model may be an XGBoost model. Of course, the integrated learning classification model based on the Decision Tree may also be other models, such as GBDT (Gradient Boosting Decision Tree).

The model training method may include the following steps.

Step S11: obtaining a plurality of index data and labels of the business objects, wherein the labels are used for representing the types of the business objects.

In some embodiments, the indicator data may include market indicator data for the business object and financial indicator data for a business associated with the business object. The index data can comprise market condition index data and financial index data, so that the index data are more comprehensive, and the training effect can be improved. The tag may be used to indicate the type of business object.

The business object may include a stock. The label may be used to indicate whether the stock is a newly entered stock. The barns may be stocks held in large numbers by the organization. Specifically, for example, a fund barter may be a stock held by a fund company and occupying more than 20% of the value of the market for circulation. The newly entered double stocks may be stocks which were not double stocks in the previous quarter and become double stocks in the current quarter. The index data may include market index data for the stock, and financial index data for a business associated with the stock. Wherein the market index data can reflect valuation, stock price, volume of trades, and the like of the stocks. The financial index data can reflect the operation of the company in terms of profitability, operational capacity, cash flow, and the like.

For example, a plurality of index data of a business object can be as shown in table 1 below.

TABLE 1

Of course, in practice, the business object may be other financial objects such as futures or bonds.

In some embodiments, the server may collect metrics data and tags for one or more business objects. In practice, for each business object, the server may collect a plurality of index data of the business object and tags corresponding to the index data. For example, the server may collect a plurality of index data of the business object and tags corresponding to the plurality of index data from the internet. Specifically, for example, the server may collect, from the internet, data on the financial index of the listed company for each quarter during 2016 to 2020, data on the market quotation index of the stocks issued by the listed company for each quarter, and tags of the stocks issued by the listed company for each quarter.

Step S13: and screening target index data from the plurality of index data.

In some embodiments, in order to improve the training effect of the decision tree-based ensemble learning classification model, the server may filter the index data to obtain target index data.

The server can analyze the correlation among the index data to realize the screening of the target index data from the plurality of index data. Specifically, the server may determine a correlation coefficient between each two of the plurality of index data; a plurality of target index data may be screened out from the plurality of index data such that a correlation coefficient between each two of the plurality of target index data satisfies a first condition. Wherein the correlation coefficient between index data may be used to represent the correlation between index data. The larger the correlation coefficient is, the more correlation between index data is represented; the smaller the correlation coefficient, the less correlation between index data. In practice, the correlation coefficient between two index data may be an empirical value. Or, the correlation coefficient between every two index data can be obtained in a calculation mode. For example, the server may further calculate a consistency coefficient (coefficient of consistency) between each two index data as a correlation coefficient between each two index data. The first condition may include: the correlation coefficient is less than or equal to a first threshold. Therefore, the selected target index data can be irrelevant.

For example, the plurality of metric data may include metric data A, B, C, D, E. The correlation coefficient between the index data a and the index data B is 0.5, the correlation coefficient between the index data a and the index data C is 0.6, the correlation coefficient between the index data a and the index data D is 0.75, the correlation coefficient between the index data a and the index data E is 0.85, the correlation coefficient between the index data B and the index data C is 0.8, the correlation coefficient between the index data B and the index data D is 0.86, the correlation coefficient between the index data B and the index data E is 0.9, the correlation coefficient between the index data C and the index data D is 0.2, the correlation coefficient between the index data C and the index data E is 0.3, and the correlation coefficient between the index data D and the index data E is 0.88. The first condition may include: the correlation coefficient is less than or equal to 0.8. Then the server may select metric data A, C, D from the metric data A, B, C, D, E as target metric data.

Alternatively, the server may further determine a correlation coefficient between the metric data and the tag; index data having a correlation coefficient satisfying the second condition may be screened out from the plurality of index data as target index data. The correlation coefficient between the index data and the label may be used to represent the correlation between the index data and the label. The correlation coefficient between the index data and the label may be an empirical value. Alternatively, the correlation coefficient between the index data and the label may be obtained by calculation. For example, the server may calculate a shrarp value between the index data and the tag as a correlation coefficient between the index data and the tag. The second condition may include: the correlation coefficient is greater than or equal to a second threshold. Therefore, the screened target index data can be index data with large influence on the label.

In some embodiments, there may be some data missing from the multiple metric data of the business object. For example, a company has missing certain index data for a certain quarter. To this end, the server may fill in missing data. For example, the server may populate the missing data with an average or mode.

In some embodiments, the server may normalize the metric data. Specifically, the server can adopt a Z-score method or a Min-Max method to carry out normalization processing on the index data.

Step S15: and training the integrated learning classification model based on the decision tree according to the target index data and the label.

In some embodiments, the decision tree based ensemble learning classification model may include an XGBoost model.

Please refer to fig. 2. The XGBoost model may be an additive model. In particular, the XGboost model may be represented as

M represents the number of decision trees in the XGboost model, f_M(x) Representing the predicted result of the XGboost model, T (x, theta)_m) Represents the predicted result, θ, of the m-th decision tree_mRepresenting the parameters of the mth decision tree and x representing the input of the XGBoost model.

The XGboost model training process can be realized based on a forward step-by-step algorithm. Specifically, during the training process, one decision tree may be added to each iteration of the XGBoost model. So that the XGboost model can be expressed as f_m(x)＝f_m-1(x)+T(x,θ_m)。f_m-1(x) Representing the prediction result of the current XGboost model, T (x, theta)_m) Representing the predicted result of the newly added mth decision tree, f_M(x) And (4) representing the prediction result of the XGboost model after the mth decision tree is added. The optimization goal of the iterated XGboost model may be to make

The minimum value is obtained.

By pairs

And solving to obtain the parameters of the XGboost model. y is_iDenotes the label, L denotes the loss function, and arg denotes the gradient.

Of course, the integrated learning classification model based on the Decision Tree may also be other models, such as GBDT (Gradient Boosting Decision Tree).

In some embodiments, the target metric data and the labels of the business objects may constitute training data. For example, the server may collect, from the internet, quarterly financial index data for the listed companies, quarterly market quotation index data for stocks issued by the listed companies, and quarterly labels for stocks issued by the listed companies over 2016-2020. Then, for each business object, the server may construct 20 training data for the business object according to the target index data for 20 quarters during 2016-2020 for the business object and the tags for 20 quarters.

The server may train the decision tree based ensemble learning classification model according to the training data. The server specifically can adopt a gradient descent method or a Newton method to train the integrated learning classification model based on the decision tree.

The model training method in the embodiment of the specification can acquire a plurality of index data and labels of the business object, wherein the labels are used for representing the type of the business object; target index data can be screened from a plurality of index data; the decision tree-based ensemble learning classification model can be trained according to target index data and labels. Therefore, the integrated learning classification model based on the decision tree can be trained by using the index data and the labels of the business objects, and a basis is provided for predicting the types of the business objects by using the integrated learning classification model based on the decision tree.

Please refer to fig. 3. The embodiment of the specification provides a classification method. The classification method may be applied to a server. The server may be one server, a server cluster including a plurality of servers, or a server deployed in the cloud.

The classification method may include the following steps.

Step S21: and acquiring a plurality of index data of the business object in a historical period.

In some embodiments, the indicator data may include market indicator data for the business object and financial indicator data for a business associated with the business object. The index data can comprise market situation index data and financial index data, so that the index data for predicting the stock types is more comprehensive, and the accuracy of prediction is improved. For example, the business object may include stocks. The index data may include market index data for the stock, and financial index data for a business associated with the stock. Wherein the market index data can reflect valuation, stock price, volume of trades, and the like of the stocks. The financial index data can reflect the operation of the company in terms of profitability, operational capacity, cash flow, and the like.

In some embodiments, the server may obtain a plurality of metric data of the business object over a historical period. The length of the historical period may be one quarter or one year. For example, the server may obtain a plurality of metric data for the business object over the last quarter. Wherein, the type of the index data can be a specified type. For example, the specified type may be a type of the target index data in the embodiment corresponding to fig. 1.

Step S23: and inputting the index data into an integrated learning classification model based on the decision tree to obtain the type of the business object in the future period.

In some embodiments, the integrated learning classification model based on the decision tree may be obtained by training based on the model training method of the embodiment corresponding to fig. 1. The decision tree based ensemble learning classification model may include an XGBoost model.

In some embodiments, the types of business objects at different time periods may be different. For example, the business object may be a stock that may be newly entered into a barns in one quarter and may not be newly entered into a barns in another quarter. The server can input the index data into an ensemble learning classification model based on a decision tree to obtain the type of the business object in a future period. The length of the future period may be one quarter or one year. The type may be selected from a first type and a second type. The meaning of the first type and the second type representation is different according to the business object. Taking the business object as a stock as an example, the first type may be used to indicate that the stock is a newly entered heavy stock, and the second type may be used to indicate that the stock is not a newly entered heavy stock.

For example, the server may obtain a plurality of index data of the business object in the last quarter; the index data may be input to a decision tree based ensemble learning classification model to obtain the type of the business object in the next quarter.

The classification method of the embodiment of the specification can acquire a plurality of index data of the business object in a historical period; the index data can be input into an integrated learning classification model based on a decision tree to obtain the type of the business object in the future period. Therefore, the accuracy of the prediction result can be improved by predicting the type of the business object by using the integrated learning classification model based on the decision tree.

Please refer to fig. 4. The embodiment of the specification provides a model training device, and the device can comprise the following units.

An obtaining unit 31, configured to obtain a plurality of index data and a tag of a business object, where the tag is used to indicate a type of the business object;

a screening unit 33 configured to screen target index data from the plurality of index data;

and the training unit 35 is configured to train the ensemble learning classification model based on the decision tree according to the target index data and the label.

Please refer to fig. 5. The embodiment of the present specification provides a type of prediction apparatus, which may include the following units.

An obtaining unit 41, configured to obtain a plurality of index data of the business object in a history period;

and the input unit 43 is used for inputting the index data into the integrated learning classification model based on the decision tree to obtain the type of the business object in the future time period.

Please refer to fig. 6. The embodiment of the specification also provides a computing device.

The computing device may include a memory and a processor.

In the present embodiment, the Memory includes, but is not limited to, a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), and the like. The memory may be used to store computer instructions.

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The processor may be configured to execute the computer instructions to implement the embodiments corresponding to fig. 1 or fig. 3.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and the same or similar parts in each embodiment may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the apparatus embodiment and the computing device embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and reference may be made to some descriptions of the method embodiment for relevant points. In addition, it is understood that one skilled in the art, after reading this specification document, may conceive of any combination of some or all of the embodiments listed in this specification without the need for inventive faculty, which combinations are also within the scope of the disclosure and protection of this specification.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be essentially or partially implemented in the form of software products, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification that do not depart from the spirit of the specification, and it is intended that the appended claims include such variations and modifications that do not depart from the spirit of the specification.

Claims

1. A model training method, comprising:

screening target index data from the plurality of index data;

2. The method of claim 1, the indicator data comprising market indicator data for a business object and financial indicator data for a business associated with the business object; the business object comprises a stock;

the decision tree-based ensemble learning classification model comprises an XGboost model.

3. The method of claim 1, wherein the screening target metric data from the plurality of metric data comprises:

determining a correlation coefficient between every two of the plurality of index data; and screening a plurality of target index data from the plurality of index data, wherein the correlation coefficient between every two of the plurality of target index data meets a first condition.

4. The method of claim 1, wherein the screening target metric data from the plurality of metric data comprises:

determining a correlation coefficient between the index data and the label;

and screening out index data of which the correlation coefficient meets a second condition from the plurality of index data as target index data.

5. A type prediction method, comprising:

6. The method of claim 5, wherein the business object comprises a stock, and wherein the type is selected from a first type and a second type, the first type being used to indicate that the stock is a newly-entered re-warehouse stock, and the second type being used to indicate that the stock is not a newly-entered re-warehouse stock.

7. The method of claim 5, wherein the integrated learning classification model based on the decision tree is trained based on the method of any one of claims 1-4; the decision tree-based ensemble learning classification model comprises an XGboost model.

8. A model training apparatus comprising:

the screening unit is used for screening target index data from the index data;

9. A type prediction apparatus comprising:

10. A computing device, comprising:

at least one processor;

a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-7.