CN111882416A

CN111882416A - Training method and related device of risk prediction model

Info

Publication number: CN111882416A
Application number: CN202010720354.9A
Authority: CN
Inventors: 李招; 张彬杰
Original assignee: Weikun Shanghai Technology Service Co Ltd
Current assignee: Weikun Shanghai Technology Service Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-11-03

Abstract

The application relates to the field of block storage systems and artificial intelligence, and discloses a risk prediction model training method and a related device, wherein the method comprises the following steps: acquiring a first financial data set, wherein the first financial data set comprises M pieces of first financial data corresponding to a plurality of first fields; vectorizing, for the first financial data set, the plurality of pieces of first financial data associated with each of the plurality of first fields to obtain a plurality of first vectors; determining the correlation between every two first vectors in the plurality of vectors by adopting a preset feature selection algorithm; determining a second financial data set from the first financial data set according to the correlation between each two first vectors; training a risk prediction model using the second financial data set. By implementing the embodiment of the application, the training period of the risk prediction model is shortened, and the training complexity is reduced.

Description

Training method and related device of risk prediction model

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and a related apparatus for training a risk prediction model.

Background

With the rapid development of emerging technologies, various industries begin to utilize deep learning, neural networks and the like to realize risk prediction. For example, the risk of default of the enterprise is predicted through a risk prediction model. Generally, before the risk of the enterprise default is predicted through the risk prediction model, the risk prediction model needs to be trained. In the prior art, when a risk prediction model is trained, a financial data set is often adopted directly. Due to the fact that the data volume of the financial data set is large, the training period of the risk prediction model is long, and the training complexity is high.

Disclosure of Invention

The embodiment of the application provides a training method and a related device of a risk prediction model, and by implementing the embodiment of the application, the training period of the risk prediction model is shortened, and the training complexity is reduced.

The first aspect of the present application provides a method for training a risk prediction model, including:

acquiring a first financial data set, wherein the first financial data set comprises M pieces of first financial data corresponding to a plurality of first fields, the plurality of first fields comprise a first field A and a first field B, the first field A is associated with X pieces of first financial data, the first field B is associated with Y pieces of first financial data, and M is X + Y, wherein M, X and Y are integers greater than 1;

vectorizing, for the first financial data set, the plurality of pieces of first financial data associated with each of the plurality of first fields to obtain a plurality of first vectors;

determining the correlation between every two first vectors in the plurality of first vectors by adopting a preset feature selection algorithm;

determining a second financial data set from the first financial data set according to the correlation between each two first vectors;

training a risk prediction model using the second financial data set.

A second aspect of the present application provides a training apparatus for a risk prediction model, including:

the processing module is configured to obtain a first financial data set, where the first financial data set includes M pieces of first financial data corresponding to a plurality of first fields, where the plurality of first fields includes a first field a and a first field B, the first field a is associated with X pieces of first financial data, the first field B is associated with Y pieces of first financial data, M is X + Y, and M, X, and Y are integers greater than 1; vectorizing, for the first financial data set, the plurality of pieces of first financial data associated with each of the plurality of first fields to obtain a plurality of first vectors; determining the correlation between every two first vectors in the plurality of first vectors by adopting a preset feature selection algorithm; determining a second financial data set from the first financial data set according to the correlation between each two first vectors; training a risk prediction model using the second financial data set.

A third aspect of the application provides an electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are generated as instructions that are executed by the processor to perform steps in any of a method of training a risk prediction model.

A fourth aspect of the application provides a computer readable storage medium for storing a computer program for execution by the processor to perform the method of any one of the methods of training a risk prediction model.

It can be seen that, in the above technical solution, by determining the second financial data set from the first financial data set according to the correlation and training the risk prediction model by using the second financial data set, the correlation between the financial data is deeply mined, so that the second financial data set is determined from the first financial data set according to the correlation between the financial data, data used for training the risk prediction model is reduced, the training period of the risk prediction model is shortened, and the training complexity is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a schematic diagram of a training system for a risk prediction model according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for training a risk prediction model according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a training method for a risk prediction model according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a training method for a risk prediction model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a risk prediction model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The following are detailed below.

The terms "first" and "second" in the description and claims of the present application and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of a training system of a risk prediction model provided in an embodiment of the present application, where the training system 100 of the risk prediction model includes a training device 110 of the risk prediction model. The risk prediction model training device 110 is used to process and store the first financial data set. The training system 100 of the risk prediction model may include an integrated single device or multiple devices, and for convenience of description, the training system 100 of the risk prediction model is generally referred to as an electronic device. It will be apparent that the electronic device may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem having wireless communication capability, as well as various forms of User Equipment (UE), Mobile Stations (MS), terminal equipment (terminal device), and the like.

With reference to fig. 1, an embodiment of the present application provides a method for training a risk prediction model, and the following describes the embodiment of the present application in detail.

Referring to fig. 2, fig. 2 is a schematic flowchart of a training method of a risk prediction model according to an embodiment of the present application. The risk prediction model training method can be applied to an electronic device, as shown in fig. 2, and includes:

201. acquiring a first financial data set, wherein the first financial data set comprises M pieces of first financial data corresponding to a plurality of first fields, the plurality of first fields comprise a first field A and a first field B, the first field A is associated with X pieces of first financial data, the first field B is associated with Y pieces of first financial data, and M is X + Y, wherein M, X and Y are integers greater than 1.

Wherein, the first field may include, for example: and fields of basic information of listed and debt enterprises, financial reports, audit opinions, credit rating, negative events, stockholder equity and equity, certificate prison punishment and the like. Specifically, the first field may include, for example, a net profit percentage increase rate within 3 years, a credit rating increase rate within 3 years, a number of negative events within 3 years, a three-year net profit average within 3 years, and the like, without being limited thereto.

The first financial data may include, for example: the percentage increase of net profit within 3 years, the extent of increase of credit rating within 3 years, the number of negative events within 3 years, the average value of net profit for three years within 3 years, etc., which are not limited herein.

For example, referring to table 1, table 1 is a first financial data set provided in the embodiments of the present application, as shown in table 1.

TABLE 1

It can be seen that in table 1, one first field is the magnitude of the rise in credit rating over 3 years, one first field is the number of negative events over 3 years, and one first field is the three-year average of net profit over 3 years. Further, the first field is a 3-year credit rating rise, and the corresponding first financial data includes 15%, 11%, and the like. The first field is the number of negative events within 3 years, and its corresponding first financial data includes 8, 3, etc. The first field is the three-year average net profit over 3 years, and its corresponding first financial data includes 9000, 11000, etc.

Wherein X may be equal to or different from Y, and is not particularly limited. Further, the first field a and the first field B are two different fields in the plurality of first fields.

202. Vectorizing, for the first financial data set, the plurality of pieces of first financial data associated with each of the plurality of first fields to obtain a plurality of first vectors.

With reference to Table 1, the first field is the ascending amplitude of the credit rating within 3 years, and the corresponding first vector is

The first field is the number of negative events in 3 years, and the corresponding first vector is

The first field is the average value of net profit for three years in 3 years, and the corresponding first vector is

203. And determining the correlation between every two first vectors in the plurality of first vectors by adopting a preset feature selection algorithm.

The preset feature selection algorithm may be, for example, a feature selection algorithm of pearson correlation coefficients.

204. Determining a second financial data set from the first financial data set according to the correlation between each two first vectors.

Optionally, in a possible implementation, the determining, from the first financial data set, a second financial data set according to a correlation between every two first vectors, where the correlation between every two first vectors includes a correlation between a second vector and a third vector, where the second vector is any one of the plurality of first vectors, and the third vector is any one of the plurality of first vectors except for the second vector, includes:

if the correlation between the second vector and the third vector is higher than the preset correlation, reserving a plurality of pieces of first financial data corresponding to the second vector, and deleting a plurality of pieces of first financial data corresponding to the third vector to obtain a second financial data set; or deleting the plurality of pieces of first financial data corresponding to the second vector, and reserving the plurality of pieces of first financial data corresponding to the third vector to obtain the second financial data set.

The preset correlation may be set by an administrator or may be configured in the electronic device.

In addition, if the correlation between the second vector and the third vector is lower than a preset correlation, the plurality of pieces of first financial data corresponding to the second vector are reserved, and the plurality of pieces of first financial data corresponding to the third vector are reserved to obtain the second financial data set.

Therefore, in the technical scheme, the relevance-based data for training the risk prediction model is reduced, so that the training period of the risk prediction model is shortened, and the training complexity is also reduced.

205. Training a risk prediction model using the second financial data set.

Referring to fig. 3, fig. 3 is a schematic flowchart of a training method for a risk prediction model according to an embodiment of the present application. The method for training the risk prediction model may be applied to an electronic device, where, as shown in fig. 3, the acquiring a first financial data set includes:

301. acquiring an initial financial data set from at least one blockchain, wherein the initial financial data set comprises N pieces of initial financial data corresponding to a plurality of initial fields, the initial fields comprise initial fields A and initial fields B, the initial fields A are associated with S pieces of initial financial data, the initial fields B are associated with T pieces of initial financial data, N is S + T, and N, S and T are integers greater than 1.

The block chain is a chain data structure which connects the data blocks according to the time sequence, and is a distributed account book which is cryptographically guaranteed to be not falsifiable and counterfeitable. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Further, the properties of the blockchain include openness, consensus, de-centering, de-trust, transparency, anonymity of both sides, non-tampering, traceability, and the like. Open and transparent means that anyone can participate in the blockchain network, and each device can be used as a node, and each node allows a complete database copy to be obtained. The nodes maintain the whole block chain together through competition calculation based on a set of consensus mechanism. When any node fails, the rest nodes can still work normally. The decentralization and the distrust mean that a block chain is formed into an end-to-end network by a plurality of nodes together, and no centralized equipment or management mechanism exists. The data exchange between the nodes is verified by a digital signature technology, mutual trust is not needed, and other nodes cannot be deceived as long as the data exchange is carried out according to the rules set by the system. Transparent and anonymous meaning that the operation rule of the block chain is public, and all data information is also public, so that each transaction is visible to all nodes. Because the nodes are distrusted, the nodes do not need to disclose identities, and each participated node is anonymous. Among other things, non-tamperable and traceable means that modifications to the database by each and even multiple nodes cannot affect the databases of other nodes unless more than 51% of the nodes in the entire network can be controlled to modify at the same time, which is almost impossible. In the block chain, each transaction is connected with two adjacent blocks in series through a cryptographic method, so that any transaction record can be traced.

In particular, the blockchain may utilize blockchain data structures to verify and store data, utilize distributed node consensus algorithms to generate and update data, cryptographically secure data transmission and access, and utilize intelligent contracts comprised of automated script code to program and manipulate data in a completely new distributed infrastructure and computing manner. Therefore, the characteristic that the block chain technology is not tampered fundamentally changes a centralized credit creation mode, and the irrevocability and the safety of data are effectively improved. The intelligent contract enables all the terms to be written into programs, the terms can be automatically executed on the block chain, and therefore when conditions for triggering the intelligent contract exist, the block chain can be forcibly executed according to the content in the intelligent contract and is not blocked by any external force, effectiveness and execution force of the contract are guaranteed, cost can be greatly reduced, and efficiency can be improved. Each node on the block chain has the same account book, and the recording process of the account book can be ensured to be public and transparent. The block chain technology can realize point-to-point, open and transparent direct interaction, so that an information interaction mode with high efficiency, large scale and no centralized agent becomes a reality.

The initial field may include, for example: and fields of basic information of listed and debt enterprises, financial reports, audit opinions, credit rating, negative events, stockholder equity and equity, certificate prison punishment and the like. Specifically, the first field may include, for example, a net profit percentage increase rate within 3 years, a credit rating increase rate within 3 years, a number of negative events within 3 years, a three-year net profit average within 3 years, and the like, without being limited thereto.

The initial financial data may include, for example: the percentage increase of net profit within 3 years, the extent of increase of credit rating within 3 years, the number of negative events within 3 years, the average value of net profit for three years within 3 years, etc., which are not limited herein.

S may be equal to T or not, which is not limited herein. Further, the initial field a and the initial field B are two different fields of the plurality of initial fields.

302. Determining sparsity of the initial set of financial data.

Optionally, in a possible implementation, the determining sparsity of the initial financial data set includes: constructing a matrix from the initial set of financial data, a column of elements in the matrix corresponding to a plurality of pieces of initial financial data associated with an initial field of the plurality of initial fields; determining the number of sparse elements of each column of elements in the matrix, wherein initial data corresponding to the sparse elements are zero; and determining the sparsity corresponding to the matrix according to the number of sparse elements of each column of elements in the matrix.

The matrix is a matrix with n rows and m columns, n is the number of initial fields, m is the number of initial financial data associated with an initial field K, and the initial field K is the initial field with the most associated initial financial data in the initial fields.

For example, referring to table 2, table 2 is an initial financial data set provided by the embodiment of the present application, as shown in table 2.

TABLE 2

It can be seen that in table 2, one first field is the magnitude of the rise in credit rating over 3 years, one first field is the number of negative events over 3 years, and one first field is the three-year average of net profit over 3 years. Further, the first field is a 3-year credit rating rise, and the corresponding initial financial data includes 15%, 0%, 11%, and so on. The first field is the number of negative events within 3 years, which corresponds to initial financial data including 8, 0, 5, etc. The first field is the three-year average net profit over 3 years, and its corresponding initial financial data includes 9000, 11000, 15000, etc. Further, the matrix may be

It can be seen that the matrix is a 3-row and 3-column matrix, the first column is initial financial data associated with a field of "credit rating rise in 3 years", the second column is initial financial data associated with a field of "number of negative events in 3 years", and the third column is initial financial data associated with a field of "three-year average of net profit in 3 years". The number of the sparse elements in the first column is 1, the number of the sparse elements in the second column is 1, and the number of the sparse elements in the third column is 0. Therefore, the sparsity corresponding to this matrix is 2.

Therefore, in the technical scheme, the sparsity is determined, and preparation is made for subsequently acquiring the first financial data set.

303. If the sparsity is less than a threshold, determining whether a plurality of initial financial data associated with at least one initial field in the plurality of initial fields do not satisfy a preset distribution for the initial financial data set.

If yes, go to step 304; if not, go to step 305.

The preset distribution may be a gaussian distribution, for example.

304. Deleting a plurality of pieces of initial financial data associated with the at least one initial field aiming at the initial financial data set to obtain a remaining initial financial data set; determining the remaining initial financial data set as the first financial data set.

305. Determining the initial financial data set as the first financial data set.

It can be seen that, in the above technical solution, the determination of the first financial data set is implemented by determining the sparsity of the initial financial data set, and determining whether there are multiple pieces of initial financial data associated with at least one initial field in the initial field that do not satisfy the preset distribution when the sparsity is smaller than the threshold. Meanwhile, more reliable and scientific training data are provided for the training of the subsequent risk prediction model.

Referring to fig. 4, fig. 4 is a schematic flowchart of a training method for a risk prediction model according to an embodiment of the present application. The method for training the risk prediction model may be applied to an electronic device, wherein, as shown in fig. 4, the training the risk prediction model using the second financial data set, where the second financial data set includes a plurality of pieces of second financial data associated with each of a plurality of second fields, includes:

401. vectorizing, for the second financial data set, the plurality of pieces of second financial data associated with each second field to obtain a plurality of fourth vectors.

402. And obtaining a vector corresponding to the preset field.

The vector corresponding to the preset field may be a negative vector or a positive vector. Further, when the initial financial data set meets a first preset strategy, a vector corresponding to the preset field is a negative vector; and when the initial financial data set meets a second preset strategy, the vector corresponding to the preset field is a positive vector.

It is understood that the first preset policy and the second preset policy may be set by an administrator or may be configured in the electronic device. The preset field may be set by an administrator or may be configured in the electronic device.

403. And determining the distance between each fourth vector in the plurality of fourth vectors and the vector corresponding to the preset field.

404. Determining a third financial data set from the second financial data set based on the distance.

Optionally, the determining, according to the distance, a third financial data set from the second financial data set, where the distance includes a distance between a fifth vector and a vector corresponding to the preset field, and the fifth vector is any one of the fourth vectors, and the method includes:

if the distance between the fifth vector and the vector corresponding to the preset field is higher than a preset distance, reserving a plurality of pieces of second financial data corresponding to the fifth vector to obtain a third financial data set;

and if the distance between the fifth vector and the vector corresponding to the preset field is lower than the preset distance, deleting the second financial data corresponding to the fifth vector to obtain the third financial data set.

The preset distance can be set by an administrator and can be configured in the electronic device.

It can be seen that, in the above technical solution, the third financial data set is determined based on the distance. And the data for training the risk prediction model is reduced again, so that the training period of the risk prediction model is shortened, and the training complexity is also reduced.

405. Training the risk prediction model using the third financial dataset.

Therefore, in the technical scheme, the third financial data set is determined from the second financial data set, and the data for training the risk prediction model is reduced again, so that the training period of the risk prediction model is shortened, and the training complexity is also reduced.

Referring to fig. 5, fig. 5 is a schematic diagram of a training apparatus for a risk prediction model according to an embodiment of the present application. As shown in fig. 5, a training apparatus 500 for a risk prediction model provided in an embodiment of the present application may include:

a processing module 501, configured to obtain a first financial data set, where the first financial data set includes M pieces of first financial data corresponding to a plurality of first fields, where the plurality of first fields include a first field a and a first field B, the first field a is associated with X pieces of first financial data, the first field B is associated with Y pieces of first financial data, and M is X + Y, where M, X, and Y are integers greater than 1; vectorizing, for the first financial data set, the plurality of pieces of first financial data associated with each of the plurality of first fields to obtain a plurality of first vectors; determining the correlation between every two first vectors in the plurality of first vectors by adopting a preset feature selection algorithm; determining a second financial data set from the first financial data set according to the correlation between each two first vectors; training a risk prediction model using the second financial data set.

Optionally, when acquiring the first financial data set, the processing module 501 is configured to acquire an initial financial data set from at least one blockchain, where the initial financial data set includes N initial financial data corresponding to a plurality of initial fields, where the plurality of initial fields includes an initial field a and an initial field B, the initial field a is associated with S initial financial data, the initial field B is associated with T initial financial data, N ═ S + T, where N, S, and T are integers greater than 1; determining a sparsity of the initial set of financial data; if the sparsity is less than a threshold, determining whether a plurality of initial financial data associated with at least one initial field in the plurality of initial fields do not satisfy a preset distribution for the initial financial data set; if so, deleting the plurality of pieces of initial financial data associated with the at least one initial field aiming at the initial financial data set to obtain a residual initial financial data set; determining the remaining initial financial data set as the first financial data set; if not, determining the initial financial data set as the first financial data set.

Optionally, when determining the sparsity of the initial financial data set, the processing module 501 is configured to construct a matrix according to the initial financial data set, where a column of elements in the matrix corresponds to multiple pieces of initial financial data associated with one of the multiple initial fields; determining the number of sparse elements of each column of elements in the matrix, wherein initial data corresponding to the sparse elements are zero; and determining the sparsity corresponding to the matrix according to the number of sparse elements of each column of elements in the matrix.

Optionally, the correlation between every two first vectors includes a correlation between a second vector and a third vector, where the second vector is any one of the multiple first vectors, and the third vector is any one of the multiple first vectors except the second vector, and when a second financial data set is determined from the first financial data set according to the correlation between every two first vectors, the processing module 501 is configured to, if the correlation between the second vector and the third vector is higher than a preset correlation, retain the multiple pieces of first financial data corresponding to the second vector, and delete the multiple pieces of first financial data corresponding to the third vector, so as to obtain the second financial data set; or deleting the plurality of pieces of first financial data corresponding to the second vector, and reserving the plurality of pieces of first financial data corresponding to the third vector to obtain the second financial data set.

Optionally, when the risk prediction model is trained by using the second financial data set, where the second financial data set includes a plurality of pieces of second financial data associated with each of a plurality of second fields, the processing module 501 is configured to vectorize, for the second financial data set, the plurality of pieces of second financial data associated with each of the plurality of second fields to obtain a plurality of fourth vectors; obtaining a vector corresponding to a preset field; determining a distance between each fourth vector in the plurality of fourth vectors and the vector corresponding to the preset field; determining a third financial data set from the second financial data set based on the distance; training the risk prediction model using the third financial dataset.

Optionally, the distance includes a distance between a fifth vector and a vector corresponding to the preset field, where the fifth vector is any one of the fourth vectors, and a third financial data set is determined from the second financial data set according to the distance, and the processing module 501 is configured to, if the distance between the fifth vector and the vector corresponding to the preset field is higher than a preset distance, reserve a plurality of pieces of second financial data corresponding to the fifth vector to obtain the third financial data set;

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present application.

An embodiment of the application provides an electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor to perform instructions of steps in a training method comprising any one of the risk prediction models. As shown in fig. 6, an electronic device of a hardware operating environment according to an embodiment of the present application may include:

a processor 601, such as a CPU.

The memory 602 may alternatively be a high speed RAM memory or a stable memory such as a disk memory.

A communication interface 603 for implementing connection communication between the processor 601 and the memory 602.

Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 6 is not intended to be limiting and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 6, the memory 602 may include an operating system, a network communication module, and one or more programs. An operating system is a program that manages and controls the server hardware and software resources, supporting the execution of one or more programs. The network communication module is used for communication among the components in the memory 602 and with other hardware and software in the electronic device.

In the electronic device shown in fig. 6, the processor 601 is configured to execute one or more programs in the memory 602, and implement the following steps: acquiring a first financial data set, wherein the first financial data set comprises M pieces of first financial data corresponding to a plurality of first fields, the plurality of first fields comprise a first field A and a first field B, the first field A is associated with X pieces of first financial data, the first field B is associated with Y pieces of first financial data, and M is X + Y, wherein M, X and Y are integers greater than 1; vectorizing, for the first financial data set, the plurality of pieces of first financial data associated with each of the plurality of first fields to obtain a plurality of first vectors; determining the correlation between every two first vectors in the plurality of first vectors by adopting a preset feature selection algorithm; determining a second financial data set from the first financial data set according to the correlation between each two first vectors; training a risk prediction model using the second financial data set.

For specific implementation of the electronic device related to the present application, reference may be made to various embodiments of the risk prediction model training method, which are not described herein again.

The present application further provides a computer readable storage medium for storing a computer program, the stored computer program being executable by the processor to perform the steps of: acquiring a first financial data set, wherein the first financial data set comprises M pieces of first financial data corresponding to a plurality of first fields, the plurality of first fields comprise a first field A and a first field B, the first field A is associated with X pieces of first financial data, the first field B is associated with Y pieces of first financial data, and M is X + Y, wherein M, X and Y are integers greater than 1; vectorizing, for the first financial data set, the plurality of pieces of first financial data associated with each of the plurality of first fields to obtain a plurality of first vectors; determining the correlation between every two first vectors in the plurality of first vectors by adopting a preset feature selection algorithm; determining a second financial data set from the first financial data set according to the correlation between each two first vectors; training a risk prediction model using the second financial data set.

For specific implementation of the computer-readable storage medium related to the present application, reference may be made to the embodiments of the risk prediction model training method, which are not described herein again.

The computer readable storage medium may be non-volatile or volatile.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that the acts and modules involved are not necessarily required for this application.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training a risk prediction model, comprising:

training a risk prediction model using the second financial data set.

2. The method of claim 1, wherein said obtaining a first set of financial data comprises:

acquiring an initial financial data set from at least one blockchain, wherein the initial financial data set comprises N pieces of initial financial data corresponding to a plurality of initial fields, the initial fields comprise initial fields A and initial fields B, the initial fields A are associated with S pieces of initial financial data, the initial fields B are associated with T pieces of initial financial data, N is S + T, and N, S and T are integers greater than 1;

determining a sparsity of the initial set of financial data;

if the sparsity is less than a threshold, determining whether a plurality of initial financial data associated with at least one initial field in the plurality of initial fields do not satisfy a preset distribution for the initial financial data set;

if so, deleting the plurality of pieces of initial financial data associated with the at least one initial field aiming at the initial financial data set to obtain a remaining initial financial data set, and determining the remaining initial financial data set as the first financial data set;

if not, determining the initial financial data set as the first financial data set.

3. The method of claim 2, wherein said determining sparsity of said initial set of financial data comprises:

constructing a matrix from the initial set of financial data, a column of elements in the matrix corresponding to a plurality of pieces of initial financial data associated with one of the plurality of initial fields;

determining the number of sparse elements of each column of elements in the matrix, wherein initial data corresponding to the sparse elements are zero;

and determining the sparsity corresponding to the matrix according to the number of sparse elements of each column of elements in the matrix.

4. The method of any one of claims 1-3, wherein the determining a second set of financial data from the first set of financial data is based on a correlation between each of two first vectors, the correlation between each of the two first vectors comprising a correlation between a second vector and a third vector, the second vector being any one of the plurality of first vectors, the third vector being any one of the plurality of first vectors except the second vector, the method comprising:

5. The method of claim 1, wherein training a risk prediction model using the second set of financial data, the second set of financial data including a plurality of pieces of second financial data associated with each of a plurality of second fields, comprises:

vectorizing, for the second financial data set, the plurality of pieces of second financial data associated with each second field to obtain a plurality of fourth vectors;

obtaining a vector corresponding to a preset field;

determining a distance between each fourth vector in the plurality of fourth vectors and the vector corresponding to the preset field;

determining a third financial data set from the second financial data set based on the distance;

training the risk prediction model using the third financial dataset.

6. The method of claim 5, wherein the determining a third set of financial data from the second set of financial data according to the distance comprises a distance between a fifth vector and a vector corresponding to the predetermined field, wherein the fifth vector is any one of the plurality of fourth vectors, and wherein the method comprises:

7. A training device for a risk prediction model, comprising:

8. The apparatus of claim 7, wherein the processing module, in acquiring the first set of financial data, is configured to

determining a sparsity of the initial set of financial data;

if so, deleting the plurality of pieces of initial financial data associated with the at least one initial field aiming at the initial financial data set to obtain a residual initial financial data set; determining the remaining initial financial data set as the first financial data set;

9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and generated instructions for execution by the processor to perform the steps of the method of any of claims 1-6.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program, which is executed by the processor, to implement the method of any of claims 1-6.