CN112905690A

CN112905690A - Financial time sequence data mining method and system based on hypergraph

Info

Publication number: CN112905690A
Application number: CN202110356371.3A
Authority: CN
Inventors: 鲁多; 李荣华; 高玉金; 秦宏超; 王国仁; 金福生
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2021-06-04

Abstract

The invention discloses a financial time sequence data mining method and system based on hypergraph, relating to the technical field of financial data processing and comprising the steps of obtaining financial time sequence data; the financial timing data includes a plurality of users and a plurality of transaction relationships; the transaction relationship is an interactive relationship between two or more users; constructing a financial hypergraph knowledge graph by using a hypergraph theory and the financial time sequence data; mining the financial knowledge hypergraph map by using a frequent subgraph mining algorithm; the invention can accurately express financial time sequence data and improve mining precision.

Description

Financial time sequence data mining method and system based on hypergraph

Technical Field

The invention relates to the technical field of financial data processing, in particular to a financial time sequence data mining method and system based on a hypergraph.

Background

Finance is an important core competitiveness of the country, and prevention and control of financial risks are important strategic demands of the current country. At present, financial time series data are mainly mined, so that a feature vector for marking financial risks is found. In financial time series data, a transaction relationship often involves a plurality of main bodies, but in actual operation, the financial time series data is represented by a binary relationship, so that the defect that the financial time series data cannot be accurately represented by the binary relationship obviously exists, and further, the risk of poor mining results exists in the follow-up process.

Disclosure of Invention

The invention aims to provide a financial time sequence data mining method and system based on a hypergraph, so as to achieve the purposes of accurately expressing financial time sequence data and improving mining precision.

In order to achieve the purpose, the invention provides the following scheme:

a financial time sequence data mining method based on hypergraph comprises

Acquiring financial time sequence data; the financial timing data includes a plurality of users and a plurality of transaction relationships; the transaction relationship is an interactive relationship between two or more users;

constructing a financial hypergraph knowledge graph by using a hypergraph theory and the financial time sequence data; the nodes of the financial hypergraph knowledge graph are the users, the hyperedges of the financial hypergraph knowledge graph are the transaction relations, and the hyperedges comprise two or more nodes;

and mining the financial knowledge hypergraph graph by using a frequent subgraph mining algorithm.

Optionally, after mining the financial knowledge hypergraph graph by using the frequent subgraph mining algorithm, the method further includes:

filtering the frequent sub-hypergraphs to obtain a characteristic vector which marks financial risks; and the frequent sub-hypergraph is a sub-graph obtained by mining the financial knowledge hypergraph graph by using a frequent sub-graph mining algorithm.

Optionally, before the mining the financial knowledge hypergraph graph by using the frequent subgraph mining algorithm is executed, the method further includes:

and removing low-frequency super edges and low-frequency nodes in the financial knowledge hypergraph map by using a pruning algorithm to obtain an updated financial knowledge hypergraph map.

Optionally, the mining the financial knowledge hypergraph graph by using a frequent subgraph mining algorithm specifically includes:

and mining the updated financial knowledge hypergraph map by using an improved gSpan algorithm.

Optionally, the mining the updated financial knowledge hypergraph atlas by using the improved gSpan algorithm specifically includes:

sorting the super edges in the updated financial knowledge super map from high to low according to the super edge frequency to obtain an ordered super edge set;

performing recursive mining on the super edges in the ordered super edge set;

wherein the recursive mining process comprises the following steps:

judging whether a current DFS code corresponding to a current-stage sub-hypergraph is a minimum DFS code corresponding to the current-stage sub-hypergraph, and if so, performing rightmost path expansion on the current-stage sub-hypergraph; the current DFS codes are the expansion sequence of the sub-hypergraph of the upper stage; the sub hypergraph is a hypergraph obtained after the hyperedges in the ordered hyperedge set are expanded;

calculating the support degree of the sub-hypergraph after the rightmost path of expansion;

and judging whether the support degree is greater than or equal to a first set threshold value, if so, classifying the expanded sub-hypergraph of the rightmost path into a frequent sub-hypergraph set, updating the sub-hypergraph of the current stage into the expanded sub-hypergraph of the rightmost path, and returning to the step of judging whether the current DFS code corresponding to the sub-hypergraph of the current stage is the minimum DFS code corresponding to the sub-hypergraph of the current stage.

Optionally, the method further includes:

and when the current DFS code corresponding to the current-stage sub-hypergraph is not the minimum DFS code corresponding to the current-stage sub-hypergraph, stopping expanding the rightmost path of the current-stage sub-hypergraph, and performing recursive mining on the un-mined hyperedges in the ordered hyperedge set until all the hyperedges in the ordered hyperedge set are subjected to mining processing.

Optionally, the method further includes:

and when the support degree is smaller than a first set threshold value, performing rightmost path expansion on the sub-hypergraph after stopping the rightmost path expansion, performing rightmost path expansion of another form on the current stage sub-hypergraph, and returning to the step of calculating the support degree of the rightmost path expanded sub-hypergraph until the current stage sub-hypergraph performs rightmost path expansion of all forms.

Optionally, before expanding the super edge in the ordered super edge set, the method further includes:

calculating the variance of the number of the nodes with the excess edges;

judging whether the variance is smaller than a second set threshold value;

if yes, expanding the excess edge;

if not, the excess edge is cut, and the step of calculating the variance of the number of the nodes of the excess edge is returned.

Optionally, the calculating the support degree of the sub-hypergraph after the rightmost path of expansion specifically includes:

segmenting the updated financial knowledge hypergraph map to obtain a plurality of financial knowledge hypergraph maps;

calculating the MNI support degree of the sub-hypergraph after the rightmost path of expansion in each sub-hypergraph of the financial knowledge;

and sequencing all the MNI support degrees, and determining the minimum MNI support degree as the support degree of the sub-hypergraph after the rightmost path of expansion.

A hypergraph-based financial timing data mining system, comprising:

the acquisition module is used for acquiring financial time sequence data; the financial timing data includes a plurality of users and a plurality of transaction relationships; the transaction relationship is an interactive relationship between two or more users;

the construction module is used for constructing a financial hypergraph knowledge graph by utilizing a hypergraph theory and the financial time sequence data; the nodes of the financial hypergraph knowledge graph are the users, the hyperedges of the financial hypergraph knowledge graph are the transaction relations, and the hyperedges comprise two or more nodes;

and the mining module is used for mining the financial knowledge hypergraph map by using a frequent subgraph mining algorithm.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention constructs the financial hypergraph knowledge graph based on the hypergraph knowledge graph, can effectively improve the expression capability, enables the multivariate relation in the financial time series data to be fully expressed and embodied, and effectively improves the mining effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a rightmost expanded schematic diagram;

FIG. 2 is a diagram of an example of MNI support;

FIG. 3 is a diagram illustrating a multi-guaranteed relationship in a generic knowledge-graph;

FIG. 4 is a schematic diagram of a multi-warranty relationship in a hypergraph knowledge-graph;

FIG. 5 is a schematic flow chart of a financial timing data mining method based on hypergraphs according to the present invention;

FIG. 6 is a schematic diagram of a financial timing data mining system based on hypergraphs according to the present invention;

FIG. 7 is a flow chart of the financial rule mining method based on hypergraph knowledge-graph of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Interpretation of terms:

knowledge graph: the knowledge graph is a data structure based on a graph and comprises nodes (nodes) and edges (edges), each Node represents an entity, each Edge is a relation between the entities, and the knowledge graph is a semantic network and is a graph with labels.

Hypergraph: the hypergraph is denoted as G (V, E); wherein, V is a node set, and E is a super edge set. The super edge is an edge connected by two or more nodes.

Drawing flow: referring to the timing diagram flow, that is, each time corresponds to a diagram, there are multiple diagrams in a period of time, and the multiple diagrams in a period of time are collectively referred to as the diagram flow.

And (3) frequent subgraph mining algorithm: and the method is used for mining frequently-appearing subgraphs in the graph.

Financial time series data: the data has the characteristics of time sequence multifrequency, heterogeneous height, peak thick tail and the like.

Example one

In financial timing data, often, one transaction relationship involves multiple subjects, rather than a simple binary relationship, such as multiple guarantees. The traditional knowledge graph can only represent binary relations, and the multivariate relations in the financial time series data are difficult to represent by the traditional knowledge graph structure.

The current common technical scheme is to mine the traditional financial knowledge map. Firstly, all the multivariate relations are converted into binary relations, a financial knowledge graph is constructed, and then mining rules are carried out based on a gSpan algorithm. Wherein the gSpan algorithm is a classical algorithm for static frequent subgraph mining.

The steps of the gSpan algorithm are as follows:

1. pre-treating; and pruning low-frequency edges and low-frequency nodes in the financial knowledge graph, and re-marking the rest edges.

2. For the remaining edges, the ordering is from high to low in frequency.

3. And selecting the unexpanded high-frequency edge to perform recursive expansion according to the sequencing order.

3.1 judging whether the minimum DFS coding is met, if not, turning to 3;

3.2 expanding the rightmost path on the opposite sides;

and 3.3, judging whether the expanded subgraph meets the support degree, if so, returning to 3.1 for recursive expansion, and otherwise, returning to 3.2.

DFS coding is the order of an edge. For a given knowledge graph, any depth-first based order corresponds to a DFS encoding. If the labels on the edges can be compared, then for a given knowledge-graph, and a given lexicographic order, there is a corresponding minimum DFS code that is unique. By minimum DFS coding, subgraphs can be uniquely coded, preventing duplicate expansion of the same subgraph. If the candidate subgraph does not meet the minimum DFS coding, the subgraph is described to be expanded, and the expansion of the subgraph is stopped.

As shown in FIG. 1, the rightmost extension means that for a given traversal order, the nodes are numbered according to the traversal order, that is, the starting point is set as v₀End point is set to v_nThe shortest path from the starting point to the ending point is called the rightmost path. The rightmost path expansion comprises two types, one type is forward edge expansion, namely the forward edge expansion is carried out from the end point to other nodes on the rightmost path, and the priority of the forward edge expansion is the same as the traversal sequence of the nodes; the other is to extend the backward edge of the frame,i.e., extending the edges outward from the rightmost way, in the order of priority opposite to the traversal order of the nodes. And performing recursive rightmost path expansion on the frequent subgraphs until no new frequent subgraphs are generated.

The support degree is usually MNI support degree. And (3) MNI support degree, firstly, calculating a isomorphic subgraph in the subgraph to be matched by calculating a mode, wherein corresponding nodes in the isomorphic subgraph are called the mapping of the nodes in the mode in the subgraph, and the MNI support degree is the minimum value of the number of the nodes mapped in the subgraph by the nodes in the mode. Fig. 2 is a diagram of an example of MNI support. The MNI support in fig. 2 is 2.

Fig. 3 is a schematic diagram of multiple security relationships in a common knowledge graph, as shown in fig. 3, three persons a, b, and c perform security for the house, and fig. 3 shows that only three persons a, b, and c all provide security for the house, but it cannot be described that three persons simultaneously provide security for the house. FIG. 4 is a schematic diagram of a multi-guaranteed relationship in a hypergraph knowledge graph, as shown in FIG. 4, the multi-guaranteed relationship can be better displayed, showing a hyperedge containing four nodes. Because the common knowledge graph has certain defects in expression, the multivariate relation cannot be accurately expressed, and the final mining result is poor.

In addition, the conventional MNI support can only calculate the support on a single large graph, and for a graph flow based on time sequence, the graph flow cannot be processed well, and only the graph flow can be combined together for calculation, which leads to the reduction of the calculation accuracy.

In view of this, the present embodiment provides a financial time series data mining method based on a hypergraph, as shown in fig. 5, including:

step 101: acquiring financial time sequence data; the financial timing data includes a plurality of users and a plurality of transaction relationships; the transaction relationship is an interactive relationship between two or more users.

Step 102: constructing a financial hypergraph knowledge graph by using a hypergraph theory and the financial time sequence data; the nodes of the financial hypergraph knowledge graph are the users, the super edges of the financial hypergraph knowledge graph are the transaction relations, and the super edges comprise two or more nodes.

Step 103: and mining the financial knowledge hypergraph graph by using a frequent subgraph mining algorithm.

Preferably, after step 103 is executed, the method further includes: filtering the frequent sub-hypergraphs to obtain a characteristic vector which marks financial risks; and the frequent sub-hypergraph is a sub-graph obtained by mining the financial knowledge hypergraph graph by using a frequent sub-graph mining algorithm.

Preferably, before executing step 103, the method further comprises: and removing low-frequency super edges and low-frequency nodes in the financial knowledge hypergraph map by using a pruning algorithm to obtain an updated financial knowledge hypergraph map.

Preferably, step 103 comprises: mining the updated financial knowledge hypergraph map by using an improved gSpan algorithm; the method specifically comprises the following steps: sorting the super edges in the updated financial knowledge super map from high to low according to the super edge frequency to obtain an ordered super edge set; and carrying out recursive mining on the super edges in the ordered super edge set.

Wherein the recursive mining process comprises the following steps:

judging whether a current DFS code corresponding to a current-stage sub-hypergraph is a minimum DFS code corresponding to the current-stage sub-hypergraph, and if so, performing rightmost path expansion on the current-stage sub-hypergraph; the current DFS codes are the expansion sequence of the sub-hypergraph of the upper stage; and the sub-hypergraph is a hypergraph obtained after the hyperedges in the ordered hyperedge set are expanded.

Calculating the support degree of the sub-hypergraph after the rightmost path of expansion; the method specifically comprises the following steps: segmenting the updated financial knowledge hypergraph map to obtain a plurality of financial knowledge hypergraph maps; calculating the MNI support degree of the sub-hypergraph after the rightmost path of expansion in each sub-hypergraph of the financial knowledge; and sequencing all the MNI support degrees, and determining the minimum MNI support degree as the support degree of the sub-hypergraph after the rightmost path of expansion.

In addition, before expanding the super edge in the ordered super edge set, the method further comprises: calculating the variance of the number of the nodes with the excess edges; judging whether the variance is smaller than a second set threshold value; if yes, expanding the excess edge; if not, the excess edge is cut, and the step of calculating the variance of the number of the nodes of the excess edge is returned.

Example two

In order to achieve the above object, this embodiment further provides a financial time series data mining system based on a hypergraph, as shown in fig. 6, including:

an obtaining module 201, configured to obtain financial timing data; the financial timing data includes a plurality of users and a plurality of transaction relationships; the transaction relationship is an interactive relationship between two or more users.

The construction module 202 is used for constructing a financial hypergraph knowledge graph by utilizing a hypergraph theory and the financial time series data; the nodes of the financial hypergraph knowledge graph are the users, the super edges of the financial hypergraph knowledge graph are the transaction relations, and the super edges comprise two or more nodes.

And the mining module 203 is used for mining the financial knowledge hypergraph map by using a frequent subgraph mining algorithm.

EXAMPLE III

The embodiment provides a financial rule mining method based on a hypergraph knowledge graph, and a corresponding flow chart is shown in fig. 7.

1. And constructing a financial hypergraph knowledge graph. And for the financial time series data, the users are used as nodes, and the interaction relation (transaction) between the users is used as an edge to construct a financial hypergraph knowledge graph. Wherein, the interaction between users may involve multiple users, and in this case, the interaction relationship is represented by a super edge. And at this point, the construction of the financial hypergraph knowledge graph is completed.

2. And (5) mining the knowledge graph of the financial hypergraph. And recording the constructed financial hypergraph knowledge graph as G, and performing rule mining.

And 2.1, counting the frequency of the financial hypergraph knowledge graph G according to the labels. The method specifically comprises the following steps: and counting the frequency of the occurrence of each label in the financial hypergraph knowledge graph according to the labels on the nodes and the hyperedges, and sequencing from high to low according to the frequency.

And 2.2, pruning according to frequency. And according to the label frequency ordering obtained in the last step, deleting labels which are lower than a threshold value, including node labels and super-edge labels, according to a preset threshold value, namely deleting nodes with low frequency and super-edges with low frequency in the financial hypergraph knowledge graph, and recording the deleted financial hypergraph knowledge graph as G1.

And 2.3, re-marking. And for the financial hypergraph knowledge graph G1, according to the frequency of the hyperedges and according to the standard that the higher the frequency is, the smaller the dictionary order of the corresponding label is, the hyperedges are marked again to obtain a new financial hypergraph knowledge graph G2. And ordering the super edges according to the dictionary order to obtain an ordered super edge set E.

And 2.4, carrying out recursive mining on the excess edges. In the super-edge set E, all the super-edges are frequent super-edges, i.e. frequent one-side super-graph. And traversing the super edge set E, and performing the following recursive mining on each super edge in the super edge set E to obtain a frequent sub-hypergraph, namely a final rule.

2.4.1, for the hypergraph to be expanded, firstly judging whether the minimum DFS coding is met; the first frequent one-sided hypergraphs all satisfy the minimum DFS coding; if not, stopping the expansion of the current hypergraph to be expanded, returning to the step 2.4, and excavating the next frequent hypergraph.

DFS coding is a sequence of edges, and for a given graph, any depth-first based order in the graph corresponds to a DFS coding. If the labels on the edges can be compared, then there is one minimum DFS code that is unique and present for both a given graph and a given lexicographic order. By minimum DFS coding, a graph can be uniquely coded, preventing repeated expansion of the same subgraph. And if the candidate hypergraph to be expanded does not meet the minimum DFS coding, the hypergraph is expanded, the expansion of the hypergraph is stopped, the step 2.4 is returned, and the next frequent hypergraph is mined.

And traversing all depth-first search trees of the hypergraph to be expanded, calculating the minimum DFS code corresponding to the hypergraph, and comparing the minimum DFS code with the DFS code obtained by current expansion. The current DFS coding refers to a DFS coding constructed in its extension order.

In DFS coding, edges and related nodes need to be converted into n-tuples. Because the number of nodes contained in the super edge in the super graph is uncertain, the length of each super edge after coding is not fixed, and two strategies are provided to avoid large length difference after super edge coding. And when the variance of the number of the nodes on the super edge is small, all the nodes contained in the super edge are directly represented, and the nodes are sequentially arranged from high to low according to the frequency. And performing necessary truncation to reduce unnecessary storage when the variance of the number of nodes is large, namely setting a threshold value according to the statistical information of the nodes to perform truncation, and performing other strategies similar to the previous strategy.

2.4.2, carrying out rightmost path expansion on the hypergraph to be expanded which meets the minimum DFS coding. In the rightmost path expansion, for the hypergraph, a plurality of edges may exist between two nodes to be connected, and the traversal order of the edges is defined herein, that is, the priority is determined according to the lexicographic order of the labels of the edges, and the lower the lexicographic order, that is, the higher the frequency, the higher the corresponding priority.

2.4.3, calculating the support degree of the expanded hypergraph, if the support degree is greater than or equal to a threshold value, indicating that the expanded hypergraph meets the requirement, continuing to expand, and then returning to the step 2.4.1; otherwise, stopping the expansion of the expanded hypergraph, returning to the step 2.4.2, and performing the rightmost path expansion of another form on the hypergraph.

The support degree here adopts support degree based on MNI. And the mode to be matched is the obtained sub-hypergraph, and the matching is carried out in the whole hypergraph. Because the whole graph may be too large and cannot be directly and completely loaded into the memory, the whole graph can be divided according to time, the whole graph is divided into a plurality of graphs, and each graph is subjected to independent support degree calculation. After the whole graph is divided, the traditional MNI support degree cannot be calculated, and the support degree based on the MNI is adopted for calculation. And when the whole graph is not divided, that is, there is only one graph, the support degree based on the MNI and the support degree based on the MNI are the same, and the support degree based on the MNI can be regarded as a generalization of the MNI.

The MNI-based support is calculated as: here, a plurality of map calculations are taken as an example, and a single map calculation is a special case of the multi-map calculation. And finally, calculating the MNI support degree of each graph to obtain the final support degree.

3. And for the obtained rule, filtering the basic rule according to the rule or the matching to obtain a high-order rule.

Compared with the prior art, the invention has the following technical effects:

1. the common knowledge graph is improved to express the multivariate relation, so that the rules can be mined more accurately.

2. The invention improves the traditional MNI support degree, so that the MNI support degree can be applied to the graph flow. The MNI support degree and the classical support degree provided by the invention are fused, so that the graph flow can be better processed, the MNI support degree can be calculated in each graph, the information in the graph can be more accurately counted, and more effective information can be obtained.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A financial time sequence data mining method based on hypergraph is characterized by comprising

2. The hypergraph-based financial time series data mining method according to claim 1, wherein after mining the financial knowledge hypergraph graph using a frequent subgraph mining algorithm, further comprising:

3. The hypergraph-based financial timing data mining method of claim 1, further comprising, before performing the mining of the financial knowledge hypergraph graph using the frequent subgraph mining algorithm,:

4. The hypergraph-based financial time series data mining method according to claim 3, wherein the mining the financial knowledge hypergraph graph by using a frequent subgraph mining algorithm specifically comprises:

5. The hypergraph-based financial time series data mining method according to claim 4, wherein the mining of the updated financial knowledge hypergraph atlas by using the improved gSpan algorithm specifically comprises:

performing recursive mining on the super edges in the ordered super edge set;

wherein the recursive mining process comprises the following steps:

6. The hypergraph-based financial time series data mining method of claim 5, further comprising:

7. The hypergraph-based financial time series data mining method of claim 5, further comprising:

8. The hypergraph-based financial timing data mining method of claim 5, further comprising, before expanding the hyperedges within the ordered set of hyperedges:

calculating the variance of the number of the nodes with the excess edges;

judging whether the variance is smaller than a second set threshold value;

if yes, expanding the excess edge;

9. The hypergraph-based financial timing data mining method according to claim 5, wherein the calculating the support of the rightmost extended sub-hypergraph specifically comprises:

10. A hypergraph-based financial timing data mining system, comprising: