CN115617882A

CN115617882A - Time sequence diagram data generation method and system with structural constraint based on GAN

Info

Publication number: CN115617882A
Application number: CN202211638436.4A
Authority: CN
Inventors: 李松; 齐逸岩; 刘力铭; 幺宝刚
Original assignee: International Digital Economy Academy IDEA
Current assignee: International Digital Economy Academy IDEA
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-01-17
Anticipated expiration: 2042-12-20
Also published as: CN115617882B

Abstract

The invention discloses a time sequence diagram data generation method and a time sequence diagram data generation system with structural constraint based on GAN, based on a GAN network structure, a time sequence diagram sequence is used for generating simulated time sequence diagram sequence data by a network, real time sequence diagram sequence data is obtained by sampling a real time sequence diagram in a target field, a first loss value between the simulated time sequence diagram sequence data and the real time sequence diagram sequence data is obtained by a time sequence diagram sequence discrimination network, a subgraph distribution distance value of the simulated time sequence diagram sequence data and a subgraph distribution distance value of the real time sequence diagram sequence data are compared to obtain a second loss value, the time sequence diagram sequence generation network is optimized according to the first loss value and the second loss value, the time sequence diagram sequence generation network can learn the subgraph distribution of the real time sequence diagram, and the time sequence diagram sequence generation network which is trained can generate the time sequence diagram data which is approximate to the real time sequence diagram and has high quality.

Description

Time sequence diagram data generation method and system with structural constraint based on GAN

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a system for generating timing diagram data with structural constraint based on GAN.

Background

Aspects of real life, such as interpersonal communication networks, chemical molecules, biological information, and the like, can be represented by using the graph. As graph computing technology is gradually matured, graph representation learning methods are more and more widely applied to the fields of finance, recommendation, medical treatment and the like, particularly the financial field.

In order to obtain a graph representation learning model with high prediction accuracy, a large amount of graph data is required to train the graph representation learning model; due to the sensitivity of financial data, it is difficult to acquire a large amount of real transaction data. Therefore, graph data generation methods are often employed to generate large amounts of simulated graph data to assist in the training of graph representation learning models.

The current static graph data generation method is not suitable for a time sequence diagram in the fields of finance and the like, and the distribution change condition of a time sequence subgraph of a graph is not considered in the graph data generation process of the existing method. Therefore, the generated graph data has a large difference from the real time chart in subgraph distribution and is not high in quality.

Thus, the prior art is in need of improvement and advancement.

Disclosure of Invention

The invention mainly aims to provide a method, a system, an intelligent terminal and a storage medium for generating timing diagram data with structural constraint based on GAN, which can solve the problems of large difference and low quality between the currently generated diagram data and a real timing diagram in sub-diagram distribution.

In order to achieve the above object, a first aspect of the present invention provides a GAN-based timing graph data generation method with structural constraints, where the method includes:

acquiring noise data, inputting the noise data into a sequence generation network of a time sequence diagram, and generating simulated sequence data of the time sequence diagram for representing a target field;

sampling a real time sequence chart of a target field to obtain real time sequence chart sequence data;

inputting the simulation sequence diagram sequence data and the real sequence diagram sequence data into a sequence diagram sequence discrimination network to obtain a first loss value;

calculating and comparing a sub-graph distribution distance value of the simulated sequence data with a sub-graph distribution distance value of the real sequence data to obtain a second loss value for restricting sub-graph distribution;

obtaining a total loss value according to the first loss value and the second loss value;

optimizing model parameters of the sequence diagram sequence generation network until the total loss value meets set conditions, and obtaining a trained sequence diagram sequence generation network;

and inputting the noise data into the trained time sequence chart sequence generation network to obtain the time sequence chart data with the structural constraint.

Optionally, the simulation sequence data and the real sequence data are sequence data of a sequence diagram, and calculating a subgraph distribution distance value of the sequence data of the sequence diagram includes:

calculating a subgraph structure distance value corresponding to each class of subgraph structures in a preset subgraph structure class based on sequence data of the time sequence chart;

and accumulating all the subgraph structure distance values to obtain the subgraph distribution distance value.

Optionally, calculating a subgraph structure distance value corresponding to the subgraph structure based on the sequence data of the time chart, including:

counting the number of the subgraph structures in sequence data of the time sequence chart to obtain the number of predicted subgraphs;

and subtracting the number of the predicted subgraphs from the number of the real subgraphs, and then squaring to obtain the subgraph structure distance value.

Optionally, the accumulating all sub-graph structure distance values to obtain the sub-graph distribution distance value includes:

acquiring the weight corresponding to each type of sub-graph structures in the preset sub-graph structure type;

and based on the weight, performing weighted accumulation on subgraph structure distance values corresponding to all classes of subgraph structures to obtain the subgraph distribution distance value.

Optionally, a gating network constructed based on LSTM is further provided, and the obtaining of the weight corresponding to each type of sub-graph structure in the preset sub-graph structure category includes:

inputting the real sequence diagram sequence data into the gating network to obtain a weight vector;

and inputting the weight vector into a full-connection layer to obtain the weight corresponding to each type of sub-graph structure.

Optionally, the sequence data of the timing graph includes several triple data describing the edges of the timing graph, and the triple data includes a start node, a stop node and a timestamp constituting the edges of the timing graph.

Optionally, the sequence diagram generation network is constructed based on an LSTM model, and a time constraint module is further disposed in the sequence diagram generation network, and the time constraint module is configured to constrain timestamps in the triple data according to a time sequence.

A second aspect of the present invention provides a GAN-based timing diagram data generation system with structural constraints, wherein the system comprises:

the sequence generation module of the time sequence is used for obtaining noise data, inputting the noise data into a trained sequence generation network of the time sequence, obtaining the time sequence data with structural constraint or inputting the noise data into the sequence generation network of the time sequence, and generating simulation sequence data of the time sequence in the representation target field;

the sequence sampling module of the time sequence is used for sampling a real time sequence chart of the target field to obtain sequence data of the real time sequence chart;

the sequence distinguishing module of the time sequence diagram is used for inputting the sequence data of the simulated time sequence diagram and the sequence data of the real time sequence diagram into the sequence distinguishing network of the time sequence diagram to obtain a first loss value;

the subgraph distribution constraint module is used for calculating and comparing a subgraph distribution distance value of the simulated sequence data of the time sequence diagram with a subgraph distribution distance value of the real sequence data of the time sequence diagram to obtain a second loss value for constraining subgraph distribution;

the optimization module is used for obtaining a total loss value according to the first loss value and the second loss value; and optimizing the model parameters of the sequence diagram sequence generation module until the total loss value meets the set condition to obtain the trained sequence diagram sequence generation module.

Optionally, the subgraph distribution constraint module further includes a gated neural network, and the gated neural network is configured to obtain a weight corresponding to each class of subgraph structure in the preset subgraph structure class according to the real sequence data of the time sequence diagram.

A third aspect of the present invention provides an intelligent terminal, where the intelligent terminal includes a memory, a processor, and a GAN-based timing graph data generating program with structural constraints, stored in the memory and executable on the processor, and when the GAN-based timing graph data generating program with structural constraints is executed by the processor, the intelligent terminal implements any one of the steps of the GAN-based timing graph data generating method with structural constraints.

A fourth aspect of the present invention provides a computer-readable storage medium, where a GAN-based timing graph data generation program with structural constraint is stored, and when executed by a processor, the GAN-based timing graph data generation program with structural constraint implements any one of the steps of the GAN-based timing graph data generation method with structural constraint.

As can be seen from the above, compared with the prior art, the method is based on the GAN network structure, the sequence diagram sequence generation network is used to generate the simulated sequence diagram sequence data, the real sequence diagram sequence data in the target field is sampled to obtain the real sequence diagram sequence data, the sequence diagram sequence discrimination network is used to obtain the first loss value between the simulated sequence diagram sequence data and the real sequence diagram sequence data, the subgraph distribution distance value of the simulated sequence diagram sequence data and the subgraph distribution distance value of the real sequence diagram sequence data are compared to obtain the second loss value, the sequence diagram sequence generation network is optimized according to the first loss value and the second loss value, and the training time sequence diagram sequence generation network can learn the subgraph distribution of the real sequence diagram, so that the trained sequence diagram sequence generation network can generate high-quality sequence diagram data approximate to the real sequence diagram.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a block diagram of a data generation architecture of a timing graph with structural constraints based on GAN according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for generating timing diagram data with structural constraints based on GAN according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a sequence data generating network of the embodiment of FIG. 2;

FIG. 4 is a schematic diagram of the encoding and decoding of the embodiment of FIG. 2 for consecutive time stamp data;

FIG. 5 is a diagram of nine classes of sub-graph structures for the embodiment of FIG. 2;

FIG. 6 is a detailed flowchart of step S400 in the embodiment of FIG. 2;

FIG. 7 is a schematic structural diagram of a GAN-based timing diagram data generation system with structural constraints according to an embodiment of the present invention;

fig. 8 is a schematic block diagram of an internal structure of an intelligent terminal according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, models, structures, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when …" or "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted depending on the context to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

With the development of artificial intelligence, the graph-showing learning method is widely applied to various aspects in life, such as finance, recommendation, medical treatment and other fields. And the reality degree of graph data used when the training graph represents the learning model has great influence on the prediction accuracy of the graph representation learning model.

The current static graph data generation method is not suitable for a time sequence diagram in the fields of finance and the like, and the distribution change condition of a time sequence subgraph of a graph is not considered in the graph data generation process of the existing method. Therefore, the generated graph data has a large difference from the real time chart in subgraph distribution and is not high in quality. The graph trained by the generated graph data shows poor prediction effect when the learning model is used.

In order to solve the problems in the prior art, the invention provides a timing diagram data generation method with structural constraint based on a GAN (Generative adaptive Networks), wherein in the timing diagram data generation process, the subgraph structure distribution in the diagram is fully considered and is made to approach the real diagram data of the target field.

Exemplary method

The embodiment of the invention provides a GAN-based timing graph data generation method with structural constraints, which is used for generating timing graph data in the financial field, and fig. 1 is an architecture block diagram of the embodiment. During training, the timing sequence generation module generates a timing sequence random walk sequence according to input noise data, and then the timing sequence random walk sequence and a real image data timing sequence random walk sequence obtained by sampling in real image data are input into the timing sequence image sequence judgment module together to judge whether the generated timing sequence random walk sequence is from the real image data; the subgraph distribution constraint module constrains the subgraph distribution of the time-sequence random walk sequence generated by the time-sequence chart sequence generation module to approach the subgraph distribution in the real graph data according to a subgraph pattern (a predefined subgraph structure in fig. 1) which is predefined by a specific scene. After training, the noise data is input into the trained sequence diagram sequence generation module, and the sequence diagram data with the structural constraint can be obtained. Specifically, as shown in fig. 2, the present embodiment includes the following steps:

step S100: acquiring noise data, inputting the noise data into a sequence generation network of a time sequence chart, and generating simulated time sequence chart sequence data for representing a time sequence chart of a target field;

specifically, in GAN networks, noise data that obeys a certain distribution rule is generally used to generate target data such as analog sequence chart sequence data generated by the present invention. The description of the noise data is:

wherein

H is the vector length of the hidden layer,

the distribution of the noise data is generally a uniform distribution or gaussian distribution, that is, the noise data is a real number vector of h dimension generated by the uniform distribution or gaussian distribution.

Because the present invention is directed to timing diagrams, timing diagram sequence data is employed to characterize the timing diagrams. The sequence data of the timing chart can be represented as a sequence of time-series random walk

In which

Indicating whether it is a start sequence, when the sequence is a start sequence of a time-series random walk sequence

Is 1, other sequences

Is 0; the time sequence random walk sequence comprises a plurality of triple data, and each triple data

One edge in the timing diagram is shown,

are the starting node and the terminating node of the edge,

a timestamp of the edge (start or end time, duration, etc. for saving the edge),

the length of the sequence data of a timing graph, i.e. the total number of edges characterizing the timing graph,

indicating whether the sequence is terminated or not, when the sequence is terminated

Is 1, otherwise

Is 0.

In this embodiment, the sequence generation module generates a network for the sequence of timing diagrams. The sequence generation network of the timing diagram corresponds to a generation network in a generative countermeasure network. Noise data

After the sequence of the financial field graph is input into the generation network of the sequence of the financial field graph, the sequence data of the simulated sequence graph which can represent the sequence of the financial field graph is output, namely, the sequence data of the financial field graph of each time sequence can be obtained by analyzing the sequence data of the simulated sequence graph.

Because the Long Short-Term Memory model (LSTM) can selectively acquire important information in the sequence according to the sequence characteristics and ignore irrelevant information when processing the time sequence data, the processing capability of the time sequence data is improved, and the time sequence data can be better generated. Preferably, as shown in this embodiment, the sequence of timing diagrams is constructed based on the LSTM model to generate the network. Of course, the sequence generation network can also be constructed based on other existing memory network models, such as RNN.

The output of the sequence generation network is a sequence random walk sequence

，

The data in the sequence can be divided into two categories: continuous time stamp data

And discrete data

、

、

、

In which

、

Binary data (whose value is 0 or 1);

、

classify data for N (N is the total number of nodes in the timing diagram), for example:

00100.. Denotes

Is the third node. Thus, the sequence of timing diagrams includes both pairs of discrete values in the network

、

Also includes decoding and encoding continuous values

Decoding and encoding.

Referring to FIG. 3, the method is shown for discrete data

、

、

、

When encoding and decoding are performed, in a decoder of a sequence chart sequence generation network, first, the LSTM cell is output

Mapping is performed and then Gumbel-Max reparameterization techniques are used to generate discrete class values. Gumbel-Max reparameterization provides a method for distributing from categoriesThe method of sampling is a conventional technical means in the art, and is not described herein again. Optionally, gumbel-Softmax may also be used for reparameterization.

With discrete data

For example, the decoder has the expression:

wherein the mapping matrix

Bias term

，

Is a vector formed by independently distributed sample values of the same distribution standard Gumbel,

is a "Temperature" over-parameter,

is a vector

Maximum value of (2). Discrete data

Processing method and

the same is true.

For discrete data

、

Also associated with discrete data

The decoding process is the same, and the specific expression is as follows:

，

，

，

。

the result of the decoder output is then input to the encoder. Re-combining nodes in an encoder

Start mark

Are respectively coded into vectors

、

、

Input into the next LSTM cell, wherein

、

、

. The abovementioned Dense operation refers to scalar quantity passing through a Dense layer (fully-connected neural network layer)

、

、

Are respectively mapped into vectors

、

、

。

Fig. 4 illustrates an encoding and decoding process for consecutive time stamp data. In the decoder (Decode), firstOutputting LSTM unit output vector by using deconvolution Layer (deconvolution Layer)

Expansion into a matrix

Wherein

The dimensionality of the matrix output for the deconvolution layer; then in the matrix

Uniformly sampling one or more lines of vectors and averaging to obtain average vectors

(ii) a Finally, a Dense layer (full connection neural network layer) will be passed

Mapping to a continuous scalar

. In an encoder (Encode), another sense layer is used to map scalars

Mapping as vectors

. Continuous time stamp data through two times of Dense mapping

Finally become vectors

And input to the next LSTM unit.

Further, to ensure that the sequence data of the generated time chart is satisfiedInter-constraint, i.e. according to the time sequence constraint the time stamp data in every three groups of data, a time constraint mechanism is also designed in the decoder

To ensure

For example: clock constraints are employed.

After the above processing procedure, the sequence in the simulated sequence data of the sequence chart is generated in the sequence chart generation network firstly

Then generating a vector recording the initial time

Continuously generating L groups of ternary groups of data

And finally, generating a sequence y in the simulation sequence chart sequence data and outputting the simulation sequence chart sequence data. The sequence data generated from the noisy data is randomly generated, also referred to as a time-sequential random walk sequence

。

Step S200: sampling a real time sequence chart of a target field to obtain real time sequence chart sequence data;

step S300: inputting the simulation sequence diagram sequence data and the real sequence diagram sequence data into a sequence diagram sequence discrimination network to obtain a first loss value;

specifically, a GAN network architecture is adopted to train the sequence diagram sequence generation network, so that the sequence diagram sequence generation network can learn the characteristics of real sequence diagram data of the target field during training. In this embodiment, the timing sequence determination module is a timing sequence determination network, and the timing sequence determination network is equivalent to a determiner in the GAN. The timing sequence discrimination network and the timing sequence generation network together form a GAN network architecture. And (3) according to the countermeasure training of the sequence diagram sequence generation network and the sequence diagram sequence discrimination network, the extraction of the sequence data characteristics of the sequence diagram sequence generation network and the judgment of the abnormal data of the sequence diagram sequence discrimination network are realized. Due to the anti-training idea, when the data characteristics are learned, the time sequence diagram sequence generation network and the time sequence diagram sequence discrimination network can be continuously improved according to the learned characteristics, the capacity of the time sequence diagram sequence generation network for generating real time sequence diagram data and the capacity of the time sequence diagram sequence discrimination network for discriminating the generated diagram data and the real diagram data are improved, and the construction of the time sequence diagram sequence generation network is finally realized.

In this embodiment, the target field is the financial field, and after the real time sequence chart samples in the financial field are collected as training samples, the real time sequence chart samples are sampled to obtain real time sequence chart sequence data. Then, the simulated sequence diagram sequence data and the real sequence diagram sequence data are input into a sequence diagram sequence discrimination network, and the simulated sequence diagram sequence data and the real sequence diagram sequence data are compared to judge whether the simulated sequence diagram sequence data is from the real diagram data.

In the real time chart data

When sampling is carried out, a random walk mode is adopted to obtain real sequence data of a time sequence chart

Wherein V is the set of all nodes, E is the set of all edges, and T is the set of timestamps corresponding to the edges. Each sampling process can obtain a set of real sequence data of the time sequence chart

The sampling process comprises the following specific steps:

step S110: initializing sampling parameters;

specifically, initializing the total number of edges sampled this time

(which may be larger than the total number of edges of the real timing graph), a sequence length counter is set

Initial value

(ii) a And discrete node data

Indicating the start of real profile data.

Step S120: sampling according to the average sampling probability to obtain a first edge of the real sequence data of the time sequence diagram;

in particular, by probability

Sampling from real graph data to obtain a first edge in real time sequence chart sequence data

And make an order

；

Step S130: sampling for the next time to obtain the next edge of the real sequence data of the time sequence diagram;

in particular, sequentially by probability

Sampling the next edge from the real image data and ordering

Until the sequence length counter i equals

. Wherein

The normalization function is used, so that the probability that all edges in the real image data are sampled is added up to 1;

step S140: and when the number of the sampled edges is equal to the set total number of the sampled edges, finishing the sampling, otherwise, returning to the step S120 for the next sampling.

Specifically, when all edges in the real-time sequence diagram are sampled, it indicates that a round of sampling is finished. When the sampling is performed in the current round, the number of the sampled edges i is equal to the total number of the sampled edges

When the sampling process is finished, the sampling process is finished to

Indicating true profile sequence data termination; and when the sampling number of edges does not meet the requirement after the sampling of the round is finished, returning to the step S120 to execute the next sampling round until the sampled number of edges is equal to the set total sampling number of edges.

After sampling to obtain real sequence diagram sequence data, inputting the simulated sequence diagram sequence data and the real sequence diagram sequence data into a sequence diagram sequence discrimination network, and then obtaining a first loss value according to a loss function of the sequence diagram sequence discrimination network to judge whether the simulated sequence diagram sequence data is from the real diagram data. For example: and comparing the real sequence data of the time sequence chart with the simulated sequence data of the time sequence chart by adopting a cross entropy loss function in the time sequence chart judging network to obtain a first loss value. The embodiment uses a LSTM network-based classifier

Sorter

The input of the classifier is the output of the sequence chart generation network and the sample obtained by sampling obtains the real sequence chart sequence data

The real time chart sequence data and the simulated time chart sequence data are classified respectively, and whether the simulated time chart sequence data is derived from the real map data or not is predicted according to the classification result. The specific loss function is:

wherein the content of the first and second substances,

for the true sequence of the sequence data of the time chart,

simulated sequence data generated by the network is generated for the sequence of sequence diagrams.

The loss function is reduced by optimizing the sequence of the time sequence chart to generate the network and judging the model parameters of the network by the sequence of the time sequence chart

The obtained first loss value allows the simulated sequence data to gradually approximate the real sequence data.

Step S400: calculating and comparing a subgraph distribution distance value of the simulated sequence data of the timing diagram with a subgraph distribution distance value of the real sequence data of the timing diagram to obtain a second loss value for restricting the subgraph distribution distance value;

specifically, the subgraph distribution refers to which types of subgraph structures exist in the sequence data of the time chart, and information such as the number, connection relationship, time correlation and the like of various subgraph structures. Wherein the sub-graph structure is also referred to as a minimum spanning tree.

In order to make the simulated sequence data of the sequence diagram generated by the sequence diagram generation network more similar to the real sequence data of the sequence diagram, the sequence diagram generated by the sequence diagram generation module can approach the real sequence diagram on a sub-graph distribution. Referring to fig. 1, a sub-graph distribution constraint module is used to guide a timing sequence generation module to learn sub-graph distribution characteristics of real pattern data in a target field, so as to further improve the quality of generated timing sequence data.

The present embodiment uses a subgraph distribution distance value to measure the similarity between simulated sequence data and real sequence data. The main process of carrying out the subgraph distribution constraint by the subgraph distribution constraint module is as follows: and calculating a subgraph distribution distance value in the real graph time sequence data and a subgraph distribution distance value in the simulated time sequence data respectively, and calculating a second loss value according to the two subgraph distribution distance values to restrict the subgraph distribution in the generated simulated time sequence data to be close to the subgraph distribution of the real graph data.

Specifically, after analyzing various subgraph structures, nine types of subgraph structures shown in fig. 5 are obtained. The sub-graph structures in the timing graph data for each scene vary and may include one or more of the sub-graph structures in fig. 5. In the financial anti-money laundering scenario of this embodiment, there is usually a ring sub-graph between nodes, i.e. the sub-graph structure of the financial real pattern data is a ring sub-graph structure (e.g. the sub-graph structures 5, 6, and 7 in fig. 5). Selecting real graph data samples according to the pre-selected sub-graph structure, for example: the embodiment mainly selects the real graph data sample with the annular sub-graph structure, so that the sub-graph distribution in the time sequence graph data generated by the time sequence graph sequence generation module is constrained to be close to the sub-graph distribution in the real financial graph data.

On the basis of the sub-graph structures, the sub-graph structure distance values are obtained and accumulated by calculating the distances of the sub-graph structures of various types, so that the sub-graph distribution distance value is obtained. As shown in fig. 6, the method specifically includes the following steps:

step S410: calculating a sub-graph structure distance value corresponding to each sub-graph structure in a preset sub-graph structure category;

specifically, taking the example of calculating the sub-graph structure distance value of the real sequence data under each class of sub-graph structures, when calculating the sub-graph structure distance value corresponding to a certain class of sub-graph structures (e.g. the 4 th sub-graph structure in fig. 5), the specific steps are as follows: when sampling real graph data G = (V, E, T) random walk, counting the number of subgraphs corresponding to the subgraph structures of the category in the real sequence data of the time sequence diagram to obtain the number of real subgraphs; or calibrating the real graph sample in advance to obtain the number of the real subgraphs. And identifying subgraph structures of various categories of sequence data of the real time sequence chart through a subgraph distribution prediction module, counting the number of subgraphs to obtain the number of predicted subgraphs, subtracting the number of the predicted subgraphs from the number of the real subgraphs, and then taking a square value to obtain a subgraph structure distance value. In the embodiment, the sub-graph distribution prediction module is constructed based on the LSTM, and the number of sub-graphs corresponding to each sub-graph structure in the real sequence data of the time sequence chart is predicted through the LSTM unit. Optionally, other common technical means in the field may also be adopted to count the number of some sub-graph structures in the real time graph data, for example, a path query algorithm on the time graph based on a greedy algorithm, or a path query method based on graph transformation, or through a TTL algorithm, etc.

The method for calculating the subgraph structure distance value of the simulation sequence data under each class of subgraph structures is similar to the method for calculating the subgraph structure distance value of the real sequence data under each class of subgraph structures, and the specific steps are as follows: counting the number of subgraphs corresponding to the subgraph structure of the category in the sequence data of the real time sequence chart when the real graph data G = (V, E, T) is randomly walked and sampled, and obtaining the number of real subgraphs; or calibrating the real graph sample in advance to obtain the number of the real subgraphs. And identifying subgraph structures of various categories of sequence data of the simulation time sequence chart through a subgraph distribution prediction module, counting the number of subgraphs to obtain the number of predicted subgraphs, subtracting the number of predicted subgraphs from the number of real subgraphs, and then taking a square value to obtain a subgraph structure distance value.

It should be noted that, in order to increase the processing speed and efficiency, the sub-graph structure distance value is calculated in this embodiment by simply counting the number of sub-graphs corresponding to the sub-graph structure, and in other scenarios, the sub-graph structure distance value may be calculated with reference to other items related to the sub-graph structure.

Step S420: and accumulating all the subgraph structure distance values to obtain subgraph distribution distance values.

Specifically, the subgraph structure distance values under all subgraph structure categories are accumulated to obtain subgraph distribution distance values. The specific expression for calculating the subgraph distribution distance value of the real time sequence data is as follows:

wherein

For the number of real subgraphs of the ith seed graph in the real time graph data,

for the number of predicted subgraphs of the i-th seed graph in the real time chart data by the subgraph distribution prediction module, k refers to the total number of preset subgraph structure types (e.g. 9 types in fig. 5, k = 9),

and sampling the obtained real time sequence data in the real image data for the sampling module.

The expression for calculating the subgraph distribution distance value of the simulation time sequence data is the same as the above expression except that

Instead of using

，

Simulated sequence profile data generated by the network is generated for the sequence profile.

The subgraph distribution prediction module is combined with the subgraph distribution distance value in the real graph time sequence data, the subgraph distribution distance value in the time sequence data is simulated to construct a loss function of the subgraph distribution prediction module, and the specific expression is as follows:

wherein the content of the first and second substances,

a network is generated for the sequence of timing diagrams,

is a distance function used to compute the sub-graph distribution distance values.

And calculating a second loss value according to the loss function to measure the difference between the subgraph distribution in the real graph time sequence data and the subgraph distribution in the simulated time sequence data, and optimizing the model parameters of the time sequence generation network, the time sequence discrimination network and the subgraph distribution prediction module according to the second loss value.

In one embodiment, the degree of similarity between the subgraph distribution in the real graph time sequence data and the subgraph distribution in the simulation time sequence data can be directly calculated by using a graph neural network, and a second loss value is obtained.

Because the subgraph distribution in the time sequence diagram data is considered in the time sequence diagram generation process, the real graph data can be approached, the quality of the generated graph data is improved, and the training effect of the graph representation learning model is further improved.

In order to further improve the accuracy of the sub-graph distribution distance value, the embodiment also designs a gating network in the sub-graph distribution prediction module, so as to learn the importance of various sub-graph structures in the timing diagram data. The gate control network is constructed based on an LSTM model, real sequence diagram sequence data are input into an LSTM unit, the number of various sub-graph structures in each real sequence diagram sequence data is memorized according to a memory unit of the LSTM, the LSTM model finally outputs the number of various sub-graph structures, the sub-graph structures are mapped into a vector to be input into a full-link layer, and the vector is mapped (0,1) through a sigmoid function to obtain the weight corresponding to each type of sub-graph structures. The specific expression of the gating network is as follows:

wherein

For output implicit vector representation of the final element of the LSTM network

The vector is mapped to a scalar representing the estimated number of i-th seed maps. The model expression of the full connection layer is as follows:

in which

For the weight of the fully-connected network layer,

is a weight vector. The gating network may be a single LSTM network or may be an LSTM network that generates a network based on sequence data from a timing graph.

After the weights of various sub-graph structures are obtained, the sub-graph structure distance values corresponding to all types of sub-graph structures are weighted and accumulated according to the weights, and the weighted sub-graph distribution distance values are obtained. For example:

wherein

For a true number of ith seed maps in the true timing graph data,

the number is estimated for the ith seed map in the real timing graph by the sub-graph distribution prediction module,

is the weight of the ith seed map.

By adopting the weighting method, the subgraph distribution distance value in the real graph time sequence data and the subgraph distribution distance value in the simulated time sequence data are calculated, so that the second loss value is more accurate, and the optimization effect is improved.

Step S500: obtaining a total loss value according to the first loss value and the second loss value;

step S600: and optimizing the model parameters of the sequence generation network until the total loss value meets the set condition, and obtaining the trained sequence generation network.

Step S700: and inputting the noise data into a trained time sequence chart sequence generation network to obtain time sequence chart data with structural constraint.

Specifically, when the sequence diagram sequence generates network training, the loss function of the module is predicted according to the subgraph distribution

Sequence diagram sequence discrimination network loss function

Constructing a total loss function, wherein the total loss function is as follows:

. Performing joint training on the time sequence diagram sequence generation network, the time sequence diagram sequence discrimination network and the subgraph distribution prediction module according to the total loss function, and optimizing the time sequence diagram sequence generation network

Until the total loss value calculated according to the total loss function reaches the set precision requirement. And after training is finished, obtaining a well-trained sequence diagram sequence to generate a network. When generating the time sequence data, the noise data is input into the trained time sequence generation network, and the time sequence data with structural constraint can be obtained.

In summary, in this embodiment, the simulated sequence diagram sequence data is generated by the sequence diagram sequence generation network, the real sequence diagram is sampled to obtain the real sequence diagram sequence data, the sequence diagram sequence discrimination network is used to obtain a first loss value between the simulated sequence diagram sequence data and the real sequence diagram sequence data, a second loss value is obtained according to the sub-graph distribution distance value of the simulated sequence diagram sequence data and the sub-graph distribution distance value of the real sequence diagram sequence data, and the training sequence diagram sequence generation network is optimized according to the first loss value and the second loss value, so that the training sequence diagram generation network can generate accurate and high-quality graph data.

Exemplary System

As shown in fig. 7, corresponding to the GAN-based timing graph data generating method with structural constraint, an embodiment of the present invention further provides a GAN-based timing graph data generating system with structural constraint, where the system includes:

a sequence generation module 600, configured to acquire noise data, input the noise data into a trained sequence generation network, acquire sequence data of a timing graph with structural constraints or input the noise data into the sequence generation network, and generate simulated sequence data of the timing graph for characterizing a target domain;

the sequence sampling module 610 of the time sequence diagram is used for sampling the real time sequence diagram of the target field to obtain the sequence data of the real time sequence diagram;

a sequence table sequence discrimination module 620, configured to input the sequence data of the simulated sequence table and the sequence data of the real sequence table into the sequence table sequence discrimination network to obtain a first loss value;

a sub-graph distribution constraint module 630, configured to calculate and compare a sub-graph distribution value of the simulated sequence data with a sub-graph distribution value of the real sequence data, so as to obtain a second loss value for constraining sub-graph distribution;

an optimization module 640, configured to obtain a total loss value according to the first loss value and the second loss value; and optimizing the model parameters of the sequence diagram sequence generation network until the total loss value meets the set condition, and obtaining the trained sequence diagram sequence generation network.

During training, the sequence diagram sequence generating module 600 outputs simulated sequence diagram sequence data according to the obtained noise data, and then inputs the simulated sequence diagram sequence data and real sequence diagram sequence data sampled from the real diagram data by the sequence diagram sampling module 610 to the sequence diagram sequence judging module 620 to judge whether the simulated sequence diagram sequence data is from the real diagram data. The subgraph distribution constraint module 630 constrains the subgraph distribution in the sequence diagram data generated by the sequence diagram generation module to be close to the subgraph in the real graph data according to a subgraph pattern predefined by a specific scene. The difference of the generated graph data and the real graph data on the subgraph structure is further considered, the generated graph data is closer to the real data, the generated graph data quality is higher, and the quality can be improved by 30% compared with the MMD (Maximum Mean redundancy) index of the node degree in the prior art through testing. After training is finished, noise data is input into the trained sequence diagram generation network, and sequence diagram sequence data with structural constraints can be obtained for representing training of the learning model.

Optionally, the sub-graph distribution constraint module 630 further includes a gating network constructed based on LSTM, and the gating network is configured to obtain a weight corresponding to each type of sub-graph structure according to the real sequence diagram sequence data. By learning the weight of each sub-graph structure through the gating network, the authenticity of the timing graph data generated by the timing graph sequence generation module 600 can be further improved.

Specifically, in this embodiment, the specific functions of each module of the GAN-based timing diagram data generation system with structural constraint may refer to the corresponding descriptions in the GAN-based timing diagram data generation method with structural constraint, and are not described herein again.

Based on the above embodiment, the present invention further provides an intelligent terminal, and a schematic block diagram thereof may be as shown in fig. 8. The intelligent terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein, the processor of the intelligent terminal is used for providing calculation and control capability. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a GAN-based timing graph data generation program with structural constraints. The internal memory provides an environment for the operating system and GAN-based timing diagram data generator with structural constraints to run in the non-volatile storage medium. The network interface of the intelligent terminal is used for being connected and communicated with an external terminal through a network. When being executed by a processor, the GAN-based timing graph data generation program with structural constraint realizes the steps of any one of the GAN-based timing graph data generation methods with structural constraint. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen.

It will be understood by those skilled in the art that the block diagram of fig. 8 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have different arrangements of components.

In one embodiment, an intelligent terminal is provided, where the intelligent terminal includes a memory, a processor, and a GAN-based timing graph data generation program with structural constraints stored in the memory and executable on the processor, and the GAN-based timing graph data generation program with structural constraints performs the following operations when executed by the processor:

calculating and comparing a subgraph distribution distance value of the simulated sequence data with a subgraph distribution distance value of the real sequence data to obtain a second loss value for restricting the subgraph distribution distance value;

acquiring the weight corresponding to each type of sub-graph structure in the preset sub-graph structure type;

and based on the weight, performing weighted accumulation on subgraph structure distance values corresponding to the subgraph structures of all classes to obtain the subgraph distribution distance value.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a GAN-based time series graph data generating program with structural constraint, and when the GAN-based time series graph data generating program with structural constraint is executed by a processor, the steps of any GAN-based time series graph data generating method with structural constraint according to the embodiment of the present invention are implemented.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. For the specific working processes of the units and modules in the system, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art would appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the above modules or units is only one type of logical function division, and the actual implementation may be implemented by another division manner, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated modules/units described above, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the embodiments of the method when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying the above-described computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier signal, telecommunications signal, software distribution medium, and the like. It should be noted that the contents contained in the computer-readable storage medium can be increased or decreased as required by legislation and patent practice in the jurisdiction.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A time sequence diagram data generation method with structural constraint based on GAN is characterized by comprising the following steps:

calculating and comparing a subgraph distribution distance value of the simulated sequence data with a subgraph distribution distance value of the real sequence data to obtain a second loss value for restricting subgraph distribution;

and inputting the noise data into a trained time sequence chart sequence generation network to obtain time sequence chart data with structural constraint.

2. The GAN-based profile data generation method with structural constraints according to claim 1, wherein the simulated profile sequence data and the real profile sequence data are profile sequence data, and the calculating the subgraph distribution distance value of the profile sequence data comprises:

3. The GAN-based timing graph data generation method with structural constraints as claimed in claim 2, wherein calculating a subgraph distance value corresponding to the subgraph based on the timing graph sequence data comprises:

4. The GAN-based timing graph data generating method with structural constraints as claimed in claim 2, wherein the accumulating all sub-graph structure distance values to obtain the sub-graph distribution distance value comprises:

5. The GAN-based timing graph data generation method with structural constraints according to claim 4, wherein a gating network constructed based on LSTM is further provided, and the obtaining the weight corresponding to each sub-graph structure in the preset sub-graph structure category comprises:

6. The GAN-based timing graph data with structural constraints generating method as claimed in claim 1, wherein the timing graph sequence data comprises a number of triple data describing edges of the timing graph, the triple data comprising a start node, a stop node and a time stamp constituting the edges of the timing graph.

7. The GAN-based timing graph data generation method with structural constraints according to claim 6, wherein the timing graph sequence generation network is constructed based on an LSTM model, and a time constraint module is further disposed in the timing graph sequence generation network and is used for constraining the timestamps in the triple data according to the time sequence.

8. A GAN-based timing graph data generation system with structural constraints, the system comprising:

9. The GAN-based timing graph data generating system with structural constraints as claimed in claim 8 wherein the sub-graph distribution constraint module further comprises a gated neural network for obtaining a weight corresponding to each sub-graph structure in the preset sub-graph structure class according to the real timing graph sequence data.

10. An intelligent terminal, characterized in that the intelligent terminal comprises a memory, a processor and a GAN-based timing graph data generation program with structural constraints stored on the memory and operable on the processor, wherein the GAN-based timing graph data generation program with structural constraints realizes the steps of the GAN-based timing graph data generation method with structural constraints according to any one of claims 1 to 7 when executed by the processor.

11. A computer-readable storage medium, wherein the computer-readable storage medium stores thereon a GAN-based timing graph data generation program with structural constraints, and the GAN-based timing graph data generation program with structural constraints, when executed by a processor, implements the steps of the GAN-based timing graph data generation method with structural constraints according to any one of claims 1 to 7.