CN116596150A

CN116596150A - Event prediction method of transform Hoxwell process model based on multi-branch self-attention

Info

Publication number: CN116596150A
Application number: CN202310616233.3A
Authority: CN
Inventors: 高腾达; 吴春雷
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-08-15

Abstract

The invention discloses an event prediction method of a transform Hoxwell process model based on multi-branch self-attention. The previous methods have been mainly directed application of recurrent neural networks or self-attention models. However, we note that in the application of the conventional self-attention model, the model ignores the difference in the integrity of the dependency relationship learned at different angles, resulting in learning bias of the overall sequence information. Furthermore, we note that in the transducer-based model, there is a problem of poor local perceptibility, which results in local information being easily ignored. The invention provides a multi-branch self-attention-based transform Hoxwell process model for learning information of an event sequence for the first time. A multi-branch self-attention network is designed, and more accurate sequence information is mined. A local perception enhancement module is provided to solve the problem that local context information is insensitive. The effectiveness of the proposed model is proved by carrying out a large number of experiments on four real world data sets such as Retweets and two artificial data sets.

Description

Event prediction method of transform Hoxwell process model based on multi-branch self-attention

Technical Field

The invention belongs to a method for predicting events, and relates to the technical field of deep learning and time sequence point processes.

Background

In the information age, there are numerous events worth recording every day, such as hospital visits, stock exchanges, earthquakes, and popular trends of social platforms, and these data can be stored as a plurality of event sequences, each event sequence is a record of all events occurring at indefinite intervals in a continuous time, and in each event record, the type of event occurrence and the time of event occurrence are stored. The data stored in the event sequence contains a great amount of rule information, and the time and the type of the event can be predicted by learning the occurrence rule of the event and analyzing the relationship between the events. How to analyze useful rule information from event sequences is a key of research, and a point process model can well model discrete event sequences in continuous time, so that the research of the point process model is a field of great concern due to the intelligent social requirement, wherein the Hox process is one of widely used methods in the point process model.

The self-excitation process is also called a Hox process, and it is considered that for the events to occur in the future, each historical event can generate excitation action and accumulate, and the dependency relationship between the events can be well simulated, so that the Hox process is widely applied. The traditional Hox process model has the defect that the situation is difficult to avoid, the relation between different events in reality is uncertain, the mutual excitation is possible, the mutual inhibition is possible, the relation is also possible, the parameter assumption of the model needs to be modified for different scenes, and the traditional method is difficult to work for the scenes; meanwhile, the lack of the nonlinear fitting capability of the traditional Hox process model limits the expression capability of the model, so that the model is difficult to cope with very complex situations.

In recent years, with the development of machine learning, particularly deep learning, many strong network architectures have emerged in the fields of computer vision and natural language processing, RNN is one of the most excellent network architectures, and has been widely used and improved because RNN is a network generated for processing sequence data, which can naturally learn potential relationships between data, nan Du et al converts event sequence information into vector input RNN models, learns nonlinear functions representing event intensity functions from history data, and thus realizes modeling of sequences. The point process model is combined with deep learning to improve the expressive power of the sequence data, and the event and category of the future event occurrence can be predicted more accurately while no custom algorithm is required for the model design.

However, RNN suffers from the disadvantage of problematic endogenous, RNN-based models that, even if equipped with gating mechanisms for preserving long-term and short-term memory, still have difficulty capturing long-term dependencies of event sequences; in addition, the training difficulty of the circulating neural network is also relatively high, and the problems of gradient disappearance and gradient explosion are always difficult to overcome, so that the model performance of the circulating neural network is limited.

In recent years, a transducer architecture has achieved great success in the field of natural language processing, and then exciting results are obtained in the fields of cognitive voice signal processing, image classification, target detection, semantic segmentation and the like, so that the transducer has excellent remote dependency capture capability, can effectively solve the problem of long-term dependence which is difficult to cope with RNN, and the Qiang Zhang et al firstly propose modeling a Hox process by using a self-attention mechanism; similarly, simiao Zuo et al propose a transducer houx process model based on the architecture of a transducer encoder, directly using the transducer encoder to encode historical representations of sequences. However, these methods do not take into account the integrity of the dependencies learned from the data sequence at different perspectives, and we note that in the existing transform hoxwellian process model, each attention header focuses on information from different token subspaces from different perspectives, they learn the sequence information independently, the extracted hidden tokens are all the whole information obtained from different perspectives, they are simply combined in a connected manner, and the difference in integrity of the different independent sequence information is ignored, which will lead to a deviation in the whole sequence information learning. In addition, in the conventional hoxwell process model based on the transducer, the problem of poor local perceptibility exists, which also causes deviation of sequence information understanding.

Disclosure of Invention

The invention aims to solve the problem that in an event prediction method based on a transform Hooke process model, the relation among information from different feature subspaces is rarely considered, and all information is simply connected. Moreover, the local perceptibility of the existing converter-based hoxwell process model is poor, so that information learning deviation can occur for event sequences with short time intervals.

The technical scheme adopted for solving the technical problems is as follows:

s1, constructing a multi-branch self-attention module, and extracting more accurate historical characterization of the sequence according to the differentiation fusion of multi-angle information.

S2, a local perception enhancement module is constructed, local context information is further processed through causal convolution, and the local perception capability is enhanced to improve modeling performance.

S3, constructing a multi-branch self-attention-based transducer sequence history representation coding network architecture by combining the network in S2 and the network in S3

S4, an event prediction module is constructed, and occurrence of future events is predicted through sequence history characterization extracted in the S3.

S5, constructing a multi-branch self-attention-based transform Hox process model framework by combining the network in the S3 and the network in the S4.

S6, training a multi-branch self-attention-based transducer Hoxwell process model.

The multi-branch self-attention module differentiates different characterization information obtained under multiple view angles according to input to extract a sequence hidden characterization with more accurate sequence. By considering the contribution degree of different characterization information to the hidden characterization of the final sequence, corresponding weights are given to the output of different characterization subspaces, so that the deviation of the overall understanding of the sequence is reduced. We describe the detailed operation below:

given a data set, the event categories in the data set are set to be M categories, and for any event sequence in the given data set Event type +.>Performing one-time thermal encoding (except kth _i M-dimensional vectors with all values except 1, 0) are obtained +.>Then input a type code of an embedding matrix generation event +.>

C＝W _e K′#(1)

Wherein the method comprises the steps ofIs an embedding matrix, using a D-dimensional sized coded vector to represent event categories, d=128.

Encoding the time information using a sine and cosine function to obtain

For time information t _i We use p (t) _i ) To encode, [ p (t) _i )] _j The j-th element encoded for time information.

In the above way we obtain the event type code C and the event time code P of the sequence, the embedded code of the whole sequence of events can be expressed as: x= (c+p) ^T The embedded encodings of all events obtained are input into a multi-branch self-attention based transform sequence history characterization encoding network.

Given a sequence of events embedded with codesInputting X into the device withIn the layer of the multi-branch transducer encoder for enhancing local attention, each layer of encoder is provided with H branches, each branch is provided with a self-attention module and a local attention enhancement network module, and in each branch, X generates Q, K and V through 3 conversion matrixes respectively, and the processing mode is as follows:

Q＝XW ^Q ，K＝XW ^K ，V＝XW ^V #(3)

wherein W is ^Q ，Is a weight matrix of linear conversion, D _K ＝D _v ＝64。

To prevent t _i Event pair t at time _j The event at time (j > i) generates attention, mask-processes Q, and then Q [ i, j ]](j > i) is set to 0.

Calculating the output S of the attention head from Q, K, V _i ：

Wherein K is ^T Representing a transpose of the matrix, softmax () refers to the softmax function.

Giving different weights to the outputs of different attention heads, generating the outputs by converting the matrix

Wherein the method comprises the steps ofIs an aggregation matrix, alpha _i Is a learned parameter, requiring +.>The method can endow sequence information learned from different angles with different importance. Then carrying out residual connection and normalizationUnification to obtain the final output +.>

S′ _i ＝S′ _i +X#(6)

S′ _i ＝LayerNorm(S′ _i )#(7)

The local perception enhancement module further learns the local context information of the characterization information under a single view angle through causal convolution, so that the local perception capability of the model on the characterization information is improved, and the accuracy of the model on the overall understanding of the sequence is improved. Obtain the output S 'of the single attention head' _i Then, the data is input into a local perception enhancement network to obtain

F _A ＝S′ _i W ₁ +b ₁ #(8)

F _B ＝ReLu(F _A )#(9)

F _C ＝CCNN(F _B )#(10)

F _D ＝F _C W ₂ +b ₂ #(11)

Wherein the method comprises the steps ofIs a parameter of the neural network, reLu () refers to ReLu function, CCNN is a causal convolution layer, convolution kernel size is 3, stride is 1, and padding is 2. The closer the event is to the more critical the future event influence, the local perception enhancement mechanism of the invention can make the model learn more reasonable hidden characterization of each event by carrying out causal convolution on adjacent events.

Will F _D The result of (2) is weighted to obtain the final output information of the branchThe specific operation is as follows:

β _i is a learned parameter, requiresB is input into the next encoder layer for learning after residual connection and normalization, and hidden characterization of an event sequence is finally output after multi-layer encoder layer learning

In a multi-branch self-attention-based transducer sequence history representation coding network architecture, each branch consists of a weight-based self-attention module and a local perception enhancement module, and different weight branches are fused, so that a model can learn more accurate sequence information.

The event prediction module accurately predicts future events based on hidden characterizations of the input event sequence. And respectively obtaining the prediction results of the event type and the event time by inputting the hidden representation into different prediction networks. We describe the detailed operation below:

given an event (k) _i ，t _i ) Historical hidden characterization of (a)Wherein h is _i =b (i,:), the probability distribution of the next event type is obtained by one full connection layer:

wherein the method comprises the steps ofIs a full-connection layer parameter of the predicted event type, argmax () is an argmax function for judging the maximum probability type of the next event +.>

Similarly, for the occurrence time of the next eventIs predicted by:

wherein the method comprises the steps ofIs a full connection layer parameter that predicts event time.

The event prediction method based on the multi-branch self-attention transducer Hooke process model comprises a multi-branch self-attention transducer sequence history characterization coding network and an event prediction network module.

Finally, the training method of the multi-branch self-attention-based transducer Hox process model is as follows:

the training loss of the model consists of likelihood function loss, time loss and type loss, the model is trained by using an Adam optimizer, and the loss formula is as follows:

wherein R is the number of sequences in the dataset, gamma _type And gamma is equal to _time Is a super parameter that helps keep training stable, first, time lossAnd type loss->The calculation method is as follows:

to maximize likelihood function, the likelihood loss is calculatedTaking a negative value, the calculation of the likelihood function comprises two parts, namely event log likelihood and non-event log likelihood, and the calculation mode is as follows:

event log likelihoodThe calculation mode of (2) is as follows:

in the event sequence, the relation between the occurrence probability and the occurrence time of each type of event is determined by the corresponding conditional intensity function, and we use the obtained hidden characterization of the event sequence to calculate the conditional intensity function lambda (t|H _t ) Wherein H is _t ＝{(t _j ，k _j )：t _j < t } is a sequence of historical events prior to time t, we define different conditional intensity functions for different classes of events, for example, in a sequence with M classes of event categories, for each event category k ε {1,2,., M }, a conditional intensity function λ _k (t|H _t )：

Wherein the method comprises the steps ofIs a softplus function, beta _k Is the soft parameter。

The conditional intensity function of the whole sequence is defined as follows:

log likelihood for non-eventsBecause of the existence of the softplus function, the integral is not calculated in a closed form, proper approximation is needed, and the methods for approximating the non-event log likelihood are Monte Carlo integral, numerical integral method and the like.

The Monte Carlo integration method is approximated as follows:

wherein u is _i Is from [ t ] _j ，t _j-1 ]The estimation obtained by the monte carlo integration method is an unbiased estimation, but the calculation is complicated by the need for sampling.

The numerical integration method approximates the following:

the approximation based on integration of the data values is faster due to the fact that no sampling is required, but smaller deviations are generated.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a novel multi-branch self-attention-based transform Hox process model for event prediction. The sequence information obtained under multiple angles is differentially fused, so that more accurate learning of the sequence information is realized, and the accuracy of event prediction is improved.

2. The invention provides a local perception enhancement mechanism to further process the local context information of the sequence, enhance the local perception capability of the model, further improve modeling performance and improve the accuracy of event prediction.

Drawings

Fig. 1 is a schematic structural diagram of an event prediction method based on a multi-branch self-attention transducer hough process model.

FIG. 2 is a schematic diagram of a multi-branch self-attention based transform sequence history characterization encoding network.

FIG. 3 is a schematic diagram of an event prediction module.

FIG. 4 is a graph comparing log likelihood results of a multi-branch self-attention based transform Houx process model with event prediction models of other network structures on four real world datasets Retweets, MIMIC-II, stackOverflow, financial and two artificial datasets.

FIG. 5 is a graph of event type prediction accuracy versus graph.

FIG. 6 is a graph comparing prediction errors of event occurrence time.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

The invention is further illustrated in the following figures and examples.

Fig. 1 is a schematic structural diagram of an event prediction method based on a multi-branch self-attention transducer hough process model. As shown in fig. 1, the framework of the whole event prediction mainly consists of two major parts, namely a sequence history characterization coding module and an event prediction module, wherein the sequence history characterization coding module comprises multi-branch self-attention and local perception enhancement.

FIG. 2 is a schematic diagram of a multi-branch self-attention based transform sequence history characterization encoding network. As shown in FIG. 2, a given sequence of events embeds the codeInputting X into a multi-branch transducer encoder layer with enhanced local attention, providing each layer encoder comprising H branches, each branch having a self-attention module and a local attention enhancing network module, in each branch X passing through 3 conversion matrices respectivelyQ, K and V are generated, and the processing mode is as follows:

Q＝XW ^Q ，K＝XW ^K ，V＝XW ^V #(3)

Calculating the output S of the attention head from Q, K, V _i ：

Giving different weights to the outputs of the self-attention modules under different branches, generating outputs by converting the matrix

Wherein the method comprises the steps ofIs an aggregation matrix, alpha _i Is a learned parameter, requiring +.>The method can endow sequence information learned from different angles with different importance. Then carrying out residual connection and normalization to obtain final output +.>

S′ _i ＝S′ _i +X#(6)

S′ _i ＝LayerNorm(S′ _i )#(7)

As shown in fig. 2, the local perception enhancement network (Local Perception Enhancement Networks) of the present invention further processes the local context information of the sequence to enhance the local perception capability of the model to further improve the modeling performance.

F _A ＝S′ _i W ₁ +b ₁ #(8)

F _B ＝ReLu(F _A )#(9)

F _C ＝CCNN(F _B )#(10)

F _D ＝F _C W ₂ +b ₂ #(11)

FIG. 3 is a schematic diagram of an event prediction module. The event prediction module accurately predicts future events based on hidden characterizations of the input event sequence. And respectively obtaining the prediction results of the event type and the event time by inputting the hidden representation into different prediction networks. We describe the detailed operation below:

Similarly, for the occurrence time of the next eventIs predicted by:

FIG. 4 is a graph comparing log likelihood results of a multi-branch self-attention based transform Houx process model with event prediction models of other network structures on four real world datasets Retweets, MIMIC-II, stackOverflow, financial and two artificial datasets. As shown in fig. 4, the modeling result of the multi-branch self-attention based transform hough process model on the data is more accurate than other models.

FIG. 5 is a graph of event type prediction accuracy versus graph. As shown in fig. 5, the multi-branch self-attention based Transformer hough process model achieves higher accuracy for the prediction of event occurrence type.

FIG. 6 is a graph comparing prediction errors of event occurrence time. As shown in fig. 6, the root mean square error of the multi-branch self-attention based Transformer hough process model is smaller for the prediction of event occurrence time.

The invention provides a transform Hoxwell process model based on multi-branch self-attention, which is used for event prediction. The sequence information obtained under multiple angles is subjected to differential fusion, so that more accurate learning of the sequence information is realized. In addition, a local perception enhancement module is added to further process the local context information of the sequence, and the local perception capability of the model is enhanced, so that the modeling performance is further improved, and the accuracy of event prediction is improved. A number of experiments performed on four real world datasets, retweets, MIMIC-II, stackOverflow, financial, and two artificial datasets, have shown that the model achieves good results in terms of sequence modeling and event prediction. In future work, we will continue to explore how better to learn the dependency information of the sequences and effectively integrate it for future event prediction.

Finally, the details of the above examples of the invention are provided only for illustrating the invention, and any modifications, improvements, substitutions, etc. of the above embodiments should be included in the scope of the claims of the invention.

Claims

1. An event prediction method based on a multi-branch self-attention transducer Hoxwell process model is characterized in that,

the method comprises the following steps:

The event prediction method based on a multi-branch self-attention transducer hough process model according to claim 1, wherein the specific process of S1 is:

C＝W _e K’#(1)

Encoding the time information using a sine and cosine function to obtain

Given a sequence of events embedded with codesInputting X into a multi-branch transducer encoder layer with enhanced local attention, and setting each layer encoder to comprise H branches, wherein each branch is provided with a self-attention module and a local attention enhancement network module, and in each branch, X respectively generates Q, K and V through 3 conversion matrixes, and the processing mode is as follows:

Q＝XW ^Q ,K＝XW ^K ,V＝XW ^V #(3)

wherein the method comprises the steps ofIs a weight matrix of linear conversion, D _K ＝D _v ＝64。

To prevent t _i Event pair t at time _j ，(j>i) The moment event generates attention, mask processing is carried out on Q, and Q [ i, j](j>i) And setting 0.

Calculating the output S of the attention head from Q, K, V _i ：

S′ _i ＝S′ _i +X#(6)

S′ _i ＝LayerNorm(S′ _i )#(7)

The event prediction method based on a multi-branch self-attention transducer hough process model according to claim 1, wherein the specific process of S2 is:

F _A ＝S′ _i W ₁ +b ₁ #(8)

F _B ＝ReLu(F _A )#(9)

F _C ＝CCNN(F _B )#(10)

F _D ＝F _C W ₂ +b ₂ #(11)

β _i is a learned parameter, requiresB is input into the next encoder layer for learning after residual connection and normalization, and hidden representation of an event sequence is finally output after multi-layer encoder layer learning>

The event prediction method based on a multi-branch self-attention transducer hough process model according to claim 1, wherein the specific process of S3 is:

The event prediction method based on a multi-branch self-attention transducer hough process model according to claim 1, wherein the specific process of S4 is:

given an event (k) _i ,t _i ) Historical hidden characterization of (a)Wherein h is _i =b (i,:), the probability distribution of the next event type is obtained by one full connection layer:

Similarly, for the occurrence time of the next eventIs predicted by:

The event prediction method based on a multi-branch self-attention transducer hough process model according to claim 1, wherein the specific process of S5 is:

2. The event prediction method based on a multi-branch self-attention transducer hough process model according to claim 1, wherein the specific process of S6 is:

the training method of the multi-branch self-attention-based transducer Hox process model is as follows:

event log likelihoodThe calculation mode of (2) is as follows:

in the event sequence, the relation between the occurrence probability and the occurrence time of each type of event is determined by a corresponding condition intensity function, and we calculate the condition intensity by using the obtained hidden representation of the event sequenceDegree function λ (t|H) _t ) Wherein H is _t ＝{(t _j ,k _j ):t _j <t is a sequence of historical events before time t, we define different conditional intensity functions for different classes of events, e.g. in a sequence with M classes of event classes, for each event class k e {1,2, …, M }, conditional intensity function λ _k (t|H _t )：

Wherein the method comprises the steps ofIs a softplus function, beta _k Is the soft parameter.

The conditional intensity function of the whole sequence is defined as follows:

The Monte Carlo integration method is approximated as follows:

The numerical integration method approximates the following: