WO2023229094A1

WO2023229094A1 - Method and apparatus for predicting actions

Info

Publication number: WO2023229094A1
Application number: PCT/KR2022/008848
Authority: WO
Inventors: 공다영; 이준석; 김만진; 조민수; 하성종
Original assignee: 주식회사 엔씨소프트; 포항공과대학교 산학협력단
Priority date: 2022-05-27
Filing date: 2022-06-22
Publication date: 2023-11-30

Abstract

A method and an apparatus for predicting actions are disclosed. According to one embodiment, a method for predicting actions comprises the operations of: generating, on the basis of features extracted from an input frame sequence, an encoder input matrix corresponding to the input frame sequence; generating an encoder output matrix based on the encoder input matrix by using an encoder composed of one or more encoder layers including a first self-attention layer as a sublayer; generating a decoder output matrix based on the encoder output matrix and an action query matrix by using a decoder composed of one or more decoder layers including a second self-attention layer and a cross-attention layer as sublayers; identifying, on the basis of the encoder output matrix, one or more actions observed in the input frame sequence; and predicting, in parallel, on the basis of the decoder output matrix, one or more future actions that will occur sequentially after the one or more actions.

Description

Behavior prediction methods and devices

The disclosed embodiments relate to behavior prediction technology.

Behavior prediction technology, which predicts future behavior based on behavior observed in a video, is one of the important research areas for the development of artificial intelligence-based systems, for example, robots that provide services based on predicted human behavior. .

Previously proposed behavior prediction techniques predict future behavior by using the label of the behavior observed in the video instead of the visual information of the observed video, so there is a problem in that prediction performance depends on the label of the input behavior.

In addition, the previously proposed behavior prediction technologies have a structure of sequentially predicting the next future behavior by considering the predicted future behavior first, so if there is an error in the first predicted future behavior, the error accumulates and affects the prediction of the next future behavior. There is a problem that the prediction time takes a long time.

In addition, previously proposed behavior prediction technologies imply information about behavior observed in a video into a summarized vector and use the implied vector to predict future behavior, so information about observed behavior is collected in detail. There is a problem that prevents it from being properly utilized.

The disclosed embodiments are intended to provide an apparatus and method for behavior prediction.

A behavior prediction method according to an embodiment includes generating an encoder input matrix corresponding to an input frame sequence based on features extracted from the input frame sequence; An operation of generating an encoder output matrix based on the encoder input matrix using an encoder composed of one or more encoder layers including a first self-attention layer as a sub-layer; An operation of generating a decoder output matrix based on the encoder output matrix and the action query matrix using a decoder composed of one or more decoder layers including a second self-attention layer and a cross-attention layer as sub-layers; identifying one or more behaviors observed in the input frame sequence based on the encoder output matrix; and predicting in parallel one or more future actions that will occur sequentially after the one or more actions based on the decoder output matrix.

The behavior query matrix may be a learnable parameter determined through learning a prediction model including the encoder and the decoder.

The encoder includes a plurality of encoder layers, each including the first self-attention layer as a sub-layer and performed sequentially, and the encoder output matrix may be an output matrix of the last encoder layer among the plurality of encoder layers. .

The first self-attention layer included in the first encoder layer among the plurality of encoder layers performs self-attention based on a matrix generated by combining the encoder input matrix and the position embedding matrix, and n ( At this time, n is a natural number of 2≤n≤N, N is the number of the plurality of encoder layers), and the first self-attention layer included in the th encoder layer is the output matrix of the n-1th encoder layer among the plurality of encoder layers. Self-attention can be performed based on a matrix generated by combining and the position embedding matrix.

The first self-attention layer may include a plurality of attention heads that each perform self-attention based on a matrix input to the first self-attention layer.

The decoder includes a plurality of decoder layers each including the second self-attention layer and the cross-attention layer as sub-layers and performed sequentially, and the decoder output matrix is the last decoder layer among the plurality of decoder layers. It may be an output matrix.

The second self-attention layer included in the first decoder layer among the plurality of decoder layers performs self-attention based on a matrix generated by combining a preset initial decoder input matrix and the action query matrix, and the plurality of decoder layers The second self-attention layer included in the m decoder layer (where m is a natural number of 2≤m≤M, M is the number of the plurality of decoder layers) is the m-1th decoder layer among the plurality of decoder layers. Self-attention can be performed based on a matrix generated by combining the output matrix of and the behavior query matrix.

The second self-attention layer may include a plurality of attention heads that each perform self-attention based on a matrix input to the second self-attention layer.

The cross attention layer includes a first input matrix combining the encoder output matrix and a position embedding matrix, a matrix generated by performing layer normalization on the self-attention matrix generated by the second self-attention layer, and the action query matrix. Cross attention can be performed based on a second input matrix combining .

The cross attention layer may include a plurality of attention heads that respectively perform cross attention based on the first input matrix and the second input matrix.

It may further include parallelly predicting a duration for each of the one or more future actions based on the decoder output matrix.

A behavior prediction device according to an embodiment includes one or more processors; and a memory storing one or more programs executed by the one or more processors, wherein the one or more processors generate an encoder input matrix corresponding to the input frame sequence based on features extracted from the input frame sequence, and Using an encoder composed of one or more encoder layers including a first self-attention layer as a sub-layer, generating an encoder output matrix based on the encoder input matrix, and including a second self-attention layer and a cross-attention layer as sub-layers Using a decoder consisting of one or more decoder layers, generate a decoder output matrix based on the encoder output matrix and a behavior query matrix, and based on the encoder output matrix, identify one or more behaviors observed in the input frame sequence, Based on the decoder output matrix, one or more future actions that will occur sequentially after the one or more actions are predicted in parallel.

The one or more processors may predict a duration for each of the one or more future actions in parallel based on the decoder output matrix.

According to the disclosed embodiments, future actions that will occur sequentially are predicted in parallel using the features of the frame sequence extracted from the video, so that visual information of the frame sequence is utilized to predict future actions and future actions are sequentially predicted. Accumulation of errors that occurs when making predictions can be prevented. Accordingly, the accuracy of predicting future behavior can be improved and the time required for prediction can be reduced.

Figure 1 is a diagram showing the configuration of an artificial neural network-based behavior prediction model according to an embodiment.

Figure 2 is a diagram showing the configuration of an encoder and decoder according to an embodiment.

Figure 3 is a diagram showing the configuration of an encoder layer according to one embodiment.

Figure 4 is a diagram illustrating an example of a first self-attention layer according to an embodiment.

Figure 5 is a diagram showing the configuration of a decoder layer according to one embodiment.

Figure 6 is a diagram illustrating an example of a second self-attention layer according to an embodiment.

Figure 7 is a diagram illustrating an example of a cross attention layer according to an embodiment.

Figure 8 is a flowchart of a behavior prediction method according to one embodiment.

FIG. 9 is a block diagram illustrating and illustrating a computing environment including a computing device according to an embodiment.

Hereinafter, specific embodiments will be described with reference to the drawings. The detailed description below is provided to provide a comprehensive understanding of the methods, devices and/or systems described herein. However, this is only an example and the present invention is not limited thereto.

In describing the embodiments, if it is determined that a detailed description of related known technology may unnecessarily obscure the point, the detailed description will be omitted. In addition, the terms described below are terms defined in consideration of function, and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification. The terminology used in the detailed description is only for describing embodiments and should in no way be limiting. Unless explicitly stated otherwise, singular forms include plural meanings. In this description, expressions such as “including” or “including” are intended to indicate certain features, numbers, steps, operations, elements, parts or combinations thereof, and one or more than those described. It should not be construed to exclude the existence or possibility of any other characteristic, number, step, operation, element, or part or combination thereof.

Hereinafter, a matrix refers to a rectangular array composed of two or more values. Unless otherwise specified, a matrix refers to a matrix composed of multiple columns and multiple rows, as well as a matrix composed of one column and multiple rows (i.e., a column vector), and one It is used to encompass a matrix (i.e., row vector) consisting of rows and multiple columns.

Below, any two matrices

and

The attention to is defined as Equation 1 below.

[Equation 1]

At this time, σ represents the softmax function and score() represents the attention score function used to calculate the attention score. The attention score function may be, for example, a scaled dot-product function, but depending on the embodiment, various types of attention score functions in addition to the scaled dot-product function may be used.

Meanwhile, Q represents a query matrix generated from matrix A, and K and V represent a key matrix and a value matrix generated from matrix B, respectively.

Meanwhile, in the following, self-attention is defined as attention to the same two matrices (i.e., when A=B), and to distinguish it from self-attention, attention to two different matrices (i.e., (if A≠B) will be referred to as cross-attention.

Referring to FIG. 1, the behavior prediction model 100 according to one embodiment includes an input unit 110, an encoder 120, a decoder 130, a behavior classification unit 140, and a behavior prediction unit 150.

According to one embodiment, the behavior prediction model 100 identifies the action taken by the object in a video of a moving object, such as a person or an animal, and predicts the future action to be taken by the object after the identified action. It may be an artificial neural network-based model to do this.

The input unit 110 generates an encoder input matrix corresponding to the input frame sequence based on features extracted from the input frame sequence.

According to one embodiment, the input frame sequence may include a plurality of frames sequentially extracted from at least some sections of the entire playback section of the video. For example, if a video consists of T frames, the input frame sequence may include T ₀ frames sequentially sampled at a preset sampling interval from the first frame to the αT frame of the video. That is, the number T ₀ of frames included in the input frame sequence can satisfy Equation 2 below.

[Equation 2]

Meanwhile, in Equation 2, τ represents a preset sampling interval, α represents the observation rate for the video, and α can be preset to a value between 0 and 1.

Meanwhile, according to one embodiment, the encoder input matrix may be generated based on features extracted from each frame included in the input frame sequence. Specifically, the input unit 110 may extract feature vectors for each frame included in the input frame sequence. At this time, the feature vector for each frame may be a vector of a preset dimension extracted from each frame using, for example, a feature extraction model based on a convolutional neural network (CNN).

As a specific example, the feature vector for each frame is a 3D convolutional neural network that can extract spatial and temporal features of each frame included in the input frame sequence, such as the I3D (Inflated 3D ConvNet) model. It can be extracted using However, the method of extracting a feature vector from an input frame sequence is not necessarily limited to the above-described example, and various methods for extracting visual features of an image may be used depending on the embodiment.

Meanwhile, according to one embodiment, the input unit 110 may generate an encoder input matrix based on a feature matrix composed of feature vectors for each frame included in the input frame sequence. Specifically, the input unit 110 may generate a matrix of a preset size from a feature matrix using a linear layer and then generate an encoder input matrix using an activation function. As a specific example, when the activation function is ReLU (Rectified Linear Unit), the encoder input matrix can be generated according to Equation 3 below.

[Equation 3]

At this time,

is the feature matrix,

is the weight matrix multiplied by the linear layer,

represents the encoder input matrix. Additionally, C and D are each preset hyperparameters.

The encoder 120 includes one or more encoder layers including a first self-attention layer as a sub-layer, and generates an encoder output matrix based on the encoder input matrix using the one or more encoder layers.

According to one embodiment, the first self-attention layer may be an artificial neural network that performs self-attention based on the input matrix of the encoder layer.

The decoder 130 includes one or more decoder layers including a second self-attention layer and a cross-attention layer as sub-layers, and generates a decoder output matrix based on the action query matrix and the encoder output matrix using the one or more decoder layers. .

According to one embodiment, the second self-attention layer may be an artificial neural network that performs self-attention based on a behavioral query matrix. Additionally, the cross attention layer may be an artificial neural network that performs cross attention based on the encoder output matrix and the action query matrix.

Specifically, Figure 2 is a diagram showing the configuration of an encoder and decoder according to an embodiment.

Referring to FIG. 2, the encoder 120 includes N encoder layers 120-1, 120-2, and 120-, each of which includes a first self-attention layer and is sequentially performed. N) may be configured.

According to one embodiment, when there are a plurality of encoder layers included in the encoder 120, each encoder layer (120-1, 120-2, 120-N) may have the same structure, and the first encoder layer (120) -1) can use the encoder input matrix X ₀ generated by the input unit 110 as input. In addition, except for the first encoder layer (120-1), the remaining encoder layers (120-2, 120-N) can use the matrix output from the previous encoder layer as input, and the encoder finally output from the encoder 120 output matrix

may be a matrix output from the last encoder layer (120-N).

Meanwhile, the decoder 130 includes M decoder layers 130-1, 130-2, 130-1, which each include a second self-attention layer and a cross-attention layer and are sequentially performed (where M is a natural number of 1≤M). M) may be configured.

According to one embodiment, each encoder layer (130-1, 130-2, 130-M) included in the decoder 130 may have the same structure, and the first decoder layer (130-1) is the initial decoder input. procession

can be used as input. In addition, except for the first decoder layer (130-1), the remaining decoder layers (130-2, 130-M) can use the matrix output from the previous decoder layer as input, and the decoder finally output from the decoder 120 output matrix

may be a matrix output from the last decoder layer (130-M).

Meanwhile, the number of encoder layers and decoder layers may change depending on the embodiment.

Specifically, each encoder layer 120-1, 120-2, and 120-N shown in FIG. 2 may have the same configuration as the encoder layer 310 shown in FIG. 3, for example.

Referring to FIG. 3, the encoder layer 310 may include a first self-attention layer 311, layer normalization (312, 314), and a feed-forward network (FFN) (313).

The first self-attention layer 311 is an input matrix input to the encoder layer 310.

(In this case, n is 1≤n≤N) and the position embedding matrix

A matrix created by combining

Self-attention can be performed for . At this time, the matrix X' _n-1 input to the first self-attention layer 311 can be generated by adding the input matrix X _n-1 and the position embedding matrix P as shown in Equation 4 below.

[Equation 4]

Meanwhile, when the encoder layer 310 is the first encoder layer (i.e., n=1), the input matrix X _n-1 input to the encoder layer 310 is the encoder input matrix X ₀ generated by the input unit 110. It can be. Additionally, if the encoder layer 310 is the second or later encoder layer (i.e., 2≤n≤N), the input matrix X _n-1 input to the encoder layer 310 is the output matrix of the n-1th encoder layer. It can be.

Meanwhile, the position embedding matrix P may be composed of T ₀ position embedding vectors corresponding to T ₀ frames included in the input frame sequence. At this time, each of the T ₀ position embedding vectors may be a D-dimensional vector indicating the position (order) within the input frame sequence of the corresponding frame among the T ₀ frames included in the input frame sequence. For example, the positional embedding matrix P can be generated through positional encoding according to Equations 5 and 6 below.

[Equation 5]

[Equation 6]

At this time, pos represents the position within the input frame sequence, and i represents the index of the dimension within the position embedding vector.

Meanwhile, the first layer normalization 312 is a self _- attention matrix generated by the first self-attention layer 311 using the matrix

After performing residual connection by adding the input matrix X _n-1 of the encoder layer 310, layer normalization can be performed. Specifically, the matrix generated through first layer normalization (312)

Can be expressed as Equation 7 below.

[Equation 7]

Meanwhile, the _matrix and the output matrix of the encoder layer 310 by performing _residual concatenation by adding the input matrix

can be created. Specifically, the output matrix X _n of the encoder layer 310 can be expressed as Equation 8 below.

[Equation 8]

Meanwhile, according to one embodiment, the first self-attention layer 311 includes a plurality of attention heads (attention heads) that parallelly perform self _- attention on the matrix head). In this case, the self-attention matrix Z generated by the first self-attention layer 311 is generated based on the output matrix of each of the plurality of attention heads using the input matrix X' _n-1 of the first self-attention layer 311. It can be.

Specifically, each attention head included in the first self-attention layer 311 generates a query matrix, key matrix, and value matrix from the matrix X' _n-1 input to the first self-attention layer 311, and the generated You can generate an output matrix by performing self-attention using the query matrix, key matrix, and value matrix.

For example, if the number of attention heads included in the first self-attention layer 311 is h ₁ , the output matrix Z _k of the k (in this case, 1 ≤ k ≤ h ₁ )th attention head is Equation 9 below: It can be created according to .

[Equation 9]

At this time, the query matrix Q _k , key matrix K _k , and value matrix V _k can each be generated according to Equations 10 to 12 below.

[Equation 10]

[Equation 11]

[Equation 12]

In equations 10 to 12,

,

and

represents a weight matrix determined through learning the behavior prediction model 100, and may be different for each attention head.

Meanwhile, in Equation 9, score(Q _k , K _k ) may be a scaled dot-product function, for example, as in Equation 13 below, but depending on the embodiment, various types of score functions may be used in addition to the scaled dot-product function. can be used.

[Equation 13]

Meanwhile, the first self-attention layer 311 can generate a self-attention matrix Z using the output matrix of each attention head. Specifically, the first self-attention layer 311 concatenates the output matrix of each attention head as shown in Equation 14 below to create a self-attention matrix.

can be created.

[Equation 14]

At this time,

represents a weight matrix determined through learning the behavior prediction model 100.

Meanwhile, Figure 4 is a diagram showing an example of a first self-attention layer according to an embodiment.

In the example shown in FIG. 4, for convenience of explanation, it is assumed that the first self-attention layer 311 includes three attention heads 410, 420, and 430 (i.e., h ₁ = 3), but the first self-attention layer 311 includes three attention heads 410, 420, and 430. The number of attention heads included in the attention layer 310 is not limited to the illustrated example and may be set differently depending on the embodiment.

Referring to FIG. 4, each attention head (410, 420, 430) uses the linear layer (411, 421, ₄₃₁ ) to add a weight matrix to the input matrix

You can generate the query matrix Q _k by multiplying . In addition, each attention head (410, 420, 430) uses the linear layer (412, 422, ₄₃₂ ) to add a weight matrix to the input matrix

You can generate the key matrix K _k by multiplying . In addition, each attention head (410, 420, 430) uses the linear layer (413, 423, ₄₃₃ ) to add a weight matrix to the input matrix

The value matrix V _k can be generated by multiplying .

Afterwards, each attention head (410, 420, 430) performs attention (414, 424, 434) using the generated query matrix Q _k , key matrix K _k, and value matrix V _k , respectively satisfying Equation 9 described above. output matrices Z ₁ , Z ₂ and Z ₃ can be generated.

Afterwards, the first self-attention layer 311 connects the output matrices Z ₁ , Z ₂ , and Z ₃ of each

attention head

410, 420, and 430 (440), and then calculates the weights using the linear layer 450.

By multiplying, a self-attention matrix Z that satisfies Equation 14 described above can be generated.

Specifically, each decoder layer 130-1, 130-2, and 130-M shown in FIG. 2 may have the same configuration as the decoder layer 510 shown in FIG. 5, for example.

Referring to FIG. 5, the decoder layer 510 may include a second self-attention layer 511,

layer normalization

512, 514, and 516, a cross attention layer 513, and a feed-forward network 515. .

The second self-attention layer 511 is an input matrix input to the decoder layer 510.

(In this case, m is 1≤m≤M) and the action query matrix

A matrix created by combining

Self-attention can be performed for . At this time, the matrix Y'm _-1 input to the second self-attention layer 511 can be generated by adding the input matrix Y _m-1 and the action query matrix Y as shown in Equation 15 below.

[Equation 15]

According to one embodiment, the behavior query matrix Y is a matrix including L behavior queries, each of which is a D-dimensional vector, and is a learnable parameter determined through learning the behavior prediction model 100. Additionally, the number L of action queries included in the action query matrix is a preset hyperparameter and may be set to different values depending on the embodiment.

Meanwhile, if the decoder layer 510 is the first decoder layer (i.e., m=1), the input matrix Y _m-1 input to the decoder layer 510 is the preset initial decoder input matrix

It can be. At this time, the initial decoder input matrix Y ₀ may be, for example, a matrix in which all elements have values of 0.

Additionally, if the decoder layer 510 is the second or subsequent decoder layer (i.e., 2≤m≤M), the input matrix Y _m-1 input to the decoder layer 510 is the output matrix of the m-1th decoder layer. It can be.

Meanwhile, the first layer normalization 512 uses the self-attention matrix Z' generated by the second self-attention layer ₅₁₁ and the input matrix Y _m-1 of the decoder layer 510 using the matrix Y' m-1. After performing additive residual concatenation, layer normalization can be performed. Specifically, the matrix generated through first layer normalization (512)

Can be expressed as Equation 16 below.

[Equation 16]

The cross attention layer 513 is a matrix that combines the matrix Y'' _m-1 generated through the first layer normalization 512 and the behavior query matrix Y.

and a matrix combining the encoder output _matrix

Cross attention can be performed on . At this time, the matrix Y''' _m-1 can be created by adding the input matrix Y'' _m-1 and the action query matrix Y as shown in Equation 17 below, and the _matrix Likewise, it can be generated by adding the encoder output matrix X _N and the position embedding matrix P.

[Equation 17]

[Equation 18]

Meanwhile, the second layer normalization 514 is a cross attention matrix generated by the cross attention layer 513.

After performing residual concatenation by adding the matrix Y'' _m-1 generated by the first layer normalization 512, layer normalization can be performed. Specifically, the matrix generated through second layer normalization (514)

Can be expressed as Equation 19 below.

[Equation 19]

Meanwhile, the matrix Y'''' _m-1 generated through the cross attention layer 513 passes through the feed-forward network 515, and the third layer normalization 516 is the output of the feed-forward network 515. Perform residual concatenation by adding the input matrix Y'''' _m-1 of the feed-forward network 515 and then perform layer normalization to the output matrix of the decoder layer 510.

can be created. Specifically, the output matrix Y _m of the decoder layer 310 can be expressed as Equation 20 below.

[Equation 20]

Meanwhile, according to one embodiment, the second self-attention layer 511 includes a plurality of attention heads that perform self-attention in parallel for the matrix Y'm _-1 input to the second self-attention layer 511. can do. In this case, the self-attention matrix generated by the second self-attention layer 511 will be generated based on the output matrix of each of the plurality of attention heads using the input matrix Y' _m-1 of the second self-attention layer 511. You can.

Specifically, each attention head included in the second self-attention layer 511 generates a query matrix, key matrix, and value matrix from the matrix Y'm _-1 input to the second self-attention layer 511, and the generated You can generate an output matrix by performing self-attention using the query matrix, key matrix, and value matrix.

For example, if the number of attention heads included in the first self-attention layer 511 is h ₂ , the output matrix Z' _l of the l (in this case, 1 ≤ l ≤ h ₂ )th attention head is expressed in the equation below: It can be created according to 21.

[Equation 21]

In Equation 21, the query matrix _Q'l , key matrix _K'l , and value matrix _V'l can be generated according to Equations 22 to 24 below, respectively.

[Equation 22]

[Equation 23]

[Equation 24]

In equations 22 to 24,

,

and

Meanwhile, in Equation 21, score(Q' _l , K' _l ) may be a scaled dot-product function, for example, as in Equation 25 below, but depending on the embodiment, various types of functions may be used in addition to the scaled dot-product function. A score function may be used.

[Equation 25]

Meanwhile, the second self-attention layer 511 uses the output matrix of each attention head to create a self-attention matrix.

can be created. Specifically, the second self-attention layer 511 can generate a self-attention matrix Z' by connecting the output matrices of each attention head as shown in Equation 26 below.

[Equation 26]

At this time,

Meanwhile, Figure 6 is a diagram showing an example of a second self-attention layer according to an embodiment.

In the example shown in FIG. 6, for convenience of explanation, it is assumed that the second self attention layer 511 includes three attention heads 610, 620, and 630 (i.e., h ₂ = 3), but the second self attention layer 511 is assumed to include three attention heads 610, 620, and 630. The number of attention heads included in the attention layer 511 is not limited to the illustrated example and may be set differently depending on the embodiment.

Referring to Figure 6, each attention head (610, 620, 630) uses a linear layer (611, 621, 631) to add a weight matrix to the input matrix Y'm _-1.

You can create the query matrix Q' _l by multiplying . In addition, each attention head (610, 620, 630) uses the linear layer (612, 622, 632) to add a weight matrix to the input matrix Y' _m-1.

You can generate the key matrix K' _l by multiplying . In addition, each attention head (610, 620, 630) uses the linear layer (613, 623, 633) to add a weight matrix to the input matrix Y' _m-1.

You can create the value matrix V' _l by multiplying .

Afterwards, each attention head (610, 620, 630) performs attention (614, 624, 634) using the generated query matrix _Q'l , key matrix _K'l , and value matrix _V'l , respectively, using the above-mentioned equation 21 It is possible to generate output matrices Z' ₁ , Z' ₂ and Z' ₃ that satisfy.

Afterwards, the second self-attention layer 511 connects the output matrices Z' ₁ , Z' ₂ and Z' ₃ of each attention head (610, 620, 630) (640) and then uses the linear layer (650). by weighting

By multiplying, a self-attention matrix Z' that satisfies Equation 26 described above can be generated.

Referring again to FIG. 5, the cross attention layer 513 according to one embodiment performs cross attention on the two matrices X' _N and Y''' _m-1 input to the cross attention layer 513 in parallel. It may include a plurality of attention heads. In this case, the cross attention matrix generated by the cross attention layer 513 may be generated based on the output matrix of each of the plurality of attention heads using the matrices X' _N and Y''' _m-1 as input.

Specifically, each attention head included in the cross attention layer 513 generates a query matrix from matrix Y''' _m-1 among the two matrices input to the cross attention layer 513, and _{a key} from matrix You can create matrices and value matrices. Additionally, each attention head can generate an output matrix by performing cross attention using the generated query matrix, key matrix, and value matrix.

For example, if the number of attention heads included in the cross attention layer 513 is h ₃ , the output matrix Z'' _u of the u (in this case, 1≤u≤h ₃ )th attention head is Equation 27 below: It can be created according to .

[Equation 27]

In Equation 27, the query matrix Q'' _u , the key matrix K'' _u , and the value matrix V'' _u can each be generated according to Equations 28 to 30 below.

[Equation 28]

[Equation 29]

[Equation 30]

In equations 28 to 30,

,

and

Meanwhile, in Equation 27, score(Q'' _u , K'' _u ) may be, for example, a scaled dot-product function such as Equation 31 below, but depending on the embodiment, various functions may be used in addition to the scaled dot-product function. A variety of score functions can be used.

[Equation 31]

Meanwhile, the cross attention layer 513 uses the output matrix of each attention head to create a cross attention matrix.

can be created. Specifically, the cross attention layer 513 can generate a cross attention matrix Z'' by connecting the output matrices of each attention head as shown in Equation 32 below.

[Equation 32]

At this time,

Meanwhile, Figure 7 is a diagram showing an example of a cross attention layer according to an embodiment.

In the example shown in FIG. 7, for convenience of explanation, it is assumed that the cross attention layer 513 includes three attention heads 710, 720, and 730 (i.e., h ₃ = 3), but the cross attention layer 513 ) The number of attention heads included is not limited to the example shown and may be set differently depending on the embodiment.

Referring to Figure 7, each attention head (710, 720, 730) uses a linear layer (711, 721, 731) to add a weight matrix to the input matrix Y''' _m-1.

You can create the query matrix Q'' _u by multiplying . In addition, each attention head (710, 720, 730) uses the linear layer (712, 722, 732) to add a weight matrix to the input _matrix

You can generate the key matrix K'' _u by multiplying . In addition, each attention head (710, 720, 730) uses the linear layer (713, 723, 733) to add a weight matrix to the input _matrix

You can create the value matrix V'' _u by multiplying .

Afterwards, each attention head (710, 720, 730) performs attention (714, 724, 734) using the generated query matrix _Q''u , key matrix _K''u, and value matrix _V''u , respectively, as described above. Output matrices Z'' ₁ , Z'' ₂ , and Z'' ₃ that satisfy Equation 27 can be generated.

Afterwards, the cross attention layer 513 connects the output matrices Z'' ₁ , Z'' ₂ , and Z'' ₃ of each attention head (710, 720, 730) (740), and then connects the linear layer (750). using weights

By multiplying, a cross attention matrix Z'' that satisfies Equation 32 described above can be generated.

Referring again to FIG. 1, the behavior classification unit 140 identifies one or more behaviors observed in the input frame sequence based on the encoder output matrix output from the encoder 120.

_Specifically _, _the encoder _output _matrix The action corresponding to each frame can be identified.

According to one embodiment, the action classification unit ₁₄₀ generates a probability distribution for each of T ₀ frames included in the input frame sequence for I action classes based on the encoder output matrix Based on this, the behavior corresponding to each frame can be classified into one of I behavior classes. At this time, the type and number of behavior classes may change depending on the learning data used to learn the behavior prediction model 100.

Specifically, according to one embodiment, the behavior classification unit 140 applies a fully connected layer to the encoder output _matrix A matrix representing the probability distribution of each of the ₀ frames for T

can be created.

[Equation 33]

At this time,

represents the weight matrix _multiplied by the encoder output matrix

Meanwhile, the action prediction unit 150 predicts in parallel one or more future actions that will occur sequentially after one or more actions identified by the action classifier 140 based on the decoder output matrix output from the decoder 130. .

Specifically, the decoder output matrix Y _M consists of L D-dimensional vectors corresponding to the L behavior queries included in the behavior query matrix Y, and the behavior prediction unit 150 sequentially generates data using the decoder output matrix Y _M. L future actions can be predicted in parallel. At this time, predicting L future actions that will occur sequentially in parallel requires using the prediction result for the t-1th future action to predict the t (here, 2≤t≤L)th future action among the L future actions. It means there is no.

Meanwhile, according to one embodiment, the behavior prediction unit 150 generates a probability distribution of each of the L future actions for the I+1 action classes, and predicts each of the L future actions based on the generated probability distribution to I+ It can be classified into one behavior class. At this time, I+1 behavior classes may include the above-described I behavior classes and a dummy class indicating that it does not belong to any of the I behavior classes.

Specifically, according to one embodiment, the behavior prediction unit 150 applies a fully connected layer to the decoder output matrix Y _M as shown in Equation 34 below and then applies the softmax function to obtain L for I+1 behavior classes. A matrix representing the probability distribution of each of the dog's future actions

can be created.

[Equation 34]

At this time,

represents the weight matrix multiplied by the decoder output matrix Y _M by the fully connected layer, and can be determined through learning the behavior prediction model 100.

Meanwhile, according to one embodiment, the behavior prediction unit 150 may predict the duration of each of one or more predicted future behaviors in parallel.

Specifically, the action prediction unit 150 applies a fully connected layer to the decoder output matrix Y _M as shown in Equation 35 below to create an L-dimensional vector containing the duration of each of the L future actions as an element.

can be created.

[Equation 35]

At this time,

represents the weight matrix multiplied by the decoder output matrix Y _M by the fully connected layer, and can be determined through learning the behavior prediction model 100. Meanwhile,

The jth element d _j of the L elements included in may be a value between 0 and 1 that represents the relative duration of the jth future action among the L predicted future actions,

The sum of the L elements included in may be 1 (i.e.

).

Meanwhile, according to one embodiment, when the total number of frames of the video from which the input frame sequence is extracted is T, the L future actions predicted by the action prediction unit 150 are αT number of future actions from which the input frame sequence was extracted among all frames. It may be an action predicted to occur in βT frames after the frame, and the actual duration predicted for each of the L future actions may be βTd _j . At this time, β means the preset prediction ratio of β∈[0, 1-α].

Meanwhile, according to one embodiment, the behavior prediction model 100 has the behavior classifier 140 identify a behavior class for the actual behavior observed in each of the T ₀ frames included in the input frame sequence extracted from the αT frames. It can be learned by using it as the ground truth for the action to be performed, and by using the action class for the actual action observed in βT frames as the correct answer for the future action to be predicted by the action prediction unit 150. In addition, depending on the embodiment, when the behavior prediction unit 150 is configured to predict the duration of the predicted future behavior along with the future behavior prediction, the behavior prediction model 100 predicts the actual behavior observed in the βT frames. The behavior class and duration can be learned using the correct answer for the future behavior and duration to be predicted by the behavior prediction unit 150.

The method shown in FIG. 8 may be performed, for example, by computing device 12 shown in FIG. 9 .

Referring to FIG. 8, computing device 12 generates an encoder input matrix corresponding to the input frame sequence based on features extracted from the input frame sequence (810).

Thereafter, the computing device generates an encoder output matrix based on the encoder input matrix using the encoder 120 composed of one or more encoder layers including the first self-attention layer as a sub-layer (820).

Thereafter, the computing device 12 uses the decoder 130, which is composed of one or more decoder layers including a second self-attention layer and a cross-attention layer as sub-layers, to generate a decoder output matrix based on the encoder output matrix and the action query matrix. Create (830).

At this time, according to one embodiment, the behavior query matrix may be a learnable parameter determined through learning the prediction model 100.

Computing device 12 then identifies one or more behaviors observed in the input frame sequence based on the encoder output matrix (840).

Computing device 12 then predicts in parallel one or more future actions that will occur sequentially after the one or more actions based on the decoder output matrix (850).

Meanwhile, depending on the embodiment, computing device 12 may predict the duration for each of one or more future actions in parallel based on the decoder output matrix.

Meanwhile, in the flowchart shown in FIG. 8, at least some of the steps are performed out of order, combined with other steps, omitted, divided into detailed steps, or performed with one or more steps not shown. It can be.

In the embodiment shown in Figure 9, each component may have different functions and capabilities in addition to those described below, and may include additional components in addition to those described below.

The illustrated computing environment 10 includes a computing device 12 .

Computing device 12 includes at least one processor 14, a computer-readable storage medium 16, and a communication bus 18. Processor 14 may cause computing device 12 to operate in accordance with the example embodiments noted above. For example, processor 14 may execute one or more programs stored on computer-readable storage medium 16. The one or more programs may include one or more computer-executable instructions, which, when executed by the processor 14, cause computing device 12 to perform operations according to example embodiments. It can be.

Computer-readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by the processor 14. In one embodiment, computer-readable storage medium 16 includes memory (volatile memory, such as random access memory, non-volatile memory, or an appropriate combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash It may be memory devices, another form of storage medium that can be accessed by computing device 12 and store desired information, or a suitable combination thereof.

Communication bus 18 interconnects various other components of computing device 12, including processor 14 and computer-readable storage medium 16.

Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide an interface for one or more input/output devices 24. The input/output interface 22 and the network communication interface 26 are connected to the communication bus 18. Input/output device 24 may be coupled to other components of computing device 12 through input/output interface 22. Exemplary input/output devices 24 include, but are not limited to, a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touch screen), a voice or sound input device, various types of sensor devices, and/or imaging devices. It may include input devices and/or output devices such as display devices, printers, speakers, and/or network cards. The exemplary input/output device 24 may be included within the computing device 12 as a component constituting the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12. It may be possible.

Although the present invention has been described in detail above through representative embodiments, those skilled in the art will recognize that various modifications to the above-described embodiments are possible without departing from the scope of the present invention. You will understand. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined by the claims and equivalents of the claims as well as the claims described later.

Claims

generating an encoder input matrix corresponding to the input frame sequence based on features extracted from the input frame sequence;

An operation of generating an encoder output matrix based on the encoder input matrix using an encoder composed of one or more encoder layers including a first self-attention layer as a sub-layer;

An operation of generating a decoder output matrix based on the encoder output matrix and the action query matrix using a decoder composed of one or more decoder layers including a second self-attention layer and a cross-attention layer as sub-layers;

identifying one or more behaviors observed in the input frame sequence based on the encoder output matrix; and

An action prediction method comprising predicting in parallel one or more future actions that will occur sequentially after the one or more actions based on the decoder output matrix.
In claim 1,

The behavior query matrix is a learnable parameter determined through learning a prediction model including the encoder and the decoder.
In claim 1,

The encoder includes a plurality of encoder layers each including the first self-attention layer as a sub-layer and performed sequentially,

The encoder output matrix is an output matrix of the last encoder layer among the plurality of encoder layers.
In claim 3,

The first self-attention layer included in the first encoder layer among the plurality of encoder layers performs self-attention based on a matrix generated by combining the encoder input matrix and the position embedding matrix,

The first self-attention layer included in the nth encoder layer (where n is a natural number of 2≤n≤N, N is the number of the plurality of encoder layers) among the plurality of encoder layers is n among the plurality of encoder layers. -A behavior prediction method that performs self-attention based on a matrix generated by combining the output matrix of the first encoder layer and the position embedding matrix.
In claim 1,

The first self-attention layer includes a plurality of attention heads that each perform self-attention based on a matrix input to the first self-attention layer.
In claim 1,

The decoder includes a plurality of decoder layers each including the second self-attention layer and the cross-attention layer as sub-layers and performed sequentially,

The decoder output matrix is an output matrix of the last decoder layer among the plurality of decoder layers.
In claim 6,

The second self-attention layer included in the first decoder layer among the plurality of decoder layers performs self-attention based on a matrix generated by combining a preset initial decoder input matrix and the action query matrix,

The second self-attention layer included in the m decoder layer (where m is a natural number of 2≤m≤M, M is the number of the plurality of decoder layers) among the plurality of decoder layers is m among the plurality of decoder layers. -A behavior prediction method that performs self-attention based on a matrix generated by combining the output matrix of the first decoder layer and the behavior query matrix.
In claim 1,

The second self-attention layer includes a plurality of attention heads that each perform self-attention based on a matrix input to the second self-attention layer.
In claim 1,

The cross attention layer includes a first input matrix combining the encoder output matrix and a position embedding matrix, a matrix generated by performing layer normalization on the self-attention matrix generated by the second self-attention layer, and the action query matrix. A behavior prediction method that performs cross attention based on a second input matrix combining .
In claim 9,

The cross attention layer includes a plurality of attention heads that each perform cross attention based on a first input matrix and a second input matrix.
In claim 1,

Predicting a duration for each of the one or more future actions in parallel based on the decoder output matrix.
One or more processors; and

a memory storing one or more programs to be executed by the one or more processors;

The one or more processors:

Generate an encoder input matrix corresponding to the input frame sequence based on features extracted from the input frame sequence,

Using an encoder composed of one or more encoder layers including a first self-attention layer as a sub-layer, generate an encoder output matrix based on the encoder input matrix,

Using a decoder composed of one or more decoder layers including a second self-attention layer and a cross-attention layer as sub-layers, generate a decoder output matrix based on the encoder output matrix and the action query matrix,

Based on the encoder output matrix, identify one or more behaviors observed in the input frame sequence,

An action prediction device that predicts in parallel one or more future actions that will occur sequentially after the one or more actions, based on the decoder output matrix.
In claim 12,

The behavior query matrix is a learnable parameter that is determined through learning a prediction model including the encoder and the decoder.
In claim 12,

The encoder includes a plurality of encoder layers each including the first self-attention layer as a sub-layer and performed sequentially,

The encoder output matrix is an output matrix of the last encoder layer among the plurality of encoder layers.
In claim 14,

The first self-attention layer included in the first encoder layer among the plurality of encoder layers performs self-attention based on a matrix generated by combining the encoder input matrix and the position embedding matrix,

The first self-attention layer included in the nth encoder layer (where n is a natural number of 2≤n≤N, N is the number of the plurality of encoder layers) among the plurality of encoder layers is n among the plurality of encoder layers. -A behavior prediction device that performs self-attention based on a matrix generated by combining the output matrix of the first encoder layer and the position embedding matrix.
In claim 12,

The first self-attention layer includes a plurality of attention heads that each perform self-attention based on a matrix input to the first self-attention layer.
In claim 12,

The decoder includes a plurality of decoder layers each including the second self-attention layer and the cross-attention layer as sub-layers and performed sequentially,

The decoder output matrix is an output matrix of the last decoder layer among the plurality of decoder layers.
In claim 17,

The second self-attention layer included in the first decoder layer among the plurality of decoder layers performs self-attention based on a matrix generated by combining a preset initial decoder input matrix and the action query matrix,

The second self-attention layer included in the m decoder layer (where m is a natural number of 2≤m≤M, M is the number of the plurality of decoder layers) among the plurality of decoder layers is m among the plurality of decoder layers. -A behavior prediction device that performs self-attention based on a matrix generated by combining the output matrix of the first decoder layer and the behavior query matrix.
In claim 12,

The second self-attention layer includes a plurality of attention heads that each perform self-attention based on a matrix input to the second self-attention layer.
In claim 12,

The cross attention layer includes a first input matrix combining the encoder output matrix and a position embedding matrix, a matrix generated by performing layer normalization on the self-attention matrix generated by the second self-attention layer, and the action query matrix. A behavior prediction device that performs cross attention based on a second input matrix that combines.
In claim 20,

The cross attention layer includes a plurality of attention heads that each perform cross attention based on a first input matrix and a second input matrix.
In claim 12,

wherein the one or more processors predict a duration for each of the one or more future actions in parallel based on the decoder output matrix.