CN116635911A

CN116635911A - Action recognition method and related device, storage medium

Info

Publication number: CN116635911A
Application number: CN202180060722.4A
Authority: CN
Inventors: 萧人豪; 陈佳伟; 何朝文
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-07-16
Filing date: 2021-06-11
Publication date: 2023-08-22
Also published as: WO2022012239A1

Abstract

The application discloses an action recognition method, related equipment and a storage medium, wherein the action recognition method comprises the following steps: dividing a video to be identified into a plurality of video clips; extracting a plurality of segment-level features of a plurality of modes for each video segment respectively; carrying out multi-layer mode/inter-segment aggregation features on the segment level features of each video segment and the segment level features of other video segments to obtain aggregation features of each video segment; and obtaining the motion recognition prediction of the video to be recognized by utilizing the aggregation characteristics of the video fragments. According to the method and the device, the multi-mode-based motion recognition accuracy can be improved.

Description

Action recognition method and related device, storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a motion recognition method, and related apparatus and storage medium thereof.

Background

In recent years, due to wide application in many scenes such as security and motion analysis, motion recognition technology has attracted widespread attention in the industry. In many related technology routes, more information can be obtained by introducing multiple modes, so that the recognition accuracy of the method is theoretically superior to that of motion recognition based on a single mode. However, it has been found in practice that, compared with motion recognition based on a single modality, simply introducing a plurality of modalities does not achieve better recognition accuracy. In view of this, how to improve the accuracy of motion recognition by a plurality of modes is a problem to be solved.

Disclosure of Invention

The application mainly solves the technical problem of providing an action recognition method, a related device and a storage medium, which can improve the action recognition precision based on multiple modes.

In order to solve the technical problems, the application adopts a technical scheme that: provided is an action recognition method, including: dividing a video to be identified into a plurality of video clips; extracting a plurality of segment level features of a plurality of modes of each video segment respectively; performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain aggregation features of each video segment; and obtaining an action recognition result of the video to be recognized by utilizing the aggregation characteristics of each video segment.

In order to solve the technical problems, the application adopts another technical scheme that: an action recognition device is provided, which comprises a video clipping module, a video editing module and a video editing module, wherein the video clipping module is used for dividing a video to be recognized into a plurality of video clips; the feature extraction module is used for respectively extracting a plurality of segment-level features of a plurality of modes of each video segment; and the feature aggregation module is used for carrying out multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain the aggregation features of each video segment. And the result prediction module is used for obtaining the action recognition result of the video to be recognized by utilizing the aggregation characteristics of the video fragments.

In order to solve the technical problems, the application adopts another technical scheme that: there is provided an electronic device comprising a non-transitory memory and a processor, wherein the non-transitory memory and the processor are coupled to each other, the non-transitory memory storing program instructions, the processor executing the program instructions to implement the above-described action recognition method.

In order to solve the technical problems, the application adopts another technical scheme that: there is provided a computer readable storage medium having program instructions stored therein, wherein a processor executes the program instructions to implement the above-described action recognition method.

In order to solve the technical problems, the application adopts another technical scheme that: there is provided an electronic device comprising a non-transitory memory and a processor, wherein the non-transitory memory and the processor are coupled to each other, the non-transitory memory storing program instructions, the processor executing the program instructions to implement: dividing a video to be identified into a plurality of video clips; extracting a plurality of segment level features of a plurality of modes of each video segment respectively; performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain aggregation features of each video segment; and obtaining an action recognition result of the video to be recognized by utilizing the aggregation characteristics of each video segment.

The beneficial effects of the application are as follows: different from the prior art, the video to be identified is divided into a plurality of video segments, a plurality of segment-level features of a plurality of modes of each video segment are extracted respectively, and inter-segment aggregation and inter-mode aggregation are carried out on all segment-level features based on the video segments to obtain an aggregate feature of each video segment, so that modeling of inter-segment correlation of different video segments can be facilitated, the aggregate feature of each video segment can contain feature information of a plurality of modes and inter-segment correlation, and accordingly the aggregate feature of a plurality of video segments is reused, and accuracy of action results of the video to be identified obtained through prediction is improved.

Drawings

For a clearer description of the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the description below are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:

FIG. 1 is a flow chart of an embodiment of the motion recognition method of the present application;

FIG. 2 is a schematic diagram of a framework of one embodiment of a motion recognition model;

FIG. 3 is a schematic diagram of a framework of one embodiment of motion recognition based on a single modality;

FIG. 4 is a schematic diagram of a framework of one embodiment of motion recognition based on simple inter-modality aggregation;

FIG. 5 is a schematic diagram of an embodiment of a motion recognition device according to the present application;

FIG. 6 is a schematic diagram of a frame of an embodiment of an electronic device according to the present application;

FIG. 7 is a schematic diagram of a frame of one embodiment of a computer-readable storage medium of the present application;

FIG. 8 is a schematic diagram of another frame of an electronic device according to an embodiment of the application

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to fall within the scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a motion recognition method according to the present application. Specifically, the method may include the steps of:

Step S11: the video to be identified is divided into a plurality of video segments.

In one embodiment, at least one frame of image may be included in each video clip, such as 2 frames of images, 3 frames of images, 4 frames of images, etc., without limitation. The number of images included in each video clip may be the same or different, and is not limited herein. It should be noted that the images included in each video clip are continuous.

In one embodiment, the plurality of video clips may be 2, 3, etc., without limitation. In order to facilitate the division, the number of images included in each video segment may be preset to be T, and then the video segment may be obtained by dividing the images in each T frame in the video to be identified.

In one embodiment, in order to improve the efficiency of motion recognition, a motion recognition model may be trained in advance, and the training process of the motion recognition model may refer to the following related embodiments, which are not described herein. Referring to fig. 2 in combination, fig. 2 is a schematic diagram of an embodiment of a motion recognition model. As shown in fig. 2, the video to be identified is divided into 3 video clips, and each video clip includes 3 frames of images. It should be noted that the division manner shown in fig. 2 is only one manner that may exist in practical applications, and is not limited to the video division manner actually adopted.

Step S12: a plurality of segment-level features of a plurality of modalities for each video segment are extracted, respectively.

In one embodiment, the plurality of modalities may include, but are not limited to: the visual mode and the auditory mode are not limited herein. Under the condition that the multiple modes comprise a visual mode, extracting features of images contained in each video segment to obtain segment-level features of the video segment relative to the visual mode; under the condition that a plurality of modes comprise an auditory mode, the audio data corresponding to each video segment can be extracted respectively, the acoustic parameters of the audio data are extracted, and on the basis, the acoustic parameters corresponding to each video segment can be extracted respectively to obtain segment level characteristics of the video segment relative to the auditory mode. That is, where the plurality of modalities includes a video modality and an auditory modality, the plurality of segment level features of the plurality of modalities may include segment level features regarding the visual modality and segment level features regarding the auditory modality. Other situations can be similar and are not exemplified here.

In a particular embodiment, the acoustic parameters may include, but are not limited to: log mel spectrum (i.e., log-mel), etc., without limitation herein.

In another specific embodiment, as mentioned above, in order to improve the motion recognition efficiency, a motion recognition model may be trained in advance, where the motion recognition model may include a feature extraction network, and further multiple segment-level features of multiple modalities of each video segment may be respectively extracted by using the feature extraction network. Further, with continued reference to fig. 2, where the plurality of modalities includes a visual modality and an auditory modality, the feature extraction network may include a first extraction network N _v And a second extraction network N _a Wherein the first extraction network N _v For extracting segment-level features about visual modalities, a second extraction network N _a For extracting segment-level features for an auditory modality, on the basis of which each video segment can be extracted with a segment-level feature for a visual modality and a segment-level feature for an auditory modality. In addition, a first extraction network N _v Specifically, but not limited to, 3D-Resnext et al, not limited thereto, such as the first extraction network N _v May be a 3D-ResNext with 101 layers. Second extraction network N _a Specifically, the method can include, but is not limited to ResNet, etc., and is not limited thereto, such as the second extraction network N _a There may be a ResNet with 50 layers.

Step S13: and carrying out multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain the aggregation features of each video segment.

In the embodiment of the disclosure, inter-segment aggregation can be performed first, and then inter-mode aggregation can be performed; alternatively, inter-modality aggregation may be performed first, and then inter-fragment aggregation may be performed, which is not limited herein.

In one embodiment, under the condition that multi-layer inter-mode aggregation is performed before inter-segment aggregation is performed, inter-mode aggregation can be performed on a plurality of segment-level features of multiple modes of each video segment to obtain a first aggregation feature of the video segment, and inter-segment aggregation is performed on the first aggregation feature of each video segment and the first aggregation features of other video segments to obtain an aggregation feature of each video segment.

In a specific embodiment, for each video clip, a plurality of clip-level features of multiple modes of the video clip may be spliced to implement inter-mode aggregation, thereby obtaining a first aggregate feature of the video clip. For ease of description, the ith video clip may be denoted as x _i Then the first extraction network N _v The extracted segment-level features for the visual modality may be denoted as N _v (x _i ) Second extraction network N _v The extracted segment-level features for the auditory modality may be noted as N _a (x _i ) So video clip X _i First polymeric characteristic g of (2) _i Can be expressed as:

in the above-mentioned formula (1),representing stitching.

In another specific embodiment, a bi-directional attention mechanism may be used to aggregate all the first aggregate features from segment to segment, resulting in an aggregate feature for each video segment. Specifically, a plurality of video clips can be respectively used as the current clipsAnd acquiring the association degree between the current segment and other video segments outside the current segment based on a bidirectional attention mechanism, and aggregating the first aggregation features of the other video segments by utilizing the association degree to obtain the aggregation features of the current segment. For convenience of description, taking the ith video segment as the current segment, the operation of inter-segment aggregation may be denoted as B, and the aggregation characteristic B (g _i ) Can be expressed as:

in the above formula (2), g _i A first aggregate feature, g, representing the current segment (i.e., the ith video segment) _j Representing a first aggregate characteristic of video segments other than the current segment, T representing a transpose operation, W _q 、W _k 、W _v And W is _z Representing a linear transformation matrix, R representing a normalization factor.Representing the degree of association between the current segment (i.e., the i-th video segment) and the j-th video segment. That is, for the case of performing inter-modality aggregation first and then inter-fragment aggregation, the aggregation characteristic M (x) of the video fragment x can be expressed as:

in yet another specific embodiment, to enhance inter-segment aggregation, after the aggregate features of each video segment are obtained, inter-segment aggregation may also be performed on all newly obtained aggregate features to update the aggregate features of each video segment. It should be noted that, the step of taking the obtained aggregate feature of each video segment as the new first aggregate feature of each video segment may be performed only once, that is, may be performed for two times of inter-segment aggregation during the motion recognition process, or may be performed for multiple times, that is, may be performed for three or more times of inter-segment aggregation during the motion recognition process, which is not limited herein. For ease of description, the first inter-fragment aggregation may be expressed as:

B ¹ ＝B(g) (4)

further, the nth inter-fragment aggregation can be expressed as:

B ⁿ ＝B(B ^n-1 ) (5)

it should be noted that, the specific process of performing inter-segment aggregation again may refer to the related description of the foregoing formula (2), that is, a plurality of video segments may be respectively used as the current segment, the degree of association between the current segment and other video segments except for the current segment may be obtained based on the bidirectional attention mechanism, and the latest obtained aggregation features of the other video segments may be aggregated by using the degree of association to obtain new aggregation features of the current segment. That is, when the inter-fragment polymerization is performed again, g in the formula (2) represents the latest obtained polymerization characteristics.

In another embodiment, under the condition that inter-segment aggregation is performed before inter-mode aggregation, multiple modes can be used as current modes respectively, inter-segment aggregation is performed on segment level features of the current mode of each video segment and segment level features of the current modes of other video segments to obtain second aggregation features of the current modes of each video segment, and inter-mode aggregation is performed on the second aggregation features of the multiple modes of each video segment to obtain aggregation features of each video segment.

In a specific embodiment, taking a plurality of modes including a visual mode and an auditory mode as an example, the visual mode may be taken as a current mode first, so that the segment-level features of all the visual modes may be subjected to inter-segment aggregation to obtain the second aggregation feature of the visual mode of each video segment, and then the auditory mode may be taken as a current mode, so that the segment-level features of all the auditory modes may be subjected to inter-segment aggregation to obtain the second aggregation feature of the auditory mode of each video segment. Specifically, the step of inter-segment aggregating the segment-level features of all the visual modes to obtain the second aggregate feature of the visual mode of each video segment may refer to the related description of the foregoing formula (2), that is, the multiple video segments may be respectively used as the current segment, the degree of association between the current segment and the other video segments except the current segment may be obtained based on the bidirectional attention mechanism, and the segment-level features of the visual modes of the other video segments may be aggregated by using the degree of association to obtain the second aggregate feature of the visual mode of the current segment, that is, at this time, g in the formula (2) is represented as the segment-level feature of the visual mode. Similarly, the step of inter-segment aggregating the segment-level features of all the auditory modes to obtain the second aggregate feature of the auditory modes of each video segment may refer to the related description of the foregoing formula (2), that is, the multiple video segments may be respectively used as the current segment, the degree of association between the current segment and the other video segments except for the current segment may be obtained based on the bidirectional attention mechanism, and the segment-level features of the auditory modes of the other video segments may be aggregated by using the degree of association to obtain the second aggregate feature of the auditory modes of the current segment, that is, at this time, g in the formula (2) is represented as the segment-level feature of the auditory mode.

In another specific embodiment, for each video segment, the second aggregation feature of multiple modes of the video segment may be spliced to implement inter-mode aggregation, so that the aggregation feature of the video segment may be obtained. Still taking the plurality of modes including the visual mode and the auditory mode as an example, for each video segment, the second aggregation feature of the auditory mode and the second aggregation feature of the visual mode of the video segment may be sliced to achieve aggregation of the auditory mode and the visual mode piece, and obtain the aggregation feature of the video segment. That is, for the case of performing inter-segment aggregation first and then inter-modality aggregation, the aggregation characteristic M (x) of the video segment x can be expressed as:

in the above formula (6), B _v Representing inter-segment aggregation performed on a visual modality (i.e., bi-directional attention mechanisms corresponding to the visual modality), B _v Representation pair hearingInter-segment aggregation performed by modalities (i.e., bi-directional attention mechanisms corresponding to auditory modalities).

In yet another specific embodiment, in order to enhance inter-segment aggregation, before inter-mode aggregation is performed on the second aggregation features of multiple modes of each video segment to obtain aggregation features of the video segments, inter-segment aggregation may be performed on all the newly obtained second aggregation features of the current mode to update the second aggregation features of the current mode of each video segment. It should be noted that, the step of performing inter-segment aggregation on the second aggregation features of all the latest obtained current modes to update the second aggregation features of the current modes of each video segment may be performed only once, that is, may be performed twice in the motion recognition process, may be performed multiple times, that is, may be performed three or more times in the motion recognition process, and is not limited herein.

In yet another embodiment, as described above, in order to improve the motion recognition efficiency, a motion recognition model may be pre-trained, as shown in fig. 2, where the motion recognition model may include a feature aggregation network, so that the feature aggregation network may be used to perform inter-segment aggregation and inter-modality aggregation on all segment-level features to obtain an aggregated feature of each video segment. Specifically, the feature aggregation network may include a bidirectional attention mechanism layer, and the bidirectional attention mechanism may refer to the foregoing description related to formula (2), which is not described herein. From the foregoing description, it can be seen that, in the case of multiple modes, the bidirectional attention mechanism can not only model the correlation relationship between different video segments, but also combine multi-mode information such as an auditory mode and a visual mode, and the receptive field of each video segment is rapidly expanded, so that the method can adapt to both short video and long video, and the aggregated feature obtained by aggregation can be regarded as a high-dimensional representation with context information, which is beneficial to improving the subsequent prediction accuracy.

Step S14: and obtaining the action recognition result of the video to be recognized by utilizing the aggregation characteristics of each video segment.

In the embodiment of the disclosure, the action recognition result may include an action category existing in the video to be recognized, for example, the action recognition result may include that a "kicking action" exists in the video to be recognized, or the action recognition result may include that a "riding action" exists in the video to be recognized, which is not exemplified here.

In one embodiment, the aggregate features of multiple video clips may be fused to obtain video-level features of the video to be identified, and the motion recognition result may be predicted using the video-level features. Specifically, a plurality of video clips may be spliced to obtain video-level features of the video to be identified.

In one embodiment, as previously described, to improve the motion recognition efficiency, a motion recognition model may be pre-trained, where the motion recognition model may include a result prediction network, so that the result prediction network may be used to process the aggregate characteristics of multiple video segments to obtain the motion recognition result of the video to be recognized. Specifically, the result prediction network may include a full connection layer, so that after merging the aggregate features of a plurality of video clips to obtain video-level features of a video to be identified, the video-level features may be input into the full connection layer to predict and obtain the action recognition result. For ease of description, the aggregate characteristics of the first video segment may be denoted ((x) ₁ ) The aggregate characteristics of the second video segment may be denoted as M (x ₂ ) By analogy, the aggregate characteristics of the C-th video clip can be denoted as M (x _C ) The action recognition result Y' can be expressed as:

Y′＝F(M(x ₁ )+M(x ₂ )+…+M(x _C )) (7)

In the above formula (7), F represents a fully connected layer. In addition, the fully-connected layer has a hidden layer that can take in the video-level features obtained by aggregating the aggregated features of the plurality of video clips.

In one embodiment, through comparative analysis, the motion recognition accuracy of the embodiment of the present disclosure is significantly improved, please refer to table 1, table 1 is a comparison table of recognition accuracy of different motion recognition methods.

Table 1 comparison table of recognition accuracy of different action recognition methods

It should be noted that, referring to fig. 3 in combination, fig. 3 is a schematic diagram of a frame of an embodiment of motion recognition based on single mode. As shown in fig. 3, unlike the embodiment of the present disclosure, in the method for identifying an action only in an auditory mode or only in a visual mode, the method extracts segment-level features of a single mode (e.g., a visual mode or an auditory mode) for a plurality of video segments, and predicts to obtain an action identification result by using the segment-level features of the single mode of the plurality of video segments. As shown in table 1, since the recognition accuracy was only 8.29% based on the auditory mode, it was difficult to perform the motion recognition task, and the recognition accuracy was 66.4% based on the visual mode. Additionally, referring to fig. 4, fig. 4 is a schematic diagram illustrating an embodiment of motion recognition based on simple inter-modality aggregation. As shown in fig. 4, the motion recognition method based on the aggregation between simple modalities extracts segment-level features (e.g., segment-level feature N of visual modality) of multiple modalities (e.g., visual modality and auditory modality) from multiple video segments, respectively _v () And segment level feature N of auditory modalities _a () For each video clip, the clip level features of multiple modalities (e.g., visual and auditory) are then aggregated among simple modalities, such as the clip level feature N of visual modality _v () And segment level feature N of auditory modalities _a () Splicing to obtain the aggregation characteristics of the video clipsThen, the aggregation characteristics of the video clips are processed by using a prediction network F comprising one or more fully connected layers, and the clip level prediction result y' = (M) of the video clips is obtained through prediction _naive (x) A kind of electronic device. On the basis, the segment level prediction results of a plurality of video segments are arithmetically averaged to obtain the action recognition result of the video to be recognized>Unlike the embodiments of the present disclosure, the aggregation between simple modalities does not model the correlation between different video segments, and even complex actions may span multiple video segments due to different action durations, which becomes a performance bottleneck of the action recognition method based on the aggregation between simple modalities. As shown in table 1, the motion recognition method based on the simple inter-modality aggregation only achieves a recognition accuracy of 64.5%, even worse than the motion recognition method based on the visual modality only. It can be seen that the embodiments of the present disclosure have an extremely important role by modeling the correlation between different video segments.

In contrast to the foregoing single-modality-only motion recognition method and simple-modality-aggregation-based motion recognition method, the embodiments of the present disclosure employ correlation relationships between different video clips and inter-modality aggregation simultaneously. The recognition accuracy of 70.11% is obtained by the method for recognizing actions by inter-mode aggregation and inter-segment aggregation, which is respectively better than the recognition accuracy of 3.71% and 5.61% of the recognition accuracy of the method for recognizing actions by the visual mode only and the method for recognizing actions by the simple mode aggregation. It follows that the importance of inter-modality and inter-fragment aggregation as proposed by embodiments of the present disclosure. In addition, the method for identifying the actions of the inter-segment aggregation and the inter-mode aggregation achieves 69.52% of identification precision, which is slightly lower than the method for identifying the actions of the inter-segment aggregation and the inter-mode aggregation, but is still better than the single-mode action identification method and the action identification method based on the simple-mode aggregation. It can be seen that modeling the correlation between video segments for each modality helps to aggregate between modalities. In addition, by comparing the action recognition method of the inter-modality aggregation and the inter-fragment aggregation with the action recognition method of the inter-fragment aggregation and the inter-modality aggregation, the earlier inter-modality aggregation is performed, the higher the action recognition accuracy can be known.

According to the scheme, the video to be identified is divided into a plurality of video segments, the plurality of segment-level features of the plurality of modes of each video segment are extracted respectively, and the inter-segment aggregation and inter-mode aggregation are carried out on all the segment-level features based on the plurality of segment-level features to obtain the aggregation feature of each video segment, so that modeling of the inter-segment correlation of different video segments can be facilitated, the aggregation feature of each video segment can not only contain the feature information of the plurality of modes, but also contain the inter-segment correlation, and therefore the aggregation feature of the plurality of video segments is reused, the action result of the video to be identified is predicted and obtained, and improvement of the action identification precision can be facilitated.

In some disclosed embodiments, as described in the foregoing disclosure implementation, in order to improve the motion recognition efficiency, one motion recognition model may be trained in advance, then a sample video marked with a sample motion category may be obtained, and the sample video is divided into a plurality of sample video segments, based on which a plurality of sample segment-level features of multiple modes of each sample video segment are extracted by using a feature extraction network of the motion recognition model, and inter-segment aggregation and inter-mode aggregation are performed on all sample segment-level features by using a feature aggregation network of the motion recognition model, so as to obtain a sample aggregation feature of each sample video segment, and sample aggregation features of the plurality of sample video segments are processed by using a result prediction network of the motion recognition model, so as to predict a predicted motion category of the sample video, and finally network parameters of the motion recognition model may be adjusted based on a difference between the sample motion category and the predicted motion category. The network structures and the processing procedures of the feature extraction network, the feature aggregation network and the result prediction network can be referred to the related descriptions in the foregoing disclosed embodiments, and are not repeated herein.

In one embodiment, the result prediction network may predict the predicted probability values of a plurality of preset action categories, based on the predicted probability values of the sample action category and the plurality of preset action categories, calculate a loss value of the action recognition model using a cross entropy loss function, and adjust network parameters of the action recognition model using the loss value.

In another embodiment, sample video clips for which a sample action class exists may be separated by K1 sample video clips, K1 may be less than a first value (e.g., 2, 3, etc.), i.e., sample video clips for which a sample action class exists may be closely separated; in addition, sample video clips with sample action categories can be separated by K2 sample video clips, K2 can be larger than a second numerical value (such as 4, 5 and the like), namely, sample video clips with sample action categories can be separated by a long distance, based on the sample video clips, the action recognition model can model correlation relations among different video clips no matter how many video clips are spanned by actions, and the recognition precision of the action recognition model is improved.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a frame of an operation recognition device 50 according to an embodiment of the application. As shown in fig. 5, the motion recognition apparatus 50 includes a video dividing module 51, a feature extraction module 52, a feature aggregation module 53, and a result prediction module 54, the video dividing module 51 being configured to divide a video to be recognized into a plurality of video clips; the feature extraction module 52 is configured to extract a plurality of segment-level features of a plurality of modalities of each video segment; the feature aggregation module 53 is configured to perform inter-segment aggregation and inter-modality aggregation on all segment-level features to obtain an aggregation feature of each video segment; the result prediction module 54 is configured to predict and obtain a motion recognition result of the video to be recognized by using the aggregate features of the plurality of video clips; the action recognition result comprises action categories existing in the video to be recognized.

It should be noted that, the description of the method for identifying an action in the above disclosed embodiments is also applicable to the device for identifying an action in the exemplary embodiment of the present disclosure, which is not repeated here.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an electronic device 60 according to an embodiment of the application. As shown in fig. 6, the electronic device 60 includes a memory 61, a processor 62 and a computer program 63 stored on the memory 61 and executable on the processor, and the processor 62 implements the action recognition method in any of the above disclosed embodiments when executing the computer program 63. Specifically, the electronic device 60 may include, but is not limited to: computers, servers, etc., are not limited herein.

Referring to FIG. 7, FIG. 7 is a schematic diagram of a computer readable storage medium 70 according to an embodiment of the application. As shown in fig. 7, a computer program 71 is stored on a computer readable storage medium 70, which computer program 71, when executed by a processor, implements the method of action recognition in any of the disclosed embodiments described above.

Referring to fig. 8, fig. 8 is a schematic diagram 80 of a frame of an electronic device 80 according to an embodiment of the application. As shown in fig. 8, the electronic device 80 comprises a non-transitory memory 81 and a processor 82 coupled to each other, wherein the non-transitory memory stores program instructions 83, and the processor 82 is capable of executing the program instructions to implement:

Performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain an aggregate feature of each video segment, including: inter-mode aggregation is carried out on segment-level features of multiple modes of each video segment, so that a first aggregation feature of each video segment is obtained; and carrying out inter-segment aggregation on the first aggregation features of each video segment and the first aggregation features of other video segments to obtain the aggregation features of each video segment.

And performing inter-segment aggregation on the first aggregation feature of each video segment and the first aggregation features of other video segments to obtain the aggregation feature of each video segment, and executing program instructions by a processor to further realize: and taking the obtained aggregation characteristic of each video segment as a new first aggregation characteristic of each video segment, and carrying out another inter-segment aggregation on the new first aggregation characteristic of each video segment and the new first aggregation characteristics of other video segments to update the aggregation characteristic of each video segment.

Inter-segment aggregation is performed on the first aggregation feature of each video segment and the first aggregation features of other video segments to obtain aggregation features of each video segment, including: taking each mode of the multiple modes as a current mode, and carrying out inter-segment aggregation on segment level features of the current mode of each video segment and segment level features of the current modes of other video segments to obtain second aggregation features of the current mode of each video segment, so as to obtain second aggregation features of the multiple modes of each video segment; and carrying out inter-mode aggregation on the second aggregation characteristics of multiple modes of each video segment to obtain the aggregation characteristics of each video segment.

Inter-modality aggregation is performed on a second aggregation feature of multiple modalities of each video segment, and before the aggregation feature of each video segment is obtained, the processor executes program instructions to further realize: and taking the obtained second aggregation characteristic of the current mode of each video clip as a new segment level characteristic of the current mode of each video clip, and carrying out another inter-segment aggregation on the new segment level characteristic of the current mode of each video clip and the new segment level characteristics of the current modes of other video clips so as to update the second aggregation characteristic of the current mode of each video clip.

Obtaining the motion recognition prediction of the video to be recognized by utilizing the aggregation characteristics of the video segments, wherein the motion recognition prediction comprises the following steps: connecting the aggregation characteristics of the video clips to obtain video-level characteristics of the video to be identified; and predicting the action recognition result by utilizing the video level characteristics.

Extracting a plurality of segment-level features of a plurality of modalities of each video segment, respectively, including: extracting segment-level features of multiple modes of each video segment through a feature extraction network of the motion recognition model; performing multi-layer modal/inter-segment aggregation on the segment level features of each video segment and the segment level features of the other video segments to obtain an aggregate feature of each video segment, including: performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of the other video segments through a feature aggregation network of the motion recognition model so as to obtain aggregation features of each video segment; obtaining the motion recognition prediction of the video to be recognized by utilizing the aggregation characteristics of the video segments, wherein the motion recognition prediction comprises the following steps: and processing the aggregation characteristics of the video clips through a result prediction network of the action recognition model to obtain an action recognition result of the video to be recognized.

The plurality of modalities includes an auditory modality and a visual modality; the feature extraction network comprises a first extraction network for extracting segment-level features relating to the visual modality and a second extraction network for extracting segment-level features relating to the auditory modality; and/or, the feature aggregation network includes a bidirectional attention mechanism layer; and/or the result prediction network comprises a fully connected layer.

Specifically, the electronic device 80 may include, but is not limited to: computers, servers, etc., again without limitation.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. An action recognition method, comprising

Dividing a video to be identified into a plurality of video clips;

extracting a plurality of segment level features of a plurality of modes of each video segment respectively;

performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain aggregation features of each video segment;

and obtaining an action recognition result of the video to be recognized by utilizing the aggregation characteristics of each video segment.

2. The method of claim 1, wherein,

performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain an aggregate feature of each video segment, including:

inter-mode aggregation is carried out on segment-level features of multiple modes of each video segment, so that a first aggregation feature of each video segment is obtained;

and carrying out inter-segment aggregation on the first aggregation features of each video segment and the first aggregation features of other video segments to obtain the aggregation features of each video segment.

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

Performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain an aggregate feature of each video segment, wherein the method further comprises the steps of:

and taking the obtained aggregation characteristic of each video segment as a new first aggregation characteristic of each video segment, and carrying out another inter-segment aggregation on the new first aggregation characteristic of each video segment and the new first aggregation characteristics of other video segments to update the aggregation characteristic of each video segment.

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

taking each mode of the multiple modes as a current mode, and carrying out inter-segment aggregation on segment level features of the current mode of each video segment and segment level features of the current modes of other video segments to obtain second aggregation features of the current mode of each video segment, so as to obtain second aggregation features of the multiple modes of each video segment;

And carrying out inter-mode aggregation on the second aggregation characteristics of multiple modes of each video segment to obtain the aggregation characteristics of each video segment.

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

performing inter-modality aggregation on a second aggregation feature of multiple modalities of each video segment to obtain the aggregation feature of each video segment, the method further comprising:

and taking the obtained second aggregation characteristic of the current mode of each video clip as a new segment level characteristic of the current mode of each video clip, and carrying out another inter-segment aggregation on the new segment level characteristic of the current mode of each video clip and the new segment level characteristics of the current modes of other video clips so as to update the second aggregation characteristic of the current mode of each video clip.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

obtaining the motion recognition prediction of the video to be recognized by utilizing the aggregation characteristics of the video segments, wherein the motion recognition prediction comprises the following steps:

connecting the aggregation characteristics of the video clips to obtain video-level characteristics of the video to be identified;

and predicting the action recognition result by utilizing the video level characteristics.

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

extracting a plurality of segment-level features of a plurality of modalities of each video segment, respectively, including:

extracting segment-level features of multiple modes of each video segment through a feature extraction network of the motion recognition model;

performing multi-layer modal/inter-segment aggregation on the segment level features of each video segment and the segment level features of the other video segments to obtain an aggregate feature of each video segment, including:

performing multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of the other video segments through a feature aggregation network of the motion recognition model so as to obtain aggregation features of each video segment;

and processing the aggregation characteristics of the video clips through a result prediction network of the action recognition model to obtain an action recognition result of the video to be recognized.

8. The method of claim 7, wherein the step of determining the position of the probe is performed,

the plurality of modalities includes an auditory modality and a visual modality;

the feature extraction network comprises a first extraction network for extracting segment-level features relating to the visual modality and a second extraction network for extracting segment-level features relating to the auditory modality;

And/or, the feature aggregation network includes a bidirectional attention mechanism layer;

and/or the result prediction network comprises a fully connected layer.

9. An action recognition device, comprising:

the video editing module is used for dividing the video to be identified into a plurality of video clips;

the feature extraction module is used for respectively extracting a plurality of segment-level features of a plurality of modes of each video segment;

and the feature aggregation module is used for carrying out multi-layer mode/inter-segment aggregation on the segment level features of each video segment and the segment level features of other video segments to obtain the aggregation features of each video segment.

And the result prediction module is used for obtaining the action recognition result of the video to be recognized by utilizing the aggregation characteristics of the video fragments.

10. A non-transitory memory and a processor, wherein the non-transitory memory and the processor are coupled to each other, the non-transitory memory storing program instructions that are executed by the processor to implement the method of action recognition of any one of claims 1 to 8.

11. A non-transitory computer readable storage medium having program instructions stored therein, wherein a processor executes the program instructions to implement the method of action recognition of any one of claims 1 to 8.

12. An electronic device, comprising:

a non-transitory memory and a processor, wherein the non-transitory memory and the processor are coupled to each other, the non-transitory memory storing program instructions that are executed by the processor to implement:

dividing a video to be identified into a plurality of video clips;

13. The electronic device of claim 12, wherein the electronic device comprises a plurality of electronic devices,

14. The electronic device of claim 13, wherein the electronic device comprises a memory device,

and performing inter-segment aggregation on the first aggregation feature of each video segment and the first aggregation features of other video segments to obtain the aggregation feature of each video segment, and executing program instructions by a processor to further realize:

15. The electronic device of claim 12, wherein the electronic device comprises a plurality of electronic devices,

inter-segment aggregation is performed on the first aggregation feature of each video segment and the first aggregation features of other video segments to obtain aggregation features of each video segment, including:

16. The electronic device of claim 15, wherein the electronic device comprises a memory device,

inter-modality aggregation is performed on a second aggregation feature of multiple modalities of each video segment, and before the aggregation feature of each video segment is obtained, the processor executes program instructions to further realize:

17. The electronic device of claim 12, wherein the electronic device comprises a plurality of electronic devices,

18. The electronic device of claim 12, wherein the electronic device comprises a plurality of electronic devices,

19. The electronic device of claim 18, wherein the electronic device comprises a memory device,

and/or the result prediction network comprises a fully connected layer.