CN114627560A

CN114627560A - Motion recognition method, motion recognition model training method and related device

Info

Publication number: CN114627560A
Application number: CN202210520433.4A
Authority: CN
Inventors: 白云超; 熊涛; 魏乃科; 潘华东; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-06-14

Abstract

The invention discloses a motion recognition method, a motion recognition model training method and a related device, wherein the motion recognition method comprises the following steps: acquiring a video to be identified containing a motion to be identified and a motion video containing a target motion; inputting the action video and the video to be recognized into an action recognition model, and respectively extracting the characteristics of the action video and the video to be recognized through a characteristic extraction network in the action recognition model to obtain first characteristic data and second characteristic data; and performing cross-correlation processing on the second characteristic data by using the first characteristic data in a cross-correlation network of the action recognition model, and recognizing whether the action to be recognized contained in the video to be recognized is a target action. Through the mode, the motion recognition can be realized by learning one motion sample once, and the recognition efficiency is improved.

Description

Motion recognition method, motion recognition model training method and related device

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a motion recognition method, a motion recognition model training method, and a related apparatus.

Background

In the technical field of behavior recognition, with the rapid development of technologies such as artificial intelligence, mode recognition and the like, the behavior recognition of a human body is concerned by more and more people, and particularly in the field of security protection, the human body behavior recognition and analysis are effective, so that some safety accidents can be effectively prevented and avoided, and therefore, the human body behavior recognition is urgent. Although the accuracy of human body action recognition is improved to a new height by a deep learning model classification-based method in recent years, human body actions are complex and changeable, it is very difficult to train and classify all actions, and when new special actions are added, the model needs to be retrained while data is acquired, and the iteration process is very complex and troublesome.

Disclosure of Invention

The invention mainly solves the technical problem of providing a motion recognition method, a motion recognition model training method and a related device, which can realize motion recognition by only learning once by utilizing one motion sample and improve the recognition efficiency.

In order to solve the technical problems, the invention adopts a technical scheme that: provided is a motion recognition method including: acquiring a video to be identified containing a motion to be identified and a motion video containing a target motion; inputting the action video and the video to be recognized into an action recognition model, and respectively extracting the characteristics of the action video and the video to be recognized through a characteristic extraction network in the action recognition model to obtain first characteristic data and second characteristic data; and performing cross-correlation processing on the second characteristic data by using the first characteristic data in a cross-correlation network of the action recognition model, and recognizing whether the action to be recognized contained in the video to be recognized is a target action.

The method for performing cross-correlation processing on the second feature data by using the first feature data in the cross-correlation network in the action recognition model comprises the following steps: determining a convolution kernel based on the size of the first feature data; carrying out convolution processing on the second characteristic data by utilizing a convolution core to obtain a target characteristic value; and determining whether the action to be recognized is a target action or not based on a video segment corresponding to the target characteristic value in the video to be recognized.

Wherein determining the convolution kernel based on the size of the first feature data comprises: and pooling the first characteristic data to change the size of the first characteristic data, and taking the pooled first characteristic data as a convolution kernel.

Performing convolution processing on the second feature data by using a convolution kernel to obtain a target feature value comprises the following steps: performing convolution processing on the second characteristic data by utilizing a convolution kernel to obtain a cross-correlation characteristic value; and carrying out convolution processing on the cross-correlation characteristic value by utilizing a convolution kernel to obtain a target characteristic value.

The feature extraction network is a twin network, and the feature extraction network is used for extracting features of the action video and the video to be identified, so that the first feature data and the second feature data are obtained, wherein the method comprises the following steps: performing feature extraction on the motion video by using a first network of the twin network to obtain first feature data; and performing feature extraction on the video to be identified by utilizing a second network of the twin network to obtain second feature data.

The first network and the second network respectively comprise an attitude extraction network and an action extraction network, and the characteristic extraction of the action video and the video to be identified by utilizing the characteristic extraction network comprises the following steps: respectively inputting the action video and the video to be recognized into a gesture extraction backbone network to obtain gesture information in the action video and the video to be recognized; inputting the attitude information in the action video and the video to be recognized into an action extraction network, performing 3D convolution processing on the attitude information, and extracting the attitude change of the attitude information to obtain first characteristic data and second characteristic data.

The gesture information in the action video and the video to be recognized comprises gesture information of a human body, and the gesture information of the human body comprises a heat map of key points of the human body.

In order to solve the technical problem, the invention adopts another technical scheme that: provided is a motion recognition model training method including: acquiring a reference learning sample and a plurality of material samples, wherein at least one material sample and the reference learning sample contain the same action type; inputting a reference learning sample and a plurality of material samples into an action recognition initial model, wherein the action recognition initial model comprises an initial characteristic extraction network and an initial cross-correlation network, and respectively extracting the characteristics of the reference learning sample and the plurality of material samples by using the initial characteristic extraction network to obtain first sample characteristic data and second sample characteristic data; performing cross-correlation processing on second sample characteristic data by using first sample characteristic data in an initial cross-correlation network to obtain a sample characteristic value, and identifying whether the video to be identified contains action categories or not by using the sample characteristic value; and adjusting parameters of the motion recognition initial model based on the sample characteristic value and the motion recognition result to obtain a motion recognition model so as to realize the motion recognition method.

Adjusting parameters of the initial feature extraction network based on the sample feature values and the action recognition results comprises: obtaining the similarity among a plurality of material samples by using the sample characteristic values, and calculating the loss of the triples; comparing the action recognition result with the reference learning sample, and calculating the cross entropy loss; and updating and adjusting the parameters of the action recognition initial model based on the triplet loss and the cross entropy loss.

In order to solve the technical problem, the invention adopts another technical scheme that: there is provided an electronic device, the data device comprising a processor for execution to implement the motion recognition method or motion recognition model training method described above.

In order to solve the technical problem, the invention adopts another technical scheme that: there is provided a computer readable storage medium for storing instructions/program data executable to implement the motion recognition method or motion recognition model training method described above.

The beneficial effects of the invention are: different from the situation of the prior art, the method takes a motion video as a motion sample, obtains the correlation between the motion video and the video to be recognized by utilizing cross-correlation processing after respectively extracting the characteristics of the motion video and the video to be recognized, directly realizes the motion classification of the video to be recognized, obtains the target motion, realizes the motion recognition by utilizing one motion sample to perform learning once, and improves the recognition efficiency and the recognition accuracy.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a motion recognition method according to the present application;

FIG. 2 is a schematic flow chart diagram of an embodiment of the feature extraction method of the present application;

FIG. 3 is a heat map of a key point of a human body of the present application;

FIG. 4 is a block diagram of the action extraction network of the present application;

FIG. 5 is a schematic flow chart diagram illustrating another embodiment of a motion recognition method of the present application;

FIG. 6 is a block diagram of a motion recognition model of the present application;

FIG. 7 is a schematic diagram of the structure of the cross-correlation network of the present application;

FIG. 8 is a schematic flow chart diagram illustrating an embodiment of a motion recognition model training method according to the present application;

FIG. 9 is a schematic illustration of the location of the loss function of the present application;

fig. 10 is a schematic structural diagram of a motion recognition device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device in an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples.

The application provides a motion recognition method, which takes a motion video as a motion sample, performs cross-correlation processing on the video to be recognized by utilizing the motion video after respectively performing feature extraction on the motion video and the video to be recognized, acquires the correlation between the motion video and the video to be recognized, directly realizes motion classification of the video to be recognized, acquires a target motion, realizes motion recognition by utilizing one motion sample to perform learning once, and improves the recognition efficiency and the recognition accuracy.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating an embodiment of a motion recognition method according to the present application. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 1 is not limited in this embodiment. As shown in fig. 1, the present embodiment includes:

s110: and acquiring a video to be recognized containing the motion to be recognized and a motion video containing the target motion.

The video to be recognized comprises one or more types of actions to be recognized, and the actions to be recognized need to be recognized from the video to be recognized, so that the action video comprising the target actions is obtained, and the action video is used as a reference video of the video to be recognized and is obtained by analyzing the action video.

S130: inputting the action video and the video to be recognized into the action recognition model, and respectively extracting the characteristics of the action video and the video to be recognized through a characteristic extraction network in the action recognition model to obtain first characteristic data and second characteristic data.

The application establishes a motion recognition model, which comprises a feature extraction network and a cross-correlation network. The feature extraction network is used for respectively identifying the motion features in the motion video and the video to be identified so as to facilitate the subsequent processing of the cross-correlation network. Specifically, a feature extraction network is utilized to perform feature extraction on the action video to obtain first feature data; and performing feature extraction on the video to be identified by using a feature extraction network to obtain second feature data.

S150: and performing cross-correlation processing on the second characteristic data by using the first characteristic data in a cross-correlation network of the action recognition model, and recognizing whether the action to be recognized contained in the video to be recognized is a target action.

The method comprises the steps of inputting first characteristic data and second characteristic data into a cross-correlation network, performing cross-correlation processing on the second characteristic data by taking the first characteristic data as a reference, comparing the first characteristic data with the second characteristic data, obtaining the correlation degree of each data in the second characteristic data and the first characteristic data by utilizing cross-correlation operation, identifying whether an action to be identified in a video to be identified is a target action, and selecting a video action corresponding to the second characteristic data with high correlation degree to obtain the target action in the video to be identified.

In the embodiment, one action video is used as an action sample, after the characteristics of the action video and the video to be recognized are respectively extracted, the correlation between the action video and the video to be recognized is obtained by utilizing the cross-correlation processing, the action classification of the video to be recognized is directly realized, the target action is obtained, the action recognition is realized by utilizing one action sample to learn only once, and the recognition efficiency and the recognition accuracy are improved.

The motion video and the video to be identified can be motion video of a human body or an animal, and generally, the motion video is human body motion video. The feature extraction network is used for respectively carrying out the same feature extraction processing on the action video and the video to be identified, in one embodiment, the feature extraction network is any neural network capable of realizing the method, and the action video and the video to be identified can be respectively processed twice by using the same network. In another embodiment, the feature extraction network is a twin network, and two branch networks of the twin network are used for respectively extracting features of the action video and the video to be identified. And performing feature extraction on the motion video by using the first network of the twin network to obtain first feature data. And performing feature extraction on the video to be identified by utilizing a second network of the twin network to obtain second feature data. Specifically, please refer to fig. 2, fig. 2 is a schematic flow chart of an embodiment of the feature extraction method of the present application. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 2 is not limited in this embodiment. As shown in fig. 2, the present embodiment includes:

s210: and inputting the action video into the gesture extraction network of the first network to acquire gesture information in the action video.

The twin network comprises two branch networks, a first network and a second network, wherein each branch network comprises a posture extraction network and an action extraction network, the posture extraction network is a human body posture estimation network, and the posture extraction network is used for simplifying and extracting video information and acquiring information most relevant to the motion state of a human body. In one embodiment, the pose extraction network comprises Resnet34 and three-layer deconvolution, and is used for extracting a heat map for detecting key points of a human body. Please refer to fig. 3, fig. 3 is a heat diagram of key points of a human body according to the present application. The heat map in the present application is a gaussian distribution in which the maximum distance is generated by using the pixel values having a standard deviation of 1 and each of the key point coordinates of the human body as the center. Because redundant information in the color image is too much, the model calculation amount is large, and the key point contains too little information and has low robustness. Meanwhile, the joint points have strong correlation, but the numerical value of each coordinate point is directly regressed, so that the correlation cannot be effectively captured and utilized. Therefore, the application selects to use the human body heat map as the posture information of the human body. A fixed-size and fixed-size feature map sequence is output through the posture extraction network, and the feature map sequence comprises the feature quantity (C), the time sequence length (T), the feature map height (H) and the feature map width (W).

S230: inputting the attitude information in the action video into an action extraction network of a second network, performing 3D convolution processing on the attitude information, and extracting the attitude change of the attitude information to obtain first characteristic data.

The action extraction network is used for extracting motion characteristic information of a human body on the basis of the posture information, and first characteristic data of each frame of image in the action video is obtained after the motion characteristic information passes through the posture extraction network, wherein the first characteristic data is data in a space dimension and comprises characteristic quantity, time sequence length, characteristic diagram height and characteristic diagram width. Inputting the attitude information data into an action extraction network, adding 3D convolution processing of a time dimension to the attitude information, and acquiring attitude change information of the attitude information in time to obtain first characteristic data. Specifically, referring to fig. 4, fig. 4 is a structural diagram of the operation extraction network of the present application, and as shown in fig. 4, in one embodiment, pose information data in which the first feature data is obtained, the feature quantity (C) is 17, the time sequence length (T) is 32, the feature map height (H) is 64, and the feature map width (W) is 48, is obtained, and 3D convolution (3D convolution) is performed on the pose information of 17 × 32 × 64 × 48. The present embodiment is not particularly limited to the number of times of 3D convolution, and is configured to obtain pose information data in which the number of features (C) is 32, the time series length (T) is 32, the feature map height (H) is 64, and the feature map width (W) is 48, and perform 3D convolution (3D convolution) again on the pose information of 32 × 64 × 48. Where the 3D convolution has its convolution kernel learnable.

S250: and inputting the video to be recognized into the gesture extraction network of the second network, and acquiring gesture information of the human body in the video to be recognized.

S270: inputting the attitude information in the video to be recognized into the action extraction network of the second network, performing 3D convolution processing on the attitude information, and extracting the attitude change of the attitude information to obtain second characteristic data.

Likewise, the second network also includes a gesture extraction network and an action extraction network. Because the first network and the second network are twin networks, the processing mode of the second network to the video to be recognized is the same as the processing mode of the first network to the motion video, and details are not repeated here.

The order of the above steps S210, S230, S250, and S270 is not limited. The steps S250 and S270 may be performed first, and then the steps S210 and S230 may be performed.

In the embodiment, the twin network is used as the feature extraction network, and the behavior action is firstly identified by using the human body heat map state, so that a large amount of information of human body motion postures is retained to the greatest extent, and meanwhile, the redundancy of spatial information is reduced. And then, the extracted attitude information is subjected to further feature extraction on the time dimension by utilizing 3D convolution to obtain attitude change features, so that the accuracy of feature extraction can be improved, and the help is provided for subsequent cross-correlation processing.

Referring to fig. 5, fig. 5 is a flowchart illustrating another embodiment of the motion recognition method of the present application. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 5 is not limited in this embodiment. As shown in fig. 5, the present embodiment includes:

s510: and acquiring a video to be recognized containing the motion to be recognized and a motion video containing the target motion.

S530: inputting the action video and the video to be recognized into the action recognition model, and respectively extracting the characteristics of the action video and the video to be recognized through a characteristic extraction network in the action recognition model to obtain first characteristic data and second characteristic data.

Referring to fig. 6, fig. 6 is a structural diagram of a motion recognition model according to the present application. The motion recognition model includes a twin network and a cross-correlation network. And respectively inputting the motion video and the video to be recognized into two branches of a twin network of the motion recognition model, and respectively performing feature extraction on the motion video and the video to be recognized by using the feature extraction method of the feature extraction network to obtain first feature data and second feature data.

S550: a convolution kernel is determined based on the size of the first feature data.

Because the size difference between the first feature data and the second feature data output by the same model is not large, for the subsequent cross-correlation processing, the first feature data is firstly subjected to pooling processing to change the size of the first feature data, and the pooled first feature data is used as a convolution kernel.

S570: and performing convolution processing on the second characteristic data by using a convolution kernel to obtain a target characteristic value.

The second feature data is subjected to cross-correlation processing using the first feature data, and in this embodiment, convolution processing is performed on the second feature data using the first feature data as a convolution kernel. Specifically, first, the first feature data after the pooling process is used as a convolution kernel, and 3D convolution (3D convolution) processing is performed on the second feature data to update and learn the second feature data, so as to obtain a target feature value. The 3D convolution in the cross-correlation network is different from the 3D convolution in the action extraction network, and in the 3D convolution in the cross-correlation network, a convolution kernel is a fixed parameter, namely the first characteristic data after pooling processing. And performing convolution processing on second characteristic data of the video to be recognized by utilizing the first characteristic data of the action video so as to update parameters of the video to be recognized, wherein if the first characteristic data of the action video is the same as the action of the action video, the characteristic value of a data target is reduced. Specifically, the operation formula of the 3D convolution in the cross-correlation network is:

，

wherein, F represents the second characteristic data, K represents the convolution kernel, namely the first characteristic data, F (out) represents the output target characteristic value, n represents n layers of characteristics, the cross-correlation network utilizes the known characteristics to update the unknown characteristics, and utilizes the characteristic that the similarity of the same characteristics is closer to screen the unknown information, thereby realizing the identification of the action in a one-time learning mode.

In this embodiment, one 3D convolution operation may be performed, or multiple 3D convolution operations may be performed, please refer to fig. 7, and fig. 7 is a schematic structural diagram of the cross-correlation network of the present application. As shown in fig. 7, in this embodiment, two 3D convolution operations are performed. And taking the pooled first feature data as a convolution kernel, performing 3D convolution processing on the second feature data to update and learn the second feature data, and obtaining the cross-correlation feature values with the feature quantity (C) of 64, the time sequence length (T) of 32, the feature diagram height (H) of 64 and the feature diagram width (W) of 48. And continuously taking the first feature data after the pooling as a convolution kernel, performing 3D convolution processing on the cross-correlation feature value to update and learn the cross-correlation feature value, and obtaining target feature values with the feature quantity (C) of 128, the time sequence length (T) of 32, the feature map height (H) of 64 and the feature map width (W) of 48.

S590: and determining whether the action to be recognized is a target action or not based on a video segment corresponding to the target characteristic value in the video to be recognized.

And carrying out average pooling and normalization processing on the target characteristic value, wherein the normalization processing adopts softmax activation processing, corresponds the video segment of the video to be detected according to the size of the target characteristic value, and selects the video segment with a smaller target characteristic value as a target action in the video to be identified.

In the embodiment, a motion video is used as a motion sample, after feature extraction is respectively carried out on the motion video and a video to be recognized through a twin network, cross-correlation processing is utilized to obtain the correlation between the motion video and the video to be recognized, motion classification of the video to be recognized is directly realized, a target motion is obtained, known features are utilized to update unknown features, the feature that the similarity of the same features is closer is utilized to screen unknown information, and therefore motion recognition is realized through a one-time learning mode. Meanwhile, the convolution kernels are obtained through learning, and the convolution kernels obtained through different actions are different, so that the efficiency and the accuracy of action identification are higher.

Referring to fig. 8, fig. 8 is a schematic flowchart illustrating an embodiment of a motion recognition model training method according to the present application. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 8 is not limited in this embodiment. As shown in fig. 8, the present embodiment includes:

s810: a reference learning sample and a plurality of material samples are obtained.

The method comprises the steps of obtaining a reference learning sample and a plurality of material samples, wherein the reference learning sample comprises action types to be identified, each type comprises a piece of video material, the material sample plate comprises multiple action types, and each type comprises a plurality of pieces of video material. At least one of the material samples and the reference learning sample contain the same motion class.

S830: inputting the benchmark learning sample and the plurality of material samples into an action recognition initial model, wherein the action recognition initial model comprises an initial characteristic extraction network and an initial cross-correlation network, and respectively extracting the characteristics of the benchmark learning sample and the plurality of material samples by using the initial characteristic extraction network to obtain first sample characteristic data and second sample characteristic data.

And establishing an action identification initial model comprising an initial feature extraction network and an initial cross-correlation network. Inputting the reference learning sample and the plurality of material samples into an action recognition initial model, and performing feature extraction by using an initial feature extraction network to respectively obtain first sample feature data and second sample feature data.

S850: and performing cross-correlation processing on the second sample characteristic data by using the first sample characteristic data in the initial cross-correlation network to obtain a sample characteristic value, and identifying whether the video to be identified contains the action category or not by using the sample characteristic value.

Performing a series of convolution operations by using an initial cross-correlation network to obtain a sample characteristic value, and performing normalization operation on the obtained characteristic, wherein the normalization adopts softmax operation, and the sample characteristic value is used for identifying whether the video to be identified contains the target action or not to obtain an action identification result.

S870: and adjusting parameters of the motion recognition initial model based on the sample characteristic value and the motion recognition result to obtain a motion recognition model.

In the embodiment of the application, the parameters of the motion recognition initial model are adjusted by using two loss functions of triple loss and cross entropy loss. Triple Loss (triple Loss) is a Loss function in deep learning, training data comprises an Anchor (Anchor) example, a Positive (Positive) example and a Negative (Negative) example, and similarity calculation of samples is achieved by optimizing the distance between the Anchor example and the Positive example to be smaller than the distance between the Anchor example and the Negative example. Specifically, please refer to fig. 9, fig. 9 is a schematic diagram illustrating the position of the loss function according to the present application. Outputting the characteristics of the model before softmax for calculating a triple loss function, specifically, obtaining the similarity among a plurality of material samples by using sample characteristic values, and calculating triple loss; and after softmax, accessing a cross entropy loss function, specifically, comparing the action recognition result with the basic learning sample, and calculating cross entropy loss. And combining the triple loss and the cross entropy loss to serve as the total loss of the model, and updating and adjusting the parameters of the action recognition initial model by using the total loss.

In the embodiment, triple losses are used, through learning, the feature expression distances of the same-class samples in the video to be identified are made to be as small as possible, and the feature expression distances of the different-class samples are made to be as large as possible, namely, the optimized adjustment of the network output feature layer values enlarges the distances of the different-class feature values, the distances of the same-class feature values are reduced, the output values are further accurately updated by using cross entropy losses and combining real labels, and the accuracy of action identification is improved.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a motion recognition device according to an embodiment of the present disclosure. In this embodiment, the motion recognition apparatus includes an acquisition module 01, a recognition module 02, and a cross-correlation module 03.

The acquisition module 01 is used for acquiring a to-be-identified video containing a to-be-identified action and an action video containing a target action; the recognition module 02 is used for inputting the action video and the video to be recognized into the action recognition model, and respectively extracting the characteristics of the action video and the video to be recognized through a characteristic extraction network in the action recognition model to obtain first characteristic data and second characteristic data; the cross-correlation module 03 is configured to perform cross-correlation processing on the second feature data by using the first feature data in a cross-correlation network of the motion recognition model, and recognize whether a to-be-recognized motion included in the to-be-recognized video is a target motion. The action recognition device takes an action video as an action sample, obtains the correlation between the action video and the video to be recognized by utilizing cross-correlation processing after respectively extracting the characteristics of the action video and the video to be recognized, directly realizes action classification of the video to be recognized, obtains a target action, realizes action recognition by utilizing one action sample to learn once, and improves the recognition efficiency and the recognition accuracy.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. In this embodiment, the motion recognition device 11 includes a processor 12.

The processor 12 may also be referred to as a CPU (Central Processing Unit). The processor 12 may be an integrated circuit chip having signal processing capabilities. The processor 12 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 12 may be any conventional processor or the like.

The motion recognition device 11 may further include a memory (not shown) for storing instructions and data required for the processor 12 to operate.

The processor 12 is configured to execute instructions to implement the methods provided by any of the embodiments of the motion recognition method or motion recognition model training method of the present application, and any non-conflicting combinations thereof.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present disclosure. The computer readable storage medium 21 of the embodiments of the present application stores instructions/program data 22, which instructions/program data 22, when executed, implement the methods provided by any of the embodiments of the motion recognition method or motion recognition model training method of the present application, as well as any non-conflicting combinations. The instructions/program data 22 may form a program file stored in the storage medium 21 in the form of a software product, so that a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) executes all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium 21 includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present specification and the attached drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of motion recognition, the method comprising:

acquiring a video to be identified containing a motion to be identified and a motion video containing a target motion;

inputting the action video and the video to be recognized into an action recognition model, and respectively extracting the characteristics of the action video and the video to be recognized through a characteristic extraction network in the action recognition model to obtain first characteristic data and second characteristic data; and are

And performing cross-correlation processing on the second characteristic data by using the first characteristic data in a cross-correlation network of the action recognition model, and recognizing whether the action to be recognized contained in the video to be recognized is the target action.

2. The motion recognition method according to claim 1, wherein the cross-correlating the second feature data with the first feature data in a cross-correlation network in the motion recognition model comprises:

determining a convolution kernel based on a size of the first feature data;

performing convolution processing on the second characteristic data by using the convolution kernel to obtain a target characteristic value;

and determining whether the action to be recognized is the target action or not based on a video segment corresponding to the target characteristic value in the video to be recognized.

3. The motion recognition method according to claim 2, wherein the determining a convolution kernel based on the size of the first feature data includes:

and pooling the first characteristic data to change the size of the first characteristic data, and taking the pooled first characteristic data as the convolution kernel.

4. The motion recognition method according to claim 2, wherein the convolving the second feature data with the convolution kernel to obtain a target feature value includes:

performing convolution processing on the second characteristic data by using the convolution kernel to obtain a cross-correlation characteristic value;

and carrying out convolution processing on the cross-correlation characteristic value by utilizing the convolution kernel to obtain a target characteristic value.

5. The motion recognition method according to claim 1, wherein the feature extraction network is a twin network, and the extracting the features of the motion video and the video to be recognized by using the feature extraction network to obtain the first feature data and the second feature data comprises:

performing feature extraction on the motion video by using a first network of the twin network to obtain first feature data;

and performing feature extraction on the video to be identified by utilizing a second network of the twin network to obtain second feature data.

6. The motion recognition method according to claim 5, wherein the first network and the second network respectively comprise a gesture extraction network and a motion extraction network, and the performing feature extraction on the motion video and the video to be recognized by using the feature extraction network comprises:

respectively inputting the action video and the video to be recognized into the gesture extraction backbone network to obtain gesture information in the action video and the video to be recognized;

inputting the attitude information in the action video and the video to be recognized into the action extraction network, performing 3D convolution processing on the attitude information, and extracting the attitude change of the attitude information to obtain the first characteristic data and the second characteristic data.

7. The motion recognition method according to claim 6, wherein the pose information in the motion video and the video to be recognized comprises pose information of a human body, and the pose information of the human body comprises a heat map of key points of the human body.

8. A method for motion recognition model training, the method comprising:

acquiring a reference learning sample and a plurality of material samples, wherein at least one material sample and the reference learning sample contain the same action type;

inputting the reference learning sample and the plurality of material samples into an action recognition initial model, wherein the action recognition initial model comprises an initial feature extraction network and an initial cross-correlation network, and respectively extracting features of the reference learning sample and the plurality of material samples by using the initial feature extraction network to obtain first sample feature data and second sample feature data;

performing cross-correlation processing on the second sample characteristic data by using the first sample characteristic data in the initial cross-correlation network to obtain a sample characteristic value, and identifying whether the video to be identified contains the action category or not by using the sample characteristic value;

and adjusting parameters of the motion recognition initial model based on the sample characteristic values and the motion recognition results to obtain a motion recognition model.

9. The motion recognition model training method according to claim 8, wherein the adjusting the parameters of the initial feature extraction network based on the sample feature values and the motion recognition results comprises:

obtaining the similarity among the material samples by using the sample characteristic values, and calculating the triple loss;

comparing the action recognition result with the reference learning sample, and calculating cross entropy loss;

and updating and adjusting the parameters of the action recognition initial model based on the triplet loss and the cross entropy loss.

10. An electronic device comprising a processor for executing instructions to implement the motion recognition method of any of claims 1-7 or the motion recognition model training method of any of claims 8-9.

11. A computer-readable storage medium for storing instructions executable to implement the motion recognition method of any one of claims 1-7 or the motion recognition model training method of any one of claims 8-9.