CN117593789A

CN117593789A - Action recognition model training method, device and equipment based on pre-training

Info

Publication number: CN117593789A
Application number: CN202311519700.7A
Authority: CN
Inventors: 许铁; 杨冬平; 林峰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-02-23

Abstract

The method, the device and the equipment for training the motion recognition model based on the pre-training are characterized in that the pre-training of the motion recognition model is performed by generating spatial data for training a spatial feature extraction model space and time data for training a time feature extraction model, the pre-training is performed on the spatial feature extraction model and the time feature extraction model, collected video data is determined as a third sample, the motion type corresponding to the video data is determined as a label of the third sample, the motion recognition model is trained, the pre-training is performed on a coding layer of the motion recognition model constructed by a storage pool network group constructed by two storage pool network models, so that the pre-training motion recognition model has better generalization on time feature extraction and spatial feature extraction, the pre-training motion recognition model is trained, and the motion recognition model after the training can recognize the motion type through multi-channel features, so that the accuracy of motion type recognition is higher.

Description

Action recognition model training method, device and equipment based on pre-training

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method, apparatus, and device for training a motion recognition model based on pre-training.

Background

With the development of the artificial intelligence field, the requirement on the accuracy of the motion recognition model is also higher and higher when the motion recognition is performed. However, when the existing motion recognition model is trained, a large number of training samples are often needed, and the training cost is high.

In the prior art, in order to solve the problems of small samples and too high training cost, the reservoir calculation is widely applied to training of the motion recognition model due to the characteristics of high learning speed, low training cost and the like.

However, when the motion recognition model is trained through the reservoir calculation, the generalized bias degree of the reservoir calculation is insufficient, so that the model generalization strength of the model trained through the reservoir calculation is insufficient, and the model precision after training is lower. Accordingly, the present specification provides a method, apparatus, and device for training a motion recognition model based on pre-training.

Disclosure of Invention

The present disclosure provides a method, an apparatus, and a device for training a motion recognition model based on pre-training, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a training method of an action recognition model based on pre-training, the action recognition model comprises a coding layer and a decoding layer, the coding layer comprises a spatial feature extraction model to be trained and a temporal feature extraction model to be trained, the spatial feature extraction model to be trained and the temporal feature extraction model to be trained are reserve pool networks, and the method comprises the following steps:

generating, in response to a model training instruction, spatial data for training the spatial feature extraction model space, and temporal data for training the temporal feature extraction model, the spatial data and the temporal data being comprised of a plurality of video data segments that are continuous in content;

determining a first sample and a label of the first sample according to adjacent video data segments in the spatial data, pre-training the spatial feature extraction model, determining a second sample and a label of the second sample according to adjacent video data segments in the time data, and pre-training the time feature extraction model;

determining collected video data as a third sample, and determining an action type corresponding to the video data as a label of the third sample;

Inputting the third sample, determining spatial characteristics and temporal characteristics by a coding layer of an action recognition model determined by a pre-trained spatial characteristic extraction model and a pre-trained temporal characteristic extraction model, inputting the spatial characteristics and the temporal characteristics into a decoding layer of the action recognition model, determining a predicted action type, training the action recognition model according to the predicted action type and the labeling of the third sample, and using the trained action recognition model for recognizing the action type corresponding to the input video data.

Optionally, generating the spatial data for training the spatial feature extraction model space specifically includes:

determining a spatial motion profile in video space;

generating a plurality of motion points in the video space according to the starting point position of the space motion track, wherein the center of Gaussian distribution of each motion point is the starting point position;

moving the center of the Gaussian distribution along the space motion track as a target to generate video data of continuous motion of the plurality of motion points;

when the center position of the Gaussian distribution changes, the video data of continuous motion is segmented, and a plurality of continuous video data segments are determined to serve as space data for training the space feature extraction model.

Optionally, determining a first sample and labeling of the first sample according to adjacent video data segments in the spatial data, and pre-training the spatial feature extraction model specifically includes:

determining adjacent video data segments in the space data, and taking the video data segments with earlier time in the adjacent video data segments as the first sample;

according to the other video data segment, determining the coordinates of the center of the Gaussian distribution of the moving light spot in the video data segment as the corresponding label of the first sample;

inputting the first sample into the spatial feature extraction network, and determining coordinates of centers of Gaussian distribution of the plurality of moving light spots in a predicted next video output by the spatial feature extraction network as a first output result;

and determining loss according to the difference between the first output result and the label corresponding to the first sample so as to reduce the loss optimization target, and pre-training the spatial feature extraction model.

Optionally, generating time data for training the time feature extraction model time specifically includes:

determining a second motion profile in the video space;

Generating video data of the target object moving along the second motion track according to the second motion track;

and dividing the continuous video data of the target object according to the time length, and determining the time data for training the time feature extraction model.

Optionally, determining a second sample and labeling of the second sample according to adjacent video data segments in the time data, and pre-training the time feature extraction model specifically includes:

determining adjacent video data segments in the time data, and taking the video data segments with earlier time in the adjacent video data segments as the second sample;

according to the other video data segment, determining a moving direction feature vector of the target object as a label corresponding to the second sample;

and inputting the second sample into the time feature extraction network, determining a feature vector of the moving direction of the target object in a predicted next video output by the time feature extraction network, taking the difference between the second output result and labels corresponding to the second sample as an optimization target, and pre-training the time feature extraction model.

Optionally, the collected video data is determined, wherein the collected video data is data collected by an event camera.

Optionally, the coding layer further includes a preset dynamic attention network;

inputting the third sample, determining spatial features and temporal features by a coding layer of an action recognition model determined by a pre-trained spatial feature extraction model and a pre-trained temporal extraction model, specifically comprising:

determining optical flow characteristics corresponding to the third sample according to the third sample, inputting the third sample into the dynamic attention network, and determining first resolution video data;

intercepting the third sample according to the first resolution video data to determine second resolution video data;

the first resolution video data and the second resolution video data are input into the temporal feature extraction network, a first resolution temporal feature and a second resolution temporal feature are determined as the temporal features, and the first resolution video data, the second resolution video data and the optical flow feature are input into the spatial feature extraction network, and a first resolution spatial feature, a second resolution spatial feature and an optical flow feature spatial feature are determined as the spatial features.

The utility model provides a motion recognition model trainer based on pretraining, motion recognition model contains coding layer and decoding layer, the coding layer is including waiting to train space feature extraction model and waiting to train time feature extraction model, wait to train space feature extraction model with wait to train time feature extraction model is the reserve pond network, includes:

a response module for generating spatial data for training the spatial feature extraction model space and temporal data for training the temporal feature extraction model in response to model training instructions, the spatial data and the temporal data being composed of a plurality of video data segments that are continuous in content;

the pre-training module is used for determining a first sample and labels of the first sample according to adjacent video data segments in the spatial data, pre-training the spatial feature extraction model, determining a second sample and labels of the second sample according to adjacent video data segments in the time data, and pre-training the time feature extraction model;

the acquisition module is used for determining acquired video data as a third sample and determining an action type corresponding to the video data as an annotation of the third sample;

The training module is used for inputting the third sample, determining spatial characteristics and temporal characteristics by the coding layer of the motion recognition model determined by the pre-trained spatial characteristic extraction model and the pre-trained temporal extraction model, inputting the spatial characteristics and the temporal characteristics into the decoding layer of the motion recognition model, and determining the predicted motion type; and training the action recognition model according to the predicted action type and the label of the third sample, wherein the trained action recognition model is used for recognizing the action type corresponding to the input video data.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the pre-training based motion recognition model training method described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the pre-training based motion recognition model training method described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

According to the training method for the pre-training-based motion recognition model, which is provided by the specification, the spatial data used for training the spatial feature extraction model space and the time data used for training the time feature extraction model are generated, the spatial feature extraction model and the time feature extraction model are pre-trained, collected video data are determined to serve as a third sample, the motion type corresponding to the video data is determined to serve as a label of the third sample, and the motion recognition model is trained.

According to the method, the coding layer of the motion recognition model is built through the reservoir network groups built through the two reservoir network models, the time feature extraction capability and the space feature extraction capability are trained respectively through the generated video data aiming at the time feature extraction capability training and the space feature extraction capability, the pre-trained motion recognition model has better generalization on the time feature extraction and the space feature extraction, the pre-trained motion recognition model is trained, the motion recognition model after training can be used for recognizing the motion types through the characteristics of multiple channels, and the accuracy of motion type recognition is higher.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

fig. 1 is a schematic flow chart of a training method of an action recognition model based on pre-training according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of spatial data provided in an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of time data according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an action recognition process according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a training device for a pre-training-based motion recognition model according to an embodiment of the present disclosure;

fig. 6 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

With the development of internet technology, the requirement of users for capturing videos is increasing, the variety of videos is also increasing, and computer vision is generated. Whether in a plurality of fields such as medical field, sports field, education field, short video field and the like, the accuracy requirement for identifying the action type corresponding to the video data is higher and higher, for example, intelligent video monitoring mountain, medical diagnosis and monitoring, intelligent man-machine interaction, identity identification and the like all need to accurately identify the action type.

Currently, reservoir computing is generally used to construct motion recognition models. Reservoir computation is a computational framework of a neural network, and is generally composed of an input layer, a reservoir and an output layer, where the reservoir is a recurrent neural network, and by means of the recurrent neural network, the reservoir computation can mine the input time sequence information, but also results in that the reservoir computation is sensitive to time sequence, but is insufficient for other features, and cannot form corresponding bias. The motion type recognition is difficult to obtain an accurate result only according to the input time sequence characteristics, so that the motion recognition model is built only according to the calculation of the reserve pool, and the accuracy rate of the trained model is low when the motion type recognition is carried out.

Based on this, the present disclosure provides a training method of an action recognition model based on pre-training, in one or more embodiments of the present disclosure, the action recognition model includes an encoding layer and a decoding layer, the encoding layer is a pool network group architecture formed by at least two pool networks, the pool network group architecture at least includes a temporal feature extraction model, a spatial feature extraction model, and is configured to extract temporal features and spatial features of input video data, and the decoding layer is configured to determine, according to temporal features and spatial features output by the encoding layer, an action type corresponding to the input video data.

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present invention based on the embodiments herein.

The technical solutions provided by the embodiments of the present specification are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a training method of an action recognition model based on pre-training provided in the present specification, which includes the following steps:

s100: in response to model training instructions, generating spatial data for training the spatial feature extraction model space, and temporal data for training the temporal feature extraction model, the spatial data and the temporal data being comprised of a plurality of video data segments that are contiguous in content.

In one or more embodiments of the present specification, the method of training the motion recognition model is not limited to what device is specifically used to perform the method, for example, a personal computer, a mobile terminal, a server, and the like. However, since the subsequent steps involve operations such as model training, feature extraction, etc., and such operations with high computational resource requirements are generally performed by a server, the present description will also be described later by taking the server as an example to perform the intelligent dialogue method. The server may be a single device or may be composed of multiple devices, for example, a distributed server, which is not limited in this specification.

The server responds to the model training instruction, if the motion recognition model to be trained is already stored in the server, the model training instruction may be merely a string of codes trained by the starting model, and if the motion recognition model to be trained is not stored in the server, the model training instruction may include the address or related information of the motion recognition model to be trained, and in one or more embodiments of the present disclosure, the specific content or form of the model training instruction is not limited.

When the model training instruction is received, the server generates spatial data for pre-training the spatial feature extraction model and temporal data for pre-training the temporal feature extraction model, wherein the temporal data and the spatial data are a plurality of continuous video data segments, and are video data shot by an event camera (dynamic vision sensor, dvs), namely dvs data.

Of course, the spatial feature extraction model and the temporal feature extraction model are both trained by the pool network according to specific training data, and the temporal data and the spatial data may be generated by other devices, and the server acquires the temporal data and the spatial data after receiving the model training instruction, which is not limited in one or more embodiments of the present disclosure.

Then, the server trains the untrained space feature extraction model and the untrained time feature extraction model according to the space data and the time data respectively, so that bias of two reserve pool models in the coding layer of the action recognition model on the time feature extraction and bias of the time feature extraction are realized respectively.

Specifically, when the generated spatial data for the pre-training spatial feature extraction model is generated, a spatial motion track in a video space can be determined first, then a plurality of motion points are generated in the video space according to the starting point position of the spatial motion track, and the center of Gaussian distribution of each motion point is the starting point position. And generating video data of continuous motion of the plurality of motion points by taking the movement of the center of the Gaussian distribution along the spatial motion track as a target, dividing the video data of continuous motion when the center of the Gaussian distribution changes, and determining a plurality of continuous video data segments as spatial data for training the spatial feature extraction model.

Of course, a plurality of continuous spatial video data segments may be generated first, where the content of each spatial video data segment is a plurality of motion points, and for each spatial video data segment, the centers of gaussian distributions of the plurality of motion points may be determined according to the motion points in each spatial video data segment, and the centers of gaussian distributions corresponding to each sequential spatial video data segment move according to a preset track.

Specifically, as shown in fig. 2, fig. 2 is a schematic diagram of spatial data according to the embodiment of the present disclosure. Each rectangular frame is a video data segment with preset resolution, a dotted line in the video data segment is a motion track of the center of Gaussian distribution of motion points, and for each video data segment, the center of a relatively large black point is the center of Gaussian distribution of motion points in the video data segment, and a virtual point is the center of Gaussian distribution of motion points, and a relatively small black point is the motion point. The right solid arrow indicates the sequence in time of the video data segments. The center of the gaussian distribution of the moving points moves linearly in any direction within a preset distance, the center changes direction randomly based on the movement after moving the preset distance, and when the center moves to the boundary of the video data, the position of the center is initialized randomly and continues to move in the original direction. It should be noted that the above video content is merely an example provided in one or more embodiments of the present disclosure, and the spatial feature extraction network may be trained by other similar video data, such as changing the center of determining the gaussian distribution of the moving light points to the center of determining the uniform distribution of the moving light points, or changing the center of determining the gaussian distribution of the moving light points to the peak of determining the gamma distribution of the moving light points. This is not limiting in this specification.

In addition, when the generated time data is used for the pre-training time feature extraction model, a time motion track in a video space is determined, and then a target object is generated, wherein the center of the target object is positioned at the starting point of the time motion track. And taking the movement of the center of the target object along the time movement track as a target, generating video data of continuous movement of the target object, dividing the video data of continuous movement of the target object according to time length, and determining time data for training the time feature extraction model.

Specifically, as shown in fig. 3, fig. 3 is a schematic diagram of time data provided in the embodiment of the present disclosure, where each rectangular frame is a video space of a video data segment with a preset resolution, and for each video data segment, a solid rectangle in the video space is a starting position of a target moving along a preset time track, and a dotted rectangle is an end position of the target moving along the preset time track in the video data segment. The right solid arrow indicates the temporal order of the video data segments, the end position of the object of the temporally earlier video data segment being the start position of the object of the adjacent temporally later video data segment. Of course, the temporal feature of the video data is a feature of the video data that changes with time, and the video content is a feature of direction instead of the extraction of the temporal feature, which is merely provided as an example in one or more embodiments of the present specification, and the temporal feature extraction network may be trained by other features instead of the extraction of the temporal feature, such as a speed, a rotation angle, and the like, which is not limited in the present specification.

In order to facilitate the subsequent explanation of the pre-training of the temporal feature extraction model and the spatial feature extraction model, the present description will take the content of the two videos as an example.

S102: and determining a first sample and a label of the first sample according to adjacent video data segments in the spatial data, pre-training the spatial feature extraction model, determining a second sample and a label of the second sample according to adjacent video data segments in the time data, and pre-training the time feature extraction model.

After time data and space data are acquired, the server starts to train the two reserve pool models of the coding layer respectively, a pre-trained model determined through time data training is a pre-trained time feature extraction model, and a pre-trained model determined through space data training is a pre-trained space feature extraction model.

Specifically, when the spatial feature extraction model is trained, adjacent video data segments in the spatial data are determined, wherein when a previous video data segment is used as a first sample of the spatial feature extraction model to be trained, coordinates of centers of gaussian distribution of each moving light spot can be determined according to contents of a next video data segment of the video data segment and used as labels corresponding to the first sample. Inputting the first sample into a spatial feature extraction model to be trained, enabling the spatial feature extraction model to be trained to predict the coordinates of the center of Gaussian distribution of a light spot moving in the next video data segment according to the input first sample, outputting the coordinates of the prediction center, taking the difference between the coordinates of the prediction center and the labels as an optimization target, and training the spatial feature extraction model to be trained. And when the accuracy of the prediction result reaches a preset value, determining a pre-trained spatial feature extraction model.

When the time feature extraction model is trained, adjacent video data segments in the time data are determined, wherein when a previous video data segment is used as a second sample of the time feature extraction model to be trained, the moving direction of a target object can be determined according to the content of the next video data segment of the video data segment and used as a label corresponding to the second sample. And inputting the second sample into a time feature extraction model to be trained, so that the time feature extraction model to be trained takes the direction of the movement of the target object in the next video data segment as a second output result according to the input second sample. And training a time feature extraction model by taking the difference between the reduced second output result and the label corresponding to the second sample as an optimization target, and determining a pre-trained time feature extraction model.

S104: and determining the collected video data as a third sample, and determining the action type corresponding to the video data as an annotation of the third sample.

Since the pre-trained time feature extraction model and the pre-trained space feature extraction model are already determined through pre-training, that is, the motion recognition model to be trained at the moment has a certain space feature extraction capability and a certain time feature extraction capability, that is, the bias of the space feature and the bias of the time feature are formed, the server further trains the pre-trained motion recognition model in order to further ensure that the motion recognition model accurately recognizes the motion type in the video data.

Specifically, the video data for training the motion recognition model is first obtained as the third sample, where the video data is actually collected video data of human motion, and of course, if the motion recognition model is trained, the video data may be collected according to the actually trained model. Then, the action type corresponding to the video data is acquired, wherein the action type can be marked manually or can be acquired in other modes.

S106: inputting the third sample, determining spatial characteristics and temporal characteristics by a coding layer of an action recognition model determined by a pre-trained spatial characteristic extraction model and a pre-trained temporal characteristic extraction model, inputting the spatial characteristics and the temporal characteristics into a decoding layer of the action recognition model, determining a predicted action type, training the action recognition model according to the predicted action type and the labeling of the third sample, and using the trained action recognition model for recognizing the action type corresponding to the input video data.

After determining the third sample, the server starts to train the whole motion recognition model, specifically, the server inputs the third sample into the motion recognition model to be trained, performs extraction of spatial features and temporal features on the third sample through an encoding layer of the motion recognition model, namely a pre-trained spatial feature extraction model and a pre-trained temporal feature extraction model, encodes the spatial features and the temporal features, then inputs the encoded result into a decoding layer, and the decoding layer determines the motion type of the third sample as an output result according to decoding of the encoded result, performs training on the motion recognition model by taking the difference between the reduced output result and the label as an optimization target, and determines the motion recognition model after training.

Based on the training method of the pre-training-based motion recognition model shown in fig. 1, the motion recognition model is trained by generating spatial data for training a spatial feature extraction model space and time data for training a time feature extraction model, pre-training the spatial feature extraction model and the time feature extraction model, determining collected video data as a third sample, determining a motion type corresponding to the video data as a label of the third sample, and training the motion recognition model.

The temporal data and spatial data generated in step S100, or the video data acquired as the third sample determined in step S104 are both in the dvs data format. Of course, the data of the RGB model may be obtained, but the data of the RGB model should be preprocessed and then model trained.

Specifically, according to the acquired video data, determining the video data with preset resolution, determining the preset number of pictures according to the video data with preset resolution, sequencing the pictures, determining the difference pictures according to two adjacent pictures, normalizing the difference pictures, and arranging the processed pictures in sequence to form the video data as the third sample.

It should be noted that, due to the difference between the collection principles of RGB data and dvs data, RGB data may include a large amount of still video factors, and therefore, before performing data conversion, RGB data should be cut to determine a video segment including human motion.

Of course, the above method is a processing method for the case where the acquired third sample is RGB data, and may be used to convert the input video data into a data format recognizable by the motion recognition model when the input video data is RGB data in the motion type recognition. Of course, in the training process, data in two data formats can be collected, and after being preprocessed, the data are used as a third sample to train the motion recognition model to be trained.

In addition, since the acquired third sample has higher resolution generally, the time feature and the space feature are directly extracted according to the acquired third sample, and more calculation resources are needed, so that on the basis of the motion recognition model training method shown in fig. 1, the coding layer of the motion recognition model to be trained can also comprise a dynamic attention network for downsampling the acquired third sample, inputting the third sample into the dynamic attention network, and determining the first resolution video data. And intercepting the third sample according to the first resolution video data to determine second resolution video data.

For example, if the resolution of the collected third sample is 1260×1260, the third sample may be downsampled, such as convolved, sampled, etc., by setting the dynamic attention network, to determine video data with a resolution of 46×46 as the first resolution video data, and then intercept the subject of the action to be identified in the third sample to determine video with a resolution of 46×46 as the second resolution video data. When the resolution of the first resolution, the second resolution, and the third sample may be intercepted according to actual needs, and the second resolution video data is intercepted, in one or more embodiments of the present disclosure, it is not limited how to intercept the subject to be identified, the action subject may be determined according to the optical flow diagram of the third sample, and the action subject may be identified according to other models.

Then, the first resolution video data and the second resolution video data are input into the temporal feature extraction network, first resolution temporal features and second resolution temporal features are determined, and first resolution video and second resolution video data are input into the spatial feature extraction network, first resolution spatial features and second resolution spatial features are determined.

And then inputting the two determined time features and the two determined space features into a decoding layer of the motion recognition model, and determining the motion type corresponding to the input third sample.

Of course, when the spatial feature extraction is performed, for better spatial feature extraction of the third sample, the optical flow feature of the third sample may be determined according to the input third sample, then the optical flow feature of the third sample and the video with the second resolution are input into the spatial feature extraction model together, and the spatial feature of the third sample is determined, so as to obtain the spatial feature of the third sample fused with the historical time sequence feature.

But both the temporal data of the first resolution and the video data of the second resolution are part of the video data obtained based on the third sample, the temporal feature extraction model and the spatial feature extraction model consume less computational resources than are required for feature extraction directly from the third sample.

Of course, since the feature extraction according to the video data of the first resolution is the feature extraction of the global original sample, and the feature extraction according to the video data of the second resolution is the feature extraction of the third sample detail, the temporal feature extraction and the spatial feature extraction are performed according to the video data of the first resolution and the video data of the second resolution, and the decoding determination is performed according to the extracted features to determine the corresponding custom type, so that the accuracy is unchanged, and meanwhile, the temporal data of the first resolution or the video data of the second resolution is a part of the video data obtained based on the third sample, and the computational resources consumed by the temporal feature extraction model and the spatial feature extraction model in the feature extraction are relatively small compared with the computational resources required by the feature extraction directly according to the third sample.

In addition, when the spatial feature extraction is performed according to the video data with two resolutions, in order to further ensure the accuracy of the motion recognition model, the optical flow feature may be extracted for the third sample, and then the extracted optical flow feature is input into the spatial feature extraction model to determine the spatial feature of the optical flow feature of the third sample, so that the decoding layer of the motion recognition model determines the accuracy of the motion type corresponding to the third sample according to the temporal feature and the spatial feature of the video data with two resolutions and the spatial feature of the optical flow feature of the third sample, and the accuracy of the determined motion type is higher than that determined according to the temporal feature and the spatial feature of the video data with the first resolution and the video data with the second resolution only.

The following is a recognition procedure of a training-completed motion recognition model provided in the present specification, as shown in fig. 4. Fig. 4 is a complete architecture of an action recognition model according to an embodiment of the present disclosure.

When the video data is input into the motion recognition model, the preprocessing module of the motion recognition model firstly preprocesses the video data, firstly judges whether the input video data is the dvs data, and if not, the input video data needs to be converted to obtain the data with the same structure as the dvs data. If the video data is the dvs data or after the data conversion, preprocessing, such as noise reduction and the like, is carried out on the video data. And extracting optical flow according to the preprocessed video data, and determining the optical flow characteristics of the input video data.

Then, the video data is input into a coding layer of an action recognition model, resolution extraction is performed through a dynamic attention model of the coding layer, and video data with a preset first resolution is determined.

The method comprises the steps of inputting video data of a first resolution and video data of a second resolution into a temporal feature extraction model, determining global temporal features and temporal features of details of the input video data, inputting optical flow features into a spatial feature extraction model, determining spatial features of the optical flow features of the input video data, and inputting the video data of the first resolution and the video data of the second resolution into a spatial feature extraction model, and determining global spatial features and spatial features of the details of the input video data.

And finally, decoding the global time feature and the detailed time feature, the global space feature and the space feature of the detail and the space feature of the optical flow feature output by the coding layer through a decoding layer of the motion recognition model, and determining the motion type corresponding to the input video data according to a decoding result. Wherein the decoding layer may be composed of a feature selector and at least one vector selector for the corresponding class. This is not limiting in this specification.

The pre-training-based motion recognition model training method provided by one or more embodiments of the present disclosure is based on the same thought, and the present disclosure further provides a corresponding pre-training-based motion recognition model training device, as shown in fig. 5.

Fig. 5 is a schematic diagram of a pre-training-based motion recognition model training apparatus provided in the present specification, where the apparatus is configured to perform the pre-training-based motion recognition model training method provided in fig. 1.

An action recognition model trainer based on pre-training, the action recognition model contains coding layer and decoding layer, the coding layer includes the space feature extraction model of waiting to train and waits to train time feature extraction model, wait to train space feature extraction model and wait to train time feature extraction model be the reserve pond network, include:

a response module 500 for generating, in response to a model training instruction, spatial data for training the spatial feature extraction model space, and temporal data for training the temporal feature extraction model, the spatial data and the temporal data being composed of a plurality of video data segments that are continuous in content;

the pre-training module 501 is configured to determine a first sample and a label of the first sample according to adjacent video data segments in the spatial data, pre-train the spatial feature extraction model, and determine a second sample and a label of the second sample according to adjacent video data segments in the temporal data, and pre-train the temporal feature extraction model;

The acquisition module 502 is configured to determine acquired video data as a third sample, and determine an action type corresponding to the video data as an annotation of the third sample;

a training module 503, configured to input the third sample, determine a spatial feature and a temporal feature from an encoding layer of an action recognition model determined by a pre-trained spatial feature extraction model and a pre-trained temporal extraction model, input the spatial feature and the temporal feature into a decoding layer of the action recognition model, and determine a predicted action type; and training the action recognition model according to the predicted action type and the label of the third sample, wherein the trained action recognition model is used for recognizing the action type corresponding to the input video data.

Optionally, the response module 500 is specifically configured to determine a spatial motion trajectory in a video space, generate a plurality of motion points in the video space according to a start point position of the spatial motion trajectory, use a gaussian distribution center of each motion point as the start point position, move the gaussian distribution center along the spatial motion trajectory as a target, generate video data of continuous motion of the plurality of motion points, segment the video data of continuous motion when the gaussian distribution center changes, and determine a plurality of continuous video data segments as spatial data for training the spatial feature extraction model.

Optionally, the pre-training module 501 is specifically configured to determine adjacent video data segments in the spatial data, determine, as the first sample, a coordinate of a center of a gaussian distribution of the moving light spot in a video data segment according to another video data segment, as a label corresponding to the first sample, input the first sample into the spatial feature extraction network, determine, as a first output result, a difference between the first output result and a label corresponding to the first sample, a coordinate of a center of a gaussian distribution of the moving light spot in a predicted next video output by the spatial feature extraction network, and pre-train the spatial feature extraction model.

Optionally, the response module 500 is specifically configured to determine a second motion trajectory in the video space, generate video data of the object moving along the second motion trajectory according to the second motion trajectory, segment the continuous video data of the object according to time, and determine time data for training the time feature extraction model.

Optionally, the pre-training module 501 is specifically configured to determine an adjacent video data segment in the time data, determine, as the second sample, a moving direction feature vector of the target object according to another video data segment in the adjacent video data segment, input the second sample to the time feature extraction network as a label corresponding to the second sample, determine, as a second output result, a feature vector of the moving direction of the target object in a predicted next video output by the time feature extraction network, and pre-train the time feature extraction model with a difference between the second output result and the label corresponding to the second sample as an optimization target.

Optionally, when the coding layer further includes a preset dynamic attention network, the training module 503 is further configured to determine, according to the third sample, an optical flow feature corresponding to the third sample, input the third sample into the dynamic attention network, determine first resolution video data, intercept the third sample according to the first resolution video data, determine second resolution video data, input the first resolution video data and the second resolution video data into the temporal feature extraction network, determine a first resolution temporal feature and a second resolution temporal feature as the temporal features, and input the first resolution video data, the second resolution video data and the optical flow feature into the spatial feature extraction network, determine a first resolution spatial feature, a second resolution spatial feature and an optical flow spatial feature as the spatial features.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 6. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 6, although other hardware required by other services may be included. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the pre-training based motion recognition model training method described above with respect to fig. 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present invention.

Claims

1. The motion recognition model training method based on pre-training is characterized in that the motion recognition model comprises a coding layer and a decoding layer, the coding layer comprises a spatial feature extraction model to be trained and a temporal feature extraction model to be trained, the spatial feature extraction model to be trained and the temporal feature extraction model to be trained are reserve pool networks, and the method comprises the following steps:

2. The method of claim 1, wherein generating spatial data for training the spatial feature extraction model space, comprises:

determining a first motion profile in a video space;

generating a plurality of motion points in the video space according to the starting point position of the first motion track, wherein the center of Gaussian distribution of each motion point is the starting point position;

moving the center of the Gaussian distribution along the first motion track as a target to generate video data of continuous motion of the plurality of motion points;

3. The method according to claim 2, wherein determining a first sample and labeling of the first sample from adjacent segments of video data in the spatial data, pre-training the spatial feature extraction model, specifically comprises:

4. The method of claim 1, wherein generating time data for training the time feature extraction model time, comprises:

determining a second motion profile in the video space;

5. The method of claim 4, wherein determining a second sample and labeling of the second sample based on adjacent segments of video data in the temporal data, pre-training the temporal feature extraction model, specifically comprises:

and inputting the second sample into the time feature extraction network, determining a feature vector of the moving direction of the target object in a predicted next video output by the time feature extraction network, taking the difference between the second output result and the label corresponding to the second sample as a loss, taking the loss as an optimization target, and pre-training the time feature extraction model.

6. The method of claim 1, wherein the captured video data is determined, wherein the captured video data is data captured by an event camera.

7. The method of claim 1, wherein the encoding layer further comprises a pre-set dynamic attention network;

8. The utility model provides a motion recognition model trainer based on pretraining, its characterized in that, motion recognition model contains coding layer and decoding layer, the coding layer is including waiting to train space feature extraction model and waiting to train time feature extraction model, wait to train space feature extraction model and wait to train time feature extraction model and be the reserve pond network, include:

9. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of the preceding claims 1-7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the program.