CN118279781A

CN118279781A - Data processing method, action recognition method and device for model training

Info

Publication number: CN118279781A
Application number: CN202211733719.7A
Authority: CN
Inventors: 周洪汉; 罗进; 刘永新; 许敏; 王亚东; 聂彦岭; 单颖; 王赛君
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2024-07-02

Abstract

The embodiment of the invention provides a data processing method, an action recognition method and a device for model training, which are characterized in that a training data set is obtained, wherein the training data set comprises a plurality of video frames which are arranged in sequence, each video frame is provided with corresponding labels, key areas of the video frames are sequentially extracted, the labels of the video frames are used as labels of the key areas in the video frames, the key areas are sequentially stored to the tail part of a preset queue to obtain a queue containing a plurality of key areas, the video frames are sequentially selected, the key areas currently positioned at the head part of the queue are filled into the video frames to obtain backfill video frames, soft labels of the backfill video frames are calculated based on the labels of the backfill video frames and the labels of the key areas filled into the backfill video frames, and the backfill video frames and the corresponding soft labels of the backfill video frames are used for training a preset action recognition model through the queue exchange of key areas among different samples, so that the video frames contain labels corresponding to different scenes, and the robustness of model recognition is improved.

Description

Data processing method, action recognition method and device for model training

Technical Field

The present invention relates to the field of video motion recognition technology, and in particular, to a data processing method for model training, a motion recognition method, a motion recognition model training method, a data processing device for model training, a motion recognition device, and a motion recognition model training device.

Background

With rapid development of computer vision, motion recognition technology is increasingly applied to many practical scenes. For example, in the field of primary social management, in order to timely handle sudden public safety events, real-time action recognition analysis and reporting of surveillance videos in a public area are required.

In the prior art, a twin network is generally adopted to respectively process a tagged action video and a video to be identified, or the video is processed in advance to obtain a frame image sequence and an optical flow image sequence, appearance characteristics and action representation characteristics of the video are respectively extracted, and then the two characteristics are fused based on an attention mechanism. However, the existing video motion recognition methods still have some disadvantages, most of the motion recognition methods do not consider motion recognition under multiple scenes, are limited by the size of the training data set, and have fewer times under different scenes where a single motion occurs.

Disclosure of Invention

In view of the above problems, embodiments of the present application have been made to provide a data processing method for model training, an action recognition method, an action recognition model training method, a data processing apparatus for model training, an action recognition apparatus, and an action recognition model training apparatus that overcome or at least partially solve the above problems.

The embodiment of the invention discloses a data processing method for model training, which comprises the following steps:

Acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag;

Sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames;

sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas;

sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain backfill video frames;

Calculating a soft tag of the backfill video frame based on the tag of the backfill video frame and the tag of the key area filled in the backfill video frame; the backfill video frames and the soft labels corresponding to the backfill video frames are used for training a preset action recognition model.

Optionally, the step of sequentially extracting key areas of the video frame includes:

dividing the video frame into a first preset number of areas in sequence; the region contains pixel information;

calculating a score for each region in the video frame based on the pixel information;

and selecting a second preset number of areas as key areas according to the grading.

Optionally, the method for calculating the score of each region in the video frame based on the pixel information includes:

converting the region into a region vector;

and calculating the score of the region vector by adopting the pixel information based on an attention mechanism, so as to obtain the score of the region vector corresponding to the region.

Optionally, the step of calculating the soft tag of the backfill video frame based on the tag of the backfill video frame and the tag of the key region filled in the backfill video frame includes:

calculating the similarity between the label of the backfill video frame and the label of the key area filled in the backfill video frame;

And calculating the soft label based on the similarity, the label of the backfill video frame and the label of the key area filled in the backfill video frame.

The embodiment of the invention also discloses an action recognition method, which comprises the following steps:

acquiring a sample to be identified;

Inputting the sample to be identified into an action identification model; the action recognition model is obtained by training a sample containing backfill video frames; the sample containing the backfill video frame is obtained by obtaining a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

obtaining a prediction label output by the action recognition model; the predictive label is used for classifying the sample to be identified.

The embodiment of the invention also discloses a training method of the motion recognition model, which comprises the following steps:

Acquiring a plurality of training samples; the training samples have soft labels; the training sample is used for acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

Inputting a plurality of training samples into an action recognition model; the action recognition model comprises a multi-layer perceptron; the multi-layer perceptron is used for removing noise generated in the process of processing the training sample by the action recognition model;

Obtaining a prediction label output by the action recognition model;

comparing the predictive label with the soft label of the training sample to obtain the current training loss;

based on the current training loss, the motion recognition model is adjusted, and a plurality of training samples are re-adopted to train the motion recognition model until the current training loss meets the preset training condition, so that the motion recognition model after training is obtained.

Optionally, each of the video frames has the same preset size.

The embodiment of the invention also discloses a data processing device for model training, which comprises:

The first acquisition module is used for acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag;

The extraction module is used for sequentially extracting key areas of the video frames and taking tags of the video frames as tags of the key areas in the video frames;

The storage module is used for sequentially storing the key areas to the tail parts of preset queues to obtain queues containing a plurality of key areas;

The selecting module is used for sequentially selecting the video frames, filling the important areas currently positioned at the head of the queue into the video frames, and obtaining backfill video frames;

The computing module is used for computing soft labels of the backfill video frames based on the labels of the backfill video frames and labels of key areas filled in the backfill video frames; the backfill video frames and the soft labels corresponding to the backfill video frames are used for training a preset action recognition model.

The embodiment of the invention also discloses a motion recognition device, which comprises:

The second acquisition module is used for acquiring a sample to be identified;

The first input module is used for inputting the sample to be identified into an action identification model; the action recognition model is obtained by training a sample containing backfill video frames; the sample containing the backfill video frame is obtained by obtaining a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

The first output module is used for obtaining the prediction label output by the action recognition model; the predictive label is used for classifying the sample to be identified.

The embodiment of the invention also discloses a motion recognition model training device, which comprises:

The third acquisition module is used for acquiring a plurality of training samples; the training samples have soft labels; the training sample is used for acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

The second input module is used for inputting a plurality of training samples into the action recognition model; the action recognition model comprises a multi-layer perceptron; the multi-layer perceptron is used for removing noise generated in the process of processing the training sample by the action recognition model;

the second output module is used for obtaining the prediction label output by the action recognition model;

the training loss module is used for comparing the prediction label with the soft label of the training sample to obtain the current training loss;

And the adjusting module is used for adjusting the action recognition model based on the current training loss, and training the action recognition model by adopting a plurality of training samples again until the current training loss meets the preset training condition, so as to obtain the action recognition model after training.

The embodiment of the invention also discloses electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method according to the embodiment of the present invention when executing the program stored in the memory.

Embodiments of the invention also disclose one or more computer-readable media having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the methods described in the embodiments of the invention.

The embodiment of the invention has the following advantages: the training data set comprises a plurality of video frames which are arranged in sequence, each video frame is provided with a corresponding label, key areas of the video frames are sequentially extracted, the labels of the video frames are used as the labels of the key areas in the video frames, the key areas are sequentially stored to the tail parts of a preset queue to obtain a queue comprising the plurality of key areas, the video frames are sequentially selected, the key areas currently positioned at the head of the queue are filled into the video frames to obtain backfill video frames, soft labels of the backfill video frames are calculated based on the labels of the backfill video frames and the labels of the key areas filled into the backfill video frames, the backfill video frames and the soft labels corresponding to the backfill video frames are used for training the key areas among different samples through the queue exchange, and the video frames are subjected to the process of the key areas, so that the video frames comprise the labels corresponding to different scenes, the models obtained through training by the backfill video frames can be suitable for different scenes such as a head-back scene, a basketball scene, a falling scene and the like, the background weight relation between the soft labels is further obtained through calculating the similarity between the key areas, the background weight record and the background information of the background is more completely ignored, and the learning information of the models can be completely ignored in the special action models. In addition, noise possibly existing in the hidden representation is removed through the multi-layer perceptron, and model training efficiency is improved.

Drawings

FIG. 1 is a flow chart of the steps of a data processing method for model training in an embodiment of the present invention;

FIG. 2 is a flow chart of steps of another data processing method for model training in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a resulting backfill video frame in an embodiment of the present invention;

FIG. 4 is a flow chart of steps of a method for motion recognition in an embodiment of the present invention;

FIG. 5 is a flowchart illustrating steps of a method for training an action recognition model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of training an action recognition model in an embodiment of the invention;

FIG. 7 is a block diagram of a data processing apparatus for model training in accordance with an embodiment of the present invention;

FIG. 8 is a block diagram of an action recognition device in an embodiment of the present invention;

FIG. 9 is a block diagram of a training device for motion recognition models in an embodiment of the present invention;

FIG. 10 is a block diagram of an electronic device in an embodiment of the invention;

FIG. 11 is a schematic diagram of a computer readable medium in an embodiment of the invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 1, a flowchart illustrating steps of a data processing method for model training according to an embodiment of the present invention may specifically include the following steps:

Step 101, acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag;

the training data set in the embodiment of the invention can be derived from a data set disclosed on the internet, the training data set comprises a plurality of video clips, each video clip comprises a plurality of video frames which are arranged in sequence, the plurality of video frames can be arranged according to the playing sequence of the video clips, each video frame is provided with a corresponding label, the video frames can be represented to belong to a certain type of scene through the labels, such as a return scene, a basketball scene, a falling scene and the like, and the labels can be artificially added or calculated based on a model. The number of frames comprising video frames in each video clip may be determined by video quality, hardware equipment, and different training strategies, and may be 16 frames, 24 frames, etc., as the invention is not limited in this regard.

Step 102, sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames;

Based on video clips, key areas of video frames can be sequentially extracted according to the arrangement sequence of video frames, and for a video frame, different areas in the video frame can contain different information, static information such as trees, houses and the like, dynamic information such as action information of people in a motion state can be contained, the dynamic information is usually used as the key information, the area containing the key information can be used as the key area of the video frame, and a label corresponding to the video frame is used as the label of the key area in the video frame. After the key area is extracted, the video frame of the key area can be used as a background.

Step 103, sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas;

In order to circularly use the key areas extracted from the video frames, a queue can be preset, key areas corresponding to the video frames in different scenes are stored in the queue, and after the key areas of the video frames are obtained each time, the key areas of the video frames are stored to the tail of the queue, so that the key areas stored to the queue first are in front, and the key areas stored to the queue later are in back. In the queue, the number of key areas which can be stored is limited, and the key areas which are stored first are output first.

Step 104, sequentially selecting the video frames, filling the important areas currently positioned at the head of the queue into the video frames, and obtaining backfilled video frames;

Because the queue has the characteristic of first-in first-out, for extracting the video frames of the key areas, the key areas currently positioned at the head of the queue can be obtained, the key areas at the head of the queue are taken as the foreground, and the key areas at the head of the queue are extracted from the video frames under other scenes, so that the labels of the key areas at the head of the queue are different from those of the video frames, and the key areas are filled into the video frames, thereby obtaining backfilled video frames containing the foreground and the background, and enabling the key areas to appear under different scenes.

Step 105, calculating a soft label of the backfill video frame based on the label of the backfill video frame and the label of the key area filled in the backfill video frame; the backfill video frames and the soft labels corresponding to the backfill video frames are used for training a preset action recognition model.

The key areas of the video frame and the backfill video frame are in different scenes, so that the labels of the video frame and the labels of the backfill video frame are different, after the backfill video frame is obtained, the backfill video frame comprises multiple labels, namely the labels of the video frame and the labels of the key areas filled in the backfill video frame, however, after the key areas are backfilled, the action information corresponding to the key areas is caused to appear in different scenes, the content of the video frame is changed, the labels of the video frame may be changed, and the single labels cannot be used for classifying the video frame, so that the soft labels of the backfill video frame can be calculated based on the labels of the backfill video frame and the labels of the key areas filled in the backfill video frame. And training a preset action recognition model based on the backfill video frames and soft labels corresponding to the backfill video frames, so that the videos containing the multiple labels can be recognized.

According to the embodiment of the invention, the training data set is obtained by obtaining the training data set, wherein the training data set comprises a plurality of video frames which are arranged in sequence, each video frame is provided with a corresponding label, key areas of the video frames are sequentially extracted, the labels of the video frames are used as the labels of the key areas in the video frames, the key areas are sequentially stored to the tail part of a preset queue to obtain a queue containing a plurality of key areas, the video frames are sequentially selected, the key areas currently positioned at the head of the queue are filled into the video frames to obtain backfill video frames, soft labels of the backfill video frames are calculated based on the labels of the backfill video frames and the labels of the key areas filled into the backfill video frames, the backfill video frames and the soft labels corresponding to the backfill video frames are used for training a preset action recognition model, and the video frames contain the labels corresponding to different scenes through backfill treatment of the video frames, so that the backfill video frames are trained to obtain the model which can be suitable for different scenes such as a back-head scene, a basketball playing scene, a falling scene and the like, and the robustness of model recognition is improved.

Referring to fig. 2, a flowchart illustrating steps of another data processing method for model training according to an embodiment of the present invention may specifically include the following steps:

step 201, acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag;

Step 202, sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames;

In an alternative embodiment of the present invention, the step 202 includes:

s11, dividing the video frame into a first preset number of areas in sequence; the region contains pixel information;

For the video frames in the training data set, key areas in the video frames can be sequentially extracted according to the arrangement sequence of the video frames, when the key areas are extracted, one video frame can be divided into a first preset number of areas on average, in an example, one video frame can be divided into 9 areas on average, and one video frame can also be divided into 16 areas, which is not limited in the invention. Each divided region contains pixel information of a video frame, the video frame can form a picture through a plurality of pixel information, and after the video frame is divided into a plurality of regions, different regions have different pictures, so that the pixel information contained in the different regions is different.

A substep S112 of calculating a score for each region in the video frame based on the pixel information;

Since the pixel information contained in each region is different, a score corresponding to each region in the video frame can be calculated based on the pixel information, and the key region and the non-key region in the video frame can be determined by the score.

And a substep S113, selecting a second preset number of regions as key regions according to the evaluation.

After the scores corresponding to the areas are calculated, selecting a second preset number of areas with higher scores as key areas according to the order from high scores to low scores. In an example, if a video frame is divided into 9 regions on average, 3 regions with higher scores can be selected as key regions.

In an alternative embodiment of the present invention, the substep S112 includes:

substep S1121, converting said region into a region vector;

After the video frame is divided equally into a first preset number of regions, each region may be converted into a region vector.

Substep S1122, based on the attention mechanism, calculates a score of the region vector using the pixel information, thereby obtaining a score of the region vector corresponding to the region.

After the area vector is obtained, the score of the current area vector is calculated based on the attention mechanism by adopting the pixel information of the current area and at least one area around the current area, and the score of the current area vector is used as the score of the current area corresponding to the current area. And calculating the score according to the method for each region vector, so as to obtain the score corresponding to each region in the video frame.

Step 203, sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas;

Step 204, sequentially selecting the video frames, and filling the important areas currently positioned at the head of the queue into the video frames to obtain backfilled video frames;

Because the queue has the characteristic of first-in first-out, for extracting the video frames of the key areas, the key areas currently positioned at the head of the queue can be obtained, the key areas at the head of the queue are taken as the foreground, and the key areas at the head of the queue are extracted from the video frames under other scenes, so that the labels of the key areas at the head of the queue are different from those of the video frames, and the key areas are filled into the video frames, thereby obtaining backfilled video frames containing the foreground and the background, and enabling the action information corresponding to the key areas to appear in different scenes.

As shown in fig. 3, a schematic diagram of obtaining a backfill video frame in the embodiment of the present invention is shown, after a video frame 301 is divided into different areas, the score of each area is calculated through an attention mechanism, the area 302 with the score is obtained, the 3 areas with the highest score are used as key areas, after key area extraction 303 is performed, the key areas are saved to the tail of a queue 304, the key areas at the head of the queue are obtained, and key area backfill 305 is performed, so as to obtain a backfill video frame 306.

Step 205, calculating the similarity between the label of the backfill video frame and the label of the key area filled in the backfill video frame;

The label of the video frame is different from the label of the backfill video frame, after the backfill video frame is obtained, the backfill video frame not only comprises the label of the video frame, but also comprises the label of the key area filled with the backfill video frame, and the similarity between the label of the backfill video frame and the label of the key area filled with the backfill video frame can be calculated.

In one example, the similarity λ may be calculated by: for the video frame Xi, the label is yi, the N regions patch (PL= { P _L,1,…,p_L,N }, yi) with the highest scores are selected and stored to the tail of the queue, the N patches (P1= { P _1,1,…,p_1,N }, yj) at the head of the queue, wherein yj is the label corresponding to the N patches at the head of the queue, and the formula is adopted

Calculation of whereinThe numerator is a module of two area vectors as a Gaussian kernel function, the distance between the two area vectors in space can be understood as the distance between the two area vectors, the denominator is a kernel radius, the preliminary determination can be carried out according to the value of the numerator, and then the size of the numerator is manually adjusted according to a large number of experimental screening values.

And 206, calculating the soft label based on the similarity, the label of the backfill video frame and the label of the key area filled in the backfill video frame.

After the similarity between the label of the backfill video frame and the label of the key area filled in the backfill video frame is obtained, calculating a soft label based on the similarity, wherein the soft label can be calculated by a formula of λyi+ (1- λ) yj, wherein λ is the similarity, yi is the label of the backfill video frame, and yj is the label of the key area filled in the backfill video frame. Training a preset action recognition model based on the backfill video frames and soft labels corresponding to the backfill video frames.

According to the embodiment of the invention, the training data set is obtained by obtaining the training data set, wherein each training data set comprises a plurality of video frames which are arranged in sequence, each video frame is provided with a corresponding label, key areas of the video frames are sequentially extracted, the labels of the video frames are used as the labels of the key areas in the video frames, the key areas are sequentially stored to the tail part of a preset queue to obtain a queue containing a plurality of key areas, the video frames are sequentially selected, the key areas currently positioned at the head of the queue are filled into the video frames to obtain backfill video frames, the similarity between the labels of the backfill video frames and the labels of the key areas filled into the backfill video frames is calculated, soft labels are calculated based on the similarity, the labels of the backfill video frames and the labels of the key areas filled into the backfill video frames, and the backfill video frames are made to contain labels corresponding to different scenes through the backfill of the key areas, so that the model obtained by training the video frames can be suitable for different scenes such as a back-end scene, a basketball scene, a falling scene and the like, robustness of model identification is improved, further, the relationship between the background and the foreground is recorded by calculating the similarity between the key areas, and the background information is completely ignored.

Referring to fig. 4, a flowchart illustrating steps of a method for identifying actions provided in an embodiment of the present invention may specifically include the following steps:

Step 401, obtaining a sample to be identified;

in the embodiment of the application, the sample to be identified can be a video segment obtained by video monitoring and identification under multiple scenes, such as a video segment of group behaviors or a video segment of falling, and the like, and the group behaviors can be identified for the video segment of group behaviors, the falling behaviors can be identified for falling detection and identification, and the like. Specific application scenario the present application is not limited in this regard.

Step 402, inputting the sample to be identified into an action identification model; the action recognition model is obtained by training a sample containing backfill video frames; the sample containing the backfill video frame is obtained by obtaining a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

After a sample to be identified is obtained, the sample to be identified is input into an action identification model, and the action identification model is obtained by training a sample containing a backfill video frame; samples comprising backfill video frames are obtained from a training dataset; the training data set comprises a plurality of video frames which are arranged in sequence, and each video frame is provided with a corresponding label; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the point areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; and filling the video frames into the key areas currently positioned at the head of the queue by the video frames in sequence.

Step 403, obtaining a prediction label output by the action recognition model; the predictive label is used for classifying the sample to be identified.

The action recognition model can recognize the sample to be recognized, and output the prediction label of the sample to be recognized, so as to obtain the classification of the sample to be recognized.

In the embodiment of the invention, the sample to be identified is input into the action identification model by acquiring the sample to be identified, and the prediction label output by the action identification model is acquired, so that the action type in the sample to be identified can be identified by adopting the action identification model.

Referring to fig. 5, a flowchart illustrating steps of a training method for an action recognition model according to an embodiment of the present invention may specifically include the following steps:

Step 501, obtaining a plurality of training samples; the training samples have soft labels; the training sample is used for acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

In the embodiment of the application, the sample to be identified can be a video segment obtained by video monitoring and identification under multiple scenes, such as a video segment of group behaviors or a falling video segment, and the like, and the group behaviors can be identified for the video segment of the group behaviors, the falling detection and identification can be carried out for the group behaviors, and the like. Specific application scenario the present application is not limited in this regard.

In the embodiment of the invention, a plurality of training samples can be obtained to train the motion recognition model, each training sample comprises a plurality of video frames, 16 frames, 24 frames and the like can be taken, the specific frame number is determined by conditions such as video quality, hardware equipment, different training strategies and the like, and the invention is not limited to the conditions. Training samples by acquiring a training data set; the training data set comprises a plurality of video frames which are arranged in sequence, and each video frame is provided with a corresponding label; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the point areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; and filling the video frames into the key areas currently positioned at the head of the queue by the video frames in sequence.

Step 502, inputting a plurality of training samples into an action recognition model; the action recognition model comprises a multi-layer perceptron; the multi-layer perceptron is used for removing noise generated in the process of processing the training sample by the action recognition model;

The multi-layer perceptron (MLP, multilayer Perceptron) is also called an artificial neural network, and besides the input and output layers, there may be multiple hidden layers in between, so as to remove noise that may exist in the hidden representation during the process of processing the training samples by the motion recognition model.

After a plurality of training samples are obtained, the plurality of training samples are input into the motion recognition model, so that the motion recognition model is trained.

Step 503, obtaining a prediction label output by the action recognition model;

In an alternative embodiment of the present invention, the motion recognition model may include a backbone network, a multi-layer perceptron, and a normalization layer (Softmax), wherein the backbone network may select a common video encoder such as a 3-dimensional convolutional neural network (C3D), a 3D residual network (Resnet D), or a Vit (Vision Transformer, visual transformer) for encoding the training samples into a low-dimensional representation, and the multi-layer perceptron is used to remove noise that may be included in the low-dimensional representation. After being processed by the multi-layer perceptron, the predictive label of the training sample is obtained through normalization layer processing.

Step 504, comparing the predicted label with the soft label of the training sample to obtain a current training loss;

After the predicted label is obtained, the predicted label is compared with the soft label of the current training sample, and the training loss between the predicted label and the soft label is calculated. In an example, cross entropy loss between the predictive label and the soft label may be used to determine the current training loss.

And 505, adjusting the motion recognition model based on the current training loss, and training the motion recognition model by adopting a plurality of training samples again until the current training loss meets the preset training condition, so as to obtain the motion recognition model after training.

After the current training loss is obtained, the motion recognition model is adjusted based on the current training loss, and a plurality of training samples are re-adopted to train the motion recognition model until the current training loss meets the preset training conditions, so that the motion recognition model after training is obtained. In an example, a penalty threshold may be set, and when the current training penalty is less than the penalty threshold, it is determined that the current training penalty meets a preset training condition, and the training is completed.

In an alternative embodiment of the present invention, each of the video frames has the same preset size.

The video frames may be standardized and set to the same size, and in one example, the video frames may be set to 300 x 500 resolution. The invention does not limit the size of the video frame.

As shown in fig. 6, a schematic diagram of training a motion recognition model in an embodiment of the present invention is shown, after a video frame in a training sample 601 is divided into different areas, the score of each area is calculated through an attention mechanism, the area 602 with the score is obtained, the 3 areas with the highest score are used as key areas, after key area extraction 603 is performed, the key areas are saved to the tail of a queue 604, the key areas at the head of the queue are obtained, key area backfill 605 is performed, backfill video frame 606 is obtained, soft labels of the backfill video frame 606 are calculated, the backfill video frame 606 is input into a backbone network 607, the low-dimensional representation is input into a multi-layer perceptron 608, and noise possibly contained in the low-dimensional representation is removed. After being processed by the multi-layer perceptron 608, the predictive label 609 of the training sample is obtained through normalization layer processing, the predictive label is compared with the soft label of the current training sample, the training loss between the predictive label and the soft label is calculated, the motion recognition model is adjusted based on the current training loss, and a plurality of training samples are re-adopted to train the motion recognition model until the current training loss meets the preset training condition, so that the trained motion recognition model is obtained.

According to the embodiment of the invention, a plurality of training samples are obtained, the training samples are provided with soft labels, the plurality of training samples are input into the motion recognition model, the prediction labels output by the motion recognition model are obtained, the prediction labels are compared with the soft labels of the training samples to obtain current training loss, the motion recognition model is adjusted based on the current training loss, and the motion recognition model is trained by adopting the plurality of training samples again until the current training loss meets the preset training condition, so that the motion recognition model after training is obtained, and therefore, the motion recognition model can be trained based on the soft labels and the prediction labels, and noise possibly existing in hidden representation is restrained by a multi-layer perceptron, so that the accuracy of model training is improved, and the recognition efficiency is improved.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 7, a block diagram of a data processing apparatus for model training according to an embodiment of the present invention is shown, which may specifically include the following modules:

A first acquisition module 701, configured to acquire a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag;

the extracting module 702 is configured to sequentially extract key areas of the video frames, and take tags of the video frames as tags of the key areas in the video frames;

A saving module 703, configured to sequentially save the key areas to a tail of a preset queue, so as to obtain a queue containing a plurality of key areas;

a selecting module 704, configured to sequentially select the video frames, and fill the key area currently located at the head of the queue into the video frames to obtain backfilled video frames;

A calculating module 705, configured to calculate a soft tag of the backfill video frame based on the tag of the backfill video frame and a tag of a key region filled in the backfill video frame; the backfill video frames and the soft labels corresponding to the backfill video frames are used for training a preset action recognition model.

In an alternative embodiment of the present invention, the extracting module 702 includes:

The dividing sub-module is used for dividing the video frame into a first preset number of areas in sequence; the region contains pixel information;

an evaluation sub-module for calculating a score for each region in the video frame based on the pixel information;

And the key region sub-module is used for selecting a second preset number of regions as key regions according to the evaluation.

In an alternative embodiment of the invention, the evaluation module comprises:

a converter unit for converting the region into a region vector;

And the region scoring subunit is used for calculating the score of the region vector by adopting the pixel information based on the attention mechanism, so as to obtain the score of the region vector corresponding to the region.

In an alternative embodiment of the present invention, the computing module 705 includes:

The similarity sub-module is used for calculating the similarity between the label of the backfill video frame and the label filled in the key area of the backfill video frame;

and the soft tag module is used for calculating the soft tag based on the similarity, the tag of the backfill video frame and the tag of the key area filled in the backfill video frame.

Referring to fig. 8, a block diagram of a motion recognition device provided in an embodiment of the present invention is shown, which may specifically include the following modules:

A second obtaining module 801, configured to obtain a sample to be identified;

a first input module 802, configured to input the sample to be identified into an action identification model; the action recognition model is obtained by training a sample containing backfill video frames; the sample containing the backfill video frame is obtained by obtaining a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

a first output module 803, configured to obtain a prediction tag output by the motion recognition model; the predictive label is used for classifying the sample to be identified.

Referring to fig. 9, a block diagram of a training device for an action recognition model according to an embodiment of the present invention is shown, which may specifically include the following modules:

A third obtaining module 901, configured to obtain a plurality of training samples; the training samples have soft labels; the training sample is used for acquiring a training data set; wherein the training data set comprises a plurality of video frames arranged in sequence, each video frame having a corresponding tag; sequentially extracting key areas of the video frames, and taking tags of the video frames as tags of the key areas in the video frames; sequentially storing the key areas to the tail of a preset queue to obtain a queue containing a plurality of key areas; sequentially selecting the video frames, and filling the key areas currently positioned at the head of the queue into the video frames to obtain the video frames;

A second input module 902, configured to input a plurality of training samples into an action recognition model; the action recognition model comprises a multi-layer perceptron; the multi-layer perceptron is used for removing noise generated in the process of processing the training sample by the action recognition model;

A second output module 903, configured to obtain a prediction tag output by the motion recognition model;

A training loss module 904, configured to compare the predicted label with the soft label of the training sample to obtain a current training loss;

And the adjustment module 905 is configured to adjust the motion recognition model based on the current training loss, and re-use a plurality of training samples to train the motion recognition model until the current training loss meets a preset training condition, thereby obtaining a trained motion recognition model.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In addition, the embodiment of the invention also provides an electronic device, as shown in fig. 10, which comprises a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 complete communication with each other through the communication bus 1004,

A memory 1003 for storing a computer program;

the processor 1001 is configured to execute a program stored in the memory 1003, and implement the following steps:

converting the region into a region vector;

Acquiring a sample to be identified;

Obtaining a prediction label output by the action recognition model;

Optionally, each of the video frames has the same preset size.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment provided by the present invention, as shown in fig. 11, there is further provided a computer readable storage medium 1101 having instructions stored therein, which when run on a computer, cause the computer to perform a data processing method for model training, an action recognition method, an action recognition model training method as described in the above embodiments.

In a further embodiment of the present invention, a computer program product comprising instructions which, when run on a computer, cause the computer to perform a data processing method, an action recognition model training method for model training as described in the above embodiments is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A data processing method for model training, comprising:

2. The method of claim 1, wherein the step of sequentially extracting key regions of the video frame comprises:

3. The method of claim 2, wherein the method of calculating a score for each region in the video frame based on the pixel information comprises:

converting the region into a region vector;

4. The method of claim 1, wherein the step of calculating the soft label of the backfill video frame based on the label of the backfill video frame and the label of the key region filled in the backfill video frame comprises:

5. A method of motion recognition, the method comprising:

acquiring a sample to be identified;

6. A method of training a motion recognition model, the method comprising:

Obtaining a prediction label output by the action recognition model;

7. The method of claim 6, wherein each of the video frames has the same preset size.

8. A data processing apparatus for model training, comprising:

9. An action recognition device, the device comprising:

The second acquisition module is used for acquiring a sample to be identified;

10. An action recognition model training apparatus, the apparatus comprising:

11. An electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method according to any one of claims 1-4 or 5 or 6-7 when executing a program stored on a memory.

12. One or more computer-readable media having instructions stored thereon that, when executed by one or more processors, cause the processors to perform the method of any of claims 1-4 or 5 or 6-7.