CN112036252A

CN112036252A - Method and device for constructing action labeling model and video action labeling

Info

Publication number: CN112036252A
Application number: CN202010775545.5A
Authority: CN
Inventors: 周慧子; 高宗; 陈彦宇; 马雅奇; 谭龙田; 刘欢
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-12-04

Abstract

The embodiment of the invention relates to a method and a device for constructing an action labeling model and marking a video action, wherein the method comprises the following steps: inputting a video (i) into a model (i), so that the model (i) performs a labeling operation on the action of a target person in the video (i), and outputting the video (i) carrying label data (2i-1), wherein i is a positive integer greater than or equal to 1; performing calibration operation on the label data (2i-1) in the video (i) carrying the label data (2i-1) to generate a video (i) carrying the label data (2 i); inputting a video (i) carrying label data (2i) into the model (i) for training to obtain a model (i + 1); if the similarity of the corresponding operation parameters of the model (i +1) and the model (i) is within the preset range, the model (i +1) is used as an action tagging model, and therefore the method can achieve the purpose that a larger amount of sample data is obtained through a small amount of sample data to train the model.

Description

Method and device for constructing action labeling model and video action labeling

Technical Field

The embodiment of the invention relates to the field of action recognition, in particular to a method and a device for constructing an action labeling model and marking a video action.

Background

In the field of human motion recognition based on vision, the existing motion recognition technology firstly needs previous motion recognition model training, model training usually needs manual marking of a large number of learning samples, then a classification model is trained firstly according to the samples, image features are extracted from a video by using the classification model, and classification marking is carried out on the obtained image features manually.

Disclosure of Invention

In view of this, in order to solve the technical problem that a large amount of sample data needs to be marked manually in the early stage of the traditional model training, embodiments of the present invention provide a method and an apparatus for constructing an action tagging model and marking a video action.

In a first aspect, an embodiment of the present invention provides a method for constructing an action tagging model, including:

inputting a video (i) into a model (i), so that the model (i) performs a labeling operation on the action of a target person in the video (i), and outputting the video (i) carrying label data (2i-1), wherein i is a positive integer greater than or equal to 1;

performing a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1) to generate the video (i) carrying the tag data (2 i);

inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1);

and if the similarity of the corresponding operation parameters of the model (i +1) and the model (i) is within a preset range, taking the model (i +1) as an action labeling model.

In one possible embodiment, the method further comprises:

determining action types corresponding to actions generated by a target person in the video (i) and start-stop time corresponding to each action type;

and performing calibration operation on the label data (2i-1) based on the action types and the start-stop time corresponding to each action type, correcting type errors and/or start-stop time errors in the label data (2i-1), and generating the video (i) carrying the label data (2 i).

In one possible embodiment, the method further comprises:

if the similarity of the model (i +1) and the corresponding operation parameter of the model (i) is not in a preset range, the training step of the model (i +1) is continuously executed by adjusting the operation parameter in the model (i + 1).

In a second aspect, an embodiment of the present invention provides a video action marking method, including:

inputting video data to be labeled to an action labeling model so that the action labeling model can identify action types of target characters in the video data and start-stop time corresponding to each action type;

and the action labeling model executes labeling operation on the action of the target person based on the action types and the start and stop time corresponding to each action type, and outputs the video data carrying label data.

In one possible embodiment, the method further comprises:

dividing the video into a plurality of frame images according to time sequence information;

determining the target person from a plurality of frame images based on a preset region frame containing the motion change of the target person;

extracting a group of action features corresponding to the target person from each frame image to obtain a plurality of groups of action features corresponding to the plurality of frame images;

matching the plurality of groups of action features with standard action features stored in an action feature database one by one, and taking an action type corresponding to the standard action feature with the similarity exceeding a set threshold as an action type corresponding to the action feature;

grouping the plurality of frame images according to time sequence information based on the action types to obtain a plurality of video clips formed by a plurality of groups of frame images corresponding to the action types;

and determining the starting and ending time respectively corresponding to the action types in the video based on the starting and ending time respectively corresponding to the starting and ending frame images in the video clip.

In one possible embodiment, the method further comprises:

and outputting the action types and the start-stop time corresponding to each action type together with the video data in a tag form.

In a third aspect, an embodiment of the present invention provides a device for constructing an action tagging model, including:

the acquisition module is used for inputting a video (i) into a model (i), so that the model (i) executes a labeling operation on the action of a target person in the video (i), and outputs the video (i) carrying label data (2 i-1);

a verification module, configured to perform a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1), and generate the video (i) carrying the tag data (2 i);

the training module is used for inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1);

and the determining module is used for taking the model (i +1) as an action labeling model if the similarity of the corresponding operation parameters of the model (i +1) and the model (i) is within a preset range.

In a fourth aspect, an embodiment of the present invention provides a video motion marking device, including:

the marking module is used for inputting video data to be marked to an action marking model so as to enable the action marking model to identify action types of target characters in the video data and start-stop time corresponding to each action type;

and the output module is used for executing the tagging operation on the action of the target person by the action tagging model based on the action types and the start and stop time corresponding to each action type and outputting the video data carrying tag data.

In a fifth aspect, an embodiment of the present invention provides a computer device, including: the processor is configured to execute a program for constructing an action tagging model or a video action marking program stored in the memory, so as to implement the method for constructing an action tagging model according to the first aspect or the method for marking a video action according to the second aspect.

In a sixth aspect, an embodiment of the present invention provides a storage medium, including: the storage medium stores one or more programs, which are executable by one or more processors to implement the method for constructing an action tagging model according to the first aspect or the method for marking video actions according to the second aspect.

According to the method for constructing the action tagging model, the video (i) is input into the model (i), so that the model (i) executes tagging operation on actions of target characters in the video (i), and outputs the video (i) carrying tag data (2i-1), wherein i is a positive integer greater than or equal to 1; performing a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1) to generate the video (i) carrying the tag data (2 i); inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1); if the similarity of the model (i +1) and the corresponding operation parameter of the model (i) is within a preset range, the model (i +1) is used as an action labeling model, so that the model can be trained by obtaining a larger number of sample data through a small amount of sample data, the time for manually labeling a large amount of data in the early stage is saved, the trained and optimized model can be reused, and the efficiency of production line process action detection engineering is improved.

Drawings

Fig. 1 is a schematic flow chart of a method for constructing an action tagging model according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of another method for constructing an action tagging model according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a video action marking method according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of another video motion marking method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an apparatus for constructing an action tagging model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a video motion marking apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.

Fig. 1 is a schematic flow chart of a method for constructing an action tagging model according to an embodiment of the present invention, and as shown in fig. 1, the method specifically includes:

s11, inputting a video (i) into a model (i), so that the model (i) performs a labeling operation on the action of a target person in the video (i), and outputting the video (i) carrying label data (2i-1), wherein i is a positive integer greater than or equal to 1.

The invention can be applied to a workshop production line and aims to monitor the standard of manual actions of staff on the workshop production line, and the quality and the production efficiency of products are possibly influenced by different action sequences, action amplitudes and the like. Therefore, the action labeling model provided by the invention is required to perform labeling operation on the actions of the staff in the monitoring video of the workshop production line so as to control the product quality at the later stage.

When i is equal to 1, the model 1 is an untrained and optimized initial classification model obtained by a developer according to experience, then the video 1 of a section of workshop production line is input into the model 1, the model 1 preliminarily and automatically distinguishes actions in the video, preliminary labeling operation is carried out on the distinguished actions, the label content comprises the action type of each action of the staff and the starting and ending time corresponding to each action type, and finally the video 1 carrying the label data 1 is output.

It should be noted that a plurality of action types are stored in the initial classification model; the action type includes at least one of:

and (3) carrying out fan blade taking and loading actions, gasket loading actions, fastening glue coating actions and/or nut beating actions by staff in the video.

S12, performing calibration operation on the label data (2i-1) in the video (i) carrying the label data (2i-1) to generate the video (i) carrying the label data (2 i).

In the embodiment of the invention, the staff checks the video with the label data output by the model, the label data is synchronously displayed along with the video playing in the video playing process, when the label data is wrong, the staff can directly modify the video, the video can be played back or played at double speed, the staff can calibrate the label data conveniently, the time is saved, and the video 1 carrying the label data 2 is finally generated.

For example, the total duration of a video is 60 seconds, the total duration of the video includes three action types, 0-30 seconds of the video is a first action type, 31-40 seconds are a second action type, 41-60 seconds are a third action type, and tag data generated by the initial model through tagging the video may be that the start-stop time corresponding to the first action type is 0-28 seconds, the start-stop time corresponding to the second action type is 29-40 seconds, and the action type of 41-60 seconds is mistakenly identified as a fourth action type, so that a worker needs to calibrate the tag data to generate the video with standard tag data.

S13, inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1).

And re-inputting the video 1 with the label data 2 after the label data is calibrated by the staff into the model 1, and performing one-time deep learning training on the model 1 according to the video 1 with the label data 2 to obtain the model 2 after one-time training.

Further, another video 2 with the same background as the video 1 and different action sequence of the target person and start-stop time of each action is input into the model 2, the model 2 performs tagging operation on the actions in the video 2, and the video 2 carrying tag data 3 is output.

Further, the staff checks the video 2 with the tag data 3 output by the model 2 again, calibrates the tag data 3, and generates the video 2 with the tag data 4.

Further, the video 2 with the label data 4 is input into the model 2 again, so that the model 2 performs deep learning training once according to the video 2 with the label data 4 to obtain the model 3.

And S14, if the similarity of the corresponding operation parameters of the model (i +1) and the model (i) is within a preset range, taking the model (i +1) as an action labeling model.

When the model performs the labeling operation on the video, the operation parameters are correspondingly adjusted, when the similarity of the operation parameters of two adjacent models is within a preset threshold range, the model (i +1) is determined to be a trained model, and the model (i +1) is used as an action labeling model.

For example, the similarity of the operation parameters of the model 2 obtained by training the model 1 once and the model 3 obtained by training the model 2 once is 95%, the similarity of the operation parameters of two adjacent models when the model training is successful is set to be 90% -98%, the similarity of the operation parameters of the model 2 and the model 3 is 95%, the similarity is in the set threshold range, it is determined that the model 3 is successfully trained, the model 3 can be used as a motion labeling model in a scene represented by the video 1 and the video 2, and the model 3 can label the motion of a person in the video in the scene.

According to the method for constructing the action tagging model, the video (i) is input into the model (i), so that the model (i) executes tagging operation on actions of target characters in the video (i), and outputs the video (i) carrying tag data (2i-1), wherein i is a positive integer greater than or equal to 1; performing a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1) to generate the video (i) carrying the tag data (2 i); inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1); if the similarity of the model (i +1) and the corresponding operation parameters of the model (i) is within a preset range, the model (i +1) is used as an action tagging model, and therefore the method can achieve the purpose that a model is trained by obtaining a large number of sample data through a small amount of sample data, the time for manually tagging a large number of data in the early stage is saved, and the trained and optimized model can be reused.

Fig. 2 is a schematic flow chart of another method for constructing an action tagging model according to an embodiment of the present invention, and as shown in fig. 2, the method specifically includes:

s21, inputting the video (i) into a model (i), so that the model (i) performs labeling operation on the action of the target person in the video (i), and outputting the video (i) carrying label data (2 i-1).

The method comprises the steps of inputting a video of a workshop production line into an untrained and optimized initial model designed by a developer according to experience, setting algorithms such as target track tracking, image feature extraction, transfer learning or feature fusion in the initial model, using the algorithms, preliminarily and automatically identifying target characters in the video by the initial model, identifying and distinguishing actions of the target characters, preliminarily marking the identified actions, wherein the marked contents are types of the actions and start and stop time corresponding to each action type in the video, outputting the labels together with the video, and synchronously displaying the label contents in the video playing process.

And S22, determining action types corresponding to actions generated by the target person in the video (i), and the starting and ending time corresponding to each action type.

In the embodiment of the invention, after the video carrying the tag data is output, the staff checks the video and determines the action types of the target characters sequentially appearing in the whole video and the start-stop time of each action type according to the standard.

For example, the video content is a video of a welding guide pipe in the production air conditioner, the total time is 10 seconds, the motion types include that a welding tool is taken, a welding pause and/or a welding tool is removed, wherein the welding is required to be at least kept for 3 seconds when a bent part close to a throttle valve is welded, in order to ensure that welding rods are fully welded, the start-stop time corresponding to the taking of the welding tool is 0-4 seconds, the start-stop time corresponding to the welding pause is 5-7 seconds, and the start-stop time corresponding to the removing of the welding tool is 8-10 seconds.

S23, performing calibration operation on the label data (2i-1) based on the action types and the start and stop time corresponding to each action type, correcting type errors and/or start and stop time errors in the label data (2i-1), and generating the video (i) carrying the label data (2 i).

The initial model is not deeply trained and optimized, so that identification errors may exist in the preliminarily generated label content, workers calibrate the preliminarily generated label data according to standard action types appearing in the video and the start-stop time corresponding to each action type, when the errors appearing in the label data are action type identification errors, the workers can directly modify the label data into correct action type names in a text box, and when the errors appearing in the label data are the start-stop time corresponding to the action types, the workers can modify and optimize the label data in a progress bar dragging mode, and finally generate the video with the standard label data.

S24, inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1).

After the staff calibrate the video tags, the program automatically re-inputs the video carrying the standard tag data into the initial model, and performs deep learning training on the model to generate the model after one training.

And S25, if the similarity of the corresponding operation parameters of the model (i +1) and the model (i) is within a preset range, taking the model (i +1) as an action labeling model.

For example, when i is 3, the model 2 obtained by training the model 1 once, the model 3 obtained by training the model 2 once, and the model 4 obtained by training the model 3 once, the similarity of the operation parameters of the model 4 and the model 3 is 97%, and the value range of the similarity of the operation parameters of two adjacent models when the model training succeeds is set to be 95% -99%, and the similarity of the operation parameters of the model 4 and the model 3 is 97%, which is within the set threshold range, it is determined that the model 4 is successfully trained, the model 4 can be used as an action tagging model in a scene represented by a video of the training model, and the model 4 can tag the action of a person in the video in the scene.

And S26, if the similarity of the model (i +1) and the corresponding operation parameter of the model (i) is not in a preset range, continuing to execute the training step of the model (i +1) by adjusting the operation parameter in the model (i + 1).

If the similarity of the model (i +1) and the corresponding operation parameter of the model (i) is not within the preset threshold range, the model (i +1) needs to be trained again, the model optimization process of the scheme is automatically completed, and a worker only needs to select whether to optimize or not, so that the labor is saved.

When the model is subjected to deep learning training, once each pair of video marks is finished, a video with label data is output again, a worker checks new label data in the video and performs calibration, the video with the standard data calibrated is input into the model again for model training, a new model is output, when the similarity of the operation parameters of the models trained twice is within a preset range, the model training is considered to be finished, and the model (i +1) is used as an action labeling model.

For example, the value range of the similarity of the operation parameters of two adjacent models when the model training is successful is set to be 80% -90%, and if the similarity of the model (i +1) and the corresponding operation parameter of the model (i) is 70% or 95%, the model (i +1) is considered to be unfinished and unusable, and the model (i +1) needs to be trained again.

According to the method for constructing the action tagging model, the video (i) is input into the model (i), so that the model (i) executes tagging operation on actions of target characters in the video (i), and outputs the video (i) carrying tag data (2i-1), wherein i is a positive integer greater than or equal to 1; performing a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1) to generate the video (i) carrying the tag data (2 i); inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1); if the similarity of the model (i +1) and the corresponding operation parameter of the model (i) is within a preset range, the model (i +1) is used as an action tagging model, so that the method can realize the cyclic training of the model through a small amount of sample data, save the time for manually tagging the sample data, and improve the training optimization efficiency of the model.

Fig. 3 is a schematic flow chart of a video motion marking method according to an embodiment of the present invention, and as shown in fig. 3, the method specifically includes:

and S31, inputting the video data to be labeled to an action labeling model, so that the action labeling model identifies the action types of the target characters in the video data and the start and stop time corresponding to each action type.

The embodiment of the invention is that the successfully trained and optimized action labeling model can be applied to the labeling operation of other production line process action videos, the videos needing to be labeled are input into the action labeling model, the action labeling model automatically identifies a target character in the videos, a plurality of action types of the target character and start and stop time corresponding to each action type, and the action types and the start and stop time corresponding to each action type are output together with the videos.

And S32, the action labeling model executes labeling operation on the action of the target person based on the action types and the start and stop time corresponding to each action type, and outputs the video data carrying label data.

The action labeling model identifies a target character in the video, a plurality of action types of the target character and start-stop time corresponding to each action type, the action types and the start-stop time corresponding to each action type are output together with the video in a label mode, and a worker can check whether actions of a production line operator are standard or not according to label data in the video, so that monitoring of product quality is achieved.

The action labeling model can also be applied to the fields of video monitoring, video retrieval, human-computer interaction and the like.

According to the video action marking method provided by the embodiment of the invention, the video data to be marked are input into the action marking model, so that the action marking model can identify the action types of target characters in the video data and the start-stop time corresponding to each action type; the action labeling model executes labeling operation on the action of the target figure based on the action types and the start and stop time corresponding to each action type, outputs the video data carrying the label data, can be applied to the labeling operation of the production line process action video, and can check whether the action of the production line operator is standard or not according to the label data in the video by a worker, so that the control on the product quality is realized.

Fig. 4 is a schematic flow chart of another video motion marking method according to an embodiment of the present invention, and as shown in fig. 4, the method specifically includes:

and S41, dividing the video into a plurality of frame images according to the time sequence information.

In the embodiment of the invention, the trained action labeling model is adopted to identify a plurality of actions of the target character in the production line process action video, and the labeling operation is carried out on the identified action types and the starting and ending time of each action type. First, the video is input into an action tagging model, which segments the video into a plurality of video frame images during the video playing process.

And S42, determining the target person from the plurality of frame images based on a preset area frame containing the motion change of the target person.

And (3) adopting a target track tracking algorithm, firstly capturing a target person in a preset area frame with action change of the target person, and then respectively determining the target person in each video frame image from a plurality of divided video frame images to realize track tracking of the target person.

S43, extracting a set of motion features corresponding to the target person from each of the frame images to obtain a plurality of sets of motion features corresponding to the plurality of frame images.

And performing feature extraction on each video frame image by adopting an image feature extraction algorithm or a feature fusion algorithm, wherein each video frame image can extract a plurality of image features, so that a group of action features corresponding to the target person in each video frame image is identified, and finally the action features of a plurality of groups of target persons corresponding to the plurality of video frame images are obtained.

And S44, matching the action characteristics with standard action characteristics stored in an action characteristic database one by one, and taking the action type corresponding to the standard action characteristic with the similarity exceeding a set threshold as the action type corresponding to the action characteristic.

The action characteristic database of the action labeling model stores a plurality of standard action characteristics corresponding to a plurality of action types, the action characteristics of a plurality of groups of target characters corresponding to a plurality of identified video frame images are compared with a plurality of standard action characteristics corresponding to a plurality of action types stored in the action characteristic database, and the action type corresponding to the standard action characteristic with the characteristic similarity exceeding a set threshold (for example, 98%) is used as the action type of a plurality of groups of target characters corresponding to each video frame image.

And S45, grouping the frame images according to the time sequence information based on the action types to obtain a plurality of video clips formed by a plurality of groups of frame images corresponding to the action types.

In the embodiment of the invention, as one action type corresponds to a plurality of action features, the action types of a plurality of groups of target characters corresponding to each identified video frame image are repeated, the video frame images with the same action types are grouped according to time sequence information to obtain a group of video frame images corresponding to each action type, the video frame images can form a video clip according to the time sequence information, and finally a plurality of video clips corresponding to a plurality of action types are obtained.

And S46, determining the start-stop time corresponding to each of the action types in the video based on the start-stop time corresponding to each of the start-stop frame images in the video clip.

And S47, outputting the action types and the start and stop time corresponding to each action type together with the video data in a label form.

And taking the time corresponding to the first video frame image and the last video frame image in each video clip as the start-stop time of each video clip, and further obtaining the start-stop time corresponding to each action type. And outputting the identified action types of the target person and the start-stop time corresponding to each action type together with the video data in a label form to obtain a video with label data, and checking whether the action of a production line operator is standard or not by a worker according to the label data in the video so as to monitor the product quality.

Fig. 5 is a schematic structural diagram of a device for constructing an action tagging model according to an embodiment of the present invention, which specifically includes:

an obtaining module 501, configured to input a video (i) into a model (i), so that the model (i) performs a tagging operation on an action of a target person in the video (i), and outputs the video (i) carrying tag data (2 i-1);

a verification module 502, configured to perform a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1), and generate the video (i) carrying the tag data (2 i);

a training module 503, configured to input the video (i) carrying the label data (2i) into the model (i) for training, so as to obtain a model (i + 1);

a determining module 504, configured to use the model (i +1) as an action tagging model if similarity between the model (i +1) and a corresponding operation parameter of the model (i) is within a preset range.

The verification module is specifically used for determining action types corresponding to actions generated by a target person in the video (i) and start-stop time corresponding to each action type; and performing calibration operation on the label data (2i-1) based on the action types and the start-stop time corresponding to each action type, correcting type errors and/or start-stop time errors in the label data (2i-1), and generating the video (i) carrying the label data (2 i).

In a possible embodiment, the determining module is further configured to continue to perform the training step on the model (i +1) by adjusting the operation parameters in the model (i +1) if the similarity between the model (i +1) and the corresponding operation parameters of the model (i) is not within a preset range.

The apparatus for constructing an action tagging model provided in this embodiment may be the apparatus for constructing an action tagging model shown in fig. 5, and may perform all the steps of the method for constructing an action tagging model shown in fig. 1-2, so as to achieve the technical effect of the method for constructing an action tagging model shown in fig. 1-2, and refer to the related description of fig. 1-2, which is not repeated herein for brevity.

Fig. 6 is a schematic structural diagram of a video motion marking device according to an embodiment of the present invention, which specifically includes:

the marking module 601 is configured to input video data to be marked to an action marking model, so that the action marking model identifies action types of a target person in the video data and start-stop time corresponding to each action type;

an output module 602, configured to execute, by the action tagging model, a tagging operation on the action of the target person based on the action types and the start-stop time corresponding to each action type, and output the video data carrying tag data.

The marking module is specifically used for dividing the video into a plurality of frame images according to time sequence information; determining the target person from a plurality of frame images based on a preset region frame containing the motion change of the target person; extracting a group of action features corresponding to the target person from each frame image to obtain a plurality of groups of action features corresponding to the plurality of frame images; matching the plurality of groups of action features with standard action features stored in an action feature database one by one, and taking an action type corresponding to the standard action feature with the similarity exceeding a set threshold as an action type corresponding to the action feature; grouping the plurality of frame images according to time sequence information based on the action types to obtain a plurality of video clips formed by a plurality of groups of frame images corresponding to the action types; and determining the starting and ending time respectively corresponding to the action types in the video based on the starting and ending time respectively corresponding to the starting and ending frame images in the video clip.

And the output module is specifically used for outputting the action types and the start-stop time corresponding to each action type together with the video data in a tag form.

The video motion marking device provided in this embodiment may be the video motion marking device shown in fig. 6, and may perform all the steps of the video motion marking method shown in fig. 3 to 4, so as to achieve the technical effect of the video motion marking method shown in fig. 3 to 4, and please refer to the description related to fig. 3 to 4 for brevity, which is not described herein again.

Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention, where the computer device 700 shown in fig. 7 includes: at least one processor 701, memory 702, at least one network interface 704, and other user interfaces 703. The various components in the computer device 700 are coupled together by a bus system 705. It is understood that the bus system 705 is used to enable communications among the components. The bus system 705 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various busses are labeled in figure 7 as the bus system 705.

The user interface 703 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.

It is to be understood that the memory 702 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), synchlronous SDRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 702 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 702 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 7021 and application programs 7022.

The operating system 7021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 7022 includes various applications, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. Programs that implement methods in accordance with embodiments of the present invention can be included within application program 7022.

In the embodiment of the present invention, by calling a program or an instruction stored in the memory 702, specifically, a program or an instruction stored in the application 7022, the processor 701 is configured to perform the method steps provided in any of the embodiments corresponding to fig. 1 and fig. 2, for example, including:

inputting a video (i) into a model (i), so that the model (i) performs a labeling operation on the action of a target person in the video (i), and outputting the video (i) carrying label data (2i-1), wherein i is a positive integer greater than or equal to 1; performing a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1) to generate the video (i) carrying the tag data (2 i); inputting the video (i) carrying the label data (2i) into the model (i) for training to obtain a model (i + 1); and if the similarity of the corresponding operation parameters of the model (i +1) and the model (i) is within a preset range, taking the model (i +1) as an action labeling model.

In one possible implementation, determining action types corresponding to actions generated by a target person in the video (i), and a start-stop time corresponding to each action type;

In a possible embodiment, if the similarity between the model (i +1) and the corresponding operation parameter of the model (i) is not within a preset range, the training step for the model (i +1) is continuously performed by adjusting the operation parameter of the model (i + 1).

Or, the processor 701 is configured to execute the method steps provided by any embodiment of the methods in the embodiments corresponding to fig. 3 and fig. 4, for example, including:

inputting video data to be labeled to an action labeling model so that the action labeling model can identify action types of target characters in the video data and start-stop time corresponding to each action type; and the action labeling model executes labeling operation on the action of the target person based on the action types and the start and stop time corresponding to each action type, and outputs the video data carrying label data.

In one possible embodiment, the video is divided into a plurality of frame images according to the time sequence information; determining the target person from a plurality of frame images based on a preset region frame containing the motion change of the target person; extracting a group of action features corresponding to the target person from each frame image to obtain a plurality of groups of action features corresponding to the plurality of frame images; matching the plurality of groups of action features with standard action features stored in an action feature database one by one, and taking an action type corresponding to the standard action feature with the similarity exceeding a set threshold as an action type corresponding to the action feature; grouping the plurality of frame images according to time sequence information based on the action types to obtain a plurality of video clips formed by a plurality of groups of frame images corresponding to the action types; and determining the starting and ending time respectively corresponding to the action types in the video based on the starting and ending time respectively corresponding to the starting and ending frame images in the video clip.

In one possible embodiment, the action types and the start and stop time corresponding to each action type are output together with the video data in a tag form.

The method disclosed in the above embodiments of the present invention may be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 701. The Processor 701 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 702, and the processor 701 reads the information in the memory 702 and performs the steps of the above method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The computer device provided in this embodiment may be the computer device shown in fig. 7, and may perform all the steps of the method for constructing the motion labeling model shown in fig. 1-2 and the method for video motion labeling shown in fig. 3-4, so as to achieve the technical effects of the method for constructing the motion labeling model shown in fig. 1-2 and the method for video motion labeling shown in fig. 3-4, which are specifically described with reference to fig. 1-2 and fig. 3-4, and are not described herein for brevity.

The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When the one or more programs in the storage medium are executable by the one or more processors, the method for constructing the motion labeling model and the video motion labeling method executed on the computer device side are realized.

The processor is used for executing the constructing program of the action labeling model and the video action marking program stored in the memory so as to realize the following steps of the constructing method of the action labeling model executed on the computer equipment side:

In one possible implementation, determining action types corresponding to actions generated by a target person in the video (i), and a start-stop time corresponding to each action type; and performing calibration operation on the label data (2i-1) based on the action types and the start-stop time corresponding to each action type, correcting type errors and/or start-stop time errors in the label data (2i-1), and generating the video (i) carrying the label data (2 i).

Or, the following steps of the video action marking method executed on the computer equipment side are realized:

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for constructing an action labeling model is characterized by comprising the following steps:

2. Method according to claim 1, characterized in that said tag data (2i-1) or said tag data (2i) comprises: the action type and the start-stop time corresponding to the action type;

the performing a calibration operation on the tag data (2i-1) in the video (i) carrying the tag data (2i-1) to generate the video (i) carrying the tag data (2i), comprises:

3. The method of claim 1, wherein if i ═ 1, then model (1) is an initial classification model;

a plurality of action types are stored in the initial classification model;

the action type includes at least one of:

the target person takes and installs the fan blades, installs the gaskets, coats the fastening glue and/or drives the nuts.

4. The method according to claim 1, wherein if i in the video (i) takes different values, the video (i) corresponds to a video with different action sequences of the target person in the same scene.

5. The method according to any one of claims 1-4, further comprising:

6. A video motion marking method is characterized by comprising the following steps:

7. The method of claim 6, wherein the action tagging model identifies action types of a target person in the video data and a start-stop time corresponding to each of the action types, comprising:

8. The method of claim 7, wherein the outputting the video data carrying tag data based on the action types and the start-stop time corresponding to each action type to perform tagging on the action of the target person comprises:

9. An apparatus for constructing an action tagging model, comprising:

10. A video motion marking device, comprising:

11. A computer device, comprising: a processor and a memory, wherein the processor is used for executing a constructing program of the action labeling model or a video action marking program stored in the memory so as to realize the constructing method of the action labeling model of any one of claims 1 to 5 or the video action marking method of any one of claims 6 to 8.

12. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the method of constructing an action tagging model according to any one of claims 1 to 5 or the method of video action tagging according to any one of claims 6 to 8.