CN116469167A

CN116469167A - Method and system for obtaining character action fragments based on character actions in video

Info

Publication number: CN116469167A
Application number: CN202310395288.6A
Authority: CN
Inventors: 韩继泽; 刘永辉; 谢恩鹏; 王志亮; 杜浩; 温连龙
Original assignee: Shandong Langchao Ultra Hd Intelligent Technology Co ltd
Current assignee: Shandong Langchao Ultra Hd Intelligent Technology Co ltd
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2023-07-21

Abstract

The invention discloses a method and a system for acquiring character action fragments based on character actions in videos, belongs to the technical field of big data processing, and aims to solve the technical problem of how to quickly acquire the character actions in the videos and provide accurate starting time of the actions. The method comprises the following steps: dividing the acquired video stream into a plurality of video clips at equal time intervals to obtain video frame numbers, and recording the starting time of video frames in the video clips; carrying out character position identification based on a view image set corresponding to the video clip to obtain characters in the video image and character positions corresponding to each character; carrying out character action recognition by using a video image set corresponding to the video clips and the character positions of each character to obtain the action type of each character, and numbering each character to obtain the character number of the video frame; and carrying out feature matching on the video clips based on the character feature vectors, and merging the video clips of the same character and the same action category to obtain a new video clip.

Description

Method and system for obtaining character action fragments based on character actions in video

Technical Field

The invention relates to the technical field of big data processing, in particular to a method and a system for acquiring a character action fragment based on a character action in a video.

Background

With the progress of AI technology, video motion recognition has been well developed on the basis of picture classification and object detection. For a video clip, it can be identified whether the video clip contains a given character action. The identification is independent of the length of the video, and even if the video has redundant parts or the human actions in the video are only partial, the identification can be performed incompletely. There is no valid identification definition for the exact starting position of the action.

How to quickly acquire the actions of people in a video and give out the accurate starting time of the actions is a technical problem to be solved.

Disclosure of Invention

The technical task of the invention is to provide a method and a system for acquiring character action fragments based on character actions in videos, so as to solve the problem of how to quickly acquire the character actions in the videos and provide accurate starting time of the actions.

In a first aspect, the present invention provides a method for obtaining a character action segment based on a character action in a video, comprising the steps of:

dividing the acquired video stream into a plurality of video clips at equal time intervals, constructing a video image set based on video images corresponding to the video clips for each video clip, numbering the video clips to obtain video frame numbers, and recording the starting time of video frames in the video clips;

for each video clip, carrying out character position identification based on a view image set corresponding to the video clip to obtain characters in the video image and character positions corresponding to each character;

for each video clip, carrying out character action recognition by using a video image set corresponding to the video clip and the character position of each character to obtain the action type of each character, and numbering each character to obtain the character number of the video frame;

and for all video clips in the video stream, carrying out feature matching on the video clips based on the character feature vectors, combining the video clips with the same character and the same action category to obtain a new video clip, and updating the video frame numbers, the video frame starting time, the video frame character numbers and the action category and the character feature vector corresponding to each video frame number corresponding to the new video clip.

Preferably, for each video clip, taking a video image set corresponding to the video clip as input, and carrying out character position recognition through a target recognition model constructed based on a YOLO algorithm to obtain characters and character positions in the video clip.

Preferably, for each video clip, the motion of each person is obtained by performing motion recognition on a motion recognition model constructed based on a slow algorithm with a view image set and a person position corresponding to the video clip as inputs.

Preferably, for all video clips in the video stream, feature comparison is performed on the video clips based on time sequence, and the video clips with the same person and the same action result are combined, including the following steps:

for characters in each frame of video image in the video clip, carrying out feature extraction on an image area of the character position in the video image through a twin network based on the character position to obtain a multi-dimensional feature vector as a character feature vector, wherein each character corresponds to a corresponding character feature vector;

constructing characterization information based on video frame numbers, video frame start time, video frame character numbers, action results corresponding to the video frame character numbers and character feature vectors of the video clips;

according to the sequence of the video stream, based on character feature vectors, carrying out feature comparison on two adjacent video clips in the video stream, and executing a feature comparison principle, wherein the feature comparison principle is as follows: if the character characteristic vector comparison results corresponding to two adjacent video clips accord with a threshold value, the same character is judged, the action results are compared, if the action comparison results accord with the threshold value, the same character is judged to be the continuation of the same action of the same character, the two video clips are combined to obtain a new video clip, the video frame starting time corresponding to the new video clip is updated, and the video frame numbers corresponding to the video clips, the video frame character numbers and the action and character characteristic vectors corresponding to the video frame character numbers are close to the video frame numbers in time sequence;

and comparing the characteristics of the new video segment with the characteristics of the next video segment which is not subjected to characteristic comparison based on the character characteristic vector, and executing a characteristic comparison principle until the character characteristic comparison result does not accord with a threshold value, or the task characteristic comparison result accords with the threshold value, but the action comparison result does not accord with the threshold value.

In a second aspect, the present invention is a system for obtaining a character action segment based on a character action in a video, for obtaining a character action segment by the method for obtaining a character action segment based on a character action in a video according to any one of the first aspects, the system comprising:

the video segment acquisition module is used for dividing an acquired video stream into a plurality of video segments at equal time intervals, constructing a video image set based on video images corresponding to the video segments for each video segment, numbering the video segments to obtain video frame numbers, and recording the starting time of video frames in the video segments;

the character recognition module is used for recognizing the character positions of each video clip based on the view image set corresponding to the video clip to obtain characters in the video image and the character positions corresponding to each character;

the motion recognition module is used for recognizing the motions of the characters according to the video image set corresponding to the video clips and the character positions of the characters to obtain the motion types of the characters, numbering the characters and obtaining the character numbers of the video frames;

and the video segment merging module is used for carrying out feature matching on the video segments based on the character feature vectors, merging the video segments with the same character and the same action category to obtain a new video segment, and updating the video frame numbers, the video frame starting time, the video frame character numbers and the action category and the character feature vectors corresponding to each video frame number corresponding to the new video segment.

Preferably, for each video clip, the person recognition module is configured to perform person position recognition by using a video image set corresponding to the video clip as input and using a target recognition model constructed based on YOLO algorithm to obtain a person and a person position in the video clip.

Preferably, for each video clip, the motion recognition module is configured to perform motion recognition by using a view image set and a person position corresponding to the video clip as input through a motion recognition model constructed based on a slow fast algorithm, so as to obtain a motion of each person.

Preferably, the video clip merging module is configured to perform the following:

The system for acquiring the character action fragments based on the character actions in the video has the following advantages: the method comprises the steps of segmenting video streams at equal time intervals into a plurality of video segments, numbering video frames for each video segment, obtaining character positions and action types of characters in the video segments, extracting character feature vectors based on the character positions, carrying out feature matching on the video segments through the character feature vectors, merging the video segments with the same character and the same action type into a new video segment, obtaining accurate segment information of character actions based on the method, rapidly obtaining the character actions, and accurately giving out the starting time of the actions.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a block flow diagram of a method for obtaining character action fragments based on character actions in video according to embodiment 1;

FIG. 2 is a block flow diagram of motion recognition in a method for obtaining a character motion segment based on a character motion in a video according to embodiment 1;

fig. 3 is a flowchart of a method for obtaining character action fragments based on character actions in video according to embodiment 1, wherein character feature comparison and video fragment merging are performed.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples, so that those skilled in the art can better understand the invention and implement it, but the examples are not meant to limit the invention, and the technical features of the embodiments of the invention and the examples can be combined with each other without conflict.

The embodiment of the invention provides a method and a system for acquiring a character action fragment based on a character action in a video, which are used for solving the technical problem of how to quickly acquire the character action in the video and giving out the accurate starting time of the action.

Example 1:

the invention discloses a method for acquiring a character action fragment based on character actions in video, which comprises the following steps:

s100, segmenting an acquired video stream into a plurality of video clips at equal time intervals, constructing a video image set based on video images corresponding to the video clips for each video clip, numbering the video clips to obtain video frame numbers, and recording the starting time of video frames in the video clips;

s200, for each video clip, carrying out character position identification based on a view image set corresponding to the video clip to obtain characters in the video image and character positions corresponding to each character;

s300, for each video clip, carrying out character action recognition by using a video image set corresponding to the video clip and the character position of each character to obtain the action type of each character, and numbering each character to obtain the character number of the video frame;

s400, for all video clips in the video stream, performing feature matching on the video clips based on the character feature vectors, merging the video clips with the same character and the same action category to obtain a new video clip, and updating the video frame numbers, the video frame start time, the video frame character numbers and the action category and the character feature vectors corresponding to each video frame number corresponding to the new video clip.

In this embodiment, step S100 intercepts a small video clip from a fixed video stream interval, for example, specifies the small video clip for 2 seconds, uniquely numbers the video frame picture set of the clip, obtains a video frame number, and records a video frame start-stop position, where the video frame start-stop position is understood as a start time.

In step S200 of this embodiment, for each video clip, a set of video images corresponding to the video clip is taken as input, and person position recognition is performed through a target recognition model constructed based on YOLO algorithm, so as to obtain a person and a person position in the video clip.

For a target recognition model, firstly, a model is required to be built according to a YOLO algorithm, then a sample set consisting of video images is acquired, character types and character position information are marked in the video images as label information, model training and testing are carried out on the target recognition model based on the sample set and the label information, a trained target recognition model is obtained, character position recognition is carried out through the trained target recognition model, and characters and character positions in video clips are obtained.

In step S300 of this embodiment, for each video clip, the motion recognition is performed by using the view image set and the person position corresponding to the video clip as input through the motion recognition model constructed based on the slow algorithm, so as to obtain the motion of each person.

For the action recognition model, firstly, a model is required to be built according to a SLOWFAST algorithm, then a sample set formed by video images is acquired, character types and character position information are marked in the video images as label information, model training and testing are carried out on the action recognition model based on the sample set and the label information, a trained action recognition model is obtained, character action recognition is carried out through the trained action recognition model, and the action types of the characters in the video clips are obtained.

In this embodiment, step S400 is character feature comparison and video segment merging, and as a specific implementation method, the steps include the following operations:

(1) For characters in each frame of video image in the video clip, carrying out feature extraction on an image area of the character position in the video image through a twin network based on the character position to obtain a multi-dimensional feature vector as a character feature vector, wherein each character corresponds to a corresponding character feature vector;

(2) Constructing characterization information based on video frame numbers, video frame start time, video frame character numbers, action results corresponding to the video frame character numbers and character feature vectors of the video clips;

(3) According to the sequence of the video stream, based on character feature vectors, carrying out feature comparison on two adjacent video clips in the video stream, and executing a feature comparison principle, wherein the feature comparison principle is as follows: if the character characteristic vector comparison results corresponding to two adjacent video clips accord with a threshold value, the same character is judged, the action results are compared, if the action comparison results accord with the threshold value, the same character is judged to be the continuation of the same action of the same character, the two video clips are combined to obtain a new video clip, the video frame starting time corresponding to the new video clip is updated, and the video frame numbers corresponding to the video clips, the video frame character numbers and the action and character characteristic vectors corresponding to the video frame character numbers are close to the video frame numbers in time sequence;

(4) And comparing the characteristics of the new video segment with the characteristics of the next video segment which is not subjected to characteristic comparison based on the character characteristic vector, and executing a characteristic comparison principle until the character characteristic comparison result does not accord with a threshold value, or the task characteristic comparison result accords with the threshold value, but the action comparison result does not accord with the threshold value.

When the character feature vectors are compared, the character feature vectors are normalized, then multiplication operation is carried out on the normalized two character feature vectors, and whether the two task feature vectors are identical or similar is judged based on multiplication results.

Based on the specific operation, in the specific implementation process, according to the character position of the small video clip frame picture, the picture area of the human body position is subjected to characteristic representation, and a multidimensional characteristic vector is obtained. If a plurality of characters exist, a plurality of characteristic vectors are generated, and the character codes are combined with the character numbers of the video frames in the video character action recognition algorithm to obtain final characterization information, wherein the characterization information comprises the character actions and the character characteristic vectors corresponding to the video frame numbers, the video frame start and end positions, the video frame character numbers and the video frame character numbers.

Then, according to the time sequence of the video, carrying out feature matching on adjacent small video fragments, if character features of two adjacent small video fragments are very similar to each other, and meanwhile, the action types of the two adjacent small video fragments are the same, then the two small video fragments can be considered as the continuation of the same action of the same person, then the information of the two small video fragments is combined, the lengths of video frames are combined, so that new video frame starting and ending positions are obtained, and other video frame numbers, video frame character numbers, actions corresponding to the video frame character numbers and character feature vectors which are used in the time sequence are used in the front; if the feature contrast is unlike, it is interpreted that they are actions of different people, and no merging operation is performed, they are considered to be separate and distinct actions. Note that if character features match similarly, but their kinds of actions are different, they are considered to be different actions of the same character, and no merging operation is performed. Thus, the video clips of the same action are combined, thereby obtaining the accurate starting and stopping positions of the video clips.

And carrying out the operation according to the operation cycle, comparing the human body characteristics of the new video segment and the video segment at the next time, and merging the same action segment of the same person to obtain the starting and ending positions of the whole action.

According to the method, the action small fragments appearing in the video fragments are stored and the characteristics are generated, then the small fragments of the same action are combined through characteristic matching, the large fragment containing the whole action is generated, and the actions of the video person can be quickly acquired and the accurate starting and ending positions of the actions can be given.

Example 2:

the invention discloses a system for acquiring a character action fragment based on a character action in a video, which comprises a video fragment acquisition module, a character recognition module, an action recognition module and a video fragment merging module.

The video segment acquisition module is used for dividing an acquired video stream into a plurality of video segments at equal time intervals, constructing a video image set based on video images corresponding to the video segments for each video segment, numbering the video segments to obtain video frame numbers, and recording the starting time of video frames in the video segments.

In this embodiment, the video clip collection module is configured to intercept a small video clip from a video stream at a fixed interval, for example, specify the small video clip for 2 seconds, uniquely number a video frame picture set of the clip, obtain a video frame number, and record a start-stop position of a video frame, where the start-stop position of the video frame is understood as a start time.

And for each video clip, the person identification module is used for carrying out person position identification based on the view image set corresponding to the video clip, so as to obtain the person in the video image and the person position corresponding to each person.

In this embodiment, for each video clip, the person recognition module is configured to perform person position recognition by using a set of video images corresponding to the video clip as input and using a target recognition model constructed based on YOLO algorithm to obtain a person and a person position in the video clip.

For the object recognition model, the person recognition module is configured to perform the following operations: firstly, a model is required to be built according to a YOLO algorithm, then a sample set formed by video images is obtained, character types and character position information are marked in the video images as tag information, model training and testing are carried out on the target recognition model based on the sample set and the tag information, a trained target recognition model is obtained, character position recognition is carried out through the trained target recognition model, and characters and character positions in video clips are obtained. Or, the person recognition module is configured with the target recognition model trained through the operation, and the trained target recognition model is called to recognize the person category and the person position.

And for each video clip, the action recognition module is used for carrying out character action recognition according to the video image set corresponding to the video clip and the character position of each character to obtain the action type of each character, and numbering each character to obtain the character number of the video frame.

For each video clip, the action recognition module in this embodiment takes the view image set corresponding to the video clip and the person position as input, and performs action recognition through an action recognition model constructed based on a slow fast algorithm to obtain the action of each person.

For the action category model, the action recognition module is configured to perform the following: firstly, a model is required to be built according to a SLOWFAST algorithm, then a sample set formed by video images is acquired, character types and character position information are marked in the video images as tag information, model training and testing are carried out on the motion recognition model based on the sample set and the tag information, a trained motion recognition model is obtained, character motion recognition is carried out through the trained motion recognition model, and the motion types of the characters in the video clips are obtained. Or, the action recognition module is configured with the action recognition model trained by the operation, and the trained action recognition model is called to recognize the action category.

And for all video clips in the video stream, the video clip merging module is used for carrying out feature comparison on the video clips based on time sequence, merging the video clips with the same person and the same action result to obtain a new video clip, and updating the video frame number, the video frame person number and the video frame starting time corresponding to the new video clip.

In this embodiment, the video segment merging module is configured to compare character features and merge video segments, and as a specific implementation method, is specifically configured to perform the following operations:

Based on the specific operation, the module workflow is as follows:

firstly, according to the character position of a small video clip frame picture, the picture area of the human body position is subjected to characteristic representation to obtain a multidimensional characteristic vector. If a plurality of characters exist, a plurality of characteristic vectors are generated, and the character codes are combined with the character numbers of the video frames in the video character action recognition algorithm to obtain final characterization information, wherein the characterization information comprises the character actions and the character characteristic vectors corresponding to the video frame numbers, the video frame start and stop positions, the video frame character numbers and the video frame character numbers;

then, according to the time sequence of the video, carrying out feature matching on adjacent small video fragments, if character features of two adjacent small video fragments are very similar to each other, and meanwhile, the action types of the two adjacent small video fragments are the same, then the two small video fragments can be considered as the continuation of the same action of the same person, then the information of the two small video fragments is combined, the lengths of video frames are combined, so that new video frame starting and ending positions are obtained, and other video frame numbers, video frame character numbers, actions corresponding to the video frame character numbers and character feature vectors which are used in the time sequence are used in the front; if the feature contrast is unlike, it is interpreted that they are actions of different people, and no merging operation is performed, they are considered to be separate and distinct actions. Note that if character features match similarly, but their kinds of actions are different, they are considered to be different actions of the same character, and no merging operation is performed; thus, the video clips of the same action are combined, thereby obtaining the accurate starting and stopping positions of the video clips.

Finally, according to the operation cycle, comparing the human body characteristics of the new video segment and the video segment at the next time, and combining the same action segment of the same person to obtain the starting and ending positions of the whole action.

While the invention has been illustrated and described in detail in the drawings and in the preferred embodiments, the invention is not limited to the disclosed embodiments, and it will be appreciated by those skilled in the art that the code audits of the various embodiments described above may be combined to produce further embodiments of the invention, which are also within the scope of the invention.

Claims

1. A method for obtaining character action fragments based on character actions in video, comprising the steps of:

2. The method for obtaining a character action segment based on a character action in video according to claim 1, wherein for each video segment, a set of video images corresponding to the video segment is taken as input, and character position recognition is performed through a target recognition model constructed based on YOLO algorithm, so as to obtain a character and a character position in the video segment.

3. The method for obtaining motion segments of a person based on motion of the person in video according to claim 1, wherein for each video segment, motion recognition is performed by a motion recognition model constructed based on a slow algorithm with a set of view images corresponding to the video segment and a position of the person as inputs, to obtain the motion of each person.

4. A method for obtaining character action fragments based on character actions in video according to any one of claims 1-3, wherein for all video fragments in a video stream, feature comparison is performed on the video fragments based on time sequence, and the video fragments of the same character and having the same action result are combined, comprising the steps of:

5. A system for obtaining a character action fragment based on a character action in a video, wherein the system comprises:

6. The system for capturing segments of human actions based on human actions in video of claim 5 wherein for each video segment, said human recognition module is configured to perform human location recognition on a target recognition model constructed based on YOLO algorithm with a set of video images corresponding to said video segment as input to obtain a human and a human location in the video segment.

7. The system for capturing motion segments of a person based on motion of the person in video of claim 5, wherein for each video segment, the motion recognition module is configured to perform motion recognition by using a motion recognition model constructed based on a slow algorithm with a set of view images corresponding to the video segment and a position of the person as inputs to obtain the motion of each person.

8. The system for capturing character action segments based on character actions in video according to any one of claims 5-7, wherein the video segment merging module is configured to perform the following: