CN113515998A - Video data processing method and device and readable storage medium - Google Patents

Video data processing method and device and readable storage medium Download PDF

Info

Publication number
CN113515998A
CN113515998A CN202011580149.3A CN202011580149A CN113515998A CN 113515998 A CN113515998 A CN 113515998A CN 202011580149 A CN202011580149 A CN 202011580149A CN 113515998 A CN113515998 A CN 113515998A
Authority
CN
China
Prior art keywords
motion
action
segment
event
video frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011580149.3A
Other languages
Chinese (zh)
Inventor
赵天昊
袁微
田思达
楼剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011580149.3A priority Critical patent/CN113515998A/en
Publication of CN113515998A publication Critical patent/CN113515998A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234345Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment

Abstract

The application discloses a video data processing method, a device and a readable storage medium, wherein the method comprises the following steps: acquiring a motion event segment aiming at a target object in a video to be processed; if the motion event segment comprises a continuous motion action segment, identifying a motion state segmentation point in the continuous motion action segment, and splitting the continuous motion action segment according to the motion state segmentation point to obtain at least two independent motion action segments; and respectively tracking the key part of the target object in each single motion action segment to obtain the position information of the key part corresponding to each single motion action segment, identifying a first event label corresponding to the continuous motion action segment according to the position information of the key part, and recording the action type corresponding to each single motion action segment in the motion log information associated with the target object. By the method and the device, accuracy of motion type identification can be improved, and generation efficiency of motion script information is improved.

Description

Video data processing method and device and readable storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a video data processing method, device, and readable storage medium.
Background
With the rapid development of the internet technology and the improvement of the computing power, the performance of the video processing technology is greatly improved.
Video processing technology can classify, detect and analyze videos, is a very challenging subject in the field of computer vision, and has received wide attention in both academia and industry. The existing artificial intelligence recognition technology can only perform behavior recognition on objects in a static state, but has poor performance for recognizing behaviors in a motion state, so that in a motion scene (for example, in a pattern skating project), it is more difficult to recognize the types of continuous motion actions through the existing model, and therefore, corresponding action types can only be manually recognized in a manual mode to record motion script information, so that the recognition accuracy and the recognition efficiency of the action types are low, a large amount of manpower is consumed, and human errors can exist, so that the generation efficiency of the motion script information is low.
Disclosure of Invention
The embodiment of the application provides a video data processing method, a device and a readable storage medium, which can improve the accuracy of identifying action types and improve the generation efficiency of motion script information.
An embodiment of the present application provides a video data processing method, including:
acquiring a motion event segment aiming at a target object in a video to be processed;
if the motion event segment comprises a continuous motion action segment, identifying a motion state segmentation point in the continuous motion action segment, and splitting the continuous motion action segment according to the motion state segmentation point to obtain at least two independent motion action segments;
respectively tracking the key part of the target object in each single motion action segment to obtain the key part position information corresponding to each single motion action segment, and identifying a first event label corresponding to the continuous motion action segment according to the key part position information; the first event label is used for recording the action type corresponding to each individual motion action segment in the motion script information associated with the target object.
An aspect of an embodiment of the present application provides a video data processing apparatus, including:
the acquisition module is used for acquiring a motion event segment aiming at a target object in a video to be processed;
the splitting module is used for identifying a motion state splitting point in the continuous motion action segment if the motion event segment comprises the continuous motion action segment, and splitting the continuous motion action segment according to the motion state splitting point to obtain at least two independent motion action segments;
the first identification module is used for tracking the key part of the target object in each single motion action segment respectively to obtain the key part position information corresponding to each single motion action segment respectively, and identifying a first event label corresponding to the continuous motion action segment according to the key part position information; the first event label is used for recording the action type corresponding to each individual motion action segment in the motion script information associated with the target object.
Wherein, the acquisition module includes:
the frame extracting unit is used for extracting frames of a video to be processed containing a target object to obtain a first video frame sequence;
the first feature extraction unit is used for inputting the first video frame sequence into a feature extraction network to obtain picture features corresponding to the first video frame sequence;
the time sequence action segmentation unit is used for inputting the picture characteristics into an action segmentation network and acquiring time sequence characteristics corresponding to the first video frame sequence; the timing feature comprises a start-stop timestamp corresponding to a motion event in the first sequence of video frames; and extracting a motion event segment aiming at the motion event from the first video frame sequence according to the corresponding start-stop time stamp of the motion event in the time sequence characteristic.
Wherein, the continuous motion action segment is a continuous jump action segment;
a splitting module comprising:
the state dividing unit is used for determining the motion state corresponding to each video frame in the continuous jump motion segment; dividing the continuous jump action segment into at least two flight video frame sequences and at least three landing video frame sequences according to the motion state; the motion state refers to the state of the target object in the motion process;
the splitting unit is used for determining the middle moment of the target landing video frame sequence as a motion state splitting point, and splitting the continuous jumping action segments according to the action state splitting point to obtain at least two independent jumping action segments; the sequence of target landing video frames includes a sequence of video frames of at least three landing video frames, excluding a first landing video frame sequence and a last landing video frame sequence.
Wherein the motion state comprises an emptying state and a landing state; the continuous jump motion segment comprises video frames TiI is a positive integer and i is less than the total number of video frames in the consecutive jump motion segment;
a state division unit comprising:
an emptying sequence determining subunit for obtaining successive video frames T from successive jump motion segmentsiTo the video frame Ti+n(ii) a n is a positive integer, and i + n is less than or equal to the total number of video frames in the consecutive jump motion segment; if successive frequency frames TiTo the video frame Ti+n-1The corresponding motion states are all flight states, and the video frame Ti+nAnd video frame Ti-1The motion states corresponding to the video frames T are all the landing statesiTo the video frame Ti+n-1If the number of the corresponding video frames is larger than the soaring number threshold value, the continuous video frames T are processediTo the video frame Ti+n-1Determining a sequence of vacated video frames; the number of sequences of emptying video frames present in successive jump action segments is at least two;
and the landing sequence determining subunit is used for determining the video frame sequences except for the at least two flight video frame sequences in the continuous jump action segment into at least three landing video frame sequences.
The first event label also comprises action evaluation quality and emptying rotation number;
the video data processing apparatus further includes:
the first frame extracting module is used for extracting frames of each single jumping motion segment respectively to obtain at least two second video frame sequences;
the first action understanding module is used for inputting at least two second video frame sequences into an action understanding network, and outputting action evaluation quality and flight rotation number corresponding to continuous jumping action segments through the action understanding network;
the first identification module is specifically configured to track the key part of the target object in each individual jump action segment, obtain key part position information corresponding to each individual jump action segment, and identify an action type corresponding to each individual jump action segment according to the key part position information; and is specifically used for generating a first event label corresponding to the continuous jump action segment according to the action evaluation quality, the emptying rotation number and the action type.
Wherein the at least two second video frame sequences comprise a second video frame sequence S, the motion understanding network comprises m non-local components, m being a positive integer;
a first action understanding module comprising:
the second feature extraction unit is used for splitting the second video frame sequence S into m subsequences, and performing feature extraction on the m subsequences through m non-local components to obtain m intermediate action features; respectively performing one-dimensional convolution on the m intermediate motion characteristics to obtain m one-dimensional convolution motion characteristics;
the independent output unit is used for combining the m one-dimensional convolution action characteristics through a full connection layer in the action understanding network to obtain target action characteristics, and outputting independent action evaluation quality and independent emptying rotation turns corresponding to the target action characteristics through a classification layer in the action understanding network;
and the merging output unit is used for generating action evaluation quality and emptying rotation number corresponding to the continuous jump action segment according to the individual action evaluation quality and the individual emptying rotation number corresponding to each second video frame sequence when the individual action evaluation quality and the individual emptying rotation number corresponding to each second video frame sequence in the at least two second video frame sequences are obtained.
Wherein, the video data processing device further comprises:
the model training module is used for marking the continuous jumping motion sample segments, uniformly extracting K video frames from the marked continuous jumping motion sample segments, extracting continuous P video frames from the K video frames as an input sequence, inputting the input sequence into an initial motion understanding network, and outputting predicted motion evaluation quality and predicted number of idle rotation turns corresponding to the continuous jumping motion sample segments through the initial motion understanding network; p is equal to the length of model input data corresponding to the initial action understanding network;
and the model correction module is used for generating a quality loss function according to the predicted action evaluation quality, generating an action loss function according to the predicted number of flight rotations, generating a target loss function according to the quality loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain the action understanding network.
The number of the key part position information is at least two; the at least two separate motion segments comprise a separate motion segment H;
a first identification module comprising:
the position information acquisition unit is used for tracking the key parts of the target object in the single motion segment H to obtain at least two pieces of key part position information corresponding to the single motion segment H;
the motion recognition unit is used for predicting motion type prediction probabilities corresponding to each piece of key part position information in the at least two pieces of key part position information respectively, carrying out average processing on the predicted at least two motion type prediction probabilities to obtain average motion type prediction probabilities, and recognizing motion types corresponding to the single motion segments H according to the average motion type prediction probabilities;
and the label generating unit is used for generating a first event label corresponding to the continuous motion action segment according to the action type corresponding to each single motion action segment when the action type corresponding to each single motion action segment in the at least two single motion action segments is obtained.
The at least two pieces of key part position information comprise global key part position information, local key part position information, fusion key part position information, global displacement key part position information and local displacement key part position information;
a position information acquisition unit including:
the first position information acquisition subunit is used for tracking the key part of the target object in the single motion action segment H to obtain the global coordinate of the key part as the global key part position information corresponding to the single motion action segment H;
the second position information acquisition subunit is used for acquiring local coordinates of the key part according to the global key part position information, and the local coordinates are used as local key part position information corresponding to the single motion action segment H;
a third position information obtaining subunit, configured to fuse the global coordinates corresponding to the target key position in the global key position information and the local coordinates corresponding to the key positions other than the target key position in the local key position information, to obtain fused key position information corresponding to the individual motion action segment H;
a fourth position information obtaining subunit, configured to use coordinate displacement of the global key position information between every two adjacent video frames in the individual motion segment H as global displacement key position information corresponding to the individual motion segment H;
and the fifth position information acquiring subunit is configured to use coordinate displacement of the local key part position information between every two adjacent video frames in the single motion action segment H as the local displacement key part position information corresponding to the single motion action segment H.
Wherein, the video data processing device further comprises:
the second frame extracting module is used for extracting frames of the rotation action segments to obtain a third video frame sequence if the motion event segments comprise the rotation action segments;
the second action understanding module is used for inputting the third video frame sequence into an action understanding network and identifying a second event label corresponding to the rotating action segment through a non-local component in the action understanding network; and the second event label is used for recording the action type and action evaluation quality corresponding to the rotating action segment in the motion script information associated with the target object.
Wherein, the video data processing device further comprises:
the third action understanding module is used for extracting frames of the discontinuous jump action segment to obtain a fourth video frame sequence if the motion event segment comprises the discontinuous jump action segment, and inputting the fourth video frame sequence into the action understanding network to obtain action evaluation quality and flight rotation number corresponding to the discontinuous jump action segment;
the second identification module is used for tracking the key part of the target object in the discontinuous jump action segment to obtain the position information of the key part corresponding to the discontinuous jump action segment, and identifying the action type corresponding to the discontinuous jump action segment according to the position information of the key part; the system is used for determining the action evaluation quality, the emptying rotation number and the action type corresponding to the discontinuous jump action segment as a third event label corresponding to the discontinuous jump action segment; and the third event label is used for recording the action evaluation quality, the number of idle rotation turns and the action type corresponding to the discontinuous jump action segment in the motion script information associated with the target object.
Wherein, the video data processing device further comprises:
the first tag determining module is used for taking a start-stop timestamp corresponding to the entry event segment as a fourth event tag corresponding to the entry event segment if the motion event segment comprises the entry event segment; the fourth event tag is used for recording a start time stamp and an end time stamp corresponding to the entry event segment in the motion script information associated with the target object;
the second tag determining module is used for taking a start-stop timestamp corresponding to the field-backing event segment as a fifth event tag corresponding to the field-backing event segment if the motion event segment comprises the field-backing event segment; the fifth event tag is used for recording a start time stamp and an end time stamp corresponding to the field-receding event segment in the motion field mark information associated with the target object;
the third tag determining module is used for taking a start-stop timestamp corresponding to the score display event segment as a sixth event tag corresponding to the score display event segment if the sports event segment comprises the score display event segment; the sixth event tag is used for recording a start time stamp and an end time stamp corresponding to the score display event segment in the motion script information associated with the target object.
Wherein, the video data processing device further comprises:
the system comprises a script information generating module, a script information generating module and a script generating module, wherein the script information generating module is used for sequentially arranging first event labels corresponding to continuous motion action segments in motion script information associated with a target object according to the time sequence of start and stop timestamps;
the important segment intercepting module is used for intercepting important event segments from the video to be processed according to the corresponding start-stop timestamps of the motion events and splicing the important event segments into a video collection; the motion event segment belongs to the important event segment.
An aspect of an embodiment of the present application provides a computer device, including: a processor, a memory, a network interface;
the processor is connected to the memory and the network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the method in the embodiment of the present application.
An aspect of the present embodiment provides a computer-readable storage medium, in which a computer program is stored, where the computer program is adapted to be loaded by a processor and to execute the method in the present embodiment.
An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, and a processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the method in the embodiments of the present application.
According to the embodiment of the application, the motion event fragment aiming at the target object can be automatically extracted from the video to be processed, the continuous motion action fragment in the motion event fragment is split, the key part position information of each independent motion action fragment is extracted, and the like, the action type corresponding to each independent motion action fragment can be identified based on the key part position information, so that the automatic accurate classification aiming at the continuous motion action fragment can be realized, therefore, the identification accuracy and the identification efficiency of the action type can be improved, in addition, the motion script information related to the target object can be automatically generated according to the event label corresponding to the continuous motion action fragment, the condition that a large amount of manpower is consumed for registering the motion script information is avoided, and the generation efficiency of the motion script information can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a system architecture diagram according to an embodiment of the present application;
fig. 2 is a schematic view of a video data processing scene provided in an embodiment of the present application;
fig. 3 is a schematic flowchart of a video data processing method according to an embodiment of the present application;
4 a-4 c are schematic views of a video data processing scene provided by an embodiment of the present application;
fig. 5 is a schematic flowchart of a video data processing method according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a three-dimensional human skeleton according to an embodiment of the present application;
fig. 7 is a schematic flowchart of a video data processing method according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an action understanding network provided in an embodiment of the present application;
fig. 9 a-9 b are schematic flow charts illustrating a video data processing method according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include data processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.
The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, deep learning technology and other technologies, and the specific process is explained by the following embodiment.
Please refer to fig. 1, which is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system architecture may include a server 10a and a terminal cluster, and the terminal cluster may include: terminal device 10b terminal devices 10c, …, and terminal device 10d, wherein there may be a communication connection between terminal clusters, for example, there may be a communication connection between terminal device 10b and terminal device 10c, and a communication connection between terminal device 10b and terminal device 10 d. Meanwhile, any terminal device in the terminal cluster may have a communication connection with the server 10a, for example, a communication connection exists between the terminal device 10b and the server 10a, where the communication connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may also be directly or indirectly connected through a wireless communication manner, and may also be through other manners, which is not limited in this application.
It should be understood that each terminal device in the terminal cluster shown in fig. 1 may be installed with an application client, and when the application client runs in each terminal device, data interaction may be performed with the server 10a shown in fig. 1. The application client can be a multimedia client (e.g., a video client), a social client, an entertainment client (e.g., a game client), an education client, a live client, or the like, which has a frame sequence (e.g., a frame animation sequence) loading and playing function. The application client may be an independent client, or may be an embedded sub-client integrated in a certain client (e.g., a multimedia client, a social client, an education client, etc.), which is not limited herein. The server 10a provides a service for the terminal cluster through a communication function, when a terminal device (which may be the terminal device 10b, the terminal device 10c, or the terminal device 10d) acquires the video segment a and needs to process the video segment a, for example, an important event segment (for example, a highlight segment) is clipped from the video segment a, and the important event segment is classified and labeled, and the terminal device may send the video segment a to the server 10a through the application client. After the server 10a receives the video segment a sent by the terminal device, it may acquire a motion event segment B (e.g., a segment when a player shows a jumping motion or a rotating motion in a video segment of a certain pattern skating game) for a target object from the video segment a, and detect the motion event segment B, if it is detected that the motion event segment B includes a continuous motion segment C, the server 10a may further identify a motion state segmentation point in the continuous motion segment C, so as to split the continuous motion segment C according to the motion state segmentation point to obtain at least two individual motion segments, further, the server 10a may obtain key position information corresponding to each individual motion segment, so as to identify a first event tag corresponding to the continuous motion segment C according to the key position information, subsequently, the server 10a may record, according to the first event tag, the action type corresponding to each individual motion action segment in the motion scenario recording information associated with the target object, and may further intercept, according to the start-stop timestamp corresponding to the motion event segment B, important event segments from the video segment a to splice into a video album, so that the motion scenario recording information and the video album may be returned to the application client of the terminal device, and after receiving the motion scenario recording information and the video album sent by the server 10a, the application client of the terminal device may display the motion scenario recording information and the video album on a screen corresponding thereto. The motion session information may further include information such as a start timestamp, an end timestamp, and motion estimation quality corresponding to the motion event segment B, and only the identification of the motion type is taken as an example for description here.
It is understood that the method provided by the embodiment of the present application can be executed by a computer device, including but not limited to a terminal device or a server. The server 10a in the embodiment of the present application may be a computer device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a palm computer, a Mobile Internet Device (MID), a wearable device (e.g., a smart watch, a smart bracelet, etc.), a smart television, a smart car, or the like. The number of the terminal devices and the number of the servers are not limited, and the terminal devices and the servers may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
In the following, a description is given by taking processing of a video clip of a pattern skating game as an example, please refer to fig. 2 together, which is a scene diagram of video data processing provided in the embodiment of the present application. As shown in fig. 2, the computer device for implementing the video data processing scenario may include modules such as feature extraction, event classification, multitask action understanding, and skeleton structure classification, and the implementation processes of these modules may be performed in the server 10a shown in fig. 1, or may be performed in the terminal device, which is not limited herein. As shown in fig. 2, the end user may upload a video segment 20a of a pattern skating game stored locally through an application client in the terminal device, or may send a resource address (e.g., a video link) corresponding to the video segment 20a through the application client. After receiving the video processing request sent by the terminal device, the computer device may call a relevant interface (such as a web interface, or may also use other forms, which is not limited in this embodiment of the present application) to transmit the video segment 20a to a video processing model, where the video processing model may include multiple modules such as the above-mentioned feature extraction, event classification, multitask action understanding, and skeleton structure classification. The video processing model firstly frames the video segment 20a to obtain a first video frame sequence 20b, and further performs feature extraction on the first video frame sequence 20b to extract picture features corresponding to each video frame in the first video frame sequence 20b to form a picture feature sequence 20c, and further obtains timing features corresponding to the first video frame sequence 20b according to the picture feature sequence 20c, wherein the timing features may include start and stop time sequences corresponding to motion events, that is, sequences formed by start and stop time stamps corresponding to each motion event, the first video frame sequence 20b may be split into a plurality of shorter video frame sequences according to the picture features and the timing features, and an event type corresponding to each shorter video frame sequence is obtained, so that the video frame sequence with the motion event type can be extracted from the first video frame sequence 20b, as a motion event segment, the start-stop time stamp specifically includes a start time stamp and an end time stamp. In addition, according to the obtained start-stop timestamps corresponding to the motion events, corresponding segments can be cut out from the video segments 20a, for easy understanding and distinction, the cut-out segments are referred to as important event segments, and the important event segments can be spliced into a video album 20f and returned to the terminal device of the end user. In fact, since the motion event segments are extracted from the first video frame sequence 20b obtained by frame extraction, the motion event segments can be regarded as a subset of the important event segments, that is, "important event" has the same meaning as "motion event", where the important event (or motion event) can refer to an event that includes a target object in a picture or an event that has a special meaning although the target object is not included in the picture, and it is understood that the specific meaning of the important event can be represented differently in different scenes, for example, in a figure skating game, the important event can refer to events such as entry of a player, exit of the player, non-continuous jumping, rotation, score display, and the like.
The computer device can train the deep neural network to generate the modules for feature extraction, event classification, multitask action understanding, skeleton structure classification and the like by using a video database with massive videos, and the specific generation process can refer to the subsequent embodiments.
After the motion event segments and the event types respectively corresponding to the segments are obtained through the steps, the motion event segments need to be further filtered and detected because more detailed information (such as action types) needs to be further acquired, for example, in a pattern skating match, action identification and classification aiming at a continuous jumping action segment, a non-continuous jumping action segment and a rotating action segment obviously have greater value and significance, so that detection on an entry segment, a departure segment and a score display segment is not needed subsequently, and because the continuous jumping action segments are special, namely the time of a front action interval and a rear action interval in the continuous jumping action segments is short, in order to improve the accuracy and the efficiency of action identification, the computer equipment can firstly split the continuous jumping action segments into a plurality of independent jumping action segments, and then can split the plurality of independent jumping action segments, The non-continuous jumping motion segment and the rotating motion segment are decimated to obtain a plurality of corresponding video frame sequences, as shown in fig. 2, assuming that 10 video frame sequences are obtained, the video frame sequences are respectively a video frame sequence B1, a video frame sequence B2, a video frame sequence B …, a video frame sequence B9 and a video frame sequence B10, the 10 video frame sequences are input into a multitask motion understanding module, and an event tag corresponding to each video frame sequence can be obtained, as shown in an event tag 20d in fig. 2, the event tag 20d specifically includes a motion type and a motion evaluation quality, and in a pattern skating game, for a single jumping motion segment and a non-continuous jumping motion segment, the event tag 20d may further include corresponding numbers of flight rotations.
However, since it is difficult to accurately classify jumping motions in a figure skating game, the embodiment of the present application employs a skeleton structure classification module to perform individual motion classification on jumping motions (including individual jumping motion segments and non-continuous jumping motion segments) to improve classification accuracy. Assuming that, among the 10 video frame sequences, the video frame sequence B1, the video frame sequence B2, and the video frame sequence B3 are all single jump motion segments obtained by splitting a certain continuous jump motion segment, the following description will be given by taking an example in which a computer device classifies jump motions of the video frame sequence B1, and the process of classifying jump motions of other video frame sequences can be referred to the video frame sequence B1. The specific process is as follows: the computer device tracks the key parts of the target object (i.e. the competitor in the continuous jumping motion segment) in the video frame sequence B1 to obtain at least two pieces of key part position information, and averages the predicted at least two motion type prediction probabilities according to the motion type prediction probability corresponding to each piece of key part position information in the at least two pieces of key part position information to obtain an average motion type prediction probability, so that the jumping motion type corresponding to the video frame sequence B1 can be identified according to the average motion type prediction probability, and the jumping motion type is used as an event tag corresponding to the video frame sequence B1. It is understood that the process of classifying skip motion of the video frame sequence obtained by decimating the discontinuous skip motion segment can also be referred to as video frame sequence B1. In addition, after the video frame sequence B1 is decimated, a second video frame sequence may be obtained, the second video frame sequence is input into the multitask action understanding module, and the action evaluation quality and the number of idle rotation turns corresponding to the video frame sequence B1 may be obtained by the multitask action understanding module, so that the corresponding action type, the action evaluation quality and the number of idle rotation turns may be used as the event tag corresponding to the video frame sequence B1. The process of generating event tags corresponding to other video frame sequences can be referred to as video frame sequence B1, and will not be described herein.
Further, the computer device may generate a ticket 20e associated with the participating players according to the event tags corresponding to the respective sports events, specifically, the event tags corresponding to the respective sports events may be sequentially arranged in the ticket 20e according to the obtained time sequence of the start and stop time stamps, and for the entry event, the exit event and the score display event, the corresponding event tags may be the start time stamp and the end time stamp corresponding to the events, as shown in fig. 2, for example, "t 1-t2 player entry" may be displayed in the ticket 20e, which indicates that the entry time of the participating players is from the time t1 to the time t2 in the video segment 20 a. As can be seen from the legend 20e, the quality of the motion assessment in the figure skating game is mainly divided into two parts, namely a basic score (BV) and a performance score (GOE). It should be noted that, for a continuous jump action segment, the final event labels of the continuous jump action segment can be obtained only by collecting and fusing event labels corresponding to a plurality of individual jump action segments obtained by splitting up the previous time, for example, the final base score corresponding to the continuous jump action segment can be obtained by adding the base scores corresponding to each individual jump action segment, the final execution score corresponding to the continuous jump action segment can be obtained by adding the execution phase scores corresponding to each individual jump action segment, the final vacation rotation turns corresponding to the continuous jump action segment can be obtained by adding and summing the vacation rotation turns corresponding to each individual jump action segment, optionally, the vacation rotation turns corresponding to each individual jump action segment can be arranged and combined in time sequence to obtain an array as the final vacation rotation turns corresponding to the continuous jump action segment, for example, assuming that the number of vacation rotations of video frame sequence B1 is C1, the number of vacation rotations of video frame sequence B2 is C2, and the number of vacation rotations of video frame sequence B3 is C3, as shown in the notation 20e in fig. 2, the three video frame sequences all belong to a continuous jump motion segment from time t3 to time t4, and the corresponding final number of vacation rotations can be shown as "C1 + C2+ C3". And further, the final basic score, the final execution score and the final emptying rotation number can be output as an event label corresponding to the continuous jump action segment.
Finally, the computer device can pack the script list 20e and the video album 20f obtained by splicing the important event segments, and return the corresponding resource address to the terminal device of the terminal user through the interface by using an information position representation method of a uniform resource location system (URL), so that the terminal device can respond to the triggering operation of the terminal user for the resource address and display the video album 20f and the script list 20e on the screen.
It can be understood that the computer device may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. Therefore, the modules mentioned above can be distributed on a plurality of physical servers or a plurality of cloud servers, that is, the calculation of all the video clips uploaded by the terminal users is completed in parallel through distribution or clustering, and further, the video clips can be automatically cut, classified and event labels can be generated rapidly.
In summary, the embodiment of the present application is based on a deep neural network, and can quickly obtain motion event segments for a target object and event types corresponding to each motion event segment from a video to be processed, and further classify and add event labels to each motion event segment, and further, for a continuous motion segment (such as a continuous jump motion segment), the motion event segment can be first split into a plurality of individual motion segments, and further, the accurate motion classification can be realized by using the position information of a key part obtained by skeleton detection, so that the accuracy and efficiency of identifying the motion types can be improved, and finally, all important event segments can be automatically split from the video to be processed, and a scenario note list can be automatically generated, and in an actual service scenario, manual operation can be replaced, a large amount of labor cost is saved and human errors are reduced, the generation efficiency of the motion script information is improved.
Further, please refer to fig. 3, which is a flowchart illustrating a video data processing method according to an embodiment of the present application. The video data processing method may be executed by the terminal device or the server described in fig. 1, or may be executed by both the terminal device and the server. As shown in fig. 3, the method may include the steps of:
step S101, acquiring a motion event aiming at a target object in a video to be processed;
specifically, please refer to fig. 4a to fig. 4c together, which are schematic views of a scene of video data processing according to an embodiment of the present application. As shown in fig. 4a, the terminal user uploads the video clip 30b as a video to be processed through the terminal device 30a, and finally, a video collection and a corresponding diary formed by splicing a plurality of important event clips in the video clip 30b can be obtained. First, the terminal device 30a may display some prompt information on the screen so that the terminal user may complete the relevant operation according to the prompt information, for example, a text prompt content of "drag file to here, or click upload" may be displayed in the area 301a, indicating that the terminal user may upload by dragging the local to-be-processed video stored in the area 301a, or, click the upload control to select the local to-be-processed video for upload, if the to-be-processed video is not stored locally, as shown in the area 302a in the terminal device 30a, the terminal user may input a resource link of the to-be-processed video (for example, paste the resource link thereof to copy and correspond to an input box), and the subsequent server may obtain the to-be-processed video by using the resource link, where the link may be in a URL format, an HTML format, an UBB format, or the like, the terminal user may select according to the specific format of the resource link corresponding to the video to be processed, and in addition, increase or decrease of the link format option may be performed according to the actual service scene, which is not limited in the embodiment of the present application. Assuming that the end user selects a local one of the video segments 30b for processing, the end device 30a may send the video segment 30b to the server 30c (i.e., corresponding to the server 10a in fig. 1) in response to a triggering operation (e.g., a clicking operation) by the end user with respect to the "confirm upload" control 303 a. The video segment 30b includes one or more target objects, and the number of the target objects is not limited herein.
After the server 30c obtains the video segment 30b, it may decode and frame the video segment 30b by using a video editor or a video editing algorithm, for example, through Adobe Premiere Pro (a common video editing software developed by Adobe corporation), Fast Forward Mpeg (FFmpeg, which is a set of Open source computer programs capable of recording, converting digital audio and video, and converting them into streams), and Open CV (which is a cross-platform computer vision and machine learning software library licensed based on berkeley software suite, and can be run on various operating systems), the server 30c may obtain each frame image of the video segment 30b, and then may uniformly frame the obtained multi-frame images at fixed time intervals, so as to obtain a first video frame sequence 30d, and assuming that 99 video frames can be obtained after frame extraction, as shown in fig. 4a, after the video segment 30b is uniformly decimated, a first video frame 301d, first video frames 302d, …, a first video frame 398d and a first video frame 399d are obtained, and the first video frame 301d, the first video frames 302d, …, the first video frame 398d and the first video frame 399d are combined into a first video frame sequence 30d according to the video time sequence. The specific fixed time interval selected during uniform frame extraction needs to be determined according to actual conditions, so that the video frames of each important event segment can be uniformly acquired, which is not limited in this embodiment of the present application.
Optionally, if the terminal device 30a is equipped with a video editor or can run a video editing algorithm, the terminal user may first decode and frame the video segment 30b on the terminal device 30a to generate a corresponding first video frame sequence 30d, and then send the first video frame sequence 30d to the server 30 c. The process of locally generating the first video frame sequence 30d corresponding to the video segment 30b is consistent with the process of generating the first video frame sequence 30d corresponding to the video segment 30b by the server 30c, and thus, description thereof is omitted here.
Further, the server 30c inputs the first video frame sequence 30d into the feature extraction network 301e for feature extraction, so as to obtain the picture features 30f corresponding to the first video frame sequence 30d, since the video segments in the actual service are all colored, that is, the extracted video frames are all colored pictures, for example, the RGB-mode colored pictures have three color channels of Red (Red), Green (Green), and Blue (Blue), the feature extraction network 301e may specifically be an RGB convolutional network, and can extract the picture RGB features of each video frame, fuse the picture RGB features, so as to obtain the picture features 30f corresponding to the first video frame sequence 30d, and then the server 30c can input the picture features 30f into the motion segmentation network 30g (also referred to as a time sequence motion segmentation network) to obtain the start-stop timestamps corresponding to the motion events, the time sequence feature 30h corresponding to the first video frame sequence 30d may be obtained by arranging and combining the start-stop timestamps corresponding to each motion event in a sequential order, for example, the time sequence feature 30h may include j timestamps, j being a positive integer, of the timestamp t1, the timestamp t2, the timestamps t3, …, and the timestamp tj, where two adjacent timestamps form a pair of start-stop timestamps, that is, a pair of start-stop timestamps includes a start timestamp and an end timestamp, for example, the timestamp t 1-the timestamp t2, and the timestamp t 3-the timestamp t4, it may be understood that the end timestamp in the previous pair of start-stop timestamps may be equal to the start timestamp in the next pair of start-stop timestamps, for example, t2 is equal to t3, but for convenience of description and distinction in the following, the timestamps may be respectively marked as t2 and t 3. Further, according to the timing feature 30h, the server 30c may extract a motion event segment 30i for the motion event from the first video frame sequence 30d, as shown in fig. 4a, if j is equal to 18, 9 motion event segments may be obtained, including the motion event segment 301i, the motion event segments 302i, …, and the motion event segment 309i, and the video frames corresponding to the motion event segment 301i, the motion event segments 302i, …, and the motion event segment 309i are respectively input into the convolutional network 302e for event classification, so as to obtain a specific event type corresponding to each motion event segment, for example, when the video segment 30b is a video segment of a pattern skating game, the event types may include an entry event type, an exit event type, a discontinuous jump event type, a continuous jump event type, a rotation event type, a video frame corresponding to each motion event segment, and a video frame corresponding to each motion event segment may be obtained Score indicates the type of event, etc. The feature extraction network 301e and the convolution network 302e may be the same network model, or two independent network models with the same network structure, and may specifically be RGB convolution networks, where the RGB convolution networks may be convolution layers in a CNN network, and at this time, the RGB convolution networks are equivalent to an N +1 classification (i.e., N-type motion event + background) picture classifier, and perform convolution processing on video frames through the RGB convolution networks, and may be used to extract picture features and classify events.
The RGB convolutional network and the motion segmentation network are deep neural networks, and the networks need to be trained through enough sample data so that the output of the networks conforms to an expected value. Specifically, for the RGB convolutional network, a large amount of videos of the same type are collected as training videos, for example, a network is needed that can process video of a figure skating game, a large number of video segments of the figure skating game can be used as a training set, all training video clips in the training set are subjected to frame extraction at fixed time intervals to form a first training sequence, further, the video frames in the first training sequence can be labeled manually according to the event types, the first training sequence after data labeling is input into the initial RGB convolutional network, training the initial RGB convolutional network, and updating the initialization parameters in the initial RGB convolutional network structure through a back propagation algorithm until the loss value reaches the expected effect, so that the RGB convolutional network capable of extracting picture features and performing event classification can be obtained finally. Aiming at the action segmentation network, firstly, manually labeling all training video segments in a training set according to event types, labeling start-stop time and event types corresponding to motion events, then, based on all training video segments after data labeling, performing frame extraction at fixed time intervals, inputting extracted video frames into the trained RGB convolution network for feature extraction, enabling obtained picture features to form a second training sequence, training an initial action segmentation network by using the second training sequence, and finally obtaining the action segmentation network for outputting start-stop time stamps of the motion events. In addition, in addition to collecting video data to train the network, a part of the video data needs to be divided into test sets for approximately estimating the generalization ability of the trained network and verifying the final effect of the network, which is not repeated herein in the embodiments of the present application.
Step S102, if the motion event segment comprises a continuous motion action segment, identifying a motion state segmentation point in the continuous motion action segment, and splitting the continuous motion action segment according to the motion state segmentation point to obtain at least two independent motion action segments;
specifically, the server may identify whether the motion event segment includes a continuous motion segment, and since the model of the subsequent motion classification and motion quality evaluation is performed based on each individual motion segment, if a continuous motion segment exists in the motion event segment, the server needs to split the continuous motion segment before performing the motion classification and motion quality evaluation on the continuous motion segment in order to improve the identification accuracy, and the continuous motion segment may be a continuous jump segment in the pattern skating game. The server firstly identifies the motion state corresponding to each video frame in the continuous jump action segments, wherein the motion state refers to the state of a target object in the motion process, the server can divide the continuous jump action segments into at least two flight video frame sequences and at least three landing video frame sequences according to the motion state corresponding to each video frame, further determine the middle moment of the target landing video frame sequence as a motion state division point, divide the continuous jump action segments according to the action state division point, and finally obtain at least two independent jump action segments, wherein the target landing video frame sequence comprises the video frame sequences except for the first landing video frame sequence and the last landing video frame sequence in the at least three landing video frame sequences.
Referring to fig. 4b in conjunction with the above step S101, as shown in fig. 4b, it is assumed that the video segment 30a is a video segment of a figure skating game, one motion event segment 303i of the motion event segments 30i is a consecutive jumping motion segment, and the consecutive jumping motion segment 303i includes 16 video frames in total, specifically, a video frame T1, a video frame T2, video frames T3, …, a video frame T15, and a video frame T16. It should be noted that, two motion states, namely an empty state and a landing state, may be specifically included in the continuous jump motion segment, where the empty state refers to a state where the target object is located in the air, and the landing state refers to a state where the target object is located on the ground, for example, the motion state corresponding to the video frame T1 is the landing state. Then, after the server 30c obtains the respective motion states of the video frames T1-T16, the continuous jump action segment 303i may be divided into 3 landing video frame sequences and 2 vacating video frame sequences, as shown in fig. 4b, the video frame T1 is the landing video frame sequence 1, the video frame T2-T5 is the vacating video frame sequence 1, the video frame T6-T12 is the landing video frame sequence 2, the video frame T13-T15 is the vacating video frame sequence 2, and the video frame T16 is the landing video frame sequence 3, so that only the landing video frame sequence 2 may be left as the target landing video frame sequence after removing the landing video frame sequence 1 and the landing video frame sequence 3, the middle time of the landing video frame sequence 2, that is, the time T' corresponding to the video frame T9, is selected as the motion state division point, and then the server 30c may divide the landing video frame sequences into 3, 2, 3 b, b, the consecutive jump motion segment 303i is split into a single jump motion segment 3031i and a single jump motion segment 3032i, wherein the single jump motion segment 3031i comprises video frames T1-T9 and the single jump motion segment 3032i comprises video frames T10-T16. The splitting process of other continuous jumping motion segments in the motion event segment 30i is the same as the splitting process of the continuous jumping motion segment 303i, and if the other continuous jumping motion segments include more than three landing video frame sequences, the number of the target landing video frame sequences is two or more, so that three or more than three individual jumping motion segments can be finally obtained, which is not described herein again.
Wherein, if the continuous jump motion segment includes the video frame TiIf i is a positive integer and i is smaller than the total number of video frames in the continuous jump motion segment, the specific process of dividing the continuous jump motion segment into at least two flight video frame sequences and at least three landing video frame sequences may be: obtaining successive video frames T from the successive jump motion segmentsiTo the video frame Ti+nN is a positive integer, and i + n is less than or equal to the total number of video frames in the consecutive jumping motion segment, if consecutive video frames TiTo the video frame Ti+n-1The corresponding motion states are all flight states, and the video frame Ti+nAnd video frame Ti-1The motion states corresponding to the video frames T are all the landing statesiTo the video frame Ti+n-1If the number of the corresponding video frames is larger than or equal to the vacation number threshold value, the continuous video frames T are processediTo the video frame Ti+n-1The number of the vacation video frame sequences existing in the continuous jump motion segment is at least two, and then the video frame sequences except for at least two vacation video frame sequences in the continuous jump motion segment can be determined as at least three landing video frame sequences.
Referring to fig. 4b again, for the continuous jump action segment 303i, the server 30c may recognize that the motion state corresponding to the video frame T1 is a landing state, the motion states corresponding to the video frame T2, the video frame T3, the video frame T4, and the video frame T5 are all flight states, the motion state corresponding to the video frame T6 is a landing state, it is easy to know that 4 video frames are included from the video frame T2 to the video frame T5, if the flight number threshold is less than 4, for example, the flight number threshold is 3, the video frame T2-the video frame T5 may be determined as an empty video frame sequence, the determination of other empty video frame sequences is consistent with the above process, and the landing video frame sequence is formed by removing the empty video frame sequence in the continuous jump action segment 303 i. In short, the server may identify every two adjacent video frames in the landing state, and if the intermediate video frames between the two video frames are in the vacated state and the number of the intermediate video frames is greater than or equal to a vacated number threshold, it may be considered that the intermediate video frames between the two video frames may form a vacated video frame sequence, and if the number of the intermediate video frames is less than the vacated number threshold, it may be considered that the intermediate video frames between the two video frames may form a landing video frame sequence.
It should be noted that the above-mentioned process of splitting the consecutive jump motion segments can be implemented in a picture classifier, specifically, the picture classifier can be based on a binary picture convolution network, first, the video frames extracted from the consecutive jump motion sample segments are composed into a third training sequence, and the third training sequence is labeled, specifically, it is labeled as "empty" or "landing", and then the labeled third training sequence is input into a binary picture convolution network, the binary picture convolution network is trained, finally a trained picture classifier can be obtained, subsequently all the video frames in the consecutive jump motion segments can be input into the picture classifier, whether each frame is in an empty state or not is output, if not all the video frames in the consecutive M frames are in an empty state, the video frames can be determined as a landing video frame sequence, the influence of misjudgment of the image classifier is reduced (M is equal to the soakage number threshold), and finally the continuous jump action segment can be split according to the obtained motion state segmentation point, namely the continuous jump action segment can be segmented into a plurality of independent jump action segments which are connected end to end.
Step S103, respectively tracking the key part of the target object in each single motion action segment to obtain the key part position information respectively corresponding to each single motion action segment, and identifying the first event label corresponding to the continuous motion action segment according to the key part position information; the first event label is used for recording the action type corresponding to each individual motion action segment in the motion script information associated with the target object.
Specifically, the server may track a key portion of the target object in each individual movement segment, so as to obtain position information of the key portion corresponding to each individual jumping movement segment, and may identify an action type corresponding to each individual jumping movement segment according to the position information of the key portion, and further may generate a first event tag corresponding to a continuous jumping movement segment according to the action type, where the first event tag is used to record an action type corresponding to each individual movement segment in the movement session information associated with the target object. In addition, the action evaluation quality and the number of flight rotations corresponding to each individual movement action segment can be obtained through the action understanding network, and the specific process can be seen from the following steps S201 to S202 in the embodiment corresponding to fig. 7.
It should be noted that although the action understanding network may also identify the action type corresponding to each individual motion action segment, the accuracy rate of classifying individual motion actions (e.g., jumping actions) is low, so that in the embodiment of the present application, an idea of classifying individual motion actions for individual motion action segments is adopted, and the accuracy rate of classification is improved by using the position information of the critical portion (or may be referred to as three-dimensional human skeleton information), where the obtaining of the position information of the critical portion and the identification of the action type may be performed in the action classifying network, and the action classifying network and the action understanding network may run in parallel, so as to improve the execution efficiency of the entire model.
Referring to fig. 4c, taking the processing of the single jump motion segment 3031i as an example for description, as shown in fig. 4c, the action type 30o corresponding to the single jump motion segment 3031i can be obtained by inputting the single jump motion segment 3031i into the action classification network 30l, and the processing procedures of the single jump motion segment 3032i in fig. 4b are the same, which is not described herein again, and the action type 30r corresponding to the single jump motion segment 3032i can be obtained finally. All action types corresponding to the consecutive jumping action segment 303i can be obtained by arranging the action types 30o and 30r in time sequence, so that the obtained all action types can form a first event label corresponding to the consecutive jumping action segment 303i, and the process of generating the first event labels corresponding to other consecutive jumping action segments is consistent with the above process.
The motion classification network may specifically include a three-dimensional skeleton detection network and a graph volume point classification network, and is used to specially classify the jumping motions, the number of the information of the position information of the key portion is at least two, and if at least two independent motion segments include an independent motion segment H, please refer to fig. 5 together, and fig. 5 is a schematic flow diagram of a video data processing method provided in an embodiment of the present application. As shown in fig. 5, the specific steps performed in the action classification network may include:
step S1031, tracking key parts of the target object in the single movement motion segment H to obtain at least two pieces of key part position information corresponding to the single movement motion segment H;
specifically, the server may input the single movement motion segment H into the three-dimensional skeleton detection network, and may further detect a key portion on the three-dimensional human skeleton of the target object in each video frame through the three-dimensional skeleton detection network, and track the key portion, thereby obtaining at least two pieces of key portion position information corresponding to the single movement motion segment H. The target object may specifically include a head, a neck, a shoulder, an elbow, a waist, a hip, a knee, a foot, and the like, and these key parts are abstracted in the video frame into individual human key points, so that tracking detection can be performed better.
The at least two pieces of key part position information may include global key part position information, local key part position information, fusion key part position information, global displacement key part position information, and local displacement key part position information. The server may track the key portion of the target object in the single motion action segment H to obtain a global coordinate of the key portion, as global key portion position information corresponding to the single motion action segment H, please refer to fig. 6 together, which is a structural schematic diagram of a three-dimensional human skeleton provided in the embodiment of the present application, as shown in fig. 6, there are 17 human key points corresponding to the key portion, including a key point P1, a key point P2, key points P3, …, a key point P16, and a key point P17, the server may obtain three-dimensional global coordinates corresponding to a key point P1 — a key point P17 in each video frame of the single motion action segment H, for example, may obtain three-dimensional global coordinates of the key point P1 (x1, y1, z1), the three-dimensional global coordinates need to be generated based on an established coordinate system, and what coordinate system is specifically established, the method needs to be considered according to actual needs, and the embodiment of the application is not limited.
Further, local coordinates of the key portions, that is, local coordinates of each key portion relative to its parent node, may be obtained according to the global key portion position information, as local key portion position information corresponding to the individual movement motion segment H, referring to fig. 6 again, for example, when the key point P17 is used as a child node, its parent node is the key point P15, and the three-dimensional global coordinate corresponding to the key point P17 and the three-dimensional global coordinate corresponding to the key point P15 are subtracted to obtain the corresponding local coordinates. After the global key part position information and the local key part position information are obtained, the global coordinates corresponding to the target key part in the global key part position information and the local coordinates corresponding to the key parts except the target key part in the local key part position information can be fused to obtain the fused key part position information corresponding to the independent motion action segment H. Since the overall position and direction of the human body can be controlled only by the hip joints, and the other joints control the local posture of the human body, in the embodiment of the present application, the hip three joints can be used as target key parts, and the global coordinates of the hip three joints are fused with the local coordinates of the other joints, please refer to fig. 6, the hip three joints correspond to the key point P1, the key point P2, and the key point P3, the other joints correspond to the remaining key points, the global coordinates corresponding to the key point P1, the key point P2, and the key point P3 are taken, the local coordinates corresponding to the key point P4-key point P17 are taken, and the two parts are combined together to obtain the fused key part position information. Compared with the two independent information streams of the global key part position information and the local key part position information, the fusion key part position information is adopted, the internal relation between the global information and the local information can be better utilized, when the graph volume point network is trained, if the fusion key part position information is used as input, only one training is needed, if the global key part position information and the local key part position information are used as input, the two information streams need to be trained twice, and therefore the training efficiency of the two information streams is higher.
In addition, the coordinate displacement of the global key position information between every two adjacent video frames in the individual motion segment H (i.e., the coordinate displacement of the global key position information from the previous frame in time sequence) may be used as the global displacement key position information corresponding to the individual motion segment H, for example, for the key point P1, the global coordinate of the key point P1 in the current video frame may be subtracted from the global coordinate of the key point P1 in the previous video frame to obtain the current coordinate displacement, it should be noted that, for the first video frame in the individual motion segment H, there is no case of the previous video frame, and therefore, the coordinate displacements corresponding to all the key points in the first video frame may be defaulted to zero. Similarly, the coordinate displacement of the local key part position information between every two adjacent video frames in the single motion segment H (i.e. the coordinate displacement of the local key part position information relative to the previous frame in time sequence) may be used as the local displacement key part position information corresponding to the single motion segment H.
It can be understood that, in addition to 17 key points shown in fig. 6, the key points may be increased or decreased according to actual needs, and the present application is not limited thereto.
Step S1032, predicting action type prediction probabilities corresponding to each piece of key part position information in the at least two pieces of key part position information respectively, carrying out average processing on the predicted at least two action type prediction probabilities to obtain average action type prediction probabilities, and identifying action types corresponding to the single movement action segments H according to the average action type prediction probabilities;
specifically, 5 sets of information streams including the global key position information, the local key position information, the fusion key position information, the global displacement key position information, and the local displacement key position information obtained in step S1032 are input to the graph integration network, the graph integration network may output action type prediction probabilities respectively corresponding to each set of information streams, and perform an averaging operation on the obtained 5 action type prediction probabilities, so as to obtain an average action type prediction probability, where the average action type prediction probability may be a vector, each piece of data in the vector corresponds to a prediction probability of an action type, and then an action type corresponding to the largest piece of data in the average action type prediction probability is taken as an action type corresponding to the individual movement action segment H.
It should be noted that, in the embodiment of the present application, taking 5 sets of information streams as input is a preferable scheme, because the action type obtained by the scheme is relatively most accurate, but 4 sets of information streams, that is, global critical position information, local critical position information, global displacement critical position information, and local displacement critical position information, may also be used as input, or only 3 sets of information streams, that is, fusion critical position information, global displacement critical position information, and local displacement critical position information, may be used as input, and the action classification accuracy of the latter two sets of information streams is lower.
Step S1033, when the action type corresponding to each individual motion segment in the at least two individual motion segments is obtained, generating a first event label corresponding to the continuous motion segment according to the action type corresponding to each individual motion segment.
Specifically, the processing procedure of the other individual exercise motion segments in the continuous exercise motion segment is the same as the processing procedure of the individual exercise motion segment H in the above steps S1031 to S1032, and is not repeated here. When the action type corresponding to each individual motion action segment in the at least two individual motion action segments is obtained, the action types corresponding to each individual motion action segment can be sequentially arranged and combined to obtain the first event label corresponding to the continuous motion action segment. And sequentially arranging the first event labels corresponding to all the continuous motion action segments together according to the time sequence of the start-stop timestamps to form motion log information associated with the target object, and finally storing the motion log information in a log list and returning the motion log information to the terminal user.
It should be noted that before the graph volume point network is used, it is also required to train the graph volume point network, specifically, data tagging is performed first, for example, if the video segments of the pattern skating game are classified, the motion names (for example, front skip, back skip, etc.) and the motion evaluation quality (including the base score and the execution score) of the skip motion can be tagged to the skip motion sample segments (including the single skip motion sample segment and the non-continuous skip motion sample segment) according to the motion types, and further, the three-dimensional global coordinates of all video frames in the skip motion sample segments can be extracted by using the three-dimensional skeleton detection network, the 5 sets of information streams (i.e., the key location information) described in the above step S1031 are generated according to the three-dimensional global coordinates, the obtained 5 sets of information streams are used as input to train the initial graph volume point network, after the trained graph convolution classification network is obtained, the jump motion segments can be classified separately, so that higher jump motion classification accuracy rate relative to the motion understanding network is obtained.
Further, please refer to fig. 7, which is a flowchart illustrating a video data processing method according to an embodiment of the present application. As shown in fig. 7, the method includes the following steps S201 to S205, and the steps S201 to S205 are a specific embodiment of the step S103 in the embodiment corresponding to fig. 3, and the method includes the following steps:
step S201, respectively performing frame extraction on each single jumping motion segment to obtain at least two second video frame sequences;
specifically, in the scene of the flower-and-slide game, the continuous motion segment is a continuous jump motion segment, each individual motion segment is an individual jump motion segment, which can be collectively referred to in fig. 4a to 4c, after obtaining an individual jump motion segment 3031i and an individual jump motion segment 3032i, as shown in fig. 4c, a second video frame sequence 30j can be obtained after performing frame extraction on the individual jump motion segment 3031i, and then a second video frame sequence corresponding to each individual jump motion segment can be obtained after performing frame extraction on the individual jump motion segment 3032 i.
Step S202, inputting at least two second video frame sequences into an action understanding network, and outputting action evaluation quality and flight rotation circles corresponding to continuous jumping action segments through the action understanding network;
specifically, assuming that the at least two second video frame sequences include the second video frame sequence S, the second video frame sequence S may be split into m subsequences, and the m subsequences are respectively subjected to feature extraction by m non-local components in the action understanding network, so that m intermediate action features may be obtained. Further, in the embodiment of the present application, the conventional averaging processing is changed to one-dimensional convolution processing, that is, two times of one-dimensional convolution are performed on m intermediate motion features, so as to obtain m one-dimensional convolution motion features, where a dimension corresponding to the one-dimensional convolution motion feature is different from a dimension corresponding to the intermediate motion feature. Furthermore, m one-dimensional convolution action characteristics are combined through a full connection layer in the action understanding network to obtain target action characteristics, and independent action evaluation quality and independent flight rotation turns corresponding to the target action characteristics can be output through a classification layer in the action understanding network. When obtaining the quality of the individual motion assessment and the number of individual flight rotations for each of the at least two second video frame sequences, adding the individual motion estimation qualities corresponding to each of the second video frame sequences may generate motion estimation qualities corresponding to successive jump motion segments, and simultaneously adding the numbers of the independent emptying rotations corresponding to each second video frame sequence respectively to generate the number of the emptying rotations X1 corresponding to the continuous jump motion segment, alternatively, the number of individual flight rotations for each second sequence of video frames may be arranged and combined in chronological order, where the number of flight rotations for successive jump action segments may be understood as the array X2 containing each individual flight rotation, and subsequently the number X1 of flight rotations and the array X2 may be recorded simultaneously in the motion session information.
Referring to fig. 4c again, as shown in fig. 4c, the second video frame sequence 30j obtained by framing the single jump motion segment 3031i is input to the motion understanding network 30k, so that the motion estimation quality 30m (at this time, the single motion estimation quality) and the number of flight rotations 30n (at this time, the number of flight rotations 30 n) corresponding to the single jump motion segment 3031i can be obtained, the processing procedures of the single jump motion segment 3032i are consistent, and finally, the motion estimation quality 30p and the number of flight rotations 30q corresponding to the single jump motion segment 3032i can also be obtained, which is not described herein again, the motion estimation quality 30m and the motion estimation quality 30p are added to obtain the total motion estimation quality corresponding to the continuous jump motion segment 303i, the number of flight rotations 30n and the number of flight rotations 30q are added, or are combined together in chronological order, the total number of flight revolutions corresponding to successive jump action segments 303i is obtained.
It should be noted that the action understanding network may also classify the rotation action segments to obtain action types and action evaluation qualities corresponding to the rotation action segments, and the specific process may refer to step S306 in the embodiment corresponding to fig. 9a described below.
Step S203, respectively tracking the key part of the target object in each individual jumping motion segment to obtain the key part position information respectively corresponding to each individual jumping motion segment, and identifying the motion type respectively corresponding to each individual jumping motion segment according to the key part position information;
specifically, as shown in fig. 4c, the corresponding action type 30o can be obtained by inputting the single jump action segment 3031i into the action classification network 30l, the corresponding action type 30r can be obtained by inputting the single jump action segment 3032i into the action classification network 30l, and finally, all the action types corresponding to the continuous jump action segment 303i can be obtained by arranging the action types 30o and 30r in time sequence. For a specific implementation process, reference may be made to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.
And step S204, generating a first event label corresponding to the continuous jump action segment according to the action evaluation quality, the emptying rotation number and the action type.
Specifically, the obtained motion estimation quality, the number of revolutions of flight and the motion type are arranged together, so that the first event label corresponding to the consecutive jump motion segment can be obtained, for example, the first event label corresponding to the consecutive jump motion segment 303i can be formed by the total motion estimation quality and the total number of revolutions of flight obtained in step S202 and all the motion types obtained in step S203, and the process of generating the first event labels corresponding to other consecutive jump motion segments is consistent with the above process.
It should be understood that all numbers shown in the embodiments of the present application are imaginary numbers, and in actual applications, the actual numbers should be used as the standard.
Furthermore, before applying the action-understanding network, the initial action-understanding network needs to be trained to obtain an available action-understanding network. The action understanding network may also be referred to as a multitask action understanding network, and may perform action classification and action quality evaluation at the same time, please refer to fig. 8, which is a schematic structural diagram of an action understanding network provided in the embodiment of the present application. As shown in fig. 8, in the embodiment of the present application, instead of using a C3D network (3D relational Networks) as a main network, m non-local components are used to form the main network, where m is a positive integer, and the action understands that the network employs a hard sharing mechanism of parameters, that is, the m non-local components can share parameters. Firstly, a jump action sample segment needs to be labeled, action types and action evaluation quality (including a basic score and an execution score) in the jump action sample segment can be labeled, K video frames are uniformly extracted from the labeled jump action sample segment (if the number of the labeled jump action sample segment is less than K frames, the K video frames are filled with zero frames), and then continuous P video frames are randomly extracted from the K video frames to serve as an input sequence to form a training test set, wherein the number of the video frames input by each non-local component must be a fixed value, so that P is equal to the length of model input data corresponding to an initial action understanding network, the value of P is very close to the value of K, and the size of the length of the model input data can be adjusted according to the length of an actual video segment. Further, the predicted action evaluation quality and the predicted flight rotation number corresponding to the jumping action sample segment output by the initial action understanding network can be used, a quality loss function can be generated according to the difference between the predicted action evaluation quality and the actual action evaluation quality, an action loss function can be generated according to the difference between the predicted flight rotation number and the actual flight rotation number, a target loss function can be generated according to the obtained quality loss function and action loss function (for example, the quality loss function and the action loss function can be summed), model parameters in the initial action understanding network are corrected through the target loss function, and a well-trained action understanding network can be obtained.
As shown in fig. 8, an initial motion understanding network based on non-local components is trained by an input sequence containing P consecutive video frames, the overall structure of the initial motion understanding network includes m non-local components (also referred to as non-local modules) sharing parameters, each of the non-local components can respectively process segments of P/m consecutive frames, as shown in fig. 8, non-local component 1 processes segment 1, non-local component 2 processes segment 2, …, non-local component m processes segment m, outputs of the m non-local components are combined through one-dimensional convolution, and then the motion type, the number of flight rotations, and the motion evaluation quality (including the base score and the execution score) are respectively output through different fully connected layers.
It should be understood that, although the action understanding network illustrated in fig. 8 only marks non-local components, one-dimensional volume layers and full connection layers, in practical applications, the network structure of the action understanding network may further include an input layer, a feature extraction layer, a normalization (BatchNorm) layer, an activation layer, an output layer, and the like. In addition, a motion understanding network capable of classifying other event type segments except the single jump motion segment can be trained according to actual needs, for example, motion type recognition and motion quality evaluation can be performed on the rotation motion segment, and details are not repeated here.
The embodiment of the application can acquire the motion event segments aiming at the target object in the video to be processed through the model composed of the plurality of deep neural networks, identify the event type corresponding to each motion event segment, when the motion event segments are detected to comprise the continuous motion action segments, the continuous motion action segments can be split through identifying the motion state segmentation points in the continuous motion action segments, so that at least two independent motion action segments can be obtained, further, the key parts of the target object can be respectively tracked in each independent motion action segment, so that the key part position information corresponding to each independent motion action segment can be obtained, the action types corresponding to the continuous motion action segments can be identified according to the key part position information, and the automatic and accurate classification aiming at the continuous motion action segments can be realized, and the corresponding action evaluation quality can be generated at the same time, so that the identification accuracy and the identification efficiency of the action type can be improved, in addition, the motion script information related to the target object can be automatically generated according to the event label corresponding to the continuous motion action segment, the registration of the motion script information by consuming a large amount of manpower is avoided, and the generation efficiency of the motion script information can be improved.
Please refer to fig. 9 a-9 b, which are schematic flow charts of a video data processing method according to an embodiment of the present application. As shown in fig. 9a, the method may comprise the steps of:
step S301, performing frame extraction on a video to be processed containing a target object to obtain a first video frame sequence;
specifically, as shown in fig. 9b, the video to be processed is a video clip of a figure skating game, and therefore the target object in the video to be processed is a player. After the frames of the video to be processed are decimated at regular time intervals, a first video frame sequence 40a can be obtained.
Step S302, inputting the first video frame sequence into a feature extraction network, and acquiring picture features corresponding to the first video frame sequence;
specifically, the feature extraction network may be an RGB convolutional network, as shown in fig. 9b, the first video frame sequence 40a is input into the RGB convolutional network 40b, and the picture features corresponding to the first video frame sequence 40a can be extracted through the RGB convolutional network 40 b.
Step S303, inputting the picture characteristics into a time sequence action segmentation network to obtain a time sequence corresponding to the motion event;
specifically, as shown in fig. 9b, the picture features obtained in step S302 are input into the time-series action segmentation network 40c (corresponding to the action segmentation network in fig. 4 a), so as to obtain start-stop timestamps corresponding to each motion event, the start-stop timestamps may form a time sequence corresponding to the motion event, the first video frame sequence 40a may be segmented into a plurality of motion event segments according to the time sequence, a video frame that does not belong to a motion event segment may be considered as a background picture, and the specific process may refer to step S101 in the embodiment corresponding to fig. 3.
Step S304, inputting the motion event segment into a feature extraction network to obtain an event type corresponding to the motion event segment;
specifically, as shown in fig. 9b, the video frame corresponding to the motion event segment obtained in step S303 is input into the RGB convolutional network 40d, so as to obtain an event type corresponding to the motion event segment, where the event type may specifically include an entry event type, an exit event type, a score display event type, a non-continuous skip event type, a continuous skip event type, and a rotation event type, and accordingly, the entry event segment, the exit event segment, the score display event segment, the non-continuous skip action segment, the continuous skip action segment, and the rotation action segment may be extracted from the first video frame sequence 40 a. The specific process of identifying the event type can be referred to step S101 in the embodiment corresponding to fig. 3.
In addition, for the entry event segment, the start and stop time stamp corresponding to the entry event segment may be used as a fourth event tag corresponding to the entry event segment, where the fourth event tag is used to record the start time stamp and the end time stamp corresponding to the entry event segment in the sports session information associated with the participating players;
for a field-returning event segment, a start-stop timestamp corresponding to the field-returning event segment may be used as a fifth event tag corresponding to the field-returning event segment, where the fifth event tag is used to record a start timestamp and an end timestamp corresponding to the field-returning event segment in the sports record information associated with the participating players;
for the score display event segment, the start and stop time stamp corresponding to the score display event segment may be used as a sixth event tag corresponding to the score display event segment, and the sixth event tag is used to record the start time stamp and the end time stamp corresponding to the score display event segment in the sports session information associated with the participating players. The score display event segment is a segment for displaying the scores of the local field of the competitor in the relevant area, for example, the total scores of the competitor can be displayed on a large screen in the field, and the score of each referee and the ranking condition of the competitor can be displayed in detail.
Step S305, dividing the continuous jump motion segment into a plurality of independent jump motion segments;
specifically, the continuous jump motion segment obtained in step S304 is split to obtain a plurality of individual jump motion segments, and the specific process may refer to step S102 in the embodiment corresponding to fig. 3.
Step S306, frame extraction is carried out on the discontinuous jump action segment, the single jump action segment and the rotation action segment, the obtained video frame is sent to a multitask action understanding network, and the detailed action type, the number of idle rotation turns and the score are obtained;
specifically, as shown in fig. 9b, the frame extraction is performed on the individual jump motion segment obtained in step S305, so as to obtain a second video frame sequence, the second video frame sequence is sent to the multitask motion understanding network 40e (corresponding to the motion understanding network 30k in fig. 4 c), so as to obtain the number of flight rotations (i.e., the number of individual flight rotations) and the score (i.e., the individual motion evaluation quality) corresponding to each individual jump motion segment, and then the first event label corresponding to the consecutive jump motion segment can be obtained after the summary, and the specific process may be referred to step S201-step S202 in the embodiment corresponding to fig. 7.
Optionally, the frame extraction is performed on the rotational motion segment obtained in step S304 to obtain a third video frame sequence, the third video frame sequence is input to the multitask motion understanding network 40e, the motion type corresponding to the rotational motion segment and the score prediction may be simultaneously identified and performed through a non-local component in the multitask motion understanding network, and then the motion type and the predicted score are used as the second event tag corresponding to the rotational motion segment, where a specific process is consistent with a process of identifying an individual jump motion segment, and is not described here again. The second event tag may be used to record the action type and the action assessment quality (i.e., predicted score) corresponding to the spinning action segment in the sports session information associated with the participating players.
Optionally, the frame extraction is performed on the discontinuous jump motion segment obtained in step S304, so as to obtain a fourth video frame sequence, and the fourth video frame sequence is input to the multitask motion understanding network 40e, so as to obtain a score (i.e., motion evaluation quality) corresponding to the discontinuous jump motion segment and the number of idle rotation turns.
Step S307, sending the discontinuous jump motion segment and the single jump motion segment into a three-dimensional skeleton detection network, and detecting a three-dimensional human body skeleton for each video frame;
specifically, in order to more accurately distinguish the slipping motion, especially the jumping motion, with the subtle differences, as shown in fig. 9b, the discontinuous jumping motion segments and the single jumping motion segments may be sent to a three-dimensional skeleton detection network 40f, which is also called a time-series convolution skeleton detection network, and each video frame may be detected to obtain three-dimensional human skeleton information (i.e., the position information of the key portion), and the specific process may refer to step S1031 in the embodiment corresponding to fig. 5.
Step S308, inputting the three-dimensional human body skeleton information into a graph volume integral network, and further judging the detailed action type of the jumping action;
specifically, as shown in fig. 9b, the three-dimensional human skeleton information obtained in step S307 is input into the graph convolution classification network 40g (here, the graph convolution classification network 40g and the three-dimensional skeleton detection network 40f correspond to the action classification network 30l in fig. 4 c), so as to obtain detailed action types corresponding to the discontinuous jump action segment and the single jump action segment, and the specific process may refer to step S1032 in the embodiment corresponding to fig. 5.
In this case, the number of flight rotations, the score, and the detailed action type corresponding to each individual jump action segment may be summarized to obtain a first event tag corresponding to a consecutive jump action segment, and the number of flight rotations, the score, and the detailed action type corresponding to a non-consecutive jump action segment may be used as a third event tag corresponding to a non-consecutive jump action segment.
Although the multitask operation understanding network 40e, the three-dimensional skeleton detection network 40f, and the volume integral network 40g illustrated in fig. 9b are configured in series, the multitask operation understanding network 40e may be executed in parallel with the three-dimensional skeleton detection network 40f and the volume integral network 40g in actual use.
And step S309, outputting the intercepted important event segment, and generating a script according to the corresponding event label.
Specifically, according to the time sequence obtained in step S303, important event segments can be cut from the processed video, and all the important event segments are spliced to obtain a video collection, where the meaning of an important event segment is the same as that of the motion event segment, but the important event segment is a complete segment, and the motion event segment is a segment obtained after frame extraction, so that the motion event segment can be understood as a subset of the important event segment. According to the starting and stopping time stamp corresponding to each motion event segment, the obtained first event label, second event label, third event label, fourth event label, fifth event label and sixth event label can be sequentially arranged according to the time sequence to form motion session information related to the players, and a character session note can be generated according to the motion session information.
The specific classification of the motion script information is as follows:
Figure BDA0002864195680000301
in the above table, "jump" refers to a discontinuous jump action segment, and the action name refers to a detailed action type, for example, for a jump action (including a discontinuous jump action segment and a single jump action segment), the detailed action types thereof may include Axel (front outer jump), Flip (back inner point ice jump), toelop (back outer point ice jump), Loop (back outer Loop jump), Lutz (hook hand jump), Salchow (back inner Loop jump), Euler (back outer half cycle jump), and the like. For the rotating action, detailed action types thereof may include Sit (squat type rotation), Camel (swallow type rotation), Layback (bow type rotation), right (vertical rotation), Combo (joint rotation), and the like.
The form of the text script can be referred to the above table, and other templates can be used to generate the text script as required. Optionally, as for the consecutive jumping motion segments, as described in step S306, the tags corresponding to each individual jumping motion segment included in the consecutive jumping motion segments may be collected and fused and then recorded in the motion session information, or the tags corresponding to each individual jumping motion segment may be directly recorded in the motion session information, which is not limited in this embodiment of the present invention.
It should be noted that, by the method provided by the embodiment of the present application, a set of multi-modal intelligent script tools can be obtained, and besides being applied to the video clips of the figure skating game, the method can also be applied to other similar video clips or related video clips, and the embodiment of the present application is not limited.
By providing an intelligent script scheme based on multiple modes (namely multiple video processing technologies based on a deep neural network), the embodiment of the application can automatically extract motion event segments aiming at a target object from a video to be processed, can realize automatic and accurate classification aiming at jump actions by carrying out treatments such as splitting and three-dimensional human skeleton detection on continuous motion action segments in the motion event segments, can simultaneously carry out action classification and scoring prediction on all the motion event segments, and can automatically output a complete script with classification and scoring and a video collection consisting of corresponding video segments through calculation of the multiple modes. Therefore, the method provided by the application can improve the identification accuracy and efficiency of the action types, and can automatically generate the motion script information associated with the target object according to the event labels corresponding to the motion event segments, thereby avoiding the consumption of a large amount of manpower for registering the motion script information, and improving the generation efficiency of the motion script information.
Fig. 10 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application. The video data processing apparatus may be a computer program (including program code) running on a computer device, for example, the video data processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 10, the video data processing apparatus 1 may include: an acquisition module 101, a splitting module 102 and a first identification module 103;
an obtaining module 101, configured to obtain a motion event segment for a target object in a video to be processed;
the splitting module 102 is configured to identify a motion state segmentation point in the continuous motion action segment if the motion event segment includes the continuous motion action segment, and split the continuous motion action segment according to the motion state segmentation point to obtain at least two independent motion action segments;
the first identification module 103 is configured to track a key part of the target object in each individual motion segment, obtain key part position information corresponding to each individual motion segment, and identify a first event tag corresponding to a continuous motion segment according to the key part position information; the first event label is used for recording action types corresponding to each single motion action segment in the motion script information associated with the target object;
the first identifying module 103 is specifically configured to track the key location of the target object in each individual jump motion segment, obtain the location information of the key location corresponding to each individual jump motion segment, and identify the motion type corresponding to each individual jump motion segment according to the location information of the key location; and is specifically used for generating a first event label corresponding to the continuous jump action segment according to the action evaluation quality, the emptying rotation number and the action type.
The specific implementation of the function of the obtaining module 101 may refer to step S101 in the embodiment corresponding to fig. 3, the specific implementation of the function of the splitting module 102 may refer to step S102 in the embodiment corresponding to fig. 3, and the specific implementation of the function of the first identifying module 103 may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.
Referring to fig. 10, the video data processing apparatus 1 may further include: a first framing module 104, a first action understanding module 105;
a first frame extracting module 104, configured to respectively perform frame extraction on each individual jumping motion segment to obtain at least two second video frame sequences;
and a first action understanding module 105, configured to input at least two second video frame sequences into the action understanding network, and output action evaluation quality and flight rotation number corresponding to the continuous jump action segment through the action understanding network.
The specific functional implementation manner of the first frame extracting module 104 may refer to step S103 in the embodiment corresponding to fig. 3, or may refer to step S201 in the embodiment corresponding to fig. 7, and the specific functional implementation manner of the first action understanding module 105 may refer to step S202 in the embodiment corresponding to fig. 7, which is not described herein again.
Referring to fig. 10, the video data processing apparatus 1 may further include: a model training module 106 and a model modification module 107;
the model training module 106 is used for labeling the continuous jump motion sample segments, uniformly extracting K video frames from the labeled continuous jump motion sample segments, extracting continuous P video frames from the K video frames as an input sequence, inputting the input sequence into an initial motion understanding network, and outputting predicted motion evaluation quality and predicted number of idle rotation turns corresponding to the continuous jump motion sample segments through the initial motion understanding network; p is equal to the length of model input data corresponding to the initial action understanding network;
and the model correction module 107 is configured to generate a quality loss function according to the predicted motion estimation quality, generate a motion loss function according to the predicted number of flight rotations, generate a target loss function according to the quality loss function and the motion loss function, and correct model parameters in the initial motion understanding network through the target loss function to obtain the motion understanding network.
The specific functional implementation manners of the model training module 106 and the model modification module 107 may refer to the steps in the embodiment corresponding to fig. 8, and are not described herein again.
Referring to fig. 10, the video data processing apparatus 1 may further include: a second frame extraction module 108, a second action understanding module 109;
a second frame extracting module 108, configured to, if the motion event segment includes a rotation motion segment, perform frame extraction on the rotation motion segment to obtain a third video frame sequence;
a second action understanding module 109, configured to input the third video frame sequence into an action understanding network, and identify, through a non-local component in the action understanding network, a second event tag corresponding to the rotation action segment; and the second event label is used for recording the action type and action evaluation quality corresponding to the rotating action segment in the motion script information associated with the target object.
The specific functional implementation manner of the second frame extracting module 108 and the second action understanding module 109 may refer to step S306 in the embodiment corresponding to fig. 9a, and the first frame extracting module 104 and the second frame extracting module 108 may be combined into one module, which is not described herein again.
Referring to fig. 10, the video data processing apparatus 1 may further include: a third action understanding module 110, a second recognition module 111;
a third action understanding module 110, configured to, if the motion event segment includes a discontinuous jump action segment, perform frame extraction on the discontinuous jump action segment to obtain a fourth video frame sequence, input the fourth video frame sequence into an action understanding network, and obtain action evaluation quality and number of idle rotations corresponding to the discontinuous jump action segment;
the second identification module 111 is configured to track a key part of the target object in the discontinuous jump action segment to obtain key part position information corresponding to the discontinuous jump action segment, and identify an action type corresponding to the discontinuous jump action segment according to the key part position information; the system is used for determining the action evaluation quality, the emptying rotation number and the action type corresponding to the discontinuous jump action segment as a third event label corresponding to the discontinuous jump action segment; and the third event label is used for recording the action evaluation quality, the number of idle rotation turns and the action type corresponding to the discontinuous jump action segment in the motion script information associated with the target object.
For specific functional implementation of the third action understanding module 110 and the second identifying module 111, refer to steps S306 to S308 in the embodiment corresponding to fig. 9a, and the first action understanding module 105, the second action understanding module 109, and the third action understanding module 110 may be combined into one module, and the second identifying module 111 and the first identifying module 103 may be combined into one module, which is not described herein again.
Referring to fig. 10, the video data processing apparatus 1 may further include: a first tag determination module 112, a second tag determination module 113, a third tag determination module 114;
a first tag determining module 112, configured to, if the motion event segment includes an entry event segment, use a start-stop timestamp corresponding to the entry event segment as a fourth event tag corresponding to the entry event segment; the fourth event tag is used for recording a start time stamp and an end time stamp corresponding to the entry event segment in the motion script information associated with the target object;
a second tag determining module 113, configured to, if the motion event segment includes a field withdrawal event segment, use a start-stop timestamp corresponding to the field withdrawal event segment as a fifth event tag corresponding to the field withdrawal event segment; the fifth event tag is used for recording a start time stamp and an end time stamp corresponding to the field-receding event segment in the motion field mark information associated with the target object;
a third tag determining module 114, configured to, if the motion event segment includes a score display event segment, take a start-stop timestamp corresponding to the score display event segment as a sixth event tag corresponding to the score display event segment; the sixth event tag is used for recording a start time stamp and an end time stamp corresponding to the score display event segment in the motion script information associated with the target object.
For specific functional implementation manners of the first tag determining module 112, the second tag determining module 113, and the third tag determining module 114, reference may be made to step S304 in the embodiment corresponding to fig. 9a, which is not described herein again.
Referring to fig. 10, the video data processing apparatus 1 may further include: a script information generating module 115 and an important segment intercepting module 116;
a script information generating module 115, configured to sequentially arrange the first event tags corresponding to the continuous motion segments in the motion script information associated with the target object according to the time sequence of the start-stop timestamps;
the important segment intercepting module 116 is configured to intercept important event segments from the video to be processed according to the start-stop timestamps corresponding to the motion events, and splice the important event segments into a video collection; the motion event segment belongs to the important event segment.
The specific functional implementation manners of the script information generating module 115 and the important fragment intercepting module 116 may refer to step S309 in the embodiment corresponding to fig. 9a, which is not described herein again.
Referring to fig. 10, the obtaining module 101 may include: a frame extracting unit 1011, a first feature extracting unit 1012, and a time-series operation dividing unit 1013;
the frame extracting unit 1011 is configured to perform frame extraction on a video to be processed, which includes a target object, to obtain a first video frame sequence;
a first feature extraction unit 1012, configured to input the first video frame sequence into a feature extraction network, and obtain picture features corresponding to the first video frame sequence;
a time sequence action division unit 1013, configured to input the picture characteristics into an action division network, and obtain time sequence characteristics corresponding to the first video frame sequence; the timing feature comprises a start-stop timestamp corresponding to a motion event in the first sequence of video frames; and extracting a motion event segment aiming at the motion event from the first video frame sequence according to the corresponding start-stop time stamp of the motion event in the time sequence characteristic.
The specific functional implementation manners of the frame extracting unit 1011, the first feature extracting unit 1012 and the time sequence action dividing unit 1013 can refer to step S101 in the embodiment corresponding to fig. 3, and are not described herein again.
Referring also to fig. 10, the splitting module 102 may include: a state division unit 1021 and a split unit 1022;
a state division unit 1021 for determining a motion state corresponding to each video frame in the continuous jump motion segment; dividing the continuous jump action segment into at least two flight video frame sequences and at least three landing video frame sequences according to the motion state; the motion state refers to the state of the target object in the motion process;
the splitting unit 1022 is configured to determine a middle time of the target landing video frame sequence as a motion state segmentation point, and split the continuous jump action segment according to the action state segmentation point to obtain at least two independent jump action segments; the sequence of target landing video frames includes a sequence of video frames of at least three landing video frames, excluding a first landing video frame sequence and a last landing video frame sequence.
The specific functional implementation manner of the state dividing unit 1021 and the splitting unit 1022 may refer to step S102 in the embodiment corresponding to fig. 3, which is not described herein again.
Referring also to fig. 10, the first action understanding module 105 may include: a second feature extraction unit 1051, an individual output unit 1052, and a combined output unit 1053;
a second feature extraction unit 1051, configured to split the second video frame sequence S into m subsequences, and perform feature extraction on the m subsequences through m non-local components, respectively, to obtain m intermediate action features; respectively performing one-dimensional convolution on the m intermediate motion characteristics to obtain m one-dimensional convolution motion characteristics;
an individual output unit 1052, configured to merge the m one-dimensional convolution action features through a full connection layer in the action understanding network to obtain a target action feature, and output an individual action evaluation quality and an individual flight rotation number corresponding to the target action feature through a classification layer in the action understanding network;
and a merge output unit 1053, configured to, when obtaining the individual motion estimation quality and the individual number of flight rotations respectively corresponding to each of the at least two second video frame sequences, generate the motion estimation quality and the number of flight rotations corresponding to the consecutive jump motion segments according to the individual motion estimation quality and the individual number of flight rotations respectively corresponding to each of the at least two second video frame sequences.
For specific functional implementation manners of the second feature extraction unit 1051, the single output unit 1052 and the combined output unit 1053, reference may be made to step S202 in the embodiment corresponding to fig. 7, which is not described herein again.
Referring to fig. 10, the first identification module 103 may include: a position information acquisition unit 1031, an action recognition unit 1032, a tag generation unit 1033;
a position information obtaining unit 1031, configured to track key portions of the target object in the individual movement segment H to obtain position information of at least two key portions corresponding to the individual movement segment H;
the action recognition unit 1032 is used for predicting action type prediction probabilities corresponding to each piece of key part position information in the at least two pieces of key part position information respectively, averaging the predicted at least two action type prediction probabilities to obtain an average action type prediction probability, and recognizing an action type corresponding to the single motion action segment H according to the average action type prediction probability;
a tag generating unit 1033, configured to, when an action type corresponding to each individual motion segment in the at least two individual motion segments is obtained, generate a first event tag corresponding to the continuous motion segment according to the action type corresponding to each individual motion segment.
For specific functional implementation manners of the location information obtaining unit 1031, the action identifying unit 1032 and the tag generating unit 1033, refer to steps S1031 to S1032 in the embodiment corresponding to fig. 5, which is not described herein again.
Referring to fig. 10, the state division unit 1021 may include: an empty sequence determination subunit 10211, a falling sequence determination subunit 10212;
an emptying sequence determining subunit 10211, configured to obtain consecutive video frames Ti to Ti + n from consecutive jump motion segments; n is a positive integer, and i + n is less than or equal to the total number of video frames in the consecutive jump motion segment; if the motion states corresponding to the continuous video frames Ti to Ti + n-1 are all flight states, the motion states corresponding to the video frames Ti + n and Ti-1 are all landing states, and the number of the video frames corresponding to the continuous video frames Ti to Ti + n-1 is greater than a flight number threshold value, determining the continuous video frames Ti to Ti + n-1 as a flight video frame sequence; the number of sequences of emptying video frames present in successive jump action segments is at least two;
the landing sequence determination subunit 10212 is configured to determine a video frame sequence other than at least two flight video frame sequences in the consecutive jump motion segment as at least three landing video frame sequences.
For specific implementation of functions of the vacation sequence determining subunit 10211 and the falling-place sequence determining subunit 10212, reference may be made to step S102 in the embodiment corresponding to fig. 3, which is not described herein again.
Referring to fig. 10 together, the location information obtaining unit 1031 may include: a first position information acquisition sub-unit 10311, a second position information acquisition sub-unit 10312, a third position information acquisition sub-unit 10313, a fourth position information acquisition sub-unit 10314, and a fifth position information acquisition sub-unit 10315;
a first position information obtaining subunit 10311, configured to track a key portion of the target object in the individual motion segment H to obtain a global coordinate of the key portion, which is used as global key portion position information corresponding to the individual motion segment H;
a second position information obtaining subunit 10312, configured to obtain local coordinates of the key location according to the global key location position information, and use the local coordinates as local key location position information corresponding to the individual motion action segment H;
a third position information obtaining subunit 10313, configured to fuse the global coordinates corresponding to the target key position in the global key position information and the local coordinates corresponding to the key positions other than the target key position in the local key position information to obtain fused key position information corresponding to the individual motion action segment H;
a fourth position information obtaining subunit 10314, configured to use coordinate displacement of the global key position information between every two adjacent video frames in the individual motion segment H as global displacement key position information corresponding to the individual motion segment H;
a fifth position information obtaining subunit 10315, configured to shift coordinates of the local key location information between every two adjacent video frames in the individual motion segment H as local displacement key location information corresponding to the individual motion segment H.
For specific functional implementation manners of the first position information obtaining subunit 10311, the second position information obtaining subunit 10312, the third position information obtaining subunit 10313, the fourth position information obtaining subunit 10314, and the fifth position information obtaining subunit 10315, reference may be made to step S1031 in the embodiment corresponding to fig. 5, which is not described herein again.
The embodiment of the application can acquire the motion event segment aiming at the target object in the video to be processed, when the motion event segment is detected to comprise the continuous motion action segment, the continuous motion segment may be split by identifying motion state segmentation points in the continuous motion segment, thereby obtaining at least two independent motion segments, further, the key parts of the target object can be tracked in each independent motion segment respectively, further obtaining the position information of the key parts corresponding to each single movement action segment, according to the key part position information, a first event label corresponding to the continuous motion action segment can be identified, and finally the obtained first event label can be used for recording the action type corresponding to each single motion action segment in the motion log information associated with the target object. As can be seen from the above, in the embodiment of the present application, the motion event segment for the target object can be automatically extracted from the video to be processed, and the continuous motion action segment in the motion event segment is split and the position information of the key part is extracted, so that the automatic and accurate classification for the continuous motion action segment can be realized, and the corresponding action evaluation quality can be simultaneously generated, thereby improving the identification accuracy and the identification efficiency for the action type.
Further, please refer to fig. 11, which is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 11, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 11, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
In the computer device 1000 shown in fig. 11, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
acquiring a motion event segment aiming at a target object in a video to be processed;
if the motion event segment comprises a continuous motion action segment, identifying a motion state segmentation point in the continuous motion action segment, and splitting the continuous motion action segment according to the motion state segmentation point to obtain at least two independent motion action segments;
respectively tracking the key part of the target object in each single motion action segment to obtain the key part position information corresponding to each single motion action segment, and identifying a first event label corresponding to the continuous motion action segment according to the key part position information; the first event label is used for recording the action type corresponding to each individual motion action segment in the motion script information associated with the target object.
It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the video data processing method in any one of the embodiments corresponding to fig. 3, fig. 5, fig. 7, and fig. 9a, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program executed by the aforementioned video data processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the video data processing method in any one of the embodiments corresponding to fig. 3, fig. 5, fig. 7, and fig. 9a can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.
The computer-readable storage medium may be the video data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
Further, here, it is to be noted that: embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by any one of the corresponding embodiments of fig. 3, fig. 5, fig. 7, and fig. 9 a.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (15)

1. A method of processing video data, comprising:
acquiring a motion event segment aiming at a target object in a video to be processed;
if the motion event segment comprises a continuous motion action segment, identifying a motion state segmentation point in the continuous motion action segment, and splitting the continuous motion action segment according to the motion state segmentation point to obtain at least two independent motion action segments;
respectively tracking the key part of the target object in each single motion action segment to obtain the position information of the key part corresponding to each single motion action segment, and identifying a first event label corresponding to the continuous motion action segment according to the position information of the key part; the first event label is used for recording the action type corresponding to each single motion action segment in the motion script information associated with the target object.
2. The method according to claim 1, wherein the obtaining of the motion event segment for the target object in the video to be processed comprises:
performing frame extraction on a video to be processed containing a target object to obtain a first video frame sequence;
inputting the first video frame sequence into a feature extraction network to obtain picture features corresponding to the first video frame sequence;
inputting the picture characteristics into an action segmentation network to obtain time sequence characteristics corresponding to the first video frame sequence; the timing feature comprises a start-stop timestamp corresponding to a motion event in the first sequence of video frames;
and extracting a motion event segment aiming at the motion event from the first video frame sequence according to the corresponding start-stop time stamp of the motion event in the time sequence characteristic.
3. The method of claim 1, wherein the continuous motion segment is a continuous jerky motion segment; the identifying the motion state segmentation points in the continuous motion action segments, and splitting the continuous motion action segments according to the motion state segmentation points to obtain at least two independent motion action segments includes:
determining the motion state corresponding to each video frame in the continuous jump motion segment; the motion state refers to the state of the target object in the motion process;
dividing the continuous jump motion segment into at least two flight video frame sequences and at least three landing video frame sequences according to the motion state;
determining the middle moment of the target landing video frame sequence as a motion state segmentation point, and splitting the continuous jumping motion segments according to the motion state segmentation point to obtain at least two independent jumping motion segments; the target sequence of landing video frames includes a sequence of video frames of the at least three sequences of landing video frames, excluding a first landing video frame sequence and a last landing video frame sequence.
4. A method according to claim 3, wherein the movement states include an empty state and a fallen state; the continuous jump motion segmentComprising a video frame TiI is a positive integer and i is less than the total number of video frames in the consecutive jump motion segment;
the dividing of the continuous jump action segment into at least two flight video frame sequences and at least three landing video frame sequences according to the motion state comprises:
obtaining successive video frames T from said successive jump motion segmentsiTo the video frame Ti+n(ii) a n is a positive integer and i + n is less than or equal to the total number of video frames in the consecutive jump action segment;
if successive frequency frames TiTo the video frame Ti+n-1The corresponding motion states are all flight states, and the video frame Ti+nAnd video frame Ti-1The motion states corresponding to the video frames T are all the landing statesiTo the video frame Ti+n-1If the number of the corresponding video frames is larger than the soaring number threshold value, the continuous video frames T are processediTo the video frame Ti+n-1Determining a sequence of vacated video frames; the number of sequences of emptying video frames present in the consecutive jump action segments is at least two;
and determining the video frame sequences except for at least two emptying video frame sequences in the continuous jump action segment into at least three landing video frame sequences.
5. The method of claim 1, wherein the continuous motion segment is a continuous jump segment, and each individual motion segment is an individual jump segment; the first event tag further comprises an action assessment quality and a number of flight rotations; the method further comprises the following steps:
respectively performing frame extraction on each single jumping motion segment to obtain at least two second video frame sequences;
inputting the at least two second video frame sequences into an action understanding network, and outputting the action evaluation quality and the number of flight rotations corresponding to the continuous jump action segments through the action understanding network;
the tracking the key part of the target object in each individual motion segment respectively to obtain the key part position information corresponding to each individual motion segment respectively, and identifying the first event label corresponding to the continuous motion segment according to the key part position information, including:
respectively tracking the key part of the target object in each single jump action fragment to obtain the position information of the key part corresponding to each single jump action fragment, and identifying the action type corresponding to each single jump action fragment according to the position information of the key part;
and generating a first event label corresponding to the continuous jump action segment according to the action evaluation quality, the emptying rotation number and the action type.
6. The method according to claim 5, wherein said at least two second sequences of video frames comprise a second sequence of video frames S; the action understanding network comprises m non-local components, wherein m is a positive integer; the inputting the at least two second video frame sequences into an action understanding network, and outputting the action evaluation quality and the number of flight rotations corresponding to the continuous jump action segments through the action understanding network, comprises:
splitting the second video frame sequence S into m subsequences, and respectively performing feature extraction on the m subsequences through the m non-local components to obtain m intermediate action features;
respectively performing one-dimensional convolution on the m intermediate motion characteristics to obtain m one-dimensional convolution motion characteristics;
merging the m one-dimensional convolution action features through a full connection layer in the action understanding network to obtain target action features, and outputting individual action evaluation quality and individual flight rotation turns corresponding to the target action features through a classification layer in the action understanding network;
and when the individual action evaluation quality and the individual emptying rotation number respectively corresponding to each second video frame sequence in the at least two second video frame sequences are obtained, generating the action evaluation quality and the emptying rotation number corresponding to the continuous jump action segment according to the individual action evaluation quality and the individual emptying rotation number respectively corresponding to each second video frame sequence.
7. The method of claim 5, further comprising:
marking the continuous jump action sample segments, uniformly extracting K video frames from the marked continuous jump action sample segments, extracting continuous P video frames from the K video frames as an input sequence, inputting the input sequence into an initial action understanding network, and outputting predicted action evaluation quality and predicted flight rotation number corresponding to the continuous jump action sample segments through the initial action understanding network; p is equal to the length of model input data corresponding to the initial action understanding network;
and generating a quality loss function according to the predicted action evaluation quality, generating an action loss function according to the predicted flight rotation number, generating a target loss function according to the quality loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain an action understanding network.
8. The method according to claim 1, wherein the information quantity of the key part position information is at least two; the at least two separate motion segments comprise a separate motion segment H; the tracking the key part of the target object in each individual motion action segment respectively to obtain the key part position information corresponding to each individual motion action segment respectively, and identifying a first event label corresponding to the continuous motion action segment according to the key part position information, includes:
tracking key parts of the target object in the single motion action segment H to obtain position information of at least two key parts corresponding to the single motion action segment H;
predicting action type prediction probabilities corresponding to each piece of key part position information in the at least two pieces of key part position information respectively, averaging the predicted at least two action type prediction probabilities to obtain an average action type prediction probability, and identifying an action type corresponding to the single movement action segment H according to the average action type prediction probability;
and when the action type corresponding to each single motion action segment in the at least two single motion action segments is obtained, generating a first event label corresponding to the continuous motion action segment according to the action type corresponding to each single motion action segment.
9. The method of claim 8, wherein the at least two key site location information comprises global key site location information, local key site location information, fusion key site location information, global displacement key site location information, and local displacement key site location information; the tracking the key parts of the target object in the single motion action segment H to obtain the position information of at least two key parts corresponding to the single motion action segment H includes:
tracking a key part of the target object in the single motion action segment H to obtain a global coordinate of the key part, wherein the global coordinate is used as global key part position information corresponding to the single motion action segment H;
obtaining the local coordinates of the key part according to the global key part position information, and using the local coordinates as the local key part position information corresponding to the single motion action segment H;
fusing global coordinates corresponding to a target key part in the global key part position information with local coordinates corresponding to key parts except the target key part in the local key part position information to obtain fused key part position information corresponding to the single motion action segment H;
coordinate displacement of the global key part position information between every two adjacent video frames in the single motion action segment H is used as global displacement key part position information corresponding to the single motion action segment H;
and taking the coordinate displacement of the local key part position information between every two adjacent video frames in the single motion action segment H as the local displacement key part position information corresponding to the single motion action segment H.
10. The method of claim 1, further comprising:
if the motion event segment comprises a rotation action segment, performing frame extraction on the rotation action segment to obtain a third video frame sequence;
inputting the third video frame sequence into an action understanding network, and identifying a second event label corresponding to the rotating action segment through a non-local component in the action understanding network; the second event label is used for recording the action type and the action evaluation quality corresponding to the rotating action segment in the motion script information associated with the target object.
11. The method of any one of claims 1-10, further comprising:
if the motion event segment comprises a discontinuous jump motion segment, performing frame extraction on the discontinuous jump motion segment to obtain a fourth video frame sequence, and inputting the fourth video frame sequence into a motion understanding network to obtain motion evaluation quality and flight rotation number corresponding to the discontinuous jump motion segment;
tracking a key part of the target object in the discontinuous jump action segment to obtain key part position information corresponding to the discontinuous jump action segment, and identifying an action type corresponding to the discontinuous jump action segment according to the key part position information;
determining the action evaluation quality, the emptying rotation number and the action type corresponding to the discontinuous jump action segment as a third event label corresponding to the discontinuous jump action segment; and the third event label is used for recording the action evaluation quality, the number of idle rotation turns and the action type corresponding to the discontinuous jump action segment in the motion script information associated with the target object.
12. The method of claim 2, further comprising:
if the motion event segment comprises an entry event segment, taking a start-stop timestamp corresponding to the entry event segment as a fourth event tag corresponding to the entry event segment; the fourth event tag is used for recording a start timestamp and an end timestamp corresponding to the entry event segment in the motion script information associated with the target object;
if the motion event segment comprises a field withdrawal event segment, taking a start-stop timestamp corresponding to the field withdrawal event segment as a fifth event tag corresponding to the field withdrawal event segment; the fifth event tag is used for recording a start timestamp and an end timestamp corresponding to the field-receding event segment in the motion field mark information associated with the target object;
if the motion event segment comprises a score display event segment, taking a start-stop timestamp corresponding to the score display event segment as a sixth event tag corresponding to the score display event segment; the sixth event tag is used for recording a start time stamp and an end time stamp corresponding to the score display event segment in the motion script information associated with the target object.
13. The method of claim 2, further comprising:
sequentially arranging first event labels corresponding to the continuous motion action segments in motion log information associated with the target object according to the time sequence of the start-stop timestamps;
intercepting important event fragments from the video to be processed according to the corresponding start-stop timestamps of the motion events, and splicing the important event fragments into a video collection; the motion event segment belongs to the important event segment.
14. A computer device, comprising: a processor, a memory, and a network interface;
the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide data communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the method of any of claims 1-13.
15. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor and to carry out the method of any one of claims 1 to 13.
CN202011580149.3A 2020-12-28 2020-12-28 Video data processing method and device and readable storage medium Pending CN113515998A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011580149.3A CN113515998A (en) 2020-12-28 2020-12-28 Video data processing method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011580149.3A CN113515998A (en) 2020-12-28 2020-12-28 Video data processing method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN113515998A true CN113515998A (en) 2021-10-19

Family

ID=78060292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011580149.3A Pending CN113515998A (en) 2020-12-28 2020-12-28 Video data processing method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113515998A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638106A (en) * 2022-03-21 2022-06-17 中国民用航空飞行学院 Radar control simulation training method based on Internet
CN115086759A (en) * 2022-05-13 2022-09-20 北京达佳互联信息技术有限公司 Video processing method, video processing device, computer equipment and medium
CN115798040A (en) * 2022-11-23 2023-03-14 广州市锐星信息科技有限公司 Automatic segmentation system for cardio-pulmonary resuscitation AI

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214330A (en) * 2018-08-30 2019-01-15 北京影谱科技股份有限公司 Video Semantic Analysis method and apparatus based on video timing information
CN110309784A (en) * 2019-07-02 2019-10-08 北京百度网讯科技有限公司 Action recognition processing method, device, equipment and storage medium
CN110427807A (en) * 2019-06-21 2019-11-08 诸暨思阔信息科技有限公司 A kind of temporal events motion detection method
CN110472554A (en) * 2019-08-12 2019-11-19 南京邮电大学 Table tennis action identification method and system based on posture segmentation and crucial point feature
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111523510A (en) * 2020-05-08 2020-08-11 国家邮政局邮政业安全中心 Behavior recognition method, behavior recognition device, behavior recognition system, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214330A (en) * 2018-08-30 2019-01-15 北京影谱科技股份有限公司 Video Semantic Analysis method and apparatus based on video timing information
CN110427807A (en) * 2019-06-21 2019-11-08 诸暨思阔信息科技有限公司 A kind of temporal events motion detection method
CN110309784A (en) * 2019-07-02 2019-10-08 北京百度网讯科技有限公司 Action recognition processing method, device, equipment and storage medium
CN110472554A (en) * 2019-08-12 2019-11-19 南京邮电大学 Table tennis action identification method and system based on posture segmentation and crucial point feature
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111523510A (en) * 2020-05-08 2020-08-11 国家邮政局邮政业安全中心 Behavior recognition method, behavior recognition device, behavior recognition system, electronic equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638106A (en) * 2022-03-21 2022-06-17 中国民用航空飞行学院 Radar control simulation training method based on Internet
CN114638106B (en) * 2022-03-21 2023-05-02 中国民用航空飞行学院 Radar control simulation training method based on Internet
CN115086759A (en) * 2022-05-13 2022-09-20 北京达佳互联信息技术有限公司 Video processing method, video processing device, computer equipment and medium
CN115798040A (en) * 2022-11-23 2023-03-14 广州市锐星信息科技有限公司 Automatic segmentation system for cardio-pulmonary resuscitation AI
CN115798040B (en) * 2022-11-23 2023-06-23 广州市锐星信息科技有限公司 Automatic segmentation system of cardiopulmonary resuscitation AI

Similar Documents

Publication Publication Date Title
CN112565825B (en) Video data processing method, device, equipment and medium
CN110781347B (en) Video processing method, device and equipment and readable storage medium
CN110784759B (en) Bullet screen information processing method and device, electronic equipment and storage medium
US9118886B2 (en) Annotating general objects in video
CN113515998A (en) Video data processing method and device and readable storage medium
US20200089661A1 (en) System and method for providing augmented reality challenges
CN112015949B (en) Video generation method and device, storage medium and electronic equipment
US9047376B2 (en) Augmenting video with facial recognition
KR101816113B1 (en) Estimating and displaying social interest in time-based media
CN113515997B (en) Video data processing method and device and readable storage medium
CN113766299B (en) Video data playing method, device, equipment and medium
CN113518256A (en) Video processing method and device, electronic equipment and computer readable storage medium
CN113542777A (en) Live video editing method and device and computer equipment
CN111586466B (en) Video data processing method and device and storage medium
CN113365147A (en) Video editing method, device, equipment and storage medium based on music card point
CN114339360B (en) Video processing method, related device and equipment
WO2014100936A1 (en) Method, platform, and system for manufacturing associated information library of video and for playing video
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN113822254B (en) Model training method and related device
CN111126390A (en) Correlation method and device for identifying identification pattern in media content
CN108540817B (en) Video data processing method, device, server and computer readable storage medium
CN110287934B (en) Object detection method and device, client and server
CN112287771A (en) Method, apparatus, server and medium for detecting video event
CN116049490A (en) Material searching method and device and electronic equipment
CN114449346B (en) Video processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination