CN113515997A

CN113515997A - Video data processing method and device and readable storage medium

Info

Publication number: CN113515997A
Application number: CN202011580130.9A
Authority: CN
Inventors: 袁微; 赵天昊; 田思达
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-10-19
Anticipated expiration: 2040-12-28
Also published as: CN113515997B

Abstract

The application discloses a video data processing method, a device and a readable storage medium, wherein the video data processing method comprises the following steps: identifying the shot type of a video frame aiming at a target object in a video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the shot type; determining the type of the shot as a first motion label corresponding to the motion segment; the first motion tag is used for recording the shot type corresponding to the motion segment in the motion script information associated with the target object; performing multi-task action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion tag is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information. By adopting the method and the device, the waste of computing resources can be reduced by accurately distinguishing the normal motion segment and the playback motion segment.

Description

Video data processing method and device and readable storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for processing video data, and a readable storage medium.

Background

With the rapid development of the internet technology and the improvement of the computing power, the performance of the video processing technology is greatly improved.

Video processing technology can classify, detect and analyze videos, is a very challenging subject in the field of computer vision, and has received wide attention in both academia and industry. The existing artificial intelligence recognition technology can recognize all motion segments in a video segment, but the video segment of a diving game often contains a playback motion segment, and an object played by the playback motion segment is the same as the motion segment under a normal shot, so that the same object can be repeatedly recognized, and the waste of computing resources is caused.

Disclosure of Invention

The embodiment of the application provides a video data processing method, a video data processing device and a readable storage medium, which can reduce the waste of computing resources by accurately distinguishing a normal motion segment and a playback motion segment.

An embodiment of the present application provides a video data processing method, including:

identifying the shot type of a video frame aiming at a target object in a video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the shot type;

determining the type of the shot as a first motion label corresponding to the motion segment; the first motion tag is used for recording the shot type corresponding to the motion segment in the motion script information associated with the target object;

performing multi-task action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion tag is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

An aspect of an embodiment of the present application provides a video data processing apparatus, including:

the shot identification module is used for identifying the shot type of a video frame aiming at a target object in a video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the shot type;

the first label determining module is used for determining the shot type as a first motion label corresponding to the motion segment; the first motion tag is used for recording the shot type corresponding to the motion segment in the motion script information associated with the target object;

the second label determining module is used for performing multi-task action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion tag is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

Wherein, the camera lens identification module includes:

the first frame extracting unit is used for extracting frames of a video to be processed containing a target object to obtain a first video frame sequence;

the image classification unit is used for inputting the first video frame sequence into an image classification network and outputting the lens type corresponding to each video frame in the first video frame sequence through the image classification network; the shot types include a normal shot type and a playback shot type;

and the segment dividing unit is used for acquiring a start-stop time sequence corresponding to the motion segment in the first video frame sequence, and dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type.

Wherein, the segment dividing unit includes:

the characteristic extraction subunit is used for inputting the first video frame sequence into a characteristic coding network and outputting the picture characteristics corresponding to the first video frame sequence through the characteristic coding network;

the time sequence acquisition subunit is used for inputting the picture characteristics into the time sequence action segmentation network and outputting a start-stop time sequence used for identifying the motion segment in the first video frame sequence through the time sequence action segmentation network; the start-stop time sequence includes at least two start time stamps T1 and at least two end time stamps T2;

a segment determining subunit, configured to obtain at least two second video frame sequences in the first video frame sequence according to the at least two start timestamps T1 and the at least two end timestamps T2, and determine each of the at least two second video frame sequences as a motion segment; if the shot type corresponding to the motion segment is a normal shot type, determining the motion segment as a normal motion segment; and if the shot type corresponding to the motion segment is the playback shot type, determining the motion segment as a playback motion segment.

The first label determining module is specifically used for identifying a lens angle type corresponding to a playback motion segment through an image classification network when the motion segment is the playback motion segment; the lens angle type comprises any one of a front side view angle type, a side front view angle type, a side rear view angle type or a top view angle type; and determining the type of the shot angle as a first motion label corresponding to the playback motion segment.

Wherein the second tag determination module comprises:

the second frame extracting unit is used for extracting N video frames from the normal motion segment to form a third video frame sequence; n is a positive integer;

the action understanding unit is used for splitting the third video frame sequence into M subsequences and inputting the M subsequences into the multitask action understanding network; the multitask action understanding network comprises M non-local components; m is a positive integer; respectively extracting the features of the M subsequences through M non-local components in the multitask action understanding network to obtain M intermediate action features; one non-local component corresponds to one subsequence; fusing the M intermediate action characteristics to obtain fused characteristics, and performing time sequence operation on the fused characteristics through a one-dimensional convolution layer in a multitask action understanding network to obtain time sequence characteristics corresponding to normal motion segments; inputting the time sequence characteristics into a full connection layer in the multitask action understanding network, and respectively inputting the characteristic data output by the full connection layer into a score regressor and a label classifier; and outputting the action evaluation quality corresponding to the normal motion segment through a score regressor, outputting the action type corresponding to the normal motion segment through a label classifier, and taking the action evaluation quality and the action type as a second motion label corresponding to the normal motion segment.

The second tag determining module is specifically configured to, according to the start-stop time sequence corresponding to the motion segment, use a normal motion segment before the playback motion segment as a target segment, and obtain a second motion tag corresponding to the target segment as a second motion tag corresponding to the playback motion segment.

Wherein, the video data processing device further comprises:

the segment intercepting module is used for intercepting important event segments from the video to be processed according to the starting and stopping time sequence corresponding to the motion segments; the motion segment belongs to the important event segment;

and the segment splicing module is used for generating a label text corresponding to the important event segment according to the incidence relation between the important event segment and the motion segment and the first motion label and the second motion label corresponding to the motion segment, adding the label text into the important event segment, and splicing the important event segment added with the label text into the video collection.

Wherein the tag text comprises a plurality of sub-tags having different tag types;

the segment splicing module is specifically used for acquiring a label fusion template; the label fusion template comprises filling positions corresponding to at least two label types respectively; adding each sub-label in the label text to a corresponding filling position according to the type of the label to which the sub-label belongs, and obtaining a label fusion template after the addition; and performing pixel fusion on the added label fusion template and the important event fragment to obtain the added important event fragment, and splicing the added important event fragment to obtain the video collection.

Wherein, the video data processing device further comprises:

the first training module is used for marking an action evaluation quality label and an action type label corresponding to each target action segment in the motion sample segment; uniformly extracting K video frames from the marked motion sample segment, extracting continuous N video frames from the K video frames as an input sequence, and inputting the input sequence into an initial action understanding network; k and N are positive integers, and K is greater than N; n is equal to the length of model input data corresponding to the initial action understanding network; respectively processing continuous N/M video frames through M non-local components in the initial action understanding network, and outputting predicted action evaluation quality and predicted action type corresponding to each target action segment; m is a positive integer, and N is an integer multiple of M; generating a quality loss function according to the predicted action evaluation quality and the action evaluation quality label, and generating an action loss function according to the predicted action type and the action type label; and generating a target loss function according to the quality loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain the multitask action understanding network.

Wherein, the video data processing device further comprises:

the second training module is used for marking a shot type label corresponding to each target action segment in the motion sample segment; performing frame extraction on the marked motion sample segment to obtain a fourth video frame sequence; inputting the fourth video frame sequence into a lightweight convolutional neural network, and outputting a prediction lens type corresponding to each target action segment through the lightweight convolutional neural network; and generating a lens loss function according to the predicted lens type and the lens type label, and correcting the model parameters in the lightweight convolutional neural network through the lens loss function to obtain an image classification network.

Wherein, the video data processing device further comprises:

the third training module is used for marking a start-stop timestamp label corresponding to each target action segment in the motion sample segment; dividing the marked motion sample segment into S sample sub-segments according to a division rule, and performing frame extraction on the S sample sub-segments to obtain a fifth video frame sequence; s is a positive integer; updating the start-stop timestamp labels according to the time length of each sample sub-segment to obtain updated start-stop timestamp labels; inputting the fifth video frame sequence into a feature coding network to obtain a picture feature matrix corresponding to each sample sub-segment; inputting the picture characteristic matrix into an initial action segmentation network, and outputting a prediction start-stop timestamp corresponding to each target action segment in each sample sub-segment through the initial action segmentation network; and generating a time loss function according to the predicted start-stop timestamp and the updated start-stop timestamp label, and correcting the model parameters in the initial action segmentation network through the time loss function to obtain the time sequence action segmentation network.

The video to be processed is a diving competition video, the target object is a competitor, the second label determining module is specifically used for inputting the normal motion segment into the multitask action understanding network to obtain a take-off mode, an arm strength diving attribute, a rotation posture, a rotation number, a side turning number and action evaluation quality corresponding to the normal motion segment, and the take-off mode, the arm strength diving attribute, the rotation posture, the rotation number and the side turning number are determined as action types corresponding to the normal motion segment; determining the action type and the action evaluation quality as a second motion label corresponding to the normal motion segment; the second motion label is used for recording the motion type and the motion evaluation quality corresponding to the normal motion segment in the motion session information associated with the participating players.

An aspect of an embodiment of the present application provides a computer device, including: a processor, a memory, a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the method in the embodiment of the present application.

An aspect of the present embodiment provides a computer-readable storage medium, in which a computer program is stored, where the computer program is adapted to be loaded by a processor and to execute the method in the present embodiment.

An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, and a processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the method in the embodiments of the present application.

The embodiment of the application can realize the accurate segmentation of the normal motion segment and the playback motion segment by identifying the shot type of the video frame aiming at the target object in the video to be processed and acquiring the motion segment in the processed video, so that the motion segment can be divided into the normal motion segment and the playback motion segment according to the shot type, the normal motion segment can be input into a multitask action understanding network for identification, and the corresponding action type and the action evaluation quality can be obtained, so that the shot type can be used as a first motion tag, the action type and the action evaluation quality can be used as a second motion tag, and finally, the motion script information can be generated according to the first motion tag and the second motion tag, therefore, the waste of computing resources can be reduced by identifying the normal motion segment only, and the motion script information can be prevented from being registered by consuming a large amount of manpower, therefore, the generation efficiency of the sports script information can be improved by automatically generating the sports script information.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a system architecture diagram according to an embodiment of the present application;

fig. 2 is a schematic view of a video data processing scene provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a video data processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a video data processing method according to an embodiment of the present application;

fig. 5 a-5 b are schematic flow charts illustrating a video data processing method according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a multitask action understanding network according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "look", and further refers to using a camera and a Computer to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include data processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, deep learning technology and other technologies, and the specific process is explained by the following embodiment.

Please refer to fig. 1, which is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system architecture may include a server 10a and a terminal cluster, and the terminal cluster may include: terminal device 10b terminal devices 10c, …, and terminal device 10d, wherein there may be a communication connection between terminal clusters, for example, there may be a communication connection between terminal device 10b and terminal device 10c, and a communication connection between terminal device 10b and terminal device 10 d. Meanwhile, any terminal device in the terminal cluster may have a communication connection with the server 10a, for example, a communication connection exists between the terminal device 10b and the server 10a, where the communication connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may also be directly or indirectly connected through a wireless communication manner, and may also be through other manners, which is not limited in this application.

It should be understood that each terminal device in the terminal cluster shown in fig. 1 may be installed with an application client, and when the application client runs in each terminal device, data interaction may be performed with the server 10a shown in fig. 1. The application client can be a multimedia client (e.g., a video client), a social client, an entertainment client (e.g., a game client), an education client, a live client, or the like, which has a frame sequence (e.g., a frame animation sequence) loading and playing function. The application client may be an independent client, or may be an embedded sub-client integrated in a certain client (e.g., a multimedia client, a social client, an education client, etc.), which is not limited herein. The server 10a provides a service for the terminal cluster through a communication function, when a terminal device (which may be the terminal device 10b, the terminal device 10c, or the terminal device 10d) acquires the video segment a and needs to process the video segment a, for example, an important event segment (for example, a highlight segment) is clipped from the video segment a, and the important event segment is classified and labeled, and the terminal device may send the video segment a to the server 10a through the application client. After the server 10a receives the video segment a sent by the terminal device, the shot types of all video frames of the target object in the video segment a may be identified, and a motion segment B (for example, a segment when a participant exhibits a diving action in a video segment of a certain diving race) may be obtained from the video segment a according to the identified start-stop time sequence, so that the motion segment B may be divided into a normal motion segment C and a playback motion segment D according to the shot types, and then the server 10a may determine the shot types as first motion tags corresponding to the motion segment B, and may be subsequently used to record the shot types corresponding to the motion segment B in motion session record information associated with the target object. Further, the server 10a may perform multi-task motion recognition on the normal motion segment C to obtain a second motion tag corresponding to the normal motion segment C, and associate the second motion tag with the playback motion segment D, therefore, the action type and the action evaluation quality corresponding to the motion segment B can be recorded in the motion script information according to the second motion label, the motion script information can be finally generated according to the first motion label and the second motion label, the important event segments can be intercepted from the video segment A and spliced into a video collection according to the start-stop time sequence corresponding to the motion segment B, therefore, the sports script information and the video collection can be returned to the application client of the terminal device, and after the application client of the terminal device receives the sports script information and the video collection sent by the server 10a, the sports script information and the video collection can be displayed on the corresponding screen. The motion session information may further include information such as a start timestamp and an end timestamp corresponding to the motion segment B, and the description is given by taking the identification of the shot type, the identification of the motion type, and the evaluation of the motion quality as examples. It should be noted that the number of the motion segment B, the normal motion segment C, and the playback motion segment D may be one or more, and the application is not limited herein. It is to be understood that the normal motion segment and the playback motion segment may be both understood as the motion segment obtained by shooting at present, but the shooting angle and the playing speed of the two motion segments may be different, the shooting angle of the normal motion segment is usually fixed, the shooting angle of the playback motion segment may be various (such as a top view angle, a front view angle, etc.), and the playing speed of the playback motion segment is less than the playing speed of the normal motion segment.

It is understood that the method provided by the embodiment of the present application can be executed by a computer device, including but not limited to a terminal device or a server. The server 10a in the embodiment of the present application may be a computer device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a palm computer, a Mobile Internet Device (MID), a wearable device (e.g., a smart watch, a smart bracelet, etc.), a smart television, a smart car, or the like. The number of the terminal devices and the number of the servers are not limited, and the terminal devices and the servers may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

It can be understood that the present application can be applied to a video clip editing scheme in various forms, and the following description takes processing of a video clip of a diving game as an example, please refer to fig. 2 together, which is a scene schematic diagram of video data processing provided by an embodiment of the present application. As shown in fig. 2, the computer device for implementing the video data processing scenario may include modules such as shot type identification, start-stop time sequence acquisition, and multitask action identification, and implementation processes of these modules may be performed in the server 10a shown in fig. 1, or may be performed in the terminal device 10b, the terminal device 10c, or the terminal device 10d shown in fig. 1. As shown in fig. 2, the terminal user uploads a video clip 20b of a diving game as a video to be processed through the terminal device 20a, and finally, a video album formed by splicing a plurality of important event clips in the video clip 20b and a corresponding stock list can be obtained. First, the terminal device 20a may display some prompt information on the screen so that the terminal user may complete the relevant operations according to the prompt information, for example, a text prompt content of "drag file to here, or click to upload" may be displayed in the area 201a, indicating that the terminal user may upload by dragging the local to-be-processed video stored in the area 201a, or, by clicking the "click to upload" control to select the local to-be-processed video to upload, if the to-be-processed video is not stored locally, as shown in the area 202a in the terminal device 20a, the terminal user may input a resource link of the to-be-processed video (for example, the resource link may be copied and pasted to a corresponding input box), and the subsequent server 20c may obtain the to-be-processed video by using the resource link, where the link may be in a URL format, an HTML format, a content of the to be processed video to be processed by the content of the content, UBB format, etc., the terminal user may select according to the specific format of the resource link corresponding to the video to be processed, and in addition, the link format option may be increased or decreased according to the actual service scenario, or the terminal device 20a may automatically identify the resource link input by the terminal user, which is not limited in this embodiment of the present application. Assuming that the end user selects one of the video segments 20b locally for processing, the end device 20a may send a video processing request to the server 20c (i.e., corresponding to the server 10a in fig. 1) in response to a triggering operation (e.g., a clicking operation) by the end user with respect to the "confirm upload" control 203a, and at the same time send the video segment 20b to the server 20 c. The video segment 20b includes a picture of a target object (here, a player participating in the game), and the number of the target objects may be one or more, which is not limited herein.

After receiving the video processing request sent by the terminal device 20a, the server 20c may call a relevant interface (e.g., a web interface, or may also take other forms, which is not limited in this embodiment) to transmit the video segment 20b to the video processing model. The video processing model can comprise a plurality of modules of the shot type identification, the start-stop time sequence acquisition, the multi-task action identification and the like, can realize automatic cutting of all the diving fragments contained in the received long video fragment, and can complete scoring and labeling. Specifically, the video processing model first performs frame extraction on the video segment 20b to obtain a first video frame sequence 20d, and then can identify a shot type corresponding to each video frame in the first video frame sequence 20d, and at the same time, can perform feature extraction on the first video frame sequence 20d to extract a picture feature corresponding to each video frame in the first video frame sequence 20d to form a picture feature sequence, so as to obtain a start-stop time sequence for identifying a motion segment in the first video frame sequence 20d according to the picture feature sequence, where the start-stop time sequence is a sequence formed by start-stop time stamps corresponding to each motion segment, and the start-stop time stamps specifically include start time stamps and end time stamps. Further, a plurality of shorter video frame sequences can be extracted from the first video frame sequence 20d as motion segments according to the start-stop time sequence, and then the obtained motion segments can be divided into normal motion segments and playback motion segments according to the shot types, and when the motion segments are playback motion segments, the server 20c can further identify the shot angle type corresponding to the playback motion segment as a part of the shot type corresponding to the playback motion segment, and further can use the shot type corresponding to each motion segment as the first motion tag corresponding to each motion segment. In addition, according to the start-stop time sequence corresponding to the obtained motion segment, a corresponding segment can be cut out from the video segment 20b, for easy understanding and distinction, the cut-out segment is called an important event segment, and then, a relevant interface can be called to splice the important event segments into a video collection 20f and return the video collection to the terminal device of the terminal user. In fact, since the motion segments are extracted from the first video frame sequence 20d obtained by frame extraction, the motion segments can be regarded as a subset of important event segments, wherein an important event (also referred to as a motion event) can refer to an event that contains a target object in a picture, or an event that has a special meaning although the target object is not contained in the picture, and it is understood that the specific meaning of the important event can be represented differently under different scenes, for example, in a diving game, the important event can refer to a diving event, that is, an event when a participant performs a diving action.

The computer device can train the deep neural network to generate the modules for identifying the shot type, acquiring the start-stop time sequence, identifying the multitask action and the like by using a video database with massive videos, and the specific generation process can refer to the following embodiments.

After the normal motion segment and the playback motion segment are obtained through the above process, since more detailed information (for example, motion types) needs to be further acquired, the motion segments also need to be identified and detected, for example, in a diving game, since different diving motion views of the normal motion segment are consistent and the time lengths are close, in order to reduce the waste of computing resources caused by repeatedly identifying the same content, only motion identification and scoring prediction need to be performed on the normal motion segment, and the playback motion segment multiplexes results of the corresponding normal motion segments, so that the server can perform multi-task motion identification on the normal motion segment first. As shown in fig. 2, assuming that 9 motion segments are obtained after segment division, which are respectively a motion segment D1, motion segments D2, …, motion segment D8 and motion segment D9, where the motion segment D1, the motion segment D4 and the motion segment D7 are all normal motion segments, the motion segment D2 and the motion segment D3 are playback motion segments related to the motion segment D1, the motion segment D5 and the motion segment D6 are playback motion segments related to the motion segment D4, and the motion segment D8 and the motion segment D9 are playback motion segments related to the motion segment D7, the server 20c may input the motion segment D1, the motion segment D4 and the motion segment D7 into a multitask action understanding network (i.e. the multitask action recognition module is a network for performing action quality evaluation and multi-tag action classification at the same time), and output the motion segment D8926, 1 and D9 through the multitask action understanding network, The motion type and the motion evaluation quality (i.e., the prediction score) of each of the motion segments D4 and D7 can be obtained, so that the second motion label corresponding to each motion segment can be obtained according to the motion type and the motion evaluation quality. The action types in the diving game scene can include, but are not limited to, a take-off mode, an arm strength diving property, a rotation posture, a rotation number and a side rotation number.

Each motion segment has its corresponding motion tag (including the first motion tag and the second motion tag), which is described below by taking the motion tags corresponding to the motion segment D1, the motion segment D2, and the motion segment D3 as examples, and the process of generating the motion tags corresponding to other motion segments can be referred to as motion segment D1, motion segment D2, and motion segment D3. As mentioned above, the motion segment D1 is a normal motion segment, and therefore the corresponding first motion label is its corresponding shot type, which may be labeled as "normal shot", for example. It is understood that the "normal shot" can be understood as a video sequence obtained by a camera shot. Assuming that the takeoff manner corresponding to the motion segment D1 is "knee-holding body", the arm-strength diving attribute is "non-arm-strength diving", the rotation posture is "jump forward facing the pool", the number of rotation turns is 4.5 turns, the number of side-turn turns is 0 turn, and the motion evaluation quality is 97.349174 minutes, then the "knee-holding body", "non-arm-strength diving", "jump forward facing the pool", 4.5 turns, 0 turn, and 97.349174 minutes may constitute a second motion tag corresponding to the motion segment D1, and for the replay motion segment related to the motion segment D1 (i.e. the motion segment D2 and the motion segment D3), the server 20c may directly acquire the second motion tag corresponding to the motion segment D1 as the second motion tag corresponding to the motion segment D2 and the motion segment D3, and further acquire the lens angle types corresponding to the two replay motion segments constitute respective first motion tags, for example, the lens angle type corresponding to the motion segment D2 is a side-front view type, its first motion tag may be "playback lens-side front view". The process of generating the first motion tag corresponding to the motion segment D3 is the same as the process of generating the first motion tag corresponding to the motion segment D2, and is not repeated here.

Further, the server 20c may generate a text slate 20e associated with the participating players based on the sports tags corresponding to each sports clip. Specifically, the motion labels corresponding to each motion segment may be sequentially arranged in the text script 20e according to the obtained sequence of the start-stop timestamps corresponding to each motion segment, as shown in fig. 2, for example, a "t 1-t2 normal shot, a score: 97.349174 minutes, take-off mode: embrace the knee body, non-arm power diving, rotation posture: jump forward in the face of the pool, number of rotations: 4.5 turns, number of turns in side rotation: 0 "indicates that a player performs a diving action from time t1 to time t2 in the video segment 20b, and the sports segment from time t1 to time t2 is a normal sports segment (this segment corresponds to the sports segment D1), and the detailed action type and predicted score of the diving action performed by the player are recorded. At the time t3 to the time t4 in the video segment 20b, a corresponding playback motion segment (this segment corresponds to the motion segment D2 described above) is recorded in the text script 20e, except for recording the detailed shot type (such as "playback shot-side front view angle") corresponding to the playback motion segment, and also recording the action type and the prediction score corresponding to the playback motion segment, it can be seen that the action type and the prediction score corresponding to the two segments are the same. Optionally, in order to avoid that excessive repeated information is displayed in the text script 20e to cause interference, for the playback motion segment, only the shot type may be displayed at the corresponding position, and other information may refer to the motion script information corresponding to the previous normal motion segment. It is to be understood that the text script 20e shown in fig. 2 is only an exemplary text script template provided in the embodiment of the present application, and the specific form may also be designed according to the actual requirement, which is not limited in the embodiment of the present application.

Finally, the server 20c may pack the text script 20e and the video album 20f obtained by splicing the important event segments, and return the corresponding resource address 20g to the terminal device of the terminal user through the relevant interface by using an information position representation method of a Uniform Resource Locator (URL), so that the terminal device may respond to the triggering operation of the terminal user for the resource address 20g, and display the video album 20f and the text script 20e on the screen thereof.

It is understood that the server 20c may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. Therefore, the modules mentioned above can be distributed on a plurality of physical servers or a plurality of cloud servers, that is, the calculation of all the video segments uploaded by the terminal users is completed in parallel through distribution or clustering, and then the video segments can be automatically cut and classified quickly and accurately, and complete motion script information can be generated.

To sum up, the embodiment of the present application is based on a deep neural network, and can quickly obtain a motion segment for a target object from a video to be processed, and divide the motion segment into a normal motion segment and a playback motion segment, and further can classify and add a motion tag to each motion segment, so that accurate segmentation of the normal motion segment and the playback motion segment can be realized, waste of computing resources is reduced, accuracy and efficiency of identification of motion types are improved, and finally, all important event segments can be automatically segmented from the video to be processed, and a script list is automatically generated.

Further, please refer to fig. 3, which is a flowchart illustrating a video data processing method according to an embodiment of the present application. The video data processing method may be executed by the terminal device or the server described in fig. 1, or may be executed by both the terminal device and the server. As shown in fig. 3, the method may include the steps of:

step S101, identifying the shot type of a video frame aiming at a target object in a video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the shot type;

specifically, a terminal user uploads a locally stored video clip as a to-be-processed video through a terminal device, or inputs a resource link corresponding to the to-be-processed video (for a specific process, see the embodiment corresponding to fig. 2), and a server may obtain the to-be-processed video, where the to-be-processed video includes target objects, and the specific number of the target objects may be one or more. Further, a video editor or a video editing algorithm may be used to decode and frame a video to be processed, for example, through Adobe Premiere Pro (a piece of commonly used video editing software developed by Adobe corporation), Fast Forward Mpeg (FFmpeg for short, which is a set of Open source computer programs that can record, convert digital audio and video, and convert them into streams), and Open CV (a cross-platform computer vision and machine learning software library that is licensed based on berkeley software suite and can be run on various operating systems), a server may obtain each frame image of the video to be processed, and further may perform uniform frame extraction on the obtained multi-frame images at fixed time intervals, so as to obtain a first video frame sequence. The specific fixed time interval selected during uniform frame extraction needs to be determined according to actual conditions, so that the video frames of each important event segment can be uniformly acquired, which is not limited in this embodiment of the present application.

Optionally, if the terminal device is equipped with a video editor or can run a video editing algorithm, the terminal user may decode and frame-extract the video to be processed on the terminal device to generate a corresponding first video frame sequence, and then send the first video frame sequence to the server. The process of locally generating the first video frame sequence corresponding to the video to be processed is consistent with the process of generating the first video frame sequence corresponding to the video to be processed by the server, and therefore details are not repeated here.

Further, the server may input the obtained first video frame sequence into an image classification network, and may output, through the image classification network, a shot type corresponding to each video frame in the first video frame sequence. In addition, a start-stop time sequence corresponding to a motion segment in a first video frame sequence may also be obtained, specifically, the first video frame sequence may be input to a feature coding network, and picture features corresponding to the first video frame sequence may be extracted through the feature coding network, and since the video segments in the actual service are all colored, that is, the extracted video frames are all colored pictures, for example, a color picture in an RGB mode has three color channels of Red (Red), Green (Green), and Blue (Blue), the feature coding network may specifically be a mobilenet convolutional neural network (a lightweight convolutional neural network proposed by google corporation, which replaces a conventional standard convolution with a deep separable convolution, thereby greatly reducing the amount of computation and the number of model parameters, and increasing the model speed), and picture RGB features of each video frame may be extracted through the mobilenet convolutional neural network, and fusing the RGB features of the pictures to obtain the picture features corresponding to the first video frame sequence. Furthermore, the server can input the picture characteristics into a time sequence action division network (a network for positioning the starting time and the ending time of the target action), the time sequence action division network can output the starting and ending time stamps for identifying the moving segments in the first video frame sequence, and the starting and ending time stamps corresponding to each moving segment are arranged and combined according to the sequence to obtain the starting and ending time sequence. Wherein the pair of start-stop timestamps includes a start timestamp and an end timestamp.

Assuming that the start-stop time sequence includes at least two start time stamps T1 and at least two end time stamps T2, and a pair of start time stamps T1 and an immediately subsequent end time stamp T2 may be formed by a previous start time stamp T1 and an immediately subsequent end time stamp T2, a video frame in the same pair of start time stamps T1 and an end time stamp T2 in the first video frame sequence may be determined as a second video frame sequence, so that at least two second video frame sequences may be obtained, each second video frame sequence is a motion segment, and the motion segment may be divided into a normal motion segment and a playback motion segment according to a shot type corresponding to the video frame in each motion segment, specifically, if a shot type corresponding to a motion segment is a normal shot type, the motion segment may be determined as a normal motion segment; if the shot type corresponding to the motion segment is the playback shot type, the motion segment can be determined as the playback motion segment, so that the normal motion segment and the playback motion segment can be distinguished.

It should be noted that the image classification network and the feature coding network may be the same network model, or may be two independent network models with the same network structure, specifically, both may be a mobilenet convolutional neural network, and the network has the functions of identifying a shot type and extracting picture features.

Step S102, determining the shot type as a first motion label corresponding to the motion segment; the first motion tag is used for recording the shot type corresponding to the motion segment in the motion script information associated with the target object;

specifically, the server may generate a first motion tag corresponding to each of the motion segments according to the shot type corresponding to each of the motion segments obtained in step S101, where the first motion tag may be used to record the shot type corresponding to the motion segment in the motion script information associated with the target object, for example, for a normal motion segment, in a diving competition scene, the shot angles of different diving actions are consistent, so that the shot angle types may not be distinguished, and the corresponding shot type is a normal shot type, and the specific content of the corresponding first motion tag may be "normal shot"; for the playback motion segment, the detailed lens angle type corresponding to the playback motion segment may be further identified through the image classification network, and then a first motion tag corresponding to the playback motion segment may be generated according to the lens angle type, for example, see the playback motion segment D2 in the embodiment corresponding to fig. 2, where the first motion tag is "playback lens-side front view". The lens angle type may specifically include any one of a front side view angle type, a side front view angle type, a side rear view angle type, or a top view angle type, and of course, other lens angle types may be added according to practical applications, which is not limited in the embodiments of the present application.

Step S103, performing multi-task action recognition on the normal motion segment, acquiring a second motion tag corresponding to the normal motion segment, and associating the second motion tag with the playback motion segment; the second motion tag is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

Specifically, through the step S101, one or more normal motion segments may be obtained, and the server may randomly extract N consecutive video frames from each normal motion segment to form a third video frame sequence corresponding to each normal motion segment, where only one normal motion segment is taken as an example for description here, and the processing processes of multiple normal motion segments are consistent, it should be understood that multiple normal motion segments may be processed in parallel, or each normal motion segment may also be processed sequentially according to a chronological order, which is not limited in the embodiment of the present application. The multitask action understanding network adopted in the embodiment of the application can simultaneously realize action type identification and action quality evaluation, the network is composed of M non-local components (also called non-local modules), the M non-local components can share parameters, and the number of video frames input by each non-local component must be a fixed value, therefore, a server can firstly split a third video frame sequence into M subsequences, each subsequence comprises continuous N/M video frames, and then the M subsequences can be respectively input into the M non-local components in the multitask action understanding network, each non-local component respectively processes one subsequence, and specifically can perform feature extraction on each subsequence, so that M intermediate action features can be obtained, further, the M intermediate action features are fused, so that a fusion feature can be obtained, then, time sequence operation can be carried out on the fusion characteristics through two one-dimensional convolution layers in the multitask action understanding network, time sequence characteristics corresponding to the normal motion segments are obtained, the time sequence characteristics can be input into a full connection layer in the multitask action understanding network, characteristic data are output through the full connection layer, the characteristic data are respectively input into a score regressor and a label classifier (both the score regressor and the label classifier are connected with the full connection layer), action evaluation quality corresponding to the normal motion segments can be output through the score regressor, action types corresponding to the normal motion segments are output through the label classifier, and therefore the obtained action evaluation quality and the obtained action types can be used as second motion labels corresponding to the normal motion segments. M, N are positive integers, N is equal to the length of the model input data corresponding to the action understanding network, and N is an integer multiple of M. The second motion tag is used for recording the action type corresponding to each motion segment and action evaluation quality in the motion session information, and the action evaluation quality can be understood as a prediction score (such as a diving action score) generated according to the action performed by the target object. The number of the label classifiers can be determined according to the number of types of the action types which need to be output actually.

Further, since the playback motion segment is only a slow playback of the associated normal motion segment, the playback motion segment may directly multiplex the second motion tag corresponding to the associated normal motion segment, that is, the server may associate the second motion tag with the playback motion segment, specifically, according to the start-stop time sequence corresponding to the motion segment in step S101, a previous normal motion segment of a certain playback motion segment may be used as the target segment, and further, the second motion tag corresponding to the target segment may be obtained and used as the second motion tag corresponding to the playback motion segment. To this end, each motion segment may obtain the corresponding first motion tag and second motion tag, and the server may sequentially arrange the first motion tag and the second motion tag corresponding to each motion segment according to the start-stop timestamp corresponding to each motion segment, to form motion script information associated with the target object, and may also generate a text script according to the motion script information, where a specific form may refer to the text script 20e in fig. 2.

The embodiment of the application can identify the shot type of the video frame aiming at the target object in the video to be processed and acquire the start-stop time sequence corresponding to the motion segment in the processed video, so that the motion segment can be divided into the normal motion segment and the playback motion segment according to the shot type and the start-stop time sequence, the accurate segmentation of the normal motion segment and the playback motion segment can be realized, the normal motion segment can be input into a multitask motion understanding network for identification, the corresponding motion type and motion evaluation quality can be obtained, the shot type can be used as a first motion label, the motion type and the motion evaluation quality can be used as a second motion label, and finally, motion note information can be generated according to the first motion label and the second motion label, therefore, the waste of computing resources can be reduced by only identifying the normal motion segment, the method can also avoid consuming a large amount of manpower to classify the actions and register the motion script information, thereby improving the identification accuracy and efficiency of the action types, improving the accuracy of the action quality evaluation, automatically scoring the actions of the target object, and improving the generation efficiency of the motion script information by automatically generating the motion script information.

The current editing of video clips for a diving game has the following problems: the manual script needs to consume a great deal of manpower; in the prior art, diving action segments cannot be automatically and accurately edited, so that clipped diving moments (namely moments of diving normal shots and replay shots in a match) are mixed into background moments (namely moments of not including diving shots in the match), or normal shots and slow-motion replay shots are mixed and output to influence the impression; in addition, the prior art is difficult to accurately distinguish diving action types and turns, and is difficult to accurately evaluate action quality, and automatically marks diving actions. The method provided by the application can solve the problems, and provides a set of multi-mode intelligent memorial tools, automatically clips all the diving pictures in the diving competition through the calculation result of the multi-mode network, and automatically outputs a memorial form containing action scores and action types.

In the following, processing of a video clip of a diving race is taken as an example for description, please refer to fig. 4, which is a schematic flow chart of a video data processing method provided in an embodiment of the present application. As shown in fig. 4, the video data processing method can be applied to an intelligent session scheme for a diving game, wherein the intelligent session refers to using an algorithm to automatically locate all the time when a sports event (such as a diving event, including normal shots and playback shots) occurs in a video segment, and to perform text description on the event. The method may specifically comprise the steps of:

a) framing the video clip of the diving game at a fixed time interval (for example, 0.5 second);

b) inputting the video frame obtained in the step a) into an image classification network to obtain the class of the lens type, and simultaneously, taking the video frame as a feature coding network for extracting the RGB features of the picture;

c) inputting the RGB features of the pictures obtained in the step b) into a time sequence action segmentation network to obtain a start-stop time sequence and a shot type of a diving action, and editing a diving segment from a diving competition video segment, wherein corresponding video frames can be extracted from the video frames in the step a) according to the start-stop time sequence to form an action segment sequence (including a normal shot and a multi-view playback shot), and the action segment sequence belongs to the diving segment;

d) sending video frames corresponding to the action fragment sequence of the normal shot into a multi-task action understanding network, wherein the network can simultaneously realize two tasks of action quality evaluation and multi-label action classification, and further can obtain detailed action labels (including a take-off mode, an arm strength diving property, a rotation posture, a rotation number and a side rotation number) and action scores of the action fragment sequence;

e) splicing the diving segments clipped in the step c) into a video collection serving as a video abstract;

f) and generating a script list according to the starting and stopping time sequence of the diving action obtained in the step c) and the action label and the action score predicted and generated in the step d).

According to the method provided by the embodiment of the application, by using a multi-mode technology (namely, a plurality of video processing technologies including picture classification, time sequence action segmentation, action classification and action quality evaluation), all diving pictures in one diving game can be accurately edited, and a complete memorandum of the whole diving game is generated.

Further, the processing procedure of the video clip of the diving game is described in detail. Please refer to fig. 5 a-5 b together, which are schematic flow charts of a video data processing method according to an embodiment of the present application. The video data processing method may be executed by the terminal device or the server described in fig. 1, or may be executed by both the terminal device and the server. As shown in fig. 5 a-5 b, the method may comprise the steps of:

step S201, performing frame extraction on a video to be processed containing a target object to obtain a first video frame sequence;

specifically, as shown in fig. 5a, the terminal user uploads a video clip 30a of a diving game as a video to be processed through a terminal device, and at this time, the target object refers to a player participating in the video clip 30 a. After receiving the video segment 30a, the server 30b (which may correspond to the server 10a in fig. 1) may perform uniform frame decimation on the video segment 30a at a fixed time interval (e.g. 0.5 seconds), so as to obtain a corresponding first video frame sequence 30c, and assuming that 700 video frames are obtained after frame decimation, as shown in fig. 5a, after uniform frame decimation on the video segment 30a, a first video frame c1, a first video frame c2, …, a first video frame c699 and a first video frame c700 are obtained, and the first video frame c1, the first video frame c2, …, the first video frame c699 and the first video frame c700 constitute the first video frame sequence 30c according to a video time order.

Step S202, inputting the first video frame sequence into an image classification network, and outputting a lens type corresponding to each video frame in the first video frame sequence through the image classification network;

specifically, as shown in fig. 5a, the server 30b inputs the first video frame sequence 30c into the image classification network 301d, and may acquire the respective shot types corresponding to the first video frame c1, the first video frames c2, …, the first video frame c699, and the first video frame c700, thereby forming a shot type sequence 30e corresponding to the first video frame sequence 30 c.

Step S203, inputting the first video frame sequence into a feature coding network, and outputting picture features corresponding to the first video frame sequence through the feature coding network;

specifically, as shown in fig. 5a, the server 30b may input the first video frame sequence 30c into the feature coding network 302d, so as to obtain the picture RGB features corresponding to the first video frame c1, the first video frame c2, the first video frame c699, and the first video frame c700, respectively, and fuse the picture RGB features corresponding to each video frame, so as to obtain the picture feature 30f corresponding to the first video frame sequence 30 c. The feature encoding network 302d and the image classification network 301d may be the same network.

It should be noted that before the image classification network and the feature coding network are used in actual services, the network models need to be trained by enough sample data to make the output of the network models conform to the expected value. Specifically, a large number of diving game videos are collected as motion sample segments, and then a shot type label (including a normal shot type and a playback shot type) and a start-stop timestamp corresponding to each target motion segment (i.e. a diving motion segment) can be marked in the motion sample segments, the marked motion sample segment is subjected to frame extraction to obtain a fourth video frame sequence, and further, the fourth video frame sequence is input into a lightweight convolutional neural network (specifically, a mobilenet convolutional neural network), the predicted shot type corresponding to each target action segment can be output through the lightweight convolutional neural network, and then generating a lens loss function according to the predicted lens type and the lens type label, and correcting model parameters in the lightweight convolutional neural network through the lens loss function, so as to obtain a trained image classification network. The training process of the feature coding network is consistent with the training process of the image classification network, and is not repeated here.

In addition, for the motion scene of the diving game, the specific content of the motion session record information may refer to the lens types and the action label classifications in table 1 below, and may also be increased or decreased according to the actual business needs, which is not limited in this application embodiment.

TABLE 1 lens types and action tag classifications

For example, it is necessary to obtain an image classification network and a feature coding network that can process a video clip of a diving game, and the training process thereof can be as follows:

a) data annotation: dividing all long videos (namely motion sample segments) of the diving competition into a training set and a testing set, labeling the starting (take-off) and ending (fall-in-water) time pairs (namely start-stop timestamps) of the occurrence of the motion aiming at all diving motions (namely target motion segments) appearing in each long video, and labeling the shot type of each diving motion, wherein the labeling content is shown in the table 1.

b) Data arrangement: and (3) extracting frames of the long video at a fixed frequency (such as 2FPS), wherein according to the time pair marked in a), the label of the video frame in the marked time interval range is the corresponding shot marking label in a), and the label of the video frame not in any time interval range is the 'background', namely the non-skipping picture.

c) Model training: training a mobilenet convolutional neural network by using the video frame data obtained in the step b), obtaining an image classification network with 7 classifications (1 class normal shot +5 class playback slow shot + background), and obtaining the shot type classification of each video frame. The other function of the network is a feature encoder, which outputs the full connection layer features of the network as picture RGB features and inputs the picture RGB features into a subsequent time sequence action segmentation network.

Step S204, inputting the picture characteristics into a time sequence action segmentation network, and outputting a start-stop time sequence for identifying the motion segment in the first video frame sequence through the time sequence action segmentation network; dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type;

specifically, as shown in fig. 5a, the server 30b may input the picture feature 30f obtained in step S203 into the time sequence action segmentation network 30g, and may obtain a start-stop time stamp of each diving action segment (i.e. a motion segment) in the first video frame sequence 30c through the time sequence action segmentation network 30g, and obtain a start-stop time sequence 30h by sequentially arranging and combining the start-stop time stamps corresponding to each diving action segment, for example, the start-stop time sequence 30h may include j time stamps t1, t2, t3,. and tj, where j is a positive integer, and two adjacent time stamps constitute a pair of start-stop time stamps, that is, a pair of start-stop time stamps includes a start time stamp and an end time stamp, such as time stamps t1-t2, time stamps t 3-time stamps 4, and it can be understood that, the ending timestamp in the previous pair of start-stop timestamps may be equal to the starting timestamp in the next pair of start-stop timestamps, e.g., t2 is equal to t3, but for ease of illustration and distinction, the timestamps will still be labeled herein as t2 and t3, respectively. Further, a diving action segment 30i may be obtained from the first video frame sequence 30c according to the start-stop time sequence 30h, as shown in fig. 5a, assuming that j is equal to 12, 6 diving action segments may be obtained, including diving action segment 301i of time stamp t 1-time stamp t2, diving action segments 302i, … of time stamp t 3-time stamp t4, and diving action segment 306i of time stamp t 11-time stamp t 12. Furthermore, the diving action segment 30i can be divided into a normal motion segment and a playback motion segment according to the start-stop time sequence 30h and the shot type sequence 30e, for example, the diving action segment 301i and the diving action segment 304i can be divided into a normal motion segment, the diving action segment 302i, the diving action segment 303i, the diving action segment 305i and the diving action segment 306i can be divided into a playback motion segment, and since the diving action segment 301i is a previous normal motion segment of the diving action segment 302i and the diving action segment 303i in time sequence, the diving action segment 301i, the diving action segment 302i and the diving action segment 303i can establish an association relationship, and similarly, the diving action segment 304i, the diving action segment 305i and the diving action segment 306i can also establish an association relationship, and the specific dividing process can be referred to the step S101 in the embodiment corresponding to the above fig. 3, and will not be described in detail herein.

Similarly, the network needs to be trained before it can be split using time-series operations. Specifically, a start-stop timestamp label and a shot type label corresponding to each target motion segment are marked in a motion sample segment, the marked motion sample segment is further divided into S sample sub-segments according to a division rule (S is a positive integer), then the S sample sub-segments are subjected to frame extraction to obtain a fifth video frame sequence, and then the start-stop timestamp label needs to be updated according to the time length of each sample sub-segment to obtain an updated start-stop timestamp label. Further, inputting the fifth video frame sequence into the trained feature coding network in step S203 to obtain a picture feature matrix corresponding to each sample sub-segment, inputting the picture feature matrix into the initial motion segmentation network, and outputting the predicted start-stop timestamp corresponding to each target motion segment in each sample sub-segment through the initial motion segmentation network, so that a time loss function can be generated according to the predicted start-stop timestamp and the updated start-stop timestamp label, and model parameters in the initial motion segmentation network are modified through the time loss function, thereby finally obtaining the trained time sequence motion segmentation network.

For example, it is necessary to obtain a time sequence action segmentation network that can process a video clip of a diving game, and the training process thereof can be as follows:

a) data annotation: the data labeling process is consistent with the data labeling process of the feature coding network and the image classification network.

b) Data arrangement: the method is characterized in that all training and testing long videos are divided into a plurality of short videos (for example, a diving competition video with 10 fields and 100 minutes can be divided into 50 sections and 20 minutes can be divided into 50 sections) according to a certain time length, so that the number of training sets is conveniently enlarged, the time length of the videos in the training sets is conveniently unified as much as possible, and meanwhile, diving actions are avoided when the videos are divided, so that the integrity of the diving actions is prevented from being damaged, namely the division rule is the division rule. The shorter duration video is then decimated at a fixed frequency (e.g., 2 FPS). In addition, the corresponding time pair sequence in the markup file is also updated correspondingly (namely, the start-stop timestamp labels are updated).

c) Model training: firstly, inputting the video frame sequence extracted from each video with a shorter duration in b) into the trained feature coding network to obtain a picture RGB feature matrix (which can be determined by a time dimension and a feature dimension together) of each video and a shot label vector (which can be determined by a time dimension and a shot label together) of each video. And then, taking the characteristic matrix as input, training a time sequence action segmentation network, outputting start and stop timestamps of all diving action segments, and obtaining lens labels of all diving action segments according to the lens label vector. Finally, each video obtains a sequence set { (start timestamp, end timestamp, shot label) | Xn }, wherein Xn is the number of the diving actions detected by the video.

Step S205, determining the shot type as a first motion label corresponding to the motion segment;

specifically, as shown in fig. 5a, since the diving action segment 301i and the diving action segment 304i are both normal motion segments, the first motion tags corresponding to the two motion segments may be "normal shots", and the diving action segment 302i, the diving action segment 303i, the diving action segment 305i and the diving action segment 306i are playback motion segments, so that the first motion tags of these segments may include detailed lens angle types in addition to "playback shots", and assuming that the lens angle type corresponding to the diving action segment 302i is a top view angle type, the first motion tag may be a "playback shot-top view angle", and a process of generating the first motion tags corresponding to other playback motion segments is consistent with the diving action segment 302 i.

Step S206, extracting N video frames from the normal motion segment to form a third video frame sequence; splitting the third video frame sequence into M subsequences, inputting the M subsequences into a multitask action understanding network, outputting an action type and action evaluation quality corresponding to a normal motion segment through the multitask action understanding network, determining the action type and the action evaluation quality as a second motion tag corresponding to the normal motion segment, and associating the second motion tag with a playback motion segment;

specifically, the diving action segment 301i is taken as an example for explanation, and the processing procedure of the diving action segment 304i can be referred to the diving action segment 301 i. As shown in fig. 5b, for a normal-shot diving action segment, the duration is generally about 4 seconds, and therefore according to an empirical value, 96 consecutive video frames (i.e. N equals to 96) can be extracted from the diving action segment 301i, the extracted 96 video frames constitute a third video frame sequence, and further the third video frame sequence can be split into M subsequences, in this embodiment, M may equal to 12, that is, the third video frame sequence is split to obtain a subsequence i1, a subsequence i2, a subsequence i …, and a subsequence i12, and the 12 subsequences are respectively input into 12 non-local components in the multitask action understanding network 30k, and then each non-local component respectively processes 8 consecutive video frames, and finally an action type 301l and an action evaluation quality 302l corresponding to the diving action segment 301i can be obtained, and the action type 301l and the action evaluation quality 302l can constitute a second motion tag 30l corresponding to the diving action segment 301i, further, the second motion tag 30l may be assigned to the associated diving action segment 302i and the diving action segment 303i, and the specific process may refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

It should be noted that, in the diving competition scenario, as shown in table 1 above, the action types may include, but are not limited to, a takeoff mode (including any posture of straight body, knee-holding body, and turning), an arm-strength diving property (including arm-strength diving and non-arm-strength diving), a rotation posture (including forward jumping towards the pool, backward jumping towards the plate (table), backward jumping towards the pool, inward jumping towards the plate (table)), a number of rotations, and a number of rotations. For example, the second motion tag 301 corresponding to the diving action segment 301i may be "66.538055 min, liberty, arm strength diving, bounce back to the pool, 2 circles, 1.5 circles", which respectively correspond to the action evaluation quality, the take-off mode, the arm strength diving property, the rotation posture, the rotation number, and the side rotation number.

Furthermore, before a multitask action can be used to understand the network, it needs to be trained. Please refer to fig. 6, which is a schematic structural diagram of a multitask action understanding network according to an embodiment of the present application. As shown in fig. 6, in the embodiment of the present application, instead of using a C3D network (3D relational Networks) as a main network, M non-local components are used to form the main network, where M is a positive integer, and the multitask action understanding network uses a hard parameter sharing mechanism, that is, the M non-local components can share parameters. First, data labeling needs to be performed on the motion sample segment, specifically, on the basis of the data labeling of the image classification network in step S203, an action evaluation quality label and an action type label corresponding to each target action segment need to be labeled in the motion sample segment, where specific contents of the labeling can be referred to table 1, and the action evaluation quality label and the action type label at this time are both real action evaluation quality and action type. Furthermore, K video frames can be uniformly extracted from the labeled motion sample segment, then N consecutive video frames are extracted from the K video frames as an input sequence, and the input sequence is input to the initial motion understanding network, where K and N are both positive integers, and K is greater than N, but the value of K is very close to the value of N, and since the number of video frames input by each non-local component must be a fixed value, N is equal to the length of the model input data corresponding to the initial motion understanding network. Further, by processing the continuous N/M video frames by M non-local components in the initial motion understanding network, the predicted motion evaluation quality and the predicted motion type corresponding to each target motion segment can be output, and it can be understood that N is an integer multiple of M. As shown in fig. 6, each group of consecutive N/M video frames may constitute a segment, and each non-local component may process a segment individually, for example, non-local component 1 processes segment 1, non-local component 2 processes segment 2. Furthermore, a quality loss function can be generated according to the predicted action evaluation quality and the gap between the action evaluation quality labels, an action loss function can be generated according to the predicted action type and the gap between the action type labels, a target loss function can be generated according to the obtained quality loss function and the action loss function (for example, the quality loss function and the action loss function can be summed), model parameters in the initial action understanding network are corrected through the target loss function, and a trained multi-task action understanding network can be obtained.

It should be understood that, although the multitask action understanding network illustrated in fig. 6 only marks non-local components, one-dimensional volumes, and full connection layers, in practical applications, the network structure of the multitask action understanding network may further include an input layer, a feature extraction layer, a normalization (BatchNorm, BN) layer, an activation layer, an output layer, and the like, and details are not described here.

For example, a multitask action understanding network that can process video clips of a diving game needs to be obtained, and the training process can be as follows:

a) data annotation: on the basis of the data annotation of the image classification network, an action score (i.e., an action evaluation quality label) and 5 fine action labels (i.e., action type labels) are additionally annotated for each diving action, as shown in table 1.

b) Data arrangement: because the different diving action visual angles of the normal lens are consistent and the duration is approximate, only the diving action segments of the normal lens are marked and subjected to label analysis, and the result of the lens multiplexing the normal lens is played back. For each normal shot, 107 frames are uniformly decimated (i.e., K equals 107), and the collection of frame sequences forms a data set.

c) Model training: training a multitask action understanding network based on a non-local neural network (resnet, namely a residual neural network) by using the action fragment frame sequence obtained in the step b). And randomly extracting 96 frames (namely N is equal to 96) from 107 frames during training as network input, thereby realizing data enhancement. The whole network structure comprises 12 (M is equal to 12) non-local modules (i.e. non-local components) sharing parameters, each processing a segment of continuous 8 frames and outputting a characteristic vector (also called an intermediate action characteristic) of 2048 dimensions (resnet full connection layer). The output vectors of the 12 nonlocal modules are spliced into a matrix (also called as fusion characteristics) of 12 × 2048, time sequence characteristics are obtained by performing time sequence operation on two one-dimensional convolution layers (convld), the time sequence characteristics are input into 2 branches, one branch is connected with the fractional regression device through the full-link layer, and the other branch is connected with the 5 label classifiers through the full-link layer. The entire network may be trained end-to-end, outputting prediction scores and multi-label categories.

Step S207, intercepting an important event segment from the video to be processed according to the start-stop time sequence corresponding to the motion segment; generating a label text corresponding to the important event fragment according to the incidence relation between the important event fragment and the motion fragment and the first motion label and the second motion label corresponding to the motion fragment, adding the label text into the important event fragment, and splicing the important event fragment added with the label text into a video collection;

specifically, as shown in fig. 5b, according to the start-stop time sequence 30h, an important event segment may be intercepted from the video segment 30a, for example, an important event segment 301m may be obtained by intercepting a segment from a timestamp t1 to a timestamp t2 in the video segment 30a, and an important event segment 302m may be obtained by intercepting a segment from a timestamp t3 to a timestamp t4 in the video segment 30a, where there is an association relationship between the important event segment 301m and the diving action segment 301i, that is, the important event segment 301m is a complete segment corresponding to the diving action segment 301i, and similarly, there is an association relationship between the important event segment 302m and the diving action segment 302 i. Taking the important event segment 301m as an example, at this time, no tag text is displayed in the picture of the important event segment 301m, the server 30b may obtain a first motion tag and a second motion tag corresponding to the diving action segment 301i, and generate a tag text corresponding to the important event segment 301m according to the first motion tag and the second motion tag, that is, the tag text includes a plurality of sub-tags of different tag types, specifically includes the lens type, the action type, and the action evaluation quality corresponding to the diving action segment 301i, as described in the above steps, the lens type is a normal lens type, the action type includes "free form", "arm power diving", "reverse pool facing jumping", 2 circles of rotation, and 1.5 circles of side rotation, the action evaluation quality is 66.538055 minutes, so that a tag fusion template may be obtained first, the tag fusion template specifies a filling position corresponding to each tag type, a sub-label such as a shot type needs to be filled in the uppermost part of the label fusion template. Further, each sub-label in the label text is added to the corresponding filling position according to the type of the label to which the sub-label belongs, so that an added label fusion template is obtained, and then the added label fusion template and the important event fragment can be subjected to pixel fusion, so that the added important event fragment is obtained. As shown in fig. 5b, the added significant event segment 301m is obtained according to the above process, and the added tag fusion template m1 can be displayed in the screen: "normal shot score is [66.538055], free form, arm strength diving, bounce in the face of a pool, 2 circles, 1.5 circles", and similarly, the added significant event segment 302m, whose screen may display the added label fusion template m 2: "playback-overhead view score [66.538055], free, arm-power jump, bounce back against the pool, 2 laps, 1.5 laps". The processing procedures of other important event segments are consistent, and are not described in detail herein. And finally, splicing all the added important event fragments to obtain a video collection 30m, wherein the video collection 30m can be used as a video abstract of the video fragment 30 a. In addition, the specific form of the label fusion template can be set according to actual needs, and the embodiment of the application is not limited.

And step S208, generating motion script information related to the target object according to the start-stop time sequence corresponding to the motion segment, the first motion label and the second motion label.

Specifically, as shown in fig. 5b, the server 30b may sequentially arrange and combine the first motion tag and the second motion tag corresponding to each motion segment according to the start-stop time sequence 30h, so as to obtain the motion script information 30n associated with each player, and further store the motion script information 30n in a script list, and then pack the motion script information and the video collection 30m together to return to the terminal user. The specific form of the script can be referred to as the script 20e in fig. 2, or other templates can be used to generate the script according to actual needs.

It should be noted that the multi-modal intelligent slate tool provided by the embodiment of the present application can be applied to other similar video clips or related video clips besides the video clip applied to the diving game, and the embodiment of the present application is not limited.

It should be understood that the numbers shown in the above embodiments are all imaginary numbers, and in practical application, the actual numbers should be used as the standard.

The method and the device for processing the video shot can be based on a deep neural network, the shot type of a video frame aiming at a target object in the video to be processed is identified, and a starting and stopping time sequence corresponding to a motion segment in the processed video is obtained, so that the motion segment can be divided into a normal motion segment and a playback motion segment according to the shot type and the starting and stopping time sequence, accurate segmentation of the normal motion segment and the playback motion segment is achieved, the normal motion segment can be input into a multitask motion understanding network to be identified, the corresponding motion type and motion evaluation quality are obtained, the shot type can be used as a first motion label, the motion type and the motion evaluation quality can be used as a second motion label, and finally motion picture recording information can be generated according to the first motion label and the second motion label. Therefore, in the process of processing the video clips of the diving game, all the diving clips can be automatically segmented from the video clips of the diving game, and the system is subjected to lens angle classification, added with labels of take-off mode, rotation posture, rotation number and the like, and subjected to action scoring, and further automatically generate the script list, thereby reducing the waste of computing resources by only identifying the normal motion segment, avoiding the consumption of a large amount of manpower for motion classification and registration of motion script information, thereby improving the identification accuracy and efficiency of the action type, improving the accuracy of the action quality evaluation, improving the generation efficiency of the motion script information, in actual service, the method can replace manual video editing and field note editing, thereby saving a large amount of labor cost and reducing human errors.

Fig. 7 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application. The video data processing apparatus may be a computer program (including program code) running on a computer device, for example, the video data processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 7, the video data processing apparatus 1 may include: a lens identification module 11, a first label determination module 12 and a second label determination module 13;

the shot identification module 11 is configured to identify a shot type of a video frame for a target object in a video to be processed, acquire a motion segment in the video to be processed, and divide the motion segment into a normal motion segment and a playback motion segment according to the shot type;

a first label determining module 12, configured to determine the type of the shot as a first motion label corresponding to the motion segment; the first motion tag is used for recording the shot type corresponding to the motion segment in the motion script information associated with the target object;

the first tag determining module 12 is specifically configured to, when the motion segment is a playback motion segment, identify a lens angle type corresponding to the playback motion segment through an image classification network; the lens angle type comprises any one of a front side view angle type, a side front view angle type, a side rear view angle type or a top view angle type; determining the type of the shot angle as a first motion label corresponding to the playback motion segment;

the second tag determining module 13 is configured to perform multi-task action recognition on the normal motion segment, acquire a second motion tag corresponding to the normal motion segment, and associate the second motion tag with the playback motion segment; the second motion tag is used for recording the motion type and the motion evaluation quality corresponding to the motion segment in the motion script information;

the second tag determining module 13 is specifically configured to, according to the start-stop time sequence corresponding to the motion segment, use a normal motion segment before the playback motion segment as a target segment, and obtain a second motion tag corresponding to the target segment, as a second motion tag corresponding to the playback motion segment;

the second tag determining module 13 is specifically configured to input the normal motion segment into the multitask action understanding network to obtain a take-off mode, an arm strength diving attribute, a rotation posture, a rotation number, a side turning number and action evaluation quality corresponding to the normal motion segment, and determine the take-off mode, the arm strength diving attribute, the rotation posture, the rotation number and the side turning number as action types corresponding to the normal motion segment; determining the action type and the action evaluation quality as a second motion label corresponding to the normal motion segment; the second motion label is used for recording the motion type and the motion evaluation quality corresponding to the normal motion segment in the motion session information associated with the participating players.

The specific functional implementation manner of the lens identification module 11 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S201 to step S204 in the embodiment corresponding to fig. 5a to 5b, the specific functional implementation manner of the first tag determination module 12 may refer to step S102 in the embodiment corresponding to fig. 3, or may refer to step S205 in the embodiment corresponding to fig. 5a to 5b, and the specific functional implementation manner of the second tag determination module 13 may refer to step S103 in the embodiment corresponding to fig. 3, or may refer to step S206 in the embodiment corresponding to fig. 5a to 5b, which is not described again here.

Referring to fig. 7, the video data processing apparatus 1 may further include: a segment intercepting module 14 and a segment splicing module 15;

the segment intercepting module 14 is configured to intercept an important event segment from the video to be processed according to the start-stop time sequence corresponding to the motion segment; the motion segment belongs to the important event segment;

the segment splicing module 15 is configured to generate a tag text corresponding to the important event segment according to the association relationship between the important event segment and the motion segment, and the first motion tag and the second motion tag corresponding to the motion segment, add the tag text to the important event segment, and splice the important event segment after the tag text is added into the video collection;

the tag text includes a plurality of sub-tags having different tag types;

the segment splicing module 15 is specifically configured to obtain a label fusion template; the label fusion template comprises filling positions corresponding to at least two label types respectively; adding each sub-label in the label text to a corresponding filling position according to the type of the label to which the sub-label belongs, and obtaining a label fusion template after the addition; and performing pixel fusion on the added label fusion template and the important event fragment to obtain the added important event fragment, and splicing the added important event fragment to obtain the video collection.

The specific functional implementation manner of the fragment intercepting module 14 and the fragment splicing module 15 may refer to step S207 in the embodiments corresponding to fig. 5a to 5b, which is not described herein again.

Referring to fig. 7, the video data processing apparatus 1 may further include: a first training module 16, a second training module 17, a third training module 18;

the first training module 16 is configured to mark, in the motion sample segment, an action evaluation quality label and an action type label corresponding to each target action segment; uniformly extracting K video frames from the marked motion sample segment, extracting continuous N video frames from the K video frames as an input sequence, and inputting the input sequence into an initial action understanding network; k and N are positive integers, and K is greater than N; n is equal to the length of model input data corresponding to the initial action understanding network; respectively processing continuous N/M video frames through M non-local components in the initial action understanding network, and outputting predicted action evaluation quality and predicted action type corresponding to each target action segment; m is a positive integer, and N is an integer multiple of M; generating a quality loss function according to the predicted action evaluation quality and the action evaluation quality label, and generating an action loss function according to the predicted action type and the action type label; generating a target loss function according to the quality loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain a multi-task action understanding network;

the second training module 17 is configured to mark a shot type label corresponding to each target action segment in the motion sample segment; performing frame extraction on the marked motion sample segment to obtain a fourth video frame sequence; inputting the fourth video frame sequence into a lightweight convolutional neural network, and outputting a prediction lens type corresponding to each target action segment through the lightweight convolutional neural network; generating a lens loss function according to the predicted lens type and the lens type label, and correcting model parameters in the lightweight convolutional neural network through the lens loss function to obtain an image classification network;

a third training module 18, configured to mark, in the motion sample segment, a start-stop timestamp label corresponding to each target action segment; dividing the marked motion sample segment into S sample sub-segments according to a division rule, and performing frame extraction on the S sample sub-segments to obtain a fifth video frame sequence; s is a positive integer; updating the start-stop timestamp labels according to the time length of each sample sub-segment to obtain updated start-stop timestamp labels; inputting the fifth video frame sequence into a feature coding network to obtain a picture feature matrix corresponding to each sample sub-segment; inputting the picture characteristic matrix into an initial action segmentation network, and outputting a prediction start-stop timestamp corresponding to each target action segment in each sample sub-segment through the initial action segmentation network; and generating a time loss function according to the predicted start-stop timestamp and the updated start-stop timestamp label, and correcting the model parameters in the initial action segmentation network through the time loss function to obtain the time sequence action segmentation network.

The specific implementation of the function of the first training module 16 may refer to step S206 in the embodiment corresponding to fig. 5a to 5b, the specific implementation of the function of the second training module 17 may refer to step S203 in the embodiment corresponding to fig. 5a to 5b, and the specific implementation of the function of the third training module 18 may refer to step S204 in the embodiment corresponding to fig. 5a to 5b, which is not described herein again.

Referring to fig. 7, the lens identification module 11 may include: a first frame extracting unit 111, an image classifying unit 112 and a segment dividing unit 113;

a first frame extracting unit 111, configured to perform frame extraction on a video to be processed including a target object to obtain a first video frame sequence;

an image classification unit 112, configured to input the first video frame sequence into an image classification network, and output a shot type corresponding to each video frame in the first video frame sequence through the image classification network; the shot types include a normal shot type and a playback shot type;

the segment dividing unit 113 is configured to obtain a start-stop time sequence corresponding to a motion segment in the first video frame sequence, and divide the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type.

The specific functional implementation manner of the first frame extracting unit 111 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S201 in the embodiment corresponding to fig. 5a to 5b, the specific functional implementation manner of the image classifying unit 112 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S202 in the embodiment corresponding to fig. 5a to 5b, and the specific functional implementation manner of the segment dividing unit 113 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S203 to step S204 in the embodiment corresponding to fig. 5a to 5b, which is not described again here.

Referring to fig. 7 together, the second tag determination module 13 may include: a second frame extracting unit 131, an action understanding unit 132;

a second frame extracting unit 131, configured to extract N video frames from the normal motion segment to form a third video frame sequence; n is a positive integer;

an action understanding unit 132, configured to split the third video frame sequence into M subsequences, and input the M subsequences into a multitask action understanding network; the multitask action understanding network comprises M non-local components; m is a positive integer; respectively extracting the features of the M subsequences through M non-local components in the multitask action understanding network to obtain M intermediate action features; one non-local component corresponds to one subsequence; fusing the M intermediate action characteristics to obtain fused characteristics, and performing time sequence operation on the fused characteristics through a one-dimensional convolution layer in a multitask action understanding network to obtain time sequence characteristics corresponding to normal motion segments; inputting the time sequence characteristics into a full connection layer in the multitask action understanding network, and respectively inputting the characteristic data output by the full connection layer into a score regressor and a label classifier; and outputting the action evaluation quality corresponding to the normal motion segment through a score regressor, outputting the action type corresponding to the normal motion segment through a label classifier, and taking the action evaluation quality and the action type as a second motion label corresponding to the normal motion segment.

The specific functional implementation manner of the second frame extracting unit 131 and the action understanding unit 132 may refer to step S103 in the embodiment corresponding to fig. 3, or may refer to step S206 in the embodiment corresponding to fig. 5a to 5b, which is not described herein again.

Referring to fig. 7 together, the fragment dividing unit 113 may include: a feature extraction subunit 1131, a timing sequence acquisition subunit 1132, and a fragment determination subunit 1133;

a feature extraction subunit 1131, configured to input the first video frame sequence into a feature coding network, and output, through the feature coding network, picture features corresponding to the first video frame sequence;

a timing acquisition subunit 1132, configured to input the picture characteristics into a timing action segmentation network, and output, through the timing action segmentation network, a start-stop time sequence used for identifying a motion segment in the first video frame sequence; the start-stop time sequence includes at least two start time stamps T1 and at least two end time stamps T2;

a segment determining subunit 1133, configured to obtain at least two second video frame sequences from the first video frame sequence according to the at least two start timestamps T1 and the at least two end timestamps T2, and determine each of the at least two second video frame sequences as a motion segment; if the shot type corresponding to the motion segment is a normal shot type, determining the motion segment as a normal motion segment; and if the shot type corresponding to the motion segment is the playback shot type, determining the motion segment as a playback motion segment.

The specific functional implementation manner of the feature extraction subunit 1131 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S203 in the embodiment corresponding to fig. 5a to 5b, and the specific functional implementation manner of the timing acquisition subunit 1132 and the fragment determination subunit 1133 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S204 in the embodiment corresponding to fig. 5a to 5b, which is not described herein again.

The method and the device for processing the video shot can be based on a deep neural network, the shot type of a video frame aiming at a target object in the video to be processed is identified, and a starting and stopping time sequence corresponding to a motion segment in the processed video is obtained, so that the motion segment can be divided into a normal motion segment and a playback motion segment according to the shot type and the starting and stopping time sequence, accurate segmentation of the normal motion segment and the playback motion segment is achieved, the normal motion segment can be input into a multitask motion understanding network to be identified, the corresponding motion type and motion evaluation quality are obtained, the shot type can be used as a first motion label, the motion type and the motion evaluation quality can be used as a second motion label, and finally motion picture recording information can be generated according to the first motion label and the second motion label. Therefore, the waste of computing resources can be reduced by only identifying the normal motion segment, the consumption of a large amount of manpower for motion classification and registration of motion script information can be avoided, the accuracy and the efficiency of identification of motion types can be improved, the accuracy of motion quality assessment can be improved, the generation efficiency of the motion script information can be improved, manual video editing and script editing can be replaced in actual services, and a large amount of labor cost can be saved and human errors can be reduced.

Please refer to fig. 8, which is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 8, the computer device 1000 may include a processor 1001, a network interface 1004 and a memory 1005, and the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 8, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 8, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the video data processing method in the embodiment corresponding to any one of fig. 3, fig. 4, and fig. 5a to fig. 5b, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program executed by the aforementioned video data processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the video data processing method in the embodiment corresponding to any one of fig. 3, fig. 4, and fig. 5a to fig. 5b can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

The computer-readable storage medium may be the video data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Further, here, it is to be noted that: embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by any one of the embodiments corresponding to fig. 3, fig. 4, and fig. 5 a-5 b.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of processing video data, comprising:

determining the shot type as a first motion label corresponding to the motion segment; the first motion tag is used for recording the shot type corresponding to the motion segment in the motion script information associated with the target object;

performing multi-task action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion tag is used for recording the motion type and the motion evaluation quality corresponding to the motion segment in the motion script information.

2. The method according to claim 1, wherein the identifying a shot type of a video frame of a target object in a video to be processed, obtaining a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the shot type comprises:

performing frame extraction on a video to be processed containing a target object to obtain a first video frame sequence;

inputting the first video frame sequence into an image classification network, and outputting a lens type corresponding to each video frame in the first video frame sequence through the image classification network; the shot types comprise a normal shot type and a playback shot type;

and acquiring a start-stop time sequence corresponding to the motion segment in the first video frame sequence, and dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type.

3. The method of claim 2, wherein the obtaining a start-stop time sequence corresponding to a motion segment in the first video frame sequence, and dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type comprises:

inputting the first video frame sequence into a feature coding network, and outputting picture features corresponding to the first video frame sequence through the feature coding network;

inputting the picture characteristics into a time sequence action segmentation network, and outputting a start-stop time sequence for identifying a motion segment in the first video frame sequence through the time sequence action segmentation network; the start-stop time sequence comprises at least two start time stamps T1 and at least two end time stamps T2;

acquiring at least two second video frame sequences in the first video frame sequence according to the at least two start time stamps T1 and the at least two end time stamps T2, determining the at least two second video frame sequences as motion segments;

if the shot type corresponding to the motion segment is a normal shot type, determining the motion segment as a normal motion segment;

and if the shot type corresponding to the motion segment is a playback shot type, determining the motion segment as a playback motion segment.

4. The method of claim 2, wherein the determining the shot type as the first motion tag corresponding to the motion segment comprises:

when the motion segment is a playback motion segment, identifying a lens angle type corresponding to the playback motion segment through the image classification network; the lens angle type comprises any one of a front side view angle type, a side front view angle type, a side rear view angle type or a top view angle type;

and determining the lens angle type as a first motion label corresponding to the playback motion segment.

5. The method according to claim 1, wherein the performing multitask action recognition on the normal motion segment and acquiring the second motion tag corresponding to the normal motion segment includes:

extracting N video frames from the normal motion segment to form a third video frame sequence; n is a positive integer;

splitting the third video frame sequence into M subsequences, and inputting the M subsequences into a multitask action understanding network; the multitask action understanding network comprises M non-local components; m is a positive integer;

respectively performing feature extraction on the M subsequences through the M non-local components in the multitask action understanding network to obtain M intermediate action features; one non-local component corresponds to one subsequence;

fusing the M intermediate action features to obtain a fused feature, and performing time sequence operation on the fused feature through a one-dimensional convolution layer in the multi-task action understanding network to obtain a time sequence feature corresponding to the normal motion segment;

inputting the time sequence characteristics into a full connection layer in the multitask action understanding network, and respectively inputting the characteristic data output by the full connection layer into a score regressor and a label classifier;

and outputting the action evaluation quality corresponding to the normal motion segment through the score regressor, outputting the action type corresponding to the normal motion segment through the label classifier, and taking the action evaluation quality and the action type as a second motion label corresponding to the normal motion segment.

6. The method of claim 2, wherein associating the second motion tag with the playback motion segment comprises:

and according to the start-stop time sequence corresponding to the motion segment, taking a normal motion segment before the playback motion segment as a target segment, and acquiring a second motion tag corresponding to the target segment as a second motion tag corresponding to the playback motion segment.

7. The method of claim 2, further comprising:

intercepting an important event segment from the video to be processed according to the start-stop time sequence corresponding to the motion segment; the motion segment belongs to the important event segment;

generating a label text corresponding to the important event fragment according to the incidence relation between the important event fragment and the motion fragment and the first motion label and the second motion label corresponding to the motion fragment, adding the label text into the important event fragment, and splicing the important event fragment added with the label text into a video collection.

8. The method of claim 7, wherein the tag text includes a plurality of sub-tags having different tag types; adding the tag text into the important event segments, and splicing the important event segments added with the tag text into a video collection, wherein the method comprises the following steps:

acquiring a label fusion template; the label fusion template comprises filling positions corresponding to at least two label types respectively;

adding each sub-label in the label text to a corresponding filling position according to the type of the label to which the sub-label belongs, and obtaining a label fusion template after the addition;

and carrying out pixel fusion on the added label fusion template and the important event fragment to obtain an added important event fragment, and splicing the added important event fragment to obtain a video collection.

9. The method of claim 5, further comprising:

marking an action evaluation quality label and an action type label corresponding to each target action segment in the motion sample segment;

uniformly extracting K video frames from the marked motion sample segment, extracting N continuous video frames from the K video frames as an input sequence, and inputting the input sequence into an initial action understanding network; k and N are positive integers, and K is greater than N; n is equal to the length of model input data corresponding to the initial action understanding network;

respectively processing continuous N/M video frames through M non-local components in the initial action understanding network, and outputting the predicted action evaluation quality and the predicted action type corresponding to each target action segment; m is a positive integer, and N is an integer multiple of M;

generating a quality loss function according to the predicted action evaluation quality and the action evaluation quality label, and generating an action loss function according to the predicted action type and the action type label;

and generating a target loss function according to the quality loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain the multitask action understanding network.

10. The method of claim 2, further comprising:

marking a shot type label corresponding to each target action segment in the motion sample segment;

performing frame extraction on the marked motion sample segment to obtain a fourth video frame sequence;

inputting the fourth video frame sequence into a lightweight convolutional neural network, and outputting a predicted lens type corresponding to each target action segment through the lightweight convolutional neural network;

and generating a lens loss function according to the predicted lens type and the lens type label, and correcting the model parameters in the lightweight convolutional neural network through the lens loss function to obtain an image classification network.

11. The method of claim 3, further comprising:

marking a start-stop timestamp label corresponding to each target action segment in the motion sample segment;

dividing the marked motion sample segment into S sample sub-segments according to a division rule, and performing frame extraction on the S sample sub-segments to obtain a fifth video frame sequence; s is a positive integer;

updating the start-stop timestamp labels according to the time length of each sample sub-segment to obtain updated start-stop timestamp labels;

inputting the fifth video frame sequence into the feature coding network to obtain a picture feature matrix corresponding to each sample sub-segment;

inputting the picture characteristic matrix into an initial action segmentation network, and outputting a prediction start-stop timestamp corresponding to each target action segment in each sample sub-segment through the initial action segmentation network;

and generating a time loss function according to the predicted start-stop timestamp and the updated start-stop timestamp label, and correcting model parameters in the initial action segmentation network through the time loss function to obtain the time sequence action segmentation network.

12. The method according to any one of claims 1 to 11, wherein the video to be processed is a diving game video, the target object is a competitor, the multitask action recognition is performed on the normal motion segment, and a second motion tag corresponding to the normal motion segment is obtained, and the method comprises the following steps:

inputting the normal motion segment into a multitask action understanding network to obtain a take-off mode, an arm strength diving property, a rotation posture, a rotation number, a side-turning number and action evaluation quality corresponding to the normal motion segment, and determining the take-off mode, the arm strength diving property, the rotation posture, the rotation number and the side-turning number as action types corresponding to the normal motion segment;

determining the action type and the action evaluation quality as a second motion label corresponding to the normal motion segment; the second motion label is used for recording the motion type and the motion evaluation quality corresponding to the normal motion segment in the motion session information associated with the participating players.

13. A video data processing apparatus, comprising:

a first tag determining module, configured to determine the shot type as a first motion tag corresponding to the motion segment; the first motion tag is used for recording the shot type corresponding to the motion segment in the motion script information associated with the target object;

the second label determining module is used for performing multi-task action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion tag is used for recording the motion type and the motion evaluation quality corresponding to the motion segment in the motion script information.

14. A computer device, comprising: a processor, a memory, and a network interface;

the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide data communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the method of any of claims 1-12.

15. A computer-readable storage medium, in which a computer program is stored which is adapted to be loaded by a processor and to carry out the method of any one of claims 1 to 12.