CN113515997B

CN113515997B - Video data processing method and device and readable storage medium

Info

Publication number: CN113515997B
Application number: CN202011580130.9A
Authority: CN
Inventors: 袁微; 赵天昊; 田思达
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2024-01-19
Anticipated expiration: 2040-12-28
Also published as: CN113515997A

Abstract

The application discloses a video data processing method, a device and a readable storage medium, wherein the video data processing method comprises the following steps: identifying the lens type of a video frame aiming at a target object in the video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the lens type; determining the type of the lens as a first motion label corresponding to the motion segment; the first motion label is used for recording the lens type corresponding to the motion segment in the motion script information associated with the target object; performing multitasking action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion label is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information. By adopting the method and the device, the normal motion segment and the playback motion segment can be distinguished accurately, and the waste of computing resources is reduced.

Description

Video data processing method and device and readable storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a video data processing method, apparatus, and readable storage medium.

Background

With the rapid development of internet technology and the improvement of computing power, the performance of video processing technology is greatly improved.

The video processing technology can classify, detect, analyze and the like videos, is a very challenging task in the field of computer vision, and is widely paid attention to both academia and industry. The existing artificial intelligent recognition technology can recognize all motion segments in the video segments, but in the video segments of the diving match, playback motion segments are often contained, and objects played by the playback motion segments are identical to the motion segments under the normal lens, so that repeated recognition of the same objects can be caused, and waste of computing resources is caused.

Disclosure of Invention

The embodiment of the application provides a video data processing method, a video data processing device and a readable storage medium, which can accurately distinguish normal motion segments from playback motion segments and reduce the waste of computing resources.

In one aspect, an embodiment of the present application provides a video data processing method, including:

Identifying the lens type of a video frame aiming at a target object in the video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the lens type;

determining the type of the lens as a first motion label corresponding to the motion segment; the first motion label is used for recording the lens type corresponding to the motion segment in the motion script information associated with the target object;

performing multitasking action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion label is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

An aspect of an embodiment of the present application provides a video data processing apparatus, including:

the lens identification module is used for identifying the lens type of a video frame aiming at a target object in the video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the lens type;

the first tag determining module is used for determining the lens type as a first motion tag corresponding to the motion segment; the first motion label is used for recording the lens type corresponding to the motion segment in the motion script information associated with the target object;

The second tag determining module is used for performing multi-task action recognition on the normal motion segment, acquiring a second motion tag corresponding to the normal motion segment, and associating the second motion tag with the playback motion segment; the second motion label is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

Wherein, the lens recognition module includes:

the first frame extraction unit is used for extracting frames of the video to be processed containing the target object to obtain a first video frame sequence;

the image classification unit is used for inputting the first video frame sequence into an image classification network and outputting a lens type corresponding to each video frame in the first video frame sequence through the image classification network; the shot type includes a normal shot type and a playback shot type;

and the segment dividing unit is used for acquiring a start-stop time sequence corresponding to the motion segment in the first video frame sequence and dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the lens type.

Wherein the segment dividing unit includes:

the feature extraction subunit is used for inputting the first video frame sequence into a feature coding network and outputting picture features corresponding to the first video frame sequence through the feature coding network;

A time sequence acquisition subunit, configured to input the picture feature into a time sequence action segmentation network, and output a start-stop time sequence for identifying a motion segment in the first video frame sequence through the time sequence action segmentation network; the start-stop time sequence comprises at least two start time stamps T1 and at least two end time stamps T2;

a segment determining subunit, configured to obtain at least two second video frame sequences from the first video frame sequences according to at least two start time stamps T1 and at least two end time stamps T2, and determine the at least two second video frame sequences as motion segments; if the lens type corresponding to the motion segment is the normal lens type, determining the motion segment as a normal motion segment; and if the shot type corresponding to the motion segment is the playback shot type, determining the motion segment as the playback motion segment.

The first tag determining module is specifically configured to identify a lens angle type corresponding to the playback motion segment through the image classification network when the motion segment is the playback motion segment; the lens angle type includes any one of a front side view angle type, a side front view angle type, a side rear view angle type, or a top view angle type; and determining the lens angle type as a first motion label corresponding to the playback motion segment.

Wherein the second tag determination module includes:

the second frame extraction unit is used for extracting N video frames from the normal motion segment to form a third video frame sequence; n is a positive integer;

the action understanding unit is used for splitting the third video frame sequence into M subsequences and inputting the M subsequences into the multi-task action understanding network; the multitasking action understanding network includes M non-local components; m is a positive integer; respectively extracting features of M subsequences by M non-local components in a multitask action understanding network to obtain M intermediate action features; a non-local component corresponds to a sub-sequence; fusing M intermediate action features to obtain fused features, and performing time sequence operation on the fused features through a one-dimensional convolution layer in a multi-task action understanding network to obtain time sequence features corresponding to normal motion segments; inputting the time sequence characteristics into a full-connection layer in a multi-task action understanding network, and respectively inputting the characteristic data output by the full-connection layer into a fractional regressor and a label classifier; outputting motion estimation quality corresponding to the normal motion segment through the fractional regression, outputting motion type corresponding to the normal motion segment through the tag classifier, and taking the motion estimation quality and the motion type as a second motion tag corresponding to the normal motion segment.

The second tag determining module is specifically configured to take a previous normal motion segment of the playback motion segment as a target segment according to a start-stop time sequence corresponding to the motion segment, and obtain a second motion tag corresponding to the target segment as a second motion tag corresponding to the playback motion segment.

Wherein, the above-mentioned video data processing apparatus, still include:

the segment intercepting module is used for intercepting important event segments from the video to be processed according to the start-stop time sequence corresponding to the motion segment; the motion segment belongs to an important event segment;

the segment splicing module is used for generating a label text corresponding to the important event segment according to the association relation between the important event segment and the motion segment, the first motion label and the second motion label corresponding to the motion segment, adding the label text into the important event segment, and splicing the important event segment added with the label text into the video highlight.

Wherein the tag text includes a plurality of sub-tags having different tag types;

the fragment splicing module is specifically used for acquiring a label fusion template; the label fusion template comprises at least two filling positions corresponding to the label types respectively; each sub-label in the label text is added to a corresponding filling position according to the label type to obtain an added label fusion template; and carrying out pixel fusion on the added tag fusion template and the important event fragment to obtain an added important event fragment, and splicing the added important event fragment to obtain the video highlight.

Wherein, the above-mentioned video data processing apparatus, still include:

the first training module is used for marking action evaluation quality labels and action type labels corresponding to each target action segment in the motion sample segments; uniformly extracting K video frames from the marked motion sample fragments, extracting continuous N video frames from the K video frames as an input sequence, and inputting the input sequence into an initial action understanding network; k and N are positive integers, K is larger than N; n is equal to the length of the model input data corresponding to the initial action understanding network; processing the continuous N/M video frames through M non-local components in the initial action understanding network respectively, and outputting predicted action evaluation quality and predicted action types corresponding to each target action segment; m is a positive integer, N is an integer multiple of M; generating a quality loss function according to the predicted action evaluation quality and the action evaluation quality label, and generating an action loss function according to the predicted action type and the action type label; generating a target loss function according to the mass loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain the multi-task action understanding network.

Wherein, the above-mentioned video data processing apparatus, still include:

the second training module is used for marking a lens type label corresponding to each target action segment in the motion sample segment; extracting frames from the marked motion sample fragments to obtain a fourth video frame sequence; inputting a fourth video frame sequence into a lightweight convolutional neural network, and outputting a predicted lens type corresponding to each target action segment through the lightweight convolutional neural network; and generating a lens loss function according to the predicted lens type and the lens type label, and correcting model parameters in the lightweight convolutional neural network through the lens loss function to obtain an image classification network.

Wherein, the above-mentioned video data processing apparatus, still include:

the third training module is used for marking a start-stop time stamp label corresponding to each target action segment in the motion sample segment; dividing the marked motion sample segment into S sample sub-segments according to a dividing rule, and performing frame extraction on the S sample sub-segments to obtain a fifth video frame sequence; s is a positive integer; updating the start and stop time stamp label according to the time length of each sample sub-segment to obtain an updated start and stop time stamp label; inputting the fifth video frame sequence into a feature coding network to obtain a picture feature matrix corresponding to each sample sub-segment; inputting the picture feature matrix into an initial action segmentation network, and outputting a predicted start-stop time stamp corresponding to each target action segment in each sample sub-segment through the initial action segmentation network; generating a time loss function according to the predicted start-stop time stamp and the updated start-stop time stamp label, and correcting model parameters in the initial action segmentation network through the time loss function to obtain the time sequence action segmentation network.

The second tag determining module is specifically configured to input a normal motion segment into the multi-task motion understanding network, obtain a jump mode, an arm force jump attribute, a rotation gesture, a rotation number, a side rotation number and motion evaluation quality corresponding to the normal motion segment, and determine the jump mode, the arm force jump attribute, the rotation gesture, the rotation number and the side rotation number as motion types corresponding to the normal motion segment; determining the action type and the action evaluation quality as a second motion label corresponding to the normal motion segment; the second sports label is used for recording the action type and action evaluation quality corresponding to the normal sports segment in sports scene information associated with the participating athletes.

In one aspect, a computer device is provided, including: a processor, a memory, a network interface;

the processor is connected to the memory and the network interface, where the network interface is used to provide a data communication function, the memory is used to store a computer program, and the processor is used to call the computer program to execute the method in the embodiment of the present application.

In one aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, the computer program being adapted to be loaded by a processor and to perform a method according to embodiments of the present application.

In one aspect, the embodiments of the present application provide a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, where the computer instructions are stored in a computer readable storage medium, and where a processor of a computer device reads the computer instructions from the computer readable storage medium, and where the processor executes the computer instructions, so that the computer device performs a method in an embodiment of the present application.

According to the method and the device for processing the video frame, the shot type of the video frame aiming at the target object in the video to be processed can be identified, and the motion segment in the processed video is obtained, so that the motion segment can be divided into the normal motion segment and the playback motion segment according to the shot type, the accurate segmentation of the normal motion segment and the playback motion segment is realized, the normal motion segment can be input into a multitask action understanding network for identification, the corresponding action type and the action evaluation quality are obtained, the shot type can be used as a first motion label, the action type and the action evaluation quality can be used as a second motion label, and finally, the motion scene information can be generated according to the first motion label and the second motion label, therefore, the waste of calculation resources can be reduced by only identifying the normal motion segment, the registration of the motion scene information can be avoided, and the generation efficiency of the motion scene information can be improved by automatically generating the motion scene information.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic view of a video data processing scenario provided in an embodiment of the present application;

fig. 3 is a flowchart of a video data processing method according to an embodiment of the present application;

fig. 4 is a flowchart of a video data processing method according to an embodiment of the present application;

fig. 5 a-5 b are schematic flow diagrams of a video data processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a multi-tasking action understanding network provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eyes to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include data processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The scheme provided by the embodiment of the application relates to the technology of computer vision of artificial intelligence, deep learning and the like, and the specific process is described by the following embodiment.

Please refer to fig. 1, which is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system architecture may include a server 10a and a terminal cluster, where the terminal cluster may include: terminal device 10b terminal devices 10c, …, terminal device 10d, wherein a communication connection may exist between the terminal clusters, e.g. a communication connection exists between terminal device 10b and terminal device 10c, and a communication connection exists between terminal device 10b and terminal device 10 d. Meanwhile, any terminal device in the terminal cluster may have a communication connection with the server 10a, for example, a communication connection exists between the terminal device 10b and the server 10a, where the communication connection is not limited to a connection manner, may be directly or indirectly connected through a wired communication manner, may also be directly or indirectly connected through a wireless communication manner, and may also be other manners, which are not limited herein.

It should be understood that each terminal device in the terminal cluster shown in fig. 1 may be provided with an application client, and when the application client runs in each terminal device, data interaction may be performed between the application client and the server 10a shown in fig. 1. The application client may be a multimedia client (e.g., a video client), a social client, an entertainment client (e.g., a game client), an educational client, a live client, etc. with a frame sequence (e.g., a frame animation sequence) loading and playing function. The application client may be an independent client, or may be an embedded sub-client integrated in a certain client (e.g., a multimedia client, a social client, an educational client, etc.), which is not limited herein. The server 10a provides services for the terminal cluster through a communication function, and when the terminal device (which may be the terminal device 10b, the terminal device 10c or the terminal device 10 d) acquires the video clip a and needs to process the video clip a, for example, clips an important event clip (for example, a highlight clip) from the video clip a, classifies and tags the important event clip, the terminal device may send the video clip a to the server 10a through the application client. After receiving the video segment a sent by the terminal device, the server 10a may identify the lens types of all video frames in the video segment a for the target object, and may obtain a motion segment B (for example, a segment when a player participating in a video segment of a diving match exhibits a diving action) from the video segment a according to the identified start-stop time sequence, so that the motion segment B may be divided into a normal motion segment C and a playback motion segment D according to the lens types, and then the server 10a may determine the lens types as a first motion tag corresponding to the motion segment B, and may be subsequently used to record the lens types corresponding to the motion segment B in motion field record information associated with the target object. Further, the server 10a may perform multitask motion recognition on the normal motion segment C to obtain a second motion tag corresponding to the normal motion segment C, and associate the second motion tag with the playback motion segment D, so that the motion type and the motion evaluation quality corresponding to the motion segment B may be recorded in the motion scene information according to the second motion tag, and finally the motion scene information may be generated according to the first motion tag and the second motion tag, and further, according to the start-stop time sequence corresponding to the motion segment B, the important event segment may be intercepted from the video segment a to be spliced into a video highlight, so that the motion scene information and the video highlight may be returned to the application client of the terminal device, and after the application client of the terminal device receives the motion scene information and the video highlight sent by the server 10a, the motion scene information and the video highlight may be displayed on a screen corresponding to the application client of the terminal device. The motion field information may also include information such as a start time stamp and an end time stamp corresponding to the motion segment B, and only the identification of the lens type, the identification of the motion type, and the evaluation of the motion quality will be described here as an example. It should be noted that the number of the motion segments B, the normal motion segment C, and the playback motion segment D may be one or more, which is not limited herein. It can be understood that the normal motion segment and the playback motion segment can be understood as motion segments obtained by shooting at present, but the two motion segments may be different in shooting angle and playing speed, the shooting angle of the normal motion segment is usually fixed, the shooting angle of the playback motion segment may be various (such as a top view angle, a right front view angle, etc.), and the playing speed of the playback motion segment is smaller than that of the normal motion segment.

It is understood that the methods provided by embodiments of the present application may be performed by a computer device, including but not limited to a terminal device or a server. The server 10a in the embodiment of the present application may be a computer device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a palm computer, a mobile internet device (mobile internet device, MID), a wearable device (for example, a smart watch, a smart bracelet, etc.), a smart television, a smart vehicle-mounted smart terminal, etc. which may run the application client. The number of terminal devices and servers is not limited, and the terminal devices and servers can be directly or indirectly connected through wired or wireless communication, which is not limited herein.

It can be understood that the present application may be applied to video clip schemes in various forms, and the following description will take a video clip for processing a diving match as an example, please refer to fig. 2 together, which is a schematic view of a video data processing scenario provided in an embodiment of the present application. As shown in fig. 2, the computer device implementing the video data processing scenario may include modules such as lens type recognition, start-stop time sequence acquisition, and multi-task action recognition, and implementation processes of these modules may be performed in the server 10a shown in fig. 1, or may be performed in the terminal device 10b, the terminal device 10c, or the terminal device 10d shown in fig. 1, which are not limited herein, and only a server is taken as an example for executing the main body. As shown in fig. 2, the terminal user uploads a video segment 20b of a diving race as a video to be processed through the terminal device 20a, and finally, a video highlight formed by splicing a plurality of important event segments in the video segment 20b and a corresponding scene note can be obtained. Firstly, the terminal device 20a may display some prompt information on the screen so that the terminal user may complete related operations according to the prompt information, for example, text prompt content of "drag file to here, or click upload" may be displayed in the area 201a, which indicates that the terminal user may upload the locally stored video to be processed into the area 201a, or select the locally stored video to be processed by clicking the "click upload" control, if the locally does not store the video to be processed, as shown in the area 202a in the terminal device 20a, the terminal user may input a resource link of the video to be processed (for example, may copy and paste a resource link thereof to a corresponding input box), the subsequent server 20c may obtain the video to be processed through the resource link, and the link format may be URL format, HTML format, UBB format, etc., and the terminal user may select according to a specific format of the resource link corresponding to the video to be processed, and may increase or decrease the link according to an actual service scenario, or may also automatically identify the resource link input by the terminal device 20a, which does not apply for limitation. Assuming that the end user selects a local one of the video clips 20b for processing, the end device 20a may send a video processing request to the server 20c (i.e., corresponding to the server 10a in fig. 1) in response to the end user triggering operation (e.g., clicking operation) on the "confirm upload" control 203a, while sending the video clip 20b to the server 20c. The video segment 20b includes a picture for a target object (here, a competitor), and the number of the target objects may be one or more, which is not limited herein.

After receiving the video processing request sent by the terminal device 20a, the server 20c invokes a relevant interface (such as a web interface, but may also transmit the video segment 20b to the video processing model in other forms, which are not limited in the embodiment of the present application). The video processing model can comprise a plurality of modules such as the lens type recognition, start-stop time sequence acquisition, multi-task action recognition and the like, and can realize automatic editing of all the diving fragments contained in the received long video fragments and finish scoring and labeling. Specifically, the video processing model firstly performs frame extraction on the video segment 20b to obtain a first video frame sequence 20d, so as to identify a shot type corresponding to each video frame in the first video frame sequence 20d, and simultaneously, may perform feature extraction on the first video frame sequence 20d to extract picture features corresponding to each video frame in the first video frame sequence 20d to form a picture feature sequence, so that a start-stop time sequence for identifying a motion segment in the first video frame sequence 20d, namely a sequence formed by the start-stop time sequence, namely a start-stop time stamp corresponding to each motion segment, can be obtained according to the picture feature sequence, wherein the start-stop time stamp specifically comprises a start time stamp and an end time stamp. Further, according to the start-stop time sequence, a plurality of shorter video frame sequences can be extracted from the first video frame sequence 20d as motion segments, then according to the shot types, the obtained motion segments can be divided into normal motion segments and playback motion segments, when the motion segments are playback motion segments, the server 20c can further identify the shot angle type corresponding to the playback motion segments as a part of the shot type corresponding to the playback motion segments, and further can use the shot type corresponding to each motion segment as a first motion label corresponding to each motion segment. In addition, according to the start-stop time sequence corresponding to the obtained motion segment, the corresponding segment can be cut out from the video segment 20b, for convenience of understanding and distinguishing, the cut-out segment is called an important event segment, and the relevant interface can be called to splice the important event segment into the video highlight 20f and return the video highlight to the terminal device of the terminal user. In fact, since the motion segment is extracted from the first video frame sequence 20d obtained by frame extraction, the motion segment may be regarded as a subset of the important event segment, where the important event (may also be referred to as a motion event) may refer to an event that includes a target object in the picture, or an event that does not include a target object but has a special meaning in the picture, it may be understood that the specific meaning of the important event may be represented differently in different scenarios, for example, in a diving game, and the important event may refer to a diving event, that is, an event when the competitor performs a diving action.

The computer device can train the deep neural network to generate the modules of shot type recognition, start-stop time sequence acquisition, multi-task action recognition and the like by utilizing a video database with massive videos, and the specific generation process can be seen in the following embodiments.

After the normal motion segment and the playback motion segment are obtained through the above process, more detailed information (such as motion types) needs to be further obtained, so that the motion segment needs to be identified and detected, for example, in a diving match, because different diving motions of the normal motion segment have consistent viewing angles and similar duration, in order to reduce the waste of calculation resources caused by repeated identification of the same content, the motion identification and scoring prediction are only needed for the normal motion segment, and the playback motion segment multiplexes the corresponding result of the normal motion segment, so that the server can perform multitasking motion identification for the normal motion segment. As shown in fig. 2, assuming that 9 motion segments are obtained after the segment division, namely, the motion segment D1, the motion segment D2, the motion segment …, the motion segment D8 and the motion segment D9, where the motion segment D1, the motion segment D4 and the motion segment D7 are all normal motion segments, the motion segment D2 and the motion segment D3 are playback motion segments related to the motion segment D1, the motion segment D5 and the motion segment D6 are playback motion segments related to the motion segment D4, and the motion segment D8 and the motion segment D9 are playback motion segments related to the motion segment D7, the server 20c may input the motion segment D1, the motion segment D4 and the motion segment D7 into a multitask motion understanding network (i.e., the multitask motion recognition module, a network for performing motion quality assessment and multitag motion classification simultaneously), and output the motion types and motion assessment qualities (i.e., prediction scores) corresponding to the motion segments D1, the motion segment D4 and the motion assessment qualities corresponding to each motion segment. The action types in the diving game scene can include, but are not limited to, a jump mode, arm force diving attributes, a rotation gesture, a rotation circle number and a side rotation circle number.

Each motion segment has a corresponding motion tag (including a first motion tag and a second motion tag), and the process of generating motion tags corresponding to the other motion segments can be described by taking the motion tags corresponding to the motion segments D1, D2 and D3 as examples. As mentioned above, the motion segment D1 is a normal motion segment, so the corresponding first motion label is of its corresponding shot type, which may be labeled as "normal shot", for example. It will be appreciated that the term "normal shot" will be understood to refer to a video sequence that is obtained when the camera shot is taken down. Assuming that the jump mode corresponding to the motion segment D1 is "knee body", the arm force jump attribute is "no arm force jump", the rotation posture is "face pool forward jump", the rotation number is 4.5 turns, the side turn is 0 turns, and the motion evaluation quality is 97.349174 minutes, then the second motion tag corresponding to the motion segment D1 may be formed by the "knee body", "no arm force jump", "face pool forward jump", "4.5 turns, 0 turns, and 97.349174 minutes, and then the playback motion segment (i.e., the motion segment D2 and the motion segment D3) related to the motion segment D1 may be directly obtained by the server 20c as the second motion tag corresponding to the motion segment D1 and the second motion tag corresponding to the motion segment D3, and then the first motion tag corresponding to each of the two playback motion segments may be formed by obtaining the lens angle types corresponding to the respective first motion tags, for example, the first motion tag may be the side forward view type. The process of generating the first motion label corresponding to the motion segment D3 is identical to the process of generating the first motion label corresponding to the motion segment D2, and will not be described here again.

Further, based on the sports labels corresponding to each sports segment, the server 20c may generate a word stock form 20e associated with the contestant. Specifically, the motion labels corresponding to each motion segment may be sequentially arranged in the text script 20e according to the sequence of the start-stop time stamps corresponding to each motion segment, as shown in fig. 2, for example, "t 1-t2 normal shots, and the scores may be displayed in the text script 20 e: 97.349174, take off mode: leg-holding body, non-arm force diving, rotation posture: forward jump facing the pool, number of rotations: 4.5 turns, number of side turns: 0 ", indicating that a certain player performs a diving action from time t1 to time t2 in the video clip 20b, and the motion clip from time t1 to time t2 is a normal motion clip (the clip corresponds to the motion clip D1), and the detailed action type and the prediction score of the diving action performed by the player are recorded. At time t3 to time t4 in the video clip 20b, a corresponding playback motion clip (this clip corresponds to the above-mentioned motion clip D2) is recorded, and in the text scene 20e, in addition to the detailed shot type (e.g. "playback shot-side front view") corresponding thereto, the action type and the prediction score corresponding thereto are recorded, and it can be seen that the action types and the prediction scores corresponding thereto are the same. Optionally, in order to avoid that excessive repeated information is displayed in the text script 20e to cause interference, for the playback of the motion segment, the lens type of the motion segment may be displayed only at the corresponding position, and other information may refer to the motion script information corresponding to the previous normal motion segment. It will be appreciated that the text script 20e shown in fig. 2 is merely an exemplary text script template provided in the embodiments of the present application, and the specific form may be designed according to actual needs, which is not limited in this embodiment of the present application.

Finally, the server 20c may package the text scene 20e and the video highlight 20f obtained by splicing the important event segments, and return the corresponding resource address 20g to the terminal device of the terminal user through the relevant interface by using the information position representation method of the uniform resource location system (uniform resource locator, URL), so that the terminal device may respond to the triggering operation of the terminal user on the resource address 20g, and display the video highlight 20f and the text scene 20e on the screen thereof.

It can be appreciated that the server 20c may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The modules mentioned above can be distributed on a plurality of physical servers or a plurality of cloud servers, namely, the calculation of all video clips uploaded by the terminal user is completed in parallel through a distributed or clustered mode, and further, the video clips can be automatically cut, classified and complete motion scene information can be generated rapidly and accurately.

In summary, the embodiment of the application is based on the deep neural network, the motion segment aiming at the target object can be quickly obtained from the video to be processed, and the motion segment is divided into the normal motion segment and the playback motion segment, so that each motion segment can be classified and added with the motion label, therefore, the accurate segmentation of the normal motion segment and the playback motion segment can be realized, the waste of computing resources is reduced, the recognition accuracy and the recognition efficiency of the action type are improved, all important event segments can be automatically segmented from the video to be processed finally, the scene note is automatically generated, the manual operation can be replaced in the actual service scene, a large amount of labor cost is saved, the human error is reduced, and the generation efficiency of the motion scene note information is improved.

Further, please refer to fig. 3, which is a flowchart illustrating a video data processing method according to an embodiment of the present application. The video data processing method may be performed by the terminal device or the server as shown in fig. 1, or may be performed by both the terminal device and the server, and in this embodiment, the method is described as being performed by the server. As shown in fig. 3, the method may include the steps of:

Step S101, identifying the lens type of a video frame aiming at a target object in a video to be processed, obtaining a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the lens type;

specifically, the terminal user uploads the video clip stored locally through the terminal device as the video to be processed, or inputs a resource link corresponding to the video to be processed (for a specific process, see the embodiment corresponding to fig. 2, above), and the server may obtain the video to be processed, where the video to be processed includes one or more target objects, where the specific number of target objects may be one or more. The video to be processed may be decoded and frame-extracted by a video editor or video editing algorithm, for example, through Adobe Premiere Pro (a common video editing software developed by Adobe corporation), fast Forward Mpeg (FFmpeg for short, is a set of Open source computer programs that may be used to record, convert digital audio and video, and convert it into streams), and Open CV (a cross-platform computer vision and machine learning software library issued based on the license of berkeley software suite, which may be run on a variety of operating systems), and the server may obtain each frame image of the video to be processed, and may further uniformly frame-extract the obtained multi-frame images at fixed time intervals, thereby obtaining the first video frame sequence. The specific fixed time interval selected during the frame extraction is determined according to the actual situation, and it is preferable to uniformly obtain the video frame of each important event segment.

Optionally, if the terminal device is provided with a video editor or can run a video editing algorithm, the terminal user may first decode the video to be processed on the terminal device and extract frames to generate a first video frame sequence corresponding to the first video frame sequence, and then send the first video frame sequence to the server. The process of locally generating the first video frame sequence corresponding to the video to be processed is consistent with the process of generating the first video frame sequence corresponding to the video to be processed by the server, so that a detailed description is omitted here.

Further, the server may input the obtained first video frame sequence into an image classification network, and may output a shot type corresponding to each video frame in the first video frame sequence through the image classification network. In addition, a start-stop time sequence corresponding to a motion segment in the first video frame sequence can be obtained, specifically, the first video frame sequence can be input into a feature encoding network, picture features corresponding to the first video frame sequence can be extracted through the feature encoding network, because video segments in actual service are all colored, that is, the extracted video frames are color pictures, for example, the color pictures in an RGB mode have three color channels of Red (Red), green (Green) and Blue (Blue), the feature encoding network can be specifically a mobile convolutional neural network (a lightweight convolutional neural network proposed by google corporation), the conventional standard convolutional is replaced by a depth separable convolutional neural network, the calculated amount and the number of model parameters are greatly reduced, the model speed can be improved, the picture RGB features of each video frame can be extracted through the mobile convolutional neural network, and the picture features are fused, so that the picture features corresponding to the first video frame sequence can be obtained. The server may further input the picture feature into a time sequence action splitting network (a network for locating a start time and an end time of occurrence of the target action), and output start and stop time stamps for identifying motion segments in the first video frame sequence through the time sequence action splitting network, and the start and stop time stamps corresponding to each motion segment are arranged and combined according to a sequence order to obtain a start and stop time sequence, for example, for a video segment of a diving match, start and stop time stamps corresponding to a diving picture and a non-diving picture in the video segment can be obtained through the time sequence action splitting network, so that the diving segment can be clipped easily. Wherein the pair of start and stop time stamps includes a start time stamp and an end time stamp.

Assuming that the start-stop time sequence includes at least two start time stamps T1 and at least two end time stamps T2, a pair of start time stamps can be formed by a previous start time stamp T1 and an end time stamp T2 immediately following the previous start time stamp T1, then video frames in the first video frame sequence within the same range of the start time stamp T1 and the end time stamp T2 can be determined as second video frame sequences, so that at least two second video frame sequences can be obtained, each second video frame sequence is a motion segment, and then the motion segment can be divided into a normal motion segment and a playback motion segment according to a shot type corresponding to a video frame in each motion segment, specifically, if the shot type corresponding to a certain motion segment is a normal shot type, the motion segment can be determined as a normal motion segment; if the shot type corresponding to the motion segment is a playback shot type, the motion segment can be determined as a playback motion segment, so that the normal motion segment and the playback motion segment can be distinguished.

It should be noted that the image classification network and the feature encoding network may be the same network model, or may be two independent network models with the same network structure, and in particular, may be a mobilet convolutional neural network, which has the functions of lens type recognition and image feature extraction.

Step S102, determining the lens type as a first motion label corresponding to the motion segment; the first motion label is used for recording the lens type corresponding to the motion segment in the motion script information associated with the target object;

specifically, the server may generate a first motion label corresponding to each of the lens types corresponding to each of the motion segments obtained in the step S101, where the first motion label may be used to record the lens type corresponding to the motion segment in the motion scene information associated with the target object, for example, for a normal motion segment, in a diving match scene, the lens angles of different diving actions are consistent, so that the lens angle types may not be distinguished, the corresponding lens type is a normal lens type, and then the specific content of the corresponding first motion label may be "normal lens"; for the playback motion segment, the detailed lens angle type corresponding to the playback motion segment can be further identified through the image classification network, and further, a first motion label corresponding to the playback motion segment can be generated according to the lens angle type, for example, see the playback motion segment D2 in the embodiment corresponding to fig. 2, where the first motion label is a "playback lens-side front view". The lens angle type may specifically include any one of a positive side view angle type, a positive front view angle type, a side rear view angle type, or a top view angle type, and of course, other lens angle types may be added according to practical applications.

Step S103, performing multi-task action recognition on the normal motion segment, acquiring a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion label is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

Specifically, through the step S101, one or more normal motion segments may be obtained, and the server may randomly extract N continuous video frames from each normal motion segment to form a third video frame sequence corresponding to each normal motion segment, where only one normal motion segment is used as an example to describe the process of processing the multiple normal motion segments, and it is to be understood that the multiple normal motion segments may be processed in parallel, or may also be processed sequentially according to a temporal sequence, where the embodiment of the present application is not limited. The multitask action understanding network adopted in the embodiment of the application can simultaneously realize action type recognition and action quality assessment, the network is composed of M non-local components (also called as non-local modules), the M non-local components can share parameters, the video frame number input by each non-local component must be a fixed value, therefore, a server can split a third video frame sequence into M subsequences, each subsequence contains continuous N/M video frames, the M subsequences can be respectively input into the M non-local components in the multitask action understanding network, each non-local component respectively processes one subsequence, the feature extraction can be specifically carried out on each subsequence, thus M intermediate action features can be obtained, the M intermediate action features are further fused, the fusion features can be obtained, then the time sequence features corresponding to the normal motion segments can be obtained through time sequence operation of two one-dimensional convolution layers in the multitask action understanding network, the time sequence features can be input into a full-connection layer in the multitask action understanding network, the full-connection layer regression layer can be used for outputting data corresponding to the full-connection layer regression feature, the full-connection regression layer data can be used as a full-motion tag, the full-motion segment can be used as a full-motion tag, and the full-motion tag can be correspondingly output through the full-motion tag label, and the full-motion tag can be correspondingly evaluated, and the full-motion tag can be correspondingly processed through the full-motion tag quality classification can be obtained. Wherein M, N is a positive integer, N is equal to the length of the model input data corresponding to the action understanding network, and N is an integer multiple of M. The second motion label is used for recording the motion type and motion evaluation quality corresponding to each motion segment in the motion script information, and the motion evaluation quality can be understood as a prediction score (such as a diving motion score) generated according to the motion executed by the target object. The number of the tag classifiers can be determined according to the number of types of action types actually required to be output.

Further, since the playback motion segment is only slow-release of the associated normal motion segment, the playback motion segment may directly multiplex the second motion tag corresponding to the associated normal motion segment, that is, the server may associate the second motion tag with the playback motion segment, specifically, may take, according to the start-stop time sequence corresponding to the motion segment in step S101, a previous normal motion segment of a certain playback motion segment as the target segment, and then may obtain the second motion tag corresponding to the target segment as the second motion tag corresponding to the playback motion segment. So far, each motion segment can obtain a first motion label and a second motion label corresponding to the motion segment, the server can sequentially arrange the first motion label and the second motion label corresponding to each motion segment according to a start-stop time stamp corresponding to each motion segment to form motion scene information related to a target object, and a text scene can be generated according to the motion scene information, and a specific form can be seen in the text scene 20e in the figure 2.

According to the method and the device for processing the video frame of the target object, the lens type of the video frame aiming at the target object in the video to be processed can be identified, and the start-stop time sequence corresponding to the motion segment in the processed video is obtained, so that the motion segment can be divided into the normal motion segment and the playback motion segment according to the lens type and the start-stop time sequence, the accurate segmentation of the normal motion segment and the playback motion segment is realized, the normal motion segment can be input into a multitask motion understanding network for identification, the corresponding motion type and the motion evaluation quality are obtained, the lens type can be used as a first motion label, the motion type and the motion evaluation quality can be used as a second motion label, finally, the motion field information can be generated according to the first motion label and the second motion label, therefore, waste of calculation resources can be reduced by only identifying the normal motion segment, a large amount of manpower can be consumed for carrying out motion classification and registration of the motion field information can be avoided, the accuracy of identifying the motion type and the identification efficiency can be improved, the accuracy of motion quality evaluation can be improved, the target object motion can be automatically scored, and the motion can be generated by automatically generating the motion field information through the motion field information.

The current editing of video clips for diving games has the following problems: the manual memorization requires a lot of manpower; the prior art cannot automatically and accurately edit the diving action fragments, so that the diving moment (namely, the moment containing the diving normal lens and the replay lens in the match) clipped is mixed with the background moment (namely, the moment not containing the diving lens in the match), or the mixed output of the normal lens and the slow action replay lens influences the appearance; in addition, the prior art is difficult to accurately distinguish the type of the diving action and the number of turns of the rotator, and is difficult to accurately evaluate the action quality, and the diving action is automatically scored. The method provided by the application can solve the problems, provides a set of multi-mode intelligent script tool, automatically clips all diving pictures in the diving match through the calculation result of the multi-mode network, and automatically outputs a script containing action scores and action types.

The following describes an example of processing a video clip of a diving game, please refer to fig. 4, which is a flow chart of a video data processing method according to an embodiment of the present application. As shown in fig. 4, the video data processing method can be applied to a diving match intelligent scene scheme, wherein the intelligent scene refers to using an algorithm to automatically locate all the time when a sports event (such as a diving event, including a normal shot and a playback shot) occurs in a video clip, and describes the event in words. The method specifically comprises the following steps:

a) The video clips of the diving match are frame-extracted at fixed time intervals (for example, 0.5 seconds);

b) Inputting the video frames obtained in the a) into an image classification network to obtain the category of the lens type, and simultaneously, taking the video frames as a feature coding network for extracting the RGB features of the picture;

c) Inputting the RGB features of the pictures obtained in the step b) into a time sequence action segmentation network to obtain a start-stop time sequence and a shot type of the occurrence of the diving action, and editing out the diving fragments from the diving match video fragments, wherein in addition, corresponding video frames can be extracted from the video frames in the step a) according to the start-stop time sequence to form an action fragment sequence (comprising a normal shot and a multi-view playback shot), and the action fragment sequence belongs to the diving fragments;

d) Video frames corresponding to the action fragment sequences of the normal shots are sent into a multi-task action understanding network, and the network can simultaneously realize two tasks of action quality assessment and multi-label action classification, so that detailed action labels (comprising a take-off mode, arm strength diving attributes, rotation postures, rotation numbers and side rotation numbers) and action scores of the action fragment sequences can be obtained;

e) Splicing the diving clips in c) into video highlights as video abstracts;

f) Generating a ticket based on the start-stop time sequence of the occurrence of the diving action obtained in c), and the action label and the action score predicted to be generated in d).

The method provided by the embodiment of the application can accurately clip all diving pictures in a diving match and generate a complete scene record list of the whole diving match by using a multi-mode technology (namely a plurality of video processing technologies including picture classification, time sequence action segmentation, action classification and action quality evaluation).

Further, the processing procedure of the video clip of the diving game will be described in detail. Fig. 5 a-5 b are schematic flow diagrams of a video data processing method according to an embodiment of the present application. The video data processing method may be performed by the terminal device or the server as shown in fig. 1, or may be performed by both the terminal device and the server, and in this embodiment, the method is described as being performed by the server. As shown in fig. 5 a-5 b, the method may comprise the steps of:

step S201, frame extraction is carried out on a video to be processed containing a target object, and a first video frame sequence is obtained;

specifically, as shown in fig. 5a, the end user uploads a video segment 30a of a diving race as a video to be processed through a held terminal device, and at this time, the target object is a competitor in the video segment 30 a. The server 30b (which may correspond to the server 10a in fig. 1) may uniformly extract frames of the video segment 30a at regular time intervals (for example, 0.5 seconds) after receiving the video segment 30a, so as to obtain a corresponding first video frame sequence 30c, and assuming that 700 video frames may be obtained after the frame extraction, as shown in fig. 5a, the first video frame c1, the first video frames c2, …, the first video frame c699, and the first video frame c700 may be obtained after uniformly extracting frames of the video segment 30a, and the first video frame c1, the first video frames c2, …, the first video frame c699, and the first video frame c700 may form the first video frame sequence 30c according to a video time sequence.

Step S202, inputting a first video frame sequence into an image classification network, and outputting a lens type corresponding to each video frame in the first video frame sequence through the image classification network;

specifically, as shown in fig. 5a, the server 30b inputs the first video frame sequence 30c into the image classification network 301d, and may obtain shot types corresponding to the first video frame c1, the first video frames c2, …, the first video frame c699, and the first video frame c700, so as to form a shot type sequence 30e corresponding to the first video frame sequence 30 c.

Step S203, inputting the first video frame sequence into a feature encoding network, and outputting picture features corresponding to the first video frame sequence through the feature encoding network;

specifically, as shown in fig. 5a, the server 30b may input the first video frame sequence 30c into the feature encoding network 302d, so as to obtain the picture RGB features corresponding to the first video frame c1, the first video frame c2, the first video frame c699, and the first video frame c700, and fuse the picture RGB features corresponding to each video frame, so as to obtain the picture feature 30f corresponding to the first video frame sequence 30 c. Wherein the feature encoding network 302d and the image classification network 301d may be the same network.

It should be noted that, before using the image classification network and the feature encoding network in the actual service, it is necessary to train these network models with enough sample data to make the output of the network models coincide with the expected values. Specifically, a large number of diving match videos are collected as motion sample fragments, lens type labels (including normal lens types and playback lens types) and start-stop time stamps corresponding to each target action fragment (namely the diving action fragments) can be marked in the motion sample fragments, the marked motion sample fragments are subjected to frame extraction to obtain a fourth video frame sequence, the fourth video frame sequence is further input into a lightweight convolutional neural network (specifically, the lightweight convolutional neural network), the predicted lens types corresponding to each target action fragment can be output through the lightweight convolutional neural network, further, a lens loss function can be generated according to the predicted lens types and the lens type labels, model parameters in the lightweight convolutional neural network are corrected through the lens loss function, and therefore the trained image classification network is obtained. The training process of the feature encoding network is consistent with the training process of the image classification network, and will not be described in detail here.

In addition, for the sports scene of the diving race, the specific content of the sports scene information can refer to the lens type and the action label classification in table 1 below, and can be increased or decreased according to the actual service requirement, which is not limited in the embodiment of the present application.

/>

TABLE 1 shot type and action tag Classification

For example, there is a need to obtain an image classification network and a feature encoding network that can process video clips of a diving race, and the training process can be as follows:

a) And (3) data marking: all long videos of the diving match (namely, movement sample fragments) are divided into a training set and a testing set, for all diving actions (namely, target action fragments) occurring in each long video, starting (starting) and ending (falling) time pairs (namely, starting and ending time stamps) of the action are marked, and the shot type of each diving action is marked, wherein the marking content is shown in the table 1.

b) And (3) data arrangement: and (3) extracting frames of the long video at a fixed frequency (e.g. 2 FPS), wherein the labels of the video frames in the range of the labeling time interval are the corresponding lens labeling labels in a) according to the time pair labeled in a), and the labels of the video frames not in any time interval range are the "background", namely the non-diving pictures.

c) Model training: training a mobilet convolutional neural network by using the video frame data obtained in the step b) to obtain an image classification network with 7 classifications (1-type normal shots+5-type playback slow shots+background), and obtaining the shot type classification of each video frame. The other function of the network is that the feature encoder outputs the full connection layer feature of the network as the picture RGB feature to be input to the subsequent time sequence action segmentation network.

Step S204, inputting the picture characteristics into a time sequence action segmentation network, and outputting a start-stop time sequence for identifying the motion segment in the first video frame sequence through the time sequence action segmentation network; dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type;

specifically, as shown in fig. 5a, the server 30b may input the picture feature 30f obtained in the above step S203 into the time sequence action splitting network 30g, and may obtain, by using the time sequence action splitting network 30g, a start-stop time stamp generated by each of the skip action segments (i.e. the motion segments) in the first video frame sequence 30c, and sequentially combine the start-stop time stamps corresponding to each of the skip action segments to obtain the start-stop time sequence 30h, where, for example, the start-stop time sequence 30h may include a time stamp t1, a time stamp t2, a time stamp t3, and a time stamp tj, where j is a positive integer, and two adjacent time stamps form a pair of start-stop time stamps, that is, a pair of start-stop time stamps includes a start time stamp and an end time stamp, for example, a time stamp t 1-time stamp t2 and a time stamp t 3-time stamp t4, and it is understood that the end time stamp in the previous pair of start-stop time stamps may be equal to the start time stamp in the next pair of start time stamps, for example, t3 and t3 are still equal to each other, and the time stamp t2 are still described herein for convenience. Further, according to the start-stop time sequence 30h, a skip action segment 30i may be obtained from the first video frame sequence 30c, as shown in fig. 5a, and assuming j is equal to 12, 6 skip action segments may be obtained, including a skip action segment 301i with a timestamp t 1-timestamp t2, skip action segments 302i, … with a timestamp t 3-timestamp t4, and skip action segment 306i with a timestamp t 11-timestamp t 12. Furthermore, the diving action segment 30i may be divided into a normal motion segment and a playback motion segment according to the start-stop time sequence 30h and the shot type sequence 30e, for example, the diving action segment 301i and the diving action segment 304i may be divided into a normal motion segment, the diving action segment 302i, the diving action segment 303i, the diving action segment 305i and the diving action segment 306i may be divided into playback motion segments, and since the diving action segment 301i is the diving action segment 302i and the previous normal motion segment of the diving action segment 303i in time sequence, the association relationship between the diving action segment 301i and the diving action segment 302i and the diving action segment 303i may be established, and similarly, the association relationship between the diving action segment 304i and the diving action segment 305i and the diving action segment 306i may also be established, and the specific dividing process may refer to step S101 in the corresponding embodiment of fig. 3.

Likewise, the time-series action needs to be trained before it can be used to segment the network. Specifically, first, a start-stop time stamp label and a lens type label corresponding to each target action segment are marked in a motion sample segment, then the marked motion sample segment is divided into S sample sub-segments (S is a positive integer) according to a division rule, then the S sample sub-segments are subjected to frame extraction to obtain a fifth video frame sequence, and further the start-stop time stamp label is required to be updated according to the time length of each sample sub-segment to obtain an updated start-stop time stamp label. Further, the fifth video frame sequence is input into the feature encoding network already trained in the step S203, a picture feature matrix corresponding to each sample sub-segment can be obtained, the picture feature matrix is input into the initial motion segmentation network, and a predicted start-stop time stamp corresponding to each target motion segment in each sample sub-segment can be output through the initial motion segmentation network, so that a time loss function can be generated according to the predicted start-stop time stamp and the updated start-stop time stamp label, model parameters in the initial motion segmentation network can be corrected through the time loss function, and finally the trained time sequence motion segmentation network can be obtained.

For example, a need exists to obtain a time-series action segmentation network that can process video clips of a diving race, and the training process can be as follows:

a) And (3) data marking: the data marking process is consistent with the data marking process of the feature coding network and the image classification network.

b) And (3) data arrangement: all training and testing long videos are segmented into a plurality of short videos (for example, about 10-field 100-minute diving match videos can be segmented into about 50-section 20-minute short videos) according to a certain duration, so that the number of training sets is conveniently enlarged, the duration of the videos in the training sets is conveniently unified as much as possible, meanwhile, diving actions are avoided when the videos are segmented, and the integrity of the diving actions is prevented from being damaged, namely the segmentation rule. And then frame the shorter duration video at a fixed frequency (e.g., 2 FPS). In addition, the corresponding time pair sequence in the annotation file is also updated accordingly (i.e. the start-stop timestamp label is updated).

c) Model training: firstly, inputting a video frame sequence extracted from each video with shorter duration in the step b) into the trained feature coding network to obtain a picture RGB feature matrix (which can be jointly determined by a time dimension and a feature dimension) of each video and a lens label vector (which can be jointly determined by the time dimension and the lens label) of each video. Then, the feature matrix is used as input, the time sequence action segmentation network is trained, start and stop time stamps of all the diving action fragments are output, and meanwhile, lens labels of all the diving action fragments can be obtained according to the lens label vector. Finally, each video gets a sequence set { (start timestamp, end timestamp, shot label) |xn }, xn is the number of diving actions detected by the video.

Step S205, determining the lens type as a first motion label corresponding to the motion segment;

specifically, as shown in fig. 5a, since the above-mentioned diving action segment 301i and the diving action segment 304i are both normal motion segments, the first motion labels corresponding to the above-mentioned diving action segment 301i and the diving action segment 304i may be "normal shots", and the diving action segment 302i, the diving action segment 305i and the diving action segment 306i are playback motion segments, so that the first motion labels of these segments may include a detailed lens angle type in addition to the "playback shots", and the first motion labels may be "playback shots-top view" assuming that the lens angle type corresponding to the diving action segment 302i is a top view type, and the process of generating the first motion labels corresponding to the other playback motion segments is consistent with the diving action segment 302 i.

Step S206, extracting N video frames from the normal motion segment to form a third video frame sequence; splitting a third video frame sequence into M subsequences, inputting the M subsequences into a multi-task action understanding network, outputting an action type and an action evaluation quality corresponding to a normal motion segment through the multi-task action understanding network, determining the action type and the action evaluation quality as a second motion label corresponding to the normal motion segment, and associating the second motion label with a playback motion segment;

Specifically, taking the diving action segment 301i as an example, the processing procedure of the diving action segment 304i can be referred to as the diving action segment 301i. As shown in fig. 5b, for a normal shot diving action segment, the general duration is about 4 seconds, so according to an empirical value, 96 continuous video frames (i.e. N is equal to 96) can be extracted from the diving action segment 301i, the extracted 96 video frames form a third video frame sequence, and the third video frame sequence can be further split into M sub-sequences, in this embodiment, M can be equal to 12, that is, splitting the third video frame sequence can obtain sub-sequences i1, i2, …, i12, and inputting the 12 sub-sequences into 12 non-local components in the multitasking action understanding network 30k respectively, each non-local component processes 8 continuous video frames respectively, and finally, the action type 301l and the action evaluation quality 302l corresponding to the diving action segment 301i can be obtained, the action type 301l and the action evaluation quality 302l can form a second motion tag 30l corresponding to the diving action segment 301i, and further the second motion tag 30l can be given to the associated diving action segment 302i, and the specific procedure can be further carried out in step S103, which is not referred to in fig. 3.

In the diving game scenario, as shown in table 1, the action types may include, but are not limited to, a jump mode (including any posture of a straight body, a leg-holding body, a tumbling and turning body), arm-force diving attributes (including arm-force diving and non-arm-force diving), a rotation posture (including forward jump of a facing pool, backward jump of a facing board (table), reverse jump of a facing pool, inward jump of a facing board (table), a rotation number, and a side rotation number. For example, the second motion label 301 corresponding to the jump motion segment 301i may specifically be "66.538055 minutes, free, arm force jump, face to face jump, 2 turns, 1.5 turns", which correspond to the motion evaluation quality, the jump mode, the arm force jump attribute, the rotation posture, the rotation number, and the side turn number, respectively.

Furthermore, the network needs to be trained before it can be understood using multitasking actions. Fig. 6 is a schematic structural diagram of a multi-task action understanding network according to an embodiment of the present application. As shown in fig. 6, the embodiment of the present application discards the conventional idea of using the C3D network (3D Convolutional Networks) as the backbone network, but adopts M non-local components to form the backbone network, where M is a positive integer, and the multitasking action understanding network adopts a hard sharing mechanism of parameters, that is, the M non-local components can share the parameters. Firstly, the motion sample segment needs to be marked with data, specifically, on the basis of the data marking of the image classification network in the step S203, the motion sample segment needs to be marked with a motion evaluation quality label and a motion type label corresponding to each target motion segment, and the marked specific content can be seen in the above table 1, and the motion evaluation quality label and the motion type label at this time are both true motion evaluation quality and motion type. And then K video frames can be uniformly extracted from the marked motion sample fragments, and then continuous N video frames are extracted from the K video frames to be used as an input sequence to be input into the initial motion understanding network, wherein K and N are positive integers, K is larger than N, but the numerical value of K is very close to the numerical value of N, and because the number of video frames input by each non-local component is necessarily a fixed value, N is equal to the input data length of a model corresponding to the initial motion understanding network. Further, by processing the N/M consecutive video frames by M non-local components in the initial motion understanding network, respectively, the predicted motion estimation quality and the predicted motion type corresponding to each target motion segment can be output, and it can be understood that N is an integer multiple of M. As shown in fig. 6, each set of consecutive N/M video frames may constitute a segment, and each non-local component may process a segment individually, e.g., non-local component 1 processes segment 1, non-local component 2 processes segment 2. Further, a quality loss function can be generated according to the difference between the predicted motion estimation quality and the motion estimation quality labels, a motion loss function can be generated according to the difference between the predicted motion type and the motion type labels, a target loss function can be generated according to the obtained quality loss function and motion loss function (for example, the quality loss function and the motion loss function can be summed), model parameters in the initial motion understanding network can be corrected through the target loss function, and a trained multi-task motion understanding network can be obtained.

It should be understood that, although the multi-tasking action understanding network illustrated in fig. 6 only marks the non-local components, the one-dimensional volumes and the full connection layers, in practical application, the network structure of the multi-tasking action understanding network may further include an input layer, a feature extraction layer, a standardization (BN) layer, an activation layer, an output layer, and the like, which are not described herein again.

For example, a need exists to obtain a multi-task action understanding network that can process video clips of a diving race, the training process of which can be as follows:

a) And (3) data marking: based on the data labeling of the image classification network, an action score (i.e., an action evaluation quality label) and 5 kinds of fine action labels (i.e., action type labels) are additionally labeled for each diving action, as shown in table 1.

b) And (3) data arrangement: because different diving action visual angles of the normal lens are consistent and the duration is similar, only scoring and label analysis are carried out on the diving action fragments of the normal lens, and the result of multiplexing the normal lens by the lens is played back. 107 frames (i.e., K equals 107) are uniformly extracted for each action segment of the normal shot, and the set of frame sequences constitutes a dataset.

c) Model training: training a non-local (non-local) neural network (resnet), i.e. residual neural network), based multitasking action understanding network using the action segment frame sequence obtained in b). During training, 96 frames (namely, N is equal to 96) are randomly extracted from 107 frames to serve as network input, so that data enhancement is realized. The overall network structure comprises 12 non-local modules (i.e. non-local components) sharing parameters (i.e. M equals 12), each processing segments of consecutive 8 frames, outputting feature vectors (also called intermediate motion features) of 2048 dimensions (resnet fully connected layer). The output vectors of the 12 nonlocal modules are spliced into a matrix (also called fusion characteristic) of 12 x 2048, time sequence operation is carried out by two one-dimensional convolution layers (convld) to obtain time sequence characteristics, the time sequence characteristics are input into 2 branches, one branch is connected with a fractional regression device through a full connection layer, and the other branch is connected with 5 tag classifiers through the full connection layer. The whole network can be trained end to end, and the prediction scores and the multi-label categories are output.

Step S207, according to the start-stop time sequence corresponding to the motion segment, an important event segment is intercepted from the video to be processed; generating a tag text corresponding to the important event fragment according to the association relation between the important event fragment and the motion fragment, and the first motion tag and the second motion tag corresponding to the motion fragment, adding the tag text into the important event fragment, and splicing the important event fragment added with the tag text into a video highlight;

specifically, as shown in fig. 5b, according to the start-stop time sequence 30h, an important event segment may be cut out from the video segment 30a, for example, a segment from a time stamp t1 to a time stamp t2 in the video segment 30a may be cut out to obtain an important event segment 301m, and a segment from a time stamp t3 to a time stamp t4 in the video segment 30a may be cut out to obtain an important event segment 302m. Taking the important event segment 301m as an example, no tag text is displayed in the picture of the important event segment 301m, the server 30b may obtain the first motion tag and the second motion tag corresponding to the diving action segment 301i, and generate tag texts corresponding to the important event segment 301m according to the first motion tag and the second motion tag, that is, the tag texts include multiple sub-tags with different tag types, specifically including a lens type, an action type and an action evaluation quality corresponding to the diving action segment 301i, where the lens type is a normal lens type, the action type includes "free", "arm diving", "face-to-face jump", 2 turns and 1.5 turns, and the action evaluation quality is 66.538055 minutes, so that a tag fusion template may be obtained first, where the tag fusion template specifies filling positions corresponding to each tag type respectively, for example, the sub-tags with the lens type need to be filled in the uppermost part of the tag fusion template. Further, each sub-label in the label text is added to a corresponding filling position according to the label type to obtain an added label fusion template, and further the added label fusion template and the important event fragment can be subjected to pixel fusion to obtain the added important event fragment. As shown in fig. 5b, the added important event fragment 301m is obtained according to the above procedure, and the added tag fusion template m1 may be displayed in the screen: "normal shot score [66.538055], free, arm-force jump, face pool reverse jump, 2 turns, 1.5 turns", and similarly, the added important event fragment 302m, the added tag fusion template m2 can be displayed in the picture: "playback lens-top view score [66.538055], free arm jump, face-to-pool jump, 2 turns, 1.5 turns". The processing procedures of other important event fragments are consistent, and will not be described in detail here. And finally, splicing all the added important event fragments to obtain a video highlight 30m, wherein the video highlight 30m can be used as a video abstract of the video fragment 30 a. In addition, the specific form of the label fusion template can be set according to actual needs, and the embodiment of the application is not limited.

Step S208, motion script information associated with the target object is generated according to the start-stop time sequence corresponding to the motion segment, the first motion label and the second motion label.

Specifically, as shown in fig. 5b, the server 30b may sequentially arrange and combine the first motion tag and the second motion tag corresponding to each motion segment according to the start-stop time sequence 30h, so as to obtain the motion jotting information 30n associated with each participating athlete, further store the motion jotting information 30n in a jotting note, and package the motion jotting information with the video highlight 30m later to return to the end user. The specific form of the script can be referred to as script 20e in fig. 2, and other templates can be used to generate script according to actual needs.

It should be noted that, the multi-mode intelligent video recording tool provided by the embodiments of the present application may be applied to other similar video clips or related video clips besides the video clips of the diving game, and the embodiments of the present application are not limited.

It will be appreciated that the numbers shown in the above embodiments are imaginary numbers, and the actual numbers will be the same as the actual numbers in practical applications.

According to the method and the device for processing the video frame, based on the deep neural network, the lens type of the video frame aiming at the target object in the video to be processed is identified, and the start-stop time sequence corresponding to the motion segment in the processed video is obtained, so that the motion segment can be divided into the normal motion segment and the playback motion segment according to the lens type and the start-stop time sequence, the normal motion segment and the playback motion segment are accurately segmented, the normal motion segment can be input into a multitask motion understanding network for identification, the corresponding motion type and the motion evaluation quality are obtained, the lens type can be used as a first motion tag, the motion type and the motion evaluation quality can be used as a second motion tag, and finally the motion field information can be generated according to the first motion tag and the second motion tag. Therefore, in the process of processing the video clips of the diving match, all the diving clips can be automatically segmented from the video clips of the diving match, the clips are classified according to angles of shots, labels such as a starting mode, a rotating gesture, a rotating circle number and the like are added, actions are scored, and further, a scene note can be automatically generated, so that the waste of calculation resources can be reduced by only identifying normal moving clips, the consumption of a large amount of manpower for action classification and registration of motion scene note information can be avoided, the identification accuracy and identification efficiency of action types can be improved, the accuracy of action quality evaluation can be improved, the generation efficiency of the motion scene note information can be improved, and the manual video editing and scene note editing can be replaced in actual services, so that a large amount of labor cost and personal errors can be saved.

Fig. 7 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application. The video data processing apparatus may be a computer program (including program code) running on a computer device, for example the video data processing apparatus is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. As shown in fig. 7, the video data processing apparatus 1 may include: a lens recognition module 11, a first tag determination module 12, a second tag determination module 13;

the lens identification module 11 is used for identifying the lens type of a video frame aiming at a target object in the video to be processed, acquiring a motion segment in the video to be processed, and dividing the motion segment into a normal motion segment and a playback motion segment according to the lens type;

a first tag determining module 12, configured to determine a lens type as a first motion tag corresponding to a motion segment; the first motion label is used for recording the lens type corresponding to the motion segment in the motion script information associated with the target object;

the first tag determining module 12 is specifically configured to identify, when the motion segment is a playback motion segment, a lens angle type corresponding to the playback motion segment through the image classification network; the lens angle type includes any one of a front side view angle type, a side front view angle type, a side rear view angle type, or a top view angle type; determining the lens angle type as a first motion label corresponding to the playback motion segment;

The second tag determining module 13 is configured to perform multitasking action recognition on the normal motion segment, obtain a second motion tag corresponding to the normal motion segment, and associate the second motion tag with the playback motion segment; the second motion label is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information;

the second tag determining module 13 is specifically configured to use a previous normal motion segment of the playback motion segment as a target segment according to a start-stop time sequence corresponding to the motion segment, and obtain a second motion tag corresponding to the target segment as a second motion tag corresponding to the playback motion segment;

the second tag determining module 13 is specifically configured to input a normal motion segment into a multi-task motion understanding network to obtain a jump mode, an arm force jump attribute, a rotation gesture, a rotation circle number, a side rotation circle number and motion evaluation quality corresponding to the normal motion segment, and determine the jump mode, the arm force jump attribute, the rotation gesture, the rotation circle number and the side rotation circle number as motion types corresponding to the normal motion segment; determining the action type and the action evaluation quality as a second motion label corresponding to the normal motion segment; the second sports label is used for recording the action type and action evaluation quality corresponding to the normal sports segment in sports scene information associated with the participating athletes.

The specific function implementation of the lens identifying module 11 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S201 to step S204 in the embodiment corresponding to fig. 5a to 5b, the specific function implementation of the first tag determining module 12 may refer to step S102 in the embodiment corresponding to fig. 3, or may refer to step S205 in the embodiment corresponding to fig. 5a to 5b, and the specific function implementation of the second tag determining module 13 may refer to step S103 in the embodiment corresponding to fig. 3, or may refer to step S206 in the embodiment corresponding to fig. 5a to 5b, which will not be repeated here.

Referring to fig. 7, the video data processing apparatus 1 may further include: a segment intercepting module 14 and a segment splicing module 15;

the segment intercepting module 14 is used for intercepting important event segments from the video to be processed according to the start-stop time sequence corresponding to the motion segment; the motion segment belongs to an important event segment;

the segment splicing module 15 is configured to generate a tag text corresponding to the important event segment according to an association relationship between the important event segment and the motion segment, and a first motion tag and a second motion tag corresponding to the motion segment, add the tag text to the important event segment, and splice the important event segment with the tag text added into a video highlight;

The tag text includes a plurality of sub-tags having different tag types;

the segment splicing module 15 is specifically configured to obtain a tag fusion template; the label fusion template comprises at least two filling positions corresponding to the label types respectively; each sub-label in the label text is added to a corresponding filling position according to the label type to obtain an added label fusion template; and carrying out pixel fusion on the added tag fusion template and the important event fragment to obtain an added important event fragment, and splicing the added important event fragment to obtain the video highlight.

The specific functional implementation manner of the segment intercepting module 14 and the segment stitching module 15 may refer to step S207 in the embodiment corresponding to fig. 5 a-5 b, and will not be described herein.

Referring to fig. 7, the video data processing apparatus 1 may further include: a first training module 16, a second training module 17, a third training module 18;

the first training module 16 is configured to label, in the motion sample segment, a motion evaluation quality label and a motion type label corresponding to each target motion segment; uniformly extracting K video frames from the marked motion sample fragments, extracting continuous N video frames from the K video frames as an input sequence, and inputting the input sequence into an initial action understanding network; k and N are positive integers, K is larger than N; n is equal to the length of the model input data corresponding to the initial action understanding network; processing the continuous N/M video frames through M non-local components in the initial action understanding network respectively, and outputting predicted action evaluation quality and predicted action types corresponding to each target action segment; m is a positive integer, N is an integer multiple of M; generating a quality loss function according to the predicted action evaluation quality and the action evaluation quality label, and generating an action loss function according to the predicted action type and the action type label; generating a target loss function according to the mass loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain a multi-task action understanding network;

The second training module 17 is configured to label a shot type label corresponding to each target action segment in the motion sample segment; extracting frames from the marked motion sample fragments to obtain a fourth video frame sequence; inputting a fourth video frame sequence into a lightweight convolutional neural network, and outputting a predicted lens type corresponding to each target action segment through the lightweight convolutional neural network; generating a lens loss function according to the predicted lens type and the lens type label, and correcting model parameters in the lightweight convolutional neural network through the lens loss function to obtain an image classification network;

a third training module 18, configured to mark start and stop timestamp labels corresponding to each target action segment in the motion sample segments; dividing the marked motion sample segment into S sample sub-segments according to a dividing rule, and performing frame extraction on the S sample sub-segments to obtain a fifth video frame sequence; s is a positive integer; updating the start and stop time stamp label according to the time length of each sample sub-segment to obtain an updated start and stop time stamp label; inputting the fifth video frame sequence into a feature coding network to obtain a picture feature matrix corresponding to each sample sub-segment; inputting the picture feature matrix into an initial action segmentation network, and outputting a predicted start-stop time stamp corresponding to each target action segment in each sample sub-segment through the initial action segmentation network; generating a time loss function according to the predicted start-stop time stamp and the updated start-stop time stamp label, and correcting model parameters in the initial action segmentation network through the time loss function to obtain the time sequence action segmentation network.

The specific function implementation manner of the first training module 16 may refer to step S206 in the embodiment corresponding to fig. 5 a-5 b, the specific function implementation manner of the second training module 17 may refer to step S203 in the embodiment corresponding to fig. 5 a-5 b, and the specific function implementation manner of the third training module 18 may refer to step S204 in the embodiment corresponding to fig. 5 a-5 b, which will not be repeated here.

Referring to fig. 7, the lens recognition module 11 may include: a first frame extraction unit 111, an image classification unit 112, and a segment division unit 113;

a first frame extracting unit 111, configured to extract frames from a video to be processed including a target object, to obtain a first video frame sequence;

an image classification unit 112, configured to input the first video frame sequence into an image classification network, and output a shot type corresponding to each video frame in the first video frame sequence through the image classification network; the shot type includes a normal shot type and a playback shot type;

and a segment dividing unit 113, configured to obtain a start-stop time sequence corresponding to a motion segment in the first video frame sequence, and divide the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type.

The specific function implementation manner of the first frame extracting unit 111 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S201 in the embodiment corresponding to fig. 5 a-5 b, the specific function implementation manner of the image classifying unit 112 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S202 in the embodiment corresponding to fig. 5 a-5 b, the specific function implementation manner of the segment dividing unit 113 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S203-step S204 in the embodiment corresponding to fig. 5 a-5 b, which will not be repeated herein.

Referring also to fig. 7, the second tag determination module 13 may include: a second frame extraction unit 131 and an action understanding unit 132;

the second frame extracting unit 131 is configured to extract N video frames from the normal motion segment to form a third video frame sequence; n is a positive integer;

an action understanding unit 132 for splitting the third video frame sequence into M sub-sequences, inputting the M sub-sequences into the multitasking action understanding network; the multitasking action understanding network includes M non-local components; m is a positive integer; respectively extracting features of M subsequences by M non-local components in a multitask action understanding network to obtain M intermediate action features; a non-local component corresponds to a sub-sequence; fusing M intermediate action features to obtain fused features, and performing time sequence operation on the fused features through a one-dimensional convolution layer in a multi-task action understanding network to obtain time sequence features corresponding to normal motion segments; inputting the time sequence characteristics into a full-connection layer in a multi-task action understanding network, and respectively inputting the characteristic data output by the full-connection layer into a fractional regressor and a label classifier; outputting motion estimation quality corresponding to the normal motion segment through the fractional regression, outputting motion type corresponding to the normal motion segment through the tag classifier, and taking the motion estimation quality and the motion type as a second motion tag corresponding to the normal motion segment.

The specific functional implementation manner of the second frame extracting unit 131 and the action understanding unit 132 may refer to step S103 in the embodiment corresponding to fig. 3, or may refer to step S206 in the embodiment corresponding to fig. 5 a-5 b, which is not described herein.

Referring to fig. 7 together, the segment dividing unit 113 may include: a feature extraction subunit 1131, a timing acquisition subunit 1132, and a fragment determination subunit 1133;

the feature extraction subunit 1131 is configured to input the first video frame sequence into a feature encoding network, and output, through the feature encoding network, a picture feature corresponding to the first video frame sequence;

a timing acquisition subunit 1132, configured to input the picture feature into a timing action segmentation network, and output, through the timing action segmentation network, a start-stop time sequence for identifying a motion segment in the first video frame sequence; the start-stop time sequence comprises at least two start time stamps T1 and at least two end time stamps T2;

a segment determining subunit 1133, configured to obtain at least two second video frame sequences from the first video frame sequences according to the at least two start time stamps T1 and the at least two end time stamps T2, and determine the at least two second video frame sequences as motion segments; if the lens type corresponding to the motion segment is the normal lens type, determining the motion segment as a normal motion segment; and if the shot type corresponding to the motion segment is the playback shot type, determining the motion segment as the playback motion segment.

The specific functional implementation manner of the feature extraction subunit 1131 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S203 in the embodiment corresponding to fig. 5 a-5 b, the specific functional implementation manner of the timing acquisition subunit 1132 and the segment determination subunit 1133 may refer to step S101 in the embodiment corresponding to fig. 3, or may refer to step S204 in the embodiment corresponding to fig. 5 a-5 b, which are not repeated here.

According to the method and the device for processing the video frame, based on the deep neural network, the lens type of the video frame aiming at the target object in the video to be processed is identified, and the start-stop time sequence corresponding to the motion segment in the processed video is obtained, so that the motion segment can be divided into the normal motion segment and the playback motion segment according to the lens type and the start-stop time sequence, the normal motion segment and the playback motion segment are accurately segmented, the normal motion segment can be input into a multitask motion understanding network for identification, the corresponding motion type and the motion evaluation quality are obtained, the lens type can be used as a first motion tag, the motion type and the motion evaluation quality can be used as a second motion tag, and finally the motion field information can be generated according to the first motion tag and the second motion tag. Therefore, the waste of calculation resources can be reduced by only identifying the normal motion segment, and the consumption of a large amount of manpower to conduct motion classification and registration of motion script information can be avoided, so that the accuracy rate of identifying motion types and the identification efficiency can be improved, the accuracy rate of motion quality evaluation can be improved, the generation efficiency of motion script information can be improved, manual video editing and script editing can be replaced in actual business, and a large amount of labor cost and human errors can be saved.

Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 8, the computer device 1000 may include a processor 1001, a network interface 1004, and a memory 1005, and in addition, the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 8, an operating system, a network communication module, a user interface module, and a device control application may be included in a memory 1005, which is a type of computer-readable storage medium.

In the computer device 1000 shown in FIG. 8, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the video data processing method in any of the foregoing embodiments corresponding to any of fig. 3, 4, 5 a-5 b, and will not be repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiments of the present application further provide a computer readable storage medium, where the computer readable storage medium stores a computer program executed by the video data processing apparatus 1 mentioned above, where the computer program includes program instructions, when executed by the processor, can perform the description of the video data processing method in any of the embodiments corresponding to fig. 3, 4, and 5 a-5 b, and therefore, will not be described herein in detail. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.

The computer readable storage medium may be the video data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the foregoing computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Furthermore, it should be noted here that: embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the method provided by any of the corresponding embodiments of fig. 3, 4, 5 a-5 b, supra.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. A method of video data processing, comprising:

when a first video frame sequence in a video to be processed containing a target object is obtained, performing shot type identification on the first video frame sequence to obtain a shot type corresponding to each video frame in the first video frame sequence;

Extracting features of the first video frame sequence to obtain picture features corresponding to the first video frame sequence, obtaining a start-stop time sequence for identifying a motion segment in the first video frame sequence based on the picture features, and dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type;

determining the lens type as a first motion label corresponding to the motion segment; the first motion label is used for recording the lens type corresponding to the motion segment in motion script information associated with the target object;

performing action type recognition on the normal motion segment through a multi-task action understanding network to obtain an action type corresponding to the normal motion segment, performing action quality assessment on the normal motion segment through the multi-task action understanding network to obtain action assessment quality corresponding to the normal motion segment, using the action assessment quality and the action type as a second motion label corresponding to the normal motion segment, and associating the second motion label with the playback motion segment; the second motion label is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

2. The method according to claim 1, wherein when a first video frame sequence in a video to be processed including a target object is acquired, performing shot type identification on the first video frame sequence to obtain a shot type corresponding to each video frame in the first video frame sequence, including:

extracting frames of a video to be processed containing a target object to obtain a first video frame sequence;

inputting the first video frame sequence into an image classification network, and carrying out shot type identification on each video frame in the first video frame sequence through the image classification network to obtain a shot type corresponding to each video frame in the first video frame sequence.

3. The method of claim 1, wherein the shot types include a normal shot type and a playback shot type;

the step of extracting the features of the first video frame sequence to obtain picture features corresponding to the first video frame sequence, obtaining a start-stop time sequence for identifying a motion segment in the first video frame sequence based on the picture features, and dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type, wherein the step of dividing the motion segment comprises the following steps:

Inputting the first video frame sequence into a feature coding network, and carrying out feature extraction on the first video frame sequence through the feature coding network to obtain picture features corresponding to the first video frame sequence;

inputting the picture characteristics into a time sequence action segmentation network, and outputting a start-stop time sequence for identifying a motion segment in the first video frame sequence through the time sequence action segmentation network; the start-stop time sequence comprises at least two start time stamps T1 and at least two end time stamps T2;

acquiring at least two second video frame sequences from the first video frame sequence according to the at least two start time stamps T1 and the at least two end time stamps T2, and determining the at least two second video frame sequences as motion fragments;

if the lens type corresponding to the motion segment is the normal lens type, determining the motion segment as a normal motion segment;

and if the shot type corresponding to the motion segment is the playback shot type, determining the motion segment as a playback motion segment.

4. The method of claim 2, wherein the determining the lens type as the first motion tag corresponding to the motion segment comprises:

When the motion segment is a playback motion segment, identifying a lens angle type corresponding to the playback motion segment through the image classification network; the lens angle type comprises any one of a positive side view angle type, a positive front view angle type, a side rear view angle type or a overlook view angle type;

and determining the lens angle type as a first motion label corresponding to the playback motion segment.

5. The method according to claim 1, wherein the performing, by the multitasking action understanding network, action type recognition on the normal motion segment to obtain an action type corresponding to the normal motion segment, performing, by the multitasking action understanding network, action quality evaluation on the normal motion segment to obtain action evaluation quality corresponding to the normal motion segment, and taking the action evaluation quality and the action type as a second motion label corresponding to the normal motion segment includes:

extracting N video frames from the normal motion segment to form a third video frame sequence; n is a positive integer;

splitting the third video frame sequence into M subsequences, and inputting the M subsequences into a multi-task action understanding network; the multitasking action understanding network includes M non-local components; m is a positive integer;

The M non-local components in the multi-task action understanding network respectively conduct feature extraction on the M subsequences to obtain M intermediate action features; a non-local component corresponds to a sub-sequence;

fusing the M intermediate action features to obtain fusion features, and performing time sequence operation on the fusion features through a one-dimensional convolution layer in the multi-task action understanding network to obtain time sequence features corresponding to the normal motion segments;

inputting the time sequence characteristics into a full connection layer in the multi-task action understanding network, and respectively inputting characteristic data output by the full connection layer into a fractional regressor and a label classifier;

outputting action evaluation quality corresponding to the normal motion segment through the fractional regression, outputting action type corresponding to the normal motion segment through the label classifier, and taking the action evaluation quality and the action type as a second motion label corresponding to the normal motion segment.

6. The method of claim 1, wherein the associating the second motion tag with the playback motion segment comprises:

and taking the previous normal motion segment of the playback motion segment as a target segment according to the start-stop time sequence corresponding to the motion segment, and acquiring a second motion label corresponding to the target segment as a second motion label corresponding to the playback motion segment.

7. The method as recited in claim 1, further comprising:

according to the start-stop time sequence corresponding to the motion segment, an important event segment is intercepted from the video to be processed; the motion segment belongs to the important event segment;

generating a tag text corresponding to the important event fragment according to the association relation between the important event fragment and the motion fragment, and the first motion tag and the second motion tag corresponding to the motion fragment, adding the tag text into the important event fragment, and splicing the important event fragment added with the tag text into a video highlight.

8. The method of claim 7, wherein the tag text comprises a plurality of sub-tags having different tag types; the step of adding the tag text into the important event fragment, and splicing the important event fragment added with the tag text into a video highlight, comprising the following steps:

acquiring a label fusion template; the label fusion template comprises at least two filling positions corresponding to label types respectively;

adding each sub-label in the label text to a corresponding filling position according to the label type to obtain an added label fusion template;

And carrying out pixel fusion on the added tag fusion template and the important event fragment to obtain an added important event fragment, and splicing the added important event fragment to obtain the video highlight.

9. The method as recited in claim 1, further comprising:

labeling action evaluation quality labels and action type labels corresponding to each target action segment in the motion sample segments;

uniformly extracting K video frames from the marked motion sample fragments, extracting continuous N video frames from the K video frames as an input sequence, and inputting the input sequence into an initial action understanding network; k and N are positive integers, K is larger than N; n is equal to the length of the model input data corresponding to the initial action understanding network;

processing the continuous N/M video frames through M non-local components in the initial action understanding network respectively, and outputting the predicted action evaluation quality and the predicted action type corresponding to each target action segment; m is a positive integer, N is an integer multiple of M;

generating a quality loss function according to the predicted action evaluation quality and the action evaluation quality label, and generating an action loss function according to the predicted action type and the action type label;

And generating a target loss function according to the mass loss function and the action loss function, and correcting model parameters in the initial action understanding network through the target loss function to obtain a multi-task action understanding network.

10. The method as recited in claim 2, further comprising:

labeling a lens type label corresponding to each target action segment in the motion sample segment;

extracting frames from the marked motion sample fragments to obtain a fourth video frame sequence;

inputting the fourth video frame sequence into a lightweight convolutional neural network, and outputting a predicted lens type corresponding to each target action segment through the lightweight convolutional neural network;

and generating a lens loss function according to the predicted lens type and the lens type label, and correcting model parameters in the lightweight convolutional neural network through the lens loss function to obtain an image classification network.

11. A method according to claim 3, further comprising:

marking a start-stop time stamp label corresponding to each target action segment in the motion sample segment;

dividing the marked motion sample segment into S sample sub-segments according to a dividing rule, and performing frame extraction on the S sample sub-segments to obtain a fifth video frame sequence; s is a positive integer;

Updating the start-stop time stamp label according to the time length of each sample sub-segment to obtain an updated start-stop time stamp label;

inputting the fifth video frame sequence into the feature coding network to obtain a picture feature matrix corresponding to each sample sub-segment;

inputting the picture feature matrix into an initial action segmentation network, and outputting a predicted start-stop time stamp corresponding to each target action segment in each sample sub-segment through the initial action segmentation network;

generating a time loss function according to the predicted start-stop time stamp and the updated start-stop time stamp label, and correcting model parameters in the initial action segmentation network through the time loss function to obtain a time sequence action segmentation network.

12. The method according to any one of claims 1-11, wherein the video to be processed is a diving game video and the target object is a competitor;

the step of performing action type recognition on the normal motion segment through a multi-task action understanding network to obtain an action type corresponding to the normal motion segment, and performing action quality evaluation on the normal motion segment through the multi-task action understanding network to obtain action evaluation quality corresponding to the normal motion segment, wherein the action evaluation quality and the action type are used as a second motion label corresponding to the normal motion segment, and the step of:

Inputting the normal movement section into a multi-task movement understanding network, and identifying the movement type of the normal movement section through the multi-task movement understanding network to obtain a jump mode, an arm force jump attribute, a rotation gesture, a rotation circle number and a side rotation circle number corresponding to the normal movement section, wherein the jump mode, the arm force jump attribute, the rotation gesture, the rotation circle number and the side rotation circle number are determined to be the movement type corresponding to the normal movement section;

performing action quality assessment on the normal motion segment through the multi-task action understanding network to obtain action assessment quality corresponding to the normal motion segment;

determining the action type and the action evaluation quality as a second motion label corresponding to the normal motion segment; the second sports label is used for recording the action type and the action evaluation quality corresponding to the normal sports segment in sports script information associated with the contestant.

13. A video data processing apparatus, comprising:

the system comprises a shot identification module, a shot identification module and a shot processing module, wherein the shot identification module is used for carrying out shot type identification on a first video frame sequence in a video to be processed containing a target object when the first video frame sequence is acquired, so as to obtain a shot type corresponding to each video frame in the first video frame sequence;

The segment dividing module is used for extracting features of the first video frame sequence to obtain picture features corresponding to the first video frame sequence, obtaining a start-stop time sequence for identifying a motion segment in the first video frame sequence based on the picture features, and dividing the motion segment into a normal motion segment and a playback motion segment according to the start-stop time sequence and the shot type;

the first tag determining module is used for determining the lens type as a first motion tag corresponding to the motion segment; the first motion label is used for recording the lens type corresponding to the motion segment in motion script information associated with the target object;

the second tag determining module is used for identifying the action type of the normal motion segment through a multitask action understanding network to obtain the action type corresponding to the normal motion segment, performing action quality assessment on the normal motion segment through the multitask action understanding network to obtain action assessment quality corresponding to the normal motion segment, taking the action assessment quality and the action type as a second motion tag corresponding to the normal motion segment, and associating the second motion tag with the playback motion segment; the second motion label is used for recording the motion type and motion evaluation quality corresponding to the motion segment in the motion script information.

14. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory, the network interface for providing data communication functions, the memory for storing program code, the processor for invoking the program code to perform the method of any of claims 1-12.

15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded by a processor and to perform the method of any of claims 1-12.