CN110516572B

CN110516572B - Method for identifying sports event video clip, electronic equipment and storage medium

Info

Publication number: CN110516572B
Application number: CN201910759733.6A
Authority: CN
Inventors: 徐鸣谦; 徐嵩; 李琳; 杜欧杰; 王科
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2022-06-28
Anticipated expiration: 2039-08-16
Also published as: CN110516572A

Abstract

The embodiment of the invention provides a method for identifying a video clip of a sports event, electronic equipment and a storage medium, wherein the method comprises the following steps: identifying the action category of the sports event video clip by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to action categories; if the accuracy of the recognition result is lower than a preset threshold, re-recognizing the action type by adopting a second preset model, and taking the re-recognition result as a final recognition result of the action type; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type. The method for identifying the video clips of the sports events, the electronic device and the storage medium provided by the embodiment of the invention can accurately identify the video clips of the sports events and also have the advantages of high efficiency, simplicity and strong universality.

Description

Method for identifying video clip of sports event, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method, an electronic device, and a storage medium for identifying a video clip of a sporting event.

Background

When the video app needs to issue short video highlights of some sports games in different scenes (such as goal, point, penalty and the like), in addition to the traditional video method for manually editing short videos, AI automatic editing can be performed on the game videos in a deep learning algorithm mode. The AI automatic editing needs to be firstly performed to identify scenes of a match video, and a plurality of deep learning methods can be used for identifying some video scenes at present, for example, a 3D convolutional neural network is used for identifying kinetics data sets (character behaviors), the average accuracy can reach 83.6%, and for example, an LSTM network is used for identifying UCF-101 data sets (101 types of actions), the average accuracy can reach 88.6%. It can be found that: the existing technical scheme has good accuracy for the identification of a single human action scene, but the effect is not ideal for the identification of scenes of sports games, particularly basketball and football, and the identification rate of the scenes is about 60 percent, which is mainly due to the fact that many people embrace a cluster of scenes, the interactive action span of people is large, and various environmental differences, such as multiple visual angles, illumination, low resolution and other factors. Therefore, the complexity of the training samples is high, and the accuracy of identifying the classification model is low.

Therefore, how to accurately identify the video clip of the sports event by avoiding the above-mentioned defects is a problem that needs to be solved urgently.

Disclosure of Invention

To solve the problems in the prior art, embodiments of the present invention provide a method, an electronic device, and a storage medium for identifying a video clip of a sporting event.

The embodiment of the invention provides a method for identifying a video clip of a sports event, which comprises the following steps:

identifying the action category of the sports event video clip by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to action categories;

if the accuracy of the recognition result is lower than a preset threshold, re-recognizing the action type by adopting a second preset model, and taking the re-recognition result as a final recognition result of the action type; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type.

An embodiment of the present invention provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein,

The processor implements the following method steps when executing the program:

identifying the action type of the video clip of the sports event by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to action categories;

An embodiment of the invention provides a non-transitory computer readable storage medium having a computer program stored thereon, which when executed by a processor implements the following method steps:

According to the method for identifying the sports event video clip, the electronic device and the storage medium provided by the embodiment of the invention, the action type of the sports event video clip is secondarily identified, and the model for secondary identification takes the data related to the relative position of the target reference object in the sports event video clip as the second sample data for training, so that the sports event video clip can be accurately identified, and the method, the electronic device and the storage medium have the advantages of high efficiency, simplicity and strong universality.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow diagram of an embodiment of a method for identifying video segments of a sporting event according to the present invention;

FIG. 2 is a flow chart illustrating another embodiment of a method for identifying video segments of a sporting event according to the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an embodiment of a method for identifying a video clip of a sporting event, as shown in fig. 1, the embodiment of the present invention provides a method for identifying a video clip of a sporting event, including the following steps:

s101: identifying the action category of the sports event video clip by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to an action category.

Specifically, the device identifies the action category of a video clip of the sports event by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to an action category. The device can be electronic equipment, uses the basketball as an example, and the action classification can include basket on, detain the basket, rob backboard and penalty ball etc. because the similarity of the technical motion of basket on and detaining the basket is higher, consequently, first preset model is difficult for distinguishing whether the action classification is basket on or detains the basket, consequently, the accuracy rate to the recognition result of basket on and detaining the basket is lower, can be less than the accuracy rate of the recognition result of robbing backboard and penalty ball usually.

The first preset model may be a convolutional neural network combining non-local modules (non-local) and the double-flow dilated 3D convolutional I3D, and the embodiment of the present invention uses the double-flow dilated 3D convolutional (I3D) convolutional neural network as a base network, and adds non-local modules (non-local) therein to obtain a better global effect. The I3D convolutional neural network is a network structure which expands a convolutional kernel and a pooling kernel into a 3D form, namely, the time dimension is increased on the basis of the original length and width of all the convolutional kernels and the pooling kernel. The non-local module extracts space-time information except the video, is used for acquiring long-term memory and global information of the deep neural network, has the characteristics of high efficiency, simplicity and strong universality, and can be conveniently embedded into an existing network framework. Therefore, the embodiment of the invention adopts the convolutional neural network combining the nonlocal and the I3D, trains the convolutional neural network, and identifies the accuracy of the action category of the video clip of the sports event through the trained convolutional neural network. The training of the first predetermined model may refer to the following description.

Fig. 2 is a flowchart of another embodiment of the method for identifying a video clip of a sporting event according to the present invention, as shown in fig. 2, video stream data may be obtained first, and then the video stream data is cut into video segments, that is, the video clip of the sporting event is obtained, and the motion category of the video clip of the sporting event is identified by using the I3D-nonlocal model for prediction, that is, using the first preset model.

S102: if the accuracy of the recognition result is lower than a preset threshold, re-recognizing the action type by adopting a second preset model, and taking the re-recognition result as a final recognition result of the action type; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type.

Specifically, if the device judges that the accuracy of the obtained recognition result is lower than a preset threshold, re-recognizing the action type by adopting a second preset model, and taking the re-recognition result as the final recognition result of the action type; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type. The preset threshold may be set according to the actual situation, and for the scenario of the basketball event, the preset threshold may be selected to be 70%, and the accuracy of the recognition result may be represented by the confidence of the action category.

Referring to the above example, if the accuracy of the recognition results of the upper basket and the cradle button is lower than the preset threshold, the action category is re-recognized by using a second preset model, where the second preset model may be a convolutional neural network CNN classifier. Taking a basketball as an example, the target reference object is a backboard frame, and the relative position of the target reference object is the position between the backboard frame and the hand, it can be understood that: when the backboard frame is far away from the hand, namely the distance between the backboard frame and the hand is greater than the preset distance, the action category is upper basket; for the situation that the backboard frame is close to the hand, namely the distance between the backboard frame and the hand is smaller than the preset distance, the action type is the dunk, and the preset distance can be set independently according to the actual situation. The training of the second predetermined model may refer to the following description.

The relative position may be detected by a yolo algorithm for each frame of the video clip of the sporting event. yolo is called You Only Look on Unifield, Real-Time Object Detection and You Only Look on, and the Unifield means that the CNN operation is required Once, and provides end-to-end prediction, while the Real-Time means that the yolo algorithm is fast in speed.

According to the method for identifying the sports event video clip provided by the embodiment of the invention, the motion type of the sports event video clip is secondarily identified, and the model for secondary identification takes the data related to the relative position of the target reference object in the sports event video clip as the second sample data for training, so that the sports event video clip can be accurately identified, and the method has the advantages of high efficiency, simplicity and strong universality.

On the basis of the above embodiment, the re-identifying the action category by using the second preset model, and taking the re-identification result as the final identification result of the action category includes:

and if the re-recognition result comprises a plurality of action types, respectively acquiring the number of all the action types, and taking the action type with the largest number as the final recognition result.

Specifically, if the device determines that the re-recognition result includes a plurality of motion categories, the device respectively obtains the number of all the motion categories, and uses the motion category with the largest number as the final recognition result. Referring to the above example, the plurality of action categories may include two categories, namely an upper basket and a dunking, and the number of the action categories of the upper basket and the number of the action categories of the dunking are respectively obtained, and if the number of the action categories of the upper basket is greater than the number of the action categories of the dunking, the upper basket is used as a final recognition result; and if the number of the dunking action categories is larger than that of the upper basket action categories, taking the dunking as a final recognition result.

According to the method for identifying the video clips of the sports events, provided by the embodiment of the invention, the final identification result is determined according to the number of the action types, so that the video clips of the sports events can be further accurately identified.

and if the action type with the largest quantity is judged to be not unique, respectively acquiring the confidence degrees of all the action types with the largest quantity, and taking the action type with the largest confidence degree value as the final recognition result.

Specifically, if the device determines that the motion category with the largest number is not unique, the device respectively obtains confidence levels of all the motion categories with the largest number, and takes the motion category with the largest confidence level value as the final recognition result. Referring to the above example, if the number of the upper basket action categories is equal to the number of the basket-catching action categories, obtaining the confidence degrees of the upper basket action and the basket-catching action, respectively, and if the numerical value of the confidence degree of the upper basket action is greater than the numerical value of the confidence degree of the basket-catching action, taking the upper basket as a final recognition result; and if the numerical value of the confidence coefficient of the basket-up action is smaller than the numerical value of the confidence coefficient of the basket-deducting action, taking the basket-deducting as a final recognition result.

According to the method for identifying the video clips of the sports events, provided by the embodiment of the invention, the final identification result is determined through the confidence coefficient value of the action type, so that the video clips of the sports events can be further accurately identified.

On the basis of the above embodiment, the relative position is detected by using the yolo algorithm for each frame of picture in the video clip of the sports event.

Specifically, the relative position in the device is detected by adopting a yolo algorithm to each frame of picture in the video clip of the sports event. Reference is made to the above description and no further description is made.

According to the method for identifying the video clips of the sports event, provided by the embodiment of the invention, the relative position is obtained by adopting a yolo algorithm to detect each frame of picture in the video clips of the sports event, so that the accuracy of obtaining the relative position is ensured, and the video clips of the sports event can be further accurately identified.

On the basis of the above embodiment, the first preset model is a convolutional neural network using a combination of a non-local module and a dual-stream dilated 3D convolution I3D.

Specifically, the first preset model in the device is a convolutional neural network which adopts a combination of a non-local module and a double-flow expansion 3D convolution I3D. Reference is made to the above description and no further description is made.

According to the method for identifying the video clips of the sports events, provided by the embodiment of the invention, the first preset model is selected as the convolutional neural network combining the non-local module and the double-flow expansion 3D convolution I3D, so that the video clips of the sports events can be further accurately identified, and the method has the advantages of high efficiency, simplicity and strong universality.

On the basis of the above embodiment, the second preset model is a convolutional neural network CNN classifier.

Specifically, the second preset model in the device is a convolutional neural network CNN classifier. Reference is made to the above description and no further description is made.

According to the method for identifying the video clips of the sports events, provided by the embodiment of the invention, the second preset model is selected as the Convolutional Neural Network (CNN) classifier, so that the video clips of the sports events can be further accurately identified.

On the basis of the above embodiment, the training of the first preset model includes:

each motion category data for each video clip of the sporting event is collected and provided as the first sample data.

Specifically, the device collects each motion category data of each sports event video clip as the first sample data. Referring to the above example, the first sample data may be 5 video clips of the upper basket, the dunk, the backboard, the penalty ball and the background, respectively, and each video clip may be 64 frames.

Preprocessing the first sample data, and training the convolutional neural network combining the nonlocal and the I3D by using the preprocessed first sample data.

Specifically, the device preprocesses the first sample data, and trains the convolutional neural network combining the nonlocal and the I3D by using the preprocessed first sample data. The pretreatment can be divided into the following steps: first, since the number of samples in different categories is very different, image enhancement operations such as flipping, rotating, and adding noise are required to add training samples so that the number of samples in each category is balanced. Second, each frame of the training samples is scaled to the same size, for example, the frame is adjusted to 256 × 320 in the embodiment of the present invention, so as to increase the speed of loading the model during the training process. Thirdly, each sampled picture is cut into a plurality of new pictures in a random cutting mode, and 3 pictures with the size of 224 × 224 are cut in the embodiment of the invention and serve as input samples of the algorithm so as to increase the generalization of the training samples. Finally, the sample after data processing is converted into a model reading format suitable for the embodiment of the invention, namely, the model reading format is converted into an LMDB form, wherein the LMDB stores the sample directory address and the label data.

And adjusting the proper learning rate, iteration number and training parameters. In the embodiment of the present invention, the sampling rate is set to 8, so that 64 frames of short video have 8 picture samples input, the model is saved once every 400 iterations, and when the number of iterations is 12800, the error rate is minimum. Therefore, the model at the iteration number is selected as the first trained preset model.

And taking the convolutional neural network which is combined by the nonlocal and the I3D when a first preset condition is reached as a trained first preset model.

Specifically, the device takes the convolutional neural network which is formed by combining the nonlocal and the I3D when a first preset condition is reached as a trained first preset model. The first preset condition may be that the number of iterations reaches 12800, and is not particularly limited.

According to the method for identifying the video clips of the sports events, provided by the embodiment of the invention, the accuracy of the first preset model is ensured by training the first preset model, the video clips of the sports events can be further accurately identified, and the method has the advantages of high efficiency, simplicity and strong universality.

On the basis of the above embodiment, the training of the second preset model includes:

and collecting the relative position data corresponding to the target reference object of each sports event video clip as the second sample data.

Specifically, the device collects each piece of relative position data corresponding to the target reference object of each sports event video clip as the second sample data. The second sample data may be position data of a backboard frame relative to a hand in two categories of video clips, namely, a top-basket video clip and a dunk video clip, and each video clip may be about 64 frames.

And preprocessing the second sample data, and training the CNN classifier by adopting the preprocessed second sample data.

Specifically, the device preprocesses the second sample data, and trains the CNN classifier by using the preprocessed second sample data. The steps of preprocessing the second sample data and training the CNN classifier using the preprocessed second sample data may be the same as the steps of preprocessing the first sample data and training the convolutional neural network combining the non-local and I3D, and are not described again.

And taking the CNN classifier which meets a second preset condition as a trained second preset model.

Specifically, the device takes the CNN classifier that meets the second preset condition as a trained second preset model. The second preset condition may include that the iteration number reaches a preset number, or the model error is smaller than a preset error, which is not specifically limited, and the preset number and the preset error may be set independently according to the actual situation.

According to the method for identifying the video clips of the sports events, provided by the embodiment of the invention, the second preset model is trained, so that the accuracy of the second preset model is ensured, and the video clips of the sports events can be further accurately identified.

It should be noted that: different types of action categories may be included in the same video clip of a sporting event, i.e., upper basketball, dunker basketball, snatch board, penalty basketball. The accuracy of the recognition results of the basketball grabbing plate and the penalty ball recognition by adopting the first preset model is higher than a preset threshold, and the accuracy of the recognition results of the basketball grabbing plate and the penalty ball recognition by adopting the first preset model is lower than the preset threshold. Therefore, the method may further include, after step S101, the steps of:

s102': if the target action type with the identification result accuracy lower than the preset threshold exists, re-identifying the target action type by adopting a second preset model, and determining a final identification result serving as the action type according to the re-identification result and the identification result higher than the preset threshold; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type.

The following is explained with reference to fig. 2: and synthesizing short videos by taking the recognition results higher than the preset threshold as a basketball catching plate and a penalty ball, taking the basketball and the dunk as target action categories, and adopting two preset models for re-recognition, wherein at the moment, the re-recognition results and the recognition results higher than the preset threshold can comprise the basketball catching plate, the basketball catching plate and the penalty ball, and the final recognition results can be further determined according to the number of the action categories and the confidence coefficient of the action categories. The description of the final recognition result determined by the number of the motion categories and the confidence degrees of the motion categories may refer to the above description of the two motion categories, which is not repeated herein.

Fig. 3 is a schematic structural diagram of an entity of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes: a processor (processor)301, a memory (memory)302, and a bus 303;

the processor 301 and the memory 302 complete communication with each other through a bus 303;

the processor 301 is configured to call program instructions in the memory 302 to perform the methods provided by the above-mentioned method embodiments, including: identifying the action category of the sports event video clip by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to action categories; if the accuracy of the recognition result is lower than a preset threshold, re-recognizing the action type by adopting a second preset model, and taking the re-recognition result as a final recognition result of the action type; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above method embodiments, for example, including: identifying the action category of the sports event video clip by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to action categories; if the accuracy of the recognition result is lower than a preset threshold, re-recognizing the action type by adopting a second preset model, and taking the re-recognition result as a final recognition result of the action type; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: identifying the action category of the sports event video clip by adopting a first preset model; the training of the first preset model adopts first sample data; the first sample data is data related to action categories; if the accuracy of the recognition result is lower than a preset threshold, re-recognizing the action type by adopting a second preset model, and taking the re-recognition result as a final recognition result of the action type; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger portion of the action type.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying a video clip of a sporting event, comprising:

the first preset model is a convolutional neural network which combines a non-local module and a double-current expansion 3D convolution I3D; the training of the first preset model comprises: collecting each action type data of each sports event video clip as the first sample data; preprocessing the first sample data, and training the convolutional neural network combining the nonlocal and the I3D by using the preprocessed first sample data; taking the convolutional neural network which is combined by the nonlocal and the I3D when a first preset condition is reached as a trained first preset model;

If the accuracy of the recognition result is lower than a preset threshold, re-recognizing the action category by adopting a second preset model, and taking the re-recognition result as a final recognition result of the action category; training of the second preset model adopts second sample data; the second sample data is data related to the relative position of a target reference object in the video clip of the sports event; the relative position is a position between the target reference object and the trigger part of the action type;

the second preset model is a Convolutional Neural Network (CNN) classifier; the training of the second preset model comprises: collecting relative position data corresponding to a target reference object of each sports event video clip, and taking the relative position data as second sample data; preprocessing the second sample data, and training the CNN classifier by adopting the preprocessed second sample data; and taking the CNN classifier which meets a second preset condition as a trained second preset model.

2. The method for recognizing the video clip of the sporting event according to claim 1, wherein the re-recognizing the action category by using the second preset model and using the re-recognition result as the final recognition result of the action category comprises:

3. The method for identifying video clips of a sporting event according to claim 2, wherein the re-identifying the action category using a second predetermined model and using the re-identification result as a final identification result of the action category comprises:

and if the action types with the largest quantity are judged to be not unique, respectively acquiring the confidence degrees of all the action types with the largest quantity, and taking the action type with the largest confidence degree value as the final recognition result.

4. A method for identifying video clips of a sporting event according to any of claims 1 to 3 wherein said relative positions are detected for each frame of picture in said video clips of a sporting event using the yolo algorithm.

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 4 when executing the computer program.

6. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.