WO2020258498A1 - 基于深度学习的足球比赛行为识别方法、装置及终端设备 - Google Patents

基于深度学习的足球比赛行为识别方法、装置及终端设备 Download PDF

Info

Publication number
WO2020258498A1
WO2020258498A1 PCT/CN2019/103168 CN2019103168W WO2020258498A1 WO 2020258498 A1 WO2020258498 A1 WO 2020258498A1 CN 2019103168 W CN2019103168 W CN 2019103168W WO 2020258498 A1 WO2020258498 A1 WO 2020258498A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
data
inception
network model
input image
Prior art date
Application number
PCT/CN2019/103168
Other languages
English (en)
French (fr)
Inventor
雷晨雨
李曼
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020258498A1 publication Critical patent/WO2020258498A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application belongs to the field of computer technology, and in particular relates to a method, device, computer non-volatile readable storage medium and terminal equipment based on deep learning for football match behavior recognition.
  • the goal of behavior recognition is to recognize the common human behavior established in real life. Accurate behavior recognition is challenging because human behavior is complex and highly diverse.
  • the player behavior in the football game video is a planned and highly collaborative multi-player (agent) team behavior.
  • the entire process is extremely complicated, resulting in extremely long time-consuming. And because there will be a certain accuracy deviation in each step, multiple complex steps are superimposed together, resulting in lower accuracy of the final recognition result.
  • the embodiments of the present application provide a method, device, computer non-volatile readable storage medium, and terminal equipment for recognizing football game behavior based on deep learning, so as to solve the time-consuming problem of the existing football game behavior recognition method. Long and low accuracy problems.
  • the first aspect of the embodiments of the present application provides a method for recognizing football match behavior based on deep learning, which may include:
  • the deep learning network model is composed of a cascade of an Inception network model and a three-dimensional ResNet network model,
  • the Inception network model is used to learn the relationship between pixels in each frame of the input image
  • the three-dimensional ResNet network model is used to learn the relationship between each frame of the input image.
  • the second aspect of the embodiments of the present application provides a football game behavior recognition device, which may include a module for implementing the steps of the above football game behavior recognition method.
  • the third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When realizing the steps of the above football game behavior recognition method.
  • the fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer
  • the steps of the above football match behavior recognition method are realized when the instructions are readable.
  • a deep learning network model composed of a cascade of Inception network model and three-dimensional ResNet network model is used to perform behavior recognition.
  • Inception network model is used to learn the relationship between pixels in each frame of the input image, and three-dimensional
  • the ResNet network model learns the relationship between the frames of the input image, which greatly simplifies the process of behavior recognition and reduces the accuracy loss caused by the superposition of multiple complex steps in the prior art. While further reducing the time-consuming, it also improves The accuracy of the final recognition result.
  • FIG. 1 is a flowchart of an embodiment of a method for recognizing a football match behavior based on deep learning in an embodiment of the application;
  • Figure 2 is a schematic flowchart of extracting a frame of image from each video segment as an input image
  • Figure 3 is a schematic diagram of the Inception network model
  • Figure 4 is a schematic flow chart of the data processing process of the Inception module
  • Figure 5 is a schematic flow chart of the data processing process of the three-dimensional convolution module
  • Figure 6 is a schematic diagram of decomposing the three-dimensional convolution operation in each three-dimensional convolution module into a spatial convolution operation and a temporal convolution operation;
  • FIG. 7 is a structural diagram of an embodiment of a football match behavior recognition device in an embodiment of the application.
  • FIG. 8 is a schematic block diagram of a terminal device in an embodiment of the application.
  • an embodiment of a method for recognizing a football match behavior based on deep learning in an embodiment of the present application may include:
  • Step S101 Obtain a football match video to be recognized.
  • the football match video may be a video that the user collects instantly through a camera of a terminal device such as a mobile phone or a tablet computer.
  • a terminal device such as a mobile phone or a tablet computer.
  • the terminal device automatically processes each video shot by the user according to the process of step S102 and step S103 to obtain the behavior recognition result.
  • the football game video may also be a video originally stored in the terminal device, or a video obtained by the terminal device from a cloud server or other terminal device via a network.
  • the user wants to perform behavior recognition on one or more existing football match videos, he can open the terminal device by clicking a specific physical button or virtual button. Behavior recognition mode, and select these football game videos (the order of clicking the button and the selected video can be interchanged, that is, you can select the video first, and then turn on the behavior recognition mode of the terminal device), then the terminal device will respond to these
  • the football game video is automatically processed according to the process of step S102 and step S103, and the behavior recognition result is obtained.
  • Step S102 Divide the football game video into N video segments, and extract one frame of image from each video segment as an input image.
  • N is an integer greater than 1, and its specific value can be set according to actual conditions, for example, it can be set to 3, 5, 10, 20 or other values. In this embodiment, it is preferably set to 5, that is, the football game video is divided into 5 video segments. The number of frames contained in each video segment may be the same or different, which is not specifically limited in this embodiment.
  • one frame of image needs to be extracted from the N video segments as the input image.
  • the video can be One frame of image is randomly extracted from the segment as the input image of the video segment.
  • a frame of image is randomly extracted from the first video segment of the football game video as the first frame of input image.
  • One frame of image is randomly extracted from the two video segments as the second frame of input image, and one frame of image is arbitrarily extracted from the third video segment of the football game video as the third frame of input image, and so on, until from the Randomly extract a frame of image from the Nth video segment of the football game video as the Nth input image.
  • the input image can also be extracted through the process shown in FIG. 2:
  • Step S1021 randomly extract a frame of image from the first video segment of the football game video as the first frame of input image.
  • Step S1022 respectively calculate the image similarity between each frame image in the nth video segment of the football game video and the n-1th frame input image.
  • the feature vector of each frame image in the nth video segment of the football game video and the feature vector of the input image of the n-1th frame may be calculated separately.
  • the feature vector of any image can be calculated through the Local Binary Patterns (LBP) algorithm.
  • LBP Local Binary Patterns
  • a method of measuring the relationship between a pixel and its surrounding pixels is constructed, and the image in the For each pixel of, the gray value of the pixel is converted into an eight-bit binary sequence by calculating the size relationship between each pixel and the central pixel in the neighborhood centered on it. Taking the pixel value of the center point as the threshold, if the pixel value of the neighboring point is less than the center point, the neighboring point is binarized to 0, otherwise it is 1.
  • the sequence of 0 and 1 obtained by binarization is regarded as an 8-bit Binary number, convert the binary number to decimal to get the LBP value at the center point.
  • the statistical histogram of the LBP feature spectrum is determined as the feature vector of the image.
  • the point is quantified because the relationship between the surrounding point and the point is used. After quantization, the influence of illumination on the image can be eliminated more effectively. As long as the change in illumination is not enough to change the size relationship between the pixel values of the two points, the LBP value will not change, which ensures the accuracy of feature information extraction.
  • the image similarity between each frame of image in the nth video segment of the football game video and the n-1th frame of input image can be calculated according to the following formula:
  • f is the sequence number of each frame image in the nth video segment, 1 ⁇ f ⁇ F n
  • F n is the number of frames in the nth video segment
  • d is the dimension number of the feature vector
  • Dim is the number of dimensions of the feature vector
  • VecX f,d is the nth
  • VecY d is the component
  • Step S1023 from each frame image in the nth video segment, select a frame image with the smallest image similarity between the n-1th frame of the input image as the nth frame of input image.
  • Step S103 Use a preset deep learning network model to process the input image to obtain a behavior recognition result corresponding to the football match video.
  • the deep learning network model is composed of a cascade of an Inception network model and a three-dimensional ResNet network model, the Inception network model is used to learn the relationship between pixels in each frame of the input image, and the three-dimensional ResNet network The model is used to learn the relationship between the frames of the input image.
  • the Inception network model is composed of M Inception modules cascaded as shown in Figure 3.
  • M is an integer greater than 1, and its value can be set according to the actual situation. For example, it can be set to 2, 3, 5, 10 or other values, it is preferably set to 3 in this embodiment.
  • the data processing process of the m-th Inception module can include the steps shown in Figure 4:
  • Step S401 Obtain input data of the m-th Inception module.
  • the input data of the m-th Inception module can be obtained according to the following formula:
  • InputImage is the input image
  • OutputInception m-1 is the output data of the m-1th Inception module
  • InputInception m is the input data of the m-th Inception module.
  • Step S402 using CN convolution kernels of different scales to perform a convolution operation on the input data of the m-th Inception module respectively, and extract the characteristic data of the input data of the m-th Inception module at different scales.
  • CN is an integer greater than 1, and its value can be set according to actual conditions.
  • the characteristic data of the input data at different scales.
  • an additional 1 ⁇ 1 convolution kernel may be used to perform the convolution operation on the input data of the mth Inception module. Convolution operation can limit the number of channels, reduce the consumption of computing resources, and reduce the cost of computing power.
  • Step S403 Perform pooling processing on the input data of the m-th Inception module, and extract the pooled data of the input data of the m-th Inception module.
  • the scale used in pooling can be set according to actual conditions. In this embodiment, a 3 ⁇ 3 scale is preferably used.
  • the specific method used in pooling can also be set according to actual conditions, including but not limited to average pooling.
  • the maximum pooling method is preferably adopted.
  • the pooled data is the output after the input data of the m-th Inception module is pooled.
  • an additional 1 ⁇ 1 convolution kernel can be used to convolve the pooled data, which can limit the number of channels, reduce the consumption of computing resources, and reduce the cost of computing power.
  • Step S404 Combine the pooled data of the input data of the m-th Inception module and the characteristic data at different scales into the output data of the m-th Inception module.
  • the concatenate here is the combination of the number of channels of each data, that is to say, the dimension (ie, channel) describing the image itself has increased, and the information under each dimension has not increased.
  • the three-dimensional ResNet network model in this embodiment is a network model obtained by replacing all the two-dimensional convolution in the traditional ResNet network with three-dimensional convolution, so as to achieve the effect of learning the relationship between each frame of image.
  • the two-dimensional convolution in the traditional ResNet network model can only extract the features of each frame image separately, while through the three-dimensional convolution, continuous multi-frame images can be convolved with multiple different convolution kernels. And the convolution results are added together, so that the extracted image features include the temporal correlation of each frame of image.
  • the three-dimensional ResNet network model is composed of R three-dimensional convolution modules cascaded, R is an integer greater than 1, and its value can be set according to actual conditions, for example, it can be set to 10, 15, 20 or other values , It is preferably set to 17 in this embodiment.
  • the data processing process of the rth three-dimensional convolution module may include the steps shown in FIG. 5:
  • Step S501 Obtain input data of the r-th three-dimensional convolution module.
  • the input data of the rth three-dimensional convolution module can be obtained according to the following formula:
  • InputResNet is the input data of the three-dimensional ResNet network model, that is, the output data of the Inception network model
  • OutputConv r-1 is the output data of the r- 1th three-dimensional convolution module
  • InputConv r is the input data of the rth three-dimensional convolution module.
  • Step S502 Perform a spatial convolution operation on the input data of the r-th three-dimensional convolution module, and extract the spatial feature data of the input data of the r-th three-dimensional convolution module.
  • Step S503 Perform a temporal convolution operation on the spatial feature data of the input data of the rth three-dimensional convolution module, and extract the temporal and spatial feature data of the input data of the rth three-dimensional convolution module.
  • the three-dimensional convolution operation in each three-dimensional convolution module is decomposed into a spatial convolution operation and a temporal convolution operation.
  • the number of channels is 1, and the number of convolution kernels is Take the case of 1 as an example, that is, the three-dimensional convolution operation of t ⁇ d ⁇ d is decomposed into a spatial convolution operation of 1 ⁇ d ⁇ d and a temporal convolution operation of t ⁇ 1 ⁇ 1.
  • 1 ⁇ d ⁇ d is the two-dimensional convolution d ⁇ d, which is the convolution of extracting spatial features
  • t ⁇ 1 ⁇ 1 is the convolution of extracting time frame features
  • t represents the time dimension
  • 1 ⁇ 1 is 2 Dimensional convolution
  • the final output is the behavior recognition result corresponding to the football match video.
  • the deep learning network model Before using the deep learning network model, it can be trained in advance through a large number of samples.
  • the training sample set includes each training sample, and each training sample includes a sample input image drawn from a football game video and a behavior category corresponding to the sample input image.
  • the behavior categories include, but are not limited to, corner kicks, free kicks, out-of-bounds kicks, gate kicks, kick-offs, passing, dribbling, shooting, goals, steals, fouls, and so on.
  • a large number of football game videos need to be collected in advance, and the football game videos of the various behavior categories mentioned above are selected from them and their input images are extracted, which are called sample input images here.
  • the specific extraction process is the same as the aforementioned step S102. For details, please refer to The detailed description in step S102 will not be repeated here.
  • the training sample set to train the deep learning network model.
  • the sample input image in each training sample is used as input, and the behavior category corresponding to the sample input image is used as the target output.
  • the loss function can be improved to avoid it, and the loss function Cross Entropy Loss commonly used in the existing network is changed to the Focal Loss function as shown below:
  • p t is the exponential function of the cross-entropy loss function
  • is the preset proportional coefficient used to coordinate the control behavior category ratio. Its value can be set according to the actual situation, for example, it can be set to 0.1, 0.2, 0.3 or other values. In this embodiment, it is preferably set to 0.1.
  • FL(p t ) is the loss function. It can be seen from this formula that the easier the sample is, the larger the p t is, the more the contribution of the loss is The smaller the value, the larger the proportion of difficult-to-differentiate samples, which solves the problem of imbalanced behavioral categories and large differences in classification difficulty of training samples in football game videos. For more difficult-to-classify behaviors Category, accuracy has been improved.
  • a deep learning network model composed of a cascade of Inception network model and three-dimensional ResNet network model is used to perform behavior recognition, and the Inception network model is used to learn between the pixels in each frame of the input image
  • the three-dimensional ResNet network model is used to learn the relationship between the frames of the input image, which greatly simplifies the process of behavior recognition, reduces the accuracy loss caused by the superposition of multiple complex steps in the prior art, and further reduces the time-consuming At the same time, the accuracy of the final recognition result is also improved.
  • FIG. 7 shows a structural diagram of an embodiment of a football game behavior recognition device provided in an embodiment of the present application.
  • a football match behavior recognition device may include:
  • the video acquisition module 701 is used to acquire a football match video to be recognized
  • the input image extraction module 702 is configured to divide the football game video into N video segments, and extract one frame of image from each video segment as an input image, where N is an integer greater than 1;
  • the behavior recognition module 703 is configured to process the input image using a preset deep learning network model to obtain a behavior recognition result corresponding to the football game video.
  • the Inception network model is composed of M Inception modules cascaded, and the behavior recognition module may include:
  • the first input data obtaining unit is used to obtain the input data of the m-th Inception module
  • the convolution operation unit is used to perform convolution operations on the input data of the m-th Inception module by using CN convolution kernels of different scales to extract the characteristic data of the input data of the m-th Inception module at different scales, CN is an integer greater than 1;
  • the pooling processing unit is used to pool the input data of the m-th Inception module, and extract the pooled data of the input data of the m-th Inception module;
  • the data merging unit is used to merge the pooled data of the input data of the m-th Inception module and the characteristic data at different scales into the output data of the m-th Inception module.
  • the three-dimensional ResNet network model is composed of R three-dimensional convolution modules cascaded, and the behavior recognition module may further include:
  • the second input data obtaining unit is used to obtain the input data of the rth three-dimensional convolution module
  • the spatial convolution unit is used to perform a spatial convolution operation on the input data of the rth three-dimensional convolution module, and extract the spatial feature data of the input data of the rth three-dimensional convolution module;
  • the time convolution unit is used to perform a time convolution operation on the spatial feature data of the input data of the rth three-dimensional convolution module, and extract the temporal and spatial feature data of the input data of the rth three-dimensional convolution module.
  • the input image extraction module may include:
  • the first extraction unit is configured to randomly extract a frame of image from the first video segment of the football game video as the first frame of input image;
  • the image similarity calculation unit is used to calculate the image similarity between each frame image in the nth video segment of the football game video and the n-1th frame input image, 2 ⁇ n ⁇ N;
  • the second extraction unit is used to select a frame with the smallest image similarity between the n-1th frame of the input image from each frame of the nth video segment as the nth frame of input image.
  • football match behavior recognition device may further include:
  • the training sample set selection module is used to obtain a training sample set from a preset database.
  • the training sample set includes each training sample, and each training sample includes a sample input image drawn from a football match video and The behavior category corresponding to the sample input image;
  • the model training module is used to train the deep learning network model using the training sample set.
  • the sample input image in each training sample is used as input, and the behavior category corresponding to the sample input image is used as input Target output, and use the loss function shown in the following formula:
  • p t is the exponential function of the cross-entropy loss function
  • is the preset scale coefficient
  • Loss(p t ) is the loss function
  • FIG. 8 shows a schematic block diagram of a terminal device provided by an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.
  • the terminal device 8 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device 8 may include: a processor 80, a memory 81, and computer-readable instructions 82 stored in the memory 81 and running on the processor 80, such as performing the aforementioned deep learning-based football match behavior recognition Computer readable instructions for the method.
  • the processor 80 executes the computer-readable instructions 82, the steps in the above-mentioned embodiments of the method for recognizing football game behavior based on deep learning are implemented.
  • the computer-readable instruction 82 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80, To complete the present invention.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 82 in the terminal device 8.
  • the processor 80 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (ASICs), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 81 may be an internal storage unit of the terminal device 8, such as a hard disk or memory of the terminal device 8.
  • the memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk equipped on the terminal device 8, a Smart Media Card (SMC), or a Secure Digital (SD) Card, Flash Card, etc. Further, the memory 81 may also include both an internal storage unit of the terminal device 8 and an external storage device.
  • the memory 81 is used to store the computer-readable instructions and other instructions and data required by the terminal device 8.
  • the memory 81 can also be used to temporarily store data that has been output or will be output.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions.
  • the computer-readable instructions can be stored in a non-volatile computer. Readable storage medium.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

一种基于深度学习的足球比赛行为识别方法,包括:获取待识别的足球比赛视频(S101),将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像(S102),N为大于1的整数,使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果(S103)。该方法使用Inception网络模型学习所述输入图像每一帧中的像素点之间的关系,使用三维ResNet网络模型学习所述输入图像各帧之间的关系,大大简化了行为识别的过程,减少了现有技术中多个复杂步骤叠加造成的精度损失,在减少耗费时长的同时,也提高了识别结果的精度。

Description

基于深度学习的足球比赛行为识别方法、装置及终端设备
本申请要求于2019年6月26日提交中国专利局、申请号为201910562902.7、发明名称为“基于深度学习的足球比赛行为识别方法、装置及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于计算机技术领域,尤其涉及一种基于深度学习的足球比赛行为识别方法、装置、计算机非易失性可读存储介质及终端设备。
背景技术
行为识别的目标是识别在现实生活中建立起来的人类共同行为。精确的行为识别具有挑战性,因为人类行为具有复杂性、高度多样化等特征。足球比赛视频中的运动员行为是一种有计划、高协同性的多运动员(智能体)的团队行为。现有技术中在进行足球比赛行为识别时,整个过程极为复杂,导致耗时极长。而且由于每个步骤中均会存在一定的精度偏差,多个复杂步骤叠加在一起,导致最终识别结果的精度较低。
技术问题
有鉴于此,本申请实施例提供了一种基于深度学习的足球比赛行为识别方法、装置、计算机非易失性可读存储介质及终端设备,以解决现有的足球比赛行为识别方法耗时极长且精度较低的问题。
技术解决方案
本申请实施例的第一方面提供了一种基于深度学习的足球比赛行为识别方法,可以包括:
获取待识别的足球比赛视频;
将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像,N为大于1的整数;
使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果,其中,所述深度学习网络模型由Inception网络模型和三维ResNet网络模型级联组成,所述Inception网络模型用于学习所述输入图像每一帧中的像素点之间的关系,所述三维ResNet网络模型用于学习所述输入图像各帧之间的关系。
本申请实施例的第二方面提供了一种足球比赛行为识别装置,可以包括用于实现上述足球比赛行为识别方法的步骤的模块。
本申请实施例的第三方面提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述足球比赛行为识别方法的步骤。
本申请实施例的第四方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述足球比赛行为识别方法的步骤。
有益效果
通过本申请实施例,无需对整个足球比赛视频进行分析,而是从每个视频段中抽取一帧图像作为分析对象,大大减少了计算复杂度,减少了行为识别过程所耗费的时长,而且,本实施例中采用了由Inception网络模型和三维ResNet网络模型级联组成深度学习网络模型来进行行为识别,使用Inception网络模型学习所述输入图像每一帧中的像素点之间的关系,使用三维ResNet网络模型学习所述输入图像各帧之间的关系,大大简化了行为识别的过程,减少了现有技术中多个复杂步骤叠加造成的精度损失,在进一步减少耗费时长的同时,也提高了最终识别结果的精度。
附图说明
图1为本申请实施例中一种基于深度学习的足球比赛行为识别方法的一个实施例流程图;
图2为从各个视频段中分别抽取一帧图像作为输入图像的示意流程图;
图3为Inception网络模型的示意图;
图4为Inception模块的数据处理过程的示意流程图;
图5为三维卷积模块的数据处理过程的示意流程图;
图6为将每个三维卷积模块中的三维卷积操作分解为一个空间卷积操作和一个时间卷积操作的示意图;
图7为本申请实施例中一种足球比赛行为识别装置的一个实施例结构图;
图8为本申请实施例中一种终端设备的示意框图。
本发明的实施方式
请参阅图1,本申请实施例中一种基于深度学习的足球比赛行为识别方法的一个实施例可以包括:
步骤S101、获取待识别的足球比赛视频。
所述足球比赛视频可以是用户通过手机、平板电脑等终端设备的摄像头即时采集到的视频。在本实施例的一种具体使用场景中,当用户想要直接进行足球比赛行为识别时,可以在拍摄视频之前,通过点击特定的物理按键或者虚拟按键的方式打开终端设备的行为识别模式,在这种模式下,所述终端设备会对用户拍摄的每一个视频均自动按照步骤S102及步骤S103的过程进行处理,得到行为识别结果。
所述足球比赛视频还可以是原本已存储在所述终端设备中的视频,或者是所述终 端设备通过网络从云服务器或者其它终端设备处所获取到的视频。在本实施例的另一种具体使用场景中,当用户想要对已有的某一个或者多个足球比赛视频进行行为识别时,可以通过点击特定的物理按键或者虚拟按键的方式打开终端设备的行为识别模式,并选定这些足球比赛视频(点击按键和选定视频的顺序可以互换,即也可以先选定视频,再打开终端设备的行为识别模式),则所述终端设备会对这些足球比赛视频自动按照步骤S102及步骤S103的过程进行处理,得到行为识别结果。
步骤S102、将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像。
N为大于1的整数,其具体取值可以根据实际情况进行设置,例如,可以将其设置为3、5、10、20或者其它取值。本实施例中优选将其设置为5,也即将所述足球比赛视频划分为5个视频段。各个视频段包含的帧数可以相同,也可以不同,本实施例对此不作具体限定。
在本实施例中,需要从这N个视频段中分别抽取一帧图像作为输入图像,在一种具体实现中,对于所述足球比赛视频中的每个视频段而言,均可以从该视频段中任意抽取一帧图像作为该视频段的输入图像,例如,从所述足球比赛视频的第1个视频段中任意抽取一帧图像作为第1帧输入图像,从所述足球比赛视频的第2个视频段中任意抽取一帧图像作为第2帧输入图像,从所述足球比赛视频的第3个视频段中任意抽取一帧图像作为第3帧输入图像,以此类推,直至从所述足球比赛视频的第N个视频段中任意抽取一帧图像作为第N帧输入图像。
在本实施例的另一种具体实现中,还可以通过如图2所示的过程抽取所述输入图像:
步骤S1021、从所述足球比赛视频的第1个视频段中任意抽取一帧图像作为第1帧输入图像。
步骤S1022、分别计算所述足球比赛视频的第n个视频段中的各帧图像与第n-1帧输入图像之间的图像相似度。
其中,2≤n≤N。具体地,可以首先分别计算所述足球比赛视频的第n个视频段中的各帧图像的特征向量,以及第n-1帧输入图像的特征向量。
在本实施例中,可以通过局部二值模式(Local Binary Patterns,LBP)算法来计算任一图像的特征向量,具体地,构造一种衡量一个像素点与其周围像素点的关系,对该图像中的每个像素,通过计算以其为中心的邻域内各像素和中心像素的大小关系,把像素的灰度值转化为一个八位二进制序列。以中心点的像素值为阈值,如果邻域点的像素值小于中心点,则邻域点被二值化为0,否则为1;将二值化得到的0、1序列 看成一个8位二进制数,将该二进制数转化为十进制就可得到中心点处的LBP值。计算出该图像的每个像素点的LBP值后,将LBP特征谱的统计直方图确定为该图像的特征向量。
由于利用了周围点与该点的关系对该点进行量化。量化后可以更有效地消除光照对图像的影响。只要光照的变化不足以改变两个点像素值之间的大小关系,那么LBP值不会发生变化,保证了特征信息提取的准确性。
在完成特征向量的计算之后,可以根据下式分别计算所述足球比赛视频的第n个视频段中的各帧图像与第n-1帧输入图像之间的图像相似度:
Figure PCTCN2019103168-appb-000001
其中,f为第n个视频段中的各帧图像的序号,1≤f≤F n,F n为第n个视频段中的帧数,第n个视频段中的第f帧图像的特征向量为CharVecX m=(VecX f,1,VecX f,2,...,VecX f,d,...,VecX f,Dim),第n-1帧输入图像的特征向量为CharVecY=(VecY 1,VecY 2,...,VecY d,...,VecY Dim),d为特征向量的维度序号,1≤d≤Dim,Dim为特征向量的维度数目,VecX f,d为第n个视频段中的第f帧图像的特征向量在第d个维度上的分量,VecY d为第n-1帧输入图像的特征向量在第d个维度上的分量,SimDeg f为第n个视频段中的第f帧图像与第n-1帧输入图像之间的图像相似度。
步骤S1023、从第n个视频段中的各帧图像中选取与第n-1帧输入图像之间的图像相似度最小的一帧图像作为第n帧输入图像。
通过这样的方式,可以保证连续的两帧输入图像之间保持一定的差异,增大了输入图像中所包含的信息量,有助于提高最终识别结果的精度。
步骤S103、使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果。
其中,所述深度学习网络模型由Inception网络模型和三维ResNet网络模型级联组成,所述Inception网络模型用于学习所述输入图像每一帧中的像素点之间的关系,所述三维ResNet网络模型用于学习所述输入图像各帧之间的关系。
所述Inception网络模型由M个如图3所示的Inception模块级联组成,M为大于1的整数,其取值可以根据实际情况进行设置,例如,可以将其设置为2、3、5、10或者其它取值,本实施例中优选将其设置为3。其中,第m个Inception模块的数据处 理过程可以包括如图4所示的步骤:
步骤S401、获取第m个Inception模块的输入数据。
具体地,可以根据下式获取第m个Inception模块的输入数据:
Figure PCTCN2019103168-appb-000002
其中,1≤m≤M,InputImage为所述输入图像,OutputInception m-1为第m-1个Inception模块的输出数据,InputInception m为第m个Inception模块的输入数据。
步骤S402、采用CN个不同尺度的卷积核分别对第m个Inception模块的输入数据进行卷积操作,提取出第m个Inception模块的输入数据在各个不同尺度下的特征数据。
CN为大于1的整数,其取值可以根据实际情况进行设置。在本实施例中,优选采用1×1、3×3、5×5这三个尺度的卷积核分别对第m个Inception模块的输入数据进行卷积操作,以提取第m个Inception模块的输入数据在各个不同尺度下的特征数据。优选地,在采用3×3以及5×5的卷积核对第m个Inception模块的输入数据进行卷积操作之前,还可以采用额外的1×1卷积核对第m个Inception模块的输入数据进行卷积操作,这样可以限制通道的数量,减少对于计算资源的消耗,降低算力成本。
步骤S403、对第m个Inception模块的输入数据进行池化处理,提取出第m个Inception模块的输入数据的池化数据。
池化处理所使用的尺度可以根据实际情况进行设置,在本实施例中,优选采用3×3的尺度,池化处理所使用的具体方式也可以根据实际情况进行设置,包括但不限于平均池化以及最大池化,在本实施例中,优选采用最大池化的处理方式。第m个Inception模块的输入数据经过池化处理后的输出,即为所述池化数据。优选地,在池化处理之后,还可以采用额外的1×1卷积核对池化数据进行卷积操作,这样可以限制通道的数量,减少对于计算资源的消耗,降低算力成本。
步骤S404、将第m个Inception模块的输入数据的池化数据以及在各个不同尺度下的特征数据合并为第m个Inception模块的输出数据。
需要注意的是,此处的合并(concatenate)是各个数据的通道数的合并,也就是说描述图像本身的维度(即通道)增加了,而每一维度下的信息是没有增加的。
本实施例中的三维ResNet网络模型是将传统的ResNet网络中的二维卷积全部替换为三维卷积而得到的网络模型,以达到学习各帧图像之间的关系的效果。传统的ResNet网络模型中的二维卷积只能对各帧图像的特征进行单独的提取,而通过三维卷积,则可以将连续的多帧图像用多个不同的卷积核进行卷积,并把卷积结果相加,这样提取出的图像特征中即包含了各帧图像在时间上的相关关系。
所述三维ResNet网络模型由R个三维卷积模块级联组成,R为大于1的整数,其 取值可以根据实际情况进行设置,例如,可以将其设置为10、15、20或者其它取值,本实施例中优选将其设置为17。其中,第r个三维卷积模块的数据处理过程可以包括如图5所示的步骤:
步骤S501、获取第r个三维卷积模块的输入数据。
具体地,可以根据下式获取第r个三维卷积模块的输入数据:
Figure PCTCN2019103168-appb-000003
其中,1≤r≤R,InputResNet为所述三维ResNet网络模型的输入数据,也即所述Inception网络模型的输出数据,OutputConv r-1为第r-1个三维卷积模块的输出数据,InputConv r为第r个三维卷积模块的输入数据。
步骤S502、对第r个三维卷积模块的输入数据进行空间卷积操作,提取出第r个三维卷积模块的输入数据的空间特征数据。
步骤S503、对第r个三维卷积模块的输入数据的空间特征数据进行时间卷积操作,提取出第r个三维卷积模块的输入数据的时空特征数据。
通过这样的方式,将每个三维卷积模块中的三维卷积操作分解为一个空间卷积操作和一个时间卷积操作,如图6所示,以通道数为1,卷积核个数为1的情况为例,即将t×d×d的三维卷积操作分解为1×d×d的空间卷积操作和t×1×1的时间卷积操作。其中,1×d×d也就是二维卷积d×d,即提取空间特征的卷积,t×1×1则是提取时间帧特征的卷积,t代表时间维,1×1就是2维卷积,1×1=1,也就是不学习空间信息,只学习时间维信息。由于三维卷积是将空间信息和时间信息混合在一起,不易进行优化,而通过上述的时空分解过程把空间信息和时间信息分别进行处理,更容易进行优化,整个模型的整体损失更低。
经过以上深度学习网络模型的处理,最终得到的输出即为所述足球比赛视频对应的行为识别结果。
优选地,在使用所述深度学习网络模型之前,可以预先通过大量的样本对其进行训练。
具体地,首先从预设的数据库中获取训练样本集合。
所述训练样本集合中包括各个训练样本,且每个训练样本均包括从一个足球比赛视频中抽取的样本输入图像以及与所述样本输入图像对应的行为类别。
在本实施例中,所述行为类别包括但不限于角球、任意球、界外球、门球、开球、传球、带球、射门、进球、抢断、犯规等等。预先需采集大量的足球比赛视频,从中分别选取出上述各种行为类别的足球比赛视频并抽取其输入图像,此处称之为样本输入图像,具体的抽取过程与前述步骤S102相同,具体可参照步骤S102中的详细叙述, 此处不再赘述。
然后,使用所述训练样本集合对所述深度学习网络模型进行训练,在训练过程中,将各个训练样本中的样本输入图像作为输入,与所述样本输入图像对应的行为类别作为目标输出。
特别需要注意的是,在足球比赛视频中有些行为类别的训练样本(比如带球、传球)很多,而有些行为类别的训练样本(如射门、犯规)偏少,这就造成了在使用所述训练样本集合对所述深度学习网络模型进行训练时,行为类别不均衡的状况。针对这一问题,本实施例中可以通过损失函数的改进对其进行规避,将现有网络中常用的损失函数Cross Entropy Loss改成如下所示的Focal Loss函数:
FL(p t)=-(1-p t) γlog(p t)
其中,p t为交叉熵损失函数的指数函数,γ为预设的比例系数,用于协调控制行为类别比例,其取值可以根据实际情况进行设置,例如,可以将其设置为0.1、0.2、0.3或者其它取值,本实施例中优选将其设置为0.1,FL(p t)为所述损失函数,由该式可以看出,样本越易分,p t越大,则贡献的损失值就越小,相对来说,难分样本所占的比重就会变大,很好的解决了足球比赛视频中训练样本的行为类别不平衡、分类难度差异大的问题,对于较难分类的行为类别,精度有所提升。
综上所述,通过本申请实施例,无需对整个足球比赛视频进行分析,而是从每个视频段中抽取一帧图像作为分析对象,大大减少了计算复杂度,减少了行为识别过程所耗费的时长,而且,本实施例中采用了由Inception网络模型和三维ResNet网络模型级联组成深度学习网络模型来进行行为识别,使用Inception网络模型学习所述输入图像每一帧中的像素点之间的关系,使用三维ResNet网络模型学习所述输入图像各帧之间的关系,大大简化了行为识别的过程,减少了现有技术中多个复杂步骤叠加造成的精度损失,在进一步减少耗费时长的同时,也提高了最终识别结果的精度。
对应于上文实施例所述的一种基于深度学习的足球比赛行为识别方法,图7示出了本申请实施例提供的一种足球比赛行为识别装置的一个实施例结构图。
本实施例中,一种足球比赛行为识别装置可以包括:
视频获取模块701,用于获取待识别的足球比赛视频;
输入图像抽取模块702,用于将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像,N为大于1的整数;
行为识别模块703,用于使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果。
进一步地,所述Inception网络模型由M个Inception模块级联组成,所述行为识 别模块可以包括:
第一输入数据获取单元,用于获取第m个Inception模块的输入数据;
卷积操作单元,用于采用CN个不同尺度的卷积核分别对第m个Inception模块的输入数据进行卷积操作,提取出第m个Inception模块的输入数据在各个不同尺度下的特征数据,CN为大于1的整数;
池化处理单元,用于对第m个Inception模块的输入数据进行池化处理,提取出第m个Inception模块的输入数据的池化数据;
数据合并单元,用于将第m个Inception模块的输入数据的池化数据以及在各个不同尺度下的特征数据合并为第m个Inception模块的输出数据。
进一步地,所述三维ResNet网络模型由R个三维卷积模块级联组成,所述行为识别模块还可以包括:
第二输入数据获取单元,用于获取第r个三维卷积模块的输入数据;
空间卷积单元,用于对第r个三维卷积模块的输入数据进行空间卷积操作,提取出第r个三维卷积模块的输入数据的空间特征数据;
时间卷积单元,用于对第r个三维卷积模块的输入数据的空间特征数据进行时间卷积操作,提取出第r个三维卷积模块的输入数据的时空特征数据。
进一步地,所述输入图像抽取模块可以包括:
第一抽取单元,用于从所述足球比赛视频的第1个视频段中任意抽取一帧图像作为第1帧输入图像;
图像相似度计算单元,用于分别计算所述足球比赛视频的第n个视频段中的各帧图像与第n-1帧输入图像之间的图像相似度,2≤n≤N;
第二抽取单元,用于从第n个视频段中的各帧图像中选取与第n-1帧输入图像之间的图像相似度最小的一帧图像作为第n帧输入图像。
进一步地,所述足球比赛行为识别装置还可以包括:
训练样本集合选取模块,用于从预设的数据库中获取训练样本集合,所述训练样本集合中包括各个训练样本,且每个训练样本均包括从一个足球比赛视频中抽取的样本输入图像以及与所述样本输入图像对应的行为类别;
模型训练模块,用于使用所述训练样本集合对所述深度学习网络模型进行训练,在训练过程中,将各个训练样本中的样本输入图像作为输入,与所述样本输入图像对应的行为类别作为目标输出,并使用如下式所示的损失函数:
Loss(p t)=-(1-p t) γlog(p t)
其中,p t为交叉熵损失函数的指数函数,γ为预设的比例系数,Loss(p t)为所述损 失函数。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置,模块和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
图8示出了本申请实施例提供的一种终端设备的示意框图,为了便于说明,仅示出了与本申请实施例相关的部分。
在本实施例中,所述终端设备8可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备8可包括:处理器80、存储器81以及存储在所述存储器81中并可在所述处理器80上运行的计算机可读指令82,例如执行上述的基于深度学习的足球比赛行为识别方法的计算机可读指令。所述处理器80执行所述计算机可读指令82时实现上述各个基于深度学习的足球比赛行为识别方法实施例中的步骤。
示例性的,所述计算机可读指令82可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器81中,并由所述处理器80执行,以完成本发明。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令82在所述终端设备8中的执行过程。
所述处理器80可以是中央处理单元(Central Processing Unit,CPU),还可以是其它通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器81可以是所述终端设备8的内部存储单元,例如终端设备8的硬盘或内存。所述存储器81也可以是所述终端设备8的外部存储设备,例如所述终端设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器81还可以既包括所述终端设备8的内部存储单元也包括外部存储设备。所述存储器81用于存储所述计算机可读指令以及所述终端设备8所需的其它指令和数据。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个 处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机非易失性可读存储介质中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机非易失性可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。。

Claims (20)

  1. 一种基于深度学习的足球比赛行为识别方法,其特征在于,包括:
    获取待识别的足球比赛视频;
    将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像,N为大于1的整数;
    使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果,其中,所述深度学习网络模型由Inception网络模型和三维ResNet网络模型级联组成,所述Inception网络模型用于学习所述输入图像每一帧中的像素点之间的关系,所述三维ResNet网络模型用于学习所述输入图像各帧之间的关系。
  2. 根据权利要求1所述的足球比赛行为识别方法,其特征在于,所述Inception网络模型由M个Inception模块级联组成,其中,第m个Inception模块的数据处理过程包括:
    根据下式获取第m个Inception模块的输入数据:
    Figure PCTCN2019103168-appb-100001
    其中,1≤m≤M,M为大于1的整数,InputImage为所述输入图像,OutputInception m-1为第m-1个Inception模块的输出数据,InputInception m为第m个Inception模块的输入数据;
    采用CN个不同尺度的卷积核分别对第m个Inception模块的输入数据进行卷积操作,提取出第m个Inception模块的输入数据在各个不同尺度下的特征数据,CN为大于1的整数;
    对第m个Inception模块的输入数据进行池化处理,提取出第m个Inception模块的输入数据的池化数据;
    将第m个Inception模块的输入数据的池化数据以及在各个不同尺度下的特征数据合并为第m个Inception模块的输出数据。
  3. 根据权利要求1所述的足球比赛行为识别方法,其特征在于,所述三维ResNet网络模型由R个三维卷积模块级联组成,其中,第r个三维卷积模块的数据处理过程包括:
    根据下式获取第r个三维卷积模块的输入数据:
    Figure PCTCN2019103168-appb-100002
    其中,1≤r≤R,R为大于1的整数,InputResNet为所述三维ResNet网络模型的输入数据,也即所述Inception网络模型的输出数据,OutputConv r-1为第r-1个三维卷积 模块的输出数据,InputConv r为第r个三维卷积模块的输入数据;
    对第r个三维卷积模块的输入数据进行空间卷积操作,提取出第r个三维卷积模块的输入数据的空间特征数据;
    对第r个三维卷积模块的输入数据的空间特征数据进行时间卷积操作,提取出第r个三维卷积模块的输入数据的时空特征数据。
  4. 根据权利要求1所述的足球比赛行为识别方法,其特征在于,所述从各个视频段中分别抽取一帧图像作为输入图像包括:
    从所述足球比赛视频的第1个视频段中任意抽取一帧图像作为第1帧输入图像;
    分别计算所述足球比赛视频的第n个视频段中的各帧图像与第n-1帧输入图像之间的图像相似度,2≤n≤N;
    从第n个视频段中的各帧图像中选取与第n-1帧输入图像之间的图像相似度最小的一帧图像作为第n帧输入图像。
  5. 根据权利要求1至4中任一项所述的足球比赛行为识别方法,其特征在于,所述深度学习网络模型的训练过程包括:
    从预设的数据库中获取训练样本集合,所述训练样本集合中包括各个训练样本,且每个训练样本均包括从一个足球比赛视频中抽取的样本输入图像以及与所述样本输入图像对应的行为类别;
    使用所述训练样本集合对所述深度学习网络模型进行训练,在训练过程中,将各个训练样本中的样本输入图像作为输入,与所述样本输入图像对应的行为类别作为目标输出,并使用如下式所示的损失函数:
    Loss(p t)=-(1-p t) γlog(p t)
    其中,p t为交叉熵损失函数的指数函数,γ为预设的比例系数,Loss(p t)为所述损失函数。
  6. 一种足球比赛行为识别装置,其特征在于,包括:
    视频获取模块,用于获取待识别的足球比赛视频;
    输入图像抽取模块,用于将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像,N为大于1的整数;
    行为识别模块,用于使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果,其中,所述深度学习网络模型由Inception网络模型和三维ResNet网络模型级联组成,所述Inception网络模型用于学习所述输入图像每一帧中的像素点之间的关系,所述三维ResNet网络模型用于学习所述输入图像各帧之间的关系。
  7. 根据权利要求6所述的足球比赛行为识别装置,其特征在于,所述Inception网络模型由M个Inception模块级联组成,所述行为识别模块包括:
    第一输入数据获取单元,用于根据下式获取第m个Inception模块的输入数据:
    Figure PCTCN2019103168-appb-100003
    其中,1≤m≤M,M为大于1的整数,InputImage为所述输入图像,OutputInception m-1为第m-1个Inception模块的输出数据,InputInception m为第m个Inception模块的输入数据;
    卷积操作单元,用于采用CN个不同尺度的卷积核分别对第m个Inception模块的输入数据进行卷积操作,提取出第m个Inception模块的输入数据在各个不同尺度下的特征数据,CN为大于1的整数;
    池化处理单元,用于对第m个Inception模块的输入数据进行池化处理,提取出第m个Inception模块的输入数据的池化数据;
    数据合并单元,用于将第m个Inception模块的输入数据的池化数据以及在各个不同尺度下的特征数据合并为第m个Inception模块的输出数据。
  8. 根据权利要求6所述的足球比赛行为识别装置,其特征在于,所述三维ResNet网络模型由R个三维卷积模块级联组成,所述行为识别模块包括:
    第二输入数据获取单元,用于根据下式获取第r个三维卷积模块的输入数据:
    Figure PCTCN2019103168-appb-100004
    其中,1≤r≤R,R为大于1的整数,InputResNet为所述三维ResNet网络模型的输入数据,也即所述Inception网络模型的输出数据,OutputConv r-1为第r-1个三维卷积模块的输出数据,InputConv r为第r个三维卷积模块的输入数据;
    空间卷积单元,用于对第r个三维卷积模块的输入数据进行空间卷积操作,提取出第r个三维卷积模块的输入数据的空间特征数据;
    时间卷积单元,用于对第r个三维卷积模块的输入数据的空间特征数据进行时间卷积操作,提取出第r个三维卷积模块的输入数据的时空特征数据。
  9. 根据权利要求6所述的足球比赛行为识别装置,其特征在于,所述输入图像抽取模块包括:
    第一抽取单元,用于从所述足球比赛视频的第1个视频段中任意抽取一帧图像作为第1帧输入图像;
    图像相似度计算单元,用于分别计算所述足球比赛视频的第n个视频段中的各帧图像与第n-1帧输入图像之间的图像相似度,2≤n≤N;
    第二抽取单元,用于从第n个视频段中的各帧图像中选取与第n-1帧输入图像之间的图像相似度最小的一帧图像作为第n帧输入图像。
  10. 根据权利要求6至9中任一项所述的足球比赛行为识别装置,其特征在于,还包括:
    训练样本集合选取模块,用于从预设的数据库中获取训练样本集合,所述训练样本集合中包括各个训练样本,且每个训练样本均包括从一个足球比赛视频中抽取的样本输入图像以及与所述样本输入图像对应的行为类别;
    模型训练模块,用于使用所述训练样本集合对所述深度学习网络模型进行训练,在训练过程中,将各个训练样本中的样本输入图像作为输入,与所述样本输入图像对应的行为类别作为目标输出,并使用如下式所示的损失函数:
    Loss(p t)=-(1-p t) γlog(p t)
    其中,p t为交叉熵损失函数的指数函数,γ为预设的比例系数,Loss(p t)为所述损失函数。
  11. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:
    获取待识别的足球比赛视频;
    将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像,N为大于1的整数;
    使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果,其中,所述深度学习网络模型由Inception网络模型和三维ResNet网络模型级联组成,所述Inception网络模型用于学习所述输入图像每一帧中的像素点之间的关系,所述三维ResNet网络模型用于学习所述输入图像各帧之间的关系。
  12. 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述Inception网络模型由M个Inception模块级联组成,其中,第m个Inception模块的数据处理过程包括:
    根据下式获取第m个Inception模块的输入数据:
    Figure PCTCN2019103168-appb-100005
    其中,1≤m≤M,M为大于1的整数,InputImage为所述输入图像,OutputInception m-1为第m-1个Inception模块的输出数据,InputInception m为第m个Inception模块的输入数据;
    采用CN个不同尺度的卷积核分别对第m个Inception模块的输入数据进行卷积操作,提取出第m个Inception模块的输入数据在各个不同尺度下的特征数据,CN为大 于1的整数;
    对第m个Inception模块的输入数据进行池化处理,提取出第m个Inception模块的输入数据的池化数据;
    将第m个Inception模块的输入数据的池化数据以及在各个不同尺度下的特征数据合并为第m个Inception模块的输出数据。
  13. 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述三维ResNet网络模型由R个三维卷积模块级联组成,其中,第r个三维卷积模块的数据处理过程包括:
    根据下式获取第r个三维卷积模块的输入数据:
    Figure PCTCN2019103168-appb-100006
    其中,1≤r≤R,R为大于1的整数,InputResNet为所述三维ResNet网络模型的输入数据,也即所述Inception网络模型的输出数据,OutputConv r-1为第r-1个三维卷积模块的输出数据,InputConv r为第r个三维卷积模块的输入数据;
    对第r个三维卷积模块的输入数据进行空间卷积操作,提取出第r个三维卷积模块的输入数据的空间特征数据;
    对第r个三维卷积模块的输入数据的空间特征数据进行时间卷积操作,提取出第r个三维卷积模块的输入数据的时空特征数据。
  14. 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述从各个视频段中分别抽取一帧图像作为输入图像包括:
    从所述足球比赛视频的第1个视频段中任意抽取一帧图像作为第1帧输入图像;
    分别计算所述足球比赛视频的第n个视频段中的各帧图像与第n-1帧输入图像之间的图像相似度,2≤n≤N;
    从第n个视频段中的各帧图像中选取与第n-1帧输入图像之间的图像相似度最小的一帧图像作为第n帧输入图像。
  15. 根据权利要求11至14中任一项所述的计算机非易失性可读存储介质,其特征在于,所述深度学习网络模型的训练过程包括:
    从预设的数据库中获取训练样本集合,所述训练样本集合中包括各个训练样本,且每个训练样本均包括从一个足球比赛视频中抽取的样本输入图像以及与所述样本输入图像对应的行为类别;
    使用所述训练样本集合对所述深度学习网络模型进行训练,在训练过程中,将各个训练样本中的样本输入图像作为输入,与所述样本输入图像对应的行为类别作为目标输出,并使用如下式所示的损失函数:
    Loss(p t)=-(1-p t) γlog(p t)
    其中,p t为交叉熵损失函数的指数函数,γ为预设的比例系数,Loss(p t)为所述损失函数。
  16. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待识别的足球比赛视频;
    将所述足球比赛视频划分为N个视频段,并从各个视频段中分别抽取一帧图像作为输入图像,N为大于1的整数;
    使用预设的深度学习网络模型对所述输入图像进行处理,得到与所述足球比赛视频对应的行为识别结果,其中,所述深度学习网络模型由Inception网络模型和三维ResNet网络模型级联组成,所述Inception网络模型用于学习所述输入图像每一帧中的像素点之间的关系,所述三维ResNet网络模型用于学习所述输入图像各帧之间的关系。
  17. 根据权利要求16所述的终端设备,其特征在于,所述Inception网络模型由M个Inception模块级联组成,其中,第m个Inception模块的数据处理过程包括:
    根据下式获取第m个Inception模块的输入数据:
    Figure PCTCN2019103168-appb-100007
    其中,1≤m≤M,M为大于1的整数,InputImage为所述输入图像,OutputInception m-1为第m-1个Inception模块的输出数据,InputInception m为第m个Inception模块的输入数据;
    采用CN个不同尺度的卷积核分别对第m个Inception模块的输入数据进行卷积操作,提取出第m个Inception模块的输入数据在各个不同尺度下的特征数据,CN为大于1的整数;
    对第m个Inception模块的输入数据进行池化处理,提取出第m个Inception模块的输入数据的池化数据;
    将第m个Inception模块的输入数据的池化数据以及在各个不同尺度下的特征数据合并为第m个Inception模块的输出数据。
  18. 根据权利要求16所述的终端设备,其特征在于,所述三维ResNet网络模型由R个三维卷积模块级联组成,其中,第r个三维卷积模块的数据处理过程包括:
    根据下式获取第r个三维卷积模块的输入数据:
    Figure PCTCN2019103168-appb-100008
    其中,1≤r≤R,R为大于1的整数,InputResNet为所述三维ResNet网络模型的输入数据,也即所述Inception网络模型的输出数据,OutputConv r-1为第r-1个三维卷积模块的输出数据,InputConv r为第r个三维卷积模块的输入数据;
    对第r个三维卷积模块的输入数据进行空间卷积操作,提取出第r个三维卷积模块的输入数据的空间特征数据;
    对第r个三维卷积模块的输入数据的空间特征数据进行时间卷积操作,提取出第r个三维卷积模块的输入数据的时空特征数据。
  19. 根据权利要求16所述的终端设备,其特征在于,所述从各个视频段中分别抽取一帧图像作为输入图像包括:
    从所述足球比赛视频的第1个视频段中任意抽取一帧图像作为第1帧输入图像;
    分别计算所述足球比赛视频的第n个视频段中的各帧图像与第n-1帧输入图像之间的图像相似度,2≤n≤N;
    从第n个视频段中的各帧图像中选取与第n-1帧输入图像之间的图像相似度最小的一帧图像作为第n帧输入图像。
  20. 根据权利要求16至19中任一项所述的终端设备,其特征在于,所述深度学习网络模型的训练过程包括:
    从预设的数据库中获取训练样本集合,所述训练样本集合中包括各个训练样本,且每个训练样本均包括从一个足球比赛视频中抽取的样本输入图像以及与所述样本输入图像对应的行为类别;
    使用所述训练样本集合对所述深度学习网络模型进行训练,在训练过程中,将各个训练样本中的样本输入图像作为输入,与所述样本输入图像对应的行为类别作为目标输出,并使用如下式所示的损失函数:
    Loss(p t)=-(1-p t) γlog(p t)
    其中,p t为交叉熵损失函数的指数函数,γ为预设的比例系数,Loss(p t)为所述损失函数。
PCT/CN2019/103168 2019-06-26 2019-08-29 基于深度学习的足球比赛行为识别方法、装置及终端设备 WO2020258498A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910562902.7 2019-06-26
CN201910562902.7A CN110378245B (zh) 2019-06-26 2019-06-26 基于深度学习的足球比赛行为识别方法、装置及终端设备

Publications (1)

Publication Number Publication Date
WO2020258498A1 true WO2020258498A1 (zh) 2020-12-30

Family

ID=68250689

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103168 WO2020258498A1 (zh) 2019-06-26 2019-08-29 基于深度学习的足球比赛行为识别方法、装置及终端设备

Country Status (2)

Country Link
CN (1) CN110378245B (zh)
WO (1) WO2020258498A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836602A (zh) * 2021-01-21 2021-05-25 深圳市信义科技有限公司 基于时空特征融合的行为识别方法、装置、设备及介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016461A (zh) * 2020-08-28 2020-12-01 深圳市信义科技有限公司 一种多目标的行为识别方法及系统
CN112446348B (zh) * 2020-12-08 2022-05-31 电子科技大学 一种基于特征谱流的行为识别方法
CN112580589A (zh) * 2020-12-28 2021-03-30 国网上海市电力公司 基于双流法考虑非均衡数据的行为识别方法、介质及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808150A (zh) * 2017-11-20 2018-03-16 珠海习悦信息技术有限公司 人体视频动作识别方法、装置、存储介质及处理器
US20180129906A1 (en) * 2016-11-07 2018-05-10 Qualcomm Incorporated Deep cross-correlation learning for object tracking
CN109492612A (zh) * 2018-11-28 2019-03-19 平安科技(深圳)有限公司 基于骨骼点的跌倒检测方法及其跌倒检测装置
CN109543556A (zh) * 2018-10-23 2019-03-29 北京市商汤科技开发有限公司 动作识别方法、装置、介质及设备
CN109871777A (zh) * 2019-01-23 2019-06-11 广州智慧城市发展研究院 一种基于注意力机制的行为识别系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152627B2 (en) * 2017-03-20 2018-12-11 Microsoft Technology Licensing, Llc Feature flow for video recognition
CN109753985A (zh) * 2017-11-07 2019-05-14 北京京东尚科信息技术有限公司 视频分类方法及装置
CN108921162A (zh) * 2018-06-11 2018-11-30 厦门中控智慧信息技术有限公司 基于深度学习的车牌识别方法及相关产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180129906A1 (en) * 2016-11-07 2018-05-10 Qualcomm Incorporated Deep cross-correlation learning for object tracking
CN107808150A (zh) * 2017-11-20 2018-03-16 珠海习悦信息技术有限公司 人体视频动作识别方法、装置、存储介质及处理器
CN109543556A (zh) * 2018-10-23 2019-03-29 北京市商汤科技开发有限公司 动作识别方法、装置、介质及设备
CN109492612A (zh) * 2018-11-28 2019-03-19 平安科技(深圳)有限公司 基于骨骼点的跌倒检测方法及其跌倒检测装置
CN109871777A (zh) * 2019-01-23 2019-06-11 广州智慧城市发展研究院 一种基于注意力机制的行为识别系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836602A (zh) * 2021-01-21 2021-05-25 深圳市信义科技有限公司 基于时空特征融合的行为识别方法、装置、设备及介质
CN112836602B (zh) * 2021-01-21 2024-04-05 深圳市信义科技有限公司 基于时空特征融合的行为识别方法、装置、设备及介质

Also Published As

Publication number Publication date
CN110378245A (zh) 2019-10-25
CN110378245B (zh) 2023-07-21

Similar Documents

Publication Publication Date Title
WO2020258498A1 (zh) 基于深度学习的足球比赛行为识别方法、装置及终端设备
WO2019100724A1 (zh) 训练多标签分类模型的方法和装置
Zhang et al. Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network
CN107944020B (zh) 人脸图像查找方法及装置、计算机装置和存储介质
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2020228446A1 (zh) 模型训练方法、装置、终端及存储介质
WO2019100723A1 (zh) 训练多标签分类模型的方法和装置
Kao et al. Visual aesthetic quality assessment with a regression model
WO2020015075A1 (zh) 人脸图像比对方法、装置、计算机设备及存储介质
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
EP2806374B1 (en) Method and system for automatic selection of one or more image processing algorithm
WO2020252917A1 (zh) 一种模糊人脸图像识别方法、装置、终端设备及介质
US10565713B2 (en) Image processing apparatus and method
US11455831B2 (en) Method and apparatus for face classification
WO2022042123A1 (zh) 图像识别模型生成方法、装置、计算机设备和存储介质
WO2021218469A1 (zh) 影像数据检测方法、装置、计算机设备和存储介质
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN111935479A (zh) 一种目标图像确定方法、装置、计算机设备及存储介质
CN112733767A (zh) 一种人体关键点检测方法、装置、存储介质及终端设备
Ji et al. Research on real–time tracking of table tennis ball based on machine learning with low-speed camera
Stojnić et al. Detection of pollen bearing honey bees in hive entrance images
CN110633630B (zh) 一种行为识别方法、装置及终端设备
CN110532971B (zh) 图像处理及装置、训练方法以及计算机可读存储介质
CN112348008A (zh) 证件信息的识别方法、装置、终端设备及存储介质
CN116168439A (zh) 一种轻量级唇语识别方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19934612

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19934612

Country of ref document: EP

Kind code of ref document: A1