CN107481263B

CN107481263B - Table tennis target tracking method, device, storage medium and computer equipment

Info

Publication number: CN107481263B
Application number: CN201710682438.6A
Authority: CN
Inventors: 任杰; 盛斌; 施之皓; 张本轩; 杨靖; 侯爽
Original assignee: Shanghai Jiaotong University; Shanghai University of Sport
Current assignee: Shanghai Jiaotong University; Shanghai University of Sport
Priority date: 2017-08-10
Filing date: 2017-08-10
Publication date: 2020-05-19
Anticipated expiration: 2037-08-10
Also published as: CN107481263A

Abstract

The invention relates to a table tennis target tracking method, a table tennis target tracking device, a storage medium and computer equipment. And acquiring an image from a video containing the target, inputting the image into a preset convolutional neural network model, and processing to obtain a bounding box of the target in the image. And inputting the bounding box of the target in the image into a preset regression layer to perform regression processing to obtain a regressed bounding box corresponding to the target, wherein the preset regression layer comprises a low-layer convolution layer of a preset convolution neural network model. The lower layers of convolutional layers contain more position information, and the higher layers of convolutional layers contain more semantic information such as the class of the object. Therefore, after the bounding box obtained through the processing of the preset convolutional neural network model is input to the preset regression layer for regression processing, because the preset regression layer comprises the low-layer convolutional layer of the preset convolutional neural network model, the semantic information of the high-layer convolutional layer and the position information of the low-layer convolutional layer can be considered at the same time, so that the target in the input image can be correctly distinguished, and the bounding box of the target can be accurately given.

Description

Table tennis target tracking method, device, storage medium and computer equipment

Technical Field

The invention relates to the technical field of image processing, in particular to a table tennis target tracking method, a table tennis target tracking device, a storage medium and computer equipment.

Background

The problem of target tracking, as a classic problem in computer vision, has continued to develop significantly in the last few decades. From the beginning, Lucas-Kanade trackers, mean-shift trackers and the like based on pure computer vision methods, to more complex trackers which integrate detection and learning ideas thereof later, and to the tracking algorithm based on deep learning nowadays. The main deep learning model currently used for tracking is based on CNN, i.e. convolutional neural networks. In a general CNN-based tracking algorithm, a CNN model mainly serves as a feature extractor (feature extractor). The bounding box obtained by the current tracking algorithm is not accurate enough, and the inaccurate bounding box not only means the error of the position information, but also directly causes the whole tracking frame to generate drift and even lose the target.

Disclosure of Invention

In view of the above, there is a need to provide a table tennis target tracking method, apparatus, storage medium and computer device that can make the resulting bounding box more accurate.

A table tennis target tracking method, the method comprising:

acquiring an image from a video containing a target;

inputting the image into a preset convolutional neural network model, and processing to obtain a bounding box of a target in the image;

inputting the bounding box of the target in the image into a preset regression layer to perform regression processing, and obtaining a regressed bounding box corresponding to the target, wherein the preset regression layer comprises a low-layer convolution layer of the preset convolution neural network model.

In one embodiment, the inputting the image into a preset convolutional neural network model and processing the image to obtain a bounding box of the object in the image includes:

inputting the image into a convolution layer in a preset convolution neural network model for convolution operation to obtain a characteristic diagram of the image;

inputting the characteristic diagram into a pooling layer to perform pooling operation to obtain a compressed characteristic diagram;

obtaining a probability map of the image by passing the compressed feature map through a full connection layer;

and obtaining the bounding box of the target in the image according to the probability map of the image.

In one embodiment, the predetermined regression layer includes: the system comprises a full connection layer, an interest region pooling layer and a low-layer convolution layer in the preset convolution neural network model.

In one embodiment, the obtaining of the regressed bounding box corresponding to the target after inputting the bounding box of the target in the image into a preset regression layer for regression processing includes:

projecting the bounding box of the target in the image to a low-layer convolutional layer in the preset convolutional neural network model again for processing to obtain a characteristic diagram;

inputting the feature map into a region of interest pooling layer for feature compression to obtain a compressed feature map;

and inputting the compressed feature map into a full connection layer for processing to obtain a regression bounding box corresponding to the target.

In one embodiment, the method further comprises:

acquiring a convolutional neural network training set for modeling, wherein the convolutional neural network training set comprises an image containing a target and an image not containing the target, and the image is acquired from a video containing the target;

labeling the image, setting a value inside an actual bounding box of a target in the image as a first value, and setting a value outside the actual bounding box of the target in the image as a second value;

inputting the convolutional neural network training set into a convolutional neural network of an initialized network parameter for training to obtain a bounding box of a target in the image;

calculating network parameters of the modeled convolutional neural network according to the bounding box of the target in the image, the marked actual bounding box and a Softmax loss function;

and obtaining a preset convolutional neural network model according to the network parameters.

In one embodiment, the method further comprises:

obtaining a regression layer training set for modeling, wherein the regression layer training set comprises images containing targets, and the images are obtained from videos containing the targets;

labeling the image, and labeling the size of an actual bounding box of a target in the image;

inputting the regression layer training set into the preset convolutional neural network model for training to obtain a bounding box of the target in the image;

inputting the bounding box of the target in the image into a regression layer of the initialized network parameters for regression processing to obtain the size of the regressed bounding box corresponding to the target;

calculating the network parameters of the regression layer after modeling according to the size of the regression bounding box corresponding to the target, the size of the marked actual bounding box and a smoothL1 loss function;

and obtaining a preset regression layer according to the network parameters.

A table tennis target tracking device, the device comprising:

the image acquisition module is used for acquiring an image from a video containing a target;

the convolutional neural network module is used for inputting the image into a preset convolutional neural network model and processing the image to obtain a bounding box of a target in the image;

and the regression layer module is used for inputting the bounding box of the target in the image into a preset regression layer to carry out regression processing so as to obtain a regressed bounding box corresponding to the target, wherein the preset regression layer comprises a low-layer convolution layer of the preset convolution neural network model.

In one embodiment, the regression layer module comprises:

the low-layer convolutional layer module is used for projecting the bounding box of the target in the image to a low-layer convolutional layer in the preset convolutional neural network model again for processing to obtain a characteristic diagram;

the interest region pooling layer module is used for inputting the feature map into the interest region pooling layer for feature compression to obtain a compressed feature map;

and the full connection layer module is used for inputting the compressed feature map into a full connection layer to be processed to obtain a regression bounding box corresponding to the target. A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an image from a video containing a target;

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring an image from a video containing a target;

According to the table tennis target tracking method, the table tennis target tracking device, the storage medium and the computer equipment, the judgment capability of the preset convolutional neural network model on the boundary of the target in the image is poor, so that the bounding box of the target obtained through processing of the preset convolutional neural network model is not accurate enough. From the characteristics of the neural network, the lower layers of convolutional layers often contain more position information, and the higher layers of convolutional layers contain more semantic information, such as the type of object. Therefore, after the bounding box of the target obtained by the processing of the preset convolutional neural network model is input to the preset regression layer for regression processing, because the preset regression layer comprises the low-layer convolutional layer of the preset convolutional neural network model, the semantic information (target category and the like) of the high-layer convolutional layer and the position information of the low-layer convolutional layer can be considered at the same time, so that the target in the input image can be correctly distinguished, and the bounding box of the target can be accurately given.

Drawings

FIG. 1 is a diagram of the internal structure of a server in one embodiment;

FIG. 2 is a flow diagram of a table tennis target tracking method in one embodiment;

FIG. 3 is a flowchart illustrating the process of inputting the image into the predetermined convolutional neural network model to obtain the bounding box of the target in FIG. 2;

FIG. 4 is a flowchart illustrating that the bounding box of the target in the image in FIG. 2 is input into a preset regression layer to perform regression processing, and then a regression bounding box corresponding to the target is obtained;

FIG. 5 is a flow chart of training a predetermined convolutional neural network model;

FIG. 6 is a flow chart of training a pre-set regression layer;

FIG. 7 is a diagram illustrating the structure of a target tracking model in one embodiment;

FIG. 8 is a schematic diagram of a table tennis target tracking device in one embodiment;

FIG. 9 is a schematic diagram of the convolutional neural network module of FIG. 8;

FIG. 10 is a schematic structural diagram of a regression layer module shown in FIG. 8;

FIG. 11 is a schematic diagram of a table tennis target tracking device in one embodiment;

fig. 12 is a schematic structural diagram of a table tennis target tracking device in one embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed as broadly as the present invention is capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

In one embodiment, as shown in fig. 1, there is also provided a server comprising a processor, a non-volatile storage medium having an operating system stored therein and a table tennis target tracking device for performing a table tennis target tracking method, a memory, a network interface connected by a system bus. The processor is used for improving the calculation and control capacity and supporting the operation of the whole server. An internal memory is used to provide an environment for operation of a table tennis target tracking device in a non-volatile storage medium, and the internal memory may have stored therein computer readable instructions that, when executed by a processor, cause the processor to perform a table tennis target tracking method. The network interface may be used to receive video containing tracking targets, and the like.

In recent years, with the development and the gradual maturity of computer vision technology, the specific application of computers in the field of sports is continuously appeared. In table tennis, the table tennis is tracked in each frame of image taken by the camera, thereby recording the position information of the ball. Because table tennis balls have the characteristics of small volume, few characteristics and fast movement, a tracking algorithm needs to be specially designed to meet the requirements.

In one embodiment, as shown in fig. 2, a table tennis target tracking method, and more particularly, a table tennis target tracking method targeting a table tennis ball is provided. The method comprises the following steps:

step 210, an image is acquired from a video containing a target.

One frame of image is obtained from a table tennis game video shot by a high-speed camera, and the images comprise an image of a target, namely a table tennis ball, and the images comprise an image of no target.

And step 220, inputting the image into a preset convolutional neural network model, and processing to obtain a bounding box of the target in the image.

Convolutional Neural Network (CNN) is one of the Network structures that are very representative in deep learning techniques. The preset convolutional neural network model is a convolutional neural network model obtained by training a convolutional neural network in advance through a group of labeled training sets. The labeling training set is an image set which is obtained from a table tennis game video and comprises images of targets and images of non-targets, and the images are labeled on actual bounding boxes of the targets. And inputting the ping-pong ball images marked in the training set into a preset convolution neural network model to perform convolution operation, pooling operation and full-connection layer processing to obtain a probability map of the images, and further obtaining a bounding box of the target in the images from the probability map. And comparing the actual bounding box with the bounding box of the target in the obtained image to obtain the network parameters of the modeled convolutional neural network, and then obtaining a preset convolutional neural network model according to the network parameters. The bounding box is a rectangular box enclosing the object.

And inputting the image obtained from one frame from the table tennis match video shot by the high-speed camera into a preset convolutional neural network model to perform convolution operation, pooling operation and full-link layer processing to obtain a probability map of the image, and further obtaining a bounding box of the target in the image according to the probability map. The bounding box is determined by judging the boundary of the target in the image through the probability map, the obtained bounding box is not accurate enough due to poor judgment capability, and the inaccurate bounding box not only means the error of position information, but also directly causes the whole tracking frame to generate drift and even lose the target.

And step 230, inputting the bounding box of the target in the image into a preset regression layer to perform regression processing, and obtaining a regressed bounding box corresponding to the target, wherein the preset regression layer comprises a low-layer convolution layer of a preset convolution neural network model.

The preset regression layer includes: a full connection layer, a region of interest pooling layer and a lower convolutional layer in the preset convolutional neural network model. And establishing a preset regression layer, wherein another group of label training sets need to be subjected to convolution operation, pooling operation and full connection layer processing by the preset convolution neural network model to be established to obtain a probability graph of the image, and then the bounding box of the target in the image is obtained by the probability graph. The bounding box is then considered a candidate region, and the output bounding box is used as the input to the regression layer. The method specifically comprises the following steps: and projecting the bounding box of the target in the image to a low-layer convolutional layer in the preset convolutional neural network model again for processing to obtain a feature map, inputting the feature map to a region of interest pooling layer for feature compression, and inputting a full-link layer for processing to obtain a regressed bounding box corresponding to the target. And comparing the size of the regressed bounding box with the size of the actual bounding box which is marked in the marking training set to obtain the network parameters of the regression layer after modeling, and obtaining the preset regression layer according to the network parameters.

And inputting the bounding box of the target in the image obtained by the image obtained from the table tennis match video through a preset convolutional neural network model into a preset regression layer to carry out regression processing, and obtaining the regressed bounding box corresponding to the target. The method specifically comprises the following steps: and projecting the bounding box of the target in the image to a low-layer convolutional layer in the preset convolutional neural network model again for processing to obtain a feature map, inputting the feature map to a region of interest pooling layer for feature compression, and inputting a full-link layer for processing to obtain a regressed bounding box corresponding to the target. From the characteristics of the neural network, the lower layers of convolutional layers often contain more position information, and the higher layers of convolutional layers contain more semantic information, such as the type of object. And projecting the bounding box to the lower convolutional layer in the preset convolutional neural network model again, so that the semantic information of the bounding box obtained through the higher convolutional layer in the preset convolutional neural network model, such as the category information of the target, is considered, and the position information of the lower convolutional layer is also integrated. Thus, the target in the input image can be correctly distinguished and the bounding box of the target can be accurately given.

In this embodiment, the preset convolutional neural network model has poor capability of judging the boundary of the target in the image, so that the bounding box of the target obtained through processing by the preset convolutional neural network model is not accurate enough. From the characteristics of the neural network, the lower layers of convolutional layers often contain more position information, and the higher layers of convolutional layers contain more semantic information, such as the type of object. Therefore, after the bounding box of the target obtained by the processing of the preset convolutional neural network model is input to the preset regression layer for regression processing, because the preset regression layer comprises the low-layer convolutional layer of the preset convolutional neural network model, the semantic information (target category and the like) of the high-layer convolutional layer and the position information of the low-layer convolutional layer can be considered at the same time, so that the target in the input image can be correctly distinguished, and the bounding box of the target can be accurately given.

In one embodiment, as shown in fig. 3, inputting the image into the predetermined convolutional neural network model and processing the image to obtain a bounding box of the object in the image includes:

step 222, inputting the image into a convolution layer in a preset convolution neural network model for convolution operation to obtain a characteristic diagram of the image.

And inputting the image with 100 multiplied by 100 pixels into a preset convolution neural network model for convolution operation. The convolutional layer in the convolutional neural network model is pre-trained using CaffeNet. The convolution operation may be a plurality of convolution operations to extract a feature map of the image.

And 224, inputting the feature map into the pooling layer for pooling operation to obtain the compressed feature map.

And a pooling layer is arranged above the convolution layer, and the feature map of the extracted image is input into the pooling layer to be pooled, namely, the feature is compressed to obtain a compressed feature map. In particular, the pooling layer may be a spatial pyramid pooling layer (spatial pyramid pooling layer) for retaining more location information.

And step 226, passing the compressed feature map through a full connection layer to obtain a probability map of the image.

The compressed feature map obtained by the pooling layer is passed through two full-connected layers, and the output 2500-dimensional vector is converted into a 50 × 50 matrix, i.e., a 50 × 50 probability map is output. Each element of the matrix is a probability value representing the probability that a pixel at a corresponding location in the input image belongs to the tracked object. For an image containing a tracked object, a connected region is typically output. The probability values within this region are significantly higher than the probability values outside.

And step 228, obtaining a bounding box of the target in the image according to the probability map of the image.

A bounding box can be calculated by thresholding the probability values that are within the bounding box above a certain probability value, and this bounding box serves as a prediction of the target location.

In this embodiment, the image is input into a preset convolutional neural network model to perform a convolution operation, and then a probability map of the image is obtained after pooling operation and full-link processing. A probability threshold is set, beyond which the likely location of the target is determined. So as to obtain the bounding box of the object in the image according to the probability map of the image. The probability map is used for estimating the position of the target, so that the rough position of the target can be accurately obtained, and a basis is provided for subsequent regression processing.

In one embodiment, the pre-set regression layer includes: the system comprises a full connection layer, an interest region pooling layer and a low-layer convolution layer in a preset convolution neural network model.

In this embodiment, the predetermined regression layer includes a lower convolutional layer in the predetermined convolutional neural network model. The low-layer convolution layer in the preset convolution neural network model is directly added to the back of the CNN model, and the network parameters are not changed. Because the preset regression layer comprises the low-layer convolutional layer of the preset convolutional neural network model, the embodiment can simultaneously consider the semantic information (object type and the like) of the high-layer convolutional layer and the position information of the low-layer convolutional layer, so that the target in the input image can be correctly distinguished, and the bounding box of the target can be accurately given.

In an embodiment, as shown in fig. 4, inputting a bounding box of an object in an image into a preset regression layer to perform regression processing to obtain a regression bounding box corresponding to the object, includes:

and step 232, projecting the bounding box of the target in the image to a low-layer convolutional layer in the preset convolutional neural network model again for processing to obtain a characteristic diagram.

The regression layer follows the pre-set convolutional neural network model. The regression layer is a low-layer convolution layer, an interest region pooling layer and a full-link layer in the preset convolution neural network model from bottom to top in sequence. And projecting a bounding box of a target in an image obtained by a table tennis image obtained from a table tennis match video through a preset convolutional neural network model into a low-layer convolutional layer in the preset convolutional neural network model for convolution processing to obtain a characteristic diagram.

And step 234, inputting the feature map into the interest region pooling layer for feature compression to obtain a compressed feature map.

The region-of-interest pooling layer only intercepts the region of interest in the feature map for pooling and feature compression, and the region of interest in the embodiment of the invention is the bounding box obtained in the previous step. Namely, the bounding box obtained in the previous step is subjected to feature compression to obtain a compressed feature map. Specifically, the bounding box is clipped on the feature map of the lower convolution layer and scaled to a new feature map of 7 × 7.

And 236, inputting the compressed feature map into a full connection layer for processing to obtain a regression bounding box corresponding to the target.

Inputting the compressed feature map into a full-connection layer for processing to obtain the feature map, and calculating the displacement in the xy direction and the scaling of the length and the width between the bounding box and the regressed bounding box by using the CNN. And then, the regression bounding box is obtained according to the bounding box calculated by the CNN, the displacement in the xy direction and the scaling of the length and the width. Specifically, a full-link layer is added on the characteristic diagram, and considering that the position precision of the convolution layer cannot be too low, the embodiment of the invention chooses to cut on conv-1 (the first convolution layer). The output after the full connection layer processing is 4 real numbers, which represent the displacement and the scaling of the length and the width of the bounding box calculated by the regression bounding box and the CNN in the xy direction.

In this embodiment, after the bounding box of the target obtained through the processing of the preset convolutional neural network model is input to the preset regression layer for regression processing, because the preset regression layer includes the lower convolutional layer of the preset convolutional neural network model, semantic information (target type, etc.) of the higher convolutional layer and position information of the lower convolutional layer can be considered at the same time, so that the target in the input image can be correctly identified and the bounding box of the target can be accurately given. Finally, the regression layer calculates the displacement and the scaling of the length and the width of the bounding box after regression and the bounding box calculated by CNN in the xy direction. Therefore, the bounding box calculated by the CNN is corrected, the regressed bounding box is more accurate, and the whole tracking frame is effectively prevented from generating drift and even losing the target.

In one embodiment, as shown in fig. 5, the table tennis target tracking method further comprises:

step 510, obtaining a convolutional neural network training set for modeling, wherein the convolutional neural network training set comprises images containing targets and images not containing the targets, and the images are obtained from videos containing the targets.

One frame of image is obtained from a table tennis game video shot by a high-speed camera, wherein the image comprises a target, and the image does not comprise the target. The images form a convolutional neural network training set, and the convolutional neural network training set is used for training to obtain a preset convolutional neural network model.

Step 520, labeling the image, setting the value inside the actual bounding box of the target in the image as a first value, and setting the value outside the actual bounding box of the target in the image as a second value.

And (3) labeling the images in the convolutional neural network training set, wherein the labeling can be performed manually or in other modes. The specific labeling mode is as follows: and setting the value inside the actual bounding box of the object, namely the table tennis ball, in the image as a first value 1, and setting the value outside the actual bounding box of the object in the image as a second value 0. Of course, in other embodiments, other values may be labeled.

And 530, inputting the convolutional neural network training set into the convolutional neural network of the initialized network parameters for training to obtain a bounding box of the target in the image.

Initializing network parameters of the convolutional neural network, and inputting the images in the convolutional neural network training set into the convolutional neural network with the initialized network parameters for training. The specific training process is as follows: and inputting the image into a convolution layer in a preset convolution neural network model for convolution operation to obtain a characteristic diagram of the image. And arranging a classification head after the convolutional layer, wherein the classification head is used for classifying the classes of the objects in the image to obtain a classification result. And inputting the feature map into a spatial pyramid pooling layer to perform pooling operation, so as to obtain a compressed feature map. And (4) passing the compressed feature map through a full connection layer to obtain a probability map of the image, and obtaining a bounding box of the target in the image according to the probability map and the classification result of the image.

And 540, calculating network parameters of the modeled convolutional neural network according to the bounding box of the target in the image, the marked actual bounding box and the Softmax loss function.

The formula for the Softmax loss function is:

where p and s are the output probability map and category score, respectively. s is the target output class, t_ijIs an element of the target output probability map, and takes the value of 0 or 1. pij is the value of each point on the probability map. Lcls is a Softmax loss function that is commonly used for classification tasks. λ is used to adjust the weight of the loss function for both tasks.

And (4) calculating parameters of the convolutional neural network model which enables the loss function value to be minimum on the whole convolutional neural network training set according to the bounding box of the target in the image, the marked actual bounding box and the Softmax loss function.

And 550, obtaining a preset convolutional neural network model according to the network parameters.

And substituting the parameters into the convolutional neural network according to the parameters of the convolutional neural network model which enables the loss function value on the whole convolutional neural network training set to be minimum, so as to obtain the preset convolutional neural network model in the embodiment of the invention.

In this embodiment, the convolutional neural network is trained through a convolutional neural network training set, and a probability map of the image is obtained after convolutional operation, pooling operation, and full-link layer processing are performed, so that a bounding box of the target in the image is obtained from the probability map. And comparing the actual bounding box with the bounding box of the target in the obtained image through a Softmax loss function to obtain a parameter with the minimum loss function, namely a network parameter of the modeled convolutional neural network, and obtaining a preset convolutional neural network model according to the network parameter.

In one embodiment, as shown in fig. 6, the table tennis target tracking method further comprises:

step 610, obtaining a regression layer training set for modeling, wherein the regression layer training set comprises images including the target, and the images are obtained from a video including the target.

And screening out an image containing the target from the image of one frame acquired from the table tennis match video shot by the high-speed camera. The images form a regression layer training set which is used for training to obtain a preset regression layer. The images in the training set of the regression layer are different from the images in the training set of the convolutional neural network, and are preferably taken from different video files. This is beneficial to improving the accuracy of the established model.

And step 620, labeling the image, and labeling the size of the actual bounding box of the target in the image.

The annotation of the image can be manual annotation or annotation in other modes. The specific labeling mode is as follows: the coordinates of the center point of the actual bounding box of the object in the image and the length and width of the actual bounding box are noted.

And 630, inputting the regression layer training set into a preset convolutional neural network model for training to obtain a bounding box of the target in the image.

And inputting the images in the training set of the regression layer into a preset convolutional neural network model for training to obtain a bounding box of the target in the images. The method specifically comprises the following steps: and inputting the image into a convolution layer in a preset convolution neural network model for convolution operation to obtain a characteristic diagram of the image. And inputting the feature map into a spatial pyramid pooling layer to perform pooling operation, so as to obtain a compressed feature map. And (4) passing the compressed feature map through a full connection layer to obtain a probability map of the image, and obtaining a bounding box of the target in the image according to the probability map of the image.

And step 640, inputting the bounding box of the target in the image into the regression layer of the initialized network parameters for regression processing to obtain the size of the regression bounding box corresponding to the target.

The regression layer is directly attached behind the preset convolutional neural network model CNN, and the regression layer is a low-layer convolutional layer, an interest region pooling layer and a full-link layer in the preset convolutional neural network model from bottom to top. Initializing network parameters for the regression layer, and inputting the bounding box of the target in the image into the regression layer for initializing the network parameters to perform regression processing. The method specifically comprises the following steps: and projecting the bounding box of the target in the image to a low-layer convolutional layer in the preset convolutional neural network model again for processing to obtain a feature map, inputting the feature map to a region of interest pooling layer for feature compression, and inputting a full-link layer for processing to obtain the coordinates of the center point of the regressed bounding box corresponding to the target and the length and width of the actual bounding box.

And 650, calculating the network parameters of the regression layer after modeling according to the size of the regression bounding box corresponding to the target, the size of the marked actual bounding box and the smoothL1 loss function.

For simplicity, let t and v below denote the four parameters x, y, w, h of the labeled bounding box and the bounding box output over the network. The loss function used to train the regression layer is

Wherein the content of the first and second substances,

the training process is to find a set of model parameters to minimize the loss function. Most neural networks are trained in the form of the following formula:

wherein

Representing the model parameters and T representing the entire training set. The meaning of the formula is to find the model parameter that minimizes the loss function value on the whole training set, i.e. the network parameter of the regression layer after modeling.

And 660, obtaining a preset regression layer according to the network parameters.

In this embodiment, it can be known from the characteristics of the neural network that the lower layers of the convolutional layer often contain more position information, and the higher layers of the convolutional layer contain more semantic information, such as the type of the object. And projecting the bounding box to the lower convolutional layer in the preset convolutional neural network model again, so that the semantic information of the bounding box obtained through the higher convolutional layer in the preset convolutional neural network model, such as the category information of the target, is considered, and the position information of the lower convolutional layer is also integrated. The regression layer established in the way can correctly distinguish the target in the input image and accurately give the bounding box of the target when in actual measurement.

In one embodiment, a table tennis target tracking method is provided, as illustrated in fig. 7.

First, images of one frame are obtained from a table tennis game video shot by a high-speed camera, and some of the images include an image of a table tennis ball as an object and some of the images do not include an object.

Further, the CNN 710 includes a convolutional layer 711, a pooling layer 712, and a full link layer 713. And inputting the image into a convolution layer 711 in a preset convolution neural network model for convolution operation to obtain a feature map of the image. The feature map is input into the pooling layer 712 for pooling operation to obtain a compressed feature map. And (5) passing the compressed feature map through a full connection layer 713 to obtain a probability map of the image. And obtaining the bounding box of the object in the image according to the probability map of the image.

Further, the preset regression layer 720 is, from bottom to top, a lower convolution layer 721, a region of interest pooling layer 722 and a full connection layer 723 in the preset convolution neural network model. And projecting the bounding box of the target in the image to a low-layer convolutional layer 721 in the preset convolutional neural network model again for processing to obtain a feature map, inputting the feature map to an interest region pooling layer 722 for feature compression to obtain a compressed feature map. And inputting the compressed feature map into the full-connection layer 723 to be processed to obtain a regression bounding box corresponding to the target.

In one embodiment, as shown in FIG. 8, there is also provided a table tennis target tracking device 800, comprising: an image acquisition module 810, a convolutional neural network module 820, and a regression layer module 830.

An image obtaining module 810 is configured to obtain an image from a video including an object.

And the convolutional neural network module 820 is used for inputting the image into a preset convolutional neural network model and obtaining a bounding box of the target in the image after processing.

The regression layer module 830 is configured to input the bounding box of the target in the image into a preset regression layer, perform regression processing on the bounding box to obtain a regressed bounding box corresponding to the target, where the preset regression layer includes a lower convolutional layer of a preset convolutional neural network model.

In one embodiment, as shown in FIG. 9, convolutional neural network module 820 includes: a convolution operation module 822, a pooling operation module 824, and a probability map generation module 826.

A convolution operation module 822, configured to input the image into a convolution layer in a preset convolution neural network model to perform convolution operation, so as to obtain a feature map of the image;

a pooling operation module 824, configured to input the feature map into a pooling layer for pooling operation, to obtain a compressed feature map;

a probability map generating module 826, configured to obtain a probability map of the image through the full connection layer from the compressed feature map;

a bounding box generation module 828 of the object in the image, for obtaining the bounding box of the object in the image according to the probability map of the image.

In one embodiment, as shown in FIG. 10, the regression layer module 830 includes: a low-level convolutional layer module 832, a region of interest pooling layer module 834, and a fully connected layer module 836.

The low-layer convolutional layer module 832 is used for projecting the bounding box of the target in the image to a low-layer convolutional layer in a preset convolutional neural network model again for processing to obtain a feature map;

an interest region pooling layer module 834 for inputting the feature map into the interest region pooling layer for feature compression to obtain a compressed feature map;

and a full connection layer module 836, configured to input the compressed feature map into a full connection layer, and process the compressed feature map to obtain a regression bounding box corresponding to the target.

In one embodiment, as shown in fig. 11, a table tennis target tracking device 800 further comprises a predetermined convolutional neural network model training module 840, and the predetermined convolutional neural network model training module 840 comprises: a convolutional neural network training set acquisition module 841, an image labeling module 842, a training module 843, a network parameter calculation module 844 and a preset convolutional neural network model generation module 845.

The convolutional neural network training set obtaining module 841 is configured to obtain a convolutional neural network training set for modeling, where the convolutional neural network training set includes an image including a target and an image not including the target, and the image is obtained from a video including the target.

The image labeling module 842 is configured to label the image, set a value inside an actual bounding box of the object in the image as a first value, and set a value outside the actual bounding box of the object in the image as a second value.

And the training module 843 is configured to input the convolutional neural network training set into the convolutional neural network with initialized network parameters to train to obtain a bounding box of the target in the image.

And a network parameter calculation module 844, configured to calculate network parameters of the modeled convolutional neural network according to the bounding box of the target in the image, the marked actual bounding box, and the Softmax loss function.

And the preset convolutional neural network model generating module 845 is configured to obtain a preset convolutional neural network model according to the network parameters.

In one embodiment, as shown in fig. 12, a table tennis target tracking device 800 further comprises a regression layer training module 850, the regression layer training module 850 comprising: the regression layer training set obtaining module 851, the image labeling module 852, the training module 853, the bounding box size generating module 854, the network parameter calculating module 855, and the preset regression layer establishing module 856.

The regression layer training set obtaining module 851 is configured to obtain a regression layer training set used for modeling, where the regression layer training set includes images including the target, and the images are obtained from a video including the target.

The image labeling module 852 is configured to label the image and label the size of the actual bounding box of the target in the image.

And a training module 853, configured to input the regression layer training set into a preset convolutional neural network model for training to obtain a bounding box of the target in the image.

And a bounding box size generating module 854, configured to input the bounding box of the target in the image into the regression layer of the initialized network parameter, and perform regression processing to obtain the size of the regressed bounding box corresponding to the target.

And a network parameter calculating module 855, configured to calculate a network parameter of the modeled regression layer according to the size of the regression bounding box corresponding to the target, the size of the marked actual bounding box, and the smoothL1 loss function.

The preset regression layer establishing module 856 is configured to obtain a preset regression layer according to the network parameters.

In one embodiment, there is also provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: acquiring an image from a video containing a target; inputting the image into a preset convolutional neural network model, and processing to obtain a bounding box of a target in the image; and inputting the bounding box of the target in the image into a preset regression layer to perform regression processing to obtain a regressed bounding box corresponding to the target, wherein the preset regression layer comprises a low-layer convolution layer of a preset convolution neural network model.

In one embodiment, the program further implements the following steps when executed by the processor: inputting the image into a convolution layer in a preset convolution neural network model to carry out convolution operation to obtain a characteristic diagram of the image; inputting the characteristic diagram into a pooling layer to perform pooling operation to obtain a compressed characteristic diagram; obtaining a probability chart of the image by passing the compressed feature chart through a full connection layer; and obtaining the bounding box of the object in the image according to the probability map of the image.

In one embodiment, the program further implements the following steps when executed by the processor: the preset regression layer includes: the system comprises a full connection layer, an interest region pooling layer and a low-layer convolution layer in a preset convolution neural network model.

In one embodiment, the program further implements the following steps when executed by the processor: projecting the bounding box of the target in the image to a low-layer convolutional layer in a preset convolutional neural network model again for processing to obtain a characteristic diagram; inputting the feature map into a region of interest pooling layer for feature compression to obtain a compressed feature map; and inputting the compressed feature map into a full connection layer for processing to obtain a regression bounding box corresponding to the target.

In one embodiment, the program further implements the following steps when executed by the processor: acquiring a convolutional neural network training set for modeling, wherein the convolutional neural network training set comprises an image containing a target and an image not containing the target, and the image is acquired from a video containing the target; labeling the image, setting a value inside an actual bounding box of the target in the image as a first value, and setting a value outside the actual bounding box of the target in the image as a second value; inputting the convolutional neural network training set into a convolutional neural network of the initialized network parameters for training to obtain a bounding box of a target in an image; calculating network parameters of the modeled convolutional neural network according to the bounding box of the target in the image, the marked actual bounding box and the Softmax loss function; and obtaining a preset convolutional neural network model according to the network parameters.

In one embodiment, the program further implements the following steps when executed by the processor: obtaining a regression layer training set for modeling, wherein the regression layer training set comprises images containing targets, and the images are obtained from videos containing the targets; labeling the image, and labeling the size of an actual bounding box of a target in the image; inputting the regression layer training set into a preset convolutional neural network model for training to obtain a bounding box of a target in an image; inputting the bounding box of the target in the image into a regression layer of the initialized network parameters for regression processing to obtain the size of the regressed bounding box corresponding to the target; calculating the network parameters of the regression layer after modeling according to the size of the regression bounding box corresponding to the target, the size of the marked actual bounding box and a smoothL1 loss function; and obtaining a preset regression layer according to the network parameters.

In one embodiment, there is also provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring an image from a video containing a target; inputting the image into a preset convolutional neural network model, and processing to obtain a bounding box of a target in the image; and inputting the bounding box of the target in the image into a preset regression layer to perform regression processing to obtain a regressed bounding box corresponding to the target, wherein the preset regression layer comprises a low-layer convolution layer of a preset convolution neural network model.

In one embodiment, the processor further implements the following steps when executing the computer program: inputting the image into a convolution layer in a preset convolution neural network model to carry out convolution operation to obtain a characteristic diagram of the image; inputting the characteristic diagram into a pooling layer to perform pooling operation to obtain a compressed characteristic diagram; obtaining a probability chart of the image by passing the compressed feature chart through a full connection layer; and obtaining the bounding box of the object in the image according to the probability map of the image.

In one embodiment, the processor further implements the following steps when executing the computer program: the preset regression layer includes: the system comprises a full connection layer, an interest region pooling layer and a low-layer convolution layer in a preset convolution neural network model.

In one embodiment, the processor further implements the following steps when executing the computer program: projecting the bounding box of the target in the image to a low-layer convolutional layer in a preset convolutional neural network model again for processing to obtain a characteristic diagram; inputting the feature map into a region of interest pooling layer for feature compression to obtain a compressed feature map; and inputting the compressed feature map into a full connection layer for processing to obtain a regression bounding box corresponding to the target.

In one embodiment, the processor further implements the following steps when executing the computer program: acquiring a convolutional neural network training set for modeling, wherein the convolutional neural network training set comprises an image containing a target and an image not containing the target, and the image is acquired from a video containing the target; labeling the image, setting a value inside an actual bounding box of the target in the image as a first value, and setting a value outside the actual bounding box of the target in the image as a second value; inputting the convolutional neural network training set into a convolutional neural network of the initialized network parameters for training to obtain a bounding box of a target in an image; calculating network parameters of the modeled convolutional neural network according to the bounding box of the target in the image, the marked actual bounding box and the Softmax loss function; and obtaining a preset convolutional neural network model according to the network parameters.

In one embodiment, the processor further implements the following steps when executing the computer program: obtaining a regression layer training set for modeling, wherein the regression layer training set comprises images containing targets, and the images are obtained from videos containing the targets; labeling the image, and labeling the size of an actual bounding box of a target in the image; inputting the regression layer training set into a preset convolutional neural network model for training to obtain a bounding box of a target in an image; inputting the bounding box of the target in the image into a regression layer of the initialized network parameters for regression processing to obtain the size of the regressed bounding box corresponding to the target; calculating the network parameters of the regression layer after modeling according to the size of the regression bounding box corresponding to the target, the size of the marked actual bounding box and a smoothL1 loss function; and obtaining a preset regression layer according to the network parameters.

It will be understood by those skilled in the art that all or part of the processes in the methods of the embodiments described above may be implemented by hardware related to instructions of a computer program, and the program may be stored in a non-volatile computer readable storage medium, and in the embodiments of the present invention, the program may be stored in a storage medium of a computer system and executed by at least one processor in the computer system, so as to implement the processes including the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples are only illustrative of several embodiments of the present invention, and the description is specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A table tennis target tracking method, the method comprising:

acquiring an image from a video containing a target;

inputting the bounding box of the target in the image into a preset regression layer for regression processing to obtain a regressed bounding box corresponding to the target, wherein the preset regression layer comprises a full connection layer, an interest region pooling layer and a lower convolution layer of the preset convolutional neural network model, the preset regression layer is obtained according to network parameters, the network parameters are obtained by projecting the bounding box of the target in the image into the lower convolution layer in the preset convolutional neural network model again for processing to obtain a characteristic diagram, inputting the characteristic diagram into the interest region pooling layer for characteristic compression, then inputting the full connection layer for processing to obtain a regressed bounding box corresponding to the target, and comparing the size of the regressed bounding box with the size of an actual bounding box marked in the marking training set to obtain the network parameters of the modeled regression layer.

2. The method of claim 1, wherein the inputting the image into a preset convolutional neural network model and processing the image into a bounding box of the object in the image comprises:

3. The method according to claim 1, wherein the obtaining of the regressed bounding box corresponding to the target after inputting the bounding box of the target in the image into a preset regression layer for regression processing comprises:

4. The method according to claim 1 or 2, characterized in that the method further comprises:

5. The method according to claim 1 or 3, characterized in that the method further comprises:

and obtaining a preset regression layer according to the network parameters.

6. A table tennis target tracking device, the device comprising:

a regression layer module for inputting the bounding box of the target in the image into a preset regression layer to carry out regression processing to obtain a regressed bounding box corresponding to the target, the preset regression layer comprises a full connection layer, an interest region pooling layer and a lower convolution layer of the preset convolution neural network model, the preset regression layer is obtained according to network parameters, the network parameters are obtained by projecting bounding boxes of targets in the images into a low-layer convolution layer in the preset convolution neural network model again for processing to obtain a feature map, the feature map is input into a region-of-interest pooling layer for feature compression, then the feature map is input into a full-connection layer for processing to obtain regression bounding boxes corresponding to the targets, and the size of the regression bounding boxes is compared with the size of the labeled actual bounding boxes in the labeling training set to obtain the network parameters of the modeled regression layer.

7. The apparatus of claim 6, wherein the convolutional neural network module 820 comprises:

the convolution operation module is used for inputting the image into a convolution layer in a preset convolution neural network model to carry out convolution operation so as to obtain a characteristic diagram of the image;

the pooling operation module is used for inputting the characteristic diagram into a pooling layer to carry out pooling operation so as to obtain a compressed characteristic diagram;

the probability map generation module is used for obtaining a probability map of the image by the compressed feature map through a full connection layer;

and the bounding box generation module of the target in the image is used for obtaining the bounding box of the target in the image according to the probability map of the image.

8. The apparatus of claim 6, wherein the regression layer module comprises:

and the full connection layer module is used for inputting the compressed feature map into a full connection layer to be processed to obtain a regression bounding box corresponding to the target.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a table tennis target tracking method according to any one of claims 1 to 5.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the table tennis target tracking method according to any one of claims 1 to 5 when executing the computer program.