WO2024041319A1

WO2024041319A1 - Basketball shot recognition method and apparatus, device and storage medium

Info

Publication number: WO2024041319A1
Application number: PCT/CN2023/110320
Authority: WO
Inventors: 王杰; 孔繁昊
Original assignee: 京东方科技集团股份有限公司; 成都京东方智慧科技有限公司
Priority date: 2022-08-23
Filing date: 2023-07-31
Publication date: 2024-02-29
Also published as: CN115376047A

Abstract

Disclosed in the present invention are a basketball shot recognition method and apparatus, a device and a storage medium. The method comprises: acquiring a video to be recognized; determining a backboard region in the video to be recognized, so as to obtain a backboard region image sequence to be recognized; and inputting into a basketball shot classification network the backboard region image sequence to be recognized, and acquiring a basketball shot recognition result output by the basketball shot classification network, the basketball shot recognition result being used for representing whether a basketball shot is successful.

Description

A shot recognition method, device, equipment and storage medium

Technical field

The present invention relates to the field of computer application technology, and in particular to a shot recognition method, device, equipment and storage medium.

Background technique

In basketball games, referees usually need to manually judge whether a shot is a goal or not. However, in many cases, it is often difficult to find a referee to make a judgment. For example, in general amateur games or basketball training, considering the labor cost of referees, it is often difficult to find referees to make judgments, and athletes need to be distracted to judge whether the shots are scored.

In related technology, sensors can be installed on the backboard, and the sensors can automatically determine whether the shot is scored. However, the deployment cost of this method is relatively high, and a low-cost shot recognition method is urgently needed.

Contents of the invention

The present invention provides a shot recognition method, device, equipment and storage medium to solve the deficiencies in related technologies.

According to a first aspect of an embodiment of the present invention, a shot recognition method is provided, including:

Get the video to be identified;

Determine the backboard area in the video to be identified, and obtain an image sequence of the backboard area to be identified;

The image sequence of the backboard area to be identified is input into the shot classification network, and the shot recognition result output by the shot classification network is obtained; wherein the shot recognition result is used to represent whether the shot is successful.

Optionally, the shot classification network is used to:

A feature map is extracted for each backboard area image in the input backboard area image sequence to obtain a feature map sequence; classification features are extracted based on the self-attention mechanism for the feature map sequence; and the shot recognition result is determined based on the classification features.

Optionally, for the feature map sequence, extracting classification features based on a self-attention mechanism includes:

For the feature map sequence, a position code is added; wherein the position code includes: information characterizing the spatial position relationship between different feature points of each feature map in the feature map sequence, and characterizing the different feature points in the feature map sequence. Information about the temporal position relationship between feature maps;

For the feature map sequence after adding position encoding, based on the self-attention mechanism, features are extracted from the spatial dimension and the temporal dimension to obtain classification features.

Optionally, adding position coding to the feature map sequence includes:

Convert each feature map in the feature map sequence into a one-dimensional feature;

Perform stacking conversion processing on one-dimensional features to obtain two-dimensional features;

For the two-dimensional features, position encoding is added.

Optionally, the shot classification network includes N cascaded preset self-attention modules; N≥2; for the i-th preset self-attention module, 1≤i≤N-1, its output is cascaded to The input of the i+1 preset self-attention module; the preset self-attention module is used to extract features from the spatial dimension and the temporal dimension based on the self-attention mechanism;

The feature map sequence after adding position encoding is based on the self-attention mechanism to extract features from the spatial dimension and the temporal dimension, including:

The feature map sequence after position coding is input into the first preset self-attention module, and the classification features are determined based on the output of the Nth preset self-attention module.

Optionally, the preset self-attention module is used to:

For the input features, extract features at least once in series from the spatial dimension based on the self-attention mechanism, and further extract features at least once from the time dimension based on the self-attention mechanism in series for the extracted features, and output the extracted features;

or,

For the input features, features are serially extracted at least once from the time dimension based on the self-attention mechanism. Further, for the extracted features, features are serially extracted at least once from the spatial dimension based on the self-attention mechanism, and the extracted features are output.

Add classification initial features to the feature map sequence;

For the feature map sequence after adding the initial features for classification, features are extracted based on the self-attention mechanism;

From the extracted features, the current representation corresponding to the initial classification feature is determined as the classification feature.

Optionally, determining the shot recognition result based on the classification features includes:

Perform pooling processing on the classification features to obtain features to be input with preset feature sizes;

The features to be input are input into the pre-trained fully connected network, and the shot recognition result output by the fully connected network is obtained.

For the feature map sequence, the first m feature maps are determined as the feature map subsequence contained in the current sliding window; m≥1;

Perform the following steps in a loop until the current sliding window cannot move backward: extract classification features based on the self-attention mechanism for the feature map subsequence contained in the current sliding window; move the sliding window backward by the preset sliding step.

Optionally, obtaining the image sequence of the backboard area to be identified includes:

For each determined backboard area, crop the image content containing the backboard area, adjust the cropping result to a preset image size, and add the adjustment result to the backboard area image sequence to be identified; the backboard area image sequence to be identified , sorted by the temporal order between the video frames where the backboard area images are located.

According to a second aspect of the embodiment of the present invention, a shot recognition device is provided, including:

A backboard identification unit, configured to: obtain a video to be identified; determine the backboard area in the video to be identified, and obtain an image sequence of the backboard area to be identified;

A classification network unit configured to: input the to-be-identified backboard area image sequence into a shot classification network, and obtain a shot recognition result output by the shot classification network; wherein the shot recognition result is used to characterize whether the shot is successful.

According to the above embodiments, it can be seen that by performing shot recognition on the video to be recognized, there is no need to pre-deploy hardware such as sensors, which reduces deployment costs.

It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit the present invention.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.

Figure 1 is a schematic flow chart of a shot recognition method according to an embodiment of the present invention;

Figure 2 is a schematic structural diagram of a shot classification network according to an embodiment of the present invention;

Figure 3 is a schematic diagram of the principle of a shot classification network according to an embodiment of the present invention;

Figure 4 is a schematic structural diagram of a shot recognition device according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the hardware structure of a computer device configured to configure a method according to an embodiment of the present invention.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the appended claims.

In order to reduce the cost of shot recognition, the present invention provides a shot recognition method.

In this method, videos taken during basketball games can be identified to determine whether the shot is a goal.

Compared with methods such as deploying sensors, shooting video for shot recognition does not require pre-deployment of hardware, which can reduce deployment costs.

Furthermore, since determining whether a shot is a goal usually involves determining whether the basketball is thrown into the basket in the backboard area, therefore, the image content including the backboard area is often closely related to whether the basketball is thrown into the basket. Other image content outside the backboard area has less to do with whether the basketball is thrown into the basket.

For example, the continuous image content containing the backboard area can be used to determine whether the basket in the backboard area is vibrating, whether the mesh pocket under the basket in the backboard area is shaking, the basketball movement trajectory in the backboard area, etc., which can be used to determine Whether the shot was successful or not.

Therefore, in this method, the rebound area in the captured video can be determined and combined into an image sequence for shot recognition.

By determining the rebound area, the amount of data that needs to be recognized can be reduced without reducing the accuracy of shot recognition, the efficiency of shot recognition can be improved, and the calculation cost of shot recognition can also be reduced.

The specific identification method can be identified using deep learning.

The above method can perform shot recognition based on video without pre-deployment of sensors, reducing deployment costs. And the shot recognition can be performed by determining the rebound area in the video, which reduces redundant data and improves the efficiency of shot recognition.

A shot recognition method provided by an embodiment of the present invention will be explained in detail below.

As shown in Figure 1, Figure 1 is a schematic flow chart of a shot recognition method according to an embodiment of the present invention.

The embodiments of the present invention do not limit the execution subject of the method flow. Optionally, the execution subject can be a mobile device or a server.

The method may include the following steps.

S101: Obtain the video to be identified.

S102: Determine the backboard area in the video to be identified, and obtain an image sequence of the backboard area to be identified.

S103: Input the image sequence of the backboard area to be identified into the shot classification network, and obtain the shot recognition result output by the shot classification network.

Alternatively, the shot classification network can be used to predict the shot recognition results for the input backboard area image sequence. The shot recognition result can specifically be used to characterize whether the shot is successful.

Optionally, the shot classification network can be trained in advance based on the shot samples whose sample characteristics are the backboard area image sequence and the corresponding shot labels; the shot labels are used to characterize whether the shot is successful.

Among them, the results of shot recognition can be used to calculate the score of basketball games.

The above method process can be used to identify shots based on the video to be identified, without pre-deployment of hardware such as sensors, and directly use software algorithms to identify shots. It only requires a device that can shoot videos. Specifically, it can be a camera of a handheld device, or it can be Surveillance cameras on basketball courts, etc., reduce deployment costs.

In addition, shot recognition can also be performed by determining the rebound area in the video to be recognized, reducing redundant data, thereby improving the efficiency of shot recognition.

Shot recognition through deep learning can improve the efficiency and accuracy of shot recognition.

Each step is explained in detail below.

1. S101: Obtain the video to be identified.

This method process does not limit the specific method of obtaining the video to be identified.

Optionally, the video to be identified may specifically be a surveillance video of a basketball game, or a video of a basketball game captured by a camera of a handheld device.

Optionally, the video to be identified can be a video clip of a basketball game, so as to identify whether the shot in the clip was successful. Specifically, the video to be recognized can be a clip of a shot in a basketball game, so that it can be directly and quickly identified whether the shot is successful.

For example, the video to be recognized can be a 2-4 second shooting video clip.

Optionally, the video to be identified can be a complete video of a basketball game, so that the number of successful shots in the basketball game can be easily identified, and the video location of successful shots can be conveniently located. The specific method can be found in the explanation below.

This method process does not limit the shooting method of the video to be identified.

Optionally, the shooting can be done with a handheld device or with a fixed angle of view. For example, the basketball game can be shot with a surveillance camera, or the basketball match can be shot with a fixed angle of view of the mobile phone.

Optionally, in order to facilitate subsequent determination of the backboard area, a fixed angle of view can be used to shoot the backboard area.

2. S102: Determine the backboard area in the video to be identified, and obtain an image sequence of the backboard area to be identified.

This method process does not limit the way to determine the backboard area.

Optionally, target detection can be used to identify the rebound area for each video frame in the video to be identified.

Optionally, a pre-trained backboard area detection model can be used to identify each video frame in the video to be recognized and determine the backboard area.

Among them, the backboard area detection model can be trained using image samples and corresponding backboard area position labels.

Optionally, the backboard area detection model may specifically use a target detection network trained based on YOLOX or RetinaNet.

For the backboard area image sequence to be identified, optionally, video frames determined to contain the backboard area can be added to the backboard area image sequence to be identified.

Optionally, redundant data can be further reduced, and video frames containing the backboard area can be cropped to obtain image content containing the backboard area, and added to the image sequence of the backboard area to be identified.

This embodiment does not limit the specific cropping method, as long as the cropping result includes the determined backboard area image.

Optionally, in order to facilitate the use of image data near the backboard area for shot recognition, you can target the target containing the basket. When the video frame in the board area is cropped, the size of the cropping result is larger than the determined size of the backboard area, so that the cropping result can include image data near the backboard area.

Optionally, for the continuous video frames that contain the rebound area, the position of the rectangular frame in the video frame can be determined, so that in the continuous video frames, the determined rebound area is included in the position of the rectangular frame. Then you can crop according to the position of the rectangular frame.

Specifically, the position of the rectangular frame can be expanded and then cropped, so that the image content near the basket area can be cropped.

Specifically, for consecutive video frames, the minimum outer bounding box of the backboard area detected at the same position between these video frames can be determined; after expanding the outer bounding box by 1.5 times, the corresponding image is cropped from the corresponding position of the image.

Optionally, since the size of the cropped image content is not uniform, in order to facilitate subsequent calculation and input into the model, the cropping results can be adjusted and different cropping results can be adjusted to the same size. Specifically, it can be to adjust different cropping results to the same resolution.

Therefore, optionally, obtaining the image sequence of the backboard area to be identified may include: for each backboard area determined, crop the image content containing the backboard area, adjust the cropping result to the preset image size, and add the adjustment result into the image sequence of the backboard area to be identified.

Optionally, different backboard areas in the backboard area image sequence to be identified may have the same image size. Obtaining the image sequence of the backboard area to be identified may include: for the determined different backboard areas, crop the image content containing the different backboard areas and adjust them to the same image size, and add the adjustment results to the image sequence of the backboard area to be identified. .

In an optional embodiment, the backboard area images in the sequence of backboard area images to be identified may have a sequence, so that shot recognition can be facilitated based on the sequence.

Optionally, the sequence of backboard area images to be identified may be sorted in temporal order between video frames where the backboard area images are located.

In an optional embodiment, the backboard area determined for the video to be identified may be the same backboard area captured in the video to be identified.

Optionally, the backboard area in the video to be identified can be determined through a backboard area detection model, and then the same backboard area captured in the video to be identified can be determined using a tracking algorithm. Specifically, it may be the same backboard area contained between different video frames in the video to be identified.

Optionally, one or more rebound areas may be determined for the video to be identified.

For example, the video to be identified can be fixed to capture the same backboard area, and then the shooting situation in the backboard area can be easily determined through the shot recognition method.

For example, the video to be identified can capture multiple different backboard areas, and then the shooting conditions of each backboard area can be determined through the shot recognition method for different backboard areas.

Optionally, for each backboard area determined in the video to be identified, an image sequence of the backboard area to be identified can be constructed separately. Specifically, the method may be to crop the image content containing the backboard area, adjust the cropping result to a preset image size, and add the adjustment result to the image sequence of the backboard area to be identified.

The determined different backboard areas can correspond to different backboard area image sequences to be identified, and subsequent shot classification networks can be used for shot recognition respectively.

Optionally, each video frame in the video to be identified can be detected whether it contains a backboard area, and then different video frames containing the same backboard area can be determined for the video frame containing the backboard area, and then the video to be identified can be determined. The area of one or more backboards being photographed.

Specifically, different video frames containing the same rebound area can be determined through a tracking algorithm.

For one or more backboard areas determined in the video to be identified, this method process does not limit the method of obtaining the backboard area image sequence.

Optionally, for any backboard area determined in the video to be identified, an image sequence of the backboard area to be identified can be obtained. Specifically, for any backboard area determined in the video to be identified, for one or more video frames containing the backboard area, the image content containing the backboard area can be cropped, and then the image of the backboard area to be identified corresponding to the backboard area can be obtained. sequence. Specifically, the cropped image content can be adjusted to a preset size, and then the adjustment result is added to the image sequence of the backboard area to be identified.

Optionally, for each backboard area determined in the video to be identified, an image sequence of the backboard area to be identified can be obtained respectively. Specifically, for each backboard area determined in the video to be identified, the image content containing the backboard area can be cropped out from one or more video frames containing the backboard area, and then the image of the backboard area to be identified corresponding to the backboard area can be obtained. sequence. Specifically, the cropped image content can be adjusted to a preset size, and then the adjustment result is added to the image sequence of the backboard area to be identified.

3. S103: Input the image sequence of the backboard area to be identified into the shot classification network, and obtain the shot recognition result output by the shot classification network.

Optionally, the shot classification network can be trained in advance based on the shot samples whose sample characteristics are the backboard area image sequence and the corresponding shot labels; the shot labels can be used to characterize whether the shot is successful.

The following is a detailed explanation of the shot classification network.

1) Self-attention mechanism.

The process of this method does not specifically limit the shot classification network's processing of the image sequence of the backboard area to be identified.

Optionally, the shot classification network can perform image recognition for each backboard area in the image sequence of the backboard area to be identified, and identify whether the shot is successful.

Optionally, the shot classification network can perform image recognition by integrating at least two consecutive backboard areas in the image sequence of the backboard area to be identified, and identify whether the shot is successful. Specifically, the image recognition may be performed by integrating 8 or more consecutive frames of backboard area images.

To determine a successful shot, it is necessary to determine the hoop where the basketball passed through the backboard area. The basketball passing through the hoop is usually a movement process. Combining continuous multi-frame images for shot recognition can improve the accuracy of shot recognition.

In an optional embodiment, for at least two consecutive backboard area images in the backboard area image sequence to be identified, the shot classification network can use a self-attention mechanism to extract features for shot recognition.

Based on the self-attention mechanism, the association between different backboard area images in at least two consecutive backboard area images can be learned. As a feature for shot recognition, it can better distinguish between successful shots and failed shots, thereby improving Shot recognition accuracy.

Therefore, optionally, the shot classification network can be used to: extract classification features based on the self-attention mechanism for the input backboard area image sequence; and determine the shot recognition result based on the classification features.

Among them, the classification features may be features used to predict shot recognition results.

The process of this method does not limit the specific way to extract features based on the self-attention mechanism.

In an optional embodiment, in order to facilitate the extraction of classification features based on the self-attention mechanism, the feature map can be first extracted for the backboard area image in the backboard area image sequence, and the features of the feature map can be extracted based on the self-attention mechanism.

Optionally, extracting classification features based on the self-attention mechanism for the input backboard area image sequence may include: extracting feature maps for each backboard area image in the input backboard area image sequence to obtain a feature map sequence; Sequence,classification features are extracted based on self-attention,mechanism.

Wherein, the order of the feature maps in the feature map sequence can be the same as the order of the corresponding backboard area images in the backboard area image sequence.

Optionally, the order of the feature maps in the feature map sequence can be the same as the temporal order between the video frames where the corresponding backboard area images are located.

In this embodiment, by extracting the feature map of the backboard area image, more information in the backboard area image can be learned to be mined, so that classification features can be better extracted based on the self-attention mechanism and the accuracy of shot recognition can be improved.

Of course, optionally, features can also be extracted directly from the backboard area image sequence based on the self-attention mechanism.

This embodiment does not limit the method of extracting the feature map of the backboard area image.

Optionally, feature maps can be extracted using pre-trained convolutional networks. The extracted feature map may include at least one of the following information: detail information, edge information, noise information, spatial relationship information, etc. This embodiment is not specifically limited.

Therefore, optionally, extracting a feature map for each backboard area image in the input backboard area image sequence may include: extracting a feature map based on a pre-trained convolutional network for each backboard area image in the input backboard area image sequence. Feature map.

This embodiment does not limit the structure of the convolutional network. Specifically, it can be a two-dimensional convolutional network, which has the advantages of high performance and fast speed in image feature extraction.

2) Time dimension and space dimension.

When extracting features based on the self-attention mechanism, it is usually necessary to add positional coding to the input sequence to determine the positional relationship between different elements in the sequence, so that the association between different elements in the sequence can be learned.

a) Time dimension.

In an optional embodiment, the shot classification network can extract features from the time dimension based on the self-attention mechanism.

For the backboard area image sequence to be identified, since the backboard area image is determined from the video frame, and the video frames have a temporal sequence relationship in the video to be identified, therefore, there are different images in the backboard area image sequence to be identified. Between images of the same backboard area, the temporal sequence relationship, that is, the temporal position relationship, can be determined, so that features can be extracted based on the self-attention mechanism in the time dimension and the association between images in different backboard areas can be learned in the time dimension.

Through the self-attention mechanism, several backboard area images that are closely related to the shot recognition results in the time dimension can be learned.

Optionally, the position encoding may include information characterizing the temporal positional relationship between different backboard area images in the backboard area image sequence to be identified. The temporal position relationship between different backboard area images can be the same as the temporal position relationship between video frames where the backboard area image is located.

Correspondingly, optionally, the shot classification network can be used to: add the above position coding to the backboard area image sequence to be identified, and determine the temporal position relationship between different backboard area images; for the backboard area image to be identified after adding position coding Sequence, based on the self-attention mechanism, extracts features from the time dimension to obtain classification features.

Optionally, since the shot classification network can extract feature maps for the backboard area image in advance to obtain a feature map sequence, the shot classification network can be used to: add position coding to the feature map sequence; position coding includes characterizing different features in the feature map sequence Information about the temporal position relationship between maps; among them, the temporal position relationship between different feature maps can be the same as the temporal position relationship between the video frames where the feature map corresponds to the backboard area image, that is, the timing between the feature maps corresponding to the video frames Positional relationship.

Furthermore, the shot classification network can be used to extract features from the time dimension based on the self-attention mechanism for the feature map sequence after adding position encoding to obtain classification features.

In this embodiment, features can be extracted from the time dimension based on the self-attention mechanism to improve the feature extraction effect of the shot classification network and improve the recognition accuracy of the shot classification network.

Optionally, the position coding may include information that characterizes the temporal positional relationship between different backboard area images in the backboard area image sequence to be identified, and may specifically include the timestamp of the video frame where the backboard area image is located in the video to be identified.

In this embodiment, it can be determined based on the timestamp whether adjacent backboard area images are in consecutive or similar video frames. In other words, information characterizing the temporal positional relationship between video frames where the backboard area images are located can be added as position coding. , so that the temporal position relationship between the video frames where the backboard area image is located can be used to improve the feature extraction effect of the shot classification network and improve the recognition accuracy of the shot classification network.

b) Spatial dimension.

In another optional embodiment, the shot classification network can extract features from the spatial dimension based on the self-attention mechanism.

For a single backboard area image in the backboard area image sequence to be identified, there is a spatial positional relationship between image contents at different positions.

Optionally, there is a spatial position relationship between different pixels in a single backboard area image, which can be characterized by two-dimensional spatial coordinates.

Therefore, for the backboard area image, features can be extracted based on the self-attention mechanism in the spatial dimension, and the association between different image contents in a single backboard area image can be learned.

Through the self-attention mechanism, it is possible to learn the image content of the backboard area image that is closely related to the shot recognition result in the spatial dimension.

For each backboard area image in the backboard area image sequence to be identified, features are extracted based on the self-attention mechanism in the spatial dimension, and the image content of each backboard area image in the spatial dimension that is more closely related to the shot recognition result can be learned.

Optionally, the position encoding may include information characterizing the spatial positional relationship between image content at different positions in each backboard area image in the backboard area image sequence to be identified.

Specifically, it may include information characterizing the spatial positional relationship between different pixels in each backboard area image in the backboard area image sequence to be identified.

Correspondingly, optionally, the shot classification network can be used to: add the above position coding to the backboard area image sequence to be identified, determine the spatial position relationship between the image content at different positions in each backboard area image; add position coding for After the image sequence of the backboard area to be identified, based on the self-attention mechanism, features are extracted from the spatial dimension to obtain classification features.

Optionally, since the shot classification network can extract feature maps for the backboard area image in advance to obtain a feature map sequence, the shot classification network can be used to: add position coding to the feature map sequence; position coding includes characterizing each feature map sequence Information about the spatial position relationship between different feature points in the feature map; for the feature map sequence after adding position encoding, based on the self-attention mechanism, features are extracted from the spatial dimension to obtain classification features.

In this embodiment, features can be extracted from the spatial dimension based on the self-attention mechanism to improve the feature extraction effect of the shot classification network and improve the recognition accuracy of the shot classification network.

c) Integrate time and space dimensions.

In another optional embodiment, the shot classification network can extract features from the spatial dimension and the temporal dimension based on the self-attention mechanism.

Through the self-attention mechanism, on the one hand, we can learn the image content of the backboard area image that is more closely related to the shot recognition result in the spatial dimension. On the other hand, we can learn several backboard areas that are more closely related to the shot recognition result in the time dimension. image.

Since features can be extracted in both the time dimension and the spatial dimension, in this embodiment, the feature extraction effect of the shot classification network can be improved, and the recognition accuracy of the shot classification network can be improved.

Optionally, the position coding may include information characterizing the spatial positional relationship between image contents at different positions in each backboard area image in the backboard area image sequence to be identified, and information characterizing the spatial positional relationship between different backboard area images in the backboard area image sequence to be identified. Information about temporal and spatial relationships.

The temporal position relationship between different backboard area images can be the same as the temporal position relationship between video frames where the backboard area image is located.

Correspondingly, optionally, the shot classification network can be used to: add the above position coding to the image sequence of the backboard area to be identified; for the image sequence of the backboard area to be identified after adding position coding, based on the self-attention mechanism, from the spatial dimension and Features are extracted in the time dimension to obtain classification features.

This embodiment does not limit the order and times of extracting features from the spatial dimension and extracting features from the time dimension.

Optionally, since the shot classification network can extract feature maps for the backboard area image in advance to obtain a feature map sequence, the shot classification network can be used to: add position coding to the feature map sequence; the position coding includes: characterizing the feature map sequence Information on the spatial position relationship between different feature points of each feature map, and information representing the temporal position relationship between different feature maps in the feature map sequence; for the feature map sequence after adding position encoding, based on the self-attention mechanism, from space Dimension and time dimensions extract features to obtain classification features.

Therefore, optionally, for the feature map sequence, extracting classification features based on the self-attention mechanism may include: adding position coding to the feature map sequence; position coding includes: characterizing the relationship between different feature points of each feature map in the feature map sequence. Information about the spatial position relationship, and information that characterizes the temporal position relationship between different feature maps in the feature map sequence; for the feature map sequence after adding position coding, based on the self-attention mechanism, features are extracted from the spatial dimension and the time dimension to obtain the classification feature.

For the feature map sequence, in an optional embodiment, since features need to be extracted from the spatial dimension and the time dimension, in order to facilitate feature extraction based on the self-attention mechanism, the feature map can be integrated into an overall feature. This overall feature It can have time dimension and space dimension, so that it can be conveniently based on automatic feature transformation through simple feature conversion. The attention mechanism extracts features from the temporal and spatial dimensions.

Optionally, each feature map in the feature map sequence can be converted into a one-dimensional feature; the one-dimensional features can be stacked and converted to obtain a two-dimensional feature. The two-dimensional feature is the integrated overall feature.

This embodiment does not limit the method of converting feature maps into one-dimensional features.

Optionally, converting the feature map into a one-dimensional feature can be by adding all feature points in the feature map to the one-dimensional feature.

For example, for a feature map of size a*b, a one-dimensional feature of length c can be obtained through conversion. where c=a*b.

This embodiment does not limit the way of stacking one-dimensional features.

Optionally, one-dimensional features can be stacked according to the temporal relationship between corresponding feature maps to obtain two-dimensional features.

For example, for a one-dimensional feature of length b, a two-dimensional feature of a*b can be obtained by stacking.

For the conversion of the feature map sequence, optionally, the feature map sequence as a whole can be regarded as an overall feature and then adjusted.

For example, the feature map sequence includes n feature maps of size a*b, and the entire feature map sequence can be regarded as a three-dimensional feature of a*b*n.

For three-dimensional features, the feature map of each a*b can be converted into one-dimensional features of 1*c, that is, the three-dimensional features of a*b*n can be converted into the two-dimensional features of c*n. Among them, c=a*b.

Optionally, add position coding to the feature map sequence, which may include: converting each feature map in the feature map sequence into one-dimensional features; performing stack conversion processing on the one-dimensional features to obtain two-dimensional features; in order to obtain Two-dimensional features, add position encoding.

Specifically, it can be to stack the one-dimensional features converted by each feature map in the feature map sequence to obtain two-dimensional features.

Optionally, adding position coding to the two-dimensional features can be to add information representing the spatial position relationship to each feature point for each one-dimensional feature stacked in the two-dimensional feature. Specifically, the feature points can be included in the feature map. coordinate information; for each one-dimensional feature as a whole, information representing the temporal position relationship can also be added, which may specifically include timestamps.

Correspondingly, for the feature map sequence after adding position encoding, based on the self-attention mechanism, from the spatial dimension and time dimension feature extraction, which can include: extracting features from the spatial dimension and the time dimension based on the self-attention mechanism for the two-dimensional features after adding position encoding.

In an optional embodiment, for the two-dimensional features with added position coding, the sequence in the spatial dimension or the sequence in the time dimension can be obtained by transposing, so as to facilitate the self-attention based on the time dimension and the spatial dimension respectively. Force mechanism extraction features.

For example, for a one-dimensional feature of length b, a two-dimensional feature of a*b can be obtained by stacking. Two-dimensional features after adding position encoding. Between a one-dimensional features, there is a temporal position relationship; among the one-dimensional features of length b, there is a spatial position relationship.

Therefore, for the two-dimensional features of a*b, features can be extracted from the time dimension based on the self-attention mechanism. In addition, the two-dimensional features of a*b can be transposed to obtain the two-dimensional features of b*a, so that features can be extracted from the spatial dimension based on the self-attention mechanism.

Optionally, the sizes are the same between the input features and output features due to the self-attention mechanism. Therefore, features can be extracted serially from the spatial dimension based on the self-attention mechanism, and then the extracted features can be transposed, and then features can be serially extracted from the temporal dimension based on the self-attention mechanism.

Of course, you can also extract features serially from the time dimension based on the self-attention mechanism, and then transpose the extracted features, and then serially extract features from the spatial dimension based on the self-attention mechanism.

Therefore, for the two-dimensional features of a*b, features can be extracted from the time dimension based on the self-attention mechanism to obtain the first feature of a*b. Then the first feature can be transposed to obtain the second feature of b*a. For the second feature, features can be extracted from the spatial dimension based on the self-attention mechanism.

The process of this method does not limit the order between extracting features based on the self-attention mechanism from the spatial dimension and extracting features based on the self-attention mechanism from the temporal dimension.

In an optional embodiment, features can be extracted from the spatial dimension based on the self-attention mechanism in parallel, and features can be extracted from the temporal dimension based on the self-attention mechanism, and then the extracted features can be synthesized to obtain classification features.

This embodiment does not limit the method of integrating features. Optionally, it can be a splicing feature, specifically it can be The features extracted from the spatial dimension based on the self-attention mechanism and the features extracted from the temporal dimension based on the self-attention mechanism are spliced into one feature as a classification feature.

This embodiment does not limit the number of feature extractions. Optionally, features can be serially extracted at least once from the spatial dimension based on the self-attention mechanism, and features can be serially extracted at least once from the temporal dimension based on the self-attention mechanism, and then the extracted features can be synthesized to obtain classification features.

Optionally, serially extract features from the spatial dimension at least once based on the self-attention mechanism, which may include: determining the two-dimensional feature with added position encoding as the current feature; performing the following steps in a loop until the preset number of cycles: for the current feature from The spatial dimension extracts features based on the self-attention mechanism and determines the extracted features as the current features. The preset number of times may be at least once.

Optionally, serially extracting features at least once from the time dimension based on the self-attention mechanism may include: determining the two-dimensional feature with added position encoding as the current feature; looping through the following steps until the preset number of cycles: for the current feature from The time dimension extracts features based on the self-attention mechanism and determines the extracted features as the current features. The preset number of times may be at least once.

In another optional embodiment, features can be extracted serially from the spatial dimension and the temporal dimension based on the self-attention mechanism.

This embodiment does not limit the order and times between extracting features from the spatial dimension based on the self-attention mechanism and extracting features from the temporal dimension based on the self-attention mechanism.

Alternatively, multiple features can be extracted serially from the spatial dimension based on the self-attention mechanism, multiple features can be extracted serially based on the self-attention mechanism from the time dimension, or multiple features can be extracted from the spatial dimension and the time dimension crosswise, based on the self-attention mechanism. The attention mechanism extracts multiple features serially.

Optionally, for the feature map sequence with added position encoding, features can be serially extracted at least once from the spatial dimension based on the self-attention mechanism, and further for the extracted features, features can be serially extracted at least once from the time dimension based on the self-attention mechanism. .

Optionally, for the feature map sequence with added position encoding, features can be serially extracted at least once from the spatial dimension based on the self-attention mechanism, and further for the extracted features, features can be serially extracted at least once from the time dimension based on the self-attention mechanism. ; Then for the extracted features, extract features at least once in series from the spatial dimension based on the self-attention mechanism, and further extract features at least once in series from the time dimension based on the self-attention mechanism based on the extracted features.

Optionally, a preset number of feature extractions can be performed serially for the feature map sequence with added position encoding. step.

Wherein, optionally, the feature extraction step may include: for the input features, serially extracting features at least once from the spatial dimension based on the self-attention mechanism, and further for the extracted features, serially extracting the features from the time dimension based on the self-attention mechanism. Feature at least once.

Optionally, the feature extraction step may include: for the input features, serially extract features at least once from the time dimension based on the self-attention mechanism, and further for the extracted features, serially extract at least once from the spatial dimension based on the self-attention mechanism. feature.

Optionally, there may be a cascade relationship between different feature extraction steps executed serially. Specifically, the features extracted in any feature extraction step can be input into the subsequent feature extraction step.

Optionally, different feature extraction steps executed serially may be different from each other. Specifically, the weight of the self-attention mechanism may be different, or the number or order of extracted features may be different.

Optionally, the feature map sequence with added position coding can be determined as the current feature; the following steps are executed in a loop until the preset loop stop condition is met: for the current feature, serially extract features at least once from the spatial dimension based on the self-attention mechanism, Further, for the extracted features, features are extracted serially at least once from the time dimension based on the self-attention mechanism; the extracted features are determined as current features.

This embodiment does not limit specific preset loop stop conditions.

Optionally, the preset loop stop condition may include at least one of the following: the loop reaches a preset number of times, the total number of feature extraction times reaches a preset number of times, the time taken to extract features reaches a preset length of time, etc.

Optionally, in different times of loops, the methods of serially extracting features from the spatial dimension based on the self-attention mechanism can be different. Specifically, it can be that the weight of the self-attention mechanism is different, or the number of serial extractions is different.

Optionally, it is allowed to stop the loop during the loop. Specifically, during the loop process, features are serially extracted at least once from the spatial dimension based on the self-attention mechanism and then the loop is stopped.

Among them, the number of times of feature extraction can be different in different loop processes. The number of times to extract features is not specifically limited.

For example, in the first loop process, features can be serially extracted three times from the spatial dimension based on the self-attention mechanism for the current feature, and further features can be serially extracted twice from the time dimension based on the self-attention mechanism for the extracted features. . In the second cycle, for the current feature, one feature can be extracted serially from the spatial dimension based on the self-attention mechanism, and further for the extracted features, five features can be serially extracted from the time dimension based on the self-attention mechanism. secondary features.

Optionally, the feature map sequence with added position coding can be determined as the current feature; the following steps are executed in a loop until the preset loop stop condition is met: for the current feature, extract features at least once in series from the time dimension based on the self-attention mechanism, Further, for the extracted features, features are extracted serially at least once from the spatial dimension based on the self-attention mechanism; the extracted features are determined as current features.

For detailed explanation, please refer to the above embodiments.

Taking the two-dimensional features obtained by stacking one-dimensional features as an example, optionally, for the two-dimensional features with added position encoding, features can be serially extracted at least once from the spatial dimension based on the self-attention mechanism, and the extracted features can be further transposed. For the transposed features, features are extracted serially at least once from the time dimension based on the self-attention mechanism. The extracted features are further transposed, and for the transposed features, features are serially extracted at least once from the spatial dimension based on the self-attention mechanism. For an explanation of serially extracting features at least once, please refer to the above embodiment.

d) Default self-attention module.

In an optional embodiment, in the shot classification network, a preset self-attention module can be used to implement the step of extracting features based on the self-attention mechanism.

Optionally, the preset self-attention module can be used to extract features from the spatial dimension and/or the temporal dimension and output them based on the self-attention mechanism for the input features.

For a specific explanation of extracting features from the spatial dimension and/or the temporal dimension based on the self-attention mechanism, please refer to the above embodiments.

Optionally, one or more preset self-attention modules may be included in the shot classification network.

Optionally, when the shot classification network needs to extract features from the spatial dimension based on the self-attention mechanism, you can select the features for the input and serially extract features from the spatial dimension at least once based on the self-attention mechanism and output them. One or more preset self-attention modules.

Correspondingly, for the feature map sequence after adding position coding, extracting features from the spatial dimension based on the self-attention mechanism may include: inputting the feature map sequence after adding position coding into one or more preset self-attention modules. , get the output features.

Among them, there can be a cascade relationship between multiple preset self-attention modules.

Optionally, among N cascaded preset self-attention modules, N≥2, for the i-th preset self-attention module, 1≤i≤N-1, its output can be cascaded to the i+th 1 preset input to the self-attention module.

Optionally, when the shot classification network needs to extract features from the time dimension based on the self-attention mechanism, you can select features for the input, based on the self-attention mechanism, serially extract features from the time dimension at least once and output them. One or more preset self-attention modules.

Optionally, when the shot classification network needs to extract features from the spatial and temporal dimensions based on the self-attention mechanism, you can choose to use the input features to extract features from the spatial and temporal dimensions based on the self-attention mechanism. and output one or more preset self-attention modules.

Of course, you can also choose multiple preset self-attention modules, which can include one or more preset self-attention modules that are used to serially extract features from the spatial dimension at least once and output them based on the self-attention mechanism based on the input features. A force module, and one or more preset self-attention modules for serially extracting features at least once from the time dimension and outputting them based on the self-attention mechanism for the input features.

For detailed explanation, please refer to the above embodiments.

In an optional embodiment, the shot classification network may include N cascaded preset self-attention modules; N≥2; for the i-th preset self-attention module, 1≤i≤N-1, Its output is cascaded to the input of the i+1th preset self-attention module.

Among them, the preset self-attention module can be used to extract features from the spatial and temporal dimensions based on the self-attention mechanism.

Accordingly, optionally, for the feature map sequence after adding position coding, extracting features from the spatial dimension and the time dimension based on the self-attention mechanism may include: inputting the feature map sequence after adding position coding into the first preset The self-attention module determines the classification features based on the output of the Nth preset self-attention module.

This embodiment is not limited to the order and number of times in the preset self-attention module between extracting features from the spatial dimension based on the self-attention mechanism and extracting features from the temporal dimension based on the self-attention mechanism.

Optionally, the preset self-attention module can be used to serially execute a preset number of feature extraction steps for the input features. For detailed explanation, please refer to the above embodiments.

Optionally, the preset self-attention module can be used to: for input features, serially extract features at least once from the spatial dimension based on the self-attention mechanism, and further for the extracted features, serially extract features from the temporal dimension based on the self-attention mechanism. Extract features at least once and output the extracted features.

Optionally, the preset self-attention module can be used to: for the input features, serially extract features at least once from the time dimension based on the self-attention mechanism, and further for the extracted features, from the spatial dimension based on the self-attention The mechanism extracts features at least once in series and outputs the extracted features.

Optionally, the preset self-attention module can be used to: determine the input feature as the current feature; perform the following steps in a loop until the preset loop stop condition is met: for the current feature, serially based on the self-attention mechanism from the spatial dimension Extract features at least once, and further extract features at least once in series based on the self-attention mechanism from the time dimension based on the extracted features; determine the extracted features as the current features. For detailed explanation, please refer to the above embodiments.

Optionally, the preset self-attention module can be used to: determine the input feature as the current feature; perform the following steps in a loop until the preset loop stop condition is met: for the current feature, serially based on the self-attention mechanism from the time dimension Extract features at least once, and further extract features at least once in series from the spatial dimension based on the self-attention mechanism for the extracted features; determine the extracted features as the current features. For detailed explanation, please refer to the above embodiments.

Optionally, the weight of the self-attention mechanism can be different between different preset self-attention modules. Specifically, it can be parameters such as the weight of each preset self-attention module determined through model training.

Optionally, the order and number of features extracted from the spatial dimension based on the self-attention mechanism and from the temporal dimension based on the self-attention mechanism can be different between different preset self-attention modules.

For example, a preset self-attention module can be used to: first extract features from the spatial dimension at least once based on the self-attention mechanism for the input features, and then serially extract at least once the features from the time dimension based on the self-attention mechanism for the extracted features. Feature once and output. Another preset self-attention module can be used to: first extract features from the time dimension at least once based on the self-attention mechanism based on the input features, and then serially extract features at least once from the spatial dimension based on the self-attention mechanism based on the extracted features. features and output.

In a specific embodiment, the preset self-attention module may be a self-attention layer, which is used to extract features based on the self-attention mechanism for input features.

The shot classification network can include a self-attention layer, or multiple self-attention layers in series, so that the shot classification network can be trained through the model and the parameters of the self-attention layer can be determined. The parameters of the self-attention layer include weights. After model training, the parameters of different self-attention layers are usually different.

Optionally, the self-attention layer can be used to extract features based on the self-attention mechanism from the spatial and temporal dimensions of the input features. Specifically, it may be that the features of the input are first serially extracted at least once from the spatial dimension based on the self-attention mechanism, and then the features are serially extracted at least once from the time dimension based on the self-attention mechanism and output. It is also possible to first serially extract features from the time dimension at least once based on the self-attention mechanism for the input features, and then serially extract features from the spatial dimension at least once based on the self-attention mechanism for the extracted features and output them.

3) Classification features.

This method flow does not limit the specific way to obtain classification features.

In an optional embodiment, the classification features may be extracted based on the self-attention mechanism for the backboard area image sequence.

The above embodiment explains the features extracted based on the self-attention mechanism for the backboard area image sequence. This embodiment is not limited to the method of obtaining classification features based on features extracted by the self-attention mechanism.

Optionally, features extracted based on the self-attention mechanism can be synthesized to obtain classification features.

For example, features extracted based on the self-attention mechanism can be directly determined as classification features.

The features extracted based on the self-attention mechanism can also be pooled, specifically average pooling or maximum pooling, to obtain classification features.

Optionally, classification initial features can be added to the features input from the attention mechanism.

The initial features of classification are not the features in the image sequence of the backboard area to be identified. The initial features of classification can be used to comprehensively represent the feature information in the image sequence of the backboard area to be identified learned by the self-attention mechanism through multiple feature extractions. Affect the original features in the image sequence of the backboard area to be identified.

For example, the backboard area image learned by the self-attention mechanism has a greater correlation with the shot recognition result in the spatial dimension; or the image content learned by the self-attention mechanism has a greater correlation with the shot recognition result in the time dimension. Several images of the backboard area.

Since the input features and output features of the self-attention mechanism are of the same size, the current representation corresponding to the initial feature of classification can be determined among the features extracted based on the self-attention mechanism.

Specifically, the current representation corresponding to the initial classification feature can be determined as the classification feature and used for subsequent prediction of the shot recognition result.

In the process of training the shot classification network, you can use the initial features of the classification and use the self-attention mechanism to learn the feature information in the image sequence of the backboard area to be identified.

Optionally, when the feature map sequence is extracted from the image sequence of the backboard area to be identified, extracting classification features based on the self-attention mechanism for the feature map sequence may include: adding classification initial features to the feature map sequence; adding classification features to the feature map sequence. The feature map sequence after the initial features extracts features based on the self-attention mechanism; from the extracted features, the current representation corresponding to the classification initial features is determined as the classification feature.

Optionally, combined with the above embodiments about position coding, the initial classification features are not features in the image sequence of the backboard area to be identified. Therefore, position coding can be set for the initial classification features to facilitate subsequent feature extraction based on the self-attention mechanism.

Optionally, position coding can be added to the feature map sequence after adding the initial features for classification.

Specifically, taking the two-dimensional features obtained by stacking multiple one-dimensional features as an example, the classification initial features can be added first, and then the position coding can be added.

The position encoding set for the initial features for classification is usually outside the entire feature map sequence. For example, in the time dimension, the classification initial features can be added before the first feature map in the feature map sequence, or after the last feature map. In the spatial dimension, the classification initial features can be added before the first feature point in each feature map in the feature map sequence, or after the last feature point.

Regarding the initial characteristics of classification, this method process does not limit the specific form.

Optionally, the classification initial features may belong to parameters in the shot classification network, and the specific values of the classification initial features may be determined through model training.

Specifically, in the process of training the shot classification network, the values of the initial features for classification can be continuously adjusted. Finally, after the training is completed, the final adjusted classification initial feature values are determined and used for shot recognition.

Optionally, since the values of the initial features for classification need to be adjusted, the smaller the size of the initial features for classification, the smaller the computing resources required for adjustment, which can improve the stability of model training.

This method flow does not limit the size of the initial features for classification. Optionally, the size of the initial classification features can be 1*1. When the initial classification features are specifically used, the initial classification features can be copied multiple times to meet the needs of the self-attention mechanism. Specifically, it can be a broadcast operation.

For example, if in the self-attention mechanism, the feature size that needs to be expanded is 1*N, then the 1*1 classification initial feature can be copied N times, combined into a 1*N feature, and then added to the feature. Used in the subsequent steps of feature extraction by the self-attention mechanism.

In this embodiment, for the 1*1 classification initial features, the classification initial features can be easily adjusted during the model training process to improve the stability of the model training.

Of course, the classification initial features can also be of other sizes.

Optionally, the size of the initial features for classification can include one or more feature maps, which can be added to the feature map sequence as new features in the time dimension.

Optionally, the size of the initial features for classification may include one or more newly added feature points at the same position in each feature map as new features in the spatial dimension of the feature map sequence.

Optionally, the size of the initial features for classification can include one or more feature maps, and one or more new feature points at the same position in each feature map, which can be used as the feature map sequence in the time and space dimensions. New features.

Of course, optionally, the features required to be extended by the self-attention mechanism can be one or more feature maps. This allows multiple copies of the initial classification features to be combined into a feature map for feature expansion.

Optionally, the features required to be expanded by the self-attention mechanism can be one or more feature points added at the same position in each feature map. Therefore, multiple copies of the initial classification features can be copied and combined into feature points at the same position in each feature map for feature expansion.

In an optional embodiment, taking a two-dimensional feature obtained by stacking multiple one-dimensional features as an example, since the stacked one-dimensional features in the two-dimensional feature correspond to the feature map, there is a temporal relationship between the one-dimensional features. Therefore, new feature points can be added in both dimensions for two-dimensional features.

Optionally, one or more one-dimensional features can be added based on the one-dimensional features that have temporal relationships among the two-dimensional features, and the newly added one-dimensional features are the initial classification features.

Optionally, one or more feature points can be added based on each one-dimensional feature that has a temporal relationship in the two-dimensional feature, and the added feature points are the initial features for classification.

Optionally, one or more one-dimensional features can be added based on the one-dimensional features that have temporal relationships in the two-dimensional features, and further one or more one-dimensional features can be added based on each one-dimensional feature in the current two-dimensional features. or multiple feature points. The newly added feature part is the classification initial feature.

The initial feature for classification can be the feature of 1*b, which is added to the two-dimensional feature of a*b to obtain the two-dimensional feature of (a+1)*b.

The initial feature for classification can be the feature of 1*a, which is added to the two-dimensional feature of a*b to obtain the two-dimensional feature of a*(b+1).

The initial classification feature can also be a 1*1 feature, for the two-dimensional feature of a*b, if necessary, it can be expanded to (a+1) *b two-dimensional features, you can copy b copies of the initial classification features and combine them into 1*b features and add them to the two-dimensional features of a*b to obtain the two-dimensional features of (a+1)*b. Specifically, it can be to copy the initial features of classification through broadcast operation.

Of course, if you want to expand the two-dimensional feature of a*b to the two-dimensional feature of a*(b+1), you can copy a copy of the initial classification feature and combine it into a feature of 1*a and add it to a*b From the two-dimensional features of , the two-dimensional features of a*(b+1) are obtained. Specifically, it can be to copy the initial features of classification through broadcast operation.

The two-dimensional features after adding the classification initial features can be the two-dimensional features of (a+1)*(b+1), and the new part compared to the two-dimensional features of a*b can be the classification initial features.

Regarding the method of determining shot classification results based on classification features.

This method process does not limit the specific method of determining the shot classification result based on the classification characteristics.

In an optional embodiment, when classification features are obtained, the classification features can be input into a pre-trained fully connected network to obtain the shot recognition result output by the fully connected network.

Alternatively, a fully connected network can be used to predict the shot recognition results based on the input features. Specifically, it can be based on the input classification features to predict the shot recognition results.

Of course, other model structures can also be used to predict shot recognition results for classification features. This embodiment is not specifically limited.

Optionally, in order to reduce the amount of calculation and improve prediction efficiency, classification features can be processed, and then a fully connected network can be used to predict the shot recognition result.

Among them, the shot classification network includes a fully connected network. In the process of training the shot classification network, the fully connected network is also trained, so that the fully connected network can be used to predict the shot recognition results and improve the prediction accuracy and prediction efficiency.

In the process of using the shot classification network for shot recognition, the fully connected network trained in the shot classification network can be used to predict the shot recognition results based on the classification features.

Specifically, the classification features can be pooled. Alternatively, average pooling or maximum pooling can be performed to obtain the classification features of the preset feature size, and then the classification features of the preset feature size can be input to the pre-trained full set of features. Connect to the network and obtain the shot recognition results output by the fully connected network.

Among them, the preset feature size can be smaller than the original size of the classification feature, thereby reducing the amount of data and calculation and improving prediction efficiency.

Therefore, optionally, determining the shot recognition result based on the classification features may include: Perform pooling processing to obtain the features to be input with preset feature sizes; input the features to be input into the pre-trained fully connected network to obtain the shot recognition results output by the fully connected network.

This embodiment does not limit the specific form of the shot recognition result, as long as it can indicate whether the shot is successful.

4) Sliding window.

Through the explanation of the above embodiments, the process of this method can be based on the shot classification network to perform shot recognition on the video to be recognized.

In an optional embodiment, since the process of shooting is usually short in a basketball game, a video clip of the shooting process can be obtained as the video to be identified, and the shot classification network is used to perform shot recognition.

Optionally, the video to be identified may also include longer basketball game clips, which may include one or more shooting processes.

In order to improve the accuracy of shot recognition, in an optional embodiment, the video to be identified can be divided into multiple segments, so that shot recognition can be performed on multiple segments respectively.

Since the video to be recognized may include multiple shooting processes, dividing the segments for separate shooting recognition can easily improve the accuracy of shooting recognition, and can also locate the video position of the shooting process and the video position of the successful shot.

This embodiment does not limit the way in which the video to be recognized is divided into segments.

Alternatively, the video to be recognized can be directly divided into multiple segments of preset segment duration.

Optionally, in order to reduce the possibility of dividing the shooting process into different segments, a sliding window mechanism can be used to divide the segments.

Specifically, you can start from the first video frame of the video to be recognized, determine the sliding window length and sliding step size, and then use the sliding window mechanism to obtain the video clips contained in each sliding window as the divided video clips.

For the different divided video segments, this embodiment does not limit the order in which shot recognition is performed.

Optionally, shot recognition can be performed in parallel for different divided video segments, or shot recognition can be performed serially.

Among them, executing shot recognition in parallel can improve the efficiency of shot recognition.

In an optional embodiment, since it is necessary to determine the rebound area for subsequent shot identification, therefore, The image sequence of the backboard area to be identified can be directly divided and shot recognition can be performed separately.

Optionally, input the backboard area image sequence to be identified into the shot classification network, and obtain the shot recognition result output by the shot classification network, which may include: dividing the backboard area image sequence to be identified into multiple backboard area image sub-sequences, and dividing the obtained Each backboard area image subsequence is input into the shot classification network, and the shot recognition result output by the shot classification network is obtained.

This embodiment does not limit the way of dividing the backboard area image sub-sequence. Specifically, a sliding window mechanism can be used. Since there may be overlapping parts between different segments divided by the sliding window mechanism, when the image sequence of the backboard area to be identified is divided into subsequences based on the sliding window mechanism, there is no need to repeatedly determine the video frames in the overlapping parts. Backboard area, improve efficiency and save computing resources.

In an optional embodiment, since it may be necessary to extract a feature map of the backboard area image, there may be overlapping parts between different divided segments in the sliding window mechanism.

Therefore, in order to reduce the cost of feature extraction and improve the efficiency of feature extraction and shot recognition, we can directly divide the feature map sequence into multiple subsequences for subsequent shot recognition, and then on the basis of dividing the subsequences based on the sliding window mechanism, we can There is no need to repeatedly extract feature maps from the backboard area images in the overlapping fragments, which improves efficiency and saves computing resources.

Optionally, for the feature map sequence, extracting classification features based on the self-attention mechanism may include: for the feature map sequence, determine the first m feature maps as the feature map subsequence contained in the current sliding window; m≥1; loop execution The following steps are performed until the current sliding window cannot move backward: extract classification features based on the self-attention mechanism for the feature map subsequence contained in the current sliding window; move the sliding window backward by a preset sliding step.

This embodiment does not limit the order in which subsequent shot recognition steps are performed for different feature map subsequences divided by the sliding window mechanism.

Optionally, subsequent steps of shot recognition can be performed in parallel for different feature map subsequences divided by the sliding window mechanism. Can also be executed serially. Among them, executing subsequent steps of shot recognition in parallel can improve the efficiency of shot recognition.

It should be noted that, in an optional embodiment, classification features are extracted based on the self-attention mechanism for the feature map subsequence contained in the current sliding window. For specific methods of extracting classification features, please refer to the above embodiments.

Optionally, extracting classification features based on the self-attention mechanism for the feature map subsequence contained in the current sliding window may include: adding position coding to the feature map subsequence contained in the current sliding window; position coding may include: characterizing the feature Information about the spatial position relationship between different feature points of each feature map in the graph subsequence, and the table Characterize the information of the temporal position relationship between different feature maps in the feature map subsequence; for the feature map subsequence after adding position coding, based on the self-attention mechanism, extract features from the spatial dimension and time dimension to obtain classification features.

Optionally, adding position coding to the feature map subsequence may include: converting each feature map in the feature map subsequence into a one-dimensional feature; performing stack conversion processing on the one-dimensional features to obtain two-dimensional features; for The obtained two-dimensional features are added with position coding.

Optionally, specifically, stacking conversion processing may be performed on all or part of the converted one-dimensional features to obtain two-dimensional features, and then position coding may be added to the obtained two-dimensional features.

Optionally, extracting classification features based on the self-attention mechanism for the feature map subsequence may include: adding classification initial features to the feature map subsequence; extracting features based on the self-attention mechanism for the feature map sequence after adding the classification initial features. ; From the extracted features, determine the current representation corresponding to the initial classification feature as the classification feature.

For detailed explanation, please refer to the above embodiments.

In this embodiment, by dividing the feature map sequence into multiple subsequences based on the sliding window mechanism for subsequent shot recognition, computing resources can be saved, shot recognition efficiency can be improved, and the accuracy of shot recognition can also be improved.

Since multiple shot recognition results can be obtained for the multiple divided sub-sequences, the video clips in the video to be identified represented by the corresponding sub-sequences can be determined based on the shot recognition results, so that the video location of the successful shot can be located.

The number of successful shots in the video to be recognized can also be determined based on the number of shot recognition results that represent successful shots, so as to facilitate subsequent calculation of scores in basketball games.

5) Structure of the shot classification network.

Based on the above embodiment explaining the function of the shot classification network, the structure of the shot classification network can be clarified.

The process of this method does not limit the specific structure of the shot classification network. The following explanations are provided for illustrative purposes.

In an optional embodiment, the shot classification network can be used to: extract classification features based on the self-attention mechanism for the input backboard area image sequence; and determine the shot recognition result based on the classification features.

Accordingly, optionally, the shot classification network may include a module for extracting classification features based on a self-attention mechanism, and a module for determining a shot recognition result based on the classification features.

Optionally, during the processing of the shot classification network, it is usually necessary to extract features of the backboard area image sequence. graph sequence, and the extracted feature graph sequence needs to be preprocessed, for example, preprocessing operations such as adding position coding, dividing subsequences, adding initial features for classification, etc.

The shot classification network is also used to further extract classification features based on the self-attention mechanism based on the preprocessed feature map sequence, and finally predict the shot recognition results based on the classification features.

Therefore, optionally, the structure of the shot classification network may include a feature map extraction module, a feature preprocessing module, a self-attention feature extraction module and a prediction module.

Of course, the structure of the shot classification network is not specifically limited, and this embodiment is only used for illustrative explanation.

The feature map extraction module may be used to extract a feature map sequence of the input backboard area image sequence, and output the extracted feature map sequence.

The feature preprocessing module can be used to preprocess the feature map sequence output by the feature map extraction module.

Specifically, the feature preprocessing module can be used to: add position coding to the feature map sequence output by the feature map extraction module; and output the feature map sequence after adding position coding.

Optionally, the position coding may include: information characterizing the spatial position relationship between different feature points of each feature map in the feature map sequence, and information characterizing the temporal position relationship between different feature maps in the feature map sequence.

The feature preprocessing module can also be used to: add classification initial features to the feature map sequence output by the feature map extraction module; and output the feature map sequence after adding the classification initial features.

The feature preprocessing module can also be used to: add classification initial features and position coding to the feature map sequence output by the feature map extraction module; and output the feature map sequence after adding classification initial features and position coding.

The order in which the classification initial features and position codes are added is not limited. Specifically, the initial classification features can be added first, and then the position coding can be added.

Optionally, the preprocessing in the feature preprocessing module may include at least adding position coding, thereby facilitating subsequent feature extraction based on the self-attention mechanism.

The feature preprocessing module can also be used to divide the feature map sequence output by the feature map extraction module into feature map subsequences, and the divided feature map subsequences can be directly output.

The feature preprocessing module can also be used to: divide the feature map subsequences for the feature map sequence output by the feature map extraction module; for each divided feature map subsequence, add classification initial features and/or position codes and output them.

Specifically, the sliding window mechanism can be used to divide the feature map subsequences.

The feature preprocessing module can be used to: for the feature map sequence output by the feature map extraction module, determine the first m feature maps as the feature map subsequence contained in the current sliding window; m ≥ 1; perform the following steps in a loop until the current sliding window Unable to move backward: For the feature map subsequence contained in the current sliding window, add classification initial features and/or position coding and output; move the sliding window backward by the preset sliding step.

Optionally, in the feature preprocessing module, when position coding is added to the feature map sequence or feature map subsequence, the self-attention feature extraction module can be used to: target the feature map with added position coding output by the feature preprocessing module. Sequence or feature map subsequence, feature extraction based on self-attention mechanism, and output classification features.

Whether feature extraction is performed from the spatial dimension or the time dimension can be determined based on the added position coding.

Optionally, in the feature preprocessing module, without adding position coding to the feature map sequence or feature map subsequence, the self-attention feature extraction module can be used to: target the feature map sequence or features output by the feature preprocessing module Position coding is added to the image subsequence, feature extraction is performed based on the self-attention mechanism, and classification features are output.

Regarding the self-attention feature extraction module, in an optional embodiment, the self-attention feature extraction module includes a preset self-attention module, or a plurality of cascaded preset self-attention modules.

For an explanation of the preset self-attention module, please refer to the above embodiment.

Optionally, the prediction module can be used to predict the shot recognition result based on the classification features output by the self-attention feature extraction module. Specifically, a fully connected network can be used for prediction.

For ease of understanding, as shown in Figure 2, Figure 2 is a schematic structural diagram of a shot classification network according to an embodiment of the present invention.

Among them, the shot classification network can include: feature map extraction module, feature preprocessing module, self-attention feature extraction module and prediction module.

The output of the feature map extraction module can be cascaded to the input of the feature preprocessing module. The output of the feature preprocessing module can be cascaded to the input of the self-attention feature extraction module. The self-attention feature extraction module can include multiple cascaded The self-attention module is preset, and the output of the self-attention feature extraction module can be cascaded to the input of the prediction module.

In addition, the backboard area image sequence can be input into the shot classification network, that is, into the feature map extraction module.

The output of the shot classification network, that is, the output of the prediction module, is the shot recognition result. Can be used to determine whether a shot was successful.

For ease of understanding, as shown in Figure 3, Figure 3 is a schematic principle diagram of a shot classification network according to an embodiment of the present invention.

For the feature map extraction module, the feature map extraction module can be used to: for each backboard area image in the input backboard area image sequence, use the pre-trained two-dimensional CNN network to extract a two-dimensional feature map, thereby obtaining the feature map sequence. and output.

The output of the feature map extraction module can be cascaded to the input of the feature preprocessing module.

For the feature preprocessing module, the feature preprocessing module can be used to: divide the input feature map sequence into multiple feature map subsequences based on the sliding window mechanism. For each feature map in each feature map subsequence, it is converted into one-dimensional features, and then all the converted one-dimensional features are stacked to obtain two-dimensional features. Then, classification initial features and position coding are added to the two-dimensional features.

The feature preprocessing module can be used to: output two-dimensional features that add classification initial features and position encoding. The output of the feature preprocessing module can be cascaded to the input of the self-attention feature extraction module.

Specifically, for the input feature map sequence containing 10 feature maps, based on the sliding window mechanism, the sliding window length is 3 and the sliding step size is 1, and 8 feature map subsequences can be divided.

For the feature map with size a*b, the three feature maps in each feature map subsequence can be converted into one-dimensional features of 1*n, n=a*b, and then the three obtained one-dimensional features The features are stacked to obtain 3*n two-dimensional features.

Then, the classification initial features can be added from the time dimension to obtain 4*n two-dimensional features, and then position coding can be added. You can also add classification initial features from the spatial dimension to obtain 3*(n+1) two-dimensional features, and then add position coding.

Then add position coding to the two-dimensional features after adding the initial classification features. The position coding can include: information characterizing the spatial position relationship between different feature points of each feature map in the feature map sequence, and characterizing different feature maps in the feature map sequence. information about the temporal relationship between them.

For the self-attention feature extraction module, it can include 3 cascaded preset self-attention modules. In each preset self-attention module, features can be extracted once from the spatial dimension based on the self-attention mechanism based on the input features. , further based on the extracted features, extract features once from the time dimension based on the self-attention mechanism.

The output of the self-attention feature extraction module can be classification features. Specifically, from the features output by the eighth preset self-attention module, the current representation corresponding to the classification initial feature is determined as the classification feature.

Specifically, the current representation corresponding to the classification initial feature can also be pooled to obtain a fixed-size feature, which is determined as a classification feature.

The output of the self-attention feature extraction module can be cascaded to the input of the prediction module.

For the prediction module, it can be used to predict the shot recognition results based on the classification features output by the self-attention feature extraction module, using the pre-trained fully connected network.

To facilitate understanding, the embodiment of the present invention also provides a specific method embodiment.

In professional basketball games, there are corresponding referees to score, but in general amateur games or training processes, there are usually no referees, and the athletes need to score by themselves.

Most current solutions require additional sensors, and some even require corresponding wearable devices to be worn on the athletes' wrists, which is inconvenient and costly.

The embodiment of this method proposes a vision-based shot recognition algorithm, which can be run on a mobile phone or an ordinary personal computer. Shot recognition can be realized using the camera of the mobile phone or the surveillance camera of the stadium, which has the advantages of simple operation and low cost. .

In basketball games, shot recognition is used to determine whether a player's shot is a goal or a score. Using shot recognition algorithms allows automatic scoring, allowing players to focus on the game.

In this method embodiment, the video data of the basketball game can be divided into multiple video segments based on the sliding window mechanism.

Specifically, the length of the sliding window may be 2 seconds, and the sliding step may be 1 second.

For each video clip, you can perform the following steps.

1. For each frame of image in the input video, use a target detection network trained based on open source architecture (such as YOLOX, RetinaNet) to detect backboards.

2. For the input video, take the smallest outer bounding box of the backboard detected at the same position in all frames.

3. After expanding the outer bounding box by 1.5 times, crop the image content of the corresponding backboard area from the corresponding position of each video frame image.

4. Adjust the cropped image content to a fixed size image (such as 64x64). Using a shot classification network Perform shot recognition, that is, classify whether there is a goal or not.

The shot classification network is a custom deep neural network, which includes a two-dimensional convolutional network and a self-attention mechanism.

a) Two-dimensional convolutional network has proven its advantages of high performance and fast speed in image feature extraction after many years of practice. Currently, a large number of image data sets, such as ImageNet, can provide pre-training data for two-dimensional convolutional networks, reducing the need for the number of target data sets and increasing network generalization.

b) Self-attention mechanism The self-attention module has shown great potential in processing sequence data in natural language.

c) Based on the above facts, the shot classification network can be divided into two sections during network design.

The first section uses a two-dimensional convolutional network to extract underlying image-level features.

The second section uses self-attention to fuse multi-frame image features in the time domain.

In order to further improve the extraction performance of spatial features, in the second stage, the spatial features can be fine-tuned through transposition + self-attention mechanism.

Since the first stage only performs image-level feature processing, the extracted features can be reused in subsequent processing, reducing unnecessary repeated calculations.

The self-attention in the second section can effectively perform multi-frame image fusion processing in the time domain to obtain optimal results.

The specific definition of the network is as follows.

a) Each frame of image can be processed using a three-layer two-dimensional convolutional network with shared weights. The kernel size of each layer is 3, the step size of the first layer is 2, and the step size of the last two layers is 1. Each layer uses Batch norm is processed, the activation function is relu, and the number of output channels are 32, 64, and 128 respectively.

b) Reshape the two-dimensional features generated in the previous step into one-dimensional features and splice them together according to the time dimension. The size of the generated spliced features is T x S x C, where T represents the time dimension data, which is processed simultaneously. The number of image frames, the typical value is 32, where S is the spatial dimension, the typical value is 1024, C represents the channel data of the feature, which is the number of output channels of the above convolution network, the typical value is 128.

c) Use an 8-layer hybrid self-attention layer for processing.

Before mixing self-attention processing, position encoding and classification initial features are added to the input features.

Use a self-attention module to process the spliced features after adding position encoding and classification initial features. Specifically, it can be based on the self-attention mechanism to extract features from the spatial dimension; after the processing is completed, the extracted features are transposed to S x T After x C, another self-attention module is used to process the features. Specifically, it can be based on the self-attention mechanism to extract features from the time dimension. After the processing is completed, the feature sequence is restored to T x S x C.

Loop the above process 8 times, and finally obtain the characteristics of T x S x C.

d) Extract the current representation corresponding to the initial classification feature from the obtained features and determine it as the classification feature. Adjust the size of the classification features to 1 x C through pooling processing, input it into a fully connected network, and obtain the classification results output by the fully connected network.

5. The output result of the classification network is whether the corresponding shot is scored. If a goal is scored, the corresponding score will be scored.

Through this method embodiment, the input used is multiple continuous frame images around the backboard. In practical applications, only the backboard needs to be detected. The backboard has a large target and cannot move. The detection difficulty is extremely low. There is no need to explicitly detect the basketball and the basket. ,

At the same time, because continuous frames are used as input, it can automatically determine whether a goal is scored based on some auxiliary information after the goal is scored, such as the swing of the basket to assist in determining whether a goal is scored.

This method embodiment provides a vision-based shot recognition algorithm, which can determine whether a goal is scored based on the video input of the camera, thereby achieving the purpose of automatic scoring of the game. It can be directly run on a mobile phone or an ordinary personal computer, and has simple deployment , the advantage of being removable.

Moreover, it does not require any modification to the basket frame or backboard, and will not interfere with athletes' shooting. It can run on mobile phones or ordinary personal computers, does not require professional installation, and is easy to use.

Corresponding to the above method embodiment, the embodiment of the present invention also provides an apparatus embodiment.

As shown in Figure 4, Figure 4 is a schematic structural diagram of a shot recognition device according to an embodiment of the present invention.

The device may include the following units.

The backboard identification unit 401 is used to: obtain the video to be identified; determine the backboard area in the video to be identified, and obtain an image sequence of the backboard area to be identified.

The classification network unit 402 is used to: input the image sequence of the backboard area to be identified into the shot classification network, and obtain the shot recognition result output by the shot classification network; the shot recognition result is used to characterize whether the shot is successful.

For specific explanations, please refer to the above method embodiments.

Embodiments of the present invention also provide a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, any one of the above method embodiments is implemented. .

An embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by the one processor, The instructions are executed by the at least one processor, so that the at least one processor can execute any of the above method embodiments.

Figure 5 is a schematic hardware structure diagram of a computer device configured to configure a method according to an embodiment of the present invention. The device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040 and a bus 1050. The processor 1010, the memory 1020, the input/output interface 1030 and the communication interface 1040 implement communication connections between each other within the device through the bus 1050.

The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related program to implement the technical solutions provided by the embodiments of the present invention.

The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store operating systems and other application programs. When the technical solution provided by the embodiment of the present invention is implemented through software or firmware, the relevant program code is stored in the memory 1020 and called and executed by the processor 1010.

The input/output interface 1030 is used to connect the input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. Input devices can include keyboards, mice, touch screens, microphones, various sensors, etc., and output devices can include monitors, speakers, vibrators, indicator lights, etc.

The communication interface 1040 is used to connect a communication module (not shown in the figure) to realize communication interaction between this device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

Bus 1050 includes a path that carries information between various components of the device (eg, processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, during specific implementation, the device may also include necessary components for normal operation. Other components. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the embodiments of the present invention, and does not necessarily include all the components shown in the figures.

Embodiments of the present invention also provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, any one of the above method embodiments can be implemented.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which implements any of the above method embodiments when executed by a processor.

Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, disk storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transient computer-readable media (transitory media), such as modulated data signals and carrier waves.

From the above description of the embodiments, those skilled in the art can clearly understand that the embodiments of the present invention can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the essence or the contribution part of the technical solutions of the embodiments of the present invention can be embodied in the form of software products. The computer software products can be stored in storage media, such as ROM/RAM, magnetic disks, and optical disks. etc., including a number of instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments of the present invention.

The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer, which may be in the form of a personal computer, a laptop, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, or a game controller. desktop, tablet, wearable device, or a combination of any of these devices.

Each embodiment in this specification is described in a progressive manner, and the same and similar features among the various embodiments Parts may refer to each other, and each embodiment focuses on its differences from other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment. The device embodiments described above are only illustrative. The modules described as separate components may or may not be physically separated. When implementing the embodiments of the present invention, the functions of each module may be integrated into the same device. or implemented in multiple software and/or hardware. Some or all of the modules can also be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

The above are only specific implementations of the embodiments of the present invention. It should be pointed out that those of ordinary skill in the art can make several improvements and modifications without departing from the principles of the embodiments of the present invention. Improvements and modifications should also be considered as protection of the embodiments of the invention.

In the present invention, the terms "first" and "second" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance. The term "plurality" refers to two or more than two, unless expressly limited otherwise.

Other embodiments of the invention will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. The present invention is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary technical means in the technical field that are not disclosed in the invention. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the present invention is not limited to the precise construction described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

A shot recognition method including:

Get the video to be identified;

Determine the backboard area in the video to be identified, and obtain an image sequence of the backboard area to be identified;

The image sequence of the backboard area to be identified is input into the shot classification network, and the shot recognition result output by the shot classification network is obtained; wherein the shot recognition result is used to represent whether the shot is successful.
According to the method of claim 1, the shot classification network is used to:

Extract feature maps for each backboard area image in the input backboard area image sequence to obtain a feature map sequence;

For the feature map sequence, extract classification features based on the self-attention mechanism;

A shot recognition result is determined based on the classification features.
The method according to claim 2, wherein extracting classification features based on a self-attention mechanism for the feature map sequence includes:

For the feature map sequence, a position code is added; wherein the position code includes: information characterizing the spatial position relationship between different feature points of each feature map in the feature map sequence, and characterizing the different feature points in the feature map sequence. Information about the temporal position relationship between feature maps;

For the feature map sequence after adding position encoding, based on the self-attention mechanism, features are extracted from the spatial dimension and the temporal dimension to obtain classification features.
The method according to claim 3, adding position coding to the feature map sequence includes:

Convert each feature map in the feature map sequence into a one-dimensional feature;

Perform stacking conversion processing on one-dimensional features to obtain two-dimensional features;

For the two-dimensional features, position coding is added.
According to the method of claim 3, the shot classification network includes N cascaded preset self-attention modules; N≥2; for the i-th preset self-attention module, 1≤i≤N-1, Its output is cascaded to the input of the i+1 preset self-attention module; the preset self-attention module is used to extract features from the spatial dimension and the temporal dimension based on the self-attention mechanism;

The feature map sequence after adding position encoding is based on the self-attention mechanism to extract features from the spatial dimension and the temporal dimension, including:

The feature map sequence after position coding is input into the first preset self-attention module, and the classification features are determined based on the output of the Nth preset self-attention module.
According to the method of claim 5, the preset self-attention module is used for:

For the input features, features are extracted serially at least once from the spatial dimension based on the self-attention mechanism, and further targeted For the extracted features, serially extract features at least once from the time dimension based on the self-attention mechanism, and output the extracted features;

or,

For the input features, features are serially extracted at least once from the time dimension based on the self-attention mechanism. Further, for the extracted features, features are serially extracted at least once from the spatial dimension based on the self-attention mechanism, and the extracted features are output.
The method according to claim 2, wherein extracting classification features based on a self-attention mechanism for the feature map sequence includes:

Add classification initial features to the feature map sequence;

For the feature map sequence after adding the initial features for classification, features are extracted based on the self-attention mechanism;

From the extracted features, the current representation corresponding to the initial classification feature is determined as the classification feature.
The method according to claim 2 or 7, wherein determining the shot recognition result according to the classification characteristics includes:

Perform pooling processing on the classification features to obtain features to be input with preset feature sizes;

The features to be input are input into the pre-trained fully connected network, and the shot recognition result output by the fully connected network is obtained.
The method according to claim 2, wherein extracting classification features based on a self-attention mechanism for the feature map sequence includes:

For the feature map sequence, the first m feature maps are determined as the feature map subsequence contained in the current sliding window; m≥1;

Perform the following steps in a loop until the current sliding window cannot move backward: extract classification features based on the self-attention mechanism for the feature map subsequence contained in the current sliding window; move the sliding window backward by the preset sliding step.
The method according to claim 1, obtaining the image sequence of the backboard area to be identified includes:

For each determined backboard area, crop the image content containing the backboard area, adjust the cropping result to a preset image size, and add the adjustment result to the backboard area image sequence to be identified; the backboard area image sequence to be identified , sorted by the temporal order between the video frames where the backboard area images are located.
A shot recognition device including:

A backboard recognition unit, configured to: obtain a video to be identified; determine the backboard area in the video to be identified, and obtain an image sequence of the backboard area to be identified;

A classification network unit configured to: input the to-be-identified backboard area image sequence into a shot classification network, and obtain a shot recognition result output by the shot classification network; wherein the shot recognition result is used to characterize whether the shot is successful.
An electronic device including:

at least one processor; and,

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the one processor, the instructions being executed by the at least one processor, so that the at least one processor can execute the instructions of any one of claims 1 to 10 method.
A computer-readable storage medium storing a computer program that implements the method of any one of claims 1 to 10 when executed by a processor.