CN109934188B

CN109934188B - Slide switching detection method, system, terminal and storage medium

Info

Publication number: CN109934188B
Application number: CN201910208617.5A
Authority: CN
Inventors: 马然; 刘致金; 李凯; 沈礼权; 安平
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2020-10-30
Anticipated expiration: 2039-03-19
Also published as: CN109934188A

Abstract

The invention provides a slide switching detection method, which comprises the following steps: connecting a three-classification output layer behind the convolutional neural network structure to obtain classification information of the video frame volume and obtain a three-classification convolutional neural network model; designing a time-space residual error network model on the basis of the structure of the three-classification convolutional neural network model, a 3D convolutional module in the 3D ConvNet network and a residual error module in the ResNet network; and extracting the time and space characteristics of the video frame by using a 3D convolution module, integrating a residual error module into the 3D convolution module to obtain a 3D convolution residual error module, and constructing a time and space residual error network model for video frame volume classification. The invention also provides a corresponding detection system, a terminal and a computer readable storage medium. The method has better accuracy, overcomes the interference of lens movement, speaker movement and switching of a plurality of PTZ lenses in the speech video, and has higher accuracy than the prior method.

Description

Slide switching detection method, system, terminal and storage medium

Technical Field

The invention relates to a video information processing method, in particular to a slide switching detection method and a slide switching detection system based on a time and space residual error deep learning network model.

Background

With the development of informatization wave and multimedia technology, the application of digital video is more and more extensive due to the intuition, the certainty and the high efficiency of video information, and the internet tightly connects the visual feast. At present, online learning becomes an important knowledge acquisition mode, people record learning videos in various forms in meeting rooms or classrooms by using intelligent equipment, and the learning videos are spread to more people through the internet. However, these videos are not subjected to any structured processing, and the learning website presents the entire video to the user. If a user is interested in a certain knowledge point, the user often needs to browse the whole video to find the corresponding knowledge point, which consumes a lot of time and energy of the user. According to statistics, YouTub video uploads a video volume of about 400 hours per minute. If none of these videos are processed, a large number of learners are overwhelmed by these learning videos and their learning interests are reduced. Therefore, for online education or other applications, it is very important to automatically extract representative information in the lecture video and summarize the lecture video. Among them, slide switching detection is one of the most critical technologies in speech video summarization, and is an important research topic.

A large part of speech videos are videos with slide show, and slide switch detection is an important research point of speech video summarization in such videos. The video containing the lecturer, the projection slide and the audience is recorded into a lecture video through a PTZ (pan-tilt-zoom) camera. According to different video recording modes, speech videos can be divided into three types: still camera lens recording, moving camera lens recording, and camera lens switch recording. Because the lecture video not only records the projection area, but also records the speaker and the audience, and the backgrounds of the speaker and the audience cause certain interference to slide switching detection, such as camera lens movement, camera lens switching, speaker movement and the like. Moreover, the slide switching often occurs in a short time with the change of the content of the projection area, and it is difficult to manually recognize the switching time. Hence, speech video slide switch detection is a meaningful and challenging task.

Due to complex noise interference, detection methods are also provided for different types of video scholars at home and abroad. Some methods propose to detect the image similarity of adjacent frames using visual features such as color histogram, SIFT, HOG, wavelet, etc. However, these methods do not take into account the interferences of speaker movement, shot movement and shot switching, such as shot switching from a computer screen to a speaker, which causes video changes. Still other approaches are directed to specific video types, such as single shot and fixed shot without shot switching. These methods all have their own limitations.

The Chinese patent previously applied by the applicant has the application numbers: 201710878115.4, discloses a slide switch detection method based on sparse time-varying graph. For the lecturer, the slide and the lecture video of the audience shot by the multiple cameras, firstly, the video is segmented through characteristic point detection and matching, and a sparse graph is established by taking each segment of video as a node for each time point, so that the slide switching detection problem can be converted into a mapping graph adjacency matrix problem. The change between the adjacency matrices reflects a slide switch. The method has a good effect on processing the speech video of a still shot type and a shot switching type, but has large errors under the condition that complicated shot movement, such as shot movement, zooming, switching and the like, simultaneously exist in the speech video. In addition, the patent application is based on the conventional image characteristic points, and ignores the switching information between adjacent frames.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method, a system, a terminal and a storage medium for detecting slide switching based on a time-space residual error network model, which can effectively solve the problem of slide switching detection under the interference of lens movement/zooming, speaker movement, lens switching and the like. Compared with the prior art, the method and the system have the advantages that the time-space residual error network model is used for detecting the slide switching, the interferences of lens movement/scaling, speaker movement, switching of a plurality of PTZ lenses and the like of the speech video can be overcome, the method is high in accuracy, and the range of processing the speech video types is wide.

The invention adopts a 3DConvNet convolution neural network which extends the convolution kernel from 2D to 3D to extract the spatial and temporal characteristics of the video. As the number of superimposed convolution layers increases, the 3D ConvNet consumes more memory, which causes certain difficulties in training the model. To solve this problem, the present invention employs a residual network model (ResNet). The new convolution network model provided by the invention not only saves training time, but also is easier to train and obtains a better slide switching detection result.

According to a first aspect of the present invention, there is provided a method for detecting a slide switch based on a time-space residual network model, comprising:

segmenting a video recorded through a single or multiple shots containing slides, a presenter and/or a viewer into a plurality of video frame rolls containing video frames;

designing a convolutional neural network structure by adopting a design principle of a network structure for extracting the spatial domain characteristics of the picture; connecting a three-classification output layer behind the convolutional neural network structure, wherein the three-classification output layer is used for obtaining classification information of the video frame volume to obtain a three-classification convolutional neural network model; designing a time-space residual error network model on the basis of the structure of the three-classification network model, a 3D convolution module in the 3D ConvNet network and a residual error module in a residual error network model ResNet network;

extracting the time and space characteristics of the video frame by using a 3D convolution module in a 3D ConvNet network, fusing a residual module in a residual network model ResNet into the 3D convolution module in the 3D ConvNet network to obtain a 3D convolution residual module, and constructing a time and space residual network model for video frame volume classification; wherein:

dividing a training video into a plurality of video frame volumes containing video frames, classifying the video frame volumes, and then sending the video frame volumes into a time-space residual error network model for training to obtain a trained time-space residual error network model;

and sending the video frame volume of the test video into a trained time and space residual error network model to obtain a classification result, and detecting the slide switching moment.

Preferably, the structure of the three-classification convolutional neural network model is a 12-layer convolutional neural network structure, which comprises 8 convolutional layers and 4 fully-connected layers; as the network deepens, the width and the height of the image are continuously reduced according to a certain rule, the width and the height of the image after each pooling are just reduced by half, and the number of channels is continuously doubled; the final output layer is a three-classification output layer for obtaining classification information of the video frame volume. The network structure is very regular, has not so many hyper-parameters, and focuses on constructing a simple network.

Preferably, the design principle of the network structure for extracting the spatial domain features of the picture mainly follows the following two design principles:

-if the time and space feature map sizes of the 3D convolution residual module input and output are the same, the number of channels of the convolution kernel of the convolutional neural network does not change;

if the size of the time and space feature map output by the 3D convolution residual module is half of the size of the input time and space feature map, the number of channels of the convolution kernel of the convolution neural network is doubled to ensure the consistency of time complexity.

Preferably, the 3D convolution module in the 3D ConvNet network applies a 3D convolution layer and a 3D pooling layer to model and extract the temporal and spatial feature maps of the video frame, and the residual module of the residual network model ResNet network applies short connection and identity mapping to improve model learning efficiency; fusing a residual error module in the residual error network model ResNet into a 3D convolution module in a 3D ConvNet network to obtain a 3D convolution residual error module; the short connection of the 3D convolution residual module comprises a 1 x 1 3D convolution layer, so as to ensure that the output of the 3D convolution residual module is consistent with the dimension of the output of the 1 x 1 3D convolution layer after mapping.

Preferably, the short connection of the 3D convolution residual module includes a 1 × 1 3D convolution layer, and the method for ensuring that the output of the 3D convolution residual module and the output of the 1 × 1 3D convolution layer after mapping have consistent dimensions is:

two convolutional layers are included in the 3D convolutional residual block, so the residual map F (x) is represented as

F(x)＝ω₂σ(ω₁x+b₁)+b₂

Where x represents the input, ω₁A weight coefficient representing the first layer convolution layer; omega₂A weight coefficient representing the second convolution layer; b₁Representing the amount of deviation of the first layer convolution layer; b₂Representing a second layer of convolutional layersA deviation amount;

σ denotes the activation function of RELU:

wherein x represents an input;

in order to make the dimensions of the input x and the residual map F (x) the same, a 3D convolutional layer with 1 x 1 added is on short connections, resulting in a weighted map H (x), denoted by

H(x)＝W_sx

Wherein, W_sIs a weighted value matrix for matching input x and the dimensions of the residual map f (x);

the mapping equation z (x) becomes:

Z(X)＝F(x)+H(x)。

preferably, the time-space residual error network model is provided with eight convolutional layers and four full-connection layers, wherein the convolutional layers are arranged in front of each other, the full-connection layers are arranged behind each other, each convolutional layer is sequentially connected, and the four full-connection layers are connected behind the convolutional layers.

Preferably, in the time and space residual error network model, the loss function is a cross entropy loss function for classifying networks.

Preferably, the cross entropy loss function is:

where x represents the input, class represents the true value of the class to which the input belongs, n is the number of video frames of the input, x [ class ] represents the score obtained in the input by the class to which it belongs, and x [ j ] represents the score obtained in the input by the j-th class.

Preferably, the video frame volumes are classified and then sent to a time-space residual error network model for training, wherein a single-path network model is adopted to extract time-space domain features from two frames in the input video frame volumes, and an Adam algorithm is used to train the time-space residual error network model.

According to a second aspect of the present invention, there is provided a slide switch detection system based on a time-space residual error network model, comprising:

a segmentation module: segmenting a video recorded through a single or multiple shots containing slides, a presenter and/or a viewer into a plurality of video frame rolls containing video frames;

a classification network structure design module: designing a convolutional neural network structure by adopting a design principle of a network structure for extracting the spatial domain characteristics of the picture; the design principle of the network structure for extracting the spatial domain features of the picture is as follows: if the sizes of the input and output characteristic graphs of the classification network structure are the same, the number of channels of a convolution kernel of the convolution neural network is not changed; if the size of the characteristic diagram output by the classification network structure is half of the size of the input characteristic diagram, the number of channels of a convolution kernel of the convolution neural network is doubled so as to ensure the consistency of time complexity; the rear end of the convolutional neural network is connected with a three-classification output layer for obtaining classification information of the video frame volume to form a classification network structure;

the time and space residual error network model construction module comprises: extracting the time and space characteristics of the video frame by using a 3D convolution module in a 3D ConvNet network, fusing a residual module in a residual network model ResNet into the 3D convolution module in the 3D ConvNet network to obtain a 3D convolution residual module, and constructing a time and space residual network model for video frame volume classification;

a training module: dividing a training video into a plurality of video frame volumes containing video frames, classifying the video frame volumes, and then sending the video frame volumes into a time-space residual error network model for training to obtain a trained time-space residual error network model;

a detection module: and sending the video frame volume of the test video into a trained time and space residual error network model, outputting the class of the video frame volume, and detecting the slide switching time.

According to a third aspect of the present invention, there is provided a terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to execute the program to perform the above-mentioned slide switching detection method based on a time-space residual network model.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, is operable to carry out the above-mentioned time-space residual error network model-based slide show switching detection method.

The invention provides a slide switching detection method, a system, a terminal and a storage medium based on a time and space residual error network model, and relates to a slide switching detection technology based on a time and space residual error network model of 3D ConvNet and ResNet. Given a video of a speech recorded through a single or multiple shots containing slides, a presenter and a viewer, the present invention aims to detect the moment in which the slides switch. Due to the importance of spatial and temporal features in the video for detection, 3DConvNet is used to detect temporal and spatial features. Because of the long time of the video, a ResNet optimization network model is combined in order to optimize the processing time and the accuracy of the submission detection. The method comprises the steps of firstly dividing an input training video into a plurality of video frame volumes containing video frames, then dividing the video frame volumes into three types and sending the three types into a classification network model for training. And then, sending the video frame volume of the test video into a trained classification network model, and detecting the slide switching moment through the class of the video frame volume output by the network model. The invention is a method for overcoming the lens movement, speaker movement and multi-PTZ lens switching interference of the lecture video with better accuracy, has higher accuracy than the existing method, and enlarges the range of the lecture video types which can be processed.

The invention designs a basic classification network model by utilizing the current network structure design principle with better performance for extracting the spatial domain characteristics of the pictures, and converts the slide switching detection problem into the video frame volume classification problem. And 3DConvNet (a deep 3-dimensional convolutional neural network) is added on the basic model to extract the space and time domain characteristics of the video frame, and a residual network model ResNet is merged into a network structure to improve the training efficiency and construct a new network model. And during model training, taking every two frames of the video frames as a video frame volume to be sent to a network for training. In designing the loss function, the loss function is designed as a cross-entropy loss function since the slide switch detection turns into a classification problem.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the 3D convolution residual module is introduced into the time-space residual deep learning network model, so that not only can the spatial domain characteristics of the video frames be extracted, but also the visual change characteristics between adjacent video frames can be extracted, and therefore, the method has a strong advantage in the aspect of processing slide switching with obvious visual change in the lecture video. The lecture video is divided into a plurality of video frame volumes containing video frames and sent to a learning network for classification learning, so that the lecture video classification learning method can learn various interference characteristics such as lens movement/zooming, speaker movement and a plurality of PTZ lens switching, and the classification learning method can process various types of lecture videos; the invention has higher detection accuracy and wider range of processing the types of the lecture videos. In addition, the present invention does not require additional material such as text, voice, electronic slides, etc.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a detection method according to an embodiment of the present invention;

FIG. 2 is an input video of a lecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating three classifications of video frame volumes according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a 3D convolution residual module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an overall temporal-spatial residual network model according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a slide switch detection result according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

Embodiments of the present invention provide a slide switch detection technique based on a 3D ConvNet and ResNet temporal-null residual network model, with the goal of detecting the moment in which slides switch for a given segment of lecture video recorded through a single or multiple shots that contains slides, a speaker and an audience. Because of the importance of spatial and temporal features in the video to detection, embodiments of the present invention utilize 3D ConvNet to detect temporal and spatial features. Because the video time is long, the invention combines the ResNet optimization network model in order to optimize the processing time and the accuracy of the submission detection. After an input training video is divided into a plurality of video frame volumes containing video frames, the video frame volumes are divided into three types through three classification output layers and sent to a classification network model for training; and then, sending the video frame volume of the test video into a trained classification network model, and detecting the slide switching moment through the class of the video frame volume output by the network model.

The application environment of the following embodiments of the invention is as follows: the overall network model was programmed in ubuntu16.04 and PyTorch environments as shown in fig. 5.

Referring to fig. 1, in an embodiment of the present invention, a slide switching detection method based on a temporal-spatial residual network model may first segment a video to be detected, which is recorded through a single or multiple shots and contains slides, speakers, and/or audiences, into multiple video frame rolls containing video frames; then the following steps are carried out:

step 1: the structural design of the three-classification convolutional neural network model is as follows: designing a 12-layer three-classification convolutional neural network structure, which mainly follows the following two design principles:

(1) if the sizes of the input and output characteristic graphs are the same, the number of channels of the convolution kernel is not changed;

(2) if the size of the output feature map is half that of the input feature map, the number of channels of the convolution kernel is doubled to ensure consistency in time complexity.

As the network grows deeper, the width and height of the image are reduced regularly, and after each pooling, the width and height of the image are reduced by half, and the number of channels is doubled, for example, after the image size is reduced from 224 × 224 to 112 × 112, the number of channels is increased from 64 channels to 128 channels. The final output layer is a three-classification output for obtaining classification information of the video frame volume. The network structure is very regular, has not so many hyper-parameters, and focuses on constructing a simple network.

Step 2: designing a 3D convolution residual module: the time-space feature extraction method of the 3D ConvNet network adopts a 3D convolution module, a 3D convolution layer and a 3D pooling layer are applied, and a residual module of a ResNet network improves the model learning efficiency by applying short connection and identity mapping. And fusing a residual error module in the residual error network model ResNet into a 3D convolution module in the 3D ConvNet network to obtain a 3D convolution residual error module. Adding a 1 x 1 layer of 3D convolution ensures consistent dimensionality of the input and mapping over short connections.

The network structure of the 3D convolutional residual block is shown in fig. 4, which contains two 3D convolutional layers and a short connection containing a 1 × 1 3D convolutional layer. The first layer of 3D convolution layer and the second layer of 3D convolution layer are connected in a front-back mode, so that the input video frame volume is output after two times of 3D convolution and the number of size reduction channels is increased. The output of the input video frame volume after being subjected to the 1 × 1 3D convolution layer has the same size as the output after being subjected to the two times of 3D convolution, and is superimposed to be the final output.

In this step, the method for determining that the input and output dimensions before and after the 3D convolutional layer are the same is:

F(x)＝ω₂σ(ω₁+b₁)+b₂

Where σ denotes the activation function of RELU:

to make the dimensions of the input x and the residual map F (x) the same, a 3D convolutional layer with 1 × 1 added to it can be given a weighted map H (x) over short connections, denoted by

H(x)＝W_sx

Wherein, W_sIs a matrix of weighted values for matching the dimensions of the input x and the residual map F (x)

Thus, the mapping equation becomes:

Z(X)＝F(x)+H(x)

and step 3: constructing a time and space residual error network model: and (3) adding the convolution residual error module designed in the step (2) into the network structure designed in the step (1) to obtain the time and space residual error network model. The time and space residual error network model has eight convolutional layers and four fully-connected layers, the network structure is shown in fig. 5, the 3D convolution in the figure indicates that the layer belongs to the 3D convolutional layer, 64 indicates that the number of convolution channels is 64,/2 indicates that the size of the video frame volume is reduced by half, and the fully-connected indicates that the layer belongs to the fully-connected layer. The solid short lines on the 3D convolution layer indicate that the video frame is convolved into the layer and has no size change, so the short connections can be used directly for identity mapping. The size of the video frame volume with the dotted short connection value on the 3D convolution layer is reduced by half after the convolution of the layer, the number of channels is doubled, and the same output size dimension is ensured after the 3D convolution layer with the 1 multiplied by 1 is added on the short connection, so that a 3D convolution residual module is formed. After eight layers of convolution layers, the size of the video frame is continuously reduced by half, and the number of channels is doubled. Because the number of channels of the video frame volume output after the layer 8 volume layer is large, the dimension reduction is carried out by using the four fully-connected layers, and finally the fully-connected layer 4 comprises three nodes, which shows that the final output is of three types and is a three-classification network.

The loss function adopts a cross entropy loss function commonly used by classification networks.

Specifically, the cross entropy loss function is as follows:

where x represents the input, class represents the true value of the classification, and n is the number of input video frames.

And 4, step 4: model training: extracting empty and time domain characteristics from two frames in an input video frame volume by adopting a single-path time and empty residual error network model; the model is trained using the Adam algorithm.

In this step, the adopted network parameter optimization algorithm is Adam algorithm, the batch size (mini-batch) is set to be 128, and meanwhile, the parameter of the algorithm is set to be beta₁0.9 and β₂0.999. Wherein beta is₁Exponential decay Rate, beta, estimated for the first moment₂The exponential decay rate estimated for the second moment. The penalty multiplier in weight decay is set to 5 x 10^-4The initial learning rate was 0.001 and decayed by a factor of 10 as training time increased by a factor of 10. And storing the model after the model training is finished.

And 5: and (4) inputting the video frame volume to be detected into the time and space residual error network model obtained by training in the step (4) to obtain a corresponding classification result, and obtaining a corresponding slide switching moment according to the classification result.

Corresponding to the method, the embodiment of the present invention further provides a slide switching detection system based on a time and space residual error network model, which can be used to implement the method. The system specifically comprises:

a classification network model design module: designing a convolutional neural network as a basic network model by adopting a design principle of a network structure for extracting image spatial domain characteristics; connecting a three-classification output layer behind the basic network model, wherein the three-classification output layer is used for obtaining classification information of the video frame volume to form a three-classification network model;

the design principle of the network structure for extracting the spatial domain features of the picture is as follows: if the sizes of the input characteristic graph and the output characteristic graph of the three-classification network model are the same, the number of channels of a convolution kernel of the convolution neural network is not changed; if the size of the characteristic diagram output by the three-classification network model is half of the size of the input characteristic diagram, the number of channels of a convolution kernel of the convolution neural network is doubled so as to ensure the consistency of time complexity;

The embodiment of the invention also provides a terminal, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor can be used for executing the slide switching detection method based on the time-space residual error network model when executing the program.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, is capable of executing the above-mentioned slide switching detection method based on a time-space residual error network model.

The method and the system are used for slide switching detection based on a time-space residual error network model, in the embodiment, the input speech video is shown in fig. 2, the video frame volume is a video frame combination containing a plurality of video frames is shown in fig. 3, and finally, the detection result after slide switching is detected is shown in fig. 6. The method for detecting the slide switching based on the time and space residual error network model solves the problems that people are detected and the people and the slides are switched due to the fact that a lens moves, a speaker moves and the lens is switched between the speaker and the slides in an input speech video, and the interferences are processed by the method for detecting the slide switching based on the time and space residual error network model, so that the problems that people are detected and the people and the slides are switched do not occur in a detection result.

It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art may refer to the technical solution of the system to implement the step flow of the method, that is, the embodiment in the system may be understood as a preferred example for implementing the method, and details are not described herein.

Those skilled in the art will appreciate that, in addition to implementing the system and its various modules, devices, units provided by the present invention in pure computer readable program code, the system and its various devices provided by the present invention can be implemented with the same functionality in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like by entirely logically programming method steps. Therefore, the system and various devices thereof provided by the present invention can be regarded as a hardware component, and the devices included in the system and various devices thereof for realizing various functions can also be regarded as structures in the hardware component; means for performing the functions may also be regarded as structures within both software modules and hardware components for performing the methods.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A slide switching detection method based on a time and space residual error network model is characterized by comprising the following steps:

designing a convolutional neural network structure by adopting a design principle of a network structure for extracting the spatial domain characteristics of the picture; connecting a three-classification output layer behind the convolutional neural network structure, wherein the three-classification output layer is used for obtaining classification information of the video frame volume to obtain a three-classification convolutional neural network model; designing a time-space residual error network model on the basis of the structure of the three-classification convolutional neural network model, a 3D convolutional module in the 3D ConvNet network and a residual error module in a residual error network model ResNet network;

2. The method according to claim 1, wherein the structure of the three-class convolutional neural network model is a 12-layer convolutional neural network structure, which comprises 8 convolutional layers and 4 fully-connected layers; as the network deepens, the width and the height of the image are continuously reduced, the width and the height of the image after each pooling are just reduced by half, and the number of channels is doubled; the last output layer is a three-classification output layer;

and/or the presence of a gas in the gas,

the design principle of the network structure for extracting the spatial domain features of the picture comprises the following steps:

3. The method according to claim 1, wherein the 3D convolution module in the 3D ConvNet network applies a 3D convolution layer and a 3D pooling layer to model and extract the temporal and spatial feature maps of the video frames, and the residual module of the residual network model ResNet network applies short connections and identity mapping to improve model learning efficiency; fusing a residual error module in the residual error network model ResNet into a 3D convolution module in a 3D ConvNet network to obtain a 3D convolution residual error module; the short connection of the 3D convolution residual module comprises a 1 x 1 3D convolution layer, so as to ensure that the output of the 3D convolution residual module is consistent with the dimension of the output of the 1 x 1 3D convolution layer after mapping.

4. The method of claim 3, wherein the short connection of the 3D convolutional residual block comprises a 1 x 1 3D convolutional layer, and the method for ensuring the output of the 3D convolutional residual block and the dimension of the output after mapping the 1 x 1 3D convolutional layer are the following steps:

F(x)＝ω₂σ(ω₁x+b₁)+b₂

Where x represents the input, ω₁A weight coefficient representing the first layer convolution layer; omega₂A weight coefficient representing the second convolution layer; b₁Representing the amount of deviation of the first layer convolution layer; b₂Indicating the deviation amount of the second layer convolution layer;

σ denotes the activation function of RELU:

wherein x represents an input;

H(x)＝W_sx

the mapping equation z (x) becomes:

Z(X)＝F(x)+H(x)。

5. the method for detecting the slide switching based on the time-space residual error network model according to claim 1, wherein the time-space residual error network model is provided with eight convolutional layers connected in sequence and four full-connected layers connected to the rear end of the last convolutional layer;

in the time and space residual error network model, the loss function adopts a cross entropy loss function for a classification network.

6. The method of claim 5, wherein the cross-entropy loss function is:

7. The method for detecting slide switching based on temporal-spatial residual network model according to any one of claims 1-6, characterized in that the video frame volumes are classified and then sent to the temporal-spatial residual network model for training, wherein a one-way network model is adopted to extract temporal-spatial features from two frames of the input video frame volumes, and the temporal-spatial residual network model is trained by using Adam algorithm.

8. A slide switch detection system based on a time-space residual network model for implementing the method of any one of claims 1-7, comprising:

a classification network structure design module: designing a convolutional neural network structure by adopting a design principle of a network structure for extracting the spatial domain characteristics of the picture; the design principle of the network structure for extracting the spatial domain features of the picture is as follows: if the sizes of the input characteristic graph and the output characteristic graph of the classification network structure are the same, the number of channels of a convolution kernel of the convolution neural network is not changed; if the size of the characteristic diagram output by the classification network structure is half of the size of the input characteristic diagram, the number of channels of a convolution kernel of the convolution neural network is doubled so as to ensure the consistency of time complexity; the rear end of the convolutional neural network structure is connected with a three-classification output layer to obtain classification information of the video frame volume so as to form a classification network structure;

9. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the program when executed by the processor is operable to perform the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.