CN109862395B

CN109862395B - Video stream hidden information detection method and device

Info

Publication number: CN109862395B
Application number: CN201910250765.3A
Authority: CN
Inventors: 陈性元; 杜学绘; 孙奕; 罗远焱; 秦若熙; 张东巍; 曹利峰
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2021-05-04
Anticipated expiration: 2039-03-29
Also published as: CN109862395A

Abstract

The application discloses a method and a device for detecting hidden information of a video stream, wherein the method comprises the following steps: a decoding library is called to decompress the video stream to be detected to obtain a plurality of video frames; performing convolution operation on each video frame by utilizing A preset convolution cores to obtain A residual error noise matrixes of each video frame; b different quantization truncation parameters are utilized to carry out quantization truncation operation on the A residual error noise matrixes of each video frame to obtain A multiplied by B quantization residual error noise matrixes of each video frame; and inputting the A multiplied by B quantized residual noise matrix of each video frame into a preset convolutional neural network model for hidden information detection, and obtaining a detection result of the video stream to be detected. Therefore, after the operations such as decompression, convolution, quantization truncation and the like are performed on the video stream, the hidden information detection is performed by combining the preset convolution neural network model, the detection characteristics in the quantization residual noise matrix can be automatically extracted by the model, the time, the energy and the computing resources are saved, and the detection efficiency and the detection result quality are improved.

Description

Video stream hidden information detection method and device

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a method and an apparatus for detecting hidden information in a video stream.

Background

With the rapid development of science and technology, video technology is gradually and widely applied in daily life, so that an attacker can easily use video stream hidden information to attack, and security threats such as malicious code diffusion and information leakage are caused. Especially, the video stream adopting the h.264 coding algorithm has higher compression efficiency, wide network adaptability and a large number of Discrete Cosine Transform (DCT) coefficients, so that the probability of hiding information by the DCT coefficients in the h.264 video stream is higher.

Since modifying the hidden information in the video stream based on the DCT coefficients may affect the spatial correlation of the data in the video stream to different degrees, the hidden information in the video stream may be detected based on the spatial correlation of the decompressed data of the video stream. At present, firstly, performing convolution and quantization truncation operation on a video frame obtained by decompressing a video stream, then, manually designing based on the features after quantization truncation to obtain detection features, and finally, detecting the video stream hidden information by using a support vector machine or an integrated classifier.

The inventor finds that the detection characteristics of the method have limitations, are obtained by manual design, and consume time, labor and computing resources; the detection characteristics have great influence on the detection result of the video stream hidden information; and because the video is more complex in the aspect of relative to other multimedia data structures, the dimensionality of the required detection features is higher, and the manual design of the detection features has higher challenge difficulty.

Disclosure of Invention

The technical problem to be solved by the application is to provide a method and a device for detecting hidden information of a video stream, which greatly save time, energy and computing resources, and can improve the detection efficiency and the detection result quality of the hidden information of the video stream to a certain extent.

In a first aspect, an embodiment of the present application provides a method for detecting hidden information in a video stream, where the method includes:

a decoding library is called to decompress the video stream to be detected to obtain a plurality of video frames;

performing convolution operation on each video frame by utilizing A preset convolution cores to obtain A residual error noise matrixes of each video frame;

performing quantization truncation operation on A residual error noise matrixes of each video frame by using B different quantization truncation parameters to obtain an A multiplied by B quantization residual error noise matrix of each video frame;

and inputting A multiplied by B quantization residual error noise matrixes of each video frame into a preset convolutional neural network model for hidden information detection, and obtaining a detection result of the video stream to be detected.

Optionally, a number of the preset convolution kernels is 16 4 × 4 convolution kernels, and a constructor of the 16 4 × 4 preset convolution kernels is:

wherein the content of the first and second substances,

N＝4。

optionally, the preset convolutional neural network model includes B convolutional neural network submodels, where the convolutional neural network submodels include a first convolutional layer, a global average pooling layer, 2 full-link layers, and a softmax layer; and quick connection is included between different first winding layers in the A first winding layers.

Optionally, if the data dimensions of the input data and the output data at the two ends of the shortcut connection are the same, the shortcut connection is an identical shortcut connection.

Optionally, if the data dimensions of the input data and the output data at the two ends of the shortcut connection are different, the shortcut connection includes a second convolution layer, and the second convolution layer is used for matching the data dimensions of the input data and the output data.

In a second aspect, an embodiment of the present application provides an apparatus for detecting hidden information in a video stream, where the apparatus includes:

the video frame obtaining unit is used for calling a decoding library to decompress the video stream to be detected to obtain a plurality of video frames;

a residual noise matrix obtaining unit, configured to perform convolution operation on each video frame by using a preset convolution kernel to obtain a residual noise matrix of each video frame;

a quantization residual noise matrix obtaining unit, configured to perform quantization truncation operation on a residual noise matrices of each video frame by using B different quantization truncation parameters, so as to obtain an a × B quantization residual noise matrix of each video frame;

and the detection result obtaining unit is used for inputting A multiplied by B quantization residual error noise matrixes of each video frame into a preset convolutional neural network model for hidden information detection, and obtaining the detection result of the video stream to be detected.

wherein the content of the first and second substances,

N＝4。

Compared with the prior art, the method has the advantages that:

by adopting the technical scheme of the embodiment of the application, a decoding library is called to decompress the video stream to be detected to obtain a plurality of video frames; performing convolution operation on each video frame by utilizing A preset convolution cores to obtain A residual error noise matrixes of each video frame; b different quantization truncation parameters are utilized to carry out quantization truncation operation on the A residual error noise matrixes of each video frame to obtain A multiplied by B quantization residual error noise matrixes of each video frame; and inputting the A multiplied by B quantized residual noise matrix of each video frame into a preset convolutional neural network model for hidden information detection, and obtaining a detection result of the video stream to be detected. Therefore, after the operations such as decompression, convolution and quantization truncation are carried out on the video stream, the hidden information detection is carried out by combining the preset convolution neural network model, the detection characteristics in the quantization residual noise matrix can be automatically extracted by the model, the time, the energy and the calculation resources are greatly saved, and the detection efficiency and the detection result quality of the hidden information of the video stream can be improved to a certain degree.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a system framework related to an application scenario in an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for detecting hidden information in a video stream according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a convolutional neural network sub-model in a preset convolutional neural network model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a partial convolutional neural network submodel provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a related structure of a quick connection according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an apparatus for detecting hidden information in a video stream according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, an h.264 video stream with high compression efficiency and wide network adaptability has a large number of DCT coefficients, and the probability of hiding information in the DCT coefficients is higher. Common methods for modifying hidden information based on DCT coefficients include Lin method, Ma method, Nakajima K method, Wong K method, etc., and although these methods eliminate the influence of information hiding on video streams as much as possible, they still affect the spatial correlation of data in video streams to different degrees, and then the hidden information in video streams can be detected based on the spatial correlation of data after decompression of video streams. In the prior art, video stream decompression, convolution and quantization truncation operation are performed, detection features obtained based on manual design of data after quantization truncation are used for detecting video stream hidden information by using a support vector machine or an integrated classifier. However, the inventor finds that the detection characteristics of the method in the prior art are obtained by manual design, and the method consumes time, labor and computing resources; the detection characteristics have great influence on the detection result of the video stream hidden information; and because the video is more complex in the aspect of relative to other multimedia data structures, the dimensionality of the required detection features is higher, and the manual design of the detection features has higher challenge difficulty.

In order to solve the problem, in the embodiment of the application, a decoding library is called to decompress a video stream to be detected to obtain a plurality of video frames; performing convolution operation on each video frame by utilizing A preset convolution cores to obtain A residual error noise matrixes of each video frame; b different quantization truncation parameters are utilized to carry out quantization truncation operation on the A residual error noise matrixes of each video frame to obtain A multiplied by B quantization residual error noise matrixes of each video frame; and inputting the A multiplied by B quantized residual noise matrix of each video frame into a preset convolutional neural network model for hidden information detection, and obtaining a detection result of the video stream to be detected. Therefore, after the operations such as decompression, convolution and quantization truncation are carried out on the video stream, the hidden information detection is carried out by combining the preset convolution neural network model, the detection characteristics in the quantization residual noise matrix can be automatically extracted by the model, the time, the energy and the calculation resources are greatly saved, and the detection efficiency and the detection result quality of the hidden information of the video stream can be improved to a certain degree.

For example, one of the scenarios in the embodiment of the present application may be applied to the scenario shown in fig. 1, where the scenario includes the user terminal 101 and the processor 102, and the user terminal 101 and the processor 102 interact with each other. The user sends the video stream to be detected to the processor 102 through the user terminal 101, and the processor 102 invokes the decoding library to decompress the video stream to be detected to obtain a plurality of video frames. The processor 102 performs convolution operation on each video frame by using a preset convolution kernels, and obtains a residual noise matrixes of each video frame. The processor 102 performs quantization truncation operation on the a residual noise matrixes of each video frame by using B different quantization truncation parameters to obtain an a × B quantized residual noise matrix of each video frame. The processor 102 inputs the a × B quantized residual noise matrix of each video frame into a preset convolutional neural network model for hidden information detection, and obtains a detection result of the video stream to be detected. The processor 102 sends the detection result of the video stream to be detected to the user terminal 101 for presentation to the user of the user terminal 101.

It is to be understood that, in the above application scenarios, although the actions of the embodiments of the present application are described as being performed by the processor 102, the present application is not limited in terms of the subject of execution as long as the actions disclosed in the embodiments of the present application are performed.

It is to be understood that the above scenario is only one example of a scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario.

The following describes in detail a specific implementation manner of the method and apparatus for detecting hidden information in a video stream according to an embodiment of the present application, with reference to the accompanying drawings.

Exemplary method

Referring to fig. 2, a flowchart of a method for detecting hidden information in a video stream in an embodiment of the present application is shown. In this embodiment, the method may include, for example, the steps of:

step 201: and calling a decoding library to decompress the video stream to be detected to obtain a plurality of video frames.

It can be understood that, at present, the h.264 video stream conceals information by modifying DCT coefficients, that is, after performing DCT transformation to transform the signal of the image from spatial domain to frequency domain, the information is concealed in the DCT coefficients of high frequency components to which the human visual system is not sensitive, taking into account the fact that the human visual system is not sensitive to high frequency signals. Since modifying the DCT coefficients affects the spatial correlation of data in the video stream, the inverse DCT transform operation converts the DCT coefficients to pixel values in the spatial domain during the decoding of the video stream, and the modified DCT coefficients are then spread over all pixels, which provides a possibility for detecting hidden information in the video stream. Therefore, in the embodiment of the present application, the video stream hidden information may be detected by using the difference caused by modifying the DCT coefficient, that is, for the video stream to be detected, decompression needs to be performed first to transform the video stream from the frequency domain to the spatial domain to obtain a series of video frames, so as to perform subsequent processing to perform detection of the hidden information.

Step 202: and performing convolution operation on each video frame by utilizing A preset convolution cores to obtain A residual error noise matrixes of each video frame.

It can be understood that, if the signal of the image content in the video frame obtained in step 201 is strong and the noise signal generated by the hidden information is weak, which is not beneficial to the detection of the subsequent hidden information, it is necessary to perform a convolution operation on the video frame obtained in step 201 and a set of preset convolution kernels to suppress the signal weakening the image content in the video frame and amplify the vocal signal generated by enhancing the hidden information, that is, the residual vocal matrix beneficial to the detection of the hidden information can be obtained by performing the convolution operation on the video frame and the preset convolution kernels.

It should be noted that the residual vocal matrices obtained by performing the convolution operation on the preset convolution kernels with different sizes and the video frames have different influences on the detection performance of the subsequent hidden information. Because most of the video stream hidden information is mainly finished by modifying DCT coefficients after DCT transformation of residual matrix corresponding to 4 x 4 brightness blocks in a video frame, only convolution kernel with 4 x 4 size is adopted to carry out convolution operation with the video frame, so that a proper residual vocal matrix which enables the hidden information to be more easily detected can be obtained. Therefore, in some embodiments of the present application, a number of the preset convolution kernels is 16 4 × 4 convolution kernels, and a constructor of the 16 4 × 4 convolution kernels is:

wherein the content of the first and second substances,

N＝4。

for example, in a video stream hidden information detection experiment, convolution kernels of various sizes, such as 2 × 2, 3 × 3, 4 × 4, 5 × 5, and 8 × 8, may be used to perform convolution operations with video frames, respectively, and corresponding hidden information detection results are tested, so that it can be known that a residual vocal matrix obtained by performing a convolution operation on a 4 × 4 convolution kernel and a video frame can obtain an optimal hidden information detection result, and then 16 4 × 4 convolution kernels obtained based on a constructor of the convolution kernels are subjected to a convolution operation with each video frame, and each video frame can obtain 16 residual vocal matrices after the convolution operation.

Step 203: and performing quantization truncation operation on the A residual error noise matrixes of each video frame by using B different quantization truncation parameters to obtain an A multiplied by B quantization residual error noise matrix of each video frame.

It can be understood that, considering the quality of the detection result of the hidden information, the residual noise matrix obtained in step 201 needs to be added for diversification processing, and since the quantization truncation operation has the functions of discretizing and diversifying the features, a plurality of different quantization truncation parameters need to be adopted to perform the quantization truncation operation on the residual noise matrix obtained in step 201 to obtain more quantization residual noise matrices, so that the subsequent convolutional neural network model is combined to provide the convolutional neural network model with diversified residual noise matrices, and the subsequent detection result of the hidden information has better quality.

For example, in the video stream hidden information detection experiment, a quantization truncation operation may be performed on a residual vocal matrix using 4 different sets of quantization truncation parameters (T4, Q1, 2, 4) (T6, Q1, 2, 4), (T8, Q1, 2, 4), and (T10, Q1, 2, 4), respectively, to test the corresponding hidden information detection result, and it is known that the best hidden information detection result can be obtained from the quantization residual vocal matrix obtained by performing the quantization truncation operation on the set of different quantization truncation parameters and the residual vocal matrix (T8, Q1, 2, 4). Then 3 different quantization truncation parameters (T-8, Q-1), (T-8, Q-2) and (T-8, Q-4) are selected, and the 16 residual noise matrices are subjected to quantization truncation operation, so as to obtain 48 quantized residual noise matrices.

Step 204: and inputting A multiplied by B quantization residual error noise matrixes of each video frame into a preset convolutional neural network model for hidden information detection, and obtaining a detection result of the video stream to be detected.

It can be understood that the a × B quantized residual noise matrix obtained in step 203 may be divided into B groups based on B different quantization truncation parameters, the B groups of quantized residual noise matrices are input into a preset convolutional neural network model for hidden information detection, and the corresponding preset convolutional neural network model is composed of B parallel and identical convolutional neural network sub-models. The convolutional neural network sub-model comprises a first convolutional layer, a global average pooling layer, 2 fully-connected layers and a softmax layer, for example, as shown in fig. 3, a schematic structural diagram of the convolutional neural network sub-model in the preset convolutional neural network model. The first convolutional layer and the global average pooling layer are the core of a convolutional neural network model, and hidden information detection characteristics in a quantized residual error matrix can be automatically extracted through learning and optimizing related parameters; the 2 full connection layers and the softmax layer can obtain detection results based on detection feature detection.

For example, the 48 quantization residual noise matrices may be divided into 3 groups based on 3 different quantization truncation parameters and input into 3 parallel and same convolutional neural network sub-models in the preset convolutional neural network model, each convolutional neural network sub-model includes 16 partial convolutional neural network sub-models of the first convolutional layer and the global average pooling layer, and the partial convolutional neural network sub-model may output 1 256-dimensional feature vector, for example, the partial convolutional neural network sub-model diagram shown in fig. 4.

It should be noted that the a first convolution layers use 1 batch normalized BN layer and a nonlinear excitation function ReLU layer in each first convolution layer. The BN layer is introduced to solve the problem of gradient diffusion in the back propagation process of the preset convolutional neural network model, and the ReLU layer is introduced to increase nonlinear factors to improve the expression capability of the preset convolutional neural network model.

It should be noted that the deeper the convolutional neural network is, the better the theoretical learning effect is, that is, the higher the detection accuracy of the convolutional neural network model trained theoretically is; in actual operation, however, the situation of gradient disappearance or gradient explosion occurs in the too deep convolutional neural network during training, which leads to the situation that the detection accuracy of the convolutional neural network model is saturated or even reduced as the number of layers is increased; and adding a shortcut connection construction residual error learning unit between different first convolution layers of the A first convolution layers by using the residual error learning thought in a convolutional neural network submodel in a preset convolutional neural network model so as to learn the residual error conduction gradient. Thus, in some implementations of embodiments of the present application, a shortcut connection is included between different first buildup layers of the a first buildup layers.

It should be noted that, the step length of convolution operation of a part of the first convolution layers in the a first convolution layers of the convolution neural network sub-model is 2, and when the size of the data dimension of the input data is halved, the number of convolution kernels is increased to double the data dimension of the output data, so that the network structure complexity of the convolution neural network sub-model is maintained; and the step size of the first convolution layer convolution operation in the rest part is 1, and the data dimension of the input data is equal to that of the output data. The effect of the shortcut connection included between different first convolution layers is to make the data dimensions of the input data and the output data at both ends thereof the same so as to achieve matching. Therefore, in some embodiments of the present application, if the data dimensions of the input data and the output data at the two ends of the shortcut connection are the same, the shortcut connection is an identical shortcut connection. And if the data dimensionalities of the input data and the output data at the two ends of the quick connection are different, the quick connection comprises a second convolution layer, and the second convolution layer is used for matching the data dimensionalities of the input data and the output data. For example, the structure diagram related to the quick connection is shown in fig. 5.

Through various implementation manners provided by the embodiment, a decoding library is called to decompress a video stream to be detected to obtain a plurality of video frames; performing convolution operation on each video frame by utilizing A preset convolution cores to obtain A residual error noise matrixes of each video frame; b different quantization truncation parameters are utilized to carry out quantization truncation operation on the A residual error noise matrixes of each video frame to obtain A multiplied by B quantization residual error noise matrixes of each video frame; and inputting the A multiplied by B quantized residual noise matrix of each video frame into a preset convolutional neural network model for hidden information detection, and obtaining a detection result of the video stream to be detected. Therefore, after the operations such as decompression, convolution and quantization truncation are carried out on the video stream, the hidden information detection is carried out by combining the preset convolution neural network model, the detection characteristics in the quantization residual noise matrix can be automatically extracted by the model, the time, the energy and the calculation resources are greatly saved, and the detection efficiency and the detection result quality of the hidden information of the video stream can be improved to a certain degree.

Exemplary devices

Referring to fig. 6, a schematic structural diagram of an apparatus for detecting hidden information in a video stream in an embodiment of the present application is shown. In this embodiment, the apparatus may specifically include:

a video frame obtaining unit 601, configured to invoke a decoding library to decompress a video stream to be detected to obtain a plurality of video frames;

a residual noise matrix obtaining unit 602, configured to perform convolution operation on each video frame by using a preset convolution kernel, so as to obtain a residual noise matrix of each video frame;

a quantized residual noise matrix obtaining unit 603, configured to perform quantization truncation on a residual noise matrices of each video frame by using B different quantization truncation parameters, so as to obtain an a × B quantized residual noise matrix of each video frame;

a detection result obtaining unit 604, configured to input the a × B quantized residual noise matrices of each video frame into a preset convolutional neural network model for hidden information detection, so as to obtain a detection result of the video stream to be detected.

In an optional implementation manner of the embodiment of the present application, a number of the preset convolution kernels is 16 4 × 4 convolution kernels, and a constructor of the 16 4 × 4 preset convolution kernels is:

wherein the content of the first and second substances,

N＝4。

in an optional implementation manner of the embodiment of the present application, the preset convolutional neural network model includes B convolutional neural network submodels, where the convolutional neural network submodels include a first convolutional layer, a global average pooling layer, 2 full-link layers, and a softmax layer; and quick connection is included between different first winding layers in the A first winding layers.

In an optional implementation manner of the embodiment of the application, if the data dimensions of the input data and the output data at the two ends of the shortcut connection are the same, the shortcut connection is an identical shortcut connection.

In an optional implementation manner of the embodiment of the present application, if the data dimensions of the input data and the output data at two ends of the shortcut connection are different, the shortcut connection includes a second convolution layer, and the second convolution layer is used for matching the data dimensions of the input data and the output data.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application in any way. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims

1. A method for detecting hidden information of a video stream, comprising:

inputting A × B quantization residual error noise matrixes of each video frame into a preset convolutional neural network model for hidden information detection to obtain a detection result of the video stream to be detected;

the preset convolutional neural network model comprises B convolutional neural network submodels, and the convolutional neural network submodels comprise A first convolutional layers, a global average pooling layer, 2 full-link layers and a softmax layer; and quick connection is included between different first winding layers in the A first winding layers.

2. The method of claim 1, wherein a of the predetermined convolution kernels is 16 4 x 4 convolution kernels, and wherein the constructor of the 16 4 x 4 predetermined convolution kernels is:

wherein the content of the first and second substances,

3. the method of claim 1, wherein the shortcut connection is an identical shortcut connection if the data dimensions of the input data and the output data at both ends of the shortcut connection are the same.

4. The method of claim 1, wherein if the data dimensions of the input data and the output data at the two ends of the shortcut connection are different, the shortcut connection comprises a second convolutional layer for matching the data dimensions of the input data and the output data.

5. An apparatus for detecting hidden information in a video stream, comprising:

a detection result obtaining unit, configured to input the a × B quantized residual noise matrices of each video frame into a preset convolutional neural network model for hidden information detection, so as to obtain a detection result of the video stream to be detected;

6. The apparatus of claim 5, wherein a of the predetermined convolution kernels is 16 4 x 4 convolution kernels, and wherein the constructor of the 16 4 x 4 predetermined convolution kernels is:

wherein the content of the first and second substances,

7. the apparatus of claim 5, wherein the shortcut connection is an identical shortcut connection if the data dimensions of the input data and the output data at both ends of the shortcut connection are the same.

8. The apparatus of claim 5, wherein if the data dimensions of the input data and the output data at the two ends of the shortcut connection are different, the shortcut connection comprises a second convolutional layer for matching the data dimensions of the input data and the output data.