WO2021142904A1

WO2021142904A1 - Video analysis method and related model training method, device and apparatus therefor

Info

Publication number: WO2021142904A1
Application number: PCT/CN2020/078656
Authority: WO
Inventors: 邵昊; 刘宇
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2020-01-17
Filing date: 2020-03-10
Publication date: 2021-07-22
Also published as: JP2022520511A; TWI761813B; CN111291631B; CN111291631A; KR20210093875A; TW202129535A; JP7096431B2

Abstract

Disclosed in embodiments of the present application are a video analysis method and a related model training method, device and apparatus therefor. The video analysis method comprises: obtaining a video to be analyzed; performing feature extraction on said video by using a preset network model to obtain a first multi-dimensional feature map, wherein the first multi-dimensional feature map comprises feature information on different timings corresponding to said video; predicting the first multi-dimensional feature map by using an offset prediction network to obtain offset information; performing timing offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtaining a second multi-dimensional feature map on the basis of the offset feature information; and analyzing the second multidimensional feature map by using the preset network model to obtain analysis result information of said video.

Description

Video analysis method and related model training method, equipment and device

Cross-references to related applications

This application is filed based on a Chinese patent application with an application number of 202010053048.4 and an application date of January 17, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by way of introduction.

Technical field

This application relates to the field of artificial intelligence technology, in particular to a video analysis method and related model training methods, equipment, and devices.

Background technique

With the development of artificial intelligence technologies such as neural networks and deep learning, the way of training neural network models and using the trained neural network models to complete tasks such as classification and detection has gradually been favored by people.

At present, neural network models are generally designed with static images as processing objects.

Summary of the invention

The embodiments of the present application provide a video analysis method and related model training methods, equipment, and devices.

In a first aspect, an embodiment of the present application provides a video analysis method, including: obtaining a video to be analyzed; using a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, wherein the The first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed; the offset prediction network is used to predict the first multi-dimensional feature map to obtain offset information; the offset information is used Perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information; The dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed.

The embodiment of the application processes the video to be analyzed through the preset network model, which is beneficial to improve the processing speed of video analysis, and through the timing offset, the spatial information and the timing information can be jointly interlaced. Therefore, the analysis and processing are performed on this basis. Conducive to improving the accuracy of video analysis.

In some optional embodiments of the present application, the offset information is used to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and the second feature information is obtained based on the offset information. Before the multi-dimensional feature map, the method further includes: predicting the first multi-dimensional feature map by using a weight prediction network to obtain weight information; and using the offset information to perform a prediction on the first multi-dimensional feature map. Performing timing offset on at least part of the feature information, and obtaining a second multi-dimensional feature map based on the offset feature information, includes: performing at least part of the feature information of the first multi-dimensional feature map by using the offset information Time sequence offset; use the weight information to perform weighting processing on the offset feature information; and obtain a second multi-dimensional feature map based on the weighted feature information.

The technical solution of the embodiment of the present application can directly obtain the feature information of space and time series joint interleaving through the processing steps of offset and weighting, which is beneficial to improve the processing speed and accuracy of video analysis.

In some optional embodiments of the present application, the dimensions of the first multi-dimensional feature map include a time series dimension and a preset dimension; the offset information is used to perform at least part of the feature information of the first multi-dimensional feature map. The time sequence offset includes: selecting at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time sequences in the same preset dimension; using the offset information The at least one set of feature information is offset in the time series dimension.

In the technical solution of the embodiment of the present application, at least one set of feature information is selected from the first multi-dimensional feature map according to a preset dimension, and each set of feature information includes feature information corresponding to different time series in the same preset dimension, and offset information is used Offset at least one set of feature information in the time sequence dimension can reduce the amount of calculation for offset processing, which is further conducive to improving the processing speed of video analysis.

In some optional embodiments of the present application, the preset dimension is a channel dimension; and/or, the offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets The first feature information; said using offset information to offset the at least one set of feature information in the time sequence dimension includes: using the i-th offset value in the offset information to offset the i-th set of first feature information Offset is performed in the time sequence dimension to obtain the i-th group of second characteristic information, where i is a positive integer less than or equal to the first number.

In the technical solution of the embodiment of the present application, by performing offset processing corresponding to the first feature information of the same number of offset values included in the offset information, the feature information of the space and time sequence joint interleaving can be directly obtained, which is conducive to improving The processing speed and accuracy of video analysis.

In some optional embodiments of the present application, the i-th offset value in the offset information is used to offset the i-th group of the first feature information in the time series dimension to obtain the i-th group The second characteristic information includes: acquiring the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value, and the first group of the i-th The characteristic information is shifted by the upper limit time sequence unit along the time sequence dimension to obtain the i-th group of third characteristic information, and the i-th group of the first characteristic information is shifted by the lower limit value along the time sequence dimension Time sequence units to obtain the i-th group of fourth characteristic information; using the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of the third characteristic information to obtain the i groups of first weighting results, and using the difference between the upper limit value and the i-th offset value as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighting result of the i-th group; The sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature information of the i-th group.

The technical solutions of the embodiments of the present application can easily and quickly perform offset processing on the first feature information, which is beneficial to improve the processing speed of video analysis.

In some optional embodiments of the present application, the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values; the weight information is used to perform the offset on the feature information. Weighting processing includes: for each group of feature information after the offset, weighting is performed on the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information, and the weighted processing is obtained Corresponding group feature information; where j is a positive integer less than or equal to the second number.

The technical solution of the embodiment of the present application uses the j-th weight value in the weight information to weight the characteristic value corresponding to the j-th time sequence of the current group of characteristic information by weighting each group of feature information after the offset to obtain the weighting process. The latter corresponding group of feature information can re-weight the feature information when the feature information at some ends is offset, which is beneficial to improve the accuracy of video analysis.

In some optional embodiments of the present application, the obtaining a second multi-dimensional feature map based on the weighted feature information includes: using the weighted feature information and the first multi-dimensional feature map. The non-shifted feature information in the dimensional feature map constitutes the second multi-dimensional feature map.

In the technical solution of the embodiment of the present application, the weighted feature information and the feature information that has not been shifted in the first multi-dimensional feature map are combined into the second multi-dimensional feature information, which can reduce the calculation load and improve the performance of video analysis. Processing speed.

In some optional embodiments of the present application, the predicting the first multi-dimensional feature map using the weight prediction network to obtain the weight information includes: using the first downsampling layer of the weight prediction network to perform the prediction on the Down-sampling the first multi-dimensional feature map to obtain the first down-sampling result; using the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain the first feature extraction result; using The first activation layer of the weight prediction network performs nonlinear processing on the first feature extraction result to obtain weight information.

In the technical solution of the embodiment of the present application, the first multi-dimensional feature map is processed layer by layer through the first down-sampling layer, the first convolutional layer, and the first activation layer, that is, weight information can be obtained, and the weight prediction network can be effectively simplified. The new network structure reduces the network parameters, which is beneficial to improve the convergence speed of the model training for video analysis, and is beneficial to avoid over-fitting, thereby helping to improve the accuracy of video analysis.

In some optional embodiments of the present application, the using the offset prediction network to predict the first multi-dimensional feature map to obtain the offset information includes: using the second downsampling layer of the offset prediction network to perform The first multi-dimensional feature map is down-sampled to obtain a second down-sampling result; the second convolutional layer of the offset prediction network is used to perform convolution processing on the second down-sampling result to obtain a second feature extraction result Use the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result; use the second activation layer of the offset prediction network to perform feature connection on the first The feature connection result is subjected to non-linear processing to obtain a non-linear processing result; the second fully connected layer of the offset prediction network is used to perform feature connection to the non-linear processing result to obtain a second feature connection result; using the offset The third activation layer of the prediction network performs nonlinear processing on the second feature connection result to obtain offset information.

The technical solutions of the embodiments of the present application can effectively simplify the network structure of the offset prediction network, reduce network parameters, help improve the convergence speed of the model training used for video analysis, and help avoid overfitting, thereby helping to improve The accuracy of video analysis.

In some optional embodiments of the present application, the preset network model includes at least one convolutional layer; the use of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map includes: Suppose that the convolutional layer of the network model performs feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, then the second most After the dimensional feature map, and before using a preset network model to analyze the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, the method further includes: using the preset network model The convolutional layer that has not performed feature extraction performs feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map; performing the use of the offset prediction network to perform the feature extraction on the new first multi-dimensional feature map Make predictions to obtain offset information and subsequent steps to obtain a new second multi-dimensional feature map; repeat the above steps until all convolutional layers of the preset network model have completed the new second multi-dimensional feature map The feature extraction step of the feature map; analyzing the second multi-dimensional feature map using the preset network model to obtain the analysis result information of the video to be analyzed includes: using the fully connected layer of the preset network model The second multi-dimensional feature map is analyzed to obtain analysis result information of the video to be analyzed.

In the technical solution of the embodiment of the present application, when the preset network model includes more than one convolutional layer, the second multi-dimensional feature map is extracted by using the convolutional layer in the preset network model that has not been subjected to feature extraction. Obtain a new first multi-dimensional feature map, and re-execute steps such as offset prediction, until all convolutional layers in the preset network model have completed the feature extraction step of the new second multi-dimensional feature map, thereby using the preset The fully connected layer of the network model analyzes the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, which in turn can improve the accuracy of video analysis.

In some optional embodiments of the present application, the video to be analyzed includes several frames of images, and the feature extraction of the video to be analyzed using a preset network model to obtain the first multi-dimensional feature map includes: The preset network model performs feature extraction on the several frames of images to obtain a feature map corresponding to each frame of image; the several feature maps are spliced according to the timing of the corresponding images in the video to be analyzed , To obtain the first multi-dimensional feature map.

In the technical solution of the embodiment of the present application, feature extraction is performed on several frames of the video to be analyzed through a preset network model, and a feature map corresponding to each frame of image is obtained, so that several feature maps are directly placed in the waiting room according to their corresponding images. Analyzing the time sequence of the video and performing splicing to obtain the first multi-dimensional feature map can reduce the processing load of feature extraction for the video to be analyzed, which is beneficial to improve the processing speed of video analysis.

In a second aspect, an embodiment of the present application provides a model training method for video analysis, including: obtaining a sample video, wherein the sample video includes preset annotation information; and performing a method on the sample video using a preset network model Feature extraction to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information at different time series corresponding to the sample video; and the offset prediction network is used to analyze the first sample The multi-dimensional feature map is predicted to obtain offset information; at least part of the feature information of the first sample multi-dimensional feature map is time-shifted using the offset information, and the second is obtained based on the offset feature information. Sample multi-dimensional feature map; use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video; use the preset annotation information and the analysis result information to calculate the loss value ; Based on the loss value, adjusting the preset network model and the parameters of the offset prediction network.

The technical solution of the embodiment of the present application can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to subsequent improvement of the accuracy of video analysis.

In a third aspect, an embodiment of the present application provides a video analysis device, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, and a network analysis module; wherein the video acquisition module is configured to acquire The video to be analyzed; the feature extraction module is configured to use a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains different timings corresponding to the video to be analyzed The offset prediction module is configured to use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information; the offset processing module is configured to use the offset information for the first multi At least part of the feature information of the three-dimensional feature map is time-shifted, and a second multi-dimensional feature map is obtained based on the offset feature information; the network analysis module is configured to perform the second multi-dimensional feature map using a preset network model Analyze to obtain the analysis result information of the video to be analyzed.

In some optional embodiments of the present application, the device further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information; the offset processing module is configured In order to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighting processing To obtain the second multi-dimensional feature map.

In some optional embodiments of the present application, the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions;

The offset processing module is configured to select at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time series in the same preset dimension; The offset information offsets the at least one set of feature information in the time sequence dimension.

In some optional embodiments of the present application, the preset dimension is a channel dimension; and/or,

The offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;

The offset processing module is configured to use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information. Characteristic information, wherein the i is a positive integer less than or equal to the first number.

In some optional embodiments of the present application, the offset processing module is configured to obtain the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is A preset value; the first feature information of the i-th group is shifted by the upper limit number of time-series units along the time sequence dimension to obtain the third feature information of the i-th group, and the first feature of the i-th group is The information is shifted by the lower limit time sequence unit along the time sequence dimension to obtain the i-th group of fourth characteristic information; the difference between the i-th offset value and the lower limit value is used as the weight to the i-th Group the third feature information to perform weighting processing to obtain the first weighted result of the i-th group, and use the difference between the upper limit value and the i-th offset value as a weight to perform weighting on the fourth group of the i-th group The feature information is weighted to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature information of the i-th group.

In some optional embodiments of the present application, the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values; the offset processing module is configured to Each group of feature information in the weight information is used to weight the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information to obtain the corresponding group of feature information after weighting; The j is a positive integer less than or equal to the second number.

In some optional embodiments of the present application, the offset processing module is configured to use the feature information after the weighting process and the feature information that has not been offset in the first multi-dimensional feature map to form The second multi-dimensional feature map.

In some optional embodiments of the present application, the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result ; Use the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain a first feature extraction result; use the first activation layer of the weight prediction network to perform the convolution processing on the first feature The extraction result is subjected to non-linear processing to obtain the weight information.

In some optional embodiments of the present application, the offset prediction module is configured to use the second downsampling layer of the offset prediction network to downsample the first multi-dimensional feature map to obtain a second downsampling layer. Sampling result; using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result; using the first fully connected layer pair of the offset prediction network Perform feature connection on the second feature extraction result to obtain a first feature connection result; use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain a nonlinear processing result; use The second fully connected layer of the offset prediction network performs feature connection on the non-linear processing result to obtain a second feature connection result; the third activation layer of the offset prediction network is used to connect the second feature result Perform nonlinear processing to obtain the offset information.

In some optional embodiments of the present application, the preset network model includes at least one convolutional layer; the feature extraction module is configured to perform feature extraction on the video to be analyzed by using the convolutional layer of the preset network model , Obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, it is further configured to use the convolutional layer in the preset network model that has not been feature extraction performed to perform the feature extraction on the second Perform feature extraction on the multi-dimensional feature map to obtain a new first multi-dimensional feature map; the offset prediction module is further configured to use an offset prediction network to predict the new first multi-dimensional feature map to obtain a new Offset information; the offset processing module is further configured to use the new offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and based on the offset feature Information to obtain a new second multi-dimensional feature map; the network analysis module is further configured to use the fully connected layer of the preset network model to analyze the new second multi-dimensional feature map to obtain the to-be-analyzed The analysis result information of the video.

In some optional embodiments of the present application, the video to be analyzed includes several frames of images; the feature extraction module is configured to use the preset network model to perform feature extraction on the several frames of images to obtain the A feature map corresponding to one frame of image; and the plurality of feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.

In a fourth aspect, an embodiment of the present application provides a model training device for video analysis, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, a network analysis module, a loss calculation module, and parameter adjustment Module; wherein the video acquisition module is configured to acquire a sample video, wherein the sample video includes preset annotation information; the feature extraction module is configured to use a preset network model to perform feature extraction on the sample video to obtain the first In this multi-dimensional feature map, the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video; the offset prediction module is configured to perform the first sample multi-dimensional feature map using the offset prediction network Prediction to obtain offset information; the offset processing module is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample based on the offset feature information The multi-dimensional feature map; the network analysis module is configured to use a preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video; the loss calculation module is configured to use preset annotation information and analysis The result information calculates the loss value; the parameter adjustment module is used to adjust the parameters of the preset network model and the offset prediction network based on the loss value.

In a fifth aspect, an embodiment of the present application provides an electronic device including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory to implement the video analysis in the first aspect of the embodiment of the present application. Method, or implement the model training method for video analysis in the second aspect of the embodiment of the present application.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium on which program instructions are stored. When the program instructions are executed by a processor, the video analysis method in the above-mentioned first aspect of the embodiments of the present application is implemented, or the present invention is implemented. The model training method for video analysis in the above second aspect of the application embodiment.

In a seventh aspect, an embodiment of the present application provides a computer program, including computer-readable code. When the computer-readable code runs in an electronic device, the processor in the electronic device executes to implement the implementation of the present application. Take the video analysis method in the first aspect described above, or implement the model training method for video analysis in the second aspect described in the embodiments of the present application.

The technical solutions of the embodiments of the present application can directly model the timing information of the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to improving the accuracy of video analysis.

Description of the drawings

FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application;

Figure 2 is a schematic diagram of an embodiment of a video analysis processing process;

FIG. 3 is a schematic diagram of an embodiment of each stage of video analysis;

FIG. 4 is a schematic flowchart of an embodiment of step S14 in FIG. 1;

FIG. 5 is a schematic flowchart of another embodiment of the video analysis method of the present application;

Fig. 6 is a schematic diagram of another embodiment of a video analysis processing process;

FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application;

FIG. 8 is a schematic diagram of the framework of an embodiment of the video analysis device of the present application;

FIG. 9 is a schematic diagram of the framework of an embodiment of a model training device for video analysis according to the present application;

FIG. 10 is a schematic diagram of a framework of an embodiment of an electronic device of the present application;

FIG. 11 is a schematic diagram of a framework of an embodiment of a computer-readable storage medium according to the present application.

Detailed ways

The following describes the solutions of the embodiments of the present application in detail with reference to the drawings in the specification.

In the following description, for the purpose of illustration rather than limitation, specific details such as specific system structure, interface, technology, etc. are proposed for a thorough understanding of the present application.

The terms "system" and "network" in this article are often used interchangeably in this article. The term "and/or" in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations. In addition, the character "/" in this text generally indicates that the associated objects before and after are in an "or" relationship. In addition, "many" in this document means two or more than two.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application. The video analysis method of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:

Step S11: Obtain the video to be analyzed.

In this embodiment of the application, the video to be analyzed may include several frames of images. For example, the video to be analyzed includes 8 frames of images, or the video to be analyzed includes 16 frames of images, or the video to be analyzed includes 24 frames of images, etc. Make specific restrictions. In an implementation scenario, the video to be analyzed may be a surveillance video shot by a surveillance camera to analyze the behavior of the target object in the surveillance video, for example, the target object falls down, the target object walks normally, and so on. In another implementation scenario, the video to be analyzed may be a video in a video library to classify the videos in the video library, for example, a football match video, a basketball match video, a ski match video, and so on.

Step S12: Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.

In a specific implementation scenario, in order to further reduce network parameters, reduce processing load, thereby increase processing speed, increase convergence speed during training, and avoid overfitting, the above-mentioned preset network model may be a two-dimensional neural network model, for example, ResNet -50, ResNet-101, etc., there is no specific limitation here. The ResNet network is constructed by a Residual Block, which uses multiple parameterized layers to learn the residual representation between input and output.

In this embodiment of the present application, the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed. Please refer to FIG. 2 in combination. FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series.

In an implementation scenario, the video to be analyzed includes several frames of images. In order to reduce the processing load of feature extraction of the video to be analyzed and increase the processing speed of video analysis, the feature extraction can be performed on several frames of the video to be analyzed through the preset network model, and the feature map corresponding to each frame of image can be obtained. The feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map. For example, if the video to be analyzed includes 8 frames of images, you can use the preset network model to perform feature extraction on the 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly placed in the waiting room according to their corresponding images. Analyze the time sequence in the video for splicing to obtain the first multi-dimensional feature map.

Step S13: Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.

Different from conventional static images, videos tend to focus more on the continuous behavior of the target object. In order to better obtain the inherent temporal semantics of the video, the temporal and spatial information in the video can be integrated. Therefore, in this embodiment of the present application, the offset prediction network is used to predict the offset information, so as to subsequently perform a time sequence offset based on the offset information, so as to complete the integration of time information and space. The offset prediction network may specifically be a preset network model, so that the first multi-dimensional feature map can be predicted through the preset network model, and the offset information can be directly obtained.

In an implementation scenario, the offset prediction network may include a downsampling layer, a convolutional layer, a fully connected layer, an activation layer, a fully connected layer, and an activation layer that are sequentially connected. Therefore, the prediction offset network contains only 5 layers, and only the convolutional layer and the fully connected layer contain network parameters, which can simplify the network structure to a certain extent and reduce the network parameters, thereby reducing the network capacity and improving the convergence speed. Avoid over-fitting, make the trained model as accurate as possible, and then improve the accuracy of video analysis.

Exemplarily, the down-sampling layer (denoted as the second down-sampling layer) of the offset prediction network may be used to down-sample the first multi-dimensional feature map to obtain the down-sampling result (denoted as the second down-sampling result). In a specific implementation scenario, the downsampling layer may specifically be an average pooling layer, and the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions (for example, channel dimensions). Perform down-sampling processing, and the down-sampling result can be expressed as:

In the above formula, c and t respectively represent the time series dimension in the multi-dimensional and the preset dimension in the multi-dimensional (the preset dimension can be, for example, the channel dimension), and z _{c, t} represents the (c, t)th element in the downsampling result, H, W represent the height and width of the feature map, U _{c, t} represent the (c, t)th element in the first multi-dimensional feature map.

Further, the convolutional layer of the offset prediction network (denoted as the second convolutional layer) can be used to perform convolution processing on the down-sampling result (ie, the second down-sampling result) to obtain the feature extraction result (denoted as the second feature extraction result). The convolution layer of the offset prediction network may specifically include the same number of convolution kernels as the number of frames of the video to be analyzed, and the size of the convolution kernel may be 3*3, for example.

Further, the first fully connected layer (denoted as the first fully connected layer) of the offset prediction network is used to perform feature connection on the feature extraction result (that is, the second feature extraction result) to obtain the feature connection result (denoted as the first feature Connection result). Among them, the first fully connected layer of the migration prediction network may contain the same number of neurons as the number of frames of the video to be analyzed.

Further, the first activation layer (which can be recorded as the second activation layer) of the migration prediction network is used to perform non-linear processing on the feature connection result (ie, the first feature connection result) to obtain the non-linear processing result. Among them, the first activation layer of the offset prediction network may be a linear rectification function (Rectified Linear Unit, ReLU) activation layer.

Further, the second fully connected layer of the offset prediction network (denoted as the second fully connected layer) is used to perform feature connection on the nonlinear processing results to obtain the feature connection result (denoted as the second feature connection result); and then use the bias The second activation layer (which can be recorded as the third activation layer) of the motion prediction network performs nonlinear processing on the feature connection result (ie, the second feature connection result) to obtain offset information. Among them, the second activation layer of the offset prediction network can be a Sigmoid activation layer, so that each element in the offset information can be constrained to be between 0 and 1.

The above processing process can be expressed as:

offset ^raw =σ(W2δ(W1(F _1dconv (z)))) (2)

In the above formula, z represents the result of downsampling, F _1dconv represents the convolutional layer of the offset prediction network, W1 represents the first fully connected layer of the offset prediction network, δ represents the first active layer of the offset prediction network, W2 Represents the second fully connected layer of the offset prediction network, σ represents the second active layer of the offset prediction network, and offset ^raw represents offset information.

In another implementation scenario, in order to improve the stability and performance of the model, the offset information obtained by the second activation layer can also be subjected to constraint processing, so that each element in the offset information is restricted to

Among them, T represents the number of frames of the video to be analyzed. Specifically, each element in the offset information obtained by using the second activation layer of the offset prediction network to perform nonlinear processing on the feature connection result can be respectively subtracted by 0.5, and the difference obtained by subtracting 0.5 can be subtracted Multiply by the number of frames of the video to be analyzed to obtain the offset information after constraint processing. The above constraint processing can be expressed as:

offset=(offset ^raw -0.5)×T (3)

In the above formula, offset ^raw represents the offset information processed by the second activation layer, T represents the number of frames of the video to be analyzed, and offset represents the constraint to

The offset information.

Step S14: Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information.

In an implementation scenario, in order to offset the information corresponding to different time sequences in at least part of the feature information, thereby integrating time information and spatial information, and improving the accuracy of video analysis, at least part of the specific information may be along a preset dimension (for example, , Channel dimension). As shown in Figure 2, in order to further reduce the processing load, the number of channels in the channel dimension of the first multi-dimensional feature map is C, and the number of channels in the channel dimension of at least part of the above-mentioned feature information is

In addition, the offset information can also be used to perform time sequence offset on all the feature information of the first multi-dimensional feature map, which is not limited here.

In an implementation scenario, in order to reduce the amount of calculation of offset information and increase the processing speed of video analysis, at least one set of feature information may be selected from the first multi-dimensional feature map according to a preset dimension (for example, channel dimension), where Each set of feature information includes feature information corresponding to different time series in the same preset dimension (for example, channel dimension), and the offset information is used to offset the at least one set of feature information in the time series dimension. At this time, the second fully connected layer of the migration prediction network can contain the same number of neurons as the number of selected feature information groups, so that the number of elements in the offset information is the same as the number of selected feature information groups. Furthermore, each element in the offset information may be used to offset at least one set of feature information in the time sequence dimension. For example, the time sequence dimension is shifted by one time sequence unit, or the time sequence dimension is shifted by two time sequence units, etc., which is not specifically limited here.

After the timing offset is performed on at least part of the feature information of the first multi-dimensional feature map by using the offset information, the at least part of the feature information after the timing offset may be combined with the partial features in the first multi-dimensional feature map that have not been time-shifted. The information is spliced to obtain the second multi-dimensional feature map. In a specific implementation scenario, please refer to Figure 2. The number of channels can be

At least part of the feature information obtained after timing offset and the number of channels without timing offset are

Part of the feature information of is spliced to obtain the second multidimensional feature map.

Step S15: Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.

In an implementation scenario, the fully connected layer of the preset network model can be used to perform feature connection to the second multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so as to obtain the category of the video to be analyzed (such as football Event video, ski event video, etc.), or you can get the behavior category of the target object in the video to be analyzed (for example, normal walking, falling, running, etc.), other application scenarios can be deduced by analogy. An example.

In an implementation scenario, in order to facilitate processing, the above-mentioned offset prediction network may be embedded before the convolutional layer of the preset network model. For example, the preset network model is ResNet-50, and the offset prediction network can be embedded before the convolutional layer in each residual block.

In an implementation scenario, the preset network model may include at least one convolutional layer, so in the feature extraction process, a convolutional layer of the preset network model may be used to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map .

In an implementation scenario, in order to improve the accuracy of video analysis, the number of convolutional layers of the preset network model can be more than one. For example, the number of convolutional layers of the preset network model can be 2, 3, or 4 and so on. Therefore, before the second multi-dimensional feature map is analyzed and the analysis result information of the video to be analyzed is obtained, the second multi-dimensional feature map can also be extracted using the convolutional layer in the preset network model that has not been feature extraction performed. , Get a new first multi-dimensional feature map; among them, the new first multi-dimensional feature map can maintain the same dimension in the time series dimension; further execute the prediction of the new first multi-dimensional feature map using the offset prediction network , The step of obtaining the offset information and subsequent steps to obtain a new second multi-dimensional feature map, and repeat the above steps until all the convolutional layers of the preset network model have completed the new second multi-dimensional feature map In the feature extraction step, the fully connected layer of the preset network model is used to analyze the finally obtained second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed.

Please refer to Figure 3, which is a schematic diagram of an embodiment of each stage of video analysis. Taking the preset network model including 3 convolutional layers as an example, the video to be analyzed is characterized by the first convolutional layer of the preset network model After the first multi-dimensional feature map is extracted, the timing shift is performed through the above-mentioned related steps to obtain the second multi-dimensional feature map. Before the fully connected layer of the preset network model is used for analysis and processing, the second multi-dimensional feature map can be further used. The dimensional feature map is input into the second convolutional layer for feature extraction to obtain a new first multi-dimensional feature map (denoted as the first multi-dimensional feature map in the figure), and the new first multi-dimensional feature map is obtained through the above related steps Perform timing shift to obtain a new second multi-dimensional feature map (denoted as the second multi-dimensional feature map in the figure), similarly, use the third convolutional layer to perform feature extraction on the new second multi-dimensional feature map , A new first multi-dimensional feature map (marked as the first multi-dimensional feature map in the figure) is obtained, and the new first multi-dimensional feature map is time-shifted through the above related steps to obtain a new second multi-dimensional feature map. Dimensional feature map (marked as the second multi-dimensional feature map in the figure). At this time, the three convolutional layers of the preset network model have all been executed to complete the feature extraction step. The fully connected layer of the preset network model can be used to compare the latest obtained The second multi-dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed. Of course, in other embodiments, in order to reduce the amount of calculation, it is also possible to add a timing offset step only after a part of the convolutional layer.

In the above solution, the first multi-dimensional feature map is obtained by feature extraction of the video to be analyzed, and the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed, and the offset prediction network is used to analyze the first multi-dimensional feature map. The multi-dimensional feature map is predicted to obtain offset information, so that at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the second multi-dimensional feature map is obtained based on the offset feature information, Furthermore, the timing information of the video to be analyzed can be directly modeled, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved. Therefore, analysis and processing are performed on this basis, which is beneficial to improve The accuracy of video analysis.

Please refer to FIG. 4, which is a schematic flowchart of an embodiment of step S14 in FIG. 1. In the embodiment of the present application, the offset information includes a first number of offset values, and at least part of the first multi-dimensional feature map can also be divided into a first number of sets of first feature information along a preset dimension (for example, channel dimension) , That is, the at least one set of characteristic information includes a first number of sets of first characteristic information. Then, using the offset information to offset the at least one set of feature information in the time series dimension may include: using the i-th offset value in the offset information to compare the i-th set of first feature information in the time series dimension Perform the offset to obtain the i-th group of second feature information, where i is a positive integer less than or equal to the first number.

Please refer to FIG. 2, at least a part of the first multi-dimensional feature map includes two sets of first feature information, then the first offset value in the offset information can be used to perform the first feature information of the first set in the time series dimension. Offset to obtain the first set of second feature information, and use the second offset value in the offset information to offset the second set of first feature information in the time series dimension to obtain the second set of second feature information, When the above-mentioned first quantity is other values, it can be deduced by analogy, and no examples are given here.

Specifically, the use of the i-th offset value in the offset information to offset the i-th group of the first characteristic information in the time sequence dimension to obtain the i-th group of second characteristic information may be Including the following steps:

Step S141: Obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value.

In an implementation scenario, the preset value can be 1, the lower limit of the value range is the value obtained by rounding down the i-th offset value, and the upper limit of the value range is the value obtained by rounding down the i-th offset value. The value obtained by rounding up, that is, for the i-th offset value O _i , its value range can be expressed as (n ₀ , n ₀ +1), and n ₀ ∈ N. For example, when the offset value is 0.8, the value range is 0 to 1; or when the offset value is 1.4, the value range is 1 to 2. When the offset value is other values, the same can be used. I will not give examples one by one here. Through the above method, when the offset value is a decimal number, the subsequent processing flow of the timing offset can be simplified.

Step S142: Shift the i-th group of first feature information along the time sequence dimension by the upper limit time sequence unit to obtain the i-th group of third feature information, and shift the i-th group of first feature information along the time sequence dimension by the lower limit value Time sequence unit, the fourth feature information of the i-th group is obtained.

In the embodiment of the present application, the first feature information of the i-th group can be expressed as U _c,t , so when the numerical range of the i-th offset value is expressed as (n ₀ ,n ₀ +1), the i-th group A feature information is shifted by an upper limit time sequence unit along the time sequence dimension, and the obtained third feature information of the i-th group can be expressed as

The i-th group of first feature information is shifted by the lower limit number of time-series units along the time series dimension, and the obtained i-th group of fourth feature information can be expressed as

In a specific implementation scenario, each offset value may be a decimal. For example, the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1. Therefore, for the i-th group of first feature information U _c,t , the corresponding third feature information can be expressed as U _c,t+1 , and the corresponding fourth feature information can be expressed as U _c,t . In addition, the range of the first feature information in the time sequence dimension is [1, T], where the value of T is equal to the number of frames of the video to be analyzed, for example, the T of the first feature information [1 0 0 0 0 0 0 1] is 8. The first feature information may become a zero vector due to the feature information being removed during the timing offset process, so that the gradient disappears during the training process. To alleviate this problem, it can be set to (0) after the timing offset. , 1) Set a buffer for the feature information of the time sequence interval and (T, T+1) time sequence interval, so that when the feature information is shifted out of time T+1 in the time sequence, or less than time 0, the buffer can be fixed Set to 0. For example, taking the first feature information U _c,t [1 0 0 0 0 0 0 1] as an example, when the i-th offset value is 0.4, since the value range it belongs to is 0 to 1, it can be Offset the first feature information by one (ie, 1) time sequence unit from the upper limit to obtain the corresponding third feature information [0 1 0 0 0 0 0 0], and shift the above-mentioned first feature information by one from the lower limit (That is, 0) time sequence unit, the corresponding fourth feature information [1 0 0 0 0 0 0 1] is obtained. When the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.

Step S143: Use the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of third feature information, to obtain the i-th group of first weighted results, and the upper limit value and the i-th deviation The difference between the shift values is used as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighted result of the i-th group.

Taking the i-th offset value expressed as O _i as an example, when the numerical range of the i-th offset value is expressed as (n ₀ ,n ₀ +1), the i-th offset value O _i and the lower limit The difference between the values (i.e. n ₀ _{) (i.e. O i} -n ₀ ) is used as the weight to the third feature information of the i-th group (i.e

) Performs weighting processing to obtain the corresponding first weighted result (ie

), and the difference between the above limit (ie n ₀ +1) and the i-th offset value O _i _{(ie n 0} +1-O _i ) as the weight for the fourth feature information of the i-th group

Perform weighting processing to obtain the corresponding second weighting result (ie

).

In a specific implementation scenario, each offset value may be a decimal. For example, the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U _c,t , the corresponding The third feature information can be expressed as U _c,t+1 , and the corresponding fourth feature information can be expressed as U _c,t , then the first weighting result can be expressed as O _i U _c,t+1 , and the second weighting result can be It is expressed as (1-O _i )U _c,t . Still taking the first feature information U _c,t expressed as a one-dimensional vector [1 0 0 0 0 0 0 1] as an example, when the i-th offset value is 0.4, the corresponding third feature information can be expressed as [ 0 1 0 0 0 0 0 0], the corresponding fourth feature information can be expressed as [1 0 0 0 0 0 0 0 1], so the first weighted result can be expressed as [0 0.4 0 0 0 0 0 0], so The second weighted result can be expressed as [0.6 0 0 0 0 0 0 0.6]. When the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.

Step S144: Calculate the sum between the first weighted result of the i-th group and the second weighted result of the i-th group as the second feature information of the i-th group.

Taking the i-th offset value expressed as O _i as an example, the first weighted result can be expressed as

The second weighted result can be expressed as

Therefore, the second feature information of the i-th group can be expressed as

In a specific implementation scenario, each offset value may be a decimal. For example, the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U _c,t , the first The weighted result can be expressed as O _i U _c,t+1 , and the second weighted result can be expressed as (1-O _i )U _c,t , so the i-th group of second feature information can be expressed as (1-O _i )U _c,t +O _i U _c,t+1 . Still taking the first feature information U _c,t expressed as a one-dimensional vector [1 0 0 0 0 0 0 1] as an example, when the i-th offset value is 0.4, the corresponding first weighting result can be expressed as [ 0 0.4 0 0 0 0 0 0], the corresponding second weighting result can be expressed as [0.6 0 0 0 0 0 0 0 0.6], so the second feature information of the i-th group can be expressed as [0.6 0.4 0 0 0 0 0 0.6 ]. When the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.

In addition, in an implementation scenario, since each group of first feature information is time-shifted in a group unit, a symmetric offset strategy can be used during training, that is, only half of the offset value can be trained during training, and Perform conversion calculations (for example, reverse the order) to obtain the other half of the offset value, which can reduce the processing load during training.

Different from the foregoing embodiment, by acquiring the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value, the i-th group of first characteristic information is moved along the time series dimension Offset the upper limit number of time series units to obtain the i-th group of third characteristic information, and shift the i-th group of first characteristic information along the time series dimension by the lower limit of time series units to obtain the i-th group of fourth characteristic information; The difference between the i offset value and the lower limit value is used as the weight to perform weighting processing on the first feature information of the ith group to obtain the first weighted result of the ith group, and the difference between the upper limit value and the ith offset value The difference is used as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the i-th group The second feature information can then conveniently and quickly perform offset processing on the first feature information, which is beneficial to improving the processing speed of video analysis.

Please refer to FIG. 5, which is a schematic flowchart of another embodiment of the video analysis method of the present application. Specifically, it can include the following steps:

Step S51: Obtain the video to be analyzed.

For details, please refer to the relevant steps in the foregoing embodiment.

Step S52: Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.

In this embodiment of the present application, the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed. For details, please refer to the relevant steps in the foregoing embodiment.

Step S53: Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.

Please refer to FIG. 6 in combination. FIG. 6 is a schematic diagram of another embodiment of the video analysis processing process. As shown in FIG. 6, the first multi-dimensional feature map can be predicted by the offset prediction network. For details, please refer to the relevant information in the foregoing embodiment. step.

Step S54: Use the weight prediction network to predict the first multi-dimensional feature map to obtain weight information.

During the timing shift, the features at the first and last ends of the first feature information may be removed. Therefore, in order to re-evaluate the importance of each feature in the first feature information after the timing shift, to better obtain long-range information , The attention mechanism can be used to re-weight each feature in the first feature information after the time sequence shift, so the weight information needs to be obtained. Please continue to refer to FIG. 6 in combination, the weight prediction network can be used to predict the first multi-dimensional feature map to obtain weight information.

In an implementation scenario, the weight prediction network may include a down-sampling layer, a convolutional layer, and an activation layer that are sequentially connected. Therefore, the weight prediction network contains only 3 layers, and only the convolutional layer contains network parameters, which can simplify the network structure to a certain extent and reduce network parameters, thereby reducing network capacity, improving convergence speed, and avoiding overfitting. The trained model is as accurate as possible, which in turn can improve the accuracy of video analysis.

In some optional embodiments, the using the weight prediction network to predict the first multi-dimensional feature map to obtain weight information may include: using the weight prediction network to predict the down-sampling layer (denoted as the first down-sampling layer) Down-sampling the first multi-dimensional feature map to obtain the down-sampling result (denoted as the first down-sampling result); use the weight to predict the convolutional layer (denoted as the first convolutional layer) of the down-sampling result (that is, the first The downsampling result) is subjected to convolution processing to obtain the feature extraction result (recorded as the first feature extraction result); the activation layer of the weight prediction network is used to perform nonlinear processing on the feature extraction result (ie, the first feature extraction result) to obtain the weight information. In a specific implementation scenario, the downsampling layer may be an average pooling layer. For details, please refer to the relevant steps in the foregoing embodiment. The convolutional layer of the weight prediction network can include one convolution kernel, and the activation layer of the weight prediction network can be a Sigmoid activation layer, so that each element in the weight information can be constrained to be between 0 and 1.

In addition, in order to facilitate processing, the offset prediction network and the weight prediction network in the embodiments of the present application may be embedded before the convolutional layer of the preset network model. For example, the preset network model is ResNet-50, the offset prediction network and the weight prediction network can be embedded before the convolutional layer of each residual block, so as to use the first multi-dimensional feature map to predict the offset information and weights. Information, for subsequent offset and weighting processing, so that a small amount of network parameters can be added to the existing network parameters of ResNet-50 to realize the modeling of timing information, which is conducive to reducing the processing load of video analysis and improving the performance of video analysis. The processing speed is conducive to accelerating the convergence speed during model training, avoiding over-fitting, and improving the accuracy of video analysis. When the preset network model is another model, it can be deduced by analogy, and the examples are not given here.

The above steps S53 and S54 can be performed in a sequential order, for example, step S53 is performed first, and then step S54; or, step S54 is performed first, and then step S53 is performed; or, step S53 and step S54 are performed at the same time. limited. In addition, the foregoing step S54 may be performed before the subsequent step S56, which is not limited here.

Step S55: Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map.

For details, please refer to the relevant steps in the foregoing embodiment.

Step S56: Use the weight information to perform weighting processing on the offset feature information.

In an implementation scenario, the video to be analyzed may specifically include a second number of frame images, and the weight information may include a second number of weight values. The second number may specifically be 8, 16, 24, etc., which are not specifically limited herein. In the weighting process, that is, using the weight information to perform weighting processing on the offset feature information, including: each group of offset feature information can be used to separately use the j-th weight value in the weight information Perform weighting processing on the feature value corresponding to the jth time sequence in the current group of feature information to obtain the corresponding group feature information after weighting, where j is a positive integer less than or equal to the second number.

Taking the feature information [0.6 0.4 0 0 0 0 0 0.6] after the offset processing in the above embodiment as an example, the weight information can be [0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.2], and the jth item in the weight information is used respectively. After the weight value performs weighting processing on the feature value corresponding to the jth time sequence in the above feature information, the feature information of the corresponding group is obtained as [0.12 0.04 0 0 0 0 0 0.12]. When the offset feature information and weight information are other values, it can be deduced by analogy, and no examples are given here.

Step S57: Obtain a second multi-dimensional feature map based on the weighted feature information.

Please refer to FIG. 6, after the timing shift and weighting process, the second multi-dimensional feature map corresponding to the first multi-dimensional feature map can be obtained. In an implementation scenario, the obtaining the second multi-dimensional feature map based on the weighted feature information may include: using the weighted feature information and the first multi-dimensional feature map is not shifted The feature information of, composes the second multi-dimensional feature map.

Specifically, referring to FIG. 2, the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map can be spliced to obtain the second multi-dimensional feature map. The obtained second multi-dimensional feature map has the same size as the first multi-dimensional feature map. In addition, if all the feature information in the first multi-dimensional feature map has undergone time-series offset processing, the weighted feature information can be directly combined to form the second multi-dimensional feature map.

Step S58: Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.

For details, please refer to the relevant steps in the foregoing embodiment.

Different from the foregoing embodiment, the weight prediction network is used to predict the first multi-dimensional feature map to obtain weight information, and at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the weight information is used The offset feature information is weighted, and the second multi-dimensional feature map is obtained based on the weighted feature information. Therefore, the spatial and temporal joint interleaving feature information can be directly obtained through the offset and weighting processing steps. Conducive to improving the processing speed and accuracy of video analysis.

Please refer to FIG. 7. FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application. The model training method used for video analysis in the embodiments of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:

Step S71: Obtain a sample video.

In the embodiment of the present application, the sample video includes preset annotation information. Taking behavior analysis of videos as an example, the preset annotation information of the sample video may include but not limited to: fall, normal walking, running and other annotation information; or, taking the classification of the video as an example, the preset annotation information of the sample video It may include but is not limited to: football match video, basketball match video, ski match video and other label information. Other application scenarios can be deduced by analogy, so we will not give examples one by one here.

In the embodiment of the present application, the sample video may include several frames of images, for example, may include 8 frames of images, or may also include 16 frames of images, or may also include 24 frames of images, which is not specifically limited here.

Step S72: Perform feature extraction on the sample video by using the preset network model to obtain the first sample multi-dimensional feature map.

In the embodiment of the present application, the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video. Please refer to FIG. 2 in combination. FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series. In an implementation scenario, the video to be analyzed includes several frames of images. In order to reduce the processing load of feature extraction for sample videos and improve the processing speed of video analysis, the feature extraction can be performed on several frames of the sample video through the preset network model to obtain the feature map corresponding to each frame image, thus directly Several feature maps are spliced according to the time sequence of the corresponding images in the sample video to obtain the first sample multi-dimensional feature map. For example, if the sample video includes 8 frames of images, you can use the preset network model to perform feature extraction on these 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly displayed in the sample video according to their corresponding images. The time sequence in is spliced to obtain the first sample multi-dimensional feature map.

Step S73: Use the offset prediction network to predict the multi-dimensional feature map of the first sample to obtain offset information.

For the specific network structure of the offset prediction network, reference may be made to the relevant steps in the foregoing embodiment, which will not be repeated here. In an implementation scenario, the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information. For the network structure of the weight prediction network, refer to the relevant steps in the foregoing embodiment, which will not be repeated here.

Step S74: Use the offset information to perform time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample multi-dimensional feature map based on the offset feature information.

For the specific implementation steps of using the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, reference may be made to the relevant steps in the foregoing embodiment, which will not be repeated here. In an implementation scenario, the weight information can also be used to weight the offset feature information, and based on the weighted feature information, the second sample multi-dimensional feature map can be obtained. For details, please refer to the relevant steps in the foregoing embodiment. I won't repeat them here.

In an implementation scenario, the preset network model may include at least one convolutional layer, and then one convolutional layer of the preset network model may be used to perform feature extraction on the sample video to obtain the first sample multi-dimensional feature map. In a specific implementation scenario, the number of convolutional layers of the preset network model can be more than one, and then the convolutional layer in the preset network model that has not been feature extraction performed can be used to perform feature extraction on the second sample multi-dimensional feature map , To obtain a new first sample multi-dimensional feature map, and execute the step of using the offset prediction network to predict the new first sample multi-dimensional feature map to obtain the offset information and subsequent steps, thereby obtaining a new second sample multi-dimensional feature map Feature map, and then repeat the above steps until all the convolutional layers of the preset network model have completed the feature extraction step of the new second sample multi-dimensional feature map.

Step S75: Use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video.

Specifically, the fully connected layer of the preset network model can be used to analyze the second sample multi-dimensional feature map to obtain the analysis result information of the sample video. In an implementation scenario, the fully connected layer of the preset network model can be used to perform feature connection to the second sample multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so that the sample video belongs to each category (such as football matches). Video, skiing event video, etc.), or the probability value of the sample video belonging to various behaviors (such as falling, normal walking, running, etc.). In other application scenarios, this can be deduced by analogy. An example.

Step S76: Calculate the loss value by using the preset label information and the analysis result information.

Specifically, a mean square error (Mean Square Error) loss function or a cross entropy loss function can be used to calculate the loss value of the preset label information and the analysis result information, which is not limited here.

Step S77: Adjust the parameters of the preset network model and the offset prediction network based on the loss value.

In an implementation scenario, as in the foregoing steps, the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information, so that the weight information is used to weight the offset feature information, and based on the weight processing After the feature information, the second sample multi-dimensional feature information is obtained; based on the loss value, the parameters of the preset network model, the offset prediction network, and the weight prediction network can also be adjusted. Specifically, the parameters of the convolutional layer and the fully connected layer in the preset network model can be adjusted, and the parameters of the convolutional layer and the fully connected layer in the offset prediction network can be adjusted, and the weight of the convolutional layer in the prediction network can be adjusted. parameter. Specifically, a gradient descent method can be used to adjust the parameters, such as a batch gradient descent method and a stochastic gradient descent method.

In an implementation scenario, after adjusting the parameters, the above step S72 and subsequent steps may be executed again until the calculated loss value meets the preset training end condition. Specifically, the preset training end condition may include: the loss value is less than a preset loss threshold and the loss value no longer decreases, or the preset training end condition may also include: the number of parameter adjustments reaches the preset number of times threshold, or, The preset training end condition may also include: using a test video to test that the network performance meets a preset requirement (for example, the accuracy rate reaches a preset accuracy threshold).

Using the technical solution of the embodiment of the present application, the first sample multi-dimensional feature map is obtained by feature extraction of the sample video, and the first sample multi-dimensional feature map contains feature information corresponding to the sample video in different time series, and uses the bias The shift prediction network predicts the multi-dimensional feature map of the first sample to obtain offset information, so as to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and based on the offset feature information Obtain the second sample multi-dimensional feature map, and then can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so this is the basis Analyzing and processing on the above is conducive to the subsequent improvement of the accuracy of video analysis.

Please refer to FIG. 8. FIG. 8 is a schematic diagram of a framework of an embodiment of a video analysis device 80 of the present application. The video analysis device 80 includes a video acquisition module 81, a feature extraction module 82, an offset prediction module 83, an offset processing module 84, and a network analysis module 85; among them,

The video acquisition module 81 is configured to acquire the video to be analyzed;

The feature extraction module 82 is configured to perform feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed;

The offset prediction module 83 is configured to use an offset prediction network to predict the first multi-dimensional feature map to obtain offset information;

The offset processing module 84 is configured to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtain a second multi-dimensional feature map based on the offset feature information;

The network analysis module 85 is configured to analyze the second multi-dimensional feature map by using a preset network model to obtain analysis result information of the video to be analyzed.

The technical solution of the embodiment of the present application uses a preset network model to process the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved, so it is performed on this basis Analysis and processing help improve the accuracy of video analysis.

In some embodiments, the video analysis device 80 further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;

The offset processing module 84 is configured to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighted feature information , Get the second multi-dimensional feature map.

In some embodiments, the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions, and the offset processing module 84 is configured to select at least one set of feature information from the first multi-dimensional feature map according to the preset dimensions, where Each set of feature information includes feature information corresponding to different time series in the same preset dimension, and the offset information is used to offset at least one set of feature information in the time series dimension.

In some embodiments, the preset dimension is the channel dimension; and/or, the offset information includes a first number of offset values, and at least one set of characteristic information includes a first number of sets of first characteristic information. The offset processing module 84, It is configured to use the i-th offset value in the offset information to offset the i-th group of first feature information in the time sequence dimension to obtain the i-th group of second feature information, where i is a positive value less than or equal to the first number. Integer.

In some embodiments, the offset processing module 84 is configured to obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value, and the timing offset processing unit It includes a time sequence offset processing subunit, which is used to shift the i-th group of first feature information along the time sequence dimension by an upper limit number of time sequence units to obtain the i-th group of third feature information, and move the i-th group of first feature information along the time sequence. The lower limit value of the dimensionality offset is time series units to obtain the fourth feature information of the i-th group; the third feature information of the i-th group is weighted by the difference between the i-th offset value and the lower limit value as the weight to obtain the I set the first weighted results, and use the difference between the upper limit and the i-th offset value as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; calculate the i-th group The sum between a weighted result and the second weighted result of the i-th group is used as the second feature information of the i-th group.

In some embodiments, the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values. The offset processing module 84 is configured to use the first set of weight information for each set of offset feature information. The j weight values perform weighting processing on the feature value corresponding to the j-th time sequence in the current group of feature information to obtain the corresponding group feature information after the weighting process; where j is a positive integer less than or equal to the second number.

In some embodiments, the offset processing module 84 is configured to use the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map to form a second multi-dimensional feature map.

In some embodiments, the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result; use the weight to predict the first convolutional layer of the network Perform convolution processing on the first down-sampling result to obtain the first feature extraction result; use the first activation layer of the weight prediction network to perform non-linear processing on the first feature extraction result to obtain weight information.

In some embodiments, the offset prediction module 83 is configured to use the second down-sampling layer of the offset prediction network to down-sample the first multi-dimensional feature map to obtain the second down-sampling result; use the second down-sampling layer of the offset prediction network The second convolution layer performs convolution processing on the second downsampling result to obtain the second feature extraction result; uses the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result; Use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain the nonlinear processing result, and use the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain the second For the feature connection result, the third activation layer of the offset prediction network is used to perform non-linear processing on the second feature connection result to obtain offset information.

In some embodiments, the preset network model includes at least one convolutional layer, and the feature extraction module 82 is configured to use the convolutional layer of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; If the number of convolutional layers of the preset network model is more than 1, use the convolutional layer in the preset network model that has not been feature extraction to perform feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map ；

The offset prediction module 83 is further configured to use the offset prediction network to predict the new first multi-dimensional feature map to obtain new offset information;

The offset processing module 84 is further configured to use the new offset information to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a new second multi-dimensional feature map based on the offset feature information;

The network analysis module 85 is configured to analyze the new second multi-dimensional feature map by using the fully connected layer of the preset network model to obtain analysis result information of the video to be analyzed.

In some embodiments, the video to be analyzed includes several frames of images, and the feature extraction module 82 is configured to perform feature extraction on several frames of images using a preset network model to obtain a feature map corresponding to each frame of image; The images are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.

Please refer to FIG. 9. FIG. 7 is a schematic diagram of an embodiment of a model training device 90 for video analysis according to the present application. The model training device 90 for video analysis includes a video acquisition module 91, a feature extraction module 92, an offset prediction module 93, an offset processing module 94, a network analysis module 95, a loss calculation module 96, and a parameter adjustment module 97; among them,

The video acquisition module 91 is configured to acquire a sample video, where the sample video includes preset annotation information;

The feature extraction module 92 is configured to perform feature extraction on the sample video by using a preset network model to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information corresponding to the sample video at different timings;

The offset prediction module 93 is configured to use the offset prediction network to predict the first sample multi-dimensional feature map to obtain offset information;

The offset processing module 94 is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information;

The network analysis module 95 is configured to analyze the second sample multi-dimensional feature map by using a preset network model to obtain analysis result information of the sample video;

The loss calculation module 96 is configured to calculate a loss value using preset label information and analysis result information;

The parameter adjustment module 97 is configured to adjust the parameters of the preset network model and the offset prediction network based on the loss value.

Through the above solution, the timing information of the sample video can be directly modeled, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so analysis and processing are performed on this basis. Conducive to the subsequent improvement of the accuracy of video analysis.

In some embodiments, the model training device 90 for video analysis may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis. For details, please refer to the above-mentioned embodiment of the video analysis device. Related modules, I won’t repeat them here.

Please refer to FIG. 10, which is a schematic diagram of a framework of an embodiment of the electronic device 100 of the present application. The electronic device 100 includes a memory 101 and a processor 102 coupled to each other. The processor 102 is configured to execute program instructions stored in the memory 101 to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing for video analysis. Analyze the steps in the embodiment of the model training method. In a specific implementation scenario, the electronic device 100 may include but is not limited to: a microcomputer and a server. In addition, the electronic device 100 may also include mobile devices such as a notebook computer and a tablet computer, which are not limited herein.

Specifically, the processor 102 is configured to control itself and the memory 101 to implement the steps in any of the foregoing video analysis method embodiments, or implement the steps in any of the foregoing model training method embodiments for video analysis. The processor 102 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 102 may be an integrated circuit chip with signal processing capability. The processor 102 may also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. In addition, the processor 102 may be jointly implemented by an integrated circuit chip.

Please refer to FIG. 11, which is a schematic diagram of a framework of an embodiment of the computer-readable storage medium 110 of this application. The computer-readable storage medium 110 stores program instructions 1101 that can be executed by a processor, and the program instructions 1101 are used to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing model training method embodiments for video analysis. Steps in. The computer-readable storage medium may be a volatile or non-volatile storage medium.

The embodiment of the present application also provides a computer program, including computer-readable code, when the computer-readable code is executed in an electronic device, the processor in the electronic device executes for realizing the implementation of any of the above-mentioned video analysis methods Or implement the steps in any of the above-mentioned model training method embodiments for video analysis.

In the several embodiments provided in this application, it should be understood that the disclosed method and device can be implemented in other ways. For example, the device implementation described above is only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of this embodiment.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Claims

A video analysis method, including:

Get the video to be analyzed;

Performing feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map includes feature information in different time series corresponding to the video to be analyzed;

Predicting the first multi-dimensional feature map by using an offset prediction network to obtain offset information;

Using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information;

The second multi-dimensional feature map is analyzed by using the preset network model to obtain analysis result information of the video to be analyzed.
The video analysis method according to claim 1, wherein, in the step of using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and based on the offset feature Before the information obtains the second multi-dimensional feature map, the method further includes:

Predicting the first multi-dimensional feature map by using a weight prediction network to obtain weight information;

The step of using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtaining a second multi-dimensional feature map based on the offset feature information, includes:

Using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map;

Performing weighting processing on the offset feature information by using the weight information;

Based on the feature information after the weighting process, a second multi-dimensional feature map is obtained.
The video analysis method according to claim 1 or 2, wherein the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions;

The using the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map includes:

Selecting at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, where each set of feature information includes feature information corresponding to different time series in the same preset dimension;

The offset information is used to offset the at least one set of feature information in a time series dimension.
The video analysis method according to claim 3, wherein the preset dimension is a channel dimension; and/or,

The offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;

The using the offset information to offset the at least one set of feature information in a time series dimension includes:

Use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information, where i is A positive integer less than or equal to the first number.
4. The video analysis method according to claim 4, wherein the offset value of the i-th group of the first feature information is offset in the time sequence dimension by using the i-th offset value in the offset information, Obtain the second characteristic information of the i-th group, including:

Acquiring the numerical range to which the i-th said offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value;

Offset the first characteristic information of the i-th group by the upper limit number of time-series units along the time sequence dimension to obtain the third characteristic information of the i-th group, and move the first characteristic information of the i-th group along the time sequence The dimension is shifted by the lower limit number of time series units to obtain the i-th group of fourth characteristic information;

Use the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of the third feature information to obtain the i-th group of first weighted results, and use the upper limit The difference between the value and the i-th offset value is used as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighting result of the i-th group;

The sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature information of the i-th group.
4. The video analysis method according to claim 3, wherein the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values;

The using the weight information to perform weighting processing on the offset feature information includes:

For each group of feature information after the offset, use the j-th weight value in the weight information to weight the feature value corresponding to the j-th time sequence in the current group of feature information to obtain the corresponding group of feature information after weighting. ；

Wherein, the j is a positive integer less than or equal to the second number.
The video analysis method according to any one of claims 2 to 6, wherein the obtaining a second multi-dimensional feature map based on the feature information after the weighting process comprises:

The feature information after the weighting process and the feature information that is not shifted in the first multi-dimensional feature map are used to form the second multi-dimensional feature map.
The video analysis method according to any one of claims 2 to 6, wherein the using a weight prediction network to predict the first multi-dimensional feature map to obtain weight information includes:

Down-sampling the first multi-dimensional feature map by using the first down-sampling layer of the weight prediction network to obtain a first down-sampling result;

Using the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain a first feature extraction result;

The first activation layer of the weight prediction network is used to perform non-linear processing on the first feature extraction result to obtain the weight information.
The video analysis method according to any one of claims 1 to 6, wherein the using an offset prediction network to predict the first multi-dimensional feature map to obtain offset information includes:

Down-sampling the first multi-dimensional feature map by using the second down-sampling layer of the offset prediction network to obtain a second down-sampling result;

Using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result;

Using the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain a first feature connection result;

Using the second activation layer of the offset prediction network to perform non-linear processing on the first feature connection result to obtain a non-linear processing result;

Using the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain a second feature connection result;

The third activation layer of the offset prediction network is used to perform nonlinear processing on the second feature connection result to obtain the offset information.
The video analysis method according to any one of claims 1 to 6, wherein the preset network model includes at least one convolutional layer; and the use of the preset network model to perform feature extraction on the to-be-analyzed video to obtain the first A multi-dimensional feature map, including:

Performing feature extraction on the video to be analyzed by using a convolutional layer of a preset network model to obtain a first multi-dimensional feature map;

If the number of convolutional layers of the preset network model is more than 1, after the second multi-dimensional feature map is obtained, and after the second multi-dimensional feature map is obtained by using the preset network model, Before performing analysis to obtain the analysis result information of the video to be analyzed, the method further includes:

Performing feature extraction on the second multi-dimensional feature map by using a convolutional layer in the preset network model that has not performed feature extraction to obtain a new first multi-dimensional feature map;

Performing the step of using the offset prediction network to predict the new first multi-dimensional feature map to obtain offset information and subsequent steps to obtain a new second multi-dimensional feature map;

Repeat the above steps until all convolutional layers of the preset network model have completed the feature extraction step of the new second multi-dimensional feature map;

The analyzing the second multi-dimensional feature map using the preset network model to obtain the analysis result information of the video to be analyzed includes:

The second multi-dimensional feature map is analyzed by using the fully connected layer of the preset network model to obtain the analysis result information of the video to be analyzed.
The video analysis method according to any one of claims 1 to 6, wherein the video to be analyzed includes several frames of images, and the feature extraction is performed on the video to be analyzed using a preset network model to obtain the first multidimensional Feature map, including:

Using the preset network model to perform feature extraction on the several frames of images to obtain a feature map corresponding to each frame of image;

The plurality of the feature maps are spliced according to the time sequence of the images corresponding to them in the video to be analyzed to obtain the first multi-dimensional feature map.
A model training method for video analysis, including:

Acquiring a sample video, where the sample video includes preset annotation information;

Performing feature extraction on the sample video by using a preset network model to obtain a first sample multi-dimensional feature map, wherein the first sample multi-dimensional feature map contains feature information corresponding to the sample video at different timings;

Predicting the first sample multi-dimensional feature map by using an offset prediction network to obtain offset information;

Using the offset information to perform a time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information;

Analyze the second sample multi-dimensional feature map by using the preset network model to obtain analysis result information of the sample video;

Calculating a loss value by using the preset label information and the analysis result information;

Adjust the parameters of the preset network model and the offset prediction network based on the loss value.
A video analysis device, including:

The video acquisition module is configured to acquire the video to be analyzed;

The feature extraction module is configured to perform feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, wherein the first multi-dimensional feature map includes different timings corresponding to the video to be analyzed Feature information on

An offset prediction module, configured to use an offset prediction network to predict the first multi-dimensional feature map to obtain offset information;

An offset processing module configured to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtain a second multi-dimensional feature map based on the offset feature information;

The network analysis module is configured to analyze the second multi-dimensional feature map by using the preset network model to obtain analysis result information of the video to be analyzed.
The video analysis device according to claim 13, wherein the device further comprises a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;

The offset processing module is configured to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; and to use the weight information to weight the offset feature information Processing; Based on the feature information after the weighting process, a second multi-dimensional feature map is obtained.
The video analysis device according to claim 13 or 14, wherein the dimensions of the first multi-dimensional feature map include a time series dimension and a preset dimension;

The offset processing module is configured to select at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time series in the same preset dimension; The offset information offsets the at least one set of feature information in the time sequence dimension.
The video analysis device according to claim 15, wherein the preset dimension is a channel dimension; and/or,

The offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;

The offset processing module is configured to use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information. Characteristic information, wherein the i is a positive integer less than or equal to the first number.
The video analysis device according to claim 16, wherein the offset processing module is configured to obtain a numerical range to which the i-th offset value belongs, and the upper limit and the lower limit of the numerical range are The difference is a preset value; the first characteristic information of the i-th group is shifted by the upper limit time series unit along the time series dimension to obtain the third characteristic information of the i-th group, and the first characteristic information of the i-th group is A piece of characteristic information is shifted by the lower limit value of time sequence units along the time sequence dimension to obtain the i-th group of fourth characteristic information; the difference between the i-th offset value and the lower limit value is used as a weight pair Perform weighting processing on the third characteristic information of the i-th group to obtain the first weighted result of the i-th group, and use the difference between the upper limit value and the i-th offset value as the weight The fourth feature information is weighted to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature of the i-th group information.
The video analysis device according to claim 15, wherein the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values;

The offset processing module is configured to perform weighting processing on the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information for each group of feature information after the offset, respectively, Obtain the feature information of the corresponding group after the weighting process; wherein, the j is a positive integer less than or equal to the second number.
The video analysis device according to any one of claims 14 to 18, wherein the offset processing module is configured to use the feature information after the weighting process and the first multi-dimensional feature map without The offset feature information constitutes the second multi-dimensional feature map.
The video analysis device according to any one of claims 14 to 18, wherein the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map , Obtain the first down-sampling result; use the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain the first feature extraction result; use the weight to predict the first activation of the network The layer performs non-linear processing on the first feature extraction result to obtain the weight information.
The video analysis device according to any one of claims 13 to 18, wherein the offset prediction module is configured to use the second down-sampling layer of the offset prediction network to perform the first multi-dimensional feature map Down-sampling to obtain a second down-sampling result; using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result; using the offset prediction network The first fully connected layer performs feature connection on the second feature extraction result to obtain the first feature connection result; using the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result, Obtain the nonlinear processing result; use the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain the second feature connection result; use the third activation layer of the offset prediction network to pair The second feature connection result is subjected to non-linear processing to obtain the offset information.
The video analysis device according to any one of claims 13 to 18, wherein the preset network model includes at least one convolutional layer;

The feature extraction module is configured to perform feature extraction on the video to be analyzed by using the convolutional layer of the preset network model to obtain a first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1. It is also configured to perform feature extraction on the second multi-dimensional feature map by using a convolutional layer in the preset network model that has not performed feature extraction to obtain a new first multi-dimensional feature map;

The offset prediction module is further configured to use an offset prediction network to predict the new first multi-dimensional feature map to obtain new offset information;

The offset processing module is further configured to use the new offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and to obtain a new one based on the offset feature information The second multi-dimensional feature map;

The network analysis module is further configured to analyze the new second multi-dimensional feature map by using the fully connected layer of the preset network model to obtain analysis result information of the video to be analyzed.
The video analysis device according to any one of claims 13 to 18, wherein the video to be analyzed includes several frames of images;

The feature extraction module is configured to use the preset network model to perform feature extraction on the plurality of frames of images to obtain a feature map corresponding to each frame of image; Images are spliced in the sequence of the video to be analyzed to obtain the first multi-dimensional feature map.
A model training device for video analysis, including:

The video acquisition module is configured to acquire a sample video, wherein the sample video includes preset annotation information;

The feature extraction module is configured to perform feature extraction on the sample video using a preset network model to obtain a first sample multi-dimensional feature map, wherein the first sample multi-dimensional feature map includes different timings corresponding to the sample video Feature information on

An offset prediction module, configured to use an offset prediction network to predict the first sample multi-dimensional feature map to obtain offset information;

An offset processing module configured to use the offset information to perform a time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information ；

A network analysis module configured to analyze the second sample multi-dimensional feature map by using the preset network model to obtain analysis result information of the sample video;

A loss calculation module, configured to calculate a loss value by using the preset label information and the analysis result information;

The parameter adjustment module is configured to adjust the parameters of the preset network model and the offset prediction network based on the loss value.
An electronic device comprising a memory and a processor coupled to each other, the processor is configured to execute program instructions stored in the memory to implement the video analysis method according to any one of claims 1 to 11, or to implement rights Requires the model training method described in 12.
A computer-readable storage medium having program instructions stored thereon, which when executed by a processor, implement the video analysis method according to any one of claims 1 to 11, or implement the model training according to claim 12 method.
A computer program, comprising computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the video analysis for realizing any one of claims 1 to 11 Method, or implement the model training method of claim 12.