WO2021142904A1 - Video analysis method and related model training method, device and apparatus therefor - Google Patents

Video analysis method and related model training method, device and apparatus therefor Download PDF

Info

Publication number
WO2021142904A1
WO2021142904A1 PCT/CN2020/078656 CN2020078656W WO2021142904A1 WO 2021142904 A1 WO2021142904 A1 WO 2021142904A1 CN 2020078656 W CN2020078656 W CN 2020078656W WO 2021142904 A1 WO2021142904 A1 WO 2021142904A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
offset
feature
feature map
video
Prior art date
Application number
PCT/CN2020/078656
Other languages
French (fr)
Chinese (zh)
Inventor
邵昊
刘宇
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to KR1020217013635A priority Critical patent/KR20210093875A/en
Priority to JP2021521512A priority patent/JP7096431B2/en
Publication of WO2021142904A1 publication Critical patent/WO2021142904A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to a video analysis method and related model training methods, equipment, and devices.
  • neural network models are generally designed with static images as processing objects.
  • the embodiments of the present application provide a video analysis method and related model training methods, equipment, and devices.
  • an embodiment of the present application provides a video analysis method, including: obtaining a video to be analyzed; using a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, wherein the The first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed; the offset prediction network is used to predict the first multi-dimensional feature map to obtain offset information; the offset information is used Perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information; The dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed.
  • the embodiment of the application processes the video to be analyzed through the preset network model, which is beneficial to improve the processing speed of video analysis, and through the timing offset, the spatial information and the timing information can be jointly interlaced. Therefore, the analysis and processing are performed on this basis. Conducive to improving the accuracy of video analysis.
  • the offset information is used to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and the second feature information is obtained based on the offset information.
  • the method further includes: predicting the first multi-dimensional feature map by using a weight prediction network to obtain weight information; and using the offset information to perform a prediction on the first multi-dimensional feature map.
  • Performing timing offset on at least part of the feature information, and obtaining a second multi-dimensional feature map based on the offset feature information includes: performing at least part of the feature information of the first multi-dimensional feature map by using the offset information Time sequence offset; use the weight information to perform weighting processing on the offset feature information; and obtain a second multi-dimensional feature map based on the weighted feature information.
  • the technical solution of the embodiment of the present application can directly obtain the feature information of space and time series joint interleaving through the processing steps of offset and weighting, which is beneficial to improve the processing speed and accuracy of video analysis.
  • the dimensions of the first multi-dimensional feature map include a time series dimension and a preset dimension; the offset information is used to perform at least part of the feature information of the first multi-dimensional feature map.
  • the time sequence offset includes: selecting at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time sequences in the same preset dimension; using the offset information The at least one set of feature information is offset in the time series dimension.
  • At least one set of feature information is selected from the first multi-dimensional feature map according to a preset dimension, and each set of feature information includes feature information corresponding to different time series in the same preset dimension, and offset information is used Offset at least one set of feature information in the time sequence dimension can reduce the amount of calculation for offset processing, which is further conducive to improving the processing speed of video analysis.
  • the preset dimension is a channel dimension; and/or, the offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets The first feature information; said using offset information to offset the at least one set of feature information in the time sequence dimension includes: using the i-th offset value in the offset information to offset the i-th set of first feature information Offset is performed in the time sequence dimension to obtain the i-th group of second characteristic information, where i is a positive integer less than or equal to the first number.
  • the feature information of the space and time sequence joint interleaving can be directly obtained, which is conducive to improving The processing speed and accuracy of video analysis.
  • the i-th offset value in the offset information is used to offset the i-th group of the first feature information in the time series dimension to obtain the i-th group
  • the second characteristic information includes: acquiring the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value, and the first group of the i-th
  • the characteristic information is shifted by the upper limit time sequence unit along the time sequence dimension to obtain the i-th group of third characteristic information, and the i-th group of the first characteristic information is shifted by the lower limit value along the time sequence dimension Time sequence units to obtain the i-th group of fourth characteristic information; using the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of the third characteristic information to obtain the i groups of first weighting results, and using the difference between the upper limit value and the i-th offset value as a weight to perform weighting processing on the fourth feature information of
  • the technical solutions of the embodiments of the present application can easily and quickly perform offset processing on the first feature information, which is beneficial to improve the processing speed of video analysis.
  • the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values; the weight information is used to perform the offset on the feature information.
  • Weighting processing includes: for each group of feature information after the offset, weighting is performed on the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information, and the weighted processing is obtained Corresponding group feature information; where j is a positive integer less than or equal to the second number.
  • the technical solution of the embodiment of the present application uses the j-th weight value in the weight information to weight the characteristic value corresponding to the j-th time sequence of the current group of characteristic information by weighting each group of feature information after the offset to obtain the weighting process.
  • the latter corresponding group of feature information can re-weight the feature information when the feature information at some ends is offset, which is beneficial to improve the accuracy of video analysis.
  • the obtaining a second multi-dimensional feature map based on the weighted feature information includes: using the weighted feature information and the first multi-dimensional feature map.
  • the non-shifted feature information in the dimensional feature map constitutes the second multi-dimensional feature map.
  • the weighted feature information and the feature information that has not been shifted in the first multi-dimensional feature map are combined into the second multi-dimensional feature information, which can reduce the calculation load and improve the performance of video analysis. Processing speed.
  • the predicting the first multi-dimensional feature map using the weight prediction network to obtain the weight information includes: using the first downsampling layer of the weight prediction network to perform the prediction on the Down-sampling the first multi-dimensional feature map to obtain the first down-sampling result; using the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain the first feature extraction result; using The first activation layer of the weight prediction network performs nonlinear processing on the first feature extraction result to obtain weight information.
  • the first multi-dimensional feature map is processed layer by layer through the first down-sampling layer, the first convolutional layer, and the first activation layer, that is, weight information can be obtained, and the weight prediction network can be effectively simplified.
  • the new network structure reduces the network parameters, which is beneficial to improve the convergence speed of the model training for video analysis, and is beneficial to avoid over-fitting, thereby helping to improve the accuracy of video analysis.
  • the using the offset prediction network to predict the first multi-dimensional feature map to obtain the offset information includes: using the second downsampling layer of the offset prediction network to perform The first multi-dimensional feature map is down-sampled to obtain a second down-sampling result; the second convolutional layer of the offset prediction network is used to perform convolution processing on the second down-sampling result to obtain a second feature extraction result Use the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result; use the second activation layer of the offset prediction network to perform feature connection on the first The feature connection result is subjected to non-linear processing to obtain a non-linear processing result; the second fully connected layer of the offset prediction network is used to perform feature connection to the non-linear processing result to obtain a second feature connection result; using the offset The third activation layer of the prediction network performs nonlinear processing on the second feature connection result to obtain offset information.
  • the technical solutions of the embodiments of the present application can effectively simplify the network structure of the offset prediction network, reduce network parameters, help improve the convergence speed of the model training used for video analysis, and help avoid overfitting, thereby helping to improve The accuracy of video analysis.
  • the preset network model includes at least one convolutional layer; the use of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map includes: Suppose that the convolutional layer of the network model performs feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, then the second most After the dimensional feature map, and before using a preset network model to analyze the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, the method further includes: using the preset network model The convolutional layer that has not performed feature extraction performs feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map; performing the use of the offset prediction network to perform the feature extraction on the new first multi-dimensional feature map Make predictions to obtain offset information and subsequent steps to obtain a new second multi-dimensional feature map; repeat the above steps until all convolutional layers of the preset network model have
  • the second multi-dimensional feature map is extracted by using the convolutional layer in the preset network model that has not been subjected to feature extraction.
  • the fully connected layer of the network model analyzes the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, which in turn can improve the accuracy of video analysis.
  • the video to be analyzed includes several frames of images
  • the feature extraction of the video to be analyzed using a preset network model to obtain the first multi-dimensional feature map includes: The preset network model performs feature extraction on the several frames of images to obtain a feature map corresponding to each frame of image; the several feature maps are spliced according to the timing of the corresponding images in the video to be analyzed , To obtain the first multi-dimensional feature map.
  • feature extraction is performed on several frames of the video to be analyzed through a preset network model, and a feature map corresponding to each frame of image is obtained, so that several feature maps are directly placed in the waiting room according to their corresponding images.
  • Analyzing the time sequence of the video and performing splicing to obtain the first multi-dimensional feature map can reduce the processing load of feature extraction for the video to be analyzed, which is beneficial to improve the processing speed of video analysis.
  • an embodiment of the present application provides a model training method for video analysis, including: obtaining a sample video, wherein the sample video includes preset annotation information; and performing a method on the sample video using a preset network model Feature extraction to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information at different time series corresponding to the sample video; and the offset prediction network is used to analyze the first sample
  • the multi-dimensional feature map is predicted to obtain offset information; at least part of the feature information of the first sample multi-dimensional feature map is time-shifted using the offset information, and the second is obtained based on the offset feature information.
  • Sample multi-dimensional feature map use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video; use the preset annotation information and the analysis result information to calculate the loss value ; Based on the loss value, adjusting the preset network model and the parameters of the offset prediction network.
  • the technical solution of the embodiment of the present application can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to subsequent improvement of the accuracy of video analysis.
  • an embodiment of the present application provides a video analysis device, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, and a network analysis module; wherein the video acquisition module is configured to acquire The video to be analyzed; the feature extraction module is configured to use a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains different timings corresponding to the video to be analyzed The offset prediction module is configured to use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information; the offset processing module is configured to use the offset information for the first multi At least part of the feature information of the three-dimensional feature map is time-shifted, and a second multi-dimensional feature map is obtained based on the offset feature information; the network analysis module is configured to perform the second multi-dimensional feature map using a preset network model Analyze to obtain the analysis result information of the video to be analyzed.
  • the device further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;
  • the offset processing module is configured In order to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighting processing To obtain the second multi-dimensional feature map.
  • the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions
  • the offset processing module is configured to select at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time series in the same preset dimension; The offset information offsets the at least one set of feature information in the time sequence dimension.
  • the preset dimension is a channel dimension
  • the offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;
  • the offset processing module is configured to use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information.
  • Characteristic information wherein the i is a positive integer less than or equal to the first number.
  • the offset processing module is configured to obtain the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is A preset value; the first feature information of the i-th group is shifted by the upper limit number of time-series units along the time sequence dimension to obtain the third feature information of the i-th group, and the first feature of the i-th group is The information is shifted by the lower limit time sequence unit along the time sequence dimension to obtain the i-th group of fourth characteristic information; the difference between the i-th offset value and the lower limit value is used as the weight to the i-th Group the third feature information to perform weighting processing to obtain the first weighted result of the i-th group, and use the difference between the upper limit value and the i-th offset value as a weight to perform weighting on the fourth group of the i-th group The feature information is weighted to obtain the second weighted result of the i-th group; the sum between the first
  • the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values; the offset processing module is configured to Each group of feature information in the weight information is used to weight the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information to obtain the corresponding group of feature information after weighting;
  • the j is a positive integer less than or equal to the second number.
  • the offset processing module is configured to use the feature information after the weighting process and the feature information that has not been offset in the first multi-dimensional feature map to form The second multi-dimensional feature map.
  • the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result ; Use the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain a first feature extraction result; use the first activation layer of the weight prediction network to perform the convolution processing on the first feature The extraction result is subjected to non-linear processing to obtain the weight information.
  • the offset prediction module is configured to use the second downsampling layer of the offset prediction network to downsample the first multi-dimensional feature map to obtain a second downsampling layer. Sampling result; using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result; using the first fully connected layer pair of the offset prediction network Perform feature connection on the second feature extraction result to obtain a first feature connection result; use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain a nonlinear processing result; use The second fully connected layer of the offset prediction network performs feature connection on the non-linear processing result to obtain a second feature connection result; the third activation layer of the offset prediction network is used to connect the second feature result Perform nonlinear processing to obtain the offset information.
  • the preset network model includes at least one convolutional layer; the feature extraction module is configured to perform feature extraction on the video to be analyzed by using the convolutional layer of the preset network model , Obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, it is further configured to use the convolutional layer in the preset network model that has not been feature extraction performed to perform the feature extraction on the second Perform feature extraction on the multi-dimensional feature map to obtain a new first multi-dimensional feature map; the offset prediction module is further configured to use an offset prediction network to predict the new first multi-dimensional feature map to obtain a new Offset information; the offset processing module is further configured to use the new offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and based on the offset feature Information to obtain a new second multi-dimensional feature map; the network analysis module is further configured to use the fully connected layer of the preset network model to analyze the new second multi-dimensional feature map to
  • the video to be analyzed includes several frames of images; the feature extraction module is configured to use the preset network model to perform feature extraction on the several frames of images to obtain the A feature map corresponding to one frame of image; and the plurality of feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.
  • an embodiment of the present application provides a model training device for video analysis, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, a network analysis module, a loss calculation module, and parameter adjustment Module; wherein the video acquisition module is configured to acquire a sample video, wherein the sample video includes preset annotation information; the feature extraction module is configured to use a preset network model to perform feature extraction on the sample video to obtain the first In this multi-dimensional feature map, the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video; the offset prediction module is configured to perform the first sample multi-dimensional feature map using the offset prediction network Prediction to obtain offset information; the offset processing module is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample based on the offset feature information The multi-dimensional feature map; the network analysis module is configured to use a preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the
  • an embodiment of the present application provides an electronic device including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory to implement the video analysis in the first aspect of the embodiment of the present application.
  • an embodiment of the present application provides a computer-readable storage medium on which program instructions are stored.
  • the program instructions are executed by a processor, the video analysis method in the above-mentioned first aspect of the embodiments of the present application is implemented, or the present invention is implemented.
  • an embodiment of the present application provides a computer program, including computer-readable code.
  • the processor in the electronic device executes to implement the implementation of the present application. Take the video analysis method in the first aspect described above, or implement the model training method for video analysis in the second aspect described in the embodiments of the present application.
  • the technical solutions of the embodiments of the present application can directly model the timing information of the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to improving the accuracy of video analysis.
  • FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application
  • Figure 2 is a schematic diagram of an embodiment of a video analysis processing process
  • FIG. 3 is a schematic diagram of an embodiment of each stage of video analysis
  • FIG. 4 is a schematic flowchart of an embodiment of step S14 in FIG. 1;
  • FIG. 5 is a schematic flowchart of another embodiment of the video analysis method of the present application.
  • Fig. 6 is a schematic diagram of another embodiment of a video analysis processing process
  • FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application.
  • FIG. 8 is a schematic diagram of the framework of an embodiment of the video analysis device of the present application.
  • FIG. 9 is a schematic diagram of the framework of an embodiment of a model training device for video analysis according to the present application.
  • FIG. 10 is a schematic diagram of a framework of an embodiment of an electronic device of the present application.
  • FIG. 11 is a schematic diagram of a framework of an embodiment of a computer-readable storage medium according to the present application.
  • system and "network” in this article are often used interchangeably in this article.
  • the term “and/or” in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations.
  • the character "/” in this text generally indicates that the associated objects before and after are in an "or” relationship.
  • "many” in this document means two or more than two.
  • FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application.
  • the video analysis method of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:
  • Step S11 Obtain the video to be analyzed.
  • the video to be analyzed may include several frames of images.
  • the video to be analyzed includes 8 frames of images, or the video to be analyzed includes 16 frames of images, or the video to be analyzed includes 24 frames of images, etc.
  • the video to be analyzed may be a surveillance video shot by a surveillance camera to analyze the behavior of the target object in the surveillance video, for example, the target object falls down, the target object walks normally, and so on.
  • the video to be analyzed may be a video in a video library to classify the videos in the video library, for example, a football match video, a basketball match video, a ski match video, and so on.
  • Step S12 Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.
  • the above-mentioned preset network model may be a two-dimensional neural network model, for example, ResNet -50, ResNet-101, etc., there is no specific limitation here.
  • the ResNet network is constructed by a Residual Block, which uses multiple parameterized layers to learn the residual representation between input and output.
  • the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed.
  • FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series.
  • the video to be analyzed includes several frames of images.
  • the feature extraction can be performed on several frames of the video to be analyzed through the preset network model, and the feature map corresponding to each frame of image can be obtained.
  • the feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map. For example, if the video to be analyzed includes 8 frames of images, you can use the preset network model to perform feature extraction on the 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly placed in the waiting room according to their corresponding images. Analyze the time sequence in the video for splicing to obtain the first multi-dimensional feature map.
  • Step S13 Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.
  • the offset prediction network is used to predict the offset information, so as to subsequently perform a time sequence offset based on the offset information, so as to complete the integration of time information and space.
  • the offset prediction network may specifically be a preset network model, so that the first multi-dimensional feature map can be predicted through the preset network model, and the offset information can be directly obtained.
  • the offset prediction network may include a downsampling layer, a convolutional layer, a fully connected layer, an activation layer, a fully connected layer, and an activation layer that are sequentially connected. Therefore, the prediction offset network contains only 5 layers, and only the convolutional layer and the fully connected layer contain network parameters, which can simplify the network structure to a certain extent and reduce the network parameters, thereby reducing the network capacity and improving the convergence speed. Avoid over-fitting, make the trained model as accurate as possible, and then improve the accuracy of video analysis.
  • the down-sampling layer (denoted as the second down-sampling layer) of the offset prediction network may be used to down-sample the first multi-dimensional feature map to obtain the down-sampling result (denoted as the second down-sampling result).
  • the downsampling layer may specifically be an average pooling layer, and the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions (for example, channel dimensions). Perform down-sampling processing, and the down-sampling result can be expressed as:
  • c and t respectively represent the time series dimension in the multi-dimensional and the preset dimension in the multi-dimensional (the preset dimension can be, for example, the channel dimension), and z c, t represents the (c, t)th element in the downsampling result, H, W represent the height and width of the feature map, U c, t represent the (c, t)th element in the first multi-dimensional feature map.
  • the convolutional layer of the offset prediction network (denoted as the second convolutional layer) can be used to perform convolution processing on the down-sampling result (ie, the second down-sampling result) to obtain the feature extraction result (denoted as the second feature extraction result).
  • the convolution layer of the offset prediction network may specifically include the same number of convolution kernels as the number of frames of the video to be analyzed, and the size of the convolution kernel may be 3*3, for example.
  • the first fully connected layer (denoted as the first fully connected layer) of the offset prediction network is used to perform feature connection on the feature extraction result (that is, the second feature extraction result) to obtain the feature connection result (denoted as the first feature Connection result).
  • the first fully connected layer of the migration prediction network may contain the same number of neurons as the number of frames of the video to be analyzed.
  • the first activation layer (which can be recorded as the second activation layer) of the migration prediction network is used to perform non-linear processing on the feature connection result (ie, the first feature connection result) to obtain the non-linear processing result.
  • the first activation layer of the offset prediction network may be a linear rectification function (Rectified Linear Unit, ReLU) activation layer.
  • the second fully connected layer of the offset prediction network (denoted as the second fully connected layer) is used to perform feature connection on the nonlinear processing results to obtain the feature connection result (denoted as the second feature connection result); and then use the bias
  • the second activation layer (which can be recorded as the third activation layer) of the motion prediction network performs nonlinear processing on the feature connection result (ie, the second feature connection result) to obtain offset information.
  • the second activation layer of the offset prediction network can be a Sigmoid activation layer, so that each element in the offset information can be constrained to be between 0 and 1.
  • z represents the result of downsampling
  • F 1dconv represents the convolutional layer of the offset prediction network
  • W1 represents the first fully connected layer of the offset prediction network
  • represents the first active layer of the offset prediction network
  • W2 Represents the second fully connected layer of the offset prediction network
  • represents the second active layer of the offset prediction network
  • offset raw represents offset information
  • the offset information obtained by the second activation layer can also be subjected to constraint processing, so that each element in the offset information is restricted to Among them, T represents the number of frames of the video to be analyzed.
  • each element in the offset information obtained by using the second activation layer of the offset prediction network to perform nonlinear processing on the feature connection result can be respectively subtracted by 0.5, and the difference obtained by subtracting 0.5 can be subtracted Multiply by the number of frames of the video to be analyzed to obtain the offset information after constraint processing.
  • the above constraint processing can be expressed as:
  • offset raw represents the offset information processed by the second activation layer
  • T represents the number of frames of the video to be analyzed
  • offset represents the constraint to The offset information
  • Step S14 Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information.
  • At least part of the specific information may be along a preset dimension (for example, , Channel dimension).
  • a preset dimension for example, , Channel dimension.
  • the number of channels in the channel dimension of the first multi-dimensional feature map is C
  • the number of channels in the channel dimension of at least part of the above-mentioned feature information is
  • the offset information can also be used to perform time sequence offset on all the feature information of the first multi-dimensional feature map, which is not limited here.
  • At least one set of feature information may be selected from the first multi-dimensional feature map according to a preset dimension (for example, channel dimension), where Each set of feature information includes feature information corresponding to different time series in the same preset dimension (for example, channel dimension), and the offset information is used to offset the at least one set of feature information in the time series dimension.
  • the second fully connected layer of the migration prediction network can contain the same number of neurons as the number of selected feature information groups, so that the number of elements in the offset information is the same as the number of selected feature information groups.
  • each element in the offset information may be used to offset at least one set of feature information in the time sequence dimension. For example, the time sequence dimension is shifted by one time sequence unit, or the time sequence dimension is shifted by two time sequence units, etc., which is not specifically limited here.
  • the at least part of the feature information after the timing offset may be combined with the partial features in the first multi-dimensional feature map that have not been time-shifted.
  • the information is spliced to obtain the second multi-dimensional feature map.
  • the number of channels can be At least part of the feature information obtained after timing offset and the number of channels without timing offset are Part of the feature information of is spliced to obtain the second multidimensional feature map.
  • Step S15 Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.
  • the fully connected layer of the preset network model can be used to perform feature connection to the second multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so as to obtain the category of the video to be analyzed (such as football Event video, ski event video, etc.), or you can get the behavior category of the target object in the video to be analyzed (for example, normal walking, falling, running, etc.), other application scenarios can be deduced by analogy. An example.
  • the above-mentioned offset prediction network may be embedded before the convolutional layer of the preset network model.
  • the preset network model is ResNet-50, and the offset prediction network can be embedded before the convolutional layer in each residual block.
  • the preset network model may include at least one convolutional layer, so in the feature extraction process, a convolutional layer of the preset network model may be used to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map .
  • the number of convolutional layers of the preset network model can be more than one.
  • the number of convolutional layers of the preset network model can be 2, 3, or 4 and so on. Therefore, before the second multi-dimensional feature map is analyzed and the analysis result information of the video to be analyzed is obtained, the second multi-dimensional feature map can also be extracted using the convolutional layer in the preset network model that has not been feature extraction performed.
  • the fully connected layer of the preset network model is used to analyze the finally obtained second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed.
  • Figure 3 is a schematic diagram of an embodiment of each stage of video analysis.
  • the video to be analyzed is characterized by the first convolutional layer of the preset network model
  • the timing shift is performed through the above-mentioned related steps to obtain the second multi-dimensional feature map.
  • the second multi-dimensional feature map can be further used.
  • the dimensional feature map is input into the second convolutional layer for feature extraction to obtain a new first multi-dimensional feature map (denoted as the first multi-dimensional feature map in the figure), and the new first multi-dimensional feature map is obtained through the above related steps Perform timing shift to obtain a new second multi-dimensional feature map (denoted as the second multi-dimensional feature map in the figure), similarly, use the third convolutional layer to perform feature extraction on the new second multi-dimensional feature map , A new first multi-dimensional feature map (marked as the first multi-dimensional feature map in the figure) is obtained, and the new first multi-dimensional feature map is time-shifted through the above related steps to obtain a new second multi-dimensional feature map. Dimensional feature map (marked as the second multi-dimensional feature map in the figure).
  • the three convolutional layers of the preset network model have all been executed to complete the feature extraction step.
  • the fully connected layer of the preset network model can be used to compare the latest obtained
  • the second multi-dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed.
  • the first multi-dimensional feature map is obtained by feature extraction of the video to be analyzed, and the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed, and the offset prediction network is used to analyze the first multi-dimensional feature map.
  • the multi-dimensional feature map is predicted to obtain offset information, so that at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the second multi-dimensional feature map is obtained based on the offset feature information,
  • the timing information of the video to be analyzed can be directly modeled, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved. Therefore, analysis and processing are performed on this basis, which is beneficial to improve The accuracy of video analysis.
  • the offset information includes a first number of offset values
  • at least part of the first multi-dimensional feature map can also be divided into a first number of sets of first feature information along a preset dimension (for example, channel dimension) , That is, the at least one set of characteristic information includes a first number of sets of first characteristic information.
  • using the offset information to offset the at least one set of feature information in the time series dimension may include: using the i-th offset value in the offset information to compare the i-th set of first feature information in the time series dimension Perform the offset to obtain the i-th group of second feature information, where i is a positive integer less than or equal to the first number.
  • the first offset value in the offset information can be used to perform the first feature information of the first set in the time series dimension. Offset to obtain the first set of second feature information, and use the second offset value in the offset information to offset the second set of first feature information in the time series dimension to obtain the second set of second feature information,
  • the above-mentioned first quantity is other values, it can be deduced by analogy, and no examples are given here.
  • the use of the i-th offset value in the offset information to offset the i-th group of the first characteristic information in the time sequence dimension to obtain the i-th group of second characteristic information may be Including the following steps:
  • Step S141 Obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value.
  • the preset value can be 1, the lower limit of the value range is the value obtained by rounding down the i-th offset value, and the upper limit of the value range is the value obtained by rounding down the i-th offset value.
  • the value obtained by rounding up that is, for the i-th offset value O i , its value range can be expressed as (n 0 , n 0 +1), and n 0 ⁇ N.
  • the offset value is 0.8
  • the value range is 0 to 1; or when the offset value is 1.4
  • the value range is 1 to 2.
  • the offset value is other values, the same can be used. I will not give examples one by one here.
  • Step S142 Shift the i-th group of first feature information along the time sequence dimension by the upper limit time sequence unit to obtain the i-th group of third feature information, and shift the i-th group of first feature information along the time sequence dimension by the lower limit value Time sequence unit, the fourth feature information of the i-th group is obtained.
  • the first feature information of the i-th group can be expressed as U c,t , so when the numerical range of the i-th offset value is expressed as (n 0 ,n 0 +1), the i-th group A feature information is shifted by an upper limit time sequence unit along the time sequence dimension, and the obtained third feature information of the i-th group can be expressed as The i-th group of first feature information is shifted by the lower limit number of time-series units along the time series dimension, and the obtained i-th group of fourth feature information can be expressed as
  • each offset value may be a decimal.
  • the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1. Therefore, for the i-th group of first feature information U c,t , the corresponding third feature information can be expressed as U c,t+1 , and the corresponding fourth feature information can be expressed as U c,t .
  • the range of the first feature information in the time sequence dimension is [1, T], where the value of T is equal to the number of frames of the video to be analyzed, for example, the T of the first feature information [1 0 0 0 0 0 0 1] is 8.
  • the first feature information may become a zero vector due to the feature information being removed during the timing offset process, so that the gradient disappears during the training process. To alleviate this problem, it can be set to (0) after the timing offset. , 1) Set a buffer for the feature information of the time sequence interval and (T, T+1) time sequence interval, so that when the feature information is shifted out of time T+1 in the time sequence, or less than time 0, the buffer can be fixed Set to 0.
  • the i-th offset value is 0.4, since the value range it belongs to is 0 to 1, it can be Offset the first feature information by one (ie, 1) time sequence unit from the upper limit to obtain the corresponding third feature information [0 1 0 0 0 0 0 0], and shift the above-mentioned first feature information by one from the lower limit (That is, 0) time sequence unit, the corresponding fourth feature information [1 0 0 0 0 0 0 1] is obtained.
  • the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.
  • Step S143 Use the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of third feature information, to obtain the i-th group of first weighted results, and the upper limit value and the i-th deviation
  • the difference between the shift values is used as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighted result of the i-th group.
  • the i-th offset value expressed as O i when the numerical range of the i-th offset value is expressed as (n 0 ,n 0 +1), the i-th offset value O i and the lower limit The difference between the values (i.e. n 0 ) (i.e.
  • O i -n 0 is used as the weight to the third feature information of the i-th group (i.e ) Performs weighting processing to obtain the corresponding first weighted result (ie ), and the difference between the above limit (ie n 0 +1) and the i-th offset value O i (ie n 0 +1-O i ) as the weight for the fourth feature information of the i-th group Perform weighting processing to obtain the corresponding second weighting result (ie ).
  • each offset value may be a decimal.
  • the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U c,t , the corresponding
  • the third feature information can be expressed as U c,t+1
  • the corresponding fourth feature information can be expressed as U c,t
  • the first weighting result can be expressed as O i U c,t+1
  • the second weighting result can be It is expressed as (1-O i )U c,t .
  • the corresponding third feature information can be expressed as [ 0 1 0 0 0 0 0 0 0]
  • the corresponding fourth feature information can be expressed as [1 0 0 0 0 0 0 0 1]
  • the first weighted result can be expressed as [0 0.4 0 0 0 0 0 0 0]
  • the second weighted result can be expressed as [0.6 0 0 0 0 0 0 0 0 0.6].
  • Step S144 Calculate the sum between the first weighted result of the i-th group and the second weighted result of the i-th group as the second feature information of the i-th group.
  • the first weighted result can be expressed as
  • the second weighted result can be expressed as Therefore, the second feature information of the i-th group can be expressed as
  • each offset value may be a decimal.
  • the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U c,t , the first The weighted result can be expressed as O i U c,t+1 , and the second weighted result can be expressed as (1-O i )U c,t , so the i-th group of second feature information can be expressed as (1-O i )U c,t +O i U c,t+1 .
  • the corresponding first weighting result can be expressed as [ 0 0.4 0 0 0 0 0 0 0]
  • the corresponding second weighting result can be expressed as [0.6 0 0 0 0 0 0 0 0.6]
  • the second feature information of the i-th group can be expressed as [0.6 0.4 0 0 0 0 0 0.6 ].
  • a symmetric offset strategy can be used during training, that is, only half of the offset value can be trained during training, and Perform conversion calculations (for example, reverse the order) to obtain the other half of the offset value, which can reduce the processing load during training.
  • the i-th group of first characteristic information is moved along the time series dimension Offset the upper limit number of time series units to obtain the i-th group of third characteristic information, and shift the i-th group of first characteristic information along the time series dimension by the lower limit of time series units to obtain the i-th group of fourth characteristic information;
  • the difference between the i offset value and the lower limit value is used as the weight to perform weighting processing on the first feature information of the ith group to obtain the first weighted result of the ith group, and the difference between the upper limit value and the ith offset value
  • the difference is used as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as
  • FIG. 5 is a schematic flowchart of another embodiment of the video analysis method of the present application. Specifically, it can include the following steps:
  • Step S51 Obtain the video to be analyzed.
  • Step S52 Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.
  • the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed.
  • the relevant steps in the foregoing embodiment please refer to the relevant steps in the foregoing embodiment.
  • Step S53 Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.
  • FIG. 6 is a schematic diagram of another embodiment of the video analysis processing process.
  • the first multi-dimensional feature map can be predicted by the offset prediction network.
  • Step S54 Use the weight prediction network to predict the first multi-dimensional feature map to obtain weight information.
  • the features at the first and last ends of the first feature information may be removed. Therefore, in order to re-evaluate the importance of each feature in the first feature information after the timing shift, to better obtain long-range information ,
  • the attention mechanism can be used to re-weight each feature in the first feature information after the time sequence shift, so the weight information needs to be obtained.
  • the weight prediction network can be used to predict the first multi-dimensional feature map to obtain weight information.
  • the weight prediction network may include a down-sampling layer, a convolutional layer, and an activation layer that are sequentially connected. Therefore, the weight prediction network contains only 3 layers, and only the convolutional layer contains network parameters, which can simplify the network structure to a certain extent and reduce network parameters, thereby reducing network capacity, improving convergence speed, and avoiding overfitting.
  • the trained model is as accurate as possible, which in turn can improve the accuracy of video analysis.
  • the using the weight prediction network to predict the first multi-dimensional feature map to obtain weight information may include: using the weight prediction network to predict the down-sampling layer (denoted as the first down-sampling layer) Down-sampling the first multi-dimensional feature map to obtain the down-sampling result (denoted as the first down-sampling result); use the weight to predict the convolutional layer (denoted as the first convolutional layer) of the down-sampling result (that is, the first The downsampling result) is subjected to convolution processing to obtain the feature extraction result (recorded as the first feature extraction result); the activation layer of the weight prediction network is used to perform nonlinear processing on the feature extraction result (ie, the first feature extraction result) to obtain the weight information.
  • the downsampling layer may be an average pooling layer.
  • the convolutional layer of the weight prediction network can include one convolution kernel, and the activation layer of the weight prediction network can be a Sigmoid activation layer, so that each element in the weight information can be constrained to be between 0 and 1.
  • the offset prediction network and the weight prediction network in the embodiments of the present application may be embedded before the convolutional layer of the preset network model.
  • the preset network model is ResNet-50
  • the offset prediction network and the weight prediction network can be embedded before the convolutional layer of each residual block, so as to use the first multi-dimensional feature map to predict the offset information and weights.
  • Information, for subsequent offset and weighting processing so that a small amount of network parameters can be added to the existing network parameters of ResNet-50 to realize the modeling of timing information, which is conducive to reducing the processing load of video analysis and improving the performance of video analysis.
  • the processing speed is conducive to accelerating the convergence speed during model training, avoiding over-fitting, and improving the accuracy of video analysis.
  • the preset network model is another model, it can be deduced by analogy, and the examples are not given here.
  • steps S53 and S54 can be performed in a sequential order, for example, step S53 is performed first, and then step S54; or, step S54 is performed first, and then step S53 is performed; or, step S53 and step S54 are performed at the same time. limited. In addition, the foregoing step S54 may be performed before the subsequent step S56, which is not limited here.
  • Step S55 Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map.
  • Step S56 Use the weight information to perform weighting processing on the offset feature information.
  • the video to be analyzed may specifically include a second number of frame images, and the weight information may include a second number of weight values.
  • the second number may specifically be 8, 16, 24, etc., which are not specifically limited herein.
  • each group of offset feature information can be used to separately use the j-th weight value in the weight information Perform weighting processing on the feature value corresponding to the jth time sequence in the current group of feature information to obtain the corresponding group feature information after weighting, where j is a positive integer less than or equal to the second number.
  • the weight information can be [0.2 0.1 0.1 0.1 0.1 0.1 0.2], and the jth item in the weight information is used respectively.
  • the feature information of the corresponding group is obtained as [0.12 0.04 0 0 0 0 0 0.12].
  • Step S57 Obtain a second multi-dimensional feature map based on the weighted feature information.
  • the second multi-dimensional feature map corresponding to the first multi-dimensional feature map can be obtained.
  • the obtaining the second multi-dimensional feature map based on the weighted feature information may include: using the weighted feature information and the first multi-dimensional feature map is not shifted The feature information of, composes the second multi-dimensional feature map.
  • the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map can be spliced to obtain the second multi-dimensional feature map.
  • the obtained second multi-dimensional feature map has the same size as the first multi-dimensional feature map.
  • the weighted feature information can be directly combined to form the second multi-dimensional feature map.
  • Step S58 Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.
  • the weight prediction network is used to predict the first multi-dimensional feature map to obtain weight information, and at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the weight information is used
  • the offset feature information is weighted, and the second multi-dimensional feature map is obtained based on the weighted feature information. Therefore, the spatial and temporal joint interleaving feature information can be directly obtained through the offset and weighting processing steps. Conducive to improving the processing speed and accuracy of video analysis.
  • FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application.
  • the model training method used for video analysis in the embodiments of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:
  • Step S71 Obtain a sample video.
  • the sample video includes preset annotation information.
  • the preset annotation information of the sample video may include but not limited to: fall, normal walking, running and other annotation information; or, taking the classification of the video as an example, the preset annotation information of the sample video It may include but is not limited to: football match video, basketball match video, ski match video and other label information.
  • Other application scenarios can be deduced by analogy, so we will not give examples one by one here.
  • the sample video may include several frames of images, for example, may include 8 frames of images, or may also include 16 frames of images, or may also include 24 frames of images, which is not specifically limited here.
  • Step S72 Perform feature extraction on the sample video by using the preset network model to obtain the first sample multi-dimensional feature map.
  • the above-mentioned preset network model may be a two-dimensional neural network model, for example, ResNet -50, ResNet-101, etc., there is no specific limitation here.
  • the ResNet network is constructed by a Residual Block, which uses multiple parameterized layers to learn the residual representation between input and output.
  • the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video.
  • FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series.
  • the video to be analyzed includes several frames of images.
  • the feature extraction can be performed on several frames of the sample video through the preset network model to obtain the feature map corresponding to each frame image, thus directly Several feature maps are spliced according to the time sequence of the corresponding images in the sample video to obtain the first sample multi-dimensional feature map. For example, if the sample video includes 8 frames of images, you can use the preset network model to perform feature extraction on these 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly displayed in the sample video according to their corresponding images. The time sequence in is spliced to obtain the first sample multi-dimensional feature map.
  • Step S73 Use the offset prediction network to predict the multi-dimensional feature map of the first sample to obtain offset information.
  • the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information.
  • the network structure of the weight prediction network refer to the relevant steps in the foregoing embodiment, which will not be repeated here.
  • Step S74 Use the offset information to perform time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample multi-dimensional feature map based on the offset feature information.
  • the weight information can also be used to weight the offset feature information, and based on the weighted feature information, the second sample multi-dimensional feature map can be obtained.
  • the preset network model may include at least one convolutional layer, and then one convolutional layer of the preset network model may be used to perform feature extraction on the sample video to obtain the first sample multi-dimensional feature map.
  • the number of convolutional layers of the preset network model can be more than one, and then the convolutional layer in the preset network model that has not been feature extraction performed can be used to perform feature extraction on the second sample multi-dimensional feature map ,
  • To obtain a new first sample multi-dimensional feature map and execute the step of using the offset prediction network to predict the new first sample multi-dimensional feature map to obtain the offset information and subsequent steps, thereby obtaining a new second sample multi-dimensional feature map Feature map, and then repeat the above steps until all the convolutional layers of the preset network model have completed the feature extraction step of the new second sample multi-dimensional feature map.
  • Step S75 Use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video.
  • the fully connected layer of the preset network model can be used to analyze the second sample multi-dimensional feature map to obtain the analysis result information of the sample video.
  • the fully connected layer of the preset network model can be used to perform feature connection to the second sample multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so that the sample video belongs to each category (such as football matches). Video, skiing event video, etc.), or the probability value of the sample video belonging to various behaviors (such as falling, normal walking, running, etc.). In other application scenarios, this can be deduced by analogy. An example.
  • Step S76 Calculate the loss value by using the preset label information and the analysis result information.
  • a mean square error (Mean Square Error) loss function or a cross entropy loss function can be used to calculate the loss value of the preset label information and the analysis result information, which is not limited here.
  • Step S77 Adjust the parameters of the preset network model and the offset prediction network based on the loss value.
  • the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information, so that the weight information is used to weight the offset feature information, and based on the weight processing After the feature information, the second sample multi-dimensional feature information is obtained; based on the loss value, the parameters of the preset network model, the offset prediction network, and the weight prediction network can also be adjusted.
  • the parameters of the convolutional layer and the fully connected layer in the preset network model can be adjusted, and the parameters of the convolutional layer and the fully connected layer in the offset prediction network can be adjusted, and the weight of the convolutional layer in the prediction network can be adjusted.
  • a gradient descent method can be used to adjust the parameters, such as a batch gradient descent method and a stochastic gradient descent method.
  • the above step S72 and subsequent steps may be executed again until the calculated loss value meets the preset training end condition.
  • the preset training end condition may include: the loss value is less than a preset loss threshold and the loss value no longer decreases, or the preset training end condition may also include: the number of parameter adjustments reaches the preset number of times threshold, or, The preset training end condition may also include: using a test video to test that the network performance meets a preset requirement (for example, the accuracy rate reaches a preset accuracy threshold).
  • the first sample multi-dimensional feature map is obtained by feature extraction of the sample video, and the first sample multi-dimensional feature map contains feature information corresponding to the sample video in different time series, and uses the bias
  • the shift prediction network predicts the multi-dimensional feature map of the first sample to obtain offset information, so as to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and based on the offset feature information
  • Obtain the second sample multi-dimensional feature map and then can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so this is the basis Analyzing and processing on the above is conducive to the subsequent improvement of the accuracy of video analysis.
  • FIG. 8 is a schematic diagram of a framework of an embodiment of a video analysis device 80 of the present application.
  • the video analysis device 80 includes a video acquisition module 81, a feature extraction module 82, an offset prediction module 83, an offset processing module 84, and a network analysis module 85; among them,
  • the video acquisition module 81 is configured to acquire the video to be analyzed
  • the feature extraction module 82 is configured to perform feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed;
  • the offset prediction module 83 is configured to use an offset prediction network to predict the first multi-dimensional feature map to obtain offset information
  • the offset processing module 84 is configured to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtain a second multi-dimensional feature map based on the offset feature information;
  • the network analysis module 85 is configured to analyze the second multi-dimensional feature map by using a preset network model to obtain analysis result information of the video to be analyzed.
  • the technical solution of the embodiment of the present application uses a preset network model to process the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved, so it is performed on this basis Analysis and processing help improve the accuracy of video analysis.
  • the video analysis device 80 further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;
  • the offset processing module 84 is configured to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighted feature information , Get the second multi-dimensional feature map.
  • the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions
  • the offset processing module 84 is configured to select at least one set of feature information from the first multi-dimensional feature map according to the preset dimensions, where Each set of feature information includes feature information corresponding to different time series in the same preset dimension, and the offset information is used to offset at least one set of feature information in the time series dimension.
  • the preset dimension is the channel dimension; and/or, the offset information includes a first number of offset values, and at least one set of characteristic information includes a first number of sets of first characteristic information.
  • the offset processing module 84 It is configured to use the i-th offset value in the offset information to offset the i-th group of first feature information in the time sequence dimension to obtain the i-th group of second feature information, where i is a positive value less than or equal to the first number. Integer.
  • the offset processing module 84 is configured to obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value
  • the timing offset processing unit It includes a time sequence offset processing subunit, which is used to shift the i-th group of first feature information along the time sequence dimension by an upper limit number of time sequence units to obtain the i-th group of third feature information, and move the i-th group of first feature information along the time sequence.
  • the lower limit value of the dimensionality offset is time series units to obtain the fourth feature information of the i-th group; the third feature information of the i-th group is weighted by the difference between the i-th offset value and the lower limit value as the weight to obtain the I set the first weighted results, and use the difference between the upper limit and the i-th offset value as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; calculate the i-th group The sum between a weighted result and the second weighted result of the i-th group is used as the second feature information of the i-th group.
  • the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values.
  • the offset processing module 84 is configured to use the first set of weight information for each set of offset feature information.
  • the j weight values perform weighting processing on the feature value corresponding to the j-th time sequence in the current group of feature information to obtain the corresponding group feature information after the weighting process; where j is a positive integer less than or equal to the second number.
  • the offset processing module 84 is configured to use the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map to form a second multi-dimensional feature map.
  • the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result; use the weight to predict the first convolutional layer of the network Perform convolution processing on the first down-sampling result to obtain the first feature extraction result; use the first activation layer of the weight prediction network to perform non-linear processing on the first feature extraction result to obtain weight information.
  • the offset prediction module 83 is configured to use the second down-sampling layer of the offset prediction network to down-sample the first multi-dimensional feature map to obtain the second down-sampling result; use the second down-sampling layer of the offset prediction network
  • the second convolution layer performs convolution processing on the second downsampling result to obtain the second feature extraction result; uses the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result;
  • Use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain the nonlinear processing result, and use the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain the second
  • the third activation layer of the offset prediction network is used to perform non-linear processing on the second feature connection result to obtain offset information.
  • the preset network model includes at least one convolutional layer
  • the feature extraction module 82 is configured to use the convolutional layer of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; If the number of convolutional layers of the preset network model is more than 1, use the convolutional layer in the preset network model that has not been feature extraction to perform feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map ;
  • the offset prediction module 83 is further configured to use the offset prediction network to predict the new first multi-dimensional feature map to obtain new offset information;
  • the offset processing module 84 is further configured to use the new offset information to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a new second multi-dimensional feature map based on the offset feature information;
  • the network analysis module 85 is configured to analyze the new second multi-dimensional feature map by using the fully connected layer of the preset network model to obtain analysis result information of the video to be analyzed.
  • the video to be analyzed includes several frames of images
  • the feature extraction module 82 is configured to perform feature extraction on several frames of images using a preset network model to obtain a feature map corresponding to each frame of image;
  • the images are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.
  • FIG. 7 is a schematic diagram of an embodiment of a model training device 90 for video analysis according to the present application.
  • the model training device 90 for video analysis includes a video acquisition module 91, a feature extraction module 92, an offset prediction module 93, an offset processing module 94, a network analysis module 95, a loss calculation module 96, and a parameter adjustment module 97; among them,
  • the video acquisition module 91 is configured to acquire a sample video, where the sample video includes preset annotation information
  • the feature extraction module 92 is configured to perform feature extraction on the sample video by using a preset network model to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information corresponding to the sample video at different timings;
  • the offset prediction module 93 is configured to use the offset prediction network to predict the first sample multi-dimensional feature map to obtain offset information
  • the offset processing module 94 is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information;
  • the network analysis module 95 is configured to analyze the second sample multi-dimensional feature map by using a preset network model to obtain analysis result information of the sample video;
  • the loss calculation module 96 is configured to calculate a loss value using preset label information and analysis result information
  • the parameter adjustment module 97 is configured to adjust the parameters of the preset network model and the offset prediction network based on the loss value.
  • the timing information of the sample video can be directly modeled, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so analysis and processing are performed on this basis. Conducive to the subsequent improvement of the accuracy of video analysis.
  • the model training device 90 for video analysis may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis.
  • the model training device 90 for video analysis may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis.
  • the video analysis device may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis.
  • FIG. 10 is a schematic diagram of a framework of an embodiment of the electronic device 100 of the present application.
  • the electronic device 100 includes a memory 101 and a processor 102 coupled to each other.
  • the processor 102 is configured to execute program instructions stored in the memory 101 to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing for video analysis. Analyze the steps in the embodiment of the model training method.
  • the electronic device 100 may include but is not limited to: a microcomputer and a server.
  • the electronic device 100 may also include mobile devices such as a notebook computer and a tablet computer, which are not limited herein.
  • the processor 102 is configured to control itself and the memory 101 to implement the steps in any of the foregoing video analysis method embodiments, or implement the steps in any of the foregoing model training method embodiments for video analysis.
  • the processor 102 may also be referred to as a central processing unit (Central Processing Unit, CPU).
  • the processor 102 may be an integrated circuit chip with signal processing capability.
  • the processor 102 may also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the processor 102 may be jointly implemented by an integrated circuit chip.
  • FIG. 11 is a schematic diagram of a framework of an embodiment of the computer-readable storage medium 110 of this application.
  • the computer-readable storage medium 110 stores program instructions 1101 that can be executed by a processor, and the program instructions 1101 are used to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing model training method embodiments for video analysis. Steps in.
  • the computer-readable storage medium may be a volatile or non-volatile storage medium.
  • the embodiment of the present application also provides a computer program, including computer-readable code, when the computer-readable code is executed in an electronic device, the processor in the electronic device executes for realizing the implementation of any of the above-mentioned video analysis methods Or implement the steps in any of the above-mentioned model training method embodiments for video analysis.
  • the disclosed method and device can be implemented in other ways.
  • the device implementation described above is only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of this embodiment.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Abstract

Disclosed in embodiments of the present application are a video analysis method and a related model training method, device and apparatus therefor. The video analysis method comprises: obtaining a video to be analyzed; performing feature extraction on said video by using a preset network model to obtain a first multi-dimensional feature map, wherein the first multi-dimensional feature map comprises feature information on different timings corresponding to said video; predicting the first multi-dimensional feature map by using an offset prediction network to obtain offset information; performing timing offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtaining a second multi-dimensional feature map on the basis of the offset feature information; and analyzing the second multidimensional feature map by using the preset network model to obtain analysis result information of said video.

Description

视频分析方法及其相关的模型训练方法、设备、装置Video analysis method and related model training method, equipment and device
相关申请的交叉引用Cross-references to related applications
本申请基于申请号为202010053048.4、申请日为2020年1月17日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。This application is filed based on a Chinese patent application with an application number of 202010053048.4 and an application date of January 17, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by way of introduction.
技术领域Technical field
本申请涉及人工智能技术领域,特别是涉及一种视频分析方法及相关的模型训练方法、设备、装置。This application relates to the field of artificial intelligence technology, in particular to a video analysis method and related model training methods, equipment, and devices.
背景技术Background technique
随着神经网络、深度学习等人工智能技术的发展,对神经网络模型进行训练,并利用训练后的神经网络模型完成分类、检测等任务的方式,逐渐受到人们的青睐。With the development of artificial intelligence technologies such as neural networks and deep learning, the way of training neural network models and using the trained neural network models to complete tasks such as classification and detection has gradually been favored by people.
目前,神经网络模型一般是以静态图像作为处理对象而进行设计的。At present, neural network models are generally designed with static images as processing objects.
发明内容Summary of the invention
本申请实施例提供一种视频分析方法及相关的模型训练方法、设备、装置。The embodiments of the present application provide a video analysis method and related model training methods, equipment, and devices.
第一方面,本申请实施例提供了一种视频分析方法,包括:获取待分析视频;利用预设网络模型对所述待分析视频进行特征提取,得到第一多维特征图,其中,所述第一多维特征图包含与所述待分析视频对应的不同时序上的特征信息;利用偏移预测网络对所述第一多维特征图进行预测,得到偏移信息;利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二多维特征图;利用所述预设网络模型对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息。In a first aspect, an embodiment of the present application provides a video analysis method, including: obtaining a video to be analyzed; using a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, wherein the The first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed; the offset prediction network is used to predict the first multi-dimensional feature map to obtain offset information; the offset information is used Perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information; The dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed.
本申请实施例通过预设网络模型对待分析视频进行处理,有利于提高视频分析的处理速度,且通过时序偏移,能够使空间信息和时序信息联合交错,故在此基础上进行分析处理,有利于提高视频分析的准确度。The embodiment of the application processes the video to be analyzed through the preset network model, which is beneficial to improve the processing speed of video analysis, and through the timing offset, the spatial information and the timing information can be jointly interlaced. Therefore, the analysis and processing are performed on this basis. Conducive to improving the accuracy of video analysis.
在本申请的一些可选实施例中,在所述利用偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二多维特征图之前,所述方法还包括:利用权重预测网络对所述第一多维特征图进行预测,得到权重信息;所述利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二多维特征图,包括:利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移;利用所述权重信息对偏移后的所述特征信息进行加权处理;基于所述加权处理后的特征信息,得到第二多维特征图。In some optional embodiments of the present application, the offset information is used to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and the second feature information is obtained based on the offset information. Before the multi-dimensional feature map, the method further includes: predicting the first multi-dimensional feature map by using a weight prediction network to obtain weight information; and using the offset information to perform a prediction on the first multi-dimensional feature map. Performing timing offset on at least part of the feature information, and obtaining a second multi-dimensional feature map based on the offset feature information, includes: performing at least part of the feature information of the first multi-dimensional feature map by using the offset information Time sequence offset; use the weight information to perform weighting processing on the offset feature information; and obtain a second multi-dimensional feature map based on the weighted feature information.
本申请实施例的技术方案,通过偏移、加权的处理步骤能够直接得到空间、时序联合交错的特征信息,有利于提高视频分析的处理速度和准确度。The technical solution of the embodiment of the present application can directly obtain the feature information of space and time series joint interleaving through the processing steps of offset and weighting, which is beneficial to improve the processing speed and accuracy of video analysis.
在本申请的一些可选实施例中,所述第一多维特征图的维度包括时序维度和预设维度;所述利用偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,包括:按照预设维度从第一多维特征图中选择至少一组特征信息,其中,每组特征信息包括同一预设维度上对应不同时序的特征信息;利用所述偏移信息对所述至少一组特征信息在时序维度上进行偏移。In some optional embodiments of the present application, the dimensions of the first multi-dimensional feature map include a time series dimension and a preset dimension; the offset information is used to perform at least part of the feature information of the first multi-dimensional feature map. The time sequence offset includes: selecting at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time sequences in the same preset dimension; using the offset information The at least one set of feature information is offset in the time series dimension.
本申请实施例的技术方案,按照预设维度从第一多维特征图中选择至少一组特征信息,且每组特征信息包括同一预设维度上对应不同时序的特征信息,并利用偏移信息对至少一组特征信息在时 序维度上进行偏移,故能够降低偏移处理的计算量,进一步有利于提高视频分析的处理速度。In the technical solution of the embodiment of the present application, at least one set of feature information is selected from the first multi-dimensional feature map according to a preset dimension, and each set of feature information includes feature information corresponding to different time series in the same preset dimension, and offset information is used Offset at least one set of feature information in the time sequence dimension can reduce the amount of calculation for offset processing, which is further conducive to improving the processing speed of video analysis.
在本申请的一些可选实施例中,所述预设维度为通道维度;和/或,所述偏移信息包括第一数量个偏移值,所述至少一组特征信息包括第一数量组第一特征信息;所述利用偏移信息对所述至少一组特征信息在时序维度上进行偏移包括:利用所述偏移信息中第i个偏移值对第i组第一特征信息在所述时序维度上进行偏移,得到第i组第二特征信息,其中,i为小于或等于第一数量的正整数。In some optional embodiments of the present application, the preset dimension is a channel dimension; and/or, the offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets The first feature information; said using offset information to offset the at least one set of feature information in the time sequence dimension includes: using the i-th offset value in the offset information to offset the i-th set of first feature information Offset is performed in the time sequence dimension to obtain the i-th group of second characteristic information, where i is a positive integer less than or equal to the first number.
本申请实施例的技术方案,通过将与偏移信息中包含的偏移值数量相同组数的第一特征信息对应进行偏移处理,能够直接得到空间、时序联合交错的特征信息,有利于提高视频分析的处理速度和准确度。In the technical solution of the embodiment of the present application, by performing offset processing corresponding to the first feature information of the same number of offset values included in the offset information, the feature information of the space and time sequence joint interleaving can be directly obtained, which is conducive to improving The processing speed and accuracy of video analysis.
在本申请的一些可选实施例中,所述利用偏移信息中第i个所述偏移值对第i组所述第一特征信息在所述时序维度上进行偏移,得到第i组第二特征信息,包括:获取第i个所述偏移值所属的数值范围,且所述数值范围的上限值与下限值之差为一预设数值,将第i组所述第一特征信息沿所述时序维度偏移所述上限值个时序单位,得到第i组第三特征信息,并将第i组所述第一特征信息沿所述时序维度偏移所述下限值个时序单位,得到第i组第四特征信息;以第i个所述偏移值与所述下限值之间的差作为权重对第i组所述第三特征信息进行加权处理,得到第i组第一加权结果,并以所述上限值与第i个偏移值之间的差作为权重对第i组所述第四特征信息进行加权处理,得到第i组第二加权结果;计算第i组第一加权结果和第i组第二加权结果之间的和,以作为第i组第二特征信息。In some optional embodiments of the present application, the i-th offset value in the offset information is used to offset the i-th group of the first feature information in the time series dimension to obtain the i-th group The second characteristic information includes: acquiring the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value, and the first group of the i-th The characteristic information is shifted by the upper limit time sequence unit along the time sequence dimension to obtain the i-th group of third characteristic information, and the i-th group of the first characteristic information is shifted by the lower limit value along the time sequence dimension Time sequence units to obtain the i-th group of fourth characteristic information; using the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of the third characteristic information to obtain the i groups of first weighting results, and using the difference between the upper limit value and the i-th offset value as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighting result of the i-th group; The sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature information of the i-th group.
本申请实施例的技术方案,能够方便、快速地对第一特征信息进行偏移处理,有利于提高视频分析的处理速度。The technical solutions of the embodiments of the present application can easily and quickly perform offset processing on the first feature information, which is beneficial to improve the processing speed of video analysis.
在本申请的一些可选实施例中,所述待分析视频包括第二数量帧图像,所述权重信息包括第二数量个权重值;所述利用权重信息对偏移后的所述特征信息进行加权处理,包括:对偏移后的每组特征信息,分别利用所述权重信息中第j个权重值对当前组特征信息中的第j个时序对应的特征值进行加权处理,得到加权处理后的对应组特征信息;其中,j为小于或等于第二数量的正整数。In some optional embodiments of the present application, the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values; the weight information is used to perform the offset on the feature information. Weighting processing includes: for each group of feature information after the offset, weighting is performed on the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information, and the weighted processing is obtained Corresponding group feature information; where j is a positive integer less than or equal to the second number.
本申请实施例的技术方案,通过对偏移后的每组特征信息,分别利用权重信息中第j个权重值对当前组特征信息的第j个时序对应的特征值进行加权处理,得到加权处理后的对应组特征信息,从而能够在某些末端的特征信息被偏移出去时,对特征信息进行重新加权,有利于提高视频分析的准确性。The technical solution of the embodiment of the present application uses the j-th weight value in the weight information to weight the characteristic value corresponding to the j-th time sequence of the current group of characteristic information by weighting each group of feature information after the offset to obtain the weighting process. The latter corresponding group of feature information can re-weight the feature information when the feature information at some ends is offset, which is beneficial to improve the accuracy of video analysis.
在本申请的一些可选实施例中,所述基于加权处理后的所述特征信息,得到第二多维特征图,包括:利用所述加权处理后的所述特征信息以及所述第一多维特征图中未被偏移的特征信息,组成所述第二多维特征图。In some optional embodiments of the present application, the obtaining a second multi-dimensional feature map based on the weighted feature information includes: using the weighted feature information and the first multi-dimensional feature map. The non-shifted feature information in the dimensional feature map constitutes the second multi-dimensional feature map.
本申请实施例的技术方案,通过加权处理后的特征信息以及第一多维特征图中未被偏移的特征信息组合成为第二多维特征信息,能够减少计算负荷,有利于提高视频分析的处理速度。In the technical solution of the embodiment of the present application, the weighted feature information and the feature information that has not been shifted in the first multi-dimensional feature map are combined into the second multi-dimensional feature information, which can reduce the calculation load and improve the performance of video analysis. Processing speed.
在本申请的一些可选实施例中,所述利用权重预测网络对所述第一多维特征图进行预测,得到权重信息,包括:利用所述权重预测网络的第一降采样层对所述第一多维特征图进行降采样,得到第一降采样结果;利用所述权重预测网络的第一卷积层对所述第一降采样结果进行卷积处理,得到第一特征提取结果;利用所述权重预测网络的第一激活层对所述第一特征提取结果进行非线性处理,得到权重信息。In some optional embodiments of the present application, the predicting the first multi-dimensional feature map using the weight prediction network to obtain the weight information includes: using the first downsampling layer of the weight prediction network to perform the prediction on the Down-sampling the first multi-dimensional feature map to obtain the first down-sampling result; using the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain the first feature extraction result; using The first activation layer of the weight prediction network performs nonlinear processing on the first feature extraction result to obtain weight information.
本申请实施例的技术方案,通过第一降采样层、第一卷积层和第一激活层对第一多维特征图进行逐步层层处理,即能够得到权重信息,能够有效简化权重预测网络的网络结构,减少网络参数,有利于提高用于视频分析的模型训练时的收敛速度,并有利于避免过拟合,从而有利于提高视频分析的准确性。In the technical solution of the embodiment of the present application, the first multi-dimensional feature map is processed layer by layer through the first down-sampling layer, the first convolutional layer, and the first activation layer, that is, weight information can be obtained, and the weight prediction network can be effectively simplified. The new network structure reduces the network parameters, which is beneficial to improve the convergence speed of the model training for video analysis, and is beneficial to avoid over-fitting, thereby helping to improve the accuracy of video analysis.
在本申请的一些可选实施例中,所述利用偏移预测网络对第一多维特征图进行预测,得到偏移信息,包括:利用所述偏移预测网络的第二降采样层对所述第一多维特征图进行降采样,得到第二降采样结果;利用所述偏移预测网络的第二卷积层对所述第二降采样结果进行卷积处理,得到第二特征提取结果;利用所述偏移预测网络的第一全连接层对所述第二特征提取结果进行特征连接,得到第一特征连接结果;利用所述偏移预测网络的第二激活层对所述第一特征连接结果进行非线性处理,得到非线性处理结果;利用所述偏移预测网络的第二全连接层对所述非线性处理结果进行特征连接,得到第二特征连接结果;利用所述偏移预测网络的第三激活层对所述第二特征连接结果进行非线性处理,得到偏移信息。In some optional embodiments of the present application, the using the offset prediction network to predict the first multi-dimensional feature map to obtain the offset information includes: using the second downsampling layer of the offset prediction network to perform The first multi-dimensional feature map is down-sampled to obtain a second down-sampling result; the second convolutional layer of the offset prediction network is used to perform convolution processing on the second down-sampling result to obtain a second feature extraction result Use the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result; use the second activation layer of the offset prediction network to perform feature connection on the first The feature connection result is subjected to non-linear processing to obtain a non-linear processing result; the second fully connected layer of the offset prediction network is used to perform feature connection to the non-linear processing result to obtain a second feature connection result; using the offset The third activation layer of the prediction network performs nonlinear processing on the second feature connection result to obtain offset information.
本申请实施例的技术方案,能够有效简化偏移预测网络的网络结构,减少网络参数,有利于提高用于视频分析的模型训练时的收敛速度,并有利于避免过拟合,从而有利于提高视频分析的准确性。The technical solutions of the embodiments of the present application can effectively simplify the network structure of the offset prediction network, reduce network parameters, help improve the convergence speed of the model training used for video analysis, and help avoid overfitting, thereby helping to improve The accuracy of video analysis.
在本申请的一些可选实施例中,所述预设网络模型包括至少一个卷积层;所述利用预设网络模型对待分析视频进行特征提取,得到第一多维特征图,包括:利用预设网络模型的卷积层对所述待分析视频进行特征提取,得到第一多维特征图;若所述预设网络模型的卷积层的数量多于1,则在所述得到第二多维特征图之后,并在利用预设网络模型对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息之前,所述方法还包括:利用所述预设网络模型中未执行特征提取的卷积层对所述第二多维特征图进行特征提取,得到新的第一多维特征图;执行所述利用偏移预测网络对所述新的第一多维特征图进行预测,得到偏移信息的步骤以及后续步骤,以得到新的第二多维特征图;重复执行上述步骤,直至所述预设网络模型的所有卷积层均完成对新的第二多维特征图的特征提取步骤;利用所述预设网络模型对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息,包括:利用所述预设网络模型的全连接层对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息。In some optional embodiments of the present application, the preset network model includes at least one convolutional layer; the use of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map includes: Suppose that the convolutional layer of the network model performs feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, then the second most After the dimensional feature map, and before using a preset network model to analyze the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, the method further includes: using the preset network model The convolutional layer that has not performed feature extraction performs feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map; performing the use of the offset prediction network to perform the feature extraction on the new first multi-dimensional feature map Make predictions to obtain offset information and subsequent steps to obtain a new second multi-dimensional feature map; repeat the above steps until all convolutional layers of the preset network model have completed the new second multi-dimensional feature map The feature extraction step of the feature map; analyzing the second multi-dimensional feature map using the preset network model to obtain the analysis result information of the video to be analyzed includes: using the fully connected layer of the preset network model The second multi-dimensional feature map is analyzed to obtain analysis result information of the video to be analyzed.
本申请实施例的技术方案,在预设网络模型包括的卷积层数量多于1个时,利用预设网络模型中未执行特征提取的卷积层对第二多维特征图进行特征提取,得到新的第一多维特征图,并重新执行偏移预测等步骤,直至预设网络模型中所有卷积层均完成对新的第二多维特征图进行特征提取的步骤,从而利用预设网络模型的全连接层对第二多维特征图进行分析,得到待分析视频的分析结果信息,进而能够提高视频分析的准确性。In the technical solution of the embodiment of the present application, when the preset network model includes more than one convolutional layer, the second multi-dimensional feature map is extracted by using the convolutional layer in the preset network model that has not been subjected to feature extraction. Obtain a new first multi-dimensional feature map, and re-execute steps such as offset prediction, until all convolutional layers in the preset network model have completed the feature extraction step of the new second multi-dimensional feature map, thereby using the preset The fully connected layer of the network model analyzes the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, which in turn can improve the accuracy of video analysis.
在本申请的一些可选实施例中,所述待分析视频包括若干帧图像,所述利用预设网络模型对所述待分析视频进行特征提取,得到第一多维特征图,包括:利用所述预设网络模型分别对所述若干帧图像进行特征提取,得到与每一帧图像对应的特征图;将所述若干个特征图按照与其对应的图像在所述待分析视频中的时序进行拼接,得到所述第一多维特征图。In some optional embodiments of the present application, the video to be analyzed includes several frames of images, and the feature extraction of the video to be analyzed using a preset network model to obtain the first multi-dimensional feature map includes: The preset network model performs feature extraction on the several frames of images to obtain a feature map corresponding to each frame of image; the several feature maps are spliced according to the timing of the corresponding images in the video to be analyzed , To obtain the first multi-dimensional feature map.
本申请实施例的技术方案,通过预设网络模型分别对待分析视频的若干帧图像进行特征提取,得到与每一帧图像对应的特征图,从而直接将若干个特征图按照与其对应的图像在待分析视频中的时序进行拼接,得到第一多维特征图,能够降低对待分析视频进行特征提取的处理负荷,有利于提高视频分析的处理速度。In the technical solution of the embodiment of the present application, feature extraction is performed on several frames of the video to be analyzed through a preset network model, and a feature map corresponding to each frame of image is obtained, so that several feature maps are directly placed in the waiting room according to their corresponding images. Analyzing the time sequence of the video and performing splicing to obtain the first multi-dimensional feature map can reduce the processing load of feature extraction for the video to be analyzed, which is beneficial to improve the processing speed of video analysis.
第二方面,本申请实施例提供了一种用于视频分析的模型训练方法,包括:获取样本视频,其中,所述样本视频包括预设标注信息;利用预设网络模型对所述样本视频进行特征提取,得到第一样本多维特征图,其中,所述第一样本多维特征图包含与所述样本视频对应的不同时序上的特征信息;利用偏移预测网络对所述第一样本多维特征图进行预测,得到偏移信息;利用所述偏移信息对所述第一样本多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二样本多维特征图;利用所述预设网络模型对所述第二样本多维特征图进行分析,得到所述样本视频的分析结果信息;利用所述预设标注信息和所述分析结果信息计算损失值;基于所述损失值,调整所述预设网络模型和所述偏移预测网络的参数。In a second aspect, an embodiment of the present application provides a model training method for video analysis, including: obtaining a sample video, wherein the sample video includes preset annotation information; and performing a method on the sample video using a preset network model Feature extraction to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information at different time series corresponding to the sample video; and the offset prediction network is used to analyze the first sample The multi-dimensional feature map is predicted to obtain offset information; at least part of the feature information of the first sample multi-dimensional feature map is time-shifted using the offset information, and the second is obtained based on the offset feature information. Sample multi-dimensional feature map; use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video; use the preset annotation information and the analysis result information to calculate the loss value ; Based on the loss value, adjusting the preset network model and the parameters of the offset prediction network.
本申请实施例的技术方案,能够直接对样本视频的时序信息进行建模,有利于提高模型训练时的速度,且通过时序偏移,能够使空间信息和时序信息联合交错,故在此基础上进行分析处理,有利于后续提高视频分析的准确度。The technical solution of the embodiment of the present application can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to subsequent improvement of the accuracy of video analysis.
第三方面,本申请实施例提供了一种视频分析装置,包括视频获取模块、特征提取模块、偏移预测模块、偏移处理模块和网络分析模块;其中,所述视频获取模块,配置为获取待分析视频;所述特征提取模块,配置为利用预设网络模型对待分析视频进行特征提取,得到第一多维特征图,其中,第一多维特征图包含与待分析视频对应的不同时序上的特征信息;所述偏移预测模块,配置为利用偏移预测网络对第一多维特征图进行预测,得到偏移信息;所述偏移处理模块,配置为利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二多维特征图;所述网络分析模块,配置为利用预设网络模型对第二多维特征图进行分析,得到待分析视频的分析结果信息。In a third aspect, an embodiment of the present application provides a video analysis device, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, and a network analysis module; wherein the video acquisition module is configured to acquire The video to be analyzed; the feature extraction module is configured to use a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains different timings corresponding to the video to be analyzed The offset prediction module is configured to use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information; the offset processing module is configured to use the offset information for the first multi At least part of the feature information of the three-dimensional feature map is time-shifted, and a second multi-dimensional feature map is obtained based on the offset feature information; the network analysis module is configured to perform the second multi-dimensional feature map using a preset network model Analyze to obtain the analysis result information of the video to be analyzed.
在本申请的一些可选实施例中,所述装置还包括权重预测模块,配置为利用权重预测网络对所述第一多维特征图进行预测,得到权重信息;所述偏移处理模块,配置为利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移;利用所述权重信息对偏移后的所述特征信息进行加权处理;基于所述加权处理后的所述特征信息,得到第二多维特征图。In some optional embodiments of the present application, the device further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information; the offset processing module is configured In order to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighting processing To obtain the second multi-dimensional feature map.
在本申请的一些可选实施例中,所述第一多维特征图的维度包括时序维度和预设维度;In some optional embodiments of the present application, the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions;
所述偏移处理模块,配置为按照预设维度从第一多维特征图中选择至少一组特征信息,其中,每组特征信息包括同一预设维度上对应不同时序的特征信息;利用所述偏移信息对所述至少一组特征信息在时序维度上进行偏移。The offset processing module is configured to select at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time series in the same preset dimension; The offset information offsets the at least one set of feature information in the time sequence dimension.
在本申请的一些可选实施例中,所述预设维度为通道维度;和/或,In some optional embodiments of the present application, the preset dimension is a channel dimension; and/or,
所述偏移信息包括第一数量个偏移值,所述至少一组特征信息包括第一数量组第一特征信息;The offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;
所述偏移处理模块,配置为利用所述偏移信息中第i个所述偏移值对第i组所述第一特征信息在所述时序维度上进行偏移,得到第i组第二特征信息,其中,所述i为小于或等于所述第一数量的正整数。The offset processing module is configured to use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information. Characteristic information, wherein the i is a positive integer less than or equal to the first number.
在本申请的一些可选实施例中,所述偏移处理模块,配置为获取第i个所述偏移值所属的数值范围,且所述数值范围的上限值与下限值之差为一预设数值;将第i组所述第一特征信息沿所述时序维度偏移所述上限值个时序单位,得到第i组第三特征信息,并将第i组所述第一特征信息沿所述时序维度偏移所述下限值个时序单位,得到第i组第四特征信息;以第i个所述偏移值与所述下限值之间的差作为权重对第i组所述第三特征信息进行加权处理,得到第i组第一加权结果,并以所述上限值与所述第i个偏移值之间的差作为权重对第i组所述第四特征信息进行加权处理,得到第i组第二加权结果;计算所述第i组第一加权结果和第i组第二加权结果之间的和,以作为第i组所述第二特征信息。In some optional embodiments of the present application, the offset processing module is configured to obtain the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is A preset value; the first feature information of the i-th group is shifted by the upper limit number of time-series units along the time sequence dimension to obtain the third feature information of the i-th group, and the first feature of the i-th group is The information is shifted by the lower limit time sequence unit along the time sequence dimension to obtain the i-th group of fourth characteristic information; the difference between the i-th offset value and the lower limit value is used as the weight to the i-th Group the third feature information to perform weighting processing to obtain the first weighted result of the i-th group, and use the difference between the upper limit value and the i-th offset value as a weight to perform weighting on the fourth group of the i-th group The feature information is weighted to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature information of the i-th group.
在本申请的一些可选实施例中,所述待分析视频包括第二数量帧图像,所述权重信息包括所述第二数量个权重值;所述偏移处理模块,配置为对偏移后的每组特征信息,分别利用所述权重信息中第j个权重值对当前组特征信息中的第j个时序对应的特征值进行加权处理,得到加权处理后的对应组特征信息;其中,所述j为小于或等于所述第二数量的正整数。In some optional embodiments of the present application, the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values; the offset processing module is configured to Each group of feature information in the weight information is used to weight the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information to obtain the corresponding group of feature information after weighting; The j is a positive integer less than or equal to the second number.
在本申请的一些可选实施例中,所述偏移处理模块,配置为利用所述加权处理后的所述特征信息以及所述第一多维特征图中未被偏移的特征信息,组成所述第二多维特征图。In some optional embodiments of the present application, the offset processing module is configured to use the feature information after the weighting process and the feature information that has not been offset in the first multi-dimensional feature map to form The second multi-dimensional feature map.
在本申请的一些可选实施例中,所述权重预测模块,配置为利用所述权重预测网络的第一降采样层对所述第一多维特征图进行降采样,得到第一降采样结果;利用所述权重预测网络的第一卷积层对所述第一降采样结果进行卷积处理,得到第一特征提取结果;利用所述权重预测网络的第一激活层对所述第一特征提取结果进行非线性处理,得到所述权重信息。In some optional embodiments of the present application, the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result ; Use the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain a first feature extraction result; use the first activation layer of the weight prediction network to perform the convolution processing on the first feature The extraction result is subjected to non-linear processing to obtain the weight information.
在本申请的一些可选实施例中,所述偏移预测模块,配置为利用所述偏移预测网络的第二降采样层对所述第一多维特征图进行降采样,得到第二降采样结果;利用所述偏移预测网络的第二卷积层对所述第二降采样结果进行卷积处理,得到第二特征提取结果;利用所述偏移预测网络的第一全连接层对所述第二特征提取结果进行特征连接,得到第一特征连接结果;利用所述偏移预测网络的第二激活层对所述第一特征连接结果进行非线性处理,得到非线性处理结果;利用所述偏移预测网络的第二全连接层对所述非线性处理结果进行特征连接,得到第二特征连接结果;利用所述偏移预测网络的第三激活层对所述第二特征连接结果进行非线性处理,得到所述偏移信息。In some optional embodiments of the present application, the offset prediction module is configured to use the second downsampling layer of the offset prediction network to downsample the first multi-dimensional feature map to obtain a second downsampling layer. Sampling result; using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result; using the first fully connected layer pair of the offset prediction network Perform feature connection on the second feature extraction result to obtain a first feature connection result; use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain a nonlinear processing result; use The second fully connected layer of the offset prediction network performs feature connection on the non-linear processing result to obtain a second feature connection result; the third activation layer of the offset prediction network is used to connect the second feature result Perform nonlinear processing to obtain the offset information.
在本申请的一些可选实施例中,所述预设网络模型包括至少一个卷积层;所述特征提取模块,配置为利用预设网络模型的卷积层对所述待分析视频进行特征提取,得到第一多维特征图;若所述预设网络模型的卷积层的数量多于1,还配置为利用所述预设网络模型中未执行特征提取的卷积层对所述第二多维特征图进行特征提取,得到新的第一多维特征图;所述偏移预测模块,还配置为利用偏移预测网络对所述新的第一多维特征图进行预测,得到新的偏移信息;所述偏移处理模块,还配置为利用所述新的偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到新的第二多维特征图;所述网络分析模块,还配置为利用所述预设网络模型的全连接层对所述新的第二多维特征图进行分析,得到所述待分析视频的分析结果信息。In some optional embodiments of the present application, the preset network model includes at least one convolutional layer; the feature extraction module is configured to perform feature extraction on the video to be analyzed by using the convolutional layer of the preset network model , Obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, it is further configured to use the convolutional layer in the preset network model that has not been feature extraction performed to perform the feature extraction on the second Perform feature extraction on the multi-dimensional feature map to obtain a new first multi-dimensional feature map; the offset prediction module is further configured to use an offset prediction network to predict the new first multi-dimensional feature map to obtain a new Offset information; the offset processing module is further configured to use the new offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and based on the offset feature Information to obtain a new second multi-dimensional feature map; the network analysis module is further configured to use the fully connected layer of the preset network model to analyze the new second multi-dimensional feature map to obtain the to-be-analyzed The analysis result information of the video.
在本申请的一些可选实施例中,所述待分析视频包括若干帧图像;所述特征提取模块,配置为利用所述预设网络模型分别对所述若干帧图像进行特征提取,得到与每一帧图像对应的特征图;将所述若干个所述特征图按照与其对应的图像在所述待分析视频中的时序进行拼接,得到所述第一多维特征图。In some optional embodiments of the present application, the video to be analyzed includes several frames of images; the feature extraction module is configured to use the preset network model to perform feature extraction on the several frames of images to obtain the A feature map corresponding to one frame of image; and the plurality of feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.
第四方面,本申请实施例提供了一种用于视频分析的模型训练装置,包括视频获取模块、特征提取模块、偏移预测模块、偏移处理模块、网络分析模块、损失计算模块和参数调整模块;其中,所述视频获取模块,配置为获取样本视频,其中,样本视频包括预设标注信息;所述特征提取模块,配置为利用预设网络模型对样本视频进行特征提取,得到第一样本多维特征图,其中,第一样本多维特征图包含与样本视频对应的不同时序上的特征信息;所述偏移预测模块,配置为利用偏移预测网络对第一样本多维特征图进行预测,得到偏移信息;所述偏移处理模块,配置为利用偏移信息对第一样本多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二样本多维特征图;所述网络分析模块,配置为利用预设网络模型对第二样本多维特征图进行分析,得到样本视频的分析结果信息;所述损失计算模块,配置为利用预设标注信息和分析结果信息计算损失值; 参数调整模块用于基于损失值,调整预设网络模型和偏移预测网络的参数。In a fourth aspect, an embodiment of the present application provides a model training device for video analysis, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, a network analysis module, a loss calculation module, and parameter adjustment Module; wherein the video acquisition module is configured to acquire a sample video, wherein the sample video includes preset annotation information; the feature extraction module is configured to use a preset network model to perform feature extraction on the sample video to obtain the first In this multi-dimensional feature map, the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video; the offset prediction module is configured to perform the first sample multi-dimensional feature map using the offset prediction network Prediction to obtain offset information; the offset processing module is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample based on the offset feature information The multi-dimensional feature map; the network analysis module is configured to use a preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video; the loss calculation module is configured to use preset annotation information and analysis The result information calculates the loss value; the parameter adjustment module is used to adjust the parameters of the preset network model and the offset prediction network based on the loss value.
第五方面,本申请实施例提供了一种电子设备,包括相互耦接的存储器和处理器,处理器用于执行存储器中存储的程序指令,以实现本申请实施例上述第一方面中的视频分析方法,或实现本申请实施例上述第二方面中的用于视频分析的模型训练方法。In a fifth aspect, an embodiment of the present application provides an electronic device including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory to implement the video analysis in the first aspect of the embodiment of the present application. Method, or implement the model training method for video analysis in the second aspect of the embodiment of the present application.
第六方面,本申请实施例提供了一种计算机可读存储介质,其上存储有程序指令,程序指令被处理器执行时实现本申请实施例上述第一方面中的视频分析方法,或实现本申请实施例上述第二方面中的用于视频分析的模型训练方法。In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium on which program instructions are stored. When the program instructions are executed by a processor, the video analysis method in the above-mentioned first aspect of the embodiments of the present application is implemented, or the present invention is implemented. The model training method for video analysis in the above second aspect of the application embodiment.
第七方面,本申请实施例提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现本申请实施例上述第一方面中的视频分析方法,或实现本申请实施例上述第二方面中的用于视频分析的模型训练方法。In a seventh aspect, an embodiment of the present application provides a computer program, including computer-readable code. When the computer-readable code runs in an electronic device, the processor in the electronic device executes to implement the implementation of the present application. Take the video analysis method in the first aspect described above, or implement the model training method for video analysis in the second aspect described in the embodiments of the present application.
本申请实施例的技术方案,能够直接对待分析视频的时序信息进行建模,有利于提高视频分析的处理速度,且通过时序偏移,能够使空间信息和时序信息联合交错,故在此基础上进行分析处理,有利于提高视频分析的准确度。The technical solutions of the embodiments of the present application can directly model the timing information of the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to improving the accuracy of video analysis.
附图说明Description of the drawings
图1是本申请视频分析方法一实施例的流程示意图;FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application;
图2是视频分析处理过程一实施例的示意图;Figure 2 is a schematic diagram of an embodiment of a video analysis processing process;
图3是视频分析各阶段一实施例的示意图;FIG. 3 is a schematic diagram of an embodiment of each stage of video analysis;
图4是图1中步骤S14一实施例的流程示意图;FIG. 4 is a schematic flowchart of an embodiment of step S14 in FIG. 1;
图5是本申请视频分析方法另一实施例的流程示意图;FIG. 5 is a schematic flowchart of another embodiment of the video analysis method of the present application;
图6是视频分析处理过程另一实施例的示意图;Fig. 6 is a schematic diagram of another embodiment of a video analysis processing process;
图7是本申请用于视频分析的模型训练方法一实施例的流程示意图;FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application;
图8本申请视频分析装置一实施例的框架示意图;FIG. 8 is a schematic diagram of the framework of an embodiment of the video analysis device of the present application;
图9是本申请用于视频分析的模型训练装置一实施例的框架示意图;FIG. 9 is a schematic diagram of the framework of an embodiment of a model training device for video analysis according to the present application;
图10是本申请电子设备一实施例的框架示意图;FIG. 10 is a schematic diagram of a framework of an embodiment of an electronic device of the present application;
图11是本申请计算机可读存储介质一实施例的框架示意图。FIG. 11 is a schematic diagram of a framework of an embodiment of a computer-readable storage medium according to the present application.
具体实施方式Detailed ways
下面结合说明书附图,对本申请实施例的方案进行详细说明。The following describes the solutions of the embodiments of the present application in detail with reference to the drawings in the specification.
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、接口、技术之类的具体细节,以便透彻理解本申请。In the following description, for the purpose of illustration rather than limitation, specific details such as specific system structure, interface, technology, etc. are proposed for a thorough understanding of the present application.
本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。此外,本文中的“多”表示两个或者多于两个。The terms "system" and "network" in this article are often used interchangeably in this article. The term "and/or" in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations. In addition, the character "/" in this text generally indicates that the associated objects before and after are in an "or" relationship. In addition, "many" in this document means two or more than two.
请参阅图1,图1是本申请视频分析方法一实施例的流程示意图。本申请视频分析方法具体可以由微型计算机、服务器、平板电脑等具有处理功能的电子设备执行,或者由处理器执行程序代码实现。具体而言,可以包括如下步骤:Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application. The video analysis method of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:
步骤S11:获取待分析视频。Step S11: Obtain the video to be analyzed.
本申请实施例中,待分析视频可以包括若干帧图像,例如,待分析视频包括8帧图像,或者,待分析视频包括16帧图像,或者,待分析视频包括24帧图像等等,在此不做具体限定。在一个实施场景中,待分析视频可以是监控相机拍摄到的监控视频,以对监控视频中目标对象进行行为分析,例如,目标对象摔倒、目标对象正常行走等等。在另一个实施场景中,待分析视频可以是视频库中的视频,以对视频库中的视频进行分类,例如,足球赛事视频、篮球赛事视频、滑雪赛事视频等等。In this embodiment of the application, the video to be analyzed may include several frames of images. For example, the video to be analyzed includes 8 frames of images, or the video to be analyzed includes 16 frames of images, or the video to be analyzed includes 24 frames of images, etc. Make specific restrictions. In an implementation scenario, the video to be analyzed may be a surveillance video shot by a surveillance camera to analyze the behavior of the target object in the surveillance video, for example, the target object falls down, the target object walks normally, and so on. In another implementation scenario, the video to be analyzed may be a video in a video library to classify the videos in the video library, for example, a football match video, a basketball match video, a ski match video, and so on.
步骤S12:利用预设网络模型对待分析视频进行特征提取,得到第一多维特征图。Step S12: Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.
在一个具体的实施场景中,为了进一步减少网络参数,降低处理负荷,从而提高处理速度,提高训练时收敛速度,避免过拟合,上述预设网络模型可以是二维神经网络模型,例如,ResNet-50、ResNet-101等等,在此不做具体限定。ResNet网络是由残差块(Residual Block)构建的,通过使用 多个有参层来学习输入、输出之间的残差表示。In a specific implementation scenario, in order to further reduce network parameters, reduce processing load, thereby increase processing speed, increase convergence speed during training, and avoid overfitting, the above-mentioned preset network model may be a two-dimensional neural network model, for example, ResNet -50, ResNet-101, etc., there is no specific limitation here. The ResNet network is constructed by a Residual Block, which uses multiple parameterized layers to learn the residual representation between input and output.
本申请实施例中,第一多维特征图包含与待分析视频对应的不同时序上的特征信息。请结合参阅图2,图2是视频分析处理过程一实施例的示意图。如图2所示,横坐标表示时序维度T上的不同时序,不同时序所对应的方格表示不同时序上的特征信息。In this embodiment of the present application, the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed. Please refer to FIG. 2 in combination. FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series.
在一个实施场景中,待分析视频包括若干帧图像。为了降低对待分析视频进行特征提取的处理负荷,提高视频分析的处理速度,可以通过预设网络模型分别对待分析视频的若干帧图像进行特征提取,得到每一帧图像对应的特征图,将若干个特征图按照与其对应的图像在待分析视频中的时序进行拼接,得到第一多维特征图。例如,待分析视频包括8帧图像,则可以利用预设网络模型分别对这8帧图像进行特征提取,得到每一帧图像的特征图,从而直接将8张特征图按照与其对应的图像在待分析视频中的时序进行拼接,得到第一多维特征图。In an implementation scenario, the video to be analyzed includes several frames of images. In order to reduce the processing load of feature extraction of the video to be analyzed and increase the processing speed of video analysis, the feature extraction can be performed on several frames of the video to be analyzed through the preset network model, and the feature map corresponding to each frame of image can be obtained. The feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map. For example, if the video to be analyzed includes 8 frames of images, you can use the preset network model to perform feature extraction on the 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly placed in the waiting room according to their corresponding images. Analyze the time sequence in the video for splicing to obtain the first multi-dimensional feature map.
步骤S13:利用偏移预测网络对第一多维特征图进行预测,得到偏移信息。Step S13: Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.
不同于常规的静态图像,视频往往更关注于目标对象连续的行为动作,为了更好地获取视频的内在时序语义,可以整合视频中的时间信息和空间信息。因此,本申请实施例中,采用偏移预测网络预测得到偏移信息,以在后续基于该偏移信息进行时序偏移,从而完成时间信息和空间的整合。偏移预测网络具体可以是一预设网络模型,从而可以通过该预设网络模型对第一多维特征图进行预测,直接得到偏移信息。Different from conventional static images, videos tend to focus more on the continuous behavior of the target object. In order to better obtain the inherent temporal semantics of the video, the temporal and spatial information in the video can be integrated. Therefore, in this embodiment of the present application, the offset prediction network is used to predict the offset information, so as to subsequently perform a time sequence offset based on the offset information, so as to complete the integration of time information and space. The offset prediction network may specifically be a preset network model, so that the first multi-dimensional feature map can be predicted through the preset network model, and the offset information can be directly obtained.
在一个实施场景中,偏移预测网络可以包括顺序连接的降采样层、卷积层、全连接层、激活层、全连接层和激活层。因此,预测偏移网络仅包含5层,且其中仅卷积层和全连接层包含网络参数,可以在一定程度上简化网络结构,并减少网络参数,从而能够降低网络容量,进而提高收敛速度,避免过拟合,使得训练得到的模型尽可能地准确,进而能够提高视频分析的准确性。In an implementation scenario, the offset prediction network may include a downsampling layer, a convolutional layer, a fully connected layer, an activation layer, a fully connected layer, and an activation layer that are sequentially connected. Therefore, the prediction offset network contains only 5 layers, and only the convolutional layer and the fully connected layer contain network parameters, which can simplify the network structure to a certain extent and reduce the network parameters, thereby reducing the network capacity and improving the convergence speed. Avoid over-fitting, make the trained model as accurate as possible, and then improve the accuracy of video analysis.
示例性的,可以利用偏移预测网络的降采样层(记为第二降采样层)对第一多维特征图进行降采样,得到降采样结果(记为第二降采样结果)。在一个具体的实施场景中,降采样层具体可以是平均池化层,第一多维特征图的维度包括时序维度和预设维度(例如,通道维度),则上述对第一多维特征图进行降采样处理,得到降采样结果可以表示为:Exemplarily, the down-sampling layer (denoted as the second down-sampling layer) of the offset prediction network may be used to down-sample the first multi-dimensional feature map to obtain the down-sampling result (denoted as the second down-sampling result). In a specific implementation scenario, the downsampling layer may specifically be an average pooling layer, and the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions (for example, channel dimensions). Perform down-sampling processing, and the down-sampling result can be expressed as:
Figure PCTCN2020078656-appb-000001
Figure PCTCN2020078656-appb-000001
上式中,c,t分别表示多维中的时序维度和多维中的预设维度(预设维度例如可以是通道维度),z c,t表示降采样结果中第(c,t)个元素,H,W分别表示特征图的高度和宽度,U c,t表示第一多维特征图中的第(c,t)个元素。 In the above formula, c and t respectively represent the time series dimension in the multi-dimensional and the preset dimension in the multi-dimensional (the preset dimension can be, for example, the channel dimension), and z c, t represents the (c, t)th element in the downsampling result, H, W represent the height and width of the feature map, U c, t represent the (c, t)th element in the first multi-dimensional feature map.
进一步地,可以利用偏移预测网络的卷积层(记为第二卷积层)对降采样结果(即第二降采样结果)进行卷积处理,得到特征提取结果(记为第二特征提取结果)。偏移预测网络的卷积层具体可以包含与待分析视频的帧数相同数量的卷积核,卷积核的尺寸例如可以为3*3。Further, the convolutional layer of the offset prediction network (denoted as the second convolutional layer) can be used to perform convolution processing on the down-sampling result (ie, the second down-sampling result) to obtain the feature extraction result (denoted as the second feature extraction result). The convolution layer of the offset prediction network may specifically include the same number of convolution kernels as the number of frames of the video to be analyzed, and the size of the convolution kernel may be 3*3, for example.
进一步地,利用偏移预测网络的第一个全连接层(记为第一全连接层)对特征提取结果(即第二特征提取结果)进行特征连接,得到特征连接结果(记为第一特征连接结果)。其中,偏移预测网络的第一个全连接层可以包含与待分析视频的帧数相同数量的神经元。Further, the first fully connected layer (denoted as the first fully connected layer) of the offset prediction network is used to perform feature connection on the feature extraction result (that is, the second feature extraction result) to obtain the feature connection result (denoted as the first feature Connection result). Among them, the first fully connected layer of the migration prediction network may contain the same number of neurons as the number of frames of the video to be analyzed.
进一步地,利用偏移预测网络的第一个激活层(可记为第二激活层)对特征连接结果(即第一特征连接结果)进行非线性处理,得到非线性处理结果。其中,偏移预测网络的第一个激活层可以是线性整流函数(Rectified Linear Unit,ReLU)激活层。Further, the first activation layer (which can be recorded as the second activation layer) of the migration prediction network is used to perform non-linear processing on the feature connection result (ie, the first feature connection result) to obtain the non-linear processing result. Among them, the first activation layer of the offset prediction network may be a linear rectification function (Rectified Linear Unit, ReLU) activation layer.
进一步地,利用偏移预测网络的第二个全连接层(记为第二全连接层)对非线性处理结果进行特征连接,得到特征连接结果(记为第二特征连接结果);再利用偏移预测网络的第二个激活层(可记为第三激活层)对特征连接结果(即第二特征连接结果)进行非线性处理,得到偏移信息。其中,偏移预测网络的第二个激活层可以是Sigmoid激活层,从而能够将偏移信息中的各个元素约束至0至1之间。Further, the second fully connected layer of the offset prediction network (denoted as the second fully connected layer) is used to perform feature connection on the nonlinear processing results to obtain the feature connection result (denoted as the second feature connection result); and then use the bias The second activation layer (which can be recorded as the third activation layer) of the motion prediction network performs nonlinear processing on the feature connection result (ie, the second feature connection result) to obtain offset information. Among them, the second activation layer of the offset prediction network can be a Sigmoid activation layer, so that each element in the offset information can be constrained to be between 0 and 1.
上述处理过程具体可以表示为:The above processing process can be expressed as:
offset raw=σ(W2δ(W1(F 1dconv(z))))      (2) offset raw =σ(W2δ(W1(F 1dconv (z)))) (2)
上式中,z表示降采样结果,F 1dconv表示偏移预测网络的卷积层,W1表示偏移预测网络的第一个全连接层,δ表示偏移预测网络的第一个激活层,W2表示偏移预测网络的第二个全连接层,σ表示偏移预测网络的第二个激活层,offset raw表示偏移信息。 In the above formula, z represents the result of downsampling, F 1dconv represents the convolutional layer of the offset prediction network, W1 represents the first fully connected layer of the offset prediction network, δ represents the first active layer of the offset prediction network, W2 Represents the second fully connected layer of the offset prediction network, σ represents the second active layer of the offset prediction network, and offset raw represents offset information.
在另一个实施场景中,为了提高模型的稳定性和性能,还可以将上述第二个激活层处理得到的 偏移信息进行约束处理,使偏移信息中的各个元素约束至
Figure PCTCN2020078656-appb-000002
其中,T表示待分析视频的帧数。具体地,可以将上述利用偏移预测网络的第二个激活层对特征连接结果进行非线性处理得到的偏移信息中的各个元素分别减去0.5,并将减去0.5后所得到的差值乘以待分析视频的帧数,从而得到经约束处理后的偏移信息。上述约束处理具体可以表示为:
In another implementation scenario, in order to improve the stability and performance of the model, the offset information obtained by the second activation layer can also be subjected to constraint processing, so that each element in the offset information is restricted to
Figure PCTCN2020078656-appb-000002
Among them, T represents the number of frames of the video to be analyzed. Specifically, each element in the offset information obtained by using the second activation layer of the offset prediction network to perform nonlinear processing on the feature connection result can be respectively subtracted by 0.5, and the difference obtained by subtracting 0.5 can be subtracted Multiply by the number of frames of the video to be analyzed to obtain the offset information after constraint processing. The above constraint processing can be expressed as:
offset=(offset raw-0.5)×T       (3) offset=(offset raw -0.5)×T (3)
上式中,offset raw表示经第二个激活层处理得到的偏移信息,T表示待分析视频的帧数,offset表示约束至
Figure PCTCN2020078656-appb-000003
的偏移信息。
In the above formula, offset raw represents the offset information processed by the second activation layer, T represents the number of frames of the video to be analyzed, and offset represents the constraint to
Figure PCTCN2020078656-appb-000003
The offset information.
步骤S14:利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二多维特征图。Step S14: Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information.
在一个实施场景中,为了使至少部分特征信息中对应于不同时序上的信息得以偏移,从而整合时间信息和空间信息,提高视频分析的准确性,至少部分具体可以是沿预设维度(例如,通道维度)进行分割而得到的。如图2所示,为了进一步降低处理负荷,第一多维特征图在通道维度的通道数为C,上述至少部分特征信息在通道维度的通道数为
Figure PCTCN2020078656-appb-000004
此外,还可以利用偏移信息对第一多维特征图的全部特征信息进行时序偏移,在此不做限定。
In an implementation scenario, in order to offset the information corresponding to different time sequences in at least part of the feature information, thereby integrating time information and spatial information, and improving the accuracy of video analysis, at least part of the specific information may be along a preset dimension (for example, , Channel dimension). As shown in Figure 2, in order to further reduce the processing load, the number of channels in the channel dimension of the first multi-dimensional feature map is C, and the number of channels in the channel dimension of at least part of the above-mentioned feature information is
Figure PCTCN2020078656-appb-000004
In addition, the offset information can also be used to perform time sequence offset on all the feature information of the first multi-dimensional feature map, which is not limited here.
在一个实施场景中,为了降低偏移信息的计算量,提高视频分析的处理速度,还可以按照预设维度(例如,通道维度)从第一多维特征图中选择至少一组特征信息,其中,每组特征信息包括同一预设维度(例如,通道维度)上对应不同时序的特征信息,利用所述偏移信息对所述至少一组特征信息在时序维度上进行偏移。此时,偏移预测网络第二个全连接层可以包含与所选择的特征信息的组数相同数量的神经元,从而偏移信息中的元素个数与所选择的特征信息的组数相同,进而可以利用偏移信息中的各个元素分别对至少一组特征信息在时序维度上进行偏移。例如,在时序维度上偏移一个时序单位,或者,在时序维度上偏移两个时序单位等,在此不做具体限定。In an implementation scenario, in order to reduce the amount of calculation of offset information and increase the processing speed of video analysis, at least one set of feature information may be selected from the first multi-dimensional feature map according to a preset dimension (for example, channel dimension), where Each set of feature information includes feature information corresponding to different time series in the same preset dimension (for example, channel dimension), and the offset information is used to offset the at least one set of feature information in the time series dimension. At this time, the second fully connected layer of the migration prediction network can contain the same number of neurons as the number of selected feature information groups, so that the number of elements in the offset information is the same as the number of selected feature information groups. Furthermore, each element in the offset information may be used to offset at least one set of feature information in the time sequence dimension. For example, the time sequence dimension is shifted by one time sequence unit, or the time sequence dimension is shifted by two time sequence units, etc., which is not specifically limited here.
在利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移之后,可以将时序偏移后的至少部分特征信息与第一多维特征图中未进行时序偏移的部分特征信息进行拼接,从而得到第二多维特征图。在一个具体的实施场景中,请结合参阅图2,可以将通道数为
Figure PCTCN2020078656-appb-000005
的至少部分特征信息进行时序偏移后得到的特征信息与未经时序偏移的通道数为
Figure PCTCN2020078656-appb-000006
的部分特征信息进行拼接,得到第二多维特征图。
After the timing offset is performed on at least part of the feature information of the first multi-dimensional feature map by using the offset information, the at least part of the feature information after the timing offset may be combined with the partial features in the first multi-dimensional feature map that have not been time-shifted. The information is spliced to obtain the second multi-dimensional feature map. In a specific implementation scenario, please refer to Figure 2. The number of channels can be
Figure PCTCN2020078656-appb-000005
At least part of the feature information obtained after timing offset and the number of channels without timing offset are
Figure PCTCN2020078656-appb-000006
Part of the feature information of is spliced to obtain the second multidimensional feature map.
步骤S15:利用预设网络模型对第二多维特征图进行分析,得到待分析视频的分析结果信息。Step S15: Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.
在一个实施场景中,可以利用预设网络模型的全连接层对第二多维特征图进行特征连接,利用预设网络模型的softmax层进行回归,从而得到待分析视频所属的类别(如,足球赛事视频、滑雪赛事视频等),或者,还可以得到待分析视频中目标对象的行为类别(例如,正常行走、摔倒、奔跑等),其他应用场景,可以以此类推,在此不再一一举例。In an implementation scenario, the fully connected layer of the preset network model can be used to perform feature connection to the second multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so as to obtain the category of the video to be analyzed (such as football Event video, ski event video, etc.), or you can get the behavior category of the target object in the video to be analyzed (for example, normal walking, falling, running, etc.), other application scenarios can be deduced by analogy. An example.
在一个实施场景中,为了便于处理,上述偏移预测网络可以嵌入在预设网络模型的卷积层之前。例如,预设网络模型为ResNet-50,偏移预测网络可以嵌入在每个残差块中的卷积层之前。In an implementation scenario, in order to facilitate processing, the above-mentioned offset prediction network may be embedded before the convolutional layer of the preset network model. For example, the preset network model is ResNet-50, and the offset prediction network can be embedded before the convolutional layer in each residual block.
在一个实施场景中,预设网络模型可以包括至少一个卷积层,从而在特征提取过程中,可以利用预设网络模型的一个卷积层对待分析视频进行特征提取,得到第一多维特征图。In an implementation scenario, the preset network model may include at least one convolutional layer, so in the feature extraction process, a convolutional layer of the preset network model may be used to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map .
在一个实施场景中,为了提高视频分析的准确性,预设网络模型的卷积层的数量可以多于1个,例如,预设网络模型的卷积层的数量可以是2个、3个或4个等等。因此在对第二多维特征图进行分析、得到所述待分析视频的分析结果信息之前,还可以利用预设网络模型中未执行特征提取的卷积层对第二多维特征图进行特征提取,得到新的第一多维特征图;其中,新的第一多维特征图在时序维度上可以保持维数不变;进一步执行利用偏移预测网络对新的第一多维特征图进行预测,得到偏移信息的步骤以及后续步骤,以得到新的第二多维特征图,并不断重复上述步骤,直至预设网络模型的所有卷积层均完成对新的第二多维特征图的特征提取步骤,再利用预设网络模型的全连接层对最后得到的第二多维特征图进行分析,得到待分析视频的分析结果信息。In an implementation scenario, in order to improve the accuracy of video analysis, the number of convolutional layers of the preset network model can be more than one. For example, the number of convolutional layers of the preset network model can be 2, 3, or 4 and so on. Therefore, before the second multi-dimensional feature map is analyzed and the analysis result information of the video to be analyzed is obtained, the second multi-dimensional feature map can also be extracted using the convolutional layer in the preset network model that has not been feature extraction performed. , Get a new first multi-dimensional feature map; among them, the new first multi-dimensional feature map can maintain the same dimension in the time series dimension; further execute the prediction of the new first multi-dimensional feature map using the offset prediction network , The step of obtaining the offset information and subsequent steps to obtain a new second multi-dimensional feature map, and repeat the above steps until all the convolutional layers of the preset network model have completed the new second multi-dimensional feature map In the feature extraction step, the fully connected layer of the preset network model is used to analyze the finally obtained second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed.
请结合参阅图3,图3是视频分析各阶段一实施例的示意图,以预设网络模型包括3个卷积层为例,待分析视频经过预设网络模型的第一个卷积层进行特征提取得到第一多维特征图之后,通过上述相关步骤进行时序偏移,得到第二多维特征图,在利用预设网络模型的全连接层进行分析处理之前,还可以进一步将该第二多维特征图输入第二个卷积层进行特征提取,得到新的第一多维特征图(图中记为第一多维特征图),并通过上述相关步骤对新的第一多维特征图进行时序偏移,得到新的第二多维特征图(图中记为第二多维特征图),类似地,利用第三个卷积层对该新的第二多维特征图进行特征提取,又得到一个新的第一多维特征图(图中记为第一多维特征图),并通过上述相关步骤对新的第一多维特征图进行时序偏移,得到新的第二多维特征图(图中记为第二多维特征图),此时预设网络模型的三个卷积层已全部执行完成特征提取步骤,可以利用预设网络模型的全连接层对最新得到的第二多维特征图进行分析,得到待分析视频的分析结果信息。当然,在其他实施例中,为了减少计算量,也可以仅在部分卷积层之后增加时序偏移步骤。Please refer to Figure 3, which is a schematic diagram of an embodiment of each stage of video analysis. Taking the preset network model including 3 convolutional layers as an example, the video to be analyzed is characterized by the first convolutional layer of the preset network model After the first multi-dimensional feature map is extracted, the timing shift is performed through the above-mentioned related steps to obtain the second multi-dimensional feature map. Before the fully connected layer of the preset network model is used for analysis and processing, the second multi-dimensional feature map can be further used. The dimensional feature map is input into the second convolutional layer for feature extraction to obtain a new first multi-dimensional feature map (denoted as the first multi-dimensional feature map in the figure), and the new first multi-dimensional feature map is obtained through the above related steps Perform timing shift to obtain a new second multi-dimensional feature map (denoted as the second multi-dimensional feature map in the figure), similarly, use the third convolutional layer to perform feature extraction on the new second multi-dimensional feature map , A new first multi-dimensional feature map (marked as the first multi-dimensional feature map in the figure) is obtained, and the new first multi-dimensional feature map is time-shifted through the above related steps to obtain a new second multi-dimensional feature map. Dimensional feature map (marked as the second multi-dimensional feature map in the figure). At this time, the three convolutional layers of the preset network model have all been executed to complete the feature extraction step. The fully connected layer of the preset network model can be used to compare the latest obtained The second multi-dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed. Of course, in other embodiments, in order to reduce the amount of calculation, it is also possible to add a timing offset step only after a part of the convolutional layer.
上述方案中,通过对待分析视频进行特征提取,得到第一多维特征图,且第一多维特征图包含与待分析视频对应的不同时序上的特征信息,并利用偏移预测网络对第一多维特征图进行预测,得到偏移信息,从而利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二多维特征图,进而能够直接对待分析视频的时序信息进行建模,有利于提高视频分析的处理速度,且通过时序偏移,能够使空间信息和时序信息联合交错,故在此基础上进行分析处理,有利于提高视频分析的准确度。In the above solution, the first multi-dimensional feature map is obtained by feature extraction of the video to be analyzed, and the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed, and the offset prediction network is used to analyze the first multi-dimensional feature map. The multi-dimensional feature map is predicted to obtain offset information, so that at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the second multi-dimensional feature map is obtained based on the offset feature information, Furthermore, the timing information of the video to be analyzed can be directly modeled, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved. Therefore, analysis and processing are performed on this basis, which is beneficial to improve The accuracy of video analysis.
请参阅图4,图4是图1中步骤S14一实施例的流程示意图。本申请实施例中,偏移信息包括第一数量个偏移值,还可以将第一多维特征图的至少部分沿预设维度(例如,通道维度)划分为第一数量组第一特征信息,即所述至少一组特征信息包括第一数量组第一特征信息。则所述利用所述偏移信息对所述至少一组特征信息在时序维度上进行偏移可以包括:利用偏移信息中第i个偏移值对第i组第一特征信息在时序维度上进行偏移,得到第i组第二特征信息,其中,i为小于或等于第一数量的正整数。Please refer to FIG. 4, which is a schematic flowchart of an embodiment of step S14 in FIG. 1. In the embodiment of the present application, the offset information includes a first number of offset values, and at least part of the first multi-dimensional feature map can also be divided into a first number of sets of first feature information along a preset dimension (for example, channel dimension) , That is, the at least one set of characteristic information includes a first number of sets of first characteristic information. Then, using the offset information to offset the at least one set of feature information in the time series dimension may include: using the i-th offset value in the offset information to compare the i-th set of first feature information in the time series dimension Perform the offset to obtain the i-th group of second feature information, where i is a positive integer less than or equal to the first number.
请结合参阅图2,第一多维特征图的至少部分包括2组第一特征信息,则可以利用偏移信息中的第1个偏移值对第1组第一特征信息在时序维度上进行偏移,得到第1组第二特征信息,并利用偏移信息中的第2个偏移值对第2组第一特征信息在时序维度上进行偏移,得到第2组第二特征信息,当上述第一数量为其他值时,可以以此类推,在此不再一一举例。Please refer to FIG. 2, at least a part of the first multi-dimensional feature map includes two sets of first feature information, then the first offset value in the offset information can be used to perform the first feature information of the first set in the time series dimension. Offset to obtain the first set of second feature information, and use the second offset value in the offset information to offset the second set of first feature information in the time series dimension to obtain the second set of second feature information, When the above-mentioned first quantity is other values, it can be deduced by analogy, and no examples are given here.
具体地,所述利用所述偏移信息中第i个所述偏移值对第i组所述第一特征信息在所述时序维度上进行偏移,得到第i组第二特征信息,可以包括如下步骤:Specifically, the use of the i-th offset value in the offset information to offset the i-th group of the first characteristic information in the time sequence dimension to obtain the i-th group of second characteristic information may be Including the following steps:
步骤S141:获取第i个偏移值所属的数值范围,且数值范围的上限值与下限值之差为一预设数值。Step S141: Obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value.
在一个实施场景中,预设数值可以为1,数值范围的下限值为对第i个偏移值进行下取整得到的数值,数值范围的上限值为对第i个偏移值进行上取整得到的数值,即对于第i个偏移值O i,其数值范围可以表示为(n 0,n 0+1),且n 0∈Ν。例如,当偏移值为0.8时,其数值范围为0至1;或者,当偏移值为1.4时,其数值范围为1至2,当偏移值为其他数值时,可以以此类推,在此不再一一举例。通过上述方式,在偏移值为小数时,能够简化后续时序偏移的处理流程。 In an implementation scenario, the preset value can be 1, the lower limit of the value range is the value obtained by rounding down the i-th offset value, and the upper limit of the value range is the value obtained by rounding down the i-th offset value. The value obtained by rounding up, that is, for the i-th offset value O i , its value range can be expressed as (n 0 , n 0 +1), and n 0 ∈ N. For example, when the offset value is 0.8, the value range is 0 to 1; or when the offset value is 1.4, the value range is 1 to 2. When the offset value is other values, the same can be used. I will not give examples one by one here. Through the above method, when the offset value is a decimal number, the subsequent processing flow of the timing offset can be simplified.
步骤S142:将第i组第一特征信息沿时序维度偏移上限值个时序单位,得到第i组第三特征信息,并将第i组第一特征信息沿时序维度偏移下限值个时序单位,得到第i组第四特征信息。Step S142: Shift the i-th group of first feature information along the time sequence dimension by the upper limit time sequence unit to obtain the i-th group of third feature information, and shift the i-th group of first feature information along the time sequence dimension by the lower limit value Time sequence unit, the fourth feature information of the i-th group is obtained.
本申请实施例中,第i组第一特征信息可以表示为U c,t,故当第i个偏移值的数值范围表示为(n 0,n 0+1)时,将第i组第一特征信息沿时序维度偏移上限值个时序单位,得到的第i组第三特征信息可以表示为
Figure PCTCN2020078656-appb-000007
将第i组第一特征信息沿时序维度偏移下限值个时序单位,得到的第i组第四特征信息可以表示为
Figure PCTCN2020078656-appb-000008
In the embodiment of the present application, the first feature information of the i-th group can be expressed as U c,t , so when the numerical range of the i-th offset value is expressed as (n 0 ,n 0 +1), the i-th group A feature information is shifted by an upper limit time sequence unit along the time sequence dimension, and the obtained third feature information of the i-th group can be expressed as
Figure PCTCN2020078656-appb-000007
The i-th group of first feature information is shifted by the lower limit number of time-series units along the time series dimension, and the obtained i-th group of fourth feature information can be expressed as
Figure PCTCN2020078656-appb-000008
在一个具体的实施场景中,每个偏移值可能为小数,例如,每个偏移值的数值范围为0至1,即上述上限值为1,下限值为0,预设数值为1,故对于第i组第一特征信息U c,t而言,对应的第三特征信息可以表示为U c,t+1,对应的第四特征信息可以表示为U c,t。此外,第一特征信息在时序维度的范围为[1,T],其中,T的取值等于待分析视频的帧数,如第一特征信息[1 0 0 0 0 0 0 1]的T为8,第一特征信息可能会在时序偏移过程中由于特征信息被移出 而变成零向量,从而在训练过程中出现梯度消失的情况,为缓解该问题,可以为时序偏移后处于(0,1)时序区间和(T,T+1)时序区间的特征信息设置一缓冲区,从而当特征信息在时序上被偏移出T+1时刻,或小于0时刻时,可以将缓冲区固定置为0。例如,以第一特征信息U c,t是[1 0 0 0 0 0 0 1]为例,则当第i个偏移值为0.4时,由于其所属的数值范围为0至1,故可以将第一特征信息偏移上限值个(即1个)时序单位,得到对应的第三特征信息[0 1 0 0 0 0 0 0],并将上述第一特征信息偏移下限值个(即0个)时序单位,得到对应的第四特征信息[1 0 0 0 0 0 0 1]。当第一特征信息、偏移值为其他数值时,可以以此类推,在此不再一一举例。 In a specific implementation scenario, each offset value may be a decimal. For example, the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1. Therefore, for the i-th group of first feature information U c,t , the corresponding third feature information can be expressed as U c,t+1 , and the corresponding fourth feature information can be expressed as U c,t . In addition, the range of the first feature information in the time sequence dimension is [1, T], where the value of T is equal to the number of frames of the video to be analyzed, for example, the T of the first feature information [1 0 0 0 0 0 0 1] is 8. The first feature information may become a zero vector due to the feature information being removed during the timing offset process, so that the gradient disappears during the training process. To alleviate this problem, it can be set to (0) after the timing offset. , 1) Set a buffer for the feature information of the time sequence interval and (T, T+1) time sequence interval, so that when the feature information is shifted out of time T+1 in the time sequence, or less than time 0, the buffer can be fixed Set to 0. For example, taking the first feature information U c,t [1 0 0 0 0 0 0 1] as an example, when the i-th offset value is 0.4, since the value range it belongs to is 0 to 1, it can be Offset the first feature information by one (ie, 1) time sequence unit from the upper limit to obtain the corresponding third feature information [0 1 0 0 0 0 0 0], and shift the above-mentioned first feature information by one from the lower limit (That is, 0) time sequence unit, the corresponding fourth feature information [1 0 0 0 0 0 0 1] is obtained. When the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.
步骤S143:以第i个偏移值与下限值之间的差作为权重对第i组第三特征信息进行加权处理,得到第i组第一加权结果,并以上限值与第i个偏移值之间的差作为权重对第i组第四特征信息进行加权处理,得到第i组第二加权结果。Step S143: Use the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of third feature information, to obtain the i-th group of first weighted results, and the upper limit value and the i-th deviation The difference between the shift values is used as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighted result of the i-th group.
以第i个偏移值表示为O i为例,故当第i个偏移值的数值范围表示为(n 0,n 0+1)时,以第i个偏移值O i与下限值(即n 0)之间的差(即O i-n 0)作为权重对第i组第三特征信息(即
Figure PCTCN2020078656-appb-000009
)进行加权处理,得到对应的第一加权结果(即
Figure PCTCN2020078656-appb-000010
),并以上限值(即n 0+1)与第i个偏移值O i之间的差(即n 0+1-O i)作为权重对第i组第四特征信息
Figure PCTCN2020078656-appb-000011
进行加权处理,得到对应的第二加权结果(即
Figure PCTCN2020078656-appb-000012
)。
Taking the i-th offset value expressed as O i as an example, when the numerical range of the i-th offset value is expressed as (n 0 ,n 0 +1), the i-th offset value O i and the lower limit The difference between the values (i.e. n 0 ) (i.e. O i -n 0 ) is used as the weight to the third feature information of the i-th group (i.e
Figure PCTCN2020078656-appb-000009
) Performs weighting processing to obtain the corresponding first weighted result (ie
Figure PCTCN2020078656-appb-000010
), and the difference between the above limit (ie n 0 +1) and the i-th offset value O i (ie n 0 +1-O i ) as the weight for the fourth feature information of the i-th group
Figure PCTCN2020078656-appb-000011
Perform weighting processing to obtain the corresponding second weighting result (ie
Figure PCTCN2020078656-appb-000012
).
在一个具体的实施场景中,每个偏移值可能为小数。例如,每个偏移值的数值范围为0至1,即上述上限值为1,下限值为0,预设数值为1,故对于第一特征信息U c,t而言,对应的第三特征信息可以表示为U c,t+1,对应的第四特征信息可以表示为U c,t,则第一加权结果可以表示为O iU c,t+1,第二加权结果可以表示为(1-O i)U c,t。仍以第一特征信息U c,t表示为一维向量[1 0 0 0 0 0 0 1]为例,则当第i个偏移值为0.4时,对应的第三特征信息可以表示为[0 1 0 0 0 0 0 0],对应的第四特征信息可以表示为[1 0 0 0 0 0 0 1],故第一加权结果可以表示为[0 0.4 0 0 0 0 0 0],故第二加权结果可以表示为[0.6 0 0 0 0 0 0 0.6]。当第一特征信息、偏移值为其他数值时,可以以此类推,在此不再一一举例。 In a specific implementation scenario, each offset value may be a decimal. For example, the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U c,t , the corresponding The third feature information can be expressed as U c,t+1 , and the corresponding fourth feature information can be expressed as U c,t , then the first weighting result can be expressed as O i U c,t+1 , and the second weighting result can be It is expressed as (1-O i )U c,t . Still taking the first feature information U c,t expressed as a one-dimensional vector [1 0 0 0 0 0 0 1] as an example, when the i-th offset value is 0.4, the corresponding third feature information can be expressed as [ 0 1 0 0 0 0 0 0], the corresponding fourth feature information can be expressed as [1 0 0 0 0 0 0 0 1], so the first weighted result can be expressed as [0 0.4 0 0 0 0 0 0], so The second weighted result can be expressed as [0.6 0 0 0 0 0 0 0.6]. When the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.
步骤S144:计算第i组第一加权结果和第i组第二加权结果之间的和,以作为第i组第二特征信息。Step S144: Calculate the sum between the first weighted result of the i-th group and the second weighted result of the i-th group as the second feature information of the i-th group.
以第i个偏移值表示为O i为例,第一加权结果可以表示为
Figure PCTCN2020078656-appb-000013
第二加权结果可以表示为
Figure PCTCN2020078656-appb-000014
故第i组第二特征信息可以表示为
Figure PCTCN2020078656-appb-000015
Taking the i-th offset value expressed as O i as an example, the first weighted result can be expressed as
Figure PCTCN2020078656-appb-000013
The second weighted result can be expressed as
Figure PCTCN2020078656-appb-000014
Therefore, the second feature information of the i-th group can be expressed as
Figure PCTCN2020078656-appb-000015
在一个具体的实施场景中,每个偏移值可能为小数。例如,每个偏移值的数值范围为0至1,即上述上限值为1,下限值为0,预设数值为1,故对于第一特征信息U c,t而言,第一加权结果可以表示为O iU c,t+1,第二加权结果可以表示为(1-O i)U c,t,故第i组第二特征信息可以表示为(1-O i)U c,t+O iU c,t+1。仍以第一特征信息U c,t表示为一维向量[1 0 0 0 0 0 0 1]为例,则当第i个偏移值为0.4时,对应的第一加权结果可以表示为[0 0.4 0 0 0 0 0 0],对应的第二加权结果可以表示为[0.6 0 0 0 0 0 0 0.6],故第i组第二特征信息可以表示为[0.6 0.4 0 0 0 0 0 0.6]。当第一特征信息、偏移值为其他数值时,可以以此类推,在此不再一一举例。 In a specific implementation scenario, each offset value may be a decimal. For example, the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U c,t , the first The weighted result can be expressed as O i U c,t+1 , and the second weighted result can be expressed as (1-O i )U c,t , so the i-th group of second feature information can be expressed as (1-O i )U c,t +O i U c,t+1 . Still taking the first feature information U c,t expressed as a one-dimensional vector [1 0 0 0 0 0 0 1] as an example, when the i-th offset value is 0.4, the corresponding first weighting result can be expressed as [ 0 0.4 0 0 0 0 0 0], the corresponding second weighting result can be expressed as [0.6 0 0 0 0 0 0 0 0.6], so the second feature information of the i-th group can be expressed as [0.6 0.4 0 0 0 0 0 0.6 ]. When the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.
此外,在一个实施场景中,由于以组单位将每组第一特征信息进行时序偏移,故在训练时,可以采用对称偏移的策略,即训练时可以只训练一半的偏移值,并对其进行转换计算(例如,颠倒其次序)得到另一半偏移值,从而能够减轻训练时的处理负荷。In addition, in an implementation scenario, since each group of first feature information is time-shifted in a group unit, a symmetric offset strategy can be used during training, that is, only half of the offset value can be trained during training, and Perform conversion calculations (for example, reverse the order) to obtain the other half of the offset value, which can reduce the processing load during training.
区别于前述实施例,通过获取第i个偏移值所属的数值范围,且该数值范围的上限值与下限值之差为一预设数值,将第i组第一特征信息沿时序维度偏移上限值个时序单位,得到第i组第三特征信息,并将第i组第一特征信息沿时序维度偏移下限值个时序单位,得到第i组第四特征信息;以第i个偏移值与下限值之间的差作为权重对第i组第一特征信息进行加权处理,得到第i组第一加权结果,并以上限值与第i个偏移值之间的差作为权重对第i组第四特征信息进行加权处理,得到第i组第二加权结果;计算第i组第一加权结果和第i组第二加权结果之间的和,以作为第i组第二特征信息,进而能够方便、快速地对第一特征信息进行偏移处理,有利于提高视频分析的处理速度。Different from the foregoing embodiment, by acquiring the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value, the i-th group of first characteristic information is moved along the time series dimension Offset the upper limit number of time series units to obtain the i-th group of third characteristic information, and shift the i-th group of first characteristic information along the time series dimension by the lower limit of time series units to obtain the i-th group of fourth characteristic information; The difference between the i offset value and the lower limit value is used as the weight to perform weighting processing on the first feature information of the ith group to obtain the first weighted result of the ith group, and the difference between the upper limit value and the ith offset value The difference is used as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the i-th group The second feature information can then conveniently and quickly perform offset processing on the first feature information, which is beneficial to improving the processing speed of video analysis.
请参阅图5,图5是本申请视频分析方法另一实施例的流程示意图。具体而言,可以包括如下步骤:Please refer to FIG. 5, which is a schematic flowchart of another embodiment of the video analysis method of the present application. Specifically, it can include the following steps:
步骤S51:获取待分析视频。Step S51: Obtain the video to be analyzed.
具体可以参阅前述实施例中的相关步骤。For details, please refer to the relevant steps in the foregoing embodiment.
步骤S52:利用预设网络模型对待分析视频进行特征提取,得到第一多维特征图。Step S52: Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.
本申请实施例中,第一多维特征图包含与待分析视频对应的不同时序上的特征信息。具体可以参阅前述实施例中的相关步骤。In this embodiment of the present application, the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed. For details, please refer to the relevant steps in the foregoing embodiment.
步骤S53:利用偏移预测网络对第一多维特征图进行预测,得到偏移信息。Step S53: Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.
请结合参阅图6,图6是视频分析处理过程另一实施例的示意图,如图6所示,第一多维特征图可以经过偏移预测网络进行预测,具体可以参阅前述实施例中的相关步骤。Please refer to FIG. 6 in combination. FIG. 6 is a schematic diagram of another embodiment of the video analysis processing process. As shown in FIG. 6, the first multi-dimensional feature map can be predicted by the offset prediction network. For details, please refer to the relevant information in the foregoing embodiment. step.
步骤S54:利用权重预测网络对第一多维特征图进行预测,得到权重信息。Step S54: Use the weight prediction network to predict the first multi-dimensional feature map to obtain weight information.
在时序偏移过程中,第一特征信息首末两端的特征可能会被移出,因此为了重新衡量经时序偏移后的第一特征信息中各特征的重要程度,以更好地获取长范围信息,可以采用注意力机制对经时序偏移后的第一特征信息中各特征进行重新加权处理,故需要获取权重信息。请继续结合参阅图6,可以利用权重预测网络对第一多维特征图进行预测,得到权重信息。During the timing shift, the features at the first and last ends of the first feature information may be removed. Therefore, in order to re-evaluate the importance of each feature in the first feature information after the timing shift, to better obtain long-range information , The attention mechanism can be used to re-weight each feature in the first feature information after the time sequence shift, so the weight information needs to be obtained. Please continue to refer to FIG. 6 in combination, the weight prediction network can be used to predict the first multi-dimensional feature map to obtain weight information.
在一个实施场景中,权重预测网络可以包括顺序连接的降采样层、卷积层和激活层。因此,权重预测网络仅包含3层,且其中仅卷积层包含网络参数,可以在一定程度上简化网络结构,并减少网络参数,从而能够降低网络容量,提高收敛速度,避免过拟合,使得训练得到的模型尽可能地准确,进而能够提高视频分析的准确性。In an implementation scenario, the weight prediction network may include a down-sampling layer, a convolutional layer, and an activation layer that are sequentially connected. Therefore, the weight prediction network contains only 3 layers, and only the convolutional layer contains network parameters, which can simplify the network structure to a certain extent and reduce network parameters, thereby reducing network capacity, improving convergence speed, and avoiding overfitting. The trained model is as accurate as possible, which in turn can improve the accuracy of video analysis.
在一些可选实施例中,所述利用权重预测网络对所述第一多维特征图进行预测,得到权重信息,可以包括:利用权重预测网络的降采样层(记为第一降采样层)对第一多维特征图进行降采样,得到降采样结果(记为第一降采样结果);利用权重预测网络的卷积层(记为第一卷积层)对降采样结果(即第一降采样结果)进行卷积处理的,得到特征提取结果(记为第一特征提取结果);利用权重预测网络的激活层对特征提取结果(即第一特征提取结果)进行非线性处理,得到权重信息。在一个具体的实施场景中,降采样层可以是平均池化层,具体可以参阅前述实施例中的相关步骤。权重预测网络的卷积层中可以包含1个卷积核,权重预测网络的激活层可以是Sigmoid激活层,从而能够将权重信息中的各个元素约束至0至1之间。In some optional embodiments, the using the weight prediction network to predict the first multi-dimensional feature map to obtain weight information may include: using the weight prediction network to predict the down-sampling layer (denoted as the first down-sampling layer) Down-sampling the first multi-dimensional feature map to obtain the down-sampling result (denoted as the first down-sampling result); use the weight to predict the convolutional layer (denoted as the first convolutional layer) of the down-sampling result (that is, the first The downsampling result) is subjected to convolution processing to obtain the feature extraction result (recorded as the first feature extraction result); the activation layer of the weight prediction network is used to perform nonlinear processing on the feature extraction result (ie, the first feature extraction result) to obtain the weight information. In a specific implementation scenario, the downsampling layer may be an average pooling layer. For details, please refer to the relevant steps in the foregoing embodiment. The convolutional layer of the weight prediction network can include one convolution kernel, and the activation layer of the weight prediction network can be a Sigmoid activation layer, so that each element in the weight information can be constrained to be between 0 and 1.
此外,为了便于处理,本申请实施例中的偏移预测网络和权重预测网络可以嵌入在预设网络模型的卷积层之前。例如,预设网络模型为ResNet-50,偏移预测网络和权重预测网络可以嵌入在每个残差块的卷积层之前,从而分别利用第一多维特征图,预测得到偏移信息和权重信息,以便后续偏移与加权处理,从而能够在ResNet-50已有的网络参数的基础上,加入少量的网络参数实现时序信息的建模,有利于降低视频分析的处理负荷,提高视频分析的处理速度,且有利于加快模型训练时的收敛速度,避免过拟合,提高视频分析的准确度。当预设网络模型为其他模型时,可以以此类推,在此不再一一举例。In addition, in order to facilitate processing, the offset prediction network and the weight prediction network in the embodiments of the present application may be embedded before the convolutional layer of the preset network model. For example, the preset network model is ResNet-50, the offset prediction network and the weight prediction network can be embedded before the convolutional layer of each residual block, so as to use the first multi-dimensional feature map to predict the offset information and weights. Information, for subsequent offset and weighting processing, so that a small amount of network parameters can be added to the existing network parameters of ResNet-50 to realize the modeling of timing information, which is conducive to reducing the processing load of video analysis and improving the performance of video analysis. The processing speed is conducive to accelerating the convergence speed during model training, avoiding over-fitting, and improving the accuracy of video analysis. When the preset network model is another model, it can be deduced by analogy, and the examples are not given here.
上述步骤S53和步骤S54可以按照先后顺序执行,例如,先执行步骤S53,后执行步骤S54;或者,先执行步骤S54,后执行步骤S53;或者,步骤S53和步骤S54同时执行,在此不做限定。此外,上述步骤S54先于后续的步骤S56执行即可,在此不做限定。The above steps S53 and S54 can be performed in a sequential order, for example, step S53 is performed first, and then step S54; or, step S54 is performed first, and then step S53 is performed; or, step S53 and step S54 are performed at the same time. limited. In addition, the foregoing step S54 may be performed before the subsequent step S56, which is not limited here.
步骤S55:利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移。Step S55: Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map.
具体可以参阅前述实施例中的相关步骤。For details, please refer to the relevant steps in the foregoing embodiment.
步骤S56:利用权重信息对偏移后的特征信息进行加权处理。Step S56: Use the weight information to perform weighting processing on the offset feature information.
在一个实施场景中,待分析视频具体可以包括第二数量帧图像,权重信息可以包括第二数量个权重值,第二数量具体可以是8、16、24等等,在此不做具体限定。在加权处理时,即所述利用所述权重信息对偏移后的所述特征信息进行加权处理,包括:可以对偏移后的每组特征信息,分别利用权重信息中的第j个权重值对当前组特征信息中的第j个时序对应的特征值进行加权处理,得到加 权处理后的对应组特征信息,其中,j为小于或等于第二数量的正整数。In an implementation scenario, the video to be analyzed may specifically include a second number of frame images, and the weight information may include a second number of weight values. The second number may specifically be 8, 16, 24, etc., which are not specifically limited herein. In the weighting process, that is, using the weight information to perform weighting processing on the offset feature information, including: each group of offset feature information can be used to separately use the j-th weight value in the weight information Perform weighting processing on the feature value corresponding to the jth time sequence in the current group of feature information to obtain the corresponding group feature information after weighting, where j is a positive integer less than or equal to the second number.
以上述实施例中偏移处理后的特征信息[0.6 0.4 0 0 0 0 0 0.6]为例,权重信息可以为[0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.2],则分别利用权重信息中的第j个权重值对上述特征信息中的第j个时序对应的特征值进行加权处理后,得到对应组的特征信息为[0.12 0.04 0 0 0 0 0 0.12]。当偏移后的特征信息、权重信息为其他数值时,可以以此类推,在此不再一一举例。Taking the feature information [0.6 0.4 0 0 0 0 0 0.6] after the offset processing in the above embodiment as an example, the weight information can be [0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.2], and the jth item in the weight information is used respectively. After the weight value performs weighting processing on the feature value corresponding to the jth time sequence in the above feature information, the feature information of the corresponding group is obtained as [0.12 0.04 0 0 0 0 0 0.12]. When the offset feature information and weight information are other values, it can be deduced by analogy, and no examples are given here.
步骤S57:基于加权处理后的特征信息,得到第二多维特征图。Step S57: Obtain a second multi-dimensional feature map based on the weighted feature information.
请结合参阅图6,经过时序偏移和加权处理之后,即可得到与第一多维特征图对应的第二多维特征图。在一个实施场景中,所述基于所述加权处理后的所述特征信息,得到第二多维特征图,可以包括:利用加权处理后的特征信息以及第一多维特征图中未被偏移的特征信息,组成第二多维特征图。Please refer to FIG. 6, after the timing shift and weighting process, the second multi-dimensional feature map corresponding to the first multi-dimensional feature map can be obtained. In an implementation scenario, the obtaining the second multi-dimensional feature map based on the weighted feature information may include: using the weighted feature information and the first multi-dimensional feature map is not shifted The feature information of, composes the second multi-dimensional feature map.
具体地,请结合参阅图2,可以将加权处理后的特征信息与第一多维特征图中未被偏移的特征信息进行拼接处理,得到第二多维特征图。得到的第二多维特征图与第一多维特征图具有相同的尺寸。此外,若第一多维特征图中的特征信息均进行了时序偏移处理,则可以直接将加权处理后的特征信息进行组合,作为第二多维特征图。Specifically, referring to FIG. 2, the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map can be spliced to obtain the second multi-dimensional feature map. The obtained second multi-dimensional feature map has the same size as the first multi-dimensional feature map. In addition, if all the feature information in the first multi-dimensional feature map has undergone time-series offset processing, the weighted feature information can be directly combined to form the second multi-dimensional feature map.
步骤S58:利用预设网络模型对第二多维特征图进行分析,得到待分析视频的分析结果信息。Step S58: Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.
具体可以参阅前述实施例中的相关步骤。For details, please refer to the relevant steps in the foregoing embodiment.
区别于前述实施例,利用权重预测网络对第一多维特征图进行预测,得到权重信息,并利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移,且利用权重信息对偏移后的特征信息进行加权处理,并基于加权处理后的特征信息,得到第二多维特征图,故通过偏移、加权的处理步骤能够直接得到空间、时序联合交错的特征信息,有利于提高视频分析的处理速度和准确度。Different from the foregoing embodiment, the weight prediction network is used to predict the first multi-dimensional feature map to obtain weight information, and at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the weight information is used The offset feature information is weighted, and the second multi-dimensional feature map is obtained based on the weighted feature information. Therefore, the spatial and temporal joint interleaving feature information can be directly obtained through the offset and weighting processing steps. Conducive to improving the processing speed and accuracy of video analysis.
请参阅图7,图7是本申请用于视频分析的模型训练方法一实施例的流程示意图。本申请实施例用于视频分析的模型训练方法具体可以由微型计算机、服务器、平板电脑等具有处理功能的电子设备执行,或者由处理器执行程序代码实现。具体而言,可以包括如下步骤:Please refer to FIG. 7. FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application. The model training method used for video analysis in the embodiments of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:
步骤S71:获取样本视频。Step S71: Obtain a sample video.
本申请实施例中,样本视频包括预设标注信息。以对视频进行行为分析为例,样本视频的预设标注信息可以包括但不限于:摔倒、正常行走、奔跑等标注信息;或者,以对视频进行分类为例,样本视频的预设标注信息可以包括但不限于:足球赛事视频、篮球赛事视频、滑雪赛事视频等标注信息。其他应用场景可以以此类推,在此不再一一举例。In the embodiment of the present application, the sample video includes preset annotation information. Taking behavior analysis of videos as an example, the preset annotation information of the sample video may include but not limited to: fall, normal walking, running and other annotation information; or, taking the classification of the video as an example, the preset annotation information of the sample video It may include but is not limited to: football match video, basketball match video, ski match video and other label information. Other application scenarios can be deduced by analogy, so we will not give examples one by one here.
本申请实施例中,样本视频可以包括若干帧图像,例如,可以包括8帧图像,或者,也可以包括16帧图像,或者,还可以包括24帧图像,在此不做具体限定。In the embodiment of the present application, the sample video may include several frames of images, for example, may include 8 frames of images, or may also include 16 frames of images, or may also include 24 frames of images, which is not specifically limited here.
步骤S72:利用预设网络模型对样本视频进行特征提取,得到第一样本多维特征图。Step S72: Perform feature extraction on the sample video by using the preset network model to obtain the first sample multi-dimensional feature map.
在一个具体的实施场景中,为了进一步减少网络参数,降低处理负荷,从而提高处理速度,提高训练时收敛速度,避免过拟合,上述预设网络模型可以是二维神经网络模型,例如,ResNet-50、ResNet-101等等,在此不做具体限定。ResNet网络是由残差块(Residual Block)构建的,通过使用多个有参层来学习输入、输出之间的残差表示。In a specific implementation scenario, in order to further reduce network parameters, reduce processing load, thereby increase processing speed, increase convergence speed during training, and avoid overfitting, the above-mentioned preset network model may be a two-dimensional neural network model, for example, ResNet -50, ResNet-101, etc., there is no specific limitation here. The ResNet network is constructed by a Residual Block, which uses multiple parameterized layers to learn the residual representation between input and output.
本申请实施例中,第一样本多维特征图包含与样本视频对应的不同时序上的特征信息。请结合参阅图2,图2是视频分析处理过程一实施例的示意图。如图2所示,横坐标表示时序维度T上的不同时序,不同时序所对应的方格表示不同时序上的特征信息。在一个实施场景中,待分析视频包括若干帧图像。为了降低对样本视频进行特征提取的处理负荷,提高视频分析的处理速度,可以通过预设网络模型分别对样本视频的若干帧图像进行特征提取,得到每一帧图像对应的特征图,从而直接将若干个特征图按照与其对应的图像在样本视频中的时序进行拼接,得到第一样本多维特征图。例如,样本视频包括8帧图像,则可以利用预设网络模型分别对这8帧图像进行特征提取,得到每一帧图像的特征图,从而直接将8张特征图按照与其对应的图像在样本视频中的时序进行拼接,得到第一样本多维特征图。In the embodiment of the present application, the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video. Please refer to FIG. 2 in combination. FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series. In an implementation scenario, the video to be analyzed includes several frames of images. In order to reduce the processing load of feature extraction for sample videos and improve the processing speed of video analysis, the feature extraction can be performed on several frames of the sample video through the preset network model to obtain the feature map corresponding to each frame image, thus directly Several feature maps are spliced according to the time sequence of the corresponding images in the sample video to obtain the first sample multi-dimensional feature map. For example, if the sample video includes 8 frames of images, you can use the preset network model to perform feature extraction on these 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly displayed in the sample video according to their corresponding images. The time sequence in is spliced to obtain the first sample multi-dimensional feature map.
步骤S73:利用偏移预测网络对第一样本多维特征图进行预测,得到偏移信息。Step S73: Use the offset prediction network to predict the multi-dimensional feature map of the first sample to obtain offset information.
偏移预测网络的网络结构具体可以参考前述实施例中的相关步骤,在此不再赘述。在一个实施场景中,还可以利用权重预测网络对第一样本多维特征图进行预测,得到权重信息,权重预测网络的网络结构可以参考前述实施例中的相关步骤,在此不再赘述。For the specific network structure of the offset prediction network, reference may be made to the relevant steps in the foregoing embodiment, which will not be repeated here. In an implementation scenario, the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information. For the network structure of the weight prediction network, refer to the relevant steps in the foregoing embodiment, which will not be repeated here.
步骤S74:利用偏移信息对第一样本多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二样本多维特征图。Step S74: Use the offset information to perform time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample multi-dimensional feature map based on the offset feature information.
利用偏移信息对第一样本多维特征图的至少部分特征信息进行时序偏移的具体实施步骤,可以参考前述实施例中的相关步骤,在此不再赘述。在一个实施场景中,还可以利用权重信息对偏移后的特征信息进行加权处理,并基于加权处理后的特征信息,得到第二样本多维特征图,具体可以参考前述实施例中的相关步骤,在此不再赘述。For the specific implementation steps of using the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, reference may be made to the relevant steps in the foregoing embodiment, which will not be repeated here. In an implementation scenario, the weight information can also be used to weight the offset feature information, and based on the weighted feature information, the second sample multi-dimensional feature map can be obtained. For details, please refer to the relevant steps in the foregoing embodiment. I won't repeat them here.
在一个实施场景中,预设网络模型可以包括至少一个卷积层,则可以利用预设网络模型的一个卷积层对样本视频进行特征提取,得到第一样本多维特征图。在一个具体的实施场景中,预设网络模型的卷积层的数量可以多于1个,则可以利用预设网络模型中未执行特征提取的卷积层对第二样本多维特征图进行特征提取,得到新的第一样本多维特征图,并执行利用偏移预测网络对新的第一样本多维特征图进行预测,得到偏移信息的步骤以及后续步骤,从而得到新的第二样本多维特征图,进而重复执行上述步骤,直至预设网络模型的所有卷积层均完成对新的第二样本多维特征图的特征提取步骤。In an implementation scenario, the preset network model may include at least one convolutional layer, and then one convolutional layer of the preset network model may be used to perform feature extraction on the sample video to obtain the first sample multi-dimensional feature map. In a specific implementation scenario, the number of convolutional layers of the preset network model can be more than one, and then the convolutional layer in the preset network model that has not been feature extraction performed can be used to perform feature extraction on the second sample multi-dimensional feature map , To obtain a new first sample multi-dimensional feature map, and execute the step of using the offset prediction network to predict the new first sample multi-dimensional feature map to obtain the offset information and subsequent steps, thereby obtaining a new second sample multi-dimensional feature map Feature map, and then repeat the above steps until all the convolutional layers of the preset network model have completed the feature extraction step of the new second sample multi-dimensional feature map.
步骤S75:利用预设网络模型对第二样本多维特征图进行分析,得到样本视频的分析结果信息。Step S75: Use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video.
具体地,可以利用预设网络模型的全连接层对第二样本多维特征图进行分析,得到样本视频的分析结果信息。在一个实施场景中,可以利用预设网络模型的全连接层对第二样本多维特征图进行特征连接,利用预设网络模型的softmax层进行回归,从而得到样本视频属于各个类别(如,足球赛事视频、滑雪赛事视频等)的概率值,或者得到样本视频属于各种行为(如,摔倒、正常行走、奔跑等)的概率值,其他应用场景中,可以以此类推,在此不再一一举例。Specifically, the fully connected layer of the preset network model can be used to analyze the second sample multi-dimensional feature map to obtain the analysis result information of the sample video. In an implementation scenario, the fully connected layer of the preset network model can be used to perform feature connection to the second sample multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so that the sample video belongs to each category (such as football matches). Video, skiing event video, etc.), or the probability value of the sample video belonging to various behaviors (such as falling, normal walking, running, etc.). In other application scenarios, this can be deduced by analogy. An example.
步骤S76:利用预设标注信息和分析结果信息计算损失值。Step S76: Calculate the loss value by using the preset label information and the analysis result information.
具体地,可以利用均方误差(Mean Square Error)损失函数,或者交叉熵损失函数对预设标注信息和分析结果信息进行损失值计算,在此不做限定。Specifically, a mean square error (Mean Square Error) loss function or a cross entropy loss function can be used to calculate the loss value of the preset label information and the analysis result information, which is not limited here.
步骤S77:基于损失值,调整预设网络模型和偏移预测网络的参数。Step S77: Adjust the parameters of the preset network model and the offset prediction network based on the loss value.
在一个实施场景中,如前述步骤,还可以利用权重预测网络对第一样本多维特征图进行预测,得到权重信息,从而利用权重信息对偏移后的特征信息进行加权处理,并基于加权处理后的特征信息,得到第二样本多维特征信息;基于损失值,还可以调整预设网络模型和偏移预测网络、权重预测网络的参数。具体地,可以调整预设网络模型中的卷积层、全连接层的参数,并调整偏移预测网络中的卷积层、全连接层的参数,并调整权重预测网络中的卷积层的参数。具体地,可以采用梯度下降法来调整参数,例如批量梯度下降法、随机梯度下降法。In an implementation scenario, as in the foregoing steps, the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information, so that the weight information is used to weight the offset feature information, and based on the weight processing After the feature information, the second sample multi-dimensional feature information is obtained; based on the loss value, the parameters of the preset network model, the offset prediction network, and the weight prediction network can also be adjusted. Specifically, the parameters of the convolutional layer and the fully connected layer in the preset network model can be adjusted, and the parameters of the convolutional layer and the fully connected layer in the offset prediction network can be adjusted, and the weight of the convolutional layer in the prediction network can be adjusted. parameter. Specifically, a gradient descent method can be used to adjust the parameters, such as a batch gradient descent method and a stochastic gradient descent method.
在一个实施场景中,在调整参数之后,还可以重新执行上述步骤S72以及后续步骤,直至计算得到的损失值满足预设训练结束条件为止。具体地,预设训练结束条件可以包括:损失值小于一预设损失阈值,且损失值不再减小,或者,预设训练结束条件还可以包括:参数调整次数达到预设次数阈值,或者,预设训练结束条件还可以包括:利用测试视频测试网络性能达到预设要求(如,准确率达到一预设准确率阈值)。In an implementation scenario, after adjusting the parameters, the above step S72 and subsequent steps may be executed again until the calculated loss value meets the preset training end condition. Specifically, the preset training end condition may include: the loss value is less than a preset loss threshold and the loss value no longer decreases, or the preset training end condition may also include: the number of parameter adjustments reaches the preset number of times threshold, or, The preset training end condition may also include: using a test video to test that the network performance meets a preset requirement (for example, the accuracy rate reaches a preset accuracy threshold).
采用本申请实施例的技术方案,通过对样本视频进行特征提取,得到第一样本多维特征图,且第一样本多维特征图包含与样本视频对应的不同时序上的特征信息,并利用偏移预测网络对第一样本多维特征图进行预测,得到偏移信息,从而利用偏移信息对第一样本多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二样本多维特征图,进而能够直接对样本视频的时序信息进行建模,有利于提高模型训练时的速度,且通过时序偏移,能够使空间信息和时序信息联合交错,故在此基础上进行分析处理,有利于后续提高视频分析的准确度。Using the technical solution of the embodiment of the present application, the first sample multi-dimensional feature map is obtained by feature extraction of the sample video, and the first sample multi-dimensional feature map contains feature information corresponding to the sample video in different time series, and uses the bias The shift prediction network predicts the multi-dimensional feature map of the first sample to obtain offset information, so as to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and based on the offset feature information Obtain the second sample multi-dimensional feature map, and then can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so this is the basis Analyzing and processing on the above is conducive to the subsequent improvement of the accuracy of video analysis.
请结合参阅图8,图8是本申请视频分析装置80一实施例的框架示意图。视频分析装置80包括视频获取模块81、特征提取模块82、偏移预测模块83、偏移处理模块84和网络分析模块85;其中,Please refer to FIG. 8. FIG. 8 is a schematic diagram of a framework of an embodiment of a video analysis device 80 of the present application. The video analysis device 80 includes a video acquisition module 81, a feature extraction module 82, an offset prediction module 83, an offset processing module 84, and a network analysis module 85; among them,
视频获取模块81,配置为获取待分析视频;The video acquisition module 81 is configured to acquire the video to be analyzed;
特征提取模块82,配置为利用预设网络模型对待分析视频进行特征提取,得到第一多维特征图,其中,第一多维特征图包含与待分析视频对应的不同时序上的特征信息;The feature extraction module 82 is configured to perform feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed;
偏移预测模块83,配置为利用偏移预测网络对第一多维特征图进行预测,得到偏移信息;The offset prediction module 83 is configured to use an offset prediction network to predict the first multi-dimensional feature map to obtain offset information;
偏移处理模块84,配置为利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二多维特征图;The offset processing module 84 is configured to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtain a second multi-dimensional feature map based on the offset feature information;
网络分析模块85,配置为利用预设网络模型对第二多维特征图进行分析,得到待分析视频的分析结果信息。The network analysis module 85 is configured to analyze the second multi-dimensional feature map by using a preset network model to obtain analysis result information of the video to be analyzed.
本申请实施例的技术方案,通过预设网络模型对待分析视频进行处理,有利于提高视频分析的处理速度,且通过时序偏移,能够使空间信息和时序信息联合交错,故在此基础上进行分析处理,有利于提高视频分析的准确度。The technical solution of the embodiment of the present application uses a preset network model to process the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved, so it is performed on this basis Analysis and processing help improve the accuracy of video analysis.
在一些实施例中,视频分析装置80还包括权重预测模块,配置为利用权重预测网络对第一多维特征图进行预测,得到权重信息;In some embodiments, the video analysis device 80 further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;
偏移处理模块84,配置为利用偏移信息对第一多维特征图的至少部分特征信息进行时序偏移;利用权重信息对偏移后的特征信息进行加权处理;基于加权处理后的特征信息,得到第二多维特征图。The offset processing module 84 is configured to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighted feature information , Get the second multi-dimensional feature map.
在一些实施例中,第一多维特征图的维度包括时序维度和预设维度,偏移处理模块84,配置为按照预设维度从第一多维特征图中选择至少一组特征信息,其中,每组特征信息包括同一预设维度上对应不同时序的特征信息,利用偏移信息对至少一组特征信息在时序维度上进行偏移。In some embodiments, the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions, and the offset processing module 84 is configured to select at least one set of feature information from the first multi-dimensional feature map according to the preset dimensions, where Each set of feature information includes feature information corresponding to different time series in the same preset dimension, and the offset information is used to offset at least one set of feature information in the time series dimension.
在一些实施例中,预设维度为通道维度;和/或,偏移信息包括第一数量个偏移值,至少一组特征信息包括第一数量组第一特征信息,偏移处理模块84,配置为利用偏移信息中第i个偏移值对第i组第一特征信息在时序维度上进行偏移,得第i组第二特征信息,其中,i为小于或等于第一数量的正整数。In some embodiments, the preset dimension is the channel dimension; and/or, the offset information includes a first number of offset values, and at least one set of characteristic information includes a first number of sets of first characteristic information. The offset processing module 84, It is configured to use the i-th offset value in the offset information to offset the i-th group of first feature information in the time sequence dimension to obtain the i-th group of second feature information, where i is a positive value less than or equal to the first number. Integer.
在一些实施例中,偏移处理模块84,配置为获取第i个偏移值所属的数值范围,且数值范围的上限值与下限值之差为一预设数值,时序偏移处理单元包括时序偏移处理子单元,用于将第i组第一特征信息沿时序维度偏移上限值个时序单位,得到第i组第三特征信息,并将第i组第一特征信息沿时序维度偏移下限值个时序单位,得到第i组第四特征信息;以第i个偏移值与下限值之间的差作为权重对第i组第三特征信息进行加权处理,得到第i组第一加权结果,并以上限值与第i个偏移值之间的差作为权重对第i组第四特征信息进行加权处理,得到第i组第二加权结果;计算第i组第一加权结果和第i组第二加权结果之间的和,以作为第i组第二特征信息。In some embodiments, the offset processing module 84 is configured to obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value, and the timing offset processing unit It includes a time sequence offset processing subunit, which is used to shift the i-th group of first feature information along the time sequence dimension by an upper limit number of time sequence units to obtain the i-th group of third feature information, and move the i-th group of first feature information along the time sequence. The lower limit value of the dimensionality offset is time series units to obtain the fourth feature information of the i-th group; the third feature information of the i-th group is weighted by the difference between the i-th offset value and the lower limit value as the weight to obtain the I set the first weighted results, and use the difference between the upper limit and the i-th offset value as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; calculate the i-th group The sum between a weighted result and the second weighted result of the i-th group is used as the second feature information of the i-th group.
在一些实施例中,待分析视频包括第二数量帧图像,权重信息包括第二数量个权重值,偏移处理模块84,配置为对偏移后的每组特征信息,分别利用权重信息中第j个权重值对当前组特征信息中的第j个时序对应的特征值进行加权处理,得到加权处理后的对应组特征信息;其中,j为小于或等于第二数量的正整数。In some embodiments, the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values. The offset processing module 84 is configured to use the first set of weight information for each set of offset feature information. The j weight values perform weighting processing on the feature value corresponding to the j-th time sequence in the current group of feature information to obtain the corresponding group feature information after the weighting process; where j is a positive integer less than or equal to the second number.
在一些实施例中,偏移处理模块84,配置为利用加权处理后的特征信息以及第一多维特征图中未被偏移的特征信息,组成第二多维特征图。In some embodiments, the offset processing module 84 is configured to use the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map to form a second multi-dimensional feature map.
在一些实施例中,权重预测模块,配置为利用权重预测网络的第一降采样层对第一多维特征图进行降采样,得到第一降采样结果;利用权重预测网络的第一卷积层对第一降采样结果进行卷积处理,得到第一特征提取结果;利用权重预测网络的第一激活层对第一特征提取结果进行非线性处理,得到权重信息。In some embodiments, the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result; use the weight to predict the first convolutional layer of the network Perform convolution processing on the first down-sampling result to obtain the first feature extraction result; use the first activation layer of the weight prediction network to perform non-linear processing on the first feature extraction result to obtain weight information.
在一些实施例中,偏移预测模块83,配置为利用偏移预测网络的第二降采样层对第一多维特征图进行降采样,得到第二降采样结果;利用偏移预测网络的第二卷积层对第二降采样结果进行卷积处理,得到第二特征提取结果;利用偏移预测网络的第一全连接层对第二特征提取结果进行特征连接,得到第一特征连接结果;利用偏移预测网络的第二激活层对第一特征连接结果进行非线性处理,得到非线性处理结果,利用偏移预测网络的第二全连接层对非线性处理结果进行特征连接,得到第二特征连接结果,利用偏移预测网络的第三激活层对第二特征连接结果进行非线性处理,得到偏移信息。In some embodiments, the offset prediction module 83 is configured to use the second down-sampling layer of the offset prediction network to down-sample the first multi-dimensional feature map to obtain the second down-sampling result; use the second down-sampling layer of the offset prediction network The second convolution layer performs convolution processing on the second downsampling result to obtain the second feature extraction result; uses the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result; Use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain the nonlinear processing result, and use the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain the second For the feature connection result, the third activation layer of the offset prediction network is used to perform non-linear processing on the second feature connection result to obtain offset information.
在一些实施例中,预设网络模型包括至少一个卷积层,特征提取模块82,配置为利用预设网络模型的卷积层对待分析视频进行特征提取,得到第一多维特征图;还配置为若预设网络模型的卷积层的数量多于1,利用预设网络模型中未执行特征提取的卷积层对第二多维特征图进行特征提取,得到新的第一多维特征图;In some embodiments, the preset network model includes at least one convolutional layer, and the feature extraction module 82 is configured to use the convolutional layer of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; If the number of convolutional layers of the preset network model is more than 1, use the convolutional layer in the preset network model that has not been feature extraction to perform feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map ;
偏移预测模块83,还配置为利用偏移预测网络对新的第一多维特征图进行预测,得到新的偏移信息;The offset prediction module 83 is further configured to use the offset prediction network to predict the new first multi-dimensional feature map to obtain new offset information;
偏移处理模块84,还配置为利用新的偏移信息对第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到新的第二多维特征图;The offset processing module 84 is further configured to use the new offset information to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a new second multi-dimensional feature map based on the offset feature information;
网络分析模块85,配置为利用预设网络模型的全连接层对新的第二多维特征图进行分析,得到待分析视频的分析结果信息。The network analysis module 85 is configured to analyze the new second multi-dimensional feature map by using the fully connected layer of the preset network model to obtain analysis result information of the video to be analyzed.
在一些实施例中,待分析视频包括若干帧图像,特征提取模块82,配置为利用预设网络模型分别对若干帧图像进行特征提取,得到与每一帧图像对应的特征图;将若干个特征图按照与其对应的 图像在待分析视频中的时序进行拼接,得到第一多维特征图。In some embodiments, the video to be analyzed includes several frames of images, and the feature extraction module 82 is configured to perform feature extraction on several frames of images using a preset network model to obtain a feature map corresponding to each frame of image; The images are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.
请参阅图9,图7是本申请用于视频分析的模型训练装置90一实施例的框架示意图。用于视频分析的模型训练装置90包括视频获取模块91、特征提取模块92、偏移预测模块93、偏移处理模块94、网络分析模块95、损失计算模块96和参数调整模块97;其中,Please refer to FIG. 9. FIG. 7 is a schematic diagram of an embodiment of a model training device 90 for video analysis according to the present application. The model training device 90 for video analysis includes a video acquisition module 91, a feature extraction module 92, an offset prediction module 93, an offset processing module 94, a network analysis module 95, a loss calculation module 96, and a parameter adjustment module 97; among them,
视频获取模块91,配置为获取样本视频,其中,样本视频包括预设标注信息;The video acquisition module 91 is configured to acquire a sample video, where the sample video includes preset annotation information;
特征提取模块92,配置为利用预设网络模型对样本视频进行特征提取,得到第一样本多维特征图,其中,第一样本多维特征图包含与样本视频对应的不同时序上的特征信息;The feature extraction module 92 is configured to perform feature extraction on the sample video by using a preset network model to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information corresponding to the sample video at different timings;
偏移预测模块93,配置为利用偏移预测网络对第一样本多维特征图进行预测,得到偏移信息;The offset prediction module 93 is configured to use the offset prediction network to predict the first sample multi-dimensional feature map to obtain offset information;
偏移处理模块94,配置为利用偏移信息对第一样本多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的特征信息得到第二样本多维特征图;The offset processing module 94 is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information;
网络分析模块95,配置为利用预设网络模型对第二样本多维特征图进行分析,得到样本视频的分析结果信息;The network analysis module 95 is configured to analyze the second sample multi-dimensional feature map by using a preset network model to obtain analysis result information of the sample video;
损失计算模块96,配置为利用预设标注信息和分析结果信息计算损失值;The loss calculation module 96 is configured to calculate a loss value using preset label information and analysis result information;
参数调整模块97,配置为基于损失值,调整预设网络模型和偏移预测网络的参数。The parameter adjustment module 97 is configured to adjust the parameters of the preset network model and the offset prediction network based on the loss value.
通过上述方案,能够直接对样本视频的时序信息进行建模,有利于提高模型训练时的速度,且通过时序偏移,能够使空间信息和时序信息联合交错,故在此基础上进行分析处理,有利于后续提高视频分析的准确度。Through the above solution, the timing information of the sample video can be directly modeled, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so analysis and processing are performed on this basis. Conducive to the subsequent improvement of the accuracy of video analysis.
在一些实施例中,用于视频分析的模型训练装置90还可以进一步包括其他模块,以执行上述用于视频分析的模型训练方法实施例中的相关步骤,具体可以参考上述视频分析装置实施例中的相关模块,在此不再赘述。In some embodiments, the model training device 90 for video analysis may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis. For details, please refer to the above-mentioned embodiment of the video analysis device. Related modules, I won’t repeat them here.
请参阅图10,图10是本申请电子设备100一实施例的框架示意图。电子设备100包括相互耦接的存储器101和处理器102,处理器102用于执行存储器101中存储的程序指令,以实现上述任一视频分析方法实施例的步骤,或实现上述任一用于视频分析的模型训练方法实施例中的步骤。在一个具体的实施场景中,电子设备100可以包括但不限于:微型计算机、服务器,此外,电子设备100还可以包括笔记本电脑、平板电脑等移动设备,在此不做限定。Please refer to FIG. 10, which is a schematic diagram of a framework of an embodiment of the electronic device 100 of the present application. The electronic device 100 includes a memory 101 and a processor 102 coupled to each other. The processor 102 is configured to execute program instructions stored in the memory 101 to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing for video analysis. Analyze the steps in the embodiment of the model training method. In a specific implementation scenario, the electronic device 100 may include but is not limited to: a microcomputer and a server. In addition, the electronic device 100 may also include mobile devices such as a notebook computer and a tablet computer, which are not limited herein.
具体而言,处理器102用于控制其自身以及存储器101以实现上述任一视频分析方法实施例的步骤,或实现上述任一用于视频分析的模型训练方法实施例中的步骤。处理器102还可以称为中央处理单元(Central Processing Unit,CPU)。处理器102可能是一种集成电路芯片,具有信号的处理能力。处理器102还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外,处理器102可以由集成电路芯片共同实现。Specifically, the processor 102 is configured to control itself and the memory 101 to implement the steps in any of the foregoing video analysis method embodiments, or implement the steps in any of the foregoing model training method embodiments for video analysis. The processor 102 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 102 may be an integrated circuit chip with signal processing capability. The processor 102 may also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. In addition, the processor 102 may be jointly implemented by an integrated circuit chip.
请参阅图11,图11为本申请计算机可读存储介质110一实施例的框架示意图。计算机可读存储介质110存储有能够被处理器运行的程序指令1101,程序指令1101用于实现上述任一视频分析方法实施例的步骤,或实现上述任一用于视频分析的模型训练方法实施例中的步骤。该计算机可读存储介质可以是易失性或非易失性存储介质。Please refer to FIG. 11, which is a schematic diagram of a framework of an embodiment of the computer-readable storage medium 110 of this application. The computer-readable storage medium 110 stores program instructions 1101 that can be executed by a processor, and the program instructions 1101 are used to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing model training method embodiments for video analysis. Steps in. The computer-readable storage medium may be a volatile or non-volatile storage medium.
本申请实施例还提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现上述任一视频分析方法实施例的步骤,或实现上述任一用于视频分析的模型训练方法实施例中的步骤。The embodiment of the present application also provides a computer program, including computer-readable code, when the computer-readable code is executed in an electronic device, the processor in the electronic device executes for realizing the implementation of any of the above-mentioned video analysis methods Or implement the steps in any of the above-mentioned model training method embodiments for video analysis.
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed method and device can be implemented in other ways. For example, the device implementation described above is only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Claims (27)

  1. 一种视频分析方法,包括:A video analysis method, including:
    获取待分析视频;Get the video to be analyzed;
    利用预设网络模型对所述待分析视频进行特征提取,得到第一多维特征图,其中,所述第一多维特征图包含与所述待分析视频对应的不同时序上的特征信息;Performing feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map includes feature information in different time series corresponding to the video to be analyzed;
    利用偏移预测网络对所述第一多维特征图进行预测,得到偏移信息;Predicting the first multi-dimensional feature map by using an offset prediction network to obtain offset information;
    利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二多维特征图;Using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information;
    利用所述预设网络模型对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息。The second multi-dimensional feature map is analyzed by using the preset network model to obtain analysis result information of the video to be analyzed.
  2. 根据权利要求1所述的视频分析方法,其中,在所述利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二多维特征图之前,所述方法还包括:The video analysis method according to claim 1, wherein, in the step of using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and based on the offset feature Before the information obtains the second multi-dimensional feature map, the method further includes:
    利用权重预测网络对所述第一多维特征图进行预测,得到权重信息;Predicting the first multi-dimensional feature map by using a weight prediction network to obtain weight information;
    所述利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二多维特征图,包括:The step of using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtaining a second multi-dimensional feature map based on the offset feature information, includes:
    利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移;Using the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map;
    利用所述权重信息对偏移后的所述特征信息进行加权处理;Performing weighting processing on the offset feature information by using the weight information;
    基于所述加权处理后的所述特征信息,得到第二多维特征图。Based on the feature information after the weighting process, a second multi-dimensional feature map is obtained.
  3. 根据权利要求1或2所述的视频分析方法,其中,所述第一多维特征图的维度包括时序维度和预设维度;The video analysis method according to claim 1 or 2, wherein the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions;
    所述利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,包括:The using the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map includes:
    按照预设维度从第一多维特征图中选择至少一组特征信息,其中,每组特征信息包括同一预设维度上对应不同时序的特征信息;Selecting at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, where each set of feature information includes feature information corresponding to different time series in the same preset dimension;
    利用所述偏移信息对所述至少一组特征信息在时序维度上进行偏移。The offset information is used to offset the at least one set of feature information in a time series dimension.
  4. 根据权利要求3所述的视频分析方法,其中,所述预设维度为通道维度;和/或,The video analysis method according to claim 3, wherein the preset dimension is a channel dimension; and/or,
    所述偏移信息包括第一数量个偏移值,所述至少一组特征信息包括第一数量组第一特征信息;The offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;
    所述利用所述偏移信息对所述至少一组特征信息在时序维度上进行偏移包括:The using the offset information to offset the at least one set of feature information in a time series dimension includes:
    利用所述偏移信息中第i个所述偏移值对第i组所述第一特征信息在所述时序维度上进行偏移,得到第i组第二特征信息,其中,所述i为小于或等于所述第一数量的正整数。Use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information, where i is A positive integer less than or equal to the first number.
  5. 根据权利要求4所述的视频分析方法,其中,所述利用所述偏移信息中第i个所述偏移值对第i组所述第一特征信息在所述时序维度上进行偏移,得到第i组第二特征信息,包括:4. The video analysis method according to claim 4, wherein the offset value of the i-th group of the first feature information is offset in the time sequence dimension by using the i-th offset value in the offset information, Obtain the second characteristic information of the i-th group, including:
    获取第i个所述偏移值所属的数值范围,且所述数值范围的上限值与下限值之差为一预设数值;Acquiring the numerical range to which the i-th said offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value;
    将第i组所述第一特征信息沿所述时序维度偏移所述上限值个时序单位,得到第i组第三特征信息,并将第i组所述第一特征信息沿所述时序维度偏移所述下限值个时序单位,得到第i组第四特征信息;Offset the first characteristic information of the i-th group by the upper limit number of time-series units along the time sequence dimension to obtain the third characteristic information of the i-th group, and move the first characteristic information of the i-th group along the time sequence The dimension is shifted by the lower limit number of time series units to obtain the i-th group of fourth characteristic information;
    以第i个所述偏移值与所述下限值之间的差作为权重对第i组所述第三特征信息进行加权处理,得到第i组第一加权结果,并以所述上限值与所述第i个偏移值之间的差作为权重对第i组所述第四特征信息进行加权处理,得到第i组第二加权结果;Use the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of the third feature information to obtain the i-th group of first weighted results, and use the upper limit The difference between the value and the i-th offset value is used as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighting result of the i-th group;
    计算所述第i组第一加权结果和第i组第二加权结果之间的和,以作为第i组所述第二特征信息。The sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature information of the i-th group.
  6. 根据权利要求3所述的视频分析方法,其中,所述待分析视频包括第二数量帧图像,所述权重信息包括所述第二数量个权重值;4. The video analysis method according to claim 3, wherein the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values;
    所述利用所述权重信息对偏移后的所述特征信息进行加权处理,包括:The using the weight information to perform weighting processing on the offset feature information includes:
    对偏移后的每组特征信息,分别利用所述权重信息中第j个权重值对当前组特征信息中的第j个时序对应的特征值进行加权处理,得到加权处理后的对应组特征信息;For each group of feature information after the offset, use the j-th weight value in the weight information to weight the feature value corresponding to the j-th time sequence in the current group of feature information to obtain the corresponding group of feature information after weighting. ;
    其中,所述j为小于或等于所述第二数量的正整数。Wherein, the j is a positive integer less than or equal to the second number.
  7. 根据权利要求2至6任一项所述的视频分析方法,其中,所述基于所述加权处理后的所述特征信息,得到第二多维特征图,包括:The video analysis method according to any one of claims 2 to 6, wherein the obtaining a second multi-dimensional feature map based on the feature information after the weighting process comprises:
    利用所述加权处理后的所述特征信息以及所述第一多维特征图中未被偏移的特征信息,组成所 述第二多维特征图。The feature information after the weighting process and the feature information that is not shifted in the first multi-dimensional feature map are used to form the second multi-dimensional feature map.
  8. 根据权利要求2至6任一项所述的视频分析方法,其中,所述利用权重预测网络对所述第一多维特征图进行预测,得到权重信息,包括:The video analysis method according to any one of claims 2 to 6, wherein the using a weight prediction network to predict the first multi-dimensional feature map to obtain weight information includes:
    利用所述权重预测网络的第一降采样层对所述第一多维特征图进行降采样,得到第一降采样结果;Down-sampling the first multi-dimensional feature map by using the first down-sampling layer of the weight prediction network to obtain a first down-sampling result;
    利用所述权重预测网络的第一卷积层对所述第一降采样结果进行卷积处理,得到第一特征提取结果;Using the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain a first feature extraction result;
    利用所述权重预测网络的第一激活层对所述第一特征提取结果进行非线性处理,得到所述权重信息。The first activation layer of the weight prediction network is used to perform non-linear processing on the first feature extraction result to obtain the weight information.
  9. 根据权利要求1至6任一项所述的视频分析方法,其中,所述利用偏移预测网络对所述第一多维特征图进行预测,得到偏移信息,包括:The video analysis method according to any one of claims 1 to 6, wherein the using an offset prediction network to predict the first multi-dimensional feature map to obtain offset information includes:
    利用所述偏移预测网络的第二降采样层对所述第一多维特征图进行降采样,得到第二降采样结果;Down-sampling the first multi-dimensional feature map by using the second down-sampling layer of the offset prediction network to obtain a second down-sampling result;
    利用所述偏移预测网络的第二卷积层对所述第二降采样结果进行卷积处理,得到第二特征提取结果;Using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result;
    利用所述偏移预测网络的第一全连接层对所述第二特征提取结果进行特征连接,得到第一特征连接结果;Using the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain a first feature connection result;
    利用所述偏移预测网络的第二激活层对所述第一特征连接结果进行非线性处理,得到非线性处理结果;Using the second activation layer of the offset prediction network to perform non-linear processing on the first feature connection result to obtain a non-linear processing result;
    利用所述偏移预测网络的第二全连接层对所述非线性处理结果进行特征连接,得到第二特征连接结果;Using the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain a second feature connection result;
    利用所述偏移预测网络的第三激活层对所述第二特征连接结果进行非线性处理,得到所述偏移信息。The third activation layer of the offset prediction network is used to perform nonlinear processing on the second feature connection result to obtain the offset information.
  10. 根据权利要求1至6任一项所述的视频分析方法,其中,所述预设网络模型包括至少一个卷积层;所述利用预设网络模型对所述待分析视频进行特征提取,得到第一多维特征图,包括:The video analysis method according to any one of claims 1 to 6, wherein the preset network model includes at least one convolutional layer; and the use of the preset network model to perform feature extraction on the to-be-analyzed video to obtain the first A multi-dimensional feature map, including:
    利用预设网络模型的卷积层对所述待分析视频进行特征提取,得到第一多维特征图;Performing feature extraction on the video to be analyzed by using a convolutional layer of a preset network model to obtain a first multi-dimensional feature map;
    若所述预设网络模型的卷积层的数量多于1,则在所述得到第二多维特征图之后,并在所述利用所述预设网络模型对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息之前,所述方法还包括:If the number of convolutional layers of the preset network model is more than 1, after the second multi-dimensional feature map is obtained, and after the second multi-dimensional feature map is obtained by using the preset network model, Before performing analysis to obtain the analysis result information of the video to be analyzed, the method further includes:
    利用所述预设网络模型中未执行特征提取的卷积层对所述第二多维特征图进行特征提取,得到新的第一多维特征图;Performing feature extraction on the second multi-dimensional feature map by using a convolutional layer in the preset network model that has not performed feature extraction to obtain a new first multi-dimensional feature map;
    执行所述利用偏移预测网络对所述新的第一多维特征图进行预测,得到偏移信息的步骤以及后续步骤,以得到新的第二多维特征图;Performing the step of using the offset prediction network to predict the new first multi-dimensional feature map to obtain offset information and subsequent steps to obtain a new second multi-dimensional feature map;
    重复执行上述步骤,直至所述预设网络模型的所有卷积层均完成对新的第二多维特征图的特征提取步骤;Repeat the above steps until all convolutional layers of the preset network model have completed the feature extraction step of the new second multi-dimensional feature map;
    所述利用所述预设网络模型对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息,包括:The analyzing the second multi-dimensional feature map using the preset network model to obtain the analysis result information of the video to be analyzed includes:
    利用所述预设网络模型的全连接层对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息。The second multi-dimensional feature map is analyzed by using the fully connected layer of the preset network model to obtain the analysis result information of the video to be analyzed.
  11. 根据权利要求1至6任一项所述的视频分析方法,其中,所述待分析视频包括若干帧图像,所述利用预设网络模型对所述待分析视频进行特征提取,得到第一多维特征图,包括:The video analysis method according to any one of claims 1 to 6, wherein the video to be analyzed includes several frames of images, and the feature extraction is performed on the video to be analyzed using a preset network model to obtain the first multidimensional Feature map, including:
    利用所述预设网络模型分别对所述若干帧图像进行特征提取,得到与每一帧图像对应的特征图;Using the preset network model to perform feature extraction on the several frames of images to obtain a feature map corresponding to each frame of image;
    将所述若干个所述特征图按照与其对应的图像在所述待分析视频中的时序进行拼接,得到所述第一多维特征图。The plurality of the feature maps are spliced according to the time sequence of the images corresponding to them in the video to be analyzed to obtain the first multi-dimensional feature map.
  12. 一种用于视频分析的模型训练方法,包括:A model training method for video analysis, including:
    获取样本视频,其中,所述样本视频包括预设标注信息;Acquiring a sample video, where the sample video includes preset annotation information;
    利用预设网络模型对所述样本视频进行特征提取,得到第一样本多维特征图,其中,所述第一样本多维特征图包含与所述样本视频对应的不同时序上的特征信息;Performing feature extraction on the sample video by using a preset network model to obtain a first sample multi-dimensional feature map, wherein the first sample multi-dimensional feature map contains feature information corresponding to the sample video at different timings;
    利用偏移预测网络对所述第一样本多维特征图进行预测,得到偏移信息;Predicting the first sample multi-dimensional feature map by using an offset prediction network to obtain offset information;
    利用所述偏移信息对所述第一样本多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二样本多维特征图;Using the offset information to perform a time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information;
    利用所述预设网络模型对所述第二样本多维特征图进行分析,得到所述样本视频的分析结果信息;Analyze the second sample multi-dimensional feature map by using the preset network model to obtain analysis result information of the sample video;
    利用所述预设标注信息和所述分析结果信息计算损失值;Calculating a loss value by using the preset label information and the analysis result information;
    基于所述损失值,调整所述预设网络模型和所述偏移预测网络的参数。Adjust the parameters of the preset network model and the offset prediction network based on the loss value.
  13. 一种视频分析装置,包括:A video analysis device, including:
    视频获取模块,配置为获取待分析视频;The video acquisition module is configured to acquire the video to be analyzed;
    特征提取模块,配置为利用预设网络模型对所述待分析视频进行特征提取,得到第一多维特征图,其中,所述第一多维特征图包含与所述待分析视频对应的不同时序上的特征信息;The feature extraction module is configured to perform feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, wherein the first multi-dimensional feature map includes different timings corresponding to the video to be analyzed Feature information on
    偏移预测模块,配置为利用偏移预测网络对所述第一多维特征图进行预测,得到偏移信息;An offset prediction module, configured to use an offset prediction network to predict the first multi-dimensional feature map to obtain offset information;
    偏移处理模块,配置为利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二多维特征图;An offset processing module configured to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtain a second multi-dimensional feature map based on the offset feature information;
    网络分析模块,配置为利用所述预设网络模型对所述第二多维特征图进行分析,得到所述待分析视频的分析结果信息。The network analysis module is configured to analyze the second multi-dimensional feature map by using the preset network model to obtain analysis result information of the video to be analyzed.
  14. 根据权利要求13所述的视频分析装置,其中,所述装置还包括权重预测模块,配置为利用权重预测网络对所述第一多维特征图进行预测,得到权重信息;The video analysis device according to claim 13, wherein the device further comprises a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;
    所述偏移处理模块,配置为利用所述偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移;利用所述权重信息对偏移后的所述特征信息进行加权处理;基于所述加权处理后的所述特征信息,得到第二多维特征图。The offset processing module is configured to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; and to use the weight information to weight the offset feature information Processing; Based on the feature information after the weighting process, a second multi-dimensional feature map is obtained.
  15. 根据权利要求13或14所述的视频分析装置,其中,所述第一多维特征图的维度包括时序维度和预设维度;The video analysis device according to claim 13 or 14, wherein the dimensions of the first multi-dimensional feature map include a time series dimension and a preset dimension;
    所述偏移处理模块,配置为按照预设维度从第一多维特征图中选择至少一组特征信息,其中,每组特征信息包括同一预设维度上对应不同时序的特征信息;利用所述偏移信息对所述至少一组特征信息在时序维度上进行偏移。The offset processing module is configured to select at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time series in the same preset dimension; The offset information offsets the at least one set of feature information in the time sequence dimension.
  16. 根据权利要求15所述的视频分析装置,其中,所述预设维度为通道维度;和/或,The video analysis device according to claim 15, wherein the preset dimension is a channel dimension; and/or,
    所述偏移信息包括第一数量个偏移值,所述至少一组特征信息包括第一数量组第一特征信息;The offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;
    所述偏移处理模块,配置为利用所述偏移信息中第i个所述偏移值对第i组所述第一特征信息在所述时序维度上进行偏移,得到第i组第二特征信息,其中,所述i为小于或等于所述第一数量的正整数。The offset processing module is configured to use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information. Characteristic information, wherein the i is a positive integer less than or equal to the first number.
  17. 根据权利要求16所述的视频分析装置,其中,所述偏移处理模块,配置为获取第i个所述偏移值所属的数值范围,且所述数值范围的上限值与下限值之差为一预设数值;将第i组所述第一特征信息沿所述时序维度偏移所述上限值个时序单位,得到第i组第三特征信息,并将第i组所述第一特征信息沿所述时序维度偏移所述下限值个时序单位,得到第i组第四特征信息;以第i个所述偏移值与所述下限值之间的差作为权重对第i组所述第三特征信息进行加权处理,得到第i组第一加权结果,并以所述上限值与所述第i个偏移值之间的差作为权重对第i组所述第四特征信息进行加权处理,得到第i组第二加权结果;计算所述第i组第一加权结果和第i组第二加权结果之间的和,以作为第i组所述第二特征信息。The video analysis device according to claim 16, wherein the offset processing module is configured to obtain a numerical range to which the i-th offset value belongs, and the upper limit and the lower limit of the numerical range are The difference is a preset value; the first characteristic information of the i-th group is shifted by the upper limit time series unit along the time series dimension to obtain the third characteristic information of the i-th group, and the first characteristic information of the i-th group is A piece of characteristic information is shifted by the lower limit value of time sequence units along the time sequence dimension to obtain the i-th group of fourth characteristic information; the difference between the i-th offset value and the lower limit value is used as a weight pair Perform weighting processing on the third characteristic information of the i-th group to obtain the first weighted result of the i-th group, and use the difference between the upper limit value and the i-th offset value as the weight The fourth feature information is weighted to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as the second feature of the i-th group information.
  18. 根据权利要求15所述的视频分析装置,其中,所述待分析视频包括第二数量帧图像,所述权重信息包括所述第二数量个权重值;The video analysis device according to claim 15, wherein the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values;
    所述偏移处理模块,配置为对偏移后的每组特征信息,分别利用所述权重信息中第j个权重值对当前组特征信息中的第j个时序对应的特征值进行加权处理,得到加权处理后的对应组特征信息;其中,所述j为小于或等于所述第二数量的正整数。The offset processing module is configured to perform weighting processing on the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information for each group of feature information after the offset, respectively, Obtain the feature information of the corresponding group after the weighting process; wherein, the j is a positive integer less than or equal to the second number.
  19. 根据权利要求14至18任一项所述的视频分析装置,其中,所述偏移处理模块,配置为利用所述加权处理后的所述特征信息以及所述第一多维特征图中未被偏移的特征信息,组成所述第二多维特征图。The video analysis device according to any one of claims 14 to 18, wherein the offset processing module is configured to use the feature information after the weighting process and the first multi-dimensional feature map without The offset feature information constitutes the second multi-dimensional feature map.
  20. 根据权利要求14至18任一项所述的视频分析装置,其中,所述权重预测模块,配置为利用所述权重预测网络的第一降采样层对所述第一多维特征图进行降采样,得到第一降采样结果;利用所述权重预测网络的第一卷积层对所述第一降采样结果进行卷积处理,得到第一特征提取结果;利用所述权重预测网络的第一激活层对所述第一特征提取结果进行非线性处理,得到所述权重信息。The video analysis device according to any one of claims 14 to 18, wherein the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map , Obtain the first down-sampling result; use the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain the first feature extraction result; use the weight to predict the first activation of the network The layer performs non-linear processing on the first feature extraction result to obtain the weight information.
  21. 根据权利要求13至18任一项所述的视频分析装置,其中,所述偏移预测模块,配置为利用所述偏移预测网络的第二降采样层对所述第一多维特征图进行降采样,得到第二降采样结果;利用所述偏移预测网络的第二卷积层对所述第二降采样结果进行卷积处理,得到第二特征提取结果; 利用所述偏移预测网络的第一全连接层对所述第二特征提取结果进行特征连接,得到第一特征连接结果;利用所述偏移预测网络的第二激活层对所述第一特征连接结果进行非线性处理,得到非线性处理结果;利用所述偏移预测网络的第二全连接层对所述非线性处理结果进行特征连接,得到第二特征连接结果;利用所述偏移预测网络的第三激活层对所述第二特征连接结果进行非线性处理,得到所述偏移信息。The video analysis device according to any one of claims 13 to 18, wherein the offset prediction module is configured to use the second down-sampling layer of the offset prediction network to perform the first multi-dimensional feature map Down-sampling to obtain a second down-sampling result; using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result; using the offset prediction network The first fully connected layer performs feature connection on the second feature extraction result to obtain the first feature connection result; using the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result, Obtain the nonlinear processing result; use the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain the second feature connection result; use the third activation layer of the offset prediction network to pair The second feature connection result is subjected to non-linear processing to obtain the offset information.
  22. 根据权利要求13至18任一项所述的视频分析装置,其中,所述预设网络模型包括至少一个卷积层;The video analysis device according to any one of claims 13 to 18, wherein the preset network model includes at least one convolutional layer;
    所述特征提取模块,配置为利用预设网络模型的卷积层对所述待分析视频进行特征提取,得到第一多维特征图;若所述预设网络模型的卷积层的数量多于1,还配置为利用所述预设网络模型中未执行特征提取的卷积层对所述第二多维特征图进行特征提取,得到新的第一多维特征图;The feature extraction module is configured to perform feature extraction on the video to be analyzed by using the convolutional layer of the preset network model to obtain a first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1. It is also configured to perform feature extraction on the second multi-dimensional feature map by using a convolutional layer in the preset network model that has not performed feature extraction to obtain a new first multi-dimensional feature map;
    所述偏移预测模块,还配置为利用偏移预测网络对所述新的第一多维特征图进行预测,得到新的偏移信息;The offset prediction module is further configured to use an offset prediction network to predict the new first multi-dimensional feature map to obtain new offset information;
    所述偏移处理模块,还配置为利用所述新的偏移信息对所述第一多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到新的第二多维特征图;The offset processing module is further configured to use the new offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and to obtain a new one based on the offset feature information The second multi-dimensional feature map;
    所述网络分析模块,还配置为利用所述预设网络模型的全连接层对所述新的第二多维特征图进行分析,得到所述待分析视频的分析结果信息。The network analysis module is further configured to analyze the new second multi-dimensional feature map by using the fully connected layer of the preset network model to obtain analysis result information of the video to be analyzed.
  23. 根据权利要求13至18任一项所述的视频分析装置,其中,所述待分析视频包括若干帧图像;The video analysis device according to any one of claims 13 to 18, wherein the video to be analyzed includes several frames of images;
    所述特征提取模块,配置为利用所述预设网络模型分别对所述若干帧图像进行特征提取,得到与每一帧图像对应的特征图;将所述若干个所述特征图按照与其对应的图像在所述待分析视频中的时序进行拼接,得到所述第一多维特征图。The feature extraction module is configured to use the preset network model to perform feature extraction on the plurality of frames of images to obtain a feature map corresponding to each frame of image; Images are spliced in the sequence of the video to be analyzed to obtain the first multi-dimensional feature map.
  24. 一种用于视频分析的模型训练装置,包括:A model training device for video analysis, including:
    视频获取模块,配置为获取样本视频,其中,所述样本视频包括预设标注信息;The video acquisition module is configured to acquire a sample video, wherein the sample video includes preset annotation information;
    特征提取模块,配置为利用预设网络模型对所述样本视频进行特征提取,得到第一样本多维特征图,其中,所述第一样本多维特征图包含与所述样本视频对应的不同时序上的特征信息;The feature extraction module is configured to perform feature extraction on the sample video using a preset network model to obtain a first sample multi-dimensional feature map, wherein the first sample multi-dimensional feature map includes different timings corresponding to the sample video Feature information on
    偏移预测模块,配置为利用偏移预测网络对所述第一样本多维特征图进行预测,得到偏移信息;An offset prediction module, configured to use an offset prediction network to predict the first sample multi-dimensional feature map to obtain offset information;
    偏移处理模块,配置为利用所述偏移信息对所述第一样本多维特征图的至少部分特征信息进行时序偏移,并基于偏移后的所述特征信息得到第二样本多维特征图;An offset processing module configured to use the offset information to perform a time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information ;
    网络分析模块,配置为利用所述预设网络模型对所述第二样本多维特征图进行分析,得到所述样本视频的分析结果信息;A network analysis module configured to analyze the second sample multi-dimensional feature map by using the preset network model to obtain analysis result information of the sample video;
    损失计算模块,配置为利用所述预设标注信息和所述分析结果信息计算损失值;A loss calculation module, configured to calculate a loss value by using the preset label information and the analysis result information;
    参数调整模块,配置为基于所述损失值,调整所述预设网络模型和所述偏移预测网络的参数。The parameter adjustment module is configured to adjust the parameters of the preset network model and the offset prediction network based on the loss value.
  25. 一种电子设备,包括相互耦接的存储器和处理器,所述处理器用于执行所述存储器中存储的程序指令,以实现权利要求1至11任一项所述的视频分析方法,或实现权利要求12所述的模型训练方法。An electronic device comprising a memory and a processor coupled to each other, the processor is configured to execute program instructions stored in the memory to implement the video analysis method according to any one of claims 1 to 11, or to implement rights Requires the model training method described in 12.
  26. 一种计算机可读存储介质,其上存储有程序指令,所述程序指令被处理器执行时实现权利要求1至11任一项所述的视频分析方法,或实现权利要求12所述的模型训练方法。A computer-readable storage medium having program instructions stored thereon, which when executed by a processor, implement the video analysis method according to any one of claims 1 to 11, or implement the model training according to claim 12 method.
  27. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至11任一项所述的视频分析方法,或实现权利要求12所述的模型训练方法。A computer program, comprising computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the video analysis for realizing any one of claims 1 to 11 Method, or implement the model training method of claim 12.
PCT/CN2020/078656 2020-01-17 2020-03-10 Video analysis method and related model training method, device and apparatus therefor WO2021142904A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020217013635A KR20210093875A (en) 2020-01-17 2020-03-10 Video analysis methods and associated model training methods, devices, and devices
JP2021521512A JP7096431B2 (en) 2020-01-17 2020-03-10 Video analysis methods and related model training methods, equipment, equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010053048.4 2020-01-17
CN202010053048.4A CN111291631B (en) 2020-01-17 2020-01-17 Video analysis method and related model training method, device and apparatus thereof

Publications (1)

Publication Number Publication Date
WO2021142904A1 true WO2021142904A1 (en) 2021-07-22

Family

ID=71025430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/078656 WO2021142904A1 (en) 2020-01-17 2020-03-10 Video analysis method and related model training method, device and apparatus therefor

Country Status (5)

Country Link
JP (1) JP7096431B2 (en)
KR (1) KR20210093875A (en)
CN (1) CN111291631B (en)
TW (1) TWI761813B (en)
WO (1) WO2021142904A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390731A1 (en) * 2020-06-12 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for positioning key point, device, and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417952B (en) * 2020-10-10 2022-11-11 北京理工大学 Environment video information availability evaluation method of vehicle collision prevention and control system
CN112464898A (en) * 2020-12-15 2021-03-09 北京市商汤科技开发有限公司 Event detection method and device, electronic equipment and storage medium
CN112949449B (en) * 2021-02-25 2024-04-19 北京达佳互联信息技术有限公司 Method and device for training staggered judgment model and method and device for determining staggered image

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199902A (en) * 2014-08-27 2014-12-10 中国科学院自动化研究所 Similarity measurement computing method of linear dynamical systems
US20170243058A1 (en) * 2014-10-28 2017-08-24 Watrix Technology Gait recognition method based on deep learning
CN108229280A (en) * 2017-04-20 2018-06-29 北京市商汤科技开发有限公司 Time domain motion detection method and system, electronic equipment, computer storage media
CN108229522A (en) * 2017-03-07 2018-06-29 北京市商汤科技开发有限公司 Training method, attribute detection method, device and the electronic equipment of neural network

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9626803B2 (en) * 2014-12-12 2017-04-18 Qualcomm Incorporated Method and apparatus for image processing in augmented reality systems
US10707837B2 (en) 2017-07-06 2020-07-07 Analog Photonics LLC Laser frequency chirping structures, methods, and applications
US11248905B2 (en) * 2017-08-16 2022-02-15 Kla-Tencor Corporation Machine learning in metrology measurements
US10430654B1 (en) * 2018-04-20 2019-10-01 Surfline\Wavetrak, Inc. Automated detection of environmental measures within an ocean environment using image data
CN109919025A (en) * 2019-01-30 2019-06-21 华南理工大学 Video scene Method for text detection, system, equipment and medium based on deep learning
CN110084742B (en) * 2019-05-08 2024-01-26 北京奇艺世纪科技有限公司 Parallax map prediction method and device and electronic equipment
CN110660082B (en) * 2019-09-25 2022-03-08 西南交通大学 Target tracking method based on graph convolution and trajectory convolution network learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199902A (en) * 2014-08-27 2014-12-10 中国科学院自动化研究所 Similarity measurement computing method of linear dynamical systems
US20170243058A1 (en) * 2014-10-28 2017-08-24 Watrix Technology Gait recognition method based on deep learning
CN108229522A (en) * 2017-03-07 2018-06-29 北京市商汤科技开发有限公司 Training method, attribute detection method, device and the electronic equipment of neural network
CN108229280A (en) * 2017-04-20 2018-06-29 北京市商汤科技开发有限公司 Time domain motion detection method and system, electronic equipment, computer storage media

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390731A1 (en) * 2020-06-12 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for positioning key point, device, and storage medium
US11610389B2 (en) * 2020-06-12 2023-03-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for positioning key point, device, and storage medium

Also Published As

Publication number Publication date
JP2022520511A (en) 2022-03-31
TWI761813B (en) 2022-04-21
CN111291631B (en) 2023-11-07
CN111291631A (en) 2020-06-16
KR20210093875A (en) 2021-07-28
TW202129535A (en) 2021-08-01
JP7096431B2 (en) 2022-07-05

Similar Documents

Publication Publication Date Title
WO2021142904A1 (en) Video analysis method and related model training method, device and apparatus therefor
WO2022001489A1 (en) Unsupervised domain adaptation target re-identification method
CN111797893B (en) Neural network training method, image classification system and related equipment
Singh et al. A deeply coupled ConvNet for human activity recognition using dynamic and RGB images
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
CN109754078A (en) Method for optimization neural network
CN112070044B (en) Video object classification method and device
WO2021190296A1 (en) Dynamic gesture recognition method and device
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
AU2021377335A1 (en) A multi-resolution attention network for video action recognition
WO2021218470A1 (en) Neural network optimization method and device
CN112784778B (en) Method, apparatus, device and medium for generating model and identifying age and sex
Yang et al. An improving faster-RCNN with multi-attention ResNet for small target detection in intelligent autonomous transport with 6G
CN114266897A (en) Method and device for predicting pox types, electronic equipment and storage medium
CN115131613A (en) Small sample image classification method based on multidirectional knowledge migration
Tripathy et al. A real-time two-input stream multi-column multi-stage convolution neural network (TIS-MCMS-CNN) for efficient crowd congestion-level analysis
WO2022088411A1 (en) Image detection method and apparatus, related model training method and apparatus, and device, medium and program
EP3995992A1 (en) Method and system for detecting an action in a video clip
CN115705706A (en) Video processing method, video processing device, computer equipment and storage medium
CN114155388B (en) Image recognition method and device, computer equipment and storage medium
Yogaswara et al. Comparison of supervised learning image classification algorithms for food and non-food objects
WO2022127333A1 (en) Training method and apparatus for image segmentation model, image segmentation method and apparatus, and device
CN111275183A (en) Visual task processing method and device and electronic system
Treliński et al. Decision combination in classifier committee built on deep embedding features
Képešiová et al. Comparison of Regularization and Optimization Methods for Process Recognition with Use of Deep Neural Network

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021521512

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913355

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20913355

Country of ref document: EP

Kind code of ref document: A1