WO2021142904A1 - Procédé d'analyse vidéo et procédé d'apprentissage de modèle associé, dispositif et appareil associés - Google Patents

Procédé d'analyse vidéo et procédé d'apprentissage de modèle associé, dispositif et appareil associés Download PDF

Info

Publication number
WO2021142904A1
WO2021142904A1 PCT/CN2020/078656 CN2020078656W WO2021142904A1 WO 2021142904 A1 WO2021142904 A1 WO 2021142904A1 CN 2020078656 W CN2020078656 W CN 2020078656W WO 2021142904 A1 WO2021142904 A1 WO 2021142904A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
offset
feature
feature map
video
Prior art date
Application number
PCT/CN2020/078656
Other languages
English (en)
Chinese (zh)
Inventor
邵昊
刘宇
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to KR1020217013635A priority Critical patent/KR20210093875A/ko
Priority to JP2021521512A priority patent/JP7096431B2/ja
Publication of WO2021142904A1 publication Critical patent/WO2021142904A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to a video analysis method and related model training methods, equipment, and devices.
  • neural network models are generally designed with static images as processing objects.
  • the embodiments of the present application provide a video analysis method and related model training methods, equipment, and devices.
  • an embodiment of the present application provides a video analysis method, including: obtaining a video to be analyzed; using a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, wherein the The first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed; the offset prediction network is used to predict the first multi-dimensional feature map to obtain offset information; the offset information is used Perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information; The dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed.
  • the embodiment of the application processes the video to be analyzed through the preset network model, which is beneficial to improve the processing speed of video analysis, and through the timing offset, the spatial information and the timing information can be jointly interlaced. Therefore, the analysis and processing are performed on this basis. Conducive to improving the accuracy of video analysis.
  • the offset information is used to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and the second feature information is obtained based on the offset information.
  • the method further includes: predicting the first multi-dimensional feature map by using a weight prediction network to obtain weight information; and using the offset information to perform a prediction on the first multi-dimensional feature map.
  • Performing timing offset on at least part of the feature information, and obtaining a second multi-dimensional feature map based on the offset feature information includes: performing at least part of the feature information of the first multi-dimensional feature map by using the offset information Time sequence offset; use the weight information to perform weighting processing on the offset feature information; and obtain a second multi-dimensional feature map based on the weighted feature information.
  • the technical solution of the embodiment of the present application can directly obtain the feature information of space and time series joint interleaving through the processing steps of offset and weighting, which is beneficial to improve the processing speed and accuracy of video analysis.
  • the dimensions of the first multi-dimensional feature map include a time series dimension and a preset dimension; the offset information is used to perform at least part of the feature information of the first multi-dimensional feature map.
  • the time sequence offset includes: selecting at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time sequences in the same preset dimension; using the offset information The at least one set of feature information is offset in the time series dimension.
  • At least one set of feature information is selected from the first multi-dimensional feature map according to a preset dimension, and each set of feature information includes feature information corresponding to different time series in the same preset dimension, and offset information is used Offset at least one set of feature information in the time sequence dimension can reduce the amount of calculation for offset processing, which is further conducive to improving the processing speed of video analysis.
  • the preset dimension is a channel dimension; and/or, the offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets The first feature information; said using offset information to offset the at least one set of feature information in the time sequence dimension includes: using the i-th offset value in the offset information to offset the i-th set of first feature information Offset is performed in the time sequence dimension to obtain the i-th group of second characteristic information, where i is a positive integer less than or equal to the first number.
  • the feature information of the space and time sequence joint interleaving can be directly obtained, which is conducive to improving The processing speed and accuracy of video analysis.
  • the i-th offset value in the offset information is used to offset the i-th group of the first feature information in the time series dimension to obtain the i-th group
  • the second characteristic information includes: acquiring the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is a preset value, and the first group of the i-th
  • the characteristic information is shifted by the upper limit time sequence unit along the time sequence dimension to obtain the i-th group of third characteristic information, and the i-th group of the first characteristic information is shifted by the lower limit value along the time sequence dimension Time sequence units to obtain the i-th group of fourth characteristic information; using the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of the third characteristic information to obtain the i groups of first weighting results, and using the difference between the upper limit value and the i-th offset value as a weight to perform weighting processing on the fourth feature information of
  • the technical solutions of the embodiments of the present application can easily and quickly perform offset processing on the first feature information, which is beneficial to improve the processing speed of video analysis.
  • the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values; the weight information is used to perform the offset on the feature information.
  • Weighting processing includes: for each group of feature information after the offset, weighting is performed on the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information, and the weighted processing is obtained Corresponding group feature information; where j is a positive integer less than or equal to the second number.
  • the technical solution of the embodiment of the present application uses the j-th weight value in the weight information to weight the characteristic value corresponding to the j-th time sequence of the current group of characteristic information by weighting each group of feature information after the offset to obtain the weighting process.
  • the latter corresponding group of feature information can re-weight the feature information when the feature information at some ends is offset, which is beneficial to improve the accuracy of video analysis.
  • the obtaining a second multi-dimensional feature map based on the weighted feature information includes: using the weighted feature information and the first multi-dimensional feature map.
  • the non-shifted feature information in the dimensional feature map constitutes the second multi-dimensional feature map.
  • the weighted feature information and the feature information that has not been shifted in the first multi-dimensional feature map are combined into the second multi-dimensional feature information, which can reduce the calculation load and improve the performance of video analysis. Processing speed.
  • the predicting the first multi-dimensional feature map using the weight prediction network to obtain the weight information includes: using the first downsampling layer of the weight prediction network to perform the prediction on the Down-sampling the first multi-dimensional feature map to obtain the first down-sampling result; using the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain the first feature extraction result; using The first activation layer of the weight prediction network performs nonlinear processing on the first feature extraction result to obtain weight information.
  • the first multi-dimensional feature map is processed layer by layer through the first down-sampling layer, the first convolutional layer, and the first activation layer, that is, weight information can be obtained, and the weight prediction network can be effectively simplified.
  • the new network structure reduces the network parameters, which is beneficial to improve the convergence speed of the model training for video analysis, and is beneficial to avoid over-fitting, thereby helping to improve the accuracy of video analysis.
  • the using the offset prediction network to predict the first multi-dimensional feature map to obtain the offset information includes: using the second downsampling layer of the offset prediction network to perform The first multi-dimensional feature map is down-sampled to obtain a second down-sampling result; the second convolutional layer of the offset prediction network is used to perform convolution processing on the second down-sampling result to obtain a second feature extraction result Use the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result; use the second activation layer of the offset prediction network to perform feature connection on the first The feature connection result is subjected to non-linear processing to obtain a non-linear processing result; the second fully connected layer of the offset prediction network is used to perform feature connection to the non-linear processing result to obtain a second feature connection result; using the offset The third activation layer of the prediction network performs nonlinear processing on the second feature connection result to obtain offset information.
  • the technical solutions of the embodiments of the present application can effectively simplify the network structure of the offset prediction network, reduce network parameters, help improve the convergence speed of the model training used for video analysis, and help avoid overfitting, thereby helping to improve The accuracy of video analysis.
  • the preset network model includes at least one convolutional layer; the use of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map includes: Suppose that the convolutional layer of the network model performs feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, then the second most After the dimensional feature map, and before using a preset network model to analyze the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, the method further includes: using the preset network model The convolutional layer that has not performed feature extraction performs feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map; performing the use of the offset prediction network to perform the feature extraction on the new first multi-dimensional feature map Make predictions to obtain offset information and subsequent steps to obtain a new second multi-dimensional feature map; repeat the above steps until all convolutional layers of the preset network model have
  • the second multi-dimensional feature map is extracted by using the convolutional layer in the preset network model that has not been subjected to feature extraction.
  • the fully connected layer of the network model analyzes the second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed, which in turn can improve the accuracy of video analysis.
  • the video to be analyzed includes several frames of images
  • the feature extraction of the video to be analyzed using a preset network model to obtain the first multi-dimensional feature map includes: The preset network model performs feature extraction on the several frames of images to obtain a feature map corresponding to each frame of image; the several feature maps are spliced according to the timing of the corresponding images in the video to be analyzed , To obtain the first multi-dimensional feature map.
  • feature extraction is performed on several frames of the video to be analyzed through a preset network model, and a feature map corresponding to each frame of image is obtained, so that several feature maps are directly placed in the waiting room according to their corresponding images.
  • Analyzing the time sequence of the video and performing splicing to obtain the first multi-dimensional feature map can reduce the processing load of feature extraction for the video to be analyzed, which is beneficial to improve the processing speed of video analysis.
  • an embodiment of the present application provides a model training method for video analysis, including: obtaining a sample video, wherein the sample video includes preset annotation information; and performing a method on the sample video using a preset network model Feature extraction to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information at different time series corresponding to the sample video; and the offset prediction network is used to analyze the first sample
  • the multi-dimensional feature map is predicted to obtain offset information; at least part of the feature information of the first sample multi-dimensional feature map is time-shifted using the offset information, and the second is obtained based on the offset feature information.
  • Sample multi-dimensional feature map use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video; use the preset annotation information and the analysis result information to calculate the loss value ; Based on the loss value, adjusting the preset network model and the parameters of the offset prediction network.
  • the technical solution of the embodiment of the present application can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to subsequent improvement of the accuracy of video analysis.
  • an embodiment of the present application provides a video analysis device, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, and a network analysis module; wherein the video acquisition module is configured to acquire The video to be analyzed; the feature extraction module is configured to use a preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains different timings corresponding to the video to be analyzed The offset prediction module is configured to use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information; the offset processing module is configured to use the offset information for the first multi At least part of the feature information of the three-dimensional feature map is time-shifted, and a second multi-dimensional feature map is obtained based on the offset feature information; the network analysis module is configured to perform the second multi-dimensional feature map using a preset network model Analyze to obtain the analysis result information of the video to be analyzed.
  • the device further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;
  • the offset processing module is configured In order to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighting processing To obtain the second multi-dimensional feature map.
  • the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions
  • the offset processing module is configured to select at least one set of feature information from the first multi-dimensional feature map according to a preset dimension, wherein each set of feature information includes feature information corresponding to different time series in the same preset dimension; The offset information offsets the at least one set of feature information in the time sequence dimension.
  • the preset dimension is a channel dimension
  • the offset information includes a first number of offset values, and the at least one set of characteristic information includes a first number of sets of first characteristic information;
  • the offset processing module is configured to use the i-th offset value in the offset information to offset the i-th group of the first feature information in the time sequence dimension to obtain the i-th group of second feature information.
  • Characteristic information wherein the i is a positive integer less than or equal to the first number.
  • the offset processing module is configured to obtain the numerical range to which the i-th offset value belongs, and the difference between the upper limit and the lower limit of the numerical range is A preset value; the first feature information of the i-th group is shifted by the upper limit number of time-series units along the time sequence dimension to obtain the third feature information of the i-th group, and the first feature of the i-th group is The information is shifted by the lower limit time sequence unit along the time sequence dimension to obtain the i-th group of fourth characteristic information; the difference between the i-th offset value and the lower limit value is used as the weight to the i-th Group the third feature information to perform weighting processing to obtain the first weighted result of the i-th group, and use the difference between the upper limit value and the i-th offset value as a weight to perform weighting on the fourth group of the i-th group The feature information is weighted to obtain the second weighted result of the i-th group; the sum between the first
  • the video to be analyzed includes a second number of frame images, and the weight information includes the second number of weight values; the offset processing module is configured to Each group of feature information in the weight information is used to weight the feature value corresponding to the j-th time sequence in the current group of feature information by using the j-th weight value in the weight information to obtain the corresponding group of feature information after weighting;
  • the j is a positive integer less than or equal to the second number.
  • the offset processing module is configured to use the feature information after the weighting process and the feature information that has not been offset in the first multi-dimensional feature map to form The second multi-dimensional feature map.
  • the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result ; Use the first convolutional layer of the weight prediction network to perform convolution processing on the first down-sampling result to obtain a first feature extraction result; use the first activation layer of the weight prediction network to perform the convolution processing on the first feature The extraction result is subjected to non-linear processing to obtain the weight information.
  • the offset prediction module is configured to use the second downsampling layer of the offset prediction network to downsample the first multi-dimensional feature map to obtain a second downsampling layer. Sampling result; using the second convolutional layer of the offset prediction network to perform convolution processing on the second down-sampling result to obtain a second feature extraction result; using the first fully connected layer pair of the offset prediction network Perform feature connection on the second feature extraction result to obtain a first feature connection result; use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain a nonlinear processing result; use The second fully connected layer of the offset prediction network performs feature connection on the non-linear processing result to obtain a second feature connection result; the third activation layer of the offset prediction network is used to connect the second feature result Perform nonlinear processing to obtain the offset information.
  • the preset network model includes at least one convolutional layer; the feature extraction module is configured to perform feature extraction on the video to be analyzed by using the convolutional layer of the preset network model , Obtain the first multi-dimensional feature map; if the number of convolutional layers of the preset network model is more than 1, it is further configured to use the convolutional layer in the preset network model that has not been feature extraction performed to perform the feature extraction on the second Perform feature extraction on the multi-dimensional feature map to obtain a new first multi-dimensional feature map; the offset prediction module is further configured to use an offset prediction network to predict the new first multi-dimensional feature map to obtain a new Offset information; the offset processing module is further configured to use the new offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map, and based on the offset feature Information to obtain a new second multi-dimensional feature map; the network analysis module is further configured to use the fully connected layer of the preset network model to analyze the new second multi-dimensional feature map to
  • the video to be analyzed includes several frames of images; the feature extraction module is configured to use the preset network model to perform feature extraction on the several frames of images to obtain the A feature map corresponding to one frame of image; and the plurality of feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.
  • an embodiment of the present application provides a model training device for video analysis, including a video acquisition module, a feature extraction module, an offset prediction module, an offset processing module, a network analysis module, a loss calculation module, and parameter adjustment Module; wherein the video acquisition module is configured to acquire a sample video, wherein the sample video includes preset annotation information; the feature extraction module is configured to use a preset network model to perform feature extraction on the sample video to obtain the first In this multi-dimensional feature map, the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video; the offset prediction module is configured to perform the first sample multi-dimensional feature map using the offset prediction network Prediction to obtain offset information; the offset processing module is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample based on the offset feature information The multi-dimensional feature map; the network analysis module is configured to use a preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the
  • an embodiment of the present application provides an electronic device including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory to implement the video analysis in the first aspect of the embodiment of the present application.
  • an embodiment of the present application provides a computer-readable storage medium on which program instructions are stored.
  • the program instructions are executed by a processor, the video analysis method in the above-mentioned first aspect of the embodiments of the present application is implemented, or the present invention is implemented.
  • an embodiment of the present application provides a computer program, including computer-readable code.
  • the processor in the electronic device executes to implement the implementation of the present application. Take the video analysis method in the first aspect described above, or implement the model training method for video analysis in the second aspect described in the embodiments of the present application.
  • the technical solutions of the embodiments of the present application can directly model the timing information of the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, the spatial information and the timing information can be jointly interleaved, so on this basis Performing analysis and processing is conducive to improving the accuracy of video analysis.
  • FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application
  • Figure 2 is a schematic diagram of an embodiment of a video analysis processing process
  • FIG. 3 is a schematic diagram of an embodiment of each stage of video analysis
  • FIG. 4 is a schematic flowchart of an embodiment of step S14 in FIG. 1;
  • FIG. 5 is a schematic flowchart of another embodiment of the video analysis method of the present application.
  • Fig. 6 is a schematic diagram of another embodiment of a video analysis processing process
  • FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application.
  • FIG. 8 is a schematic diagram of the framework of an embodiment of the video analysis device of the present application.
  • FIG. 9 is a schematic diagram of the framework of an embodiment of a model training device for video analysis according to the present application.
  • FIG. 10 is a schematic diagram of a framework of an embodiment of an electronic device of the present application.
  • FIG. 11 is a schematic diagram of a framework of an embodiment of a computer-readable storage medium according to the present application.
  • system and "network” in this article are often used interchangeably in this article.
  • the term “and/or” in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations.
  • the character "/” in this text generally indicates that the associated objects before and after are in an "or” relationship.
  • "many” in this document means two or more than two.
  • FIG. 1 is a schematic flowchart of an embodiment of a video analysis method according to the present application.
  • the video analysis method of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:
  • Step S11 Obtain the video to be analyzed.
  • the video to be analyzed may include several frames of images.
  • the video to be analyzed includes 8 frames of images, or the video to be analyzed includes 16 frames of images, or the video to be analyzed includes 24 frames of images, etc.
  • the video to be analyzed may be a surveillance video shot by a surveillance camera to analyze the behavior of the target object in the surveillance video, for example, the target object falls down, the target object walks normally, and so on.
  • the video to be analyzed may be a video in a video library to classify the videos in the video library, for example, a football match video, a basketball match video, a ski match video, and so on.
  • Step S12 Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.
  • the above-mentioned preset network model may be a two-dimensional neural network model, for example, ResNet -50, ResNet-101, etc., there is no specific limitation here.
  • the ResNet network is constructed by a Residual Block, which uses multiple parameterized layers to learn the residual representation between input and output.
  • the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed.
  • FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series.
  • the video to be analyzed includes several frames of images.
  • the feature extraction can be performed on several frames of the video to be analyzed through the preset network model, and the feature map corresponding to each frame of image can be obtained.
  • the feature maps are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map. For example, if the video to be analyzed includes 8 frames of images, you can use the preset network model to perform feature extraction on the 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly placed in the waiting room according to their corresponding images. Analyze the time sequence in the video for splicing to obtain the first multi-dimensional feature map.
  • Step S13 Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.
  • the offset prediction network is used to predict the offset information, so as to subsequently perform a time sequence offset based on the offset information, so as to complete the integration of time information and space.
  • the offset prediction network may specifically be a preset network model, so that the first multi-dimensional feature map can be predicted through the preset network model, and the offset information can be directly obtained.
  • the offset prediction network may include a downsampling layer, a convolutional layer, a fully connected layer, an activation layer, a fully connected layer, and an activation layer that are sequentially connected. Therefore, the prediction offset network contains only 5 layers, and only the convolutional layer and the fully connected layer contain network parameters, which can simplify the network structure to a certain extent and reduce the network parameters, thereby reducing the network capacity and improving the convergence speed. Avoid over-fitting, make the trained model as accurate as possible, and then improve the accuracy of video analysis.
  • the down-sampling layer (denoted as the second down-sampling layer) of the offset prediction network may be used to down-sample the first multi-dimensional feature map to obtain the down-sampling result (denoted as the second down-sampling result).
  • the downsampling layer may specifically be an average pooling layer, and the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions (for example, channel dimensions). Perform down-sampling processing, and the down-sampling result can be expressed as:
  • c and t respectively represent the time series dimension in the multi-dimensional and the preset dimension in the multi-dimensional (the preset dimension can be, for example, the channel dimension), and z c, t represents the (c, t)th element in the downsampling result, H, W represent the height and width of the feature map, U c, t represent the (c, t)th element in the first multi-dimensional feature map.
  • the convolutional layer of the offset prediction network (denoted as the second convolutional layer) can be used to perform convolution processing on the down-sampling result (ie, the second down-sampling result) to obtain the feature extraction result (denoted as the second feature extraction result).
  • the convolution layer of the offset prediction network may specifically include the same number of convolution kernels as the number of frames of the video to be analyzed, and the size of the convolution kernel may be 3*3, for example.
  • the first fully connected layer (denoted as the first fully connected layer) of the offset prediction network is used to perform feature connection on the feature extraction result (that is, the second feature extraction result) to obtain the feature connection result (denoted as the first feature Connection result).
  • the first fully connected layer of the migration prediction network may contain the same number of neurons as the number of frames of the video to be analyzed.
  • the first activation layer (which can be recorded as the second activation layer) of the migration prediction network is used to perform non-linear processing on the feature connection result (ie, the first feature connection result) to obtain the non-linear processing result.
  • the first activation layer of the offset prediction network may be a linear rectification function (Rectified Linear Unit, ReLU) activation layer.
  • the second fully connected layer of the offset prediction network (denoted as the second fully connected layer) is used to perform feature connection on the nonlinear processing results to obtain the feature connection result (denoted as the second feature connection result); and then use the bias
  • the second activation layer (which can be recorded as the third activation layer) of the motion prediction network performs nonlinear processing on the feature connection result (ie, the second feature connection result) to obtain offset information.
  • the second activation layer of the offset prediction network can be a Sigmoid activation layer, so that each element in the offset information can be constrained to be between 0 and 1.
  • z represents the result of downsampling
  • F 1dconv represents the convolutional layer of the offset prediction network
  • W1 represents the first fully connected layer of the offset prediction network
  • represents the first active layer of the offset prediction network
  • W2 Represents the second fully connected layer of the offset prediction network
  • represents the second active layer of the offset prediction network
  • offset raw represents offset information
  • the offset information obtained by the second activation layer can also be subjected to constraint processing, so that each element in the offset information is restricted to Among them, T represents the number of frames of the video to be analyzed.
  • each element in the offset information obtained by using the second activation layer of the offset prediction network to perform nonlinear processing on the feature connection result can be respectively subtracted by 0.5, and the difference obtained by subtracting 0.5 can be subtracted Multiply by the number of frames of the video to be analyzed to obtain the offset information after constraint processing.
  • the above constraint processing can be expressed as:
  • offset raw represents the offset information processed by the second activation layer
  • T represents the number of frames of the video to be analyzed
  • offset represents the constraint to The offset information
  • Step S14 Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a second multi-dimensional feature map based on the offset feature information.
  • At least part of the specific information may be along a preset dimension (for example, , Channel dimension).
  • a preset dimension for example, , Channel dimension.
  • the number of channels in the channel dimension of the first multi-dimensional feature map is C
  • the number of channels in the channel dimension of at least part of the above-mentioned feature information is
  • the offset information can also be used to perform time sequence offset on all the feature information of the first multi-dimensional feature map, which is not limited here.
  • At least one set of feature information may be selected from the first multi-dimensional feature map according to a preset dimension (for example, channel dimension), where Each set of feature information includes feature information corresponding to different time series in the same preset dimension (for example, channel dimension), and the offset information is used to offset the at least one set of feature information in the time series dimension.
  • the second fully connected layer of the migration prediction network can contain the same number of neurons as the number of selected feature information groups, so that the number of elements in the offset information is the same as the number of selected feature information groups.
  • each element in the offset information may be used to offset at least one set of feature information in the time sequence dimension. For example, the time sequence dimension is shifted by one time sequence unit, or the time sequence dimension is shifted by two time sequence units, etc., which is not specifically limited here.
  • the at least part of the feature information after the timing offset may be combined with the partial features in the first multi-dimensional feature map that have not been time-shifted.
  • the information is spliced to obtain the second multi-dimensional feature map.
  • the number of channels can be At least part of the feature information obtained after timing offset and the number of channels without timing offset are Part of the feature information of is spliced to obtain the second multidimensional feature map.
  • Step S15 Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.
  • the fully connected layer of the preset network model can be used to perform feature connection to the second multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so as to obtain the category of the video to be analyzed (such as football Event video, ski event video, etc.), or you can get the behavior category of the target object in the video to be analyzed (for example, normal walking, falling, running, etc.), other application scenarios can be deduced by analogy. An example.
  • the above-mentioned offset prediction network may be embedded before the convolutional layer of the preset network model.
  • the preset network model is ResNet-50, and the offset prediction network can be embedded before the convolutional layer in each residual block.
  • the preset network model may include at least one convolutional layer, so in the feature extraction process, a convolutional layer of the preset network model may be used to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map .
  • the number of convolutional layers of the preset network model can be more than one.
  • the number of convolutional layers of the preset network model can be 2, 3, or 4 and so on. Therefore, before the second multi-dimensional feature map is analyzed and the analysis result information of the video to be analyzed is obtained, the second multi-dimensional feature map can also be extracted using the convolutional layer in the preset network model that has not been feature extraction performed.
  • the fully connected layer of the preset network model is used to analyze the finally obtained second multi-dimensional feature map to obtain the analysis result information of the video to be analyzed.
  • Figure 3 is a schematic diagram of an embodiment of each stage of video analysis.
  • the video to be analyzed is characterized by the first convolutional layer of the preset network model
  • the timing shift is performed through the above-mentioned related steps to obtain the second multi-dimensional feature map.
  • the second multi-dimensional feature map can be further used.
  • the dimensional feature map is input into the second convolutional layer for feature extraction to obtain a new first multi-dimensional feature map (denoted as the first multi-dimensional feature map in the figure), and the new first multi-dimensional feature map is obtained through the above related steps Perform timing shift to obtain a new second multi-dimensional feature map (denoted as the second multi-dimensional feature map in the figure), similarly, use the third convolutional layer to perform feature extraction on the new second multi-dimensional feature map , A new first multi-dimensional feature map (marked as the first multi-dimensional feature map in the figure) is obtained, and the new first multi-dimensional feature map is time-shifted through the above related steps to obtain a new second multi-dimensional feature map. Dimensional feature map (marked as the second multi-dimensional feature map in the figure).
  • the three convolutional layers of the preset network model have all been executed to complete the feature extraction step.
  • the fully connected layer of the preset network model can be used to compare the latest obtained
  • the second multi-dimensional feature map is analyzed to obtain the analysis result information of the video to be analyzed.
  • the first multi-dimensional feature map is obtained by feature extraction of the video to be analyzed, and the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed, and the offset prediction network is used to analyze the first multi-dimensional feature map.
  • the multi-dimensional feature map is predicted to obtain offset information, so that at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the second multi-dimensional feature map is obtained based on the offset feature information,
  • the timing information of the video to be analyzed can be directly modeled, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved. Therefore, analysis and processing are performed on this basis, which is beneficial to improve The accuracy of video analysis.
  • the offset information includes a first number of offset values
  • at least part of the first multi-dimensional feature map can also be divided into a first number of sets of first feature information along a preset dimension (for example, channel dimension) , That is, the at least one set of characteristic information includes a first number of sets of first characteristic information.
  • using the offset information to offset the at least one set of feature information in the time series dimension may include: using the i-th offset value in the offset information to compare the i-th set of first feature information in the time series dimension Perform the offset to obtain the i-th group of second feature information, where i is a positive integer less than or equal to the first number.
  • the first offset value in the offset information can be used to perform the first feature information of the first set in the time series dimension. Offset to obtain the first set of second feature information, and use the second offset value in the offset information to offset the second set of first feature information in the time series dimension to obtain the second set of second feature information,
  • the above-mentioned first quantity is other values, it can be deduced by analogy, and no examples are given here.
  • the use of the i-th offset value in the offset information to offset the i-th group of the first characteristic information in the time sequence dimension to obtain the i-th group of second characteristic information may be Including the following steps:
  • Step S141 Obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value.
  • the preset value can be 1, the lower limit of the value range is the value obtained by rounding down the i-th offset value, and the upper limit of the value range is the value obtained by rounding down the i-th offset value.
  • the value obtained by rounding up that is, for the i-th offset value O i , its value range can be expressed as (n 0 , n 0 +1), and n 0 ⁇ N.
  • the offset value is 0.8
  • the value range is 0 to 1; or when the offset value is 1.4
  • the value range is 1 to 2.
  • the offset value is other values, the same can be used. I will not give examples one by one here.
  • Step S142 Shift the i-th group of first feature information along the time sequence dimension by the upper limit time sequence unit to obtain the i-th group of third feature information, and shift the i-th group of first feature information along the time sequence dimension by the lower limit value Time sequence unit, the fourth feature information of the i-th group is obtained.
  • the first feature information of the i-th group can be expressed as U c,t , so when the numerical range of the i-th offset value is expressed as (n 0 ,n 0 +1), the i-th group A feature information is shifted by an upper limit time sequence unit along the time sequence dimension, and the obtained third feature information of the i-th group can be expressed as The i-th group of first feature information is shifted by the lower limit number of time-series units along the time series dimension, and the obtained i-th group of fourth feature information can be expressed as
  • each offset value may be a decimal.
  • the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1. Therefore, for the i-th group of first feature information U c,t , the corresponding third feature information can be expressed as U c,t+1 , and the corresponding fourth feature information can be expressed as U c,t .
  • the range of the first feature information in the time sequence dimension is [1, T], where the value of T is equal to the number of frames of the video to be analyzed, for example, the T of the first feature information [1 0 0 0 0 0 0 1] is 8.
  • the first feature information may become a zero vector due to the feature information being removed during the timing offset process, so that the gradient disappears during the training process. To alleviate this problem, it can be set to (0) after the timing offset. , 1) Set a buffer for the feature information of the time sequence interval and (T, T+1) time sequence interval, so that when the feature information is shifted out of time T+1 in the time sequence, or less than time 0, the buffer can be fixed Set to 0.
  • the i-th offset value is 0.4, since the value range it belongs to is 0 to 1, it can be Offset the first feature information by one (ie, 1) time sequence unit from the upper limit to obtain the corresponding third feature information [0 1 0 0 0 0 0 0], and shift the above-mentioned first feature information by one from the lower limit (That is, 0) time sequence unit, the corresponding fourth feature information [1 0 0 0 0 0 0 1] is obtained.
  • the first feature information and the offset value are other values, it can be deduced by analogy, and no examples are given here.
  • Step S143 Use the difference between the i-th offset value and the lower limit value as the weight to perform weighting processing on the i-th group of third feature information, to obtain the i-th group of first weighted results, and the upper limit value and the i-th deviation
  • the difference between the shift values is used as a weight to perform weighting processing on the fourth feature information of the i-th group to obtain the second weighted result of the i-th group.
  • the i-th offset value expressed as O i when the numerical range of the i-th offset value is expressed as (n 0 ,n 0 +1), the i-th offset value O i and the lower limit The difference between the values (i.e. n 0 ) (i.e.
  • O i -n 0 is used as the weight to the third feature information of the i-th group (i.e ) Performs weighting processing to obtain the corresponding first weighted result (ie ), and the difference between the above limit (ie n 0 +1) and the i-th offset value O i (ie n 0 +1-O i ) as the weight for the fourth feature information of the i-th group Perform weighting processing to obtain the corresponding second weighting result (ie ).
  • each offset value may be a decimal.
  • the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U c,t , the corresponding
  • the third feature information can be expressed as U c,t+1
  • the corresponding fourth feature information can be expressed as U c,t
  • the first weighting result can be expressed as O i U c,t+1
  • the second weighting result can be It is expressed as (1-O i )U c,t .
  • the corresponding third feature information can be expressed as [ 0 1 0 0 0 0 0 0 0]
  • the corresponding fourth feature information can be expressed as [1 0 0 0 0 0 0 0 1]
  • the first weighted result can be expressed as [0 0.4 0 0 0 0 0 0 0]
  • the second weighted result can be expressed as [0.6 0 0 0 0 0 0 0 0 0.6].
  • Step S144 Calculate the sum between the first weighted result of the i-th group and the second weighted result of the i-th group as the second feature information of the i-th group.
  • the first weighted result can be expressed as
  • the second weighted result can be expressed as Therefore, the second feature information of the i-th group can be expressed as
  • each offset value may be a decimal.
  • the value range of each offset value is 0 to 1, that is, the upper limit value is 1, the lower limit value is 0, and the preset value is 1, so for the first feature information U c,t , the first The weighted result can be expressed as O i U c,t+1 , and the second weighted result can be expressed as (1-O i )U c,t , so the i-th group of second feature information can be expressed as (1-O i )U c,t +O i U c,t+1 .
  • the corresponding first weighting result can be expressed as [ 0 0.4 0 0 0 0 0 0 0]
  • the corresponding second weighting result can be expressed as [0.6 0 0 0 0 0 0 0 0.6]
  • the second feature information of the i-th group can be expressed as [0.6 0.4 0 0 0 0 0 0.6 ].
  • a symmetric offset strategy can be used during training, that is, only half of the offset value can be trained during training, and Perform conversion calculations (for example, reverse the order) to obtain the other half of the offset value, which can reduce the processing load during training.
  • the i-th group of first characteristic information is moved along the time series dimension Offset the upper limit number of time series units to obtain the i-th group of third characteristic information, and shift the i-th group of first characteristic information along the time series dimension by the lower limit of time series units to obtain the i-th group of fourth characteristic information;
  • the difference between the i offset value and the lower limit value is used as the weight to perform weighting processing on the first feature information of the ith group to obtain the first weighted result of the ith group, and the difference between the upper limit value and the ith offset value
  • the difference is used as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; the sum between the first weighted result of the i-th group and the second weighted result of the i-th group is calculated as
  • FIG. 5 is a schematic flowchart of another embodiment of the video analysis method of the present application. Specifically, it can include the following steps:
  • Step S51 Obtain the video to be analyzed.
  • Step S52 Use the preset network model to perform feature extraction on the video to be analyzed to obtain a first multi-dimensional feature map.
  • the first multi-dimensional feature map contains feature information in different time series corresponding to the video to be analyzed.
  • the relevant steps in the foregoing embodiment please refer to the relevant steps in the foregoing embodiment.
  • Step S53 Use the offset prediction network to predict the first multi-dimensional feature map to obtain offset information.
  • FIG. 6 is a schematic diagram of another embodiment of the video analysis processing process.
  • the first multi-dimensional feature map can be predicted by the offset prediction network.
  • Step S54 Use the weight prediction network to predict the first multi-dimensional feature map to obtain weight information.
  • the features at the first and last ends of the first feature information may be removed. Therefore, in order to re-evaluate the importance of each feature in the first feature information after the timing shift, to better obtain long-range information ,
  • the attention mechanism can be used to re-weight each feature in the first feature information after the time sequence shift, so the weight information needs to be obtained.
  • the weight prediction network can be used to predict the first multi-dimensional feature map to obtain weight information.
  • the weight prediction network may include a down-sampling layer, a convolutional layer, and an activation layer that are sequentially connected. Therefore, the weight prediction network contains only 3 layers, and only the convolutional layer contains network parameters, which can simplify the network structure to a certain extent and reduce network parameters, thereby reducing network capacity, improving convergence speed, and avoiding overfitting.
  • the trained model is as accurate as possible, which in turn can improve the accuracy of video analysis.
  • the using the weight prediction network to predict the first multi-dimensional feature map to obtain weight information may include: using the weight prediction network to predict the down-sampling layer (denoted as the first down-sampling layer) Down-sampling the first multi-dimensional feature map to obtain the down-sampling result (denoted as the first down-sampling result); use the weight to predict the convolutional layer (denoted as the first convolutional layer) of the down-sampling result (that is, the first The downsampling result) is subjected to convolution processing to obtain the feature extraction result (recorded as the first feature extraction result); the activation layer of the weight prediction network is used to perform nonlinear processing on the feature extraction result (ie, the first feature extraction result) to obtain the weight information.
  • the downsampling layer may be an average pooling layer.
  • the convolutional layer of the weight prediction network can include one convolution kernel, and the activation layer of the weight prediction network can be a Sigmoid activation layer, so that each element in the weight information can be constrained to be between 0 and 1.
  • the offset prediction network and the weight prediction network in the embodiments of the present application may be embedded before the convolutional layer of the preset network model.
  • the preset network model is ResNet-50
  • the offset prediction network and the weight prediction network can be embedded before the convolutional layer of each residual block, so as to use the first multi-dimensional feature map to predict the offset information and weights.
  • Information, for subsequent offset and weighting processing so that a small amount of network parameters can be added to the existing network parameters of ResNet-50 to realize the modeling of timing information, which is conducive to reducing the processing load of video analysis and improving the performance of video analysis.
  • the processing speed is conducive to accelerating the convergence speed during model training, avoiding over-fitting, and improving the accuracy of video analysis.
  • the preset network model is another model, it can be deduced by analogy, and the examples are not given here.
  • steps S53 and S54 can be performed in a sequential order, for example, step S53 is performed first, and then step S54; or, step S54 is performed first, and then step S53 is performed; or, step S53 and step S54 are performed at the same time. limited. In addition, the foregoing step S54 may be performed before the subsequent step S56, which is not limited here.
  • Step S55 Use the offset information to perform time sequence offset on at least part of the feature information of the first multi-dimensional feature map.
  • Step S56 Use the weight information to perform weighting processing on the offset feature information.
  • the video to be analyzed may specifically include a second number of frame images, and the weight information may include a second number of weight values.
  • the second number may specifically be 8, 16, 24, etc., which are not specifically limited herein.
  • each group of offset feature information can be used to separately use the j-th weight value in the weight information Perform weighting processing on the feature value corresponding to the jth time sequence in the current group of feature information to obtain the corresponding group feature information after weighting, where j is a positive integer less than or equal to the second number.
  • the weight information can be [0.2 0.1 0.1 0.1 0.1 0.1 0.2], and the jth item in the weight information is used respectively.
  • the feature information of the corresponding group is obtained as [0.12 0.04 0 0 0 0 0 0.12].
  • Step S57 Obtain a second multi-dimensional feature map based on the weighted feature information.
  • the second multi-dimensional feature map corresponding to the first multi-dimensional feature map can be obtained.
  • the obtaining the second multi-dimensional feature map based on the weighted feature information may include: using the weighted feature information and the first multi-dimensional feature map is not shifted The feature information of, composes the second multi-dimensional feature map.
  • the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map can be spliced to obtain the second multi-dimensional feature map.
  • the obtained second multi-dimensional feature map has the same size as the first multi-dimensional feature map.
  • the weighted feature information can be directly combined to form the second multi-dimensional feature map.
  • Step S58 Use the preset network model to analyze the second multi-dimensional feature map to obtain analysis result information of the video to be analyzed.
  • the weight prediction network is used to predict the first multi-dimensional feature map to obtain weight information, and at least part of the feature information of the first multi-dimensional feature map is time-shifted using the offset information, and the weight information is used
  • the offset feature information is weighted, and the second multi-dimensional feature map is obtained based on the weighted feature information. Therefore, the spatial and temporal joint interleaving feature information can be directly obtained through the offset and weighting processing steps. Conducive to improving the processing speed and accuracy of video analysis.
  • FIG. 7 is a schematic flowchart of an embodiment of a model training method for video analysis according to the present application.
  • the model training method used for video analysis in the embodiments of the present application can be specifically executed by electronic devices with processing functions such as microcomputers, servers, tablet computers, or implemented by a processor executing program code. Specifically, it can include the following steps:
  • Step S71 Obtain a sample video.
  • the sample video includes preset annotation information.
  • the preset annotation information of the sample video may include but not limited to: fall, normal walking, running and other annotation information; or, taking the classification of the video as an example, the preset annotation information of the sample video It may include but is not limited to: football match video, basketball match video, ski match video and other label information.
  • Other application scenarios can be deduced by analogy, so we will not give examples one by one here.
  • the sample video may include several frames of images, for example, may include 8 frames of images, or may also include 16 frames of images, or may also include 24 frames of images, which is not specifically limited here.
  • Step S72 Perform feature extraction on the sample video by using the preset network model to obtain the first sample multi-dimensional feature map.
  • the above-mentioned preset network model may be a two-dimensional neural network model, for example, ResNet -50, ResNet-101, etc., there is no specific limitation here.
  • the ResNet network is constructed by a Residual Block, which uses multiple parameterized layers to learn the residual representation between input and output.
  • the first sample multi-dimensional feature map contains feature information in different time series corresponding to the sample video.
  • FIG. 2 is a schematic diagram of an embodiment of a video analysis processing process. As shown in FIG. 2, the abscissa represents different time series in the time series dimension T, and the squares corresponding to the different time series represent feature information in the different time series.
  • the video to be analyzed includes several frames of images.
  • the feature extraction can be performed on several frames of the sample video through the preset network model to obtain the feature map corresponding to each frame image, thus directly Several feature maps are spliced according to the time sequence of the corresponding images in the sample video to obtain the first sample multi-dimensional feature map. For example, if the sample video includes 8 frames of images, you can use the preset network model to perform feature extraction on these 8 frames of images to obtain the feature map of each frame of image, so that the 8 feature maps are directly displayed in the sample video according to their corresponding images. The time sequence in is spliced to obtain the first sample multi-dimensional feature map.
  • Step S73 Use the offset prediction network to predict the multi-dimensional feature map of the first sample to obtain offset information.
  • the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information.
  • the network structure of the weight prediction network refer to the relevant steps in the foregoing embodiment, which will not be repeated here.
  • Step S74 Use the offset information to perform time sequence offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain the second sample multi-dimensional feature map based on the offset feature information.
  • the weight information can also be used to weight the offset feature information, and based on the weighted feature information, the second sample multi-dimensional feature map can be obtained.
  • the preset network model may include at least one convolutional layer, and then one convolutional layer of the preset network model may be used to perform feature extraction on the sample video to obtain the first sample multi-dimensional feature map.
  • the number of convolutional layers of the preset network model can be more than one, and then the convolutional layer in the preset network model that has not been feature extraction performed can be used to perform feature extraction on the second sample multi-dimensional feature map ,
  • To obtain a new first sample multi-dimensional feature map and execute the step of using the offset prediction network to predict the new first sample multi-dimensional feature map to obtain the offset information and subsequent steps, thereby obtaining a new second sample multi-dimensional feature map Feature map, and then repeat the above steps until all the convolutional layers of the preset network model have completed the feature extraction step of the new second sample multi-dimensional feature map.
  • Step S75 Use the preset network model to analyze the second sample multi-dimensional feature map to obtain analysis result information of the sample video.
  • the fully connected layer of the preset network model can be used to analyze the second sample multi-dimensional feature map to obtain the analysis result information of the sample video.
  • the fully connected layer of the preset network model can be used to perform feature connection to the second sample multi-dimensional feature map, and the softmax layer of the preset network model can be used to perform regression, so that the sample video belongs to each category (such as football matches). Video, skiing event video, etc.), or the probability value of the sample video belonging to various behaviors (such as falling, normal walking, running, etc.). In other application scenarios, this can be deduced by analogy. An example.
  • Step S76 Calculate the loss value by using the preset label information and the analysis result information.
  • a mean square error (Mean Square Error) loss function or a cross entropy loss function can be used to calculate the loss value of the preset label information and the analysis result information, which is not limited here.
  • Step S77 Adjust the parameters of the preset network model and the offset prediction network based on the loss value.
  • the weight prediction network can also be used to predict the first sample multi-dimensional feature map to obtain weight information, so that the weight information is used to weight the offset feature information, and based on the weight processing After the feature information, the second sample multi-dimensional feature information is obtained; based on the loss value, the parameters of the preset network model, the offset prediction network, and the weight prediction network can also be adjusted.
  • the parameters of the convolutional layer and the fully connected layer in the preset network model can be adjusted, and the parameters of the convolutional layer and the fully connected layer in the offset prediction network can be adjusted, and the weight of the convolutional layer in the prediction network can be adjusted.
  • a gradient descent method can be used to adjust the parameters, such as a batch gradient descent method and a stochastic gradient descent method.
  • the above step S72 and subsequent steps may be executed again until the calculated loss value meets the preset training end condition.
  • the preset training end condition may include: the loss value is less than a preset loss threshold and the loss value no longer decreases, or the preset training end condition may also include: the number of parameter adjustments reaches the preset number of times threshold, or, The preset training end condition may also include: using a test video to test that the network performance meets a preset requirement (for example, the accuracy rate reaches a preset accuracy threshold).
  • the first sample multi-dimensional feature map is obtained by feature extraction of the sample video, and the first sample multi-dimensional feature map contains feature information corresponding to the sample video in different time series, and uses the bias
  • the shift prediction network predicts the multi-dimensional feature map of the first sample to obtain offset information, so as to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and based on the offset feature information
  • Obtain the second sample multi-dimensional feature map and then can directly model the timing information of the sample video, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so this is the basis Analyzing and processing on the above is conducive to the subsequent improvement of the accuracy of video analysis.
  • FIG. 8 is a schematic diagram of a framework of an embodiment of a video analysis device 80 of the present application.
  • the video analysis device 80 includes a video acquisition module 81, a feature extraction module 82, an offset prediction module 83, an offset processing module 84, and a network analysis module 85; among them,
  • the video acquisition module 81 is configured to acquire the video to be analyzed
  • the feature extraction module 82 is configured to perform feature extraction on the video to be analyzed by using a preset network model to obtain a first multi-dimensional feature map, where the first multi-dimensional feature map contains feature information at different timings corresponding to the video to be analyzed;
  • the offset prediction module 83 is configured to use an offset prediction network to predict the first multi-dimensional feature map to obtain offset information
  • the offset processing module 84 is configured to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map by using the offset information, and obtain a second multi-dimensional feature map based on the offset feature information;
  • the network analysis module 85 is configured to analyze the second multi-dimensional feature map by using a preset network model to obtain analysis result information of the video to be analyzed.
  • the technical solution of the embodiment of the present application uses a preset network model to process the video to be analyzed, which is beneficial to improve the processing speed of video analysis, and through timing offset, spatial information and timing information can be jointly interleaved, so it is performed on this basis Analysis and processing help improve the accuracy of video analysis.
  • the video analysis device 80 further includes a weight prediction module configured to use a weight prediction network to predict the first multi-dimensional feature map to obtain weight information;
  • the offset processing module 84 is configured to use the offset information to perform timing offset on at least part of the feature information of the first multi-dimensional feature map; use the weight information to perform weighting processing on the offset feature information; based on the weighted feature information , Get the second multi-dimensional feature map.
  • the dimensions of the first multi-dimensional feature map include time series dimensions and preset dimensions
  • the offset processing module 84 is configured to select at least one set of feature information from the first multi-dimensional feature map according to the preset dimensions, where Each set of feature information includes feature information corresponding to different time series in the same preset dimension, and the offset information is used to offset at least one set of feature information in the time series dimension.
  • the preset dimension is the channel dimension; and/or, the offset information includes a first number of offset values, and at least one set of characteristic information includes a first number of sets of first characteristic information.
  • the offset processing module 84 It is configured to use the i-th offset value in the offset information to offset the i-th group of first feature information in the time sequence dimension to obtain the i-th group of second feature information, where i is a positive value less than or equal to the first number. Integer.
  • the offset processing module 84 is configured to obtain the value range to which the i-th offset value belongs, and the difference between the upper limit value and the lower limit value of the value range is a preset value
  • the timing offset processing unit It includes a time sequence offset processing subunit, which is used to shift the i-th group of first feature information along the time sequence dimension by an upper limit number of time sequence units to obtain the i-th group of third feature information, and move the i-th group of first feature information along the time sequence.
  • the lower limit value of the dimensionality offset is time series units to obtain the fourth feature information of the i-th group; the third feature information of the i-th group is weighted by the difference between the i-th offset value and the lower limit value as the weight to obtain the I set the first weighted results, and use the difference between the upper limit and the i-th offset value as the weight to weight the fourth feature information of the i-th group to obtain the second weighted result of the i-th group; calculate the i-th group The sum between a weighted result and the second weighted result of the i-th group is used as the second feature information of the i-th group.
  • the video to be analyzed includes a second number of frame images, and the weight information includes a second number of weight values.
  • the offset processing module 84 is configured to use the first set of weight information for each set of offset feature information.
  • the j weight values perform weighting processing on the feature value corresponding to the j-th time sequence in the current group of feature information to obtain the corresponding group feature information after the weighting process; where j is a positive integer less than or equal to the second number.
  • the offset processing module 84 is configured to use the weighted feature information and the feature information that is not offset in the first multi-dimensional feature map to form a second multi-dimensional feature map.
  • the weight prediction module is configured to use the first down-sampling layer of the weight prediction network to down-sample the first multi-dimensional feature map to obtain the first down-sampling result; use the weight to predict the first convolutional layer of the network Perform convolution processing on the first down-sampling result to obtain the first feature extraction result; use the first activation layer of the weight prediction network to perform non-linear processing on the first feature extraction result to obtain weight information.
  • the offset prediction module 83 is configured to use the second down-sampling layer of the offset prediction network to down-sample the first multi-dimensional feature map to obtain the second down-sampling result; use the second down-sampling layer of the offset prediction network
  • the second convolution layer performs convolution processing on the second downsampling result to obtain the second feature extraction result; uses the first fully connected layer of the offset prediction network to perform feature connection on the second feature extraction result to obtain the first feature connection result;
  • Use the second activation layer of the offset prediction network to perform nonlinear processing on the first feature connection result to obtain the nonlinear processing result, and use the second fully connected layer of the offset prediction network to perform feature connection on the nonlinear processing result to obtain the second
  • the third activation layer of the offset prediction network is used to perform non-linear processing on the second feature connection result to obtain offset information.
  • the preset network model includes at least one convolutional layer
  • the feature extraction module 82 is configured to use the convolutional layer of the preset network model to perform feature extraction on the video to be analyzed to obtain the first multi-dimensional feature map; If the number of convolutional layers of the preset network model is more than 1, use the convolutional layer in the preset network model that has not been feature extraction to perform feature extraction on the second multi-dimensional feature map to obtain a new first multi-dimensional feature map ;
  • the offset prediction module 83 is further configured to use the offset prediction network to predict the new first multi-dimensional feature map to obtain new offset information;
  • the offset processing module 84 is further configured to use the new offset information to perform a time sequence offset on at least part of the feature information of the first multi-dimensional feature map, and obtain a new second multi-dimensional feature map based on the offset feature information;
  • the network analysis module 85 is configured to analyze the new second multi-dimensional feature map by using the fully connected layer of the preset network model to obtain analysis result information of the video to be analyzed.
  • the video to be analyzed includes several frames of images
  • the feature extraction module 82 is configured to perform feature extraction on several frames of images using a preset network model to obtain a feature map corresponding to each frame of image;
  • the images are spliced according to the time sequence of the corresponding images in the video to be analyzed to obtain the first multi-dimensional feature map.
  • FIG. 7 is a schematic diagram of an embodiment of a model training device 90 for video analysis according to the present application.
  • the model training device 90 for video analysis includes a video acquisition module 91, a feature extraction module 92, an offset prediction module 93, an offset processing module 94, a network analysis module 95, a loss calculation module 96, and a parameter adjustment module 97; among them,
  • the video acquisition module 91 is configured to acquire a sample video, where the sample video includes preset annotation information
  • the feature extraction module 92 is configured to perform feature extraction on the sample video by using a preset network model to obtain a first sample multi-dimensional feature map, where the first sample multi-dimensional feature map contains feature information corresponding to the sample video at different timings;
  • the offset prediction module 93 is configured to use the offset prediction network to predict the first sample multi-dimensional feature map to obtain offset information
  • the offset processing module 94 is configured to use the offset information to perform timing offset on at least part of the feature information of the first sample multi-dimensional feature map, and obtain a second sample multi-dimensional feature map based on the offset feature information;
  • the network analysis module 95 is configured to analyze the second sample multi-dimensional feature map by using a preset network model to obtain analysis result information of the sample video;
  • the loss calculation module 96 is configured to calculate a loss value using preset label information and analysis result information
  • the parameter adjustment module 97 is configured to adjust the parameters of the preset network model and the offset prediction network based on the loss value.
  • the timing information of the sample video can be directly modeled, which is beneficial to improve the speed of model training, and through the timing offset, the spatial information and the timing information can be jointly interleaved, so analysis and processing are performed on this basis. Conducive to the subsequent improvement of the accuracy of video analysis.
  • the model training device 90 for video analysis may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis.
  • the model training device 90 for video analysis may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis.
  • the video analysis device may further include other modules to execute the relevant steps in the above-mentioned embodiment of the model training method for video analysis.
  • FIG. 10 is a schematic diagram of a framework of an embodiment of the electronic device 100 of the present application.
  • the electronic device 100 includes a memory 101 and a processor 102 coupled to each other.
  • the processor 102 is configured to execute program instructions stored in the memory 101 to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing for video analysis. Analyze the steps in the embodiment of the model training method.
  • the electronic device 100 may include but is not limited to: a microcomputer and a server.
  • the electronic device 100 may also include mobile devices such as a notebook computer and a tablet computer, which are not limited herein.
  • the processor 102 is configured to control itself and the memory 101 to implement the steps in any of the foregoing video analysis method embodiments, or implement the steps in any of the foregoing model training method embodiments for video analysis.
  • the processor 102 may also be referred to as a central processing unit (Central Processing Unit, CPU).
  • the processor 102 may be an integrated circuit chip with signal processing capability.
  • the processor 102 may also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the processor 102 may be jointly implemented by an integrated circuit chip.
  • FIG. 11 is a schematic diagram of a framework of an embodiment of the computer-readable storage medium 110 of this application.
  • the computer-readable storage medium 110 stores program instructions 1101 that can be executed by a processor, and the program instructions 1101 are used to implement the steps of any of the foregoing video analysis method embodiments, or implement any of the foregoing model training method embodiments for video analysis. Steps in.
  • the computer-readable storage medium may be a volatile or non-volatile storage medium.
  • the embodiment of the present application also provides a computer program, including computer-readable code, when the computer-readable code is executed in an electronic device, the processor in the electronic device executes for realizing the implementation of any of the above-mentioned video analysis methods Or implement the steps in any of the above-mentioned model training method embodiments for video analysis.
  • the disclosed method and device can be implemented in other ways.
  • the device implementation described above is only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of this embodiment.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

Les modes de réalisation de la présente invention concernent un procédé d'analyse vidéo et un procédé d'apprentissage de modèle associé, un dispositif et un appareil associés. Le procédé d'analyse vidéo comprend les étapes comprend : l'obtention d'une vidéo à analyser ; la réalisation d'une extraction de caractéristiques sur ladite vidéo à l'aide d'un modèle de réseau prédéfini pour obtenir une première carte de caractéristiques multidimensionnelles, la première carte de caractéristiques multidimensionnelles comprenant des informations de caractéristiques sur différents moments correspondant à ladite vidéo ; la prédiction de la première carte de caractéristiques multidimensionnelles à l'aide d'un réseau de prédiction de décalage pour obtenir des informations de décalage ; la réalisation d'un décalage de synchronisation sur au moins une partie des informations de caractéristiques de la première carte de caractéristiques multidimensionnelles à l'aide des informations de décalage, et l'obtention d'une seconde carte de caractéristiques multidimensionnelles sur la base des informations de caractéristiques de décalage ; et l'analyse de la seconde carte de caractéristiques multidimensionnelle à l'aide du modèle de réseau prédéfini pour obtenir des informations de résultat d'analyse de ladite vidéo.
PCT/CN2020/078656 2020-01-17 2020-03-10 Procédé d'analyse vidéo et procédé d'apprentissage de modèle associé, dispositif et appareil associés WO2021142904A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020217013635A KR20210093875A (ko) 2020-01-17 2020-03-10 비디오 분석 방법 및 연관된 모델 훈련 방법, 기기, 장치
JP2021521512A JP7096431B2 (ja) 2020-01-17 2020-03-10 ビデオ分析方法及びそれに関連するモデル訓練方法、機器、装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010053048.4A CN111291631B (zh) 2020-01-17 2020-01-17 视频分析方法及其相关的模型训练方法、设备、装置
CN202010053048.4 2020-01-17

Publications (1)

Publication Number Publication Date
WO2021142904A1 true WO2021142904A1 (fr) 2021-07-22

Family

ID=71025430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/078656 WO2021142904A1 (fr) 2020-01-17 2020-03-10 Procédé d'analyse vidéo et procédé d'apprentissage de modèle associé, dispositif et appareil associés

Country Status (5)

Country Link
JP (1) JP7096431B2 (fr)
KR (1) KR20210093875A (fr)
CN (1) CN111291631B (fr)
TW (1) TWI761813B (fr)
WO (1) WO2021142904A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390731A1 (en) * 2020-06-12 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for positioning key point, device, and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417952B (zh) * 2020-10-10 2022-11-11 北京理工大学 一种车辆碰撞防控系统的环境视频信息可用性测评方法
CN112464898A (zh) * 2020-12-15 2021-03-09 北京市商汤科技开发有限公司 事件检测方法及装置、电子设备和存储介质
CN112949449B (zh) * 2021-02-25 2024-04-19 北京达佳互联信息技术有限公司 交错判断模型训练方法及装置和交错图像确定方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199902A (zh) * 2014-08-27 2014-12-10 中国科学院自动化研究所 一种线性动态系统的相似性度量计算方法
US20170243058A1 (en) * 2014-10-28 2017-08-24 Watrix Technology Gait recognition method based on deep learning
CN108229522A (zh) * 2017-03-07 2018-06-29 北京市商汤科技开发有限公司 神经网络的训练方法、属性检测方法、装置及电子设备
CN108229280A (zh) * 2017-04-20 2018-06-29 北京市商汤科技开发有限公司 时域动作检测方法和系统、电子设备、计算机存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9626803B2 (en) * 2014-12-12 2017-04-18 Qualcomm Incorporated Method and apparatus for image processing in augmented reality systems
US10707837B2 (en) 2017-07-06 2020-07-07 Analog Photonics LLC Laser frequency chirping structures, methods, and applications
WO2019035854A1 (fr) * 2017-08-16 2019-02-21 Kla-Tencor Corporation Apprentissage machine par rapport à des mesures de métrologie
US10430654B1 (en) * 2018-04-20 2019-10-01 Surfline\Wavetrak, Inc. Automated detection of environmental measures within an ocean environment using image data
CN109919025A (zh) * 2019-01-30 2019-06-21 华南理工大学 基于深度学习的视频场景文本检测方法、系统、设备及介质
CN110084742B (zh) * 2019-05-08 2024-01-26 北京奇艺世纪科技有限公司 一种视差图预测方法、装置及电子设备
CN110660082B (zh) * 2019-09-25 2022-03-08 西南交通大学 一种基于图卷积与轨迹卷积网络学习的目标跟踪方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199902A (zh) * 2014-08-27 2014-12-10 中国科学院自动化研究所 一种线性动态系统的相似性度量计算方法
US20170243058A1 (en) * 2014-10-28 2017-08-24 Watrix Technology Gait recognition method based on deep learning
CN108229522A (zh) * 2017-03-07 2018-06-29 北京市商汤科技开发有限公司 神经网络的训练方法、属性检测方法、装置及电子设备
CN108229280A (zh) * 2017-04-20 2018-06-29 北京市商汤科技开发有限公司 时域动作检测方法和系统、电子设备、计算机存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390731A1 (en) * 2020-06-12 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for positioning key point, device, and storage medium
US11610389B2 (en) * 2020-06-12 2023-03-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for positioning key point, device, and storage medium

Also Published As

Publication number Publication date
TWI761813B (zh) 2022-04-21
KR20210093875A (ko) 2021-07-28
JP7096431B2 (ja) 2022-07-05
CN111291631A (zh) 2020-06-16
JP2022520511A (ja) 2022-03-31
CN111291631B (zh) 2023-11-07
TW202129535A (zh) 2021-08-01

Similar Documents

Publication Publication Date Title
WO2021142904A1 (fr) Procédé d'analyse vidéo et procédé d'apprentissage de modèle associé, dispositif et appareil associés
CN111797893B (zh) 一种神经网络的训练方法、图像分类系统及相关设备
Singh et al. A deeply coupled ConvNet for human activity recognition using dynamic and RGB images
CN108132968B (zh) 网络文本与图像中关联语义基元的弱监督学习方法
CN107480261B (zh) 一种基于深度学习细粒度人脸图像快速检索方法
CN109754078A (zh) 用于优化神经网络的方法
WO2021190296A1 (fr) Procédé et dispositif de reconnaissance de geste dynamique
CN112784778B (zh) 生成模型并识别年龄和性别的方法、装置、设备和介质
CN112070044B (zh) 一种视频物体分类方法及装置
EP4244763A1 (fr) Réseau d'attention multirésolution pour la reconnaissance d'actions dans une vidéo
WO2021218470A1 (fr) Procédé et dispositif d'optimisation de réseau neuronal
CN114266897A (zh) 痘痘类别的预测方法、装置、电子设备及存储介质
CN115131613A (zh) 一种基于多向知识迁移的小样本图像分类方法
Tripathy et al. A real-time two-input stream multi-column multi-stage convolution neural network (TIS-MCMS-CNN) for efficient crowd congestion-level analysis
WO2022127333A1 (fr) Procédé et appareil d'apprentissage pour un modèle de segmentation d'image, procédé et appareil de segmentation d'image, et dispositif
WO2022088411A1 (fr) Procédé et appareil de détection d'image, procédé et appareil d'entraînement de modèle associé, ainsi que dispositif, support et programme
EP3995992A1 (fr) Procédé et système pour détecter une action dans un clip vidéo
CN115705706A (zh) 视频处理方法、装置、计算机设备和存储介质
CN117237756A (zh) 一种训练目标分割模型的方法、目标分割方法及相关装置
Negi et al. End-to-end residual learning-based deep neural network model deployment for human activity recognition
CN114155388B (zh) 一种图像识别方法、装置、计算机设备和存储介质
Yogaswara et al. Comparison of supervised learning image classification algorithms for food and non-food objects
CN112132175A (zh) 对象分类方法、装置、电子设备及存储介质
CN111275183A (zh) 视觉任务的处理方法、装置和电子系统
Treliński et al. Decision combination in classifier committee built on deep embedding features

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021521512

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913355

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20913355

Country of ref document: EP

Kind code of ref document: A1