WO2021253938A1 - 一种神经网络的训练方法、视频识别方法及装置 - Google Patents

一种神经网络的训练方法、视频识别方法及装置 Download PDF

Info

Publication number
WO2021253938A1
WO2021253938A1 PCT/CN2021/086199 CN2021086199W WO2021253938A1 WO 2021253938 A1 WO2021253938 A1 WO 2021253938A1 CN 2021086199 W CN2021086199 W CN 2021086199W WO 2021253938 A1 WO2021253938 A1 WO 2021253938A1
Authority
WO
WIPO (PCT)
Prior art keywords
directed acyclic
node
neural network
acyclic graph
feature map
Prior art date
Application number
PCT/CN2021/086199
Other languages
English (en)
French (fr)
Inventor
王子豪
林宸
邵婧
盛律
闫俊杰
Original Assignee
深圳市商汤科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市商汤科技有限公司 filed Critical 深圳市商汤科技有限公司
Priority to KR1020227000769A priority Critical patent/KR20220011208A/ko
Priority to JP2021570177A priority patent/JP7163515B2/ja
Publication of WO2021253938A1 publication Critical patent/WO2021253938A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/84Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection

Definitions

  • the present disclosure relates to the field of computer technology, and relates to a neural network training method, video recognition method and device.
  • Video recognition refers to recognizing events that occur in a video.
  • a neural network for image recognition is generally used for video recognition after a simple transformation.
  • the neural network for image recognition performs target recognition in the image dimension, it will ignore some video features that cannot be extracted from the image dimension, thereby affecting the accuracy of the neural network for video recognition.
  • the embodiments of the present disclosure provide at least a neural network training method, video recognition method and device.
  • an embodiment of the present disclosure provides a neural network training method, including: obtaining sample videos and constructing a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs including At least one directed acyclic graph for extracting temporal features, and at least one directed acyclic graph for extracting spatial features; each edge of the directed acyclic graph corresponds to multiple operation methods, each The operation method has a corresponding weight parameter; based on the sample video and the event label corresponding to each sample video, the neural network is trained to obtain the weight parameter after training; based on the weight parameter after training, A target operation method is selected for each edge of the multiple directed acyclic graphs to obtain a trained neural network.
  • the constructed neural network not only includes a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features.
  • Each edge of the directed acyclic graph corresponds to multiple Operation method;
  • the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method; the neural network trained by this method
  • the network not only recognizes the spatial features in the image dimension, but also recognizes the temporal features in the time dimension.
  • the trained neural network has a high recognition accuracy for the video.
  • the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature graph; and the construction of a neural network including multiple directed acyclic graphs includes : Use the feature map output by the N-1th directed acyclic graph as the feature map of an input node of the N+1th directed acyclic graph, and use the feature map output by the Nth directed acyclic graph as The feature map of another input node of the N+1th directed acyclic graph; N is an integer greater than 1; wherein, the target input node in the first directed acyclic graph of the neural network corresponds to The feature map is the feature map after feature extraction is performed on the sampled video frames of the sample video. The other input node except the target input node is empty; one input node in the second directed acyclic graph of the neural network The feature graph of is the feature graph output by the first directed acyclic graph, and the other input node is empty.
  • the feature map output by the directed acyclic graph is determined according to the following method: the feature graphs corresponding to other nodes in the directed acyclic graph except the input node are connected in series, and the features after the series are connected in series.
  • the graph is used as the characteristic graph output by the directed acyclic graph.
  • each edge in the directed acyclic graph for extracting temporal features corresponds to multiple first operation methods
  • each edge in the directed acyclic graph for extracting spatial features corresponds to a plurality of first operation methods.
  • the strips correspond to a plurality of second operation methods; the plurality of first operation methods include the plurality of second operation methods and at least one other operation method that is different from each of the second operation methods.
  • the neural network further includes a sampling layer connected to the first directed acyclic graph, and the sampling layer is used to sample the sample video to obtain a sampled video frame, and to sample the The feature extraction of the video frame is performed to obtain the feature map corresponding to the sampled video frame, and the feature map corresponding to the sampled video frame is input to the first target of the directed acyclic graph and input to the node;
  • the neural network also It includes a fully connected layer connected to the output node of the last directed acyclic graph; the fully connected layer is used to determine the occurrence probabilities of various events corresponding to the sample video based on the feature map output by the last directed acyclic graph
  • the training of the neural network based on the sample video and the event label corresponding to each of the sample videos to obtain the weight parameters after training includes: the sample video corresponding to the calculated based on the fully connected layer The probability of occurrence of a variety of events and the event label corresponding to each of the sample videos are trained on the neural network to obtain the weight parameters after
  • the feature graph corresponding to each node in the directed acyclic graph except the input node is obtained according to the following method: according to the feature graph corresponding to each upper-level node pointing to the current node, and The weight parameter of the operation method corresponding to the edge between the current node and each upper-level node pointing to the current node is used to generate a feature map corresponding to the current node.
  • the weight parameter can be used to control the influence of the operation method between any node and the edge of the previous node on the feature graph of any node. Therefore, the weight parameter can be controlled to Control the operation method corresponding to the edge between any node and the previous node of any node, and then change the value of the feature graph of any node.
  • generating the feature map corresponding to the current node includes: for the current edge between the current node and each upper-level node pointing to the current node, based on the current
  • Each of the operation methods corresponding to the edge processes the feature map of the upper-level node corresponding to the current edge to obtain the first intermediate feature map corresponding to each of the operation methods corresponding to the current edge; the current edge corresponds to
  • the first intermediate feature map corresponding to each of the operation methods is weighted and summed according to the weight parameters corresponding to each of the operation methods to obtain the second intermediate feature map corresponding to the current edge;
  • a summation operation is performed on the second intermediate feature maps corresponding to the multiple edges between the respective upper-level nodes of the current node to obtain the feature map corresponding to the current node.
  • each operation method can be used when determining the feature map of the node, which reduces the influence of a single operation method on the feature map corresponding to the node, which is beneficial to improve the recognition accuracy of the neural network.
  • the selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the weight parameters after training includes: for each of the directed acyclic graphs For the side, the operation method with the largest weight parameter corresponding to each side is used as the target operation method corresponding to each side.
  • the step of selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the weight parameters after training to obtain a trained neural network includes: For the node, when the number of edges pointing to the node is greater than the target number, determine the weight parameter of the target operation method corresponding to each edge pointing to the node; according to the corresponding weight parameter Sort each edge pointing to the node in descending order, and delete the other edges except the first K-bit edges, where K is the target number; the neural network after the deletion processing will be performed As the neural network after training.
  • the size of the neural network can be reduced, on the other hand, the calculation steps of the neural network can be reduced, and the computational efficiency of the neural network can be improved.
  • embodiments of the present disclosure also provide a video recognition method, including: acquiring a video to be recognized; In the neural network trained by the neural network training method, the occurrence probabilities of various events corresponding to the video to be recognized are determined; the event whose occurrence probability meets the preset conditions is regarded as the event that occurs in the video to be recognized .
  • an embodiment of the present disclosure provides a neural network training device, including: a construction part configured to obtain a sample video and construct a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs;
  • the acyclic graph includes at least one directed acyclic graph for extracting temporal features and at least one directed acyclic graph for extracting spatial features;
  • each edge of the directed acyclic graph corresponds to multiple operations respectively Method, each of the operation methods has a corresponding weight parameter;
  • the training part is configured to train the neural network based on the sample video and the event label corresponding to each of the sample videos to obtain the trained weight Parameters;
  • the selection part is configured to select a target operation method for each edge of the plurality of directed acyclic graphs based on the weight parameters after training, so as to obtain a trained neural network.
  • the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature graph; the construction part is further configured to: The feature graph output by the directed acyclic graph is used as the feature graph of an input node of the N+1th directed acyclic graph, and the feature graph output by the Nth directed acyclic graph is taken as the N+1th one
  • the feature map of another input node of the directed acyclic graph; N is an integer greater than 1; wherein, the feature map corresponding to the target input node in the first directed acyclic graph of the neural network is a sample video
  • the feature map of the sampled video frame after feature extraction, the other input node except the target input node is empty; the feature map of one input node in the second directed acyclic graph of the neural network is the first The output feature map of a directed acyclic graph, and the other input node is empty.
  • the construction part is further configured to connect feature maps corresponding to nodes other than input nodes in the directed acyclic graph in series, and use the concatenated feature maps as the organic The feature map output to the acyclic graph.
  • each edge in the directed acyclic graph for extracting temporal features corresponds to multiple first operation methods
  • each edge in the directed acyclic graph for extracting spatial features corresponds to a plurality of first operation methods.
  • the strips correspond to a plurality of second operation methods; the plurality of first operation methods include the plurality of second operation methods and at least one other operation method that is different from each of the second operation methods.
  • the neural network further includes a sampling layer connected to the first directed acyclic graph, and the sampling layer is used to sample the sample video to obtain a sampled video frame, and to sample the The feature extraction of the video frame is performed to obtain the feature map corresponding to the sampled video frame, and the feature map corresponding to the sampled video frame is input to the first target of the directed acyclic graph and input to the node;
  • the neural network also It includes a fully connected layer connected to the output node of the last directed acyclic graph; the fully connected layer is used to determine the occurrence probabilities of various events corresponding to the sample video based on the feature map of the output node; the training part , And also configured to: train the neural network based on the probability of occurrence of multiple events corresponding to the sample video calculated by the fully connected layer, and the event label corresponding to each sample video, to obtain the post-training The weight parameter.
  • the construction part is further configured to be based on the feature map corresponding to each upper-level node pointing to the current node, and the current node and each upper-level node pointing to the current node.
  • the edges between nodes correspond to the weight parameters of the operation method, and the feature map corresponding to the current node is generated.
  • the construction part is further configured to, for the current edge between the current node and each upper-level node that points to the current node, based on each of the current edges corresponding to the current edge.
  • the operation method processes the feature map of the upper-level node corresponding to the current edge to obtain a first intermediate feature map corresponding to each of the operation methods corresponding to the current edge; each of the operations corresponding to the current edge
  • the first intermediate feature map corresponding to the method performs weighted summation according to the weight parameters corresponding to each of the operation methods to obtain the second intermediate feature map corresponding to the current edge;
  • a summation operation is performed on the second intermediate feature maps corresponding to the multiple edges between the first-level nodes to obtain the feature map corresponding to the current node.
  • the selection part is further configured to, for each of the edges of the directed acyclic graph, use the operation method with the largest weight parameter corresponding to each of the edges as each of the edges.
  • the target operation method corresponding to the above edge.
  • the selection part is further configured to determine, for each node, when the number of edges pointing to the node is greater than the target number, each of the edges pointing to the node is determined The weight parameter of the target operation method corresponding to the edge; sort each edge pointing to the node in the descending order of the corresponding weight parameter, and delete the remaining edges except the first K-bit edge , Where K is the number of the targets; the neural network after the deletion processing is used as the trained neural network.
  • an embodiment of the present disclosure also provides a video recognition device, including: an acquiring part configured to acquire a video to be recognized; a first determining part configured to input the video to be recognized based on the first aspect Or in the neural network trained by the neural network training method described in any of the possible implementations of the first aspect, the probability of occurrence of multiple events corresponding to the video to be recognized is determined; the second determining part is configured to correspond to An event whose occurrence probability meets a preset condition is regarded as an event that occurs in the to-be-identified video.
  • embodiments of the present disclosure also provide a computer device, including a processor, a memory, and a bus.
  • the memory stores machine-readable instructions executable by the processor.
  • the processing communicates with the memory through a bus.
  • the machine-readable instructions are executed by the processor, the above-mentioned first aspect or the steps in any one of the possible implementations of the first aspect are executed, or the above-mentioned first aspect is executed. The steps in the two aspects.
  • the embodiments of the present disclosure also provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the first aspect or any of the first aspects when the computer program is run by a processor.
  • a processor executes the first aspect or any of the first aspects when the computer program is run by a processor.
  • the embodiments of the present disclosure also provide a computer program, including computer-readable code, when the computer-readable code runs in an electronic device, the processor in the electronic device executes the first aspect, or Steps in any possible implementation manner in the first aspect, or execute the steps in the second aspect described above.
  • Fig. 1 shows a flowchart of a neural network training method provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of a network structure of a neural network including a directed acyclic graph provided by an embodiment of the present disclosure
  • FIG. 3a shows a schematic diagram of a processing process of temporal convolution provided by an embodiment of the present disclosure
  • FIG. 3b shows a schematic diagram of another time convolution processing process provided by an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of a neural network structure provided by an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of a directed acyclic graph provided by an embodiment of the present disclosure
  • Fig. 6 shows a flowchart of a method for generating a feature map corresponding to a node provided by an embodiment of the present disclosure
  • FIG. 7 shows a schematic diagram of the overall structure of a constructed neural network provided by an embodiment of the present disclosure
  • FIG. 8 shows a schematic flowchart of a neural network training method provided by an embodiment of the present disclosure
  • FIG. 9 shows a schematic flowchart of a video recognition method provided by an embodiment of the present disclosure.
  • FIG. 10 shows a schematic structural diagram of a neural network training device provided by an embodiment of the present disclosure
  • FIG. 11 shows a schematic structural diagram of a video recognition device provided by an embodiment of the present disclosure
  • FIG. 12 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure
  • Fig. 13 shows a schematic structural diagram of another computer device provided by an embodiment of the present disclosure.
  • the existing neural network for image recognition in the process of video recognition, the existing neural network for image recognition is generally modified.
  • the existing neural network for image recognition recognizes in the image dimension, and ignores some image recognition. Video features that cannot be extracted in dimensionality affect the recognition accuracy of the neural network.
  • the related technology will also use evolution-based algorithms to search for neural networks for video recognition.
  • this method needs to train multiple neural networks each time, and then select the neural network with the best performance to adjust again.
  • the amount of calculation in the adjustment process of the network is relatively large, and the training efficiency is low.
  • the embodiments of the present disclosure provide a neural network training method.
  • the constructed neural network not only includes a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features.
  • Each edge of the directed acyclic graph corresponds to multiple operation methods; in this way, after training the neural network with the sample video, the weight parameters of the trained operation method can be obtained, which is further based on the weight parameters of the trained operation method.
  • the neural network after training is obtained; the neural network trained by this method not only recognizes the spatial feature of the image dimension, but also recognizes the temporal feature of the time dimension.
  • the trained neural network has a higher recognition accuracy for the video.
  • the execution subject of the neural network training method provided in the embodiment of the present disclosure is generally a computer with certain computing capabilities.
  • Equipment the computer equipment includes, for example, terminal equipment or servers or other processing equipment.
  • the terminal equipment may be User Equipment (UE), mobile equipment, user terminals, personal computers, and so on.
  • UE User Equipment
  • the method proposed in the embodiments of the present disclosure may also be implemented by a processor executing computer program code.
  • FIG. 1 it is a flowchart of a neural network training method provided by an embodiment of the present disclosure.
  • the method includes steps 101 to 103, wherein:
  • Step 101 Obtain sample videos, and construct a neural network including multiple directed acyclic graphs.
  • the multiple directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, and at least one directed acyclic graph for extracting spatial features;
  • Each edge corresponds to a plurality of operation methods, and each of the operation methods has a corresponding weight parameter.
  • Step 102 Train the neural network based on the sample video and the event label corresponding to each sample video to obtain a weight parameter after training.
  • Step 103 Based on the weight parameters after training, select a target operation method for each edge of the multiple directed acyclic graphs to obtain a trained neural network.
  • the number of directed acyclic graphs used to extract temporal features and the number of directed acyclic graphs used to extract spatial features are preset .
  • the nodes of the directed acyclic graph represent the feature graph, and the edges between the nodes represent the operation method.
  • the feature map output by the N-1th directed acyclic graph can be used as the feature of an input node of the N+1th directed acyclic graph And use the feature graph output from the Nth directed acyclic graph as the feature graph of another input node of the N+1th directed acyclic graph; N is an integer greater than 1.
  • each directed acyclic graph includes two input nodes, and any input node of the first directed acyclic graph of the neural network can be used as the target input node.
  • the input of the target input node is The feature map after feature extraction is performed on the sampled video frames of the sample video, the first directed acyclic graph of the neural network, except for the target input node, is empty; the second input node of the neural network is empty;
  • the feature graph corresponding to one input node of a directed acyclic graph is the feature graph output by the first directed acyclic graph, and the other input node is empty.
  • the directed acyclic graph may also include one, three or more input nodes.
  • the feature maps corresponding to the nodes other than the input node in the directed acyclic graph can be connected in series (contact), and the concatenated The feature map is used as the output feature map of the directed acyclic graph.
  • the constructed network structure of the neural network including the directed acyclic graph can be as shown in Figure 2.
  • Figure 2 includes three directed acyclic graphs. The white dots indicate the input nodes, and the black dots indicate that there will be The feature map after concatenating the feature maps corresponding to nodes other than the input node in the acyclic graph.
  • One input node of the first directed acyclic graph corresponds to the feature map of the sample video frame of the sample video, and the other input node Is empty, the feature graph corresponding to the output node of the first directed acyclic graph is used as one of the input nodes of the second directed acyclic graph, the input node of the second directed acyclic graph is empty, and the second The output feature map of the directed acyclic graph and the output feature map of the first directed acyclic graph are respectively used as the feature graphs corresponding to the two input nodes of the third directed acyclic graph, and so on.
  • each edge in the directed acyclic graph used to extract temporal features corresponds to multiple first operation methods
  • each edge in the directed acyclic graph used to extract spatial features corresponds to multiple A second operation method
  • the plurality of first operation methods include the plurality of second operation methods and at least one other operation method that is different from each of the second operation methods.
  • the multiple first operation methods corresponding to each edge in the directed acyclic graph for extracting time features may include average pooling operations (such as 1 ⁇ 3 ⁇ 3 average pooling), maximum pooling Convolution operations (such as 1 ⁇ 3 ⁇ 3 maximum pooling), discrete convolution operations (such as 1 ⁇ 3 ⁇ 3 discrete convolution), discrete convolution with holes (such as 1 ⁇ 3 ⁇ 3 discrete convolution with holes) Product); multiple second operation methods corresponding to each edge in the directed acyclic graph used to extract spatial features may include average pooling operation, maximum pooling operation, discrete convolution operation, and discrete convolution with holes , And different time convolutions.
  • average pooling operations such as 1 ⁇ 3 ⁇ 3 average pooling
  • maximum pooling Convolution operations such as 1 ⁇ 3 ⁇ 3 maximum pooling
  • discrete convolution operations such as 1 ⁇ 3 ⁇ 3 discrete convolution
  • discrete convolution with holes such as 1 ⁇ 3 ⁇ 3 discrete convolution with holes
  • the temporal convolution is used to extract temporal features.
  • the temporal convolution may be a temporal convolution with a size of 3+3 ⁇ 3, and a temporal convolution with a size of 3+3 ⁇ 3 means that the size of the convolution kernel in the temporal dimension is 3, and the convolution in the spatial dimension
  • the size of the core is 3 ⁇ 3, and the processing process can be illustrated as shown in Figure 3a. Cin represents the input feature map, Cout represents the processed output feature map, ReLU represents the activation function, and conv1 ⁇ 3 ⁇ 3 represents the time.
  • the size of the convolution kernel in the dimension is 1, the size of the convolution kernel in the spatial dimension is 3 ⁇ 3 convolution operation, conv3 ⁇ 1 ⁇ 1 means that the size of the convolution kernel in the time dimension is 3, and the size of the convolution kernel in the spatial dimension is 1 ⁇ 1 convolution operation, BatchNorm represents the normalization operation, T, W, and H represent the two dimensions of time and space respectively.
  • the temporal convolution may also be a temporal convolution with a size of 3+1 ⁇ 1, and a temporal convolution with a size of 3+1 ⁇ 1 means that the size of the convolution kernel in the temporal dimension is 3, and the convolution in the spatial dimension
  • the size of the product kernel is 1 ⁇ 1, and the processing process can be as shown in Figure 3b.
  • conv1 ⁇ 1 ⁇ 1 means that the size of the convolution kernel in the time dimension is 1, and the size of the convolution kernel in the space dimension is 1 ⁇ 1.
  • the meaning of the remaining symbols is the same as that in FIG. 3a, and will not be repeated here.
  • the structure of each directed acyclic graph used to extract temporal features is the same, but after the neural network training is completed, different ones used to extract temporal features
  • the target operation methods corresponding to the edges in the directed acyclic graph may be different; similarly, in the process of building a neural network, the structure of each directed acyclic graph used to extract spatial features is also the same.
  • the target operation methods corresponding to the edges in different directed acyclic graphs used to extract spatial features may also be different.
  • each directed acyclic graph used to extract temporal features includes two types of directed acyclic graphs. One is the first one that changes the size and number of channels of the input feature map.
  • a directed acyclic graph is a second directed acyclic graph that does not change the size of the input feature map and the number of channels.
  • the first directed acyclic graph may include a first preset number of nodes
  • the second directed acyclic graph may include a second preset number of nodes
  • the first preset number and the second preset number The number can be the same.
  • Each directed acyclic graph used to extract spatial features can also include two directed acyclic graphs, one is the third directed acyclic graph that changes the size and number of channels of the input feature map, one One is the fourth directed acyclic graph that does not change the size of the input feature map and the number of channels, where the third directed acyclic graph can include a third preset number of nodes, and the fourth directed acyclic graph
  • the ring graph may include a fourth preset number of nodes, and the third preset number and the fourth preset number may be the same.
  • the constructed neural network includes the above-mentioned four types of directed acyclic graphs.
  • the preset number of nodes corresponding to each type of directed acyclic graph includes the number of nodes at each level of the directed acyclic graph. The number of nodes, after determining the number of nodes at each level, you can directly determine the connection relationship between each node, and then determine the directed acyclic graph.
  • the network structure of a neural network containing four directed acyclic graphs can be shown in Figure 4.
  • the sampling layer can be input first, the sample video can be sampled, and then the sample video can be sampled.
  • Feature extraction is performed on the subsequent sample video frames and input into the first directed acyclic graph, and the last directed acyclic graph is input into the fully connected layer, and the input of the fully connected layer is the output of the neural network.
  • the constructed neural network not only includes a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features.
  • Each edge of the directed acyclic graph corresponds to multiple Operation method;
  • the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method; the neural network trained by this method
  • the network not only recognizes the spatial features in the image dimension, but also recognizes the temporal features in the time dimension.
  • the trained neural network has a high recognition accuracy for the video.
  • the feature graph corresponding to each upper-level node pointing to the current node when determining the feature graph corresponding to each node in the directed acyclic graph except the input node, the feature graph corresponding to each upper-level node pointing to the current node and the The weight parameter of the operation method corresponding to the edge between the current node and each upper-level node pointing to the current node, and a feature map corresponding to the current node is generated.
  • the nodes pointing to node 3 are node 0, node 1, and node 2, and then it can be based on node 0, node 1
  • the feature map corresponding to node 2 and the weight parameters of the operation method corresponding to the edges between node 0, node 1 and node 2 and node 3 are used to determine the feature map corresponding to node 3.
  • the directed acyclic graph is a directed acyclic graph used to extract temporal features
  • the operation method corresponding to the edge between node 0, node 1, and node 2 respectively and node 3 is the first operation method
  • the operation method of node 0, node 1, and node 2 respectively corresponding to the edge between node 3 is the second operation method.
  • the weight parameter can be used to control the influence of the operation method between any node and the edge of the previous node on the feature graph of any node. Therefore, the weight parameter can be controlled to Control the operation method corresponding to the edge between any node and the previous node of any node, and then change the value of the feature graph of any node.
  • Step 601 Regarding the current edge between the current node and each upper-level node that points to the current node, perform processing on the upper-level node corresponding to the current edge based on each of the operation methods corresponding to the current edge Process the feature maps of, and obtain the first intermediate feature maps corresponding to each of the operation methods corresponding to the current edge.
  • the directed acyclic graph where the current node is located is a directed acyclic graph for temporal feature extraction, and there are three current edges pointing to the current node, and each current edge corresponds to six first operation methods, then For any current edge, the feature graph corresponding to the previous node connected to the current edge can be processed separately through each operation method corresponding to the current edge, and the six first ones corresponding to the current edge can be obtained.
  • the intermediate feature graph there are three current edges pointing to the current node, and then through calculation, eighteen first intermediate feature graphs can be obtained.
  • the directed acyclic graph where the current node is located is a directed acyclic graph used for spatial feature extraction
  • Step 602 Perform a weighted summation of the first intermediate feature map corresponding to each of the operation methods corresponding to the current edge according to the weight parameters corresponding to each of the operation methods, to obtain a second intermediate feature map corresponding to the current edge.
  • the weight parameter is a model parameter to be trained.
  • the weight parameter may be randomly assigned, and then continuously adjusted during the training process of the neural network.
  • Each operation method corresponding to the current edge pointing to the current node has a corresponding weight parameter.
  • the first intermediate feature map corresponding to each operation method is weighted and summed according to the corresponding weight parameter, the corresponding position of the first feature map can be calculated The value at is multiplied by the weight parameter of the operation method corresponding to the first feature map, and then the multiplication result at the corresponding position is added to obtain the second intermediate feature map corresponding to the current edge.
  • each current edge corresponds to six first operation methods
  • each first operation method has a corresponding weight parameter
  • each current edge can correspond to six first operation methods.
  • the six first intermediate feature maps corresponding to each current edge are then weighted and summed according to the corresponding weight parameters to obtain the second intermediate feature map corresponding to each current edge.
  • both edge 1 and edge 2 point to the current node, and the operation methods corresponding to edge 1 and edge 2 include average pooling operation.
  • the weight parameter of the average pooling operation corresponding to 1 may be 70%, and the weight parameter of the average pooling operation corresponding to edge 2 may be 10%.
  • o and o' represent operation methods
  • O represents the set of operation methods between the i-th node and the j-th node
  • o(x i ) represents the feature map corresponding to the i-th node
  • Step 603 Perform a summation operation on the second intermediate feature maps corresponding to the multiple edges between the current node and each upper-level node pointing to the current node to obtain a feature map corresponding to the current node.
  • each second intermediate feature map is the same.
  • the values at the corresponding position of each second intermediate feature map can be added to obtain the corresponding value of the current node. Feature map.
  • the constructed neural network also includes a sampling layer and a fully connected layer.
  • the sampling layer is used to sample the video input to the neural network to obtain sampled video frames, and perform feature extraction on the sampled video frames to obtain the sampled video
  • the feature map corresponding to the frame, and then the feature map corresponding to the sampled video frame is input to the target input node of the first directed acyclic graph.
  • the fully connected layer is used to determine the occurrence probability of various events corresponding to the sample video based on the feature map output by the last directed acyclic graph.
  • the overall structure of the constructed neural network is exemplary as shown in Fig. 7 As shown, Figure 7 includes three directed acyclic graphs, a fully connected layer, and a sampling layer. The output of the fully connected layer is the output of the neural network.
  • each operation method can be used when determining the feature map of the node, which reduces the influence of a single operation method on the feature map corresponding to the node, which is beneficial to improve the recognition accuracy of the neural network.
  • the event tags corresponding to the sample video are used to indicate events that occurred in the sample video.
  • the events that occurred in the sample video may include people running, puppies playing, two people playing badminton, and so on.
  • the method shown in FIG. 8 can be used, including the following steps:
  • Step 801 Input the sample video into the neural network, and output the probability of occurrence of various events corresponding to the sample video.
  • the number of various events corresponding to the sample video is the same as the number of event labels of the sample video for training the neural network. For example, if the neural network is trained through sample videos of 400 event labels, then any video After inputting to the neural network, the neural network can output the respective occurrence probabilities of 400 events corresponding to the input video.
  • Step 802 Determine a predicted event corresponding to the sample video based on the occurrence probabilities of various events corresponding to the sample video.
  • the corresponding event with the highest probability of occurrence can be determined as the event predicted by the neural network.
  • the sample video may carry multiple event tags, such as carrying a puppy to play at the same time, and two people playing badminton. Therefore, in the process of determining the predicted event corresponding to the sample video based on the occurrence probabilities of various events corresponding to the sample video, the event with the corresponding occurrence probability greater than the preset probability can also be determined as the predicted event corresponding to the sample video .
  • Step 803 Determine the loss value in this training process based on the predicted event corresponding to the sample video and the event label of the sample video.
  • the cross-entropy loss in this training process can be determined based on the predicted event corresponding to the sample video and the event label of the sample video.
  • Step 804 Determine whether the loss value during this training process is less than a preset loss value.
  • step 805 is executed in sequence; if the judgment result is no, the parameter values of the neural network parameters in this training process are adjusted, and step 801 is executed back.
  • the adjusted neural network parameters include the weight parameters of the operation method corresponding to each edge of the directed acyclic graph. Since each weight parameter can affect the choice of the target operation method corresponding to each edge of the directed acyclic graph, the weight here Parameters can be used as structural parameters of the neural network; the adjusted neural network parameters also include operating parameters, for example, the size and weight of the convolution kernel of each convolution operation can be included.
  • a gradual learning rate attenuation strategy can be used, and the hyperparameter S can be set in advance, which means that the learning rate is attenuated once every time the operating parameters and structural parameters are optimized S times, and the attenuation amplitude is d (pre-set), which can be achieved
  • the gradual decay of the learning rate so as to realize the synchronous learning of structural parameters and operating parameters, that is, synchronous optimization.
  • represents structural parameters
  • represents operating parameters
  • L( ⁇ , ⁇ ) represents the loss value calculated based on ⁇ when ⁇ is fixed
  • ⁇ * ( ⁇ ) represents ⁇ is fixed, and then through training ⁇
  • L( ⁇ , ⁇ ) is the smallest
  • the value of ⁇ is the optimized ⁇
  • L( ⁇ * ( ⁇ ), ⁇ ) means that the optimized ⁇ remains unchanged, and is calculated based on ⁇ Loss value, training ⁇ , make L( ⁇ * ( ⁇ ), ⁇ ) the smallest.
  • needs to be continuously adjusted, and each time you adjust ⁇ , you need to retrain ⁇ . For example, if you need to calculate 100 times for each training ⁇ , if you adjust ⁇ 100 times, you need to calculate 10,000 times in the end. Larger.
  • the optimization is generally based on the following formula:
  • represents the learning rate of the operating parameters
  • the network parameters inside the neural network can be searched out. Compared with the method of first determining the network structure and then determining the network parameters, the determination efficiency of the neural network is improved.
  • Step 805 Determine the trained neural network model based on the trained neural network parameters.
  • the target operation method can be selected for each edge of multiple directed acyclic graphs based on the trained weight parameters, and the neural network model after the target operation method is determined for each edge is the trained Neural network.
  • each of the The operation method with the largest weight parameter corresponding to the edge is used as the target operation method corresponding to each of the edges.
  • the weight parameters of the target operation method corresponding to the three edges pointing to the node can be determined respectively, and according to the weight parameters, The three edges pointing to the node are sorted in descending order, the first two edges are kept, and the third edge is deleted.
  • the size of the neural network can be reduced, on the other hand, the calculation steps of the neural network can be reduced, and the computational efficiency of the neural network can be improved.
  • an embodiment of the present disclosure also provides a video recognition method.
  • a schematic flowchart of a video recognition method provided by the embodiment of the present disclosure includes the following steps:
  • Step 901 Obtain a video to be recognized.
  • Step 902 Input the video to be recognized into a pre-trained neural network, and determine the occurrence probabilities of various events corresponding to the video to be recognized.
  • the neural network is obtained by training based on the neural network training method provided in the foregoing embodiment.
  • Step 903 Use an event whose occurrence probability meets a preset condition as an event that occurs in the to-be-recognized video.
  • the event whose occurrence probability meets the preset condition may be an event with the largest occurrence probability, or an event whose occurrence probability is greater than a preset probability value.
  • the neural network includes a sampling layer, a feature extraction layer, and a fully connected layer.
  • the extraction layer includes multiple directed acyclic graphs.
  • the sampling layer can sample the video to be recognized to obtain multiple sampled video frames, and then perform feature extraction on the sampled video frames to obtain the feature map corresponding to the sampled video frame, and then The feature map corresponding to the sampled video frame is input to the feature extraction layer.
  • the feature extraction layer includes multiple directed acyclic graphs for temporal feature extraction and directed acyclic graphs for spatial feature extraction.
  • the number of each type of directed acyclic graphs is preset.
  • the number of nodes in each type of directed acyclic graph is also preset.
  • the difference between the directed acyclic graph for temporal feature extraction and the directed acyclic graph for spatial feature extraction is as follows 1 shows:
  • the sampling layer After the sampling layer inputs the feature map corresponding to the sampled video frame to the feature extraction layer, it can input the feature map corresponding to the sampled video frame to the target input node of the first directed acyclic graph, the first directed acyclic graph
  • the other input node of the second directed acyclic graph is empty, an input node of the second directed acyclic graph is connected to the output node of the first directed acyclic graph, the other input node is empty, and the third directed acyclic graph
  • An input node is connected to the output node of the second directed acyclic graph, an input node is connected to the output node of the first directed acyclic graph, and so on, the output node of the last directed acyclic graph will correspond to The feature map is input to the fully connected layer.
  • the fully connected layer can determine the occurrence probability of various events in the input video to be recognized based on the input feature map, wherein the to-be-recognized Various events corresponding to the video may be event tags corresponding to the sample video applied when the neural network is trained.
  • the constructed neural network not only includes a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each of the directed acyclic graphs
  • the edges correspond to multiple operation methods; in this way, after training the neural network with the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method; this
  • the neural network trained by this method not only recognizes the spatial feature in the image dimension, but also recognizes the temporal feature in the time dimension.
  • the trained neural network has a higher recognition accuracy for the video.
  • the embodiment of the present disclosure also provides a neural network training device corresponding to the neural network training method. Because the device in the embodiment of the present disclosure solves the problem and the principle of the above-mentioned neural network training method of the embodiment of the present disclosure Similar, so the implementation of the device can refer to the implementation of the method, and the repetition will not be repeated.
  • FIG. 10 it is a schematic diagram of the architecture of a neural network training device provided by an embodiment of the present disclosure.
  • the device includes: a construction part 1001, a training part 1002, and a selection part 1003; wherein,
  • the construction part 1001 is configured to obtain a sample video and construct a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, And at least one directed acyclic graph for extracting spatial features; each side of the directed acyclic graph corresponds to multiple operation methods, and each of the operation methods has a corresponding weight parameter;
  • the training part 1002 is configured to train the neural network based on the sample video and the event label corresponding to each sample video to obtain the weight parameter after training;
  • the selection part 1003 is configured to select a target operation method for each edge of the plurality of directed acyclic graphs based on the weight parameters after training, so as to obtain a trained neural network.
  • the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature graph; the construction part 1001 is further configured to: The feature map output by a directed acyclic graph is taken as the feature graph of an input node of the N+1th directed acyclic graph, and the feature graph output by the Nth directed acyclic graph is taken as the N+1th A feature map of another input node of a directed acyclic graph; N is an integer greater than 1; wherein, the feature map corresponding to the target input node in the first directed acyclic graph of the neural network is a sample video The feature map of the sampled video frame after feature extraction, the other input node except the target input node is empty; the feature map of one input node in the second directed acyclic graph of the neural network is the The output feature map of the first directed acyclic graph, and the other input node is empty.
  • the construction part 1001 is further configured to connect feature maps corresponding to nodes other than input nodes in the directed acyclic graph in series, and use the concatenated feature maps as the Feature map output by directed acyclic graph.
  • each edge in the directed acyclic graph for extracting temporal features corresponds to multiple first operation methods
  • each edge in the directed acyclic graph for extracting spatial features corresponds to a plurality of first operation methods.
  • the strips correspond to a plurality of second operation methods; the plurality of first operation methods include the plurality of second operation methods and at least one other operation method that is different from each of the second operation methods.
  • the neural network further includes a sampling layer connected to the first directed acyclic graph, and the sampling layer is used to sample the sample video to obtain a sampled video frame, and to sample the The feature extraction of the video frame is performed to obtain the feature map corresponding to the sampled video frame, and the feature map corresponding to the sampled video frame is input to the first target of the directed acyclic graph and input to the node;
  • the neural network also It includes a fully connected layer connected to the output node of the last directed acyclic graph; the fully connected layer is used to determine the occurrence probabilities of various events corresponding to the sample video based on the feature map output by the last directed acyclic graph
  • the training part 1002 is also configured to: based on the probability of occurrence of various events corresponding to the sample video calculated by the fully connected layer, and the event label corresponding to each of the sample videos, the neural network Perform training and get the weight parameters after training.
  • the construction part 1001 is further configured to be based on the feature map corresponding to each upper-level node pointing to the current node, and the current node and each previous node pointing to the current node.
  • the edge between the level nodes corresponds to the weight parameter of the operation method, and the feature map corresponding to the current node is generated.
  • the construction part 1001 is further configured to, for the current edge between the current node and each upper-level node that points to the current node, based on the respective current edges corresponding to the current edge.
  • the operation method processes the feature map of the upper-level node corresponding to the current edge to obtain the first intermediate feature map corresponding to each of the operation methods corresponding to the current edge;
  • the first intermediate feature graph corresponding to the operation method is weighted and summed according to the weight parameters corresponding to each of the operation methods to obtain the second intermediate feature graph corresponding to the current edge;
  • a summation operation is performed on the second intermediate feature maps corresponding to the multiple edges between the upper-level nodes to obtain the feature map corresponding to the current node.
  • the selection part 1003 is further configured to, for each of the edges of the directed acyclic graph, use the operation method with the largest weight parameter corresponding to each of the edges as each of the edges.
  • the target operation method corresponding to the above edge.
  • the selection part 1003 is further configured to determine, for each node, when the number of edges pointing to the node is greater than the target number, each of the edges pointing to the node is The weight parameter of the target operation method corresponding to the edge; sort each edge pointing to the node in the descending order of the corresponding weight parameter, and delete the other edges except the first K-bit edge , Where K is the number of the targets; the neural network after the deletion processing is used as the trained neural network.
  • an embodiment of the present disclosure also provides a video recognition device corresponding to the video recognition method.
  • the device includes: The acquiring part 1101, the first determining part 1102, and the second determining part 1103, wherein: the acquiring part 1101 is configured to acquire the video to be recognized; the first determining part 1102 is configured to input the video to be recognized based on the foregoing In the neural network trained by the neural network training method described in the embodiment, the probability of occurrence of various events corresponding to the video to be recognized is determined; the second determining part 1103 is configured to match the corresponding probability of occurrence to a preset condition As the event that occurred in the video to be identified.
  • a schematic structural diagram of a computer device provided by an embodiment of this application includes a processor 1201, a memory 1202, and a bus 1203.
  • the memory 1202 is used to store execution instructions, including a memory 12021 and an external memory 12022; the memory 12021 here is also called an internal memory, and is configured to temporarily store arithmetic data in the processor 1201 and data exchanged with an external memory 12022 such as a hard disk
  • the processor 1201 exchanges data with the external memory 12022 through the memory 12021.
  • the processor 1201 and the memory 1202 communicate through the bus 1203, so that the processor 1201 executes the following instructions:
  • a target operation method is selected for each edge of the plurality of directed acyclic graphs to obtain a trained neural network.
  • the embodiments of the present disclosure also provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the steps of the neural network training method described in the above method embodiment when the computer program is run by a processor .
  • the storage medium may be a volatile or nonvolatile computer readable storage medium.
  • the computer program product of the neural network training method provided by the embodiment of the present disclosure includes a computer-readable storage medium storing program code.
  • the program code includes instructions that can be used to execute the neural network described in the above method embodiment. For the steps of the training method, please refer to the above method embodiment, which will not be repeated here.
  • a schematic structural diagram of a computer device 1300 provided in an embodiment of this application includes a processor 1301, a memory 1302, and a bus 1303.
  • the memory 1302 is used to store execution instructions, including a memory 13021 and an external memory 13022; the memory 13021 here is also called an internal memory, and is configured to temporarily store arithmetic data in the processor 1301 and data exchanged with an external memory 13022 such as a hard disk ,
  • the processor 1301 exchanges data with the external memory 13022 through the memory 13021, and when the computer device 1300 is running, the processor 1301 and the memory 1302 communicate through the bus 1303, so that the processor 1301 is executing the following instructions: get the video to be recognized;
  • the video to be recognized is input to a neural network trained based on the neural network training method described in the foregoing embodiment, and the occurrence probabilities of various events corresponding to the video to be recognized are determined; and the corresponding
  • the embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and the computer program executes the steps of the video recognition method described in the above method embodiment when the computer program is run by a processor.
  • the storage medium may be a volatile or nonvolatile computer readable storage medium.
  • the computer program product of the video recognition method provided by the embodiment of the present disclosure includes a computer-readable storage medium storing program code, and the program code includes instructions that can be used to execute the steps of the video recognition method described in the above method embodiment
  • the program code includes instructions that can be used to execute the steps of the video recognition method described in the above method embodiment
  • the embodiments of the present disclosure also provide a computer program, which, when executed by a processor, implements any one of the methods in the foregoing embodiments.
  • the computer program product can be implemented by hardware, software, or a combination thereof.
  • the computer program product is embodied as a computer storage medium.
  • the computer program product is embodied as a software product, such as a software development kit (SDK) and so on.
  • SDK software development kit
  • the working process of the system and device described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in the actual implementation process.
  • multiple units or components may be It can be combined or integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be indirect couplings or communication connections between devices or units through some communication interfaces, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software function unit and sold or used as an independent product, it can be stored in a non-volatile computer readable storage medium executable by a processor.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the embodiment of the present disclosure obtains sample videos and constructs a neural network including multiple directed acyclic graphs; the multiple directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, and use At least one directed acyclic graph for extracting spatial features; each side of the directed acyclic graph corresponds to multiple operation methods, and each of the operation methods has a corresponding weight parameter; based on the sample video and each An event label corresponding to the sample video, training the neural network to obtain a weight parameter after training; based on the weight parameter after training, a target is selected for each edge of the multiple directed acyclic graphs Operation method to get the trained neural network.
  • the neural network constructed in the foregoing embodiment not only includes a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features. Each edge of the directed acyclic graph corresponds to multiple operations. Method; In this way, after training the neural network with the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method; the neural network trained by this method Not only the spatial feature recognition of the image dimension, but also the temporal feature recognition of the time dimension, the trained neural network has a high recognition accuracy for the video.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本公开提供了一种神经网络的训练方法、视频识别方法及装置,包括:获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数;基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。

Description

一种神经网络的训练方法、视频识别方法及装置
相关申请的交叉引用
本公开基于申请号为202010567864.7、申请日为2020年06月19日、申请名称为“一种神经网络的训练方法、视频识别方法及装置”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及计算机技术领域,涉及一种神经网络的训练方法、视频识别方法及装置。
背景技术
视频识别是指识别视频中所发生的事件,相关技术中,一般是对进行图片识别的神经网络进行简单改造后用于视频识别。
然而,由于进行图片识别的神经网络是在图像维度上进行目标识别的,这样会忽略一些从图像维度无法提取的视频特征,从而影响了神经网络进行视频识别的精度。
发明内容
本公开实施例至少提供一种神经网络的训练方法、视频识别方法及装置。
第一方面,本公开实施例提供了一种神经网络的训练方法,包括:获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数;基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。
上述方法中,所构建的神经网络中不仅包括用于提取空间特征的有向无环图,还包括用于提取时间特征的有向无环图,有向无环图的每条边对应多个操作方法;这样在利用样本视频对神经网络进行训练后,可以得到训练后的操作方法的权重参数,进一步基于训练后的操作方法的权重参数来得到训练后的神经网络;这种方法训练的神经网络不仅进行了图像维度的空间特征识别,还进行了时间维度的时间特征识别,训练出的神经网络对于视频的识别精度较高。
在一些可能的实施方式中,所述有向无环图包括两个输入节点;所述神经网络的每个节点对应一个特征图;所述构建包括多个有向无环图的神经网络,包括:将第N-1个有向无环图输出的特征图作为第N+1个有向无环图的一个输入节点的特征图,并将第N个有向无环图输出的特征图作为所述第N+1个有向无环图的另一个输入节点的特征图;N为大于1的整数;其中,所述神经网络的第一个有向无环图中的目标输入节点对应的特征图为对样本视频的采样视频帧进行特征提取后的特征图,除所述目标输入节点外的另一个输入节点为空;所述神经网络的第二个有向无环图中一个输入节点的特征图为所述第一个有向无环图输出的特征图,另一个输入节点为空。
在一些可能的实施方式中,根据以下方法确定有向无环图输出的特征图:将所述有向无环图中除输入节点外的其他节点对应的特征图进行串联,将串联后的特征图作为所述有向无环图输出的特征图。
在一些可能的实施方式中,所述用于提取时间特征的有向无环图中的每条边对应多个第一操作方法,所述用于提取空间特征的有向无环图中的每条边对应多个第二操作方法;所述多个第一操作方法中包括所述多个第二操作方法以及至少一个区别于各所述第二操作方法的其他操作方法。
在一些可能的实施方式中,所述神经网络还包括与第一个有向无环图连接的采样层,所述采样层用于对样本视频进行采样,得到采样视频帧,并对所述采样视频帧进行特征提取,得到所述采样视频帧对应的特征图,将所述采样视频帧对应的特征图输入至第一个所 述有向无环图的目标输入至节点;所述神经网络还包括与最后一个有向无环图的输出节点连接的全连接层;所述全连接层用于基于最后一个有向无环图输出的特征图确定所述样本视频对应的多种事件的发生概率;所述基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数,包括:基于所述全连接层计算的所述样本视频对应的多种事件的发生概率,以及每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数。
在一些可能的实施方式中,根据以下方法得到所述有向无环图中除输入节点外的每个节点对应的特征图:根据指向当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图。
通过上述方法,通过权重参数,可以控制任一节点与该任一节点的上一节点之间的边之间的操作方法对于该任一节点的特征图的影响,因此可以通过控制权重参数,来控制任一节点与任一节点的上一节点之间的边对应的操作方法,进而改变该任一节点的特征图的取值。
在一些可能的实施方式中,所述根据指向所述当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图,包括:针对所述当前节点与指向所述当前节点的每个上一级节点之间的当前边,基于所述当前边对应的各所述操作方法对所述当前边对应的上一级节点的特征图进行处理,得到所述当前边对应的各所述操作方法对应的第一中间特征图;所述当前边对应的各所述操作方法对应的第一中间特征图按照各所述操作方法对应的权重参数进行加权求和,得到所述当前边对应的第二中间特征图;将所述当前节点与指向所述当前节点的各个上一级节点之间的多条边分别对应的第二中间特征图进行求和运算,得到所述当前节点对应的特征图。
通过这种方法,可以使得每种操作方法都在确定节点的特征图时加以运用,减少单一操作方法对于节点对应的特征图的影响,有利于提高神经网络的识别精度。
在一些可能的实施方式中,所述基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,包括:针对所述有向无环图的每一所述边,将每一所述边对应的权重参数最大的操作方法作为每一所述边对应的目标操作方法。
在一些可能的实施方式中,所述基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络,包括:针对每一所述节点,在指向所述节点的边的个数大于目标个数的情况下,确定指向所述节点的每条边对应的所述目标操作方法的权重参数;按照对应的所述权重参数由大到小的顺序,对指向所述节点的每条边进行排序,将除前K位的边外的其余边删除,其中,K为所述目标个数;将进行删除处理后的神经网络作为所述训练后的神经网络。
通过这种方法,一方面可以降低神经网络的尺寸,另一方面可以减少神经网络的计算步骤,提高神经网络的计算效率。
第二方面,本公开实施例还提供了一种视频识别方法,包括:获取待识别视频;将所述待识别视频输入至基于第一方面或第一方面的任一种可能的实施方式所述的神经网络的训练方法训练得到的神经网络中,确定所述待识别视频对应的多种事件的发生概率;将对应的发生概率符合预设条件的事件作为与所述待识别视频中发生的事件。
第三方面,本公开实施例提供了一种神经网络的训练装置,包括:构建部分,被配置为获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;训练部分,被配置为基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经 网络进行训练,得到训练后的权重参数;选择部分,被配置为基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。
在一些可能的实施方式中,所述有向无环图包括两个输入节点;所述神经网络的每个节点对应一个特征图;所述构建部分,还被配置为:将第N-1个有向无环图输出的特征图作为第N+1个有向无环图的一个输入节点的特征图,并将第N个有向无环图输出的特征图作为所述第N+1个有向无环图的另一个输入节点的特征图;N为大于1的整数;其中,所述神经网络的第一个有向无环图中的目标输入节点对应的特征图为对样本视频的采样视频帧进行特征提取后的特征图,除所述目标输入节点外的另一个输入节点为空;所述神经网络的第二个有向无环图中一个输入节点的特征图为所述第一个有向无环图输出的特征图,另一个输入节点为空。
在一些可能的实施方式中,所述构建部分,还被配置为将所述有向无环图中除输入节点外的其他节点对应的特征图进行串联,将串联后的特征图作为所述有向无环图输出的特征图。
在一些可能的实施方式中,所述用于提取时间特征的有向无环图中的每条边对应多个第一操作方法,所述用于提取空间特征的有向无环图中的每条边对应多个第二操作方法;所述多个第一操作方法中包括所述多个第二操作方法以及至少一个区别于各所述第二操作方法的其他操作方法。
在一些可能的实施方式中,所述神经网络还包括与第一个有向无环图连接的采样层,所述采样层用于对样本视频进行采样,得到采样视频帧,并对所述采样视频帧进行特征提取,得到所述采样视频帧对应的特征图,将所述采样视频帧对应的特征图输入至第一个所述有向无环图的目标输入至节点;所述神经网络还包括与最后一个有向无环图的输出节点连接的全连接层;所述全连接层用于基于该输出节点的特征图确定所述样本视频对应的多种事件的发生概率;所述训练部分,还被配置为:基于所述全连接层计算的所述样本视频对应的多种事件的发生概率,以及每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数。
在一些可能的实施方式中,所述构建部分,还被配置为根据指向当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图。
在一些可能的实施方式中,所述构建部分,还被配置为针对所述当前节点与指向所述当前节点的每个上一级节点之间的当前边,基于所述当前边对应的各所述操作方法对所述当前边对应的上一级节点的特征图进行处理,得到所述当前边对应的各所述操作方法对应的第一中间特征图;所述当前边对应的各所述操作方法对应的第一中间特征图按照各所述操作方法对应的权重参数进行加权求和,得到所述当前边对应的第二中间特征图;将所述当前节点与指向所述当前节点的各个上一级节点之间的多条边分别对应的第二中间特征图进行求和运算,得到所述当前节点对应的特征图。
在一些可能的实施方式中,所述选择部分,还被配置为针对所述有向无环图的每一所述边,将每一所述边对应的权重参数最大的操作方法作为每一所述边对应的目标操作方法。
在一些可能的实施方式中,所述选择部分,还被配置为针对每一所述节点,在指向所述节点的边的个数大于目标个数的情况下,确定指向所述节点的每条边对应的所述目标操作方法的权重参数;按照对应的所述权重参数由大到小的顺序,对指向所述节点的每条边进行排序,将除前K位的边外的其余边删除,其中,K为所述目标个数;将进行删除处理后的神经网络作为所述训练后的神经网络。
第四方面,本公开实施例还提供了一种视频识别装置,包括:获取部分,被配置为获取待识别视频;第一确定部分,被配置为将所述待识别视频输入至基于第一方面或第一方面任一些可能的实施方式所述的神经网络的训练方法训练得到的神经网络中,确定所述待 识别视频对应的多种事件的发生概率;第二确定部分,被配置为将对应的发生概率符合预设条件的事件作为与所述待识别视频中发生的事件。
第五方面,本公开实施例还提供一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤,或执行上述第二方面中的步骤。
第六方面,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤,或执行上述第二方面中的步骤。
第七方面,本公开实施例还提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤,或执行上述第二方面中的步骤。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种神经网络的训练方法的流程图;
图2示出了本公开实施例所提供的一种包括有向无环图的神经网络的网络结构示意图;
图3a示出了本公开实施例所提供的一种时间卷积的处理过程示意图;
图3b示出了本公开实施例所提供的另一种时间卷积的处理过程示意图;
图4示出了本公开实施例所提供的一种神经网络结构的示意图;
图5示出了本公开实施例所提供的一种有向无环图的示意图;
图6示出了本公开实施例所提供的一种生成节点对应的特征图的方法的流程图;
图7示出了本公开实施例所提供的一种构建的神经网络的整体结构示意图;
图8示出了本公开实施例所提供的一种神经网络的训练方法的流程示意图;
图9示出了本公开实施例所提供的一种视频识别方法的流程示意图;
图10示出了本公开实施例所提供的一种神经网络的训练装置的架构示意图;
图11示出了本公开实施例所提供的一种视频识别装置的架构示意图;
图12示出了本公开实施例所提供的一种计算机设备的结构示意图;
图13示出了本公开实施例所提供的另一种计算机设备的结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
相关技术中,在进行视频识别的过程中,一般是对现有的图像识别的神经网络加以改造,然而现有的进行图像识别的神经网络是图像维度上进行识别的,而忽略了一些从图像维度上无法提取的视频特征,影响了神经网络的识别精度。
另外,相关技术中还会采用基于进化的算法搜索进行视频识别的神经网络,然而这种方法每次需要对多个神经网络进行训练完成之后,选择性能最佳的神经网络再次进行调整,在神经网络的调整过程中的计算量较大,训练效率较低。
针对以上方案所存在的缺陷,均是发明人在经过实践并仔细研究后得出的结果,因此,上述问题的发现过程以及下文中本公开实施例针对上述问题所提出的解决方案,都应该是发明人对本公开实施例做出的贡献。
基于此,本公开实施例提供了一种神经网络的训练方法,所构建的神经网络中不仅包括用于提取空间特征的有向无环图,还包括用于提取时间特征的有向无环图,有向无环图的每条边对应多个操作方法;这样在利用样本视频对神经网络进行训练后,可以得到训练后的操作方法的权重参数,进一步基于训练后的操作方法的权重参数来得到训练后的神经网络;这种方法训练的神经网络不仅进行了图像维度的空间特征识别,还进行了时间维度的时间特征识别,训练出的神经网络对于视频的识别精度较高。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
为便于对本实施例进行理解,首先对本公开实施例所公开的一种神经网络的训练方法进行详细介绍,本公开实施例所提供的神经网络的训练方法的执行主体一般为具有一定计算能力的计算机设备,该计算机设备例如包括:终端设备或服务器或其它处理设备,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、个人计算机等。此外,本公开实施例提出的方法还可以通过处理器执行计算机程序代码实现。
参见图1所示,为本公开实施例提供的一种神经网络的训练方法的流程图,所述方法包括步骤101至步骤103,其中:
步骤101、获取样本视频,并构建包括多个有向无环图的神经网络。
其中,所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数。
步骤102、基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数。
步骤103、基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。
以下是对上述步骤101至步骤103的详细介绍。
在一些可能的实施方式中,在构建神经网络的过程中,用于提取时间特征的有向无环图的个数和用于提取空间特征的有向无环图的个数是预先设置好的。有向无环图的节点表示特征图,节点之间的边表示操作方法。
在构建包括多个有向无环图的神经网络的过程中,可以将第N-1个有向无环图输出的特征图作为第N+1个有向无环图的一个输入节点的特征图,并将第N个有向无环图输出的特征图作为所述第N+1个有向无环图的另一个输入节点的特征图;N为大于1的整数。
在一些可能的实现方式中,每个有向无环图包括两个输入节点,可以将神经网络的第一个有向无环图的任意一个输入节点作为目标输入节点,目标输入节点的输入为对样本视频的采样视频帧进行特征提取后的特征图,所述神经网络的第一个有向无环图中除所述目标输入节点外的另一个输入节点为空;将神经网络的第二个有向无环图的一个输入节点对应的特征图为所述第一个有向无环图输出的特征图,另一个输入节点为空。在其他实施例中,有向无环图也可以包括一个、三个或更多个输入节点。
其中,在确定任一有向无环图输出的特征图的过程中,可以将该有向无环图中除输入节点外的其他节点对应的特征图进行串联(contact),并将串联后的特征图作为该有向无环图输出的特征图。
示例性的,构建的包括有向无环图的神经网络的网络结构可以如图2所示,图2中包括三个有向无环图,白色圆点表示输入节点,黑色圆点表示将有向无环图中除输入节点外的其他节点对应的特征图进行串联后的特征图,第一个有向无环图的一个输入节点对应样本视频的采样视频帧的特征图,另一个输入节点为空,第一个有向无环图的输出节点对应的特征图作为第二个有向无环图的其中一个输入节点,第二个有向无环图的输入节点为空,第二个有向无环图的输出的特征图和第一个有向无环图的输出的特征图分别作为第三个有向无环图的两个输入节点对应的特征图,以此类推。
在一种实施方式中,用于提取时间特征的有向无环图中的每条边对应多个第一操作方法,用于提取空间特征的有向无环图中的每条边对应多个第二操作方法,所述多个第一操作方法中包括所述多个第二操作方法以及至少一个区别于各所述第二操作方法的其他操作方法。
示例性的,用于提取时间特征的有向无环图中的每条边对应的多个第一操作方法可以包括平均池化操作(如1×3×3的平均池化)、最大值池化操作(如1×3×3的最大值池化)、离散卷积操作(如1×3×3的离散卷积)、带洞离散卷积(如1×3×3的带洞离散卷积);用于提取空间特征的有向无环图中的每条边对应的多个第二操作方法可以包括平均池化操作、最大值池化操作、离散卷积操作、带洞离散卷积、以及不同的时间卷积。
其中,所述时间卷积用于提取时间特征。示例性的,时间卷积可以是3+3×3尺寸的时间卷积,3+3×3尺寸的时间卷积表示在时间维度上的卷积核的大小是3,在空间维度上卷积核的大小是3×3,其处理过程示例性的可以如图3a所示,Cin表示输入的特征图,Cout表示经过处理后输出的特征图,ReLU表示激活函数,conv1×3×3表示时间维度上卷积核大小是1、空间维度上卷积核大小是3×3的卷积操作,conv3×1×1表示时间维度上卷积核大小是3、空间维度上卷积核大小是1×1的卷积操作,BatchNorm表示归一化操作,T、W、H分别表示时间维度、和空间的两个维度。
示例性的,时间卷积也可以是3+1×1尺寸的时间卷积,3+1×1尺寸的时间卷积表示在时间维度上的卷积核的大小是3,在空间维度上卷积核的大小是1×1,其处理过程示例性的可以如图3b所示,conv1×1×1表示时间维度上卷积核大小是1、空间维度上卷积核大小是1×1的卷积操作,其余符号的含义与图3a中的含义相同,在此将不再赘述。
在一些可能的实施方式中,初始构建神经网络的过程中,用于提取时间特征的各个有向无环图的结构是相同的,但是在神经网络训练完成之后,不同的用于提取时间特征的有向无环图中的边对应的目标操作方法可能是不同的;同样的,构建神经网络的过程中,用于提取空间特征的各个有向无环图的结构也是相同的,在神经网络训练完成之后,不同的用于提取空间特征的有向无环图中的边对应的目标操作方法也可能不同。
在一些可能的实施方式中,用于提取时间特征的每个有向无环图中包括两种有向无环图,一种是对于输入的特征图的尺寸和通道数进行改变的第一有向无环图,一种是对于输入的特征图的尺寸和通道数不进行改变的第二有向无环图。其中,第一有向无环图中可以包括第一预设个数的节点,第二有向无环图中可以包括第二预设个数的节点,第一预设个数和第二预设个数可以相同。用于提取空间特征的每个有向无环图中也可以包括两种有向无环图,一种是对于输入的特征图的尺寸和通道数进行改变的第三有向无环图,一种是对于输入的特征图的尺寸和通道数不进行改变的第四有向无环图,其中,第三有向无环图中可以包括第三预设个数的节点,第四有向无环图中可以包括第四预设个数的节点,第三预设个数和第四预设个数可以相同。
因此,在构建的神经网络中包括上述四种有向无环图,实际应用中,每一种有向无环图对应的预设个数的节点包括该有向无环图中每一级的节点的个数,在确定每一级节点个数之后,可以直接确定各个节点之间的连接关系,进而确定有向无环图。
示例性的,包含四种有向无环图的神经网络的网络结构可以如图4所示,在将样本视 频输入至神经网络之后,可以先输入采样层,对样本视频进行采样,然后对采样之后的样本视频帧进行特征提取,输入至第一个有向无环图中,最后一个有向无环图输入全连接层中,全连接层的输入即为神经网络的输出。
这里需要说明的是,通过有向无环图控制特征图的尺寸和通道数,一方面可以扩大神经网络的感受野,另一方面可以减少神经网络的计算量,提高计算效率。上述方法中,所构建的神经网络中不仅包括用于提取空间特征的有向无环图,还包括用于提取时间特征的有向无环图,有向无环图的每条边对应多个操作方法;这样在利用样本视频对神经网络进行训练后,可以得到训练后的操作方法的权重参数,进一步基于训练后的操作方法的权重参数来得到训练后的神经网络;这种方法训练的神经网络不仅进行了图像维度的空间特征识别,还进行了时间维度的时间特征识别,训练出的神经网络对于视频的识别精度较高。
在一些可能的实施方式中,在确定有向无环图中除输入节点外的每个节点对应的特征图时,可以根据指向当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图。
示例性的,若有向无环图如图5所示,则在确定节点3对应的特征图时,指向节点3的节点为节点0、节点1和节点2,则可以根据节点0、节点1和节点2对应的特征图,以及节点0、节点1和节点2分别与节点3之间的边对应的操作方法的权重参数,确定节点3对应的特征图。
其中,若该有向无环图为用于提取时间特征的有向无环图,则节点0、节点1和节点2分别与节点3之间的边对应的操作方法为第一操作方法,若该有向无环图为用于提取空间特征的有向无环图,则节点0、节点1和节点2分别与节点3之间的边对应的操作方法为第二操作方法。
通过上述方法,通过权重参数,可以控制任一节点与该任一节点的上一节点之间的边之间的操作方法对于该任一节点的特征图的影响,因此可以通过控制权重参数,来控制任一节点与任一节点的上一节点之间的边对应的操作方法,进而改变该任一节点的特征图的取值。
在生成节点对应的特征图的过程中,可以参照图6所示的方法,包括以下几个步骤:
步骤601、针对所述当前节点与指向所述当前节点的每个上一级节点之间的当前边,基于所述当前边对应的各所述操作方法对所述当前边对应的上一级节点的特征图进行处理,得到所述当前边对应的各所述操作方法对应的第一中间特征图。
示例性的,若当前节点所在的有向无环图为用于进行时间特征提取的有向无环图,指向当前节点的有三条当前边,每条当前边对应六个第一操作方法,则针对任一条当前边,可以通过该条当前边对应的每一个操作方法对与该条当前边连接的上一节点对应的特征图分别进行处理,则可以得到该条当前边对应的六个第一中间特征图,指向该当前节点的有三条当前边,则通过计算,可以得到十八个第一中间特征图。
若当前节点所在的有向无环图为用于进行空间特征提取的有向无环图,指向当前节点的有三条当前边,每条当前边对应四个第一操作方法,与上述计算方法类似,每条当前边对应的第一中间特征图为四个,通过计算可以得到十二个第一中间特征图。
步骤602、将所述当前边对应的各所述操作方法对应的第一中间特征图按照各所述操作方法对应的权重参数进行加权求和,得到所述当前边对应的第二中间特征图。
所述权重参数为待训练的模型参数,在一些可能的实施方式中,可以给权重参数随机赋值,然后在神经网络的训练过程中不断调整。
每条指向当前节点的当前边对应的操作方法都有对应的权重参数,在将各个操作方法对应的第一中间特征图按照对应的权重参数进行加权求和时,可以将第一特征图对应位置处的取值与该第一特征图对应的操作方法的权重参数相乘,然后将对应位置处的相乘结果 进行相加,得到该条当前边对应的第二中间特征图。
延续步骤601中的例子,指向当前节点的有三条边,每条当前边对应六个第一操作方法,每个第一操作方法都有对应的权重参数,每条当前边可以对应六个第一中间特征图,然后将每条当前边对应的六个第一中间特征图按照对应的权重参数进行加权求和,得到每条当前边对应的第二中间特征图。
这里需要说明的是,不同边对应的同一种操作方法的权重参数可能不同,例如,边1和边2均指向当前节点,边1和边2对应的操作方法中均包括平均池化操作,边1对应的平均池化操作的权重参数可能为70%,边2对应的平均池化操作的权重参数可能为10%。
示例性的,在计算第i个节点和第j个节点之间的边对应的第二特征图时,可以通过如下公式(1)进行计算:
Figure PCTCN2021086199-appb-000001
其中,o和o’表示操作方法,O表示第i个节点和第j个节点之间的操作方法的集合,
Figure PCTCN2021086199-appb-000002
表示第i个节点和第j个节点之间的边对应的操作方法“o”的权重参数,
Figure PCTCN2021086199-appb-000003
表示第i个节点和第j个节点之间的边对应的操作方法“o’”的权重参数,o(x i)表示第i个节点对应的特征图,
Figure PCTCN2021086199-appb-000004
表示第i个节点和第j个节点之间的边对应的第二特征图。
步骤603、将所述当前节点与指向所述当前节点的各个上一级节点之间的多条边分别对应的第二中间特征图进行求和运算,得到所述当前节点对应的特征图。
其中,各个第二中间特征图的尺寸是相同的,在将各个第二中间特征图进行求和运算时,可以将各个第二中间特征图对应位置处的取值相加,得到当前节点对应的特征图。
另外,构建的神经网络中还包括采样层和全连接层,所述采样层用于对输入神经网络的视频进行采样,得到采样视频帧,并对采样视频帧进行特征提取,得到所述采样视频帧对应的特征图,然后将采样视频帧对应的特征图输入至第一个有向无环图的目标输入节点。所述全连接层用于基于最后一个有向无环图输出的特征图确定所述样本视频对应的多种事件的发生概率,综上,构建的神经网络的整体结构示例性的如图7所示,图7中包括三个有向无环图,一个全连接层,一个采样层,全连接层的输出即为神经网络的输出。
通过这种方法,可以使得每种操作方法都在确定节点的特征图时加以运用,减少单一操作方法对于节点对应的特征图的影响,有利于提高神经网络的识别精度。
样本视频对应的事件标签用于表示样本视频中所发生的事件,示例性的,样本视频中所发生的事件可以包括人跑步、小狗玩耍、两个人打羽毛球等。在一些可能的实施方式中,在基于样本视频和样本视频对应的事件标签,对构建的神经网络进行训练时,可以通过如图8所示的方法,包括以下几个步骤:
步骤801、将样本视频输入至神经网络中,输出得到样本视频对应的多种事件的发生概率。
这里,样本视频对应的多种事件的个数与训练神经网络的样本视频的事件标签的种类个数相同,例如若通过400种事件标签的样本视频对神经网络进行训练,则在将任一视频输入至神经网络之后,神经网络可以输出输入的视频对应的400种事件分别的发生概率。
步骤802、基于样本视频对应的多种事件的发生概率,确定样本视频对应的预测事件。
例如,可以将对应的发生概率最大的事件确定为神经网络预测的事件,在另外一些可能的实施方式中,样本视频可能携带有多个事件标签,例如同时携带有小狗玩耍、两个人打羽毛球的事件标签,因此在基于样本视频对应的多种事件的发生概率,确定样本视频对应的预测事件的过程中,还可以将对应的发生概率大于预设概率的事件确定为样本视频对应的预测事件。
步骤803、基于样本视频对应的预测事件以及样本视频的事件标签,确定本次训练过程中的损失值。
示例性的,可以基于样本视频对应的预测事件以及样本视频的事件标签确定本次训练过程中的交叉熵损失。
步骤804、判断本次训练过程中的损失值是否小于预设损失值。
在判断结果为是的情况下,则顺序执行步骤805;在判断结果为否的情况下,则调整本次训练过程中的神经网络参数的参数值,并返回执行步骤801。
这里,调整的神经网络参数包括有向无环图的各个边对应的操作方法的权重参数,由于各个权重参数可以影响有向无环图的各个边对应的目标操作方法的选择,因此这里的权重参数可以作为神经网络的结构参数;调整的神经网络参数中还包括操作参数,例如可以包括各个卷积操作的卷积核的大小、权重等。
由于结构参数和操作参数的收敛速度相差较大,在操作参数处于学习的早期,学习速率较小的情况下,可能会导致结构参数的快速收敛,因此可以通过控制学习速率实现操作参数和结构参数的同步学习的过程。
示例性的,可以采用逐步学习速率衰减策略,可以预先设置超参数S,表示每优化操作参数和结构参数S次,衰减一次学习速率,衰减的幅度为d(预先设置的),由此可以实现学习速率的逐步衰减,从而实现结构参数和操作参数的同步学习即同步优化。
现有技术中,在进行参数优化的过程中,一般是通过如下公式(2)和公式(3)进行优化:
ω *(α)=argmin ωL(ω,α)            公式(2);
Figure PCTCN2021086199-appb-000005
上述公式(2)中,α表示结构参数,ω表示操作参数,L(ω,α)表示α固定时,基于ω计算出的损失值,ω *(α)表示α固定,然后通过训练ω使得L(ω,α)最小时,ω的取值,即优化后的ω;上述公式(3)中,L(ω *(α),α)表示优化后的ω不变,基于α计算出的损失值,训练α,使得L(ω *(α),α)最小。这种方法中,α是需要不断调整的,每次调整α则需要重新训练ω,示例性的,若每次训练ω需要计算100次,若调整α100次,则最终需要计算10000次,计算量较大。
本公开实施例所提供的方法中,在进行参数优化过程中,一般是基于下述公式进行优化:
Figure PCTCN2021086199-appb-000006
Figure PCTCN2021086199-appb-000007
上述公式中,ξ表示操作参数的学习速率,
Figure PCTCN2021086199-appb-000008
表示基于L(ω,α)计算ω的梯度值,在计算优化后的ω时,采用近似计算的方法,这样,每优化一次α值,在优化ω时,仅通过一次计算即可,因此可以看作是α和ω的同时优化。
基于这种方法,在搜索神经网络结构的同时,可以搜索出神经网络内部的网络参数,相比较先确定网络结构再确定网络参数的方法而言,提高了神经网络的确定效率。
步骤805、基于训练好的神经网络参数,确定训练好的神经网络模型。
在一些可能的实施方式中,可以基于训练好的权重参数,为多个有向无环图的每条边选择目标操作方法,为每条边确定目标操作方法后的神经网络模型即为训练好的神经网络。
示例性的,在基于训练好的权重参数,为多个有向无环图的每条边选择目标操作方法时,针对所述有向无环图的每一所述边,将每一所述边对应的权重参数最大的操作方法作为每一所述边对应的目标操作方法。
在另外一些可能的实施方式中,为了降低神经网络的大小,以及提高神经网络的计算速度,在为多个有向无环图的每条边选择目标操作方法之后,还可以对有向无环图的边进 行删减,然后将进行删减之后的神经网络作为训练好的神经网络。
其中,针对每一所述节点,在指向所述节点的边的个数大于目标个数的情况下,确定指向所述节点的每条边对应的所述目标操作方法的权重参数;按照对应的所述权重参数由大到小的顺序,对指向所述节点的每条边进行排序,将除前K位的边外的其余边删除,其中,K为所述目标个数;将进行删除处理后的神经网络作为所述训练后的神经网络。
示例性的,若目标个数为两个,指向某一节点的边的个数为三个,则可以分别确定指向该节点的三条边对应的目标操作方法的权重参数,并按照权重参数,对指向该节点的三条边进行由大到小的顺序排序,将排在前两位的边保留,将排在第三位的边删除。
通过这种方法,一方面可以降低神经网络的尺寸,另一方面可以减少神经网络的计算步骤,提高神经网络的计算效率。
基于相同的构思,本公开实施例还提供了一种视频识别方法,参见图9所示,为本公开实施例提供的一种视频识别方法的流程示意图,包括以下几个步骤:
步骤901、获取待识别视频。
步骤902、将所述待识别视频输入预先训练的神经网络中,确定所述待识别视频对应的多种事件的发生概率。
其中,所述神经网络是基于上述实施例提供的神经网络的训练方法训练得到的。
步骤903、将对应的发生概率符合预设条件的事件作为与所述待识别视频中发生的事件。
其中,所述发生概率符合预设条件的事件可以是发生概率最大的事件,或者发生概率大于预设概率值的事件。
下面将结合实施例,对上述待识别视频输入至神经网络之后,神经网络对于待识别视频的详细的处理过程进行介绍,所述神经网络包括采样层、特征提取层、全连接层,所述特征提取层包括多个有向无环图。
1)采样层
待识别视频输入至神经网络之后,首先输入至采样层,采样层可以对待识别视频进行采样,获得多个采样视频帧,然后对采样视频帧进行特征提取,得到采样视频帧对应的特征图,然后将采样视频帧对应的特征图输入至特征提取层。
2)特征提取层
特征提取层包括多个用于进行时间特征提取的有向无环图和用于进行空间特征提取的有向无环图,每种类型的有向无环图的个数是预先设置好的,每种类型的有向无环图内的节点的个数也是预先设置好的,用于进行时间特征提取的有向无环图和用于进行空间特征提取的有向无环图的区别如下表1所示:
表1
Figure PCTCN2021086199-appb-000009
采样层将采样视频帧对应的特征图输入至特征提取层之后,可以是将采样视频帧对应的特征图输入至第一个有向无环图的目标输入节点,第一个有向无环图的另一个输入节点为空,第二个有向无环图的一个输入节点与第一个有向无环图的输出节点连接,另一个输入节点为空,第三个有向无环图的一个输入节点与第二个有向无环图的输出节点连接,一个输入节点与第一个有向无环图的输出节点连接,以此类推,最后一个有向无环图的输出节点将对应的特征图输入至全连接层。
3)全连接层
有向无环图的输出节点对应的特征图输入至全连接层之后,全连接层可以基于输入的特征图确定输入的待识别视频中对应的多种事件的发生概率,其中,所述待识别视频中对应的多种事件可以为在训练神经网络时,所应用的样本视频对应的事件标签。
上述实施例所提供的方法中,所构建的神经网络中不仅包括用于提取空间特征的有向无环图,还包括用于提取时间特征的有向无环图,有向无环图的每条边对应多个操作方法;这样在利用样本视频对神经网络进行训练后,可以得到训练后的操作方法的权重参数,进一步基于训练后的操作方法的权重参数来得到训练后的神经网络;这种方法训练的神经网络不仅进行了图像维度的空间特征识别,还进行了时间维度的时间特征识别,训练出的神经网络对于视频的识别精度较高。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的执行顺序应当以其功能和可能的内在逻辑确定。
基于同一发明构思,本公开实施例中还提供了与神经网络的训练方法对应的神经网络的训练装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述神经网络的训练方法相似,因此装置的实施可以参见方法的实施,重复之处不再赘述。
参照图10所示,为本公开实施例提供的一种神经网络的训练装置的架构示意图,所述装置包括:构建部分1001、训练部分1002、选择部分1003;其中,
构建部分1001,被配置为获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;
训练部分1002,被配置为基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数;
选择部分1003,被配置为基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。
在一些可能的实施方式中,所述有向无环图包括两个输入节点;所述神经网络的每个节点对应一个特征图;所述构建部分1001,还被配置为:将第N-1个有向无环图输出的特征图作为第N+1个有向无环图的一个输入节点的特征图,并将第N个有向无环图输出的特征图作为所述第N+1个有向无环图的另一个输入节点的特征图;N为大于1的整数;其中,所述神经网络的第一个有向无环图中的目标输入节点对应的特征图为对样本视频的采样视频帧进行特征提取后的特征图,除所述目标输入节点外的另一个输入节点为空;所述神经网络的第二个有向无环图中一个输入节点的特征图为所述第一个有向无环图输出的特征图,另一个输入节点为空。
在一些可能的实施方式中,所述构建部分1001,还被配置为将所述有向无环图中除输入节点外的其他节点对应的特征图进行串联,将串联后的特征图作为所述有向无环图输出的特征图。
在一些可能的实施方式中,所述用于提取时间特征的有向无环图中的每条边对应多个第一操作方法,所述用于提取空间特征的有向无环图中的每条边对应多个第二操作方法;所述多个第一操作方法中包括所述多个第二操作方法以及至少一个区别于各所述第二操作方法的其他操作方法。
在一些可能的实施方式中,所述神经网络还包括与第一个有向无环图连接的采样层,所述采样层用于对样本视频进行采样,得到采样视频帧,并对所述采样视频帧进行特征提取,得到所述采样视频帧对应的特征图,将所述采样视频帧对应的特征图输入至第一个所述有向无环图的目标输入至节点;所述神经网络还包括与最后一个有向无环图的输出节点连接的全连接层;所述全连接层用于基于最后一个有向无环图输出的特征图确定所述样本 视频对应的多种事件的发生概率;所述训练部分1002,还被配置为:基于所述全连接层计算的所述样本视频对应的多种事件的发生概率,以及每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数。
在一些可能的实施方式中,所述构建部分1001,还被配置为根据指向当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图。
在一些可能的实施方式中,所述构建部分1001,还被配置为针对所述当前节点与指向所述当前节点的每个上一级节点之间的当前边,基于所述当前边对应的各所述操作方法对所述当前边对应的上一级节点的特征图进行处理,得到所述当前边对应的各所述操作方法对应的第一中间特征图;所述当前边对应的各所述操作方法对应的第一中间特征图按照各所述操作方法对应的权重参数进行加权求和,得到所述当前边对应的第二中间特征图;将所述当前节点与指向所述当前节点的各个上一级节点之间的多条边分别对应的第二中间特征图进行求和运算,得到所述当前节点对应的特征图。
在一些可能的实施方式中,所述选择部分1003还被配置为针对所述有向无环图的每一所述边,将每一所述边对应的权重参数最大的操作方法作为每一所述边对应的目标操作方法。
在一些可能的实施方式中,所述选择部分1003还被配置为针对每一所述节点,在指向所述节点的边的个数大于目标个数的情况下,确定指向所述节点的每条边对应的所述目标操作方法的权重参数;按照对应的所述权重参数由大到小的顺序,对指向所述节点的每条边进行排序,将除前K位的边外的其余边删除,其中,K为所述目标个数;将进行删除处理后的神经网络作为所述训练后的神经网络。
关于装置中的各部分的处理流程、以及各部分之间的交互流程的描述可以参照上述方法实施例中的相关说明,这里不再详述。
基于同一发明构思,本公开实施例中还提供了与视频识别方法对应的视频识别装置,参照图11所示,为本公开实施例提供的一种视频识别装置的架构示意图,所述装置包括:获取部分1101、第一确定部分1102、以及第二确定部分1103,其中:获取部分1101,被配置为获取待识别视频;第一确定部分1102,被配置为将所述待识别视频输入至基于上述实施例所述的神经网络的训练方法训练得到的神经网络中,确定所述待识别视频对应的多种事件的发生概率;第二确定部分1103,被配置为将对应的发生概率符合预设条件的事件作为与所述待识别视频中发生的事件。
基于同一技术构思,本申请实施例还提供了一种计算机设备。参照图12所示,为本申请实施例提供的计算机设备的结构示意图,包括处理器1201、存储器1202、和总线1203。其中,存储器1202用于存储执行指令,包括内存12021和外部存储器12022;这里的内存12021也称内存储器,被配置为暂时存放处理器1201中的运算数据,以及与硬盘等外部存储器12022交换的数据,处理器1201通过内存12021与外部存储器12022进行数据交换,当计算机设备1200运行时,处理器1201与存储器1202之间通过总线1203通信,使得处理器1201在执行以下指令:
获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;
基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数;
基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算 机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的神经网络的训练方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例所提供的神经网络的训练方法的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令可用于执行上述方法实施例中所述的神经网络的训练方法的步骤,可参见上述方法实施例,在此不再赘述。
基于同一技术构思,本申请实施例还提供了一种计算机设备。参照图13所示,为本申请实施例提供的计算机设备1300的结构示意图,包括处理器1301、存储器1302、和总线1303。其中,存储器1302用于存储执行指令,包括内存13021和外部存储器13022;这里的内存13021也称内存储器,被配置为暂时存放处理器1301中的运算数据,以及与硬盘等外部存储器13022交换的数据,处理器1301通过内存13021与外部存储器13022进行数据交换,当计算机设备1300运行时,处理器1301与存储器1302之间通过总线1303通信,使得处理器1301在执行以下指令:获取待识别视频;将所述待识别视频输入至基于上述实施例所述的神经网络的训练方法训练得到的神经网络中,确定所述待识别视频对应的多种事件的发生概率;将对应的发生概率符合预设条件的事件作为与所述待识别视频中发生的事件。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的视频识别方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例所提供的视频识别方法的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令可用于执行上述方法实施例中所述的视频识别方法的步骤,可参见上述方法实施例,在此不再赘述。
本公开实施例还提供一种计算机程序,该计算机程序被处理器执行时实现前述实施例的任意一种方法。该计算机程序产品可以通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品体现为计算机存储介质,在另一个可选实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现的过程中可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各 种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本公开实施例的实施方式,用以说明本公开实施例的技术方案,而非对其限制,本公开实施例的保护范围并不局限于此,尽管参照前述实施例对本公开实施例进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开实施例揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开实施例的保护范围之内。因此,本公开实施例的保护范围应所述以权利要求的保护范围为准。
工业实用性
本公开实施例通过获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数;基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。上述实施例所构建的神经网络中不仅包括用于提取空间特征的有向无环图,还包括用于提取时间特征的有向无环图,有向无环图的每条边对应多个操作方法;这样在利用样本视频对神经网络进行训练后,可以得到训练后的操作方法的权重参数,进一步基于训练后的操作方法的权重参数来得到训练后的神经网络;这种方法训练的神经网络不仅进行了图像维度的空间特征识别,还进行了时间维度的时间特征识别,训练出的神经网络对于视频的识别精度较高。

Claims (23)

  1. 一种神经网络的训练方法,包括:
    获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;
    基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数;
    基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。
  2. 根据权利要求1所述的方法,其中,所述有向无环图包括两个输入节点;所述神经网络的每个节点对应一个特征图;
    所述构建包括多个有向无环图的神经网络,包括:
    将第N-1个有向无环图输出的特征图作为第N+1个有向无环图的一个输入节点的特征图,并将第N个有向无环图输出的特征图作为所述第N+1个有向无环图的另一个输入节点的特征图;N为大于1的整数;
    其中,所述神经网络的第一个有向无环图中的目标输入节点对应的特征图为对样本视频的采样视频帧进行特征提取后的特征图,除所述目标输入节点外的另一个输入节点为空;所述神经网络的第二个有向无环图中一个输入节点的特征图为所述第一个有向无环图输出的特征图,另一个输入节点为空。
  3. 根据权利要求2所述的方法,其中,所述方法还包括:
    将所述有向无环图中除输入节点外的其他节点对应的特征图进行串联,将串联后的特征图作为所述有向无环图输出的特征图。
  4. 根据权利要求1至3任一项所述的方法,其中,所述用于提取时间特征的有向无环图中的每条边对应多个第一操作方法,所述用于提取空间特征的有向无环图中的每条边对应多个第二操作方法;所述多个第一操作方法中包括所述多个第二操作方法以及至少一个区别于各所述第二操作方法的其他操作方法。
  5. 根据权利要求1至4任一项所述的方法,其中,所述神经网络还包括与第一个有向无环图连接的采样层,所述采样层用于对样本视频进行采样,得到采样视频帧,并对所述采样视频帧进行特征提取,得到所述采样视频帧对应的特征图,将所述采样视频帧对应的特征图输入至第一个所述有向无环图的目标输入至节点;
    所述神经网络还包括与最后一个有向无环图连接的全连接层;所述全连接层用于基于最后一个有向无环图输出的特征图确定所述样本视频对应的多种事件的发生概率;
    所述基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数,包括:
    基于所述全连接层计算的所述样本视频对应的多种事件的发生概率,以及每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数。
  6. 根据权利要求2至5任一项所述的方法,其中,所述方法还包括:
    根据指向当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图。
  7. 根据权利要求6所述的方法,其中,所述根据指向所述当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图,包括:
    针对所述当前节点与指向所述当前节点的每个上一级节点之间的当前边,基于所述当前边对应的各所述操作方法对所述当前边对应的上一级节点的特征图进行处理,得到所述 当前边对应的各所述操作方法对应的第一中间特征图;
    所述当前边对应的各所述操作方法对应的第一中间特征图按照各所述操作方法对应的权重参数进行加权求和,得到所述当前边对应的第二中间特征图;
    将所述当前节点与指向所述当前节点的各个上一级节点之间的多条边分别对应的第二中间特征图进行求和运算,得到所述当前节点对应的特征图。
  8. 根据权利要求1至7任一项所述的方法,其中,所述基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,包括:
    针对所述有向无环图的每一所述边,将每一所述边对应的权重参数最大的操作方法作为每一所述边对应的目标操作方法。
  9. 根据权利要求8所述的方法,其中,所述基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络,包括:
    针对每一所述节点,在指向所述节点的边的个数大于目标个数的情况下,确定指向所述节点的每条边对应的所述目标操作方法的权重参数;
    按照对应的所述权重参数由大到小的顺序,对指向所述节点的每条边进行排序,将除前K位的边外的其余边删除,其中,K为所述目标个数;
    将进行删除处理后的神经网络作为所述训练后的神经网络。
  10. 一种视频识别方法,包括:
    获取待识别视频;
    将所述待识别视频输入至基于权利要求1至9任一项所述的神经网络的训练方法训练得到的神经网络中,确定所述待识别视频对应的多种事件的发生概率;
    将对应的发生概率符合预设条件的事件作为与所述待识别视频中发生的事件。
  11. 一种神经网络的训练装置,包括:
    构建部分,被配置为获取样本视频,并构建包括多个有向无环图的神经网络;所述多个有向无环图中包括用于提取时间特征的至少一个有向无环图,和用于提取空间特征的至少一个有向无环图;所述有向无环图的每条边分别对应多个操作方法,每一所述操作方法具有对应的权重参数;
    训练部分,被配置为基于所述样本视频和每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数;
    选择部分,被配置为基于所述训练后的权重参数,为所述多个有向无环图的每条边选择目标操作方法,以得到训练后的神经网络。
  12. 根据权利要求11所述的装置,其中,所述构建部分,还被配置为将第N-1个有向无环图输出的特征图作为第N+1个有向无环图的一个输入节点的特征图,并将第N个有向无环图输出的特征图作为所述第N+1个有向无环图的另一个输入节点的特征图;N为大于1的整数;其中,所述神经网络的第一个有向无环图中的目标输入节点对应的特征图为对样本视频的采样视频帧进行特征提取后的特征图,除所述目标输入节点外的另一个输入节点为空;所述神经网络的第二个有向无环图中一个输入节点的特征图为所述第一个有向无环图输出的特征图,另一个输入节点为空。
  13. 根据权利要求12所述的装置,其中,所述构建部分,还被配置为将所述有向无环图中除输入节点外的其他节点对应的特征图进行串联,将串联后的特征图作为所述有向无环图输出的特征图。
  14. 根据权利要求11至13任一项所述的装置,其中,所述用于提取时间特征的有向无环图中的每条边对应多个第一操作方法,所述用于提取空间特征的有向无环图中的每条边对应多个第二操作方法;所述多个第一操作方法中包括所述多个第二操作方法以及至少一个区别于各所述第二操作方法的其他操作方法。
  15. 根据权利要求11至14任一项所述的装置,其中,所述神经网络还包括与第一个 有向无环图连接的采样层,所述采样层用于对样本视频进行采样,得到采样视频帧,并对所述采样视频帧进行特征提取,得到所述采样视频帧对应的特征图,将所述采样视频帧对应的特征图输入至第一个所述有向无环图的目标输入至节点;所述神经网络还包括与最后一个有向无环图连接的全连接层;所述全连接层用于基于最后一个有向无环图输出的特征图确定所述样本视频对应的多种事件的发生概率;
    所述训练部分,还被配置为基于所述全连接层计算的所述样本视频对应的多种事件的发生概率,以及每一所述样本视频对应的事件标签,对所述神经网络进行训练,得到训练后的权重参数。
  16. 根据权利要求12至15任一项所述的装置,其中,所述构建部分,还被配置为根据指向当前节点的每个上一级节点对应的特征图、以及所述当前节点与指向所述当前节点的每个上一级节点之间的边对应的所述操作方法的权重参数,生成所述当前节点对应的特征图。
  17. 根据权利要求16所述的装置,其中,所述构建部分,还被配置为针对所述当前节点与指向所述当前节点的每个上一级节点之间的当前边,基于所述当前边对应的各所述操作方法对所述当前边对应的上一级节点的特征图进行处理,得到所述当前边对应的各所述操作方法对应的第一中间特征图;所述当前边对应的各所述操作方法对应的第一中间特征图按照各所述操作方法对应的权重参数进行加权求和,得到所述当前边对应的第二中间特征图;将所述当前节点与指向所述当前节点的各个上一级节点之间的多条边分别对应的第二中间特征图进行求和运算,得到所述当前节点对应的特征图。
  18. 根据权利要求11至17任一项所述的装置,其中,所述选择部分,还被配置为针对所述有向无环图的每一所述边,将每一所述边对应的权重参数最大的操作方法作为每一所述边对应的目标操作方法。
  19. 根据权利要求18所述的装置,其中,所述选择部分,还被配置为针对每一所述节点,在指向所述节点的边的个数大于目标个数的情况下,确定指向所述节点的每条边对应的所述目标操作方法的权重参数;按照对应的所述权重参数由大到小的顺序,对指向所述节点的每条边进行排序,将除前K位的边外的其余边删除,其中,K为所述目标个数;将进行删除处理后的神经网络作为所述训练后的神经网络。
  20. 一种视频识别装置,包括:
    获取部分,被配置为获取待识别视频;
    第一确定部分,被配置为将所述待识别视频输入至基于权利要求1至9任一项所述的神经网络的训练方法训练得到的神经网络中,确定所述待识别视频对应的多种事件的发生概率;
    第二确定部分,被配置为将对应的发生概率符合预设条件的事件作为与所述待识别视频中发生的事件。
  21. 一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至9项任一所述的神经网络的训练方法的步骤,或执行如权利要求10所述的视频识别方法的步骤。
  22. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行如权利要求1至9任一项所述的神经网络的训练方法的步骤,或执行如权利要求10所述的视频识别方法的步骤。
  23. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行如权利要求1至9项任一所述的神经网络的训练方法的步骤,或执行如权利要求10所述的视频识别方法的步骤。
PCT/CN2021/086199 2020-06-19 2021-04-09 一种神经网络的训练方法、视频识别方法及装置 WO2021253938A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020227000769A KR20220011208A (ko) 2020-06-19 2021-04-09 신경망 트레이닝 방법, 비디오 인식 방법 및 장치
JP2021570177A JP7163515B2 (ja) 2020-06-19 2021-04-09 ニューラルネットワークのトレーニング方法、ビデオ認識方法及び装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010567864.7A CN111767985B (zh) 2020-06-19 2020-06-19 一种神经网络的训练方法、视频识别方法及装置
CN202010567864.7 2020-06-19

Publications (1)

Publication Number Publication Date
WO2021253938A1 true WO2021253938A1 (zh) 2021-12-23

Family

ID=72721043

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086199 WO2021253938A1 (zh) 2020-06-19 2021-04-09 一种神经网络的训练方法、视频识别方法及装置

Country Status (5)

Country Link
JP (1) JP7163515B2 (zh)
KR (1) KR20220011208A (zh)
CN (1) CN111767985B (zh)
TW (1) TWI770967B (zh)
WO (1) WO2021253938A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767985B (zh) * 2020-06-19 2022-07-22 深圳市商汤科技有限公司 一种神经网络的训练方法、视频识别方法及装置
CN112598021A (zh) * 2020-11-27 2021-04-02 西北工业大学 一种基于自动机器学习的图结构搜索方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281853A (zh) * 2014-09-02 2015-01-14 电子科技大学 一种基于3d卷积神经网络的行为识别方法
CN109284820A (zh) * 2018-10-26 2019-01-29 北京图森未来科技有限公司 一种深度神经网络的结构搜索方法及装置
CN110705463A (zh) * 2019-09-29 2020-01-17 山东大学 基于多模态双流3d网络的视频人体行为识别方法及系统
CN111767985A (zh) * 2020-06-19 2020-10-13 深圳市商汤科技有限公司 一种神经网络的训练方法、视频识别方法及装置

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10515304B2 (en) * 2015-04-28 2019-12-24 Qualcomm Incorporated Filter specificity as training criterion for neural networks
WO2017070656A1 (en) * 2015-10-23 2017-04-27 Hauptmann Alexander G Video content retrieval system
US10546211B2 (en) * 2016-07-01 2020-01-28 Google Llc Convolutional neural network on programmable two dimensional image processor
EP3306528B1 (en) * 2016-10-04 2019-12-25 Axis AB Using image analysis algorithms for providing traning data to neural networks
CN108664849A (zh) * 2017-03-30 2018-10-16 富士通株式会社 视频中事件的检测装置、方法以及图像处理设备
US11010658B2 (en) * 2017-12-22 2021-05-18 Intel Corporation System and method for learning the structure of deep convolutional neural networks
CN108228861B (zh) * 2018-01-12 2020-09-01 第四范式(北京)技术有限公司 用于执行机器学习的特征工程的方法及系统
CN108334910B (zh) * 2018-03-30 2020-11-03 国信优易数据股份有限公司 一种事件检测模型训练方法以及事件检测方法
CN108985259B (zh) * 2018-08-03 2022-03-18 百度在线网络技术(北京)有限公司 人体动作识别方法和装置
JP7207630B2 (ja) * 2018-09-25 2023-01-18 Awl株式会社 物体認識カメラシステム、再学習システム、及び物体認識プログラム
US20200167659A1 (en) * 2018-11-27 2020-05-28 Electronics And Telecommunications Research Institute Device and method for training neural network
CN110598598A (zh) * 2019-08-30 2019-12-20 西安理工大学 基于有限样本集的双流卷积神经网络人体行为识别方法
CN110852168A (zh) * 2019-10-11 2020-02-28 西北大学 基于神经架构搜索的行人重识别模型构建方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281853A (zh) * 2014-09-02 2015-01-14 电子科技大学 一种基于3d卷积神经网络的行为识别方法
CN109284820A (zh) * 2018-10-26 2019-01-29 北京图森未来科技有限公司 一种深度神经网络的结构搜索方法及装置
CN110705463A (zh) * 2019-09-29 2020-01-17 山东大学 基于多模态双流3d网络的视频人体行为识别方法及系统
CN111767985A (zh) * 2020-06-19 2020-10-13 深圳市商汤科技有限公司 一种神经网络的训练方法、视频识别方法及装置

Also Published As

Publication number Publication date
TW202201285A (zh) 2022-01-01
JP7163515B2 (ja) 2022-10-31
CN111767985B (zh) 2022-07-22
KR20220011208A (ko) 2022-01-27
TWI770967B (zh) 2022-07-11
CN111767985A (zh) 2020-10-13
JP2022541712A (ja) 2022-09-27

Similar Documents

Publication Publication Date Title
WO2022068196A1 (zh) 跨模态的数据处理方法、装置、存储介质以及电子装置
CN109783817B (zh) 一种基于深度强化学习的文本语义相似计算模型
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
WO2020082560A1 (zh) 文本关键词提取方法、装置、设备及计算机可读存储介质
WO2019100724A1 (zh) 训练多标签分类模型的方法和装置
WO2019100723A1 (zh) 训练多标签分类模型的方法和装置
WO2020094060A1 (zh) 推荐方法及装置
WO2021057056A1 (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
Liu et al. Incdet: In defense of elastic weight consolidation for incremental object detection
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
WO2021143267A1 (zh) 基于图像检测的细粒度分类模型处理方法、及其相关设备
WO2021089013A1 (zh) 空间图卷积网络的训练方法、电子设备及存储介质
WO2023065859A1 (zh) 物品推荐方法、装置及存储介质
US11200444B2 (en) Presentation object determining method and apparatus based on image content, medium, and device
WO2021253938A1 (zh) 一种神经网络的训练方法、视频识别方法及装置
CN113434716B (zh) 一种跨模态信息检索方法和装置
CN110083683B (zh) 基于随机游走的实体语义标注方法
WO2021169453A1 (zh) 用于文本处理的方法和装置
WO2021218037A1 (zh) 目标检测方法、装置、计算机设备和存储介质
CN111353534B (zh) 一种基于自适应分数阶梯度的图数据类别预测方法
JP2022117941A (ja) イメージ検索方法、装置、電子機器、及びコンピュータ読み取り可能な記憶媒体
CN106844338B (zh) 基于属性间依赖关系的网络表格的实体列的检测方法
WO2014144396A1 (en) Manifold-aware ranking kernel for information retrieval
US11868440B1 (en) Statistical model training systems
CN113554145B (zh) 确定神经网络的输出的方法、电子设备和计算机程序产品

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021570177

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20227000769

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21826976

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21826976

Country of ref document: EP

Kind code of ref document: A1