CN111767985B

CN111767985B - Neural network training method, video identification method and device

Info

Publication number: CN111767985B
Application number: CN202010567864.7A
Authority: CN
Inventors: 王子豪; 林宸; 邵婧; 盛律; 闫俊杰
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2022-07-22
Anticipated expiration: 2040-06-19
Also published as: JP2022541712A; TW202201285A; KR20220011208A; CN111767985A; WO2021253938A1; JP7163515B2; TWI770967B

Abstract

The disclosure provides a neural network training method, a video identification method and a device, comprising the following steps: acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter; training the constructed neural network based on the sample videos and event labels corresponding to the sample videos to obtain trained weight parameters; and selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters to obtain a trained neural network.

Description

Neural network training method, video identification method and device

Technical Field

The disclosure relates to the technical field of computers, in particular to a neural network training method, a video identification method and a device.

Background

Video identification refers to identifying events occurring in a video, and in the related art, a neural network for picture identification is generally used for video identification after being simply modified.

However, since the neural network for image recognition performs target recognition on the image dimension, some video features that cannot be extracted from the image dimension are ignored, and thus the accuracy of the neural network for video recognition is affected.

Disclosure of Invention

The embodiment of the disclosure at least provides a neural network training method, a video identification method and a device.

In a first aspect, an embodiment of the present disclosure provides a training method for a neural network, including:

acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each operation method has a corresponding weight parameter;

training the constructed neural network based on the sample videos and event labels corresponding to the sample videos to obtain trained weight parameters;

and selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters to obtain the trained neural network.

In the method, the constructed neural network not only comprises a directed acyclic graph used for extracting spatial features, but also comprises a directed acyclic graph used for extracting temporal features, and each edge of the directed acyclic graph corresponds to a plurality of operation methods; thus, after the neural network is trained by using the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network is further obtained based on the weight parameters of the trained operation method; the neural network trained by the method not only carries out image dimension spatial feature recognition, but also carries out time dimension time feature recognition, and the trained neural network has higher video recognition precision.

In one possible embodiment, the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;

the constructing a neural network comprising a plurality of directed acyclic graphs, comprising:

taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of the (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of another input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1;

the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained after characteristic extraction is carried out on a sampling video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.

In one possible embodiment, the feature map of the directed acyclic graph output is determined according to the following method:

and connecting feature graphs corresponding to other nodes except the input node in the directed acyclic graph in series, and taking the feature graph after the connection in series as the feature graph output by the directed acyclic graph.

In a possible implementation manner, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, and each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods; the plurality of first operation methods includes the plurality of second operation methods and at least one other operation method different from the second operation method.

In a possible implementation manner, the neural network further includes a sampling layer connected to the first directed acyclic graph, where the sampling layer is configured to sample a sample video to obtain a sampled video frame, perform feature extraction on the sampled video frame to obtain a feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame into a target input node of the first directed acyclic graph;

the neural network further comprises a full connection layer connected with the output node of the last directed acyclic graph; the full connection layer is used for calculating the occurrence probability of various events corresponding to the sample video based on the feature graph output by the last directed acyclic graph;

the training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters includes:

training the constructed neural network based on the occurrence probability of various events corresponding to the sample videos calculated by the full connection layer and the event labels corresponding to the sample videos to obtain trained weight parameters.

In one possible implementation, the feature map corresponding to each node except the input node in the directed acyclic graph is obtained according to the following method:

and generating the characteristic graph corresponding to the node according to the characteristic graph corresponding to each superior node pointing to the node and the weight parameters of the operation method corresponding to the edges between the node and each superior node pointing to the node.

The influence of the operation method between the edge of any node and the previous node of any node on the feature graph of any node can be controlled through the weight parameter, so that the operation method corresponding to the edge between any node and the previous node of any node can be controlled through controlling the weight parameter, and the value of the feature graph of any node is further changed.

In a possible implementation manner, the generating a feature map corresponding to the node according to the feature map corresponding to each previous-level node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each previous-level node pointing to the node includes:

aiming at the edge between the node and each upper-level node pointing to the node, processing the feature graph of the upper-level node based on each operation method corresponding to the edge to obtain a first intermediate feature graph corresponding to each operation method;

carrying out weighted summation on the first intermediate characteristic graphs corresponding to the operation methods according to corresponding weight parameters to obtain second intermediate characteristic graphs corresponding to the edges;

and summing the second intermediate characteristic graphs respectively corresponding to the plurality of edges between the node and each upper-level node pointing to the node to obtain the characteristic graph corresponding to the node.

By the method, each operation method can be applied when the feature map of the node is determined, the influence of a single operation method on the feature map corresponding to the node is reduced, and the influence on the identification precision of the neural network is further avoided.

In a possible embodiment, the selecting a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter includes:

and aiming at each edge of the directed acyclic graph, taking the operation method with the maximum weight parameter corresponding to the edge as the target operation method corresponding to the edge.

In one possible embodiment, the selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain a trained neural network includes:

for each node, determining a weight parameter of the target operation method corresponding to each edge pointing to the node under the condition that the number of the edges pointing to the node is larger than the target number;

sequencing all edges pointing to the node according to the descending order of the corresponding weight parameters, and deleting the rest edges except the edge of the previous K bits, wherein K is the target number;

and taking the neural network subjected to deletion processing as the trained neural network.

By the method, on one hand, the size of the neural network can be reduced, on the other hand, the calculation steps of the neural network can be reduced, and the calculation efficiency of the neural network is improved.

In a second aspect, an embodiment of the present disclosure further provides a video identification method, including:

acquiring a video to be identified;

inputting the video to be recognized into a neural network obtained by training based on the first aspect or any one of the possible embodiments of the first aspect through the training method of the neural network, and determining the occurrence probability of multiple events corresponding to the video to be recognized;

and taking the event with the corresponding occurrence probability meeting the preset condition as the event occurring in the video to be identified.

In a third aspect, an embodiment of the present disclosure provides a training apparatus for a neural network, including:

the device comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for acquiring a sample video and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each operation method has a corresponding weight parameter;

the training module is used for training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters;

and the selecting module is used for selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters so as to obtain the trained neural network.

the building module, when building a neural network comprising a plurality of directed acyclic graphs, is configured to:

the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained by performing characteristic extraction on a sampled video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.

In a possible implementation, the building module is further configured to determine a feature map of the directed acyclic graph output according to the following method:

In one possible implementation, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, and each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods; the plurality of first operation methods include the plurality of second operation methods and at least one other operation method different from the second operation method.

the neural network further comprises a full connection layer connected with the output node of the last directed acyclic graph; the full connection layer is used for calculating the occurrence probability of various events corresponding to the sample video based on the characteristic diagram of the output node;

the training module is configured to, when training the constructed neural network based on the sample videos and the event labels corresponding to each sample video to obtain trained weight parameters,:

In a possible embodiment, the building module is further configured to obtain a feature map corresponding to each node except the input node in the directed acyclic graph according to the following method:

In a possible implementation manner, when the construction module generates the feature map corresponding to the node according to the feature map corresponding to each superior node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each superior node pointing to the node, the construction module is configured to:

carrying out weighted summation on the first intermediate characteristic graphs respectively corresponding to the operation methods according to corresponding weight parameters to obtain second intermediate characteristic graphs corresponding to the edges;

and summing the second intermediate feature graphs respectively corresponding to the plurality of edges between the node and each upper-level node pointing to the node to obtain the feature graph corresponding to the node.

In a possible implementation, the selecting module, when selecting the target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter, is configured to:

In one possible embodiment, the selecting module, when selecting the target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain the trained neural network, is configured to:

In a fourth aspect, an embodiment of the present disclosure further provides a video identification apparatus, including:

the acquisition module is used for acquiring a video to be identified;

a first determining module, configured to input the video to be recognized into a neural network trained based on the training method of the neural network according to the first aspect or any one of the possible embodiments of the first aspect, and determine occurrence probabilities of multiple events corresponding to the video to be recognized;

and the second determining module is used for taking the event of which the corresponding occurrence probability meets the preset condition as the event occurring in the video to be identified.

In a fifth aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions, when executed by the processor, performing the steps of the first aspect, or any one of the possible implementations of the first aspect, or the second aspect.

In a sixth aspect, this disclosure also provides a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in the first aspect, or any one of the possible implementation manners of the first aspect, or to perform the steps in the second aspect.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is to be understood that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art to which the disclosure pertains without the benefit of the inventive faculty, and that additional related drawings may be derived therefrom.

Fig. 1 shows a flowchart of a training method of a neural network provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a network structure of a neural network including a directed acyclic graph according to an embodiment of the present disclosure;

FIG. 3a is a schematic diagram illustrating a process of time convolution according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram illustrating another exemplary process of time convolution according to the embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a neural network architecture provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a directed acyclic graph provided by an embodiment of the present disclosure;

fig. 6 shows a flowchart of a method for generating a feature map corresponding to a node according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating an overall structure of a constructed neural network provided by an embodiment of the present disclosure;

FIG. 8 is a flow chart of a method for training a neural network provided by an embodiment of the present disclosure;

fig. 9 is a schematic flow chart illustrating a video recognition method provided by an embodiment of the present disclosure;

fig. 10 is a schematic diagram illustrating an architecture of a training apparatus for a neural network provided in an embodiment of the present disclosure;

fig. 11 is a schematic diagram illustrating an architecture of a video recognition apparatus provided in an embodiment of the present disclosure;

FIG. 12 is a schematic diagram illustrating a computer device according to an embodiment of the present disclosure;

fig. 13 shows a schematic structural diagram of another computer device provided in an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the disclosure, provided in the accompanying drawings, is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making any creative effort, shall fall within the protection scope of the disclosure.

In the related art, when video identification is performed, an existing image identification neural network is generally modified, however, the existing image identification neural network performs identification in an image dimension, and some video features which cannot be extracted from the image dimension are ignored, so that the identification accuracy of the neural network is influenced.

In addition, in the related art, the neural network for performing video recognition is searched by using an evolutionary algorithm, however, in this method, after training of a plurality of neural networks is completed each time, the neural network with the best performance is selected to be adjusted again, the calculation amount in the adjustment process of the neural network is large, and the training efficiency is low.

The above drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above problems and the solutions proposed by the present disclosure in the following description should be the contribution of the inventor to the present disclosure in the course of the present disclosure.

Based on this, the embodiment of the present disclosure provides a training method for a neural network, where the constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each edge of the directed acyclic graph corresponds to multiple operation methods; thus, after the neural network is trained by using the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network is further obtained based on the weight parameters of the trained operation method; the neural network trained by the method not only carries out image dimension spatial feature recognition, but also carries out time dimension time feature recognition, and the trained neural network has higher video recognition precision.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a training method for a neural network disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the training method for a neural network provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a personal computer, or the like, or a server or other processing device. Furthermore, the method proposed in the embodiments of the present disclosure can also be implemented by executing computer program codes by a processor.

Referring to fig. 1, a flowchart of a training method of a neural network provided in an embodiment of the present disclosure is shown, where the method includes steps 101 to 103, where:

step 101, obtaining a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs.

The directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter.

And 102, training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters.

And 103, selecting a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameters to obtain a trained neural network.

The following is a detailed description of the above steps 101 to 103.

In one possible embodiment, when constructing the neural network, the number of the directed acyclic graphs used for extracting the temporal features and the number of the directed acyclic graphs used for extracting the spatial features are preset. The nodes of the directed acyclic graph represent the feature graph, and the edges between the nodes represent the operation methods.

When a neural network comprising a plurality of directed acyclic graphs is constructed, the feature graph output by the (N-1) th directed acyclic graph can be used as the feature graph of one input node of the (N + 1) th directed acyclic graph, and the feature graph output by the (N) th directed acyclic graph can be used as the feature graph of the other input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1.

In a possible implementation manner, each directed acyclic graph includes two input nodes, any one input node of a first directed acyclic graph of a neural network may be used as a target input node, an input of the target input node is a feature graph obtained by performing feature extraction on a sampled video frame of a sample video, and another input node except the target input node in the first directed acyclic graph of the neural network is empty; and taking a feature graph corresponding to one input node of a second directed acyclic graph of the neural network as the feature graph output by the first directed acyclic graph, and taking the other input node as null. In other embodiments, the directed acyclic graph may also include one, three, or more input nodes.

When the feature graph output by any directed acyclic graph is determined, feature graphs corresponding to other nodes except the input node in the directed acyclic graph can be connected in series (contact), and the feature graph after being connected in series can be used as the feature graph output by the directed acyclic graph.

By way of example, a network structure constructed comprising a neural network of a directed acyclic graph may be as shown in figure 2, the graph 2 includes three directed acyclic graphs, white dots represent input nodes, black dots represent feature graphs obtained by connecting feature graphs corresponding to other nodes except the input nodes in the directed acyclic graph in series, one input node of the first directed acyclic graph corresponds to a feature graph of a sample video frame of a sample video, the other input node is empty, the feature graph corresponding to the output node of the first directed acyclic graph serves as one of the input nodes of the second directed acyclic graph, the input node of the second directed acyclic graph is empty, the feature graph output by the second directed acyclic graph and the feature graph output by the first directed acyclic graph serve as feature graphs corresponding to two input nodes of the third directed acyclic graph, and so on.

In one embodiment, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods, and the plurality of first operation methods include the plurality of second operation methods and at least one other operation method different from the second operation method.

For example, the plurality of first operation methods corresponding to each edge in the directed acyclic graph for extracting the temporal feature may include an average pooling operation (e.g., average pooling of 1 × 3 × 3), a maximum pooling operation (e.g., maximum pooling of 1 × 3 × 3), a discrete convolution operation (e.g., discrete convolution of 1 × 3 × 3), a hole discrete convolution (e.g., hole discrete convolution of 1 × 3 × 3); the plurality of second operation methods corresponding to each edge in the directed acyclic graph for extracting the spatial feature may include an average pooling operation, a maximum pooling operation, a discrete convolution operation, a holed discrete convolution, and a different time convolution.

Wherein the temporal convolution is used to extract temporal features. Illustratively, the time convolution may be a time convolution of 3+3 × 3 size, the time convolution of 3+3 × 3 size indicates that the size of the convolution kernel in the time dimension is 3, the size of the convolution kernel in the space dimension is 3 × 3, the processing procedure may illustratively be as shown in fig. 3a, Cin represents a feature map of the input, Cout represents a feature map of the output after processing, ReLU represents an activation function, conv1 × 3 × 3 represents a convolution operation in which the size of the convolution kernel in the time dimension is 1 and the size of the convolution kernel in the space dimension is 3 × 3, conv3 × 1 × 1 represents a convolution operation in which the size of the convolution kernel in the time dimension is 3 and the size of the convolution kernel in the space dimension is 1 × 1, BatchNorm represents a normalization operation, and T, W, H represents the time dimension and two dimensions of the space, respectively.

Illustratively, the time convolution may also be a time convolution of a size of 3+1 × 1, the time convolution of a size of 3+1 × 1 indicates that the size of the convolution kernel in the time dimension is 3, and the size of the convolution kernel in the space dimension is 1 × 1, and the processing procedure may illustratively be as shown in fig. 3b, conv1 × 1 × 1 indicates a convolution operation in which the size of the convolution kernel in the time dimension is 1 and the size of the convolution kernel in the space dimension is 1 × 1, and the meaning of the remaining symbols is the same as that in fig. 3a, and will not be described herein again.

In a possible implementation manner, when a neural network is initially constructed, the structures of the directed acyclic graphs used for extracting the time features are the same, but after the training of the neural network is completed, the target operation methods corresponding to the edges in the directed acyclic graphs used for extracting the time features may be different; similarly, when the neural network is constructed, the structures of the directed acyclic graphs used for extracting the spatial features are also the same, and after the training of the neural network is completed, the target operation methods corresponding to the edges in different directed acyclic graphs used for extracting the spatial features may also be different.

In one possible implementation, each directed acyclic graph for extracting the temporal feature includes two directed acyclic graphs, one is a first directed acyclic graph with a changed size and channel number of the input feature graph, and the other is a second directed acyclic graph with no changed size and channel number of the input feature graph. The first directed acyclic graph may include a first preset number of nodes, the second directed acyclic graph may include a second preset number of nodes, and the first preset number and the second preset number may be the same. Each directed acyclic graph for extracting the spatial feature may also include two directed acyclic graphs, one is a third directed acyclic graph in which the size and the number of channels of the input feature graph are changed, and the other is a fourth directed acyclic graph in which the size and the number of channels of the input feature graph are not changed, where the third directed acyclic graph may include a third preset number of nodes, the fourth directed acyclic graph may include a fourth preset number of nodes, and the third preset number and the fourth preset number may be the same.

Therefore, the constructed neural network comprises the four directed acyclic graphs, in practical application, the preset number of nodes corresponding to each directed acyclic graph comprises the number of nodes at each level in the directed acyclic graph, and after the number of nodes at each level is determined, the connection relationship among the nodes can be directly determined, so that the directed acyclic graph is determined.

For example, as shown in fig. 4, after a sample video is input to the neural network, a sampling layer may be input first, the sample video is sampled, then feature extraction is performed on a sample video frame after sampling, the sample video frame is input to a first directed acyclic graph, a last directed acyclic graph is input to a full connection layer, and an input of the full connection layer is an output of the neural network.

It should be noted here that the size and the number of channels of the feature map are controlled by the directed acyclic graph, so that on one hand, the receptive field of the neural network can be expanded, and on the other hand, the calculation amount of the neural network can be reduced, and the calculation efficiency can be improved.

In a possible implementation manner, when determining the feature graph corresponding to each node except the input node in the directed acyclic graph, the feature graph corresponding to the node may be generated according to the feature graph corresponding to each previous-level node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each previous-level node pointing to the node.

For example, if the directed acyclic graph is shown in fig. 5, when determining the feature graph corresponding to the node 3, the nodes pointing to the node 3 are the node 0, the node 1, and the node 2, and the feature graph corresponding to the node 3 may be determined according to the feature graphs corresponding to the node 0, the node 1, and the node 2 and the weight parameters of the operation methods corresponding to the edges between the node 0, the node 1, and the node 2 and the node 3, respectively.

If the directed acyclic graph is used for extracting the temporal feature, the operation methods corresponding to the edges between the node 0, the node 1, and the node 2 and the node 3 are the first operation methods, and if the directed acyclic graph is used for extracting the spatial feature, the operation methods corresponding to the edges between the node 0, the node 1, and the node 2 and the node 3 are the second operation methods.

Specifically, when generating the feature map corresponding to the node, the method shown in fig. 6 may be referred to, including the following steps:

step 601, aiming at the edge between the current node and each previous-level node pointing to the current node, processing the feature graph of the previous-level node based on each operation method corresponding to the edge to obtain a first intermediate feature graph corresponding to each operation method.

For example, if the directed acyclic graph where the current node is located is a directed acyclic graph for performing time feature extraction, and three edges point to the node, and each edge corresponds to six first operation methods, for any edge, the feature graph corresponding to the previous node connected to the edge may be respectively processed by each operation method corresponding to the edge, so that six first intermediate feature graphs corresponding to the edge may be obtained, and three edges point to the node, and eighteen first intermediate feature graphs may be obtained through calculation.

If the directed acyclic graph where the current node is located is the directed acyclic graph for spatial feature extraction, three edges point to the node, each edge corresponds to four first operation methods, similar to the above calculation method, the number of the first intermediate feature graphs corresponding to each edge is four, and twelve first intermediate feature graphs can be obtained through calculation.

Step 602, performing weighted summation on the first intermediate feature maps respectively corresponding to the operation methods according to the corresponding weight parameters to obtain a second intermediate feature map corresponding to the edge.

The weight parameters are model parameters to be trained, and in one possible implementation, the weight parameters may be randomly assigned and then continuously adjusted in the training process of the neural network.

Each operation method corresponding to the edge pointing to the current node has a corresponding weight parameter, when the first intermediate feature graph corresponding to each operation method is subjected to weighted summation according to the corresponding weight parameters, the value at the position corresponding to the first feature graph can be multiplied by the weight parameter of the operation method corresponding to the first feature graph, and then the multiplication results at the corresponding positions are added to obtain a second intermediate feature graph corresponding to the edge.

In the example in the continuation step 601, there are three edges pointing to the current node, each edge corresponds to six first operation methods, each first operation method has a corresponding weight parameter, each edge may correspond to six first intermediate feature maps, and then the six first intermediate feature maps corresponding to each edge are subjected to weighted summation according to the corresponding weight parameters to obtain a second intermediate feature map corresponding to each edge.

It should be noted here that the weighting parameters of the same operation method corresponding to different edges may be different, for example, edge 1 and edge 2 both point to the current node, the operation methods corresponding to edge 1 and edge 2 both include an average pooling operation, the weighting parameter of the average pooling operation corresponding to edge 1 may be 70%, and the weighting parameter of the average pooling operation corresponding to edge 2 may be 10%.

For example, when calculating the second feature map corresponding to the edge between the ith node and the jth node, the calculation may be performed by the following formula:

where O and O' denote operation methods, O denotes a set of operation methods between the ith node and the jth node,

a weight parameter indicating an operation method "o" corresponding to an edge between the ith node and the jth node,

a weight parameter, o (x), representing the operation method "o'" corresponding to the edge between the ith node and the jth node_i) A feature map corresponding to the ith node is shown,

and a second feature graph representing edge correspondence between the ith node and the jth node.

And 603, performing summation operation on second intermediate feature graphs corresponding to a plurality of edges between the current node and each upper-level node pointing to the current node respectively to obtain a feature graph corresponding to the current node.

The sizes of the second intermediate feature maps are the same, and when the summation operation is performed on the second intermediate feature maps, the values at the positions corresponding to the second intermediate feature maps can be added to obtain the feature map corresponding to the current node.

In addition, the constructed neural network further comprises a sampling layer and a full connection layer, wherein the sampling layer is used for sampling videos input into the neural network to obtain sampling video frames, extracting features of the sampling video frames to obtain feature maps corresponding to the sampling video frames, and then inputting the feature maps corresponding to the sampling video frames to a target input node of the first directed acyclic graph. In summary, the overall structure of the constructed neural network is exemplarily shown in fig. 7, where fig. 7 includes three directed acyclic graphs, one full-connection layer, one sampling layer, and the output of the full-connection layer is the output of the neural network.

The event tags corresponding to the sample video are used for representing events occurring in the sample video, and for example, the events occurring in the sample video may include running of a person, playing of a puppy, badminton of two persons, and the like. In one possible implementation, when training the constructed neural network based on the sample video and the event labels corresponding to the sample video, the method as shown in fig. 8 may include the following steps:

step 801, inputting the sample video into a neural network, and outputting to obtain the occurrence probability of multiple events corresponding to the sample video.

Here, the number of the plurality of types of events corresponding to the sample video is the same as the number of types of event labels of the sample video for training the neural network, and for example, when the neural network is trained using sample videos of 400 types of event labels, the neural network can output the occurrence probability of each of the 400 types of events corresponding to the input video after any one video is input to the neural network.

Step 802, determining a predicted event corresponding to the sample video based on the occurrence probability of multiple events corresponding to the sample video.

For example, the event with the highest occurrence probability may be determined as the event predicted by the neural network, and in another possible implementation, the sample video may carry a plurality of event tags, for example, event tags carrying a puppy playing and an event tag for two people to play a badminton, so when the predicted event corresponding to the sample video is determined based on the occurrence probabilities of the plurality of events corresponding to the sample video, the event with the occurrence probability greater than the preset probability may also be determined as the predicted event corresponding to the sample video.

And 803, determining a loss value in the training process based on the predicted event corresponding to the sample video and the event label of the sample video.

For example, the cross entropy loss in the training process may be determined based on the predicted event corresponding to the sample video and the event label of the sample video.

And step 804, judging whether the loss value in the training process is smaller than a preset loss value.

If yes, go to step 805 in sequence; if the result of the determination is negative, adjusting the parameter values of the neural network parameters in the training process, and returning to execute the step 801.

Here, the adjusted neural network parameters include weight parameters of operation methods corresponding to each edge of the directed acyclic graph, and since each weight parameter may affect selection of a target operation method corresponding to each edge of the directed acyclic graph, the weight parameters may be used as structural parameters of the neural network; the adjusted neural network parameters also include operation parameters, such as the size, weight, etc. of the convolution kernel of each convolution operation.

Since the convergence rates of the structural parameter and the operational parameter are different from each other, when the operational parameter is in the early stage of learning, and when the learning rate is small, rapid convergence of the structural parameter may be caused, and therefore, the process of synchronously learning the operational parameter and the structural parameter can be realized by controlling the learning rate.

For example, a gradual learning rate attenuation strategy may be adopted, specifically, a hyper-parameter S may be preset, which indicates that the learning rate is attenuated once every time the operating parameter and the structural parameter are optimized S times, and the amplitude of the attenuation is d (preset), so that gradual attenuation of the learning rate may be realized, and thus synchronous learning, i.e., synchronous optimization, of the structural parameter and the operating parameter may be realized.

In the prior art, when parameter optimization is performed, the following formula is generally used:

ω^*(α)＝argmin_ωL(ω,α) (1)

in the above formula (1), α represents a structural parameter, ω represents an operation parameter, and L (ω, α) represents a loss value calculated based on ω when α is fixed, ω being a value calculated based on ω^*The (alpha) represents that alpha is fixed, and then the value of omega is obtained when L (omega, alpha) is minimum through training omega, namely the optimized omega; in the above formula (2), L (. omega.), (I)^*(α), α) represents the optimized ω invariant, and α is trained based on the loss value calculated by α such that L (ω) is constant^*And (α), α) is minimal. In this method, α needs to be adjusted continuously, and ω needs to be retrained each time α is adjusted, for example, ω needs to be calculated 100 times for each training, and 10000 times are finally calculated if α is adjusted 100 times, which results in a large calculation amount.

In the method provided by the embodiment of the present disclosure, the parameter optimization is generally based on the following formula:

in the above formula, ξ represents the learning rate of the operating parameter,

the gradient value of ω is calculated based on L (ω, α), and when calculating the optimized ω, an approximate calculation method is adopted, so that each time α is optimized, only one calculation is needed when ω is optimized, and therefore, the optimization can be regarded as simultaneous optimization of α and ω.

Based on the method, the network parameters in the neural network can be searched while the neural network structure is searched, and compared with a method of determining the network structure and then determining the network parameters, the method improves the determination efficiency of the neural network.

Step 805, determining a trained neural network model based on the trained neural network parameters.

In a possible implementation manner, a target operation method may be selected for each edge of the multiple directed acyclic graphs based on the trained weight parameter, and the neural network model after determining the target operation method for each edge is the trained neural network.

Illustratively, when a target operation method is selected for each edge of the directed acyclic graph based on the trained weight parameters, regarding each edge of the directed acyclic graph, the operation method with the largest weight parameter corresponding to the edge is taken as the target operation method corresponding to the edge.

In another possible implementation, in order to reduce the size of the neural network and increase the computation speed of the neural network, after the target operation method is selected for each edge of the multiple directed acyclic graphs, the edges of the directed acyclic graphs may be further pruned, and then the pruned neural network is used as the trained neural network.

Specifically, for each node, under the condition that the number of edges pointing to the node is larger than the target number, determining a weight parameter of a target operation method corresponding to each edge pointing to the node, then sorting the edges pointing to the node according to a descending order of the corresponding weight parameters, reserving the edges arranged at the front K bits, deleting the rest edges behind the front K bits, and taking the neural network subjected to deletion processing as a trained neural network, wherein K is the preset target number.

For example, if the number of the targets is two and the number of the edges pointing to a certain node is three, the weight parameters of the target operation methods corresponding to the three edges pointing to the node may be respectively determined, the three edges pointing to the node are sorted according to the weight parameters from large to small, the edges in the first two bits are reserved, and the edges in the third bit are deleted.

Based on the same concept, the embodiment of the present disclosure further provides a video identification method, as shown in fig. 9, which is a schematic flow diagram of the video identification method provided by the embodiment of the present disclosure, and the method includes the following steps:

and step 901, acquiring a video to be identified.

And 902, inputting the video to be recognized into a pre-trained neural network, and determining the occurrence probability of various events corresponding to the video to be recognized.

The neural network is obtained by training based on the training method of the neural network provided by the embodiment.

And step 903, taking the event with the corresponding occurrence probability meeting the preset condition as the event occurring in the video to be identified.

The event with the occurrence probability meeting the preset condition may be the event with the maximum occurrence probability, or the event with the occurrence probability greater than the preset probability value.

In the following, with reference to a specific embodiment, a detailed processing procedure of the video to be recognized by the neural network after the video to be recognized is input to the neural network is described, where the neural network includes a sampling layer, a feature extraction layer, and a full connection layer, and the feature extraction layer includes a plurality of directed acyclic graphs.

1) Sampling layer

After the video to be identified is input into the neural network, the video to be identified is firstly input into the sampling layer, the sampling layer can sample the video to be identified to obtain a plurality of sampling video frames, then the sampling video frames are subjected to feature extraction to obtain feature maps corresponding to the sampling video frames, and then the feature maps corresponding to the sampling video frames are input into the feature extraction layer.

2) Feature extraction layer

The feature extraction layer includes a plurality of directed acyclic graphs for temporal feature extraction and directed acyclic graphs for spatial feature extraction, the number of each type of directed acyclic graphs is preset, the number of nodes in each type of directed acyclic graphs is also preset, and the differences between the directed acyclic graphs for temporal feature extraction and the directed acyclic graphs for spatial feature extraction are shown in table 1 below:

TABLE 1

After the sampling layer inputs the feature graph corresponding to the sampled video frame to the feature extraction layer, the feature graph corresponding to the sampled video frame may be input to a target input node of a first directed acyclic graph, another input node of the first directed acyclic graph is empty, one input node of a second directed acyclic graph is connected to an output node of the first directed acyclic graph, another input node is empty, one input node of a third directed acyclic graph is connected to an output node of the second directed acyclic graph, one input node is connected to an output node of the first directed acyclic graph, and so on, and the output node of the last directed acyclic graph inputs the corresponding feature graph to the full connection layer.

3) Full connection layer

After the feature map corresponding to the output node of the directed acyclic graph is input to the full connection layer, the full connection layer may determine occurrence probabilities of a plurality of events corresponding to the input video to be recognized based on the input feature map, where the plurality of events corresponding to the video to be recognized may be event labels corresponding to the sample video applied when the neural network is trained.

In the method provided by the above embodiment, the constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each edge of the directed acyclic graph corresponds to a plurality of operation methods; thus, after the neural network is trained by using the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network is further obtained based on the weight parameters of the trained operation method; the neural network trained by the method not only carries out image dimension spatial feature recognition, but also carries out time dimension time feature recognition, and the trained neural network has higher video recognition precision.

It will be understood by those of skill in the art that in the above method of the present embodiment, the order of writing the steps does not imply a strict order of execution and does not impose any limitations on the implementation, as the order of execution of the steps should be determined by their function and possibly inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a training apparatus for a neural network corresponding to the training method for the neural network, and as the principle of solving the problem of the apparatus in the embodiment of the present disclosure is similar to the training method for the neural network described in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and the repeated parts are not described again.

Referring to fig. 10, there is shown a schematic architecture diagram of a training apparatus for a neural network according to an embodiment of the present disclosure, the apparatus includes: a construction module 1001, a training module 1002, and a selection module 1003; wherein,

a constructing module 1001, configured to obtain a sample video and construct a neural network including a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter;

a training module 1002, configured to train the constructed neural network based on the sample videos and event labels corresponding to each sample video, so as to obtain trained weight parameters;

a selecting module 1003, configured to select a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter, so as to obtain a trained neural network.

the building module 1001, when building a neural network including a plurality of directed acyclic graphs, is configured to:

In a possible implementation, the building module 1001 is further configured to determine a feature map of the directed acyclic graph output according to the following method:

and connecting feature graphs corresponding to other nodes except the input node in the directed acyclic graph in series, and taking the connected feature graphs as feature graphs output by the directed acyclic graph.

In one possible implementation, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, and each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods; the plurality of first operation methods includes the plurality of second operation methods and at least one other operation method different from the second operation method.

the training module 1002 is configured to, when training the constructed neural network based on the sample videos and the event labels corresponding to each sample video to obtain trained weight parameters, be configured to:

and training the constructed neural network based on the occurrence probability of various events corresponding to the sample videos calculated by the full connection layer and the event label corresponding to each sample video to obtain the trained weight parameters.

In a possible implementation, the constructing module 1001 is further configured to obtain a feature map corresponding to each node except the input node in the directed acyclic graph according to the following method:

In a possible implementation manner, the building module 1001, when generating the feature map corresponding to the node according to the feature map corresponding to each previous node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each previous node pointing to the node, is configured to:

In a possible implementation, the selecting module 1003, when selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter, is configured to:

In a possible implementation manner, the selecting module 1003, when selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain a trained neural network, is configured to:

The description of the processing flow of each module in the apparatus and the interaction flow between the modules may refer to the relevant description in the above method embodiments, and will not be described in detail here.

Based on the same inventive concept, a video identification device corresponding to the video identification method is further provided in the embodiments of the present disclosure, and as shown in fig. 11, an architecture diagram of a video identification device provided in the embodiments of the present disclosure is shown, where the device includes: the obtaining module 1101, the first determining module 1102 and the second determining module 1103 specifically:

an obtaining module 1101, configured to obtain a video to be identified;

a first determining module 1102, configured to input the video to be recognized into a neural network obtained by training based on the training method of the neural network described in the foregoing embodiment, and determine occurrence probabilities of multiple events corresponding to the video to be recognized;

a second determining module 1103, configured to use an event whose corresponding occurrence probability meets a preset condition as an event occurring in the video to be identified.

Based on the same technical concept, the embodiment of the application also provides computer equipment. Referring to fig. 12, a schematic structural diagram of a computer device provided in the embodiment of the present application includes a processor 1201, a memory 1202, and a bus 1203. The storage 1202 is used for storing execution instructions, and includes a memory 12021 and an external storage 12022; the memory 12021 is also called an internal memory, and is used for temporarily storing operation data in the processor 1201 and data exchanged with an external storage 12022 such as a hard disk, and the processor 1201 exchanges data with the external storage 12022 through the memory 12021, and when the computer apparatus 1200 operates, the processor 1201 and the storage 1202 communicate with each other through the bus 1203, so that the processor 1201 executes the following instructions:

acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter;

and selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters to obtain a trained neural network.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the training method for a neural network described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The computer program product of the neural network training method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute steps of the neural network training method described in the above method embodiments, which may be referred to in the above method embodiments specifically, and are not described herein again.

Based on the same technical concept, the embodiment of the application also provides computer equipment. Referring to fig. 13, a schematic structural diagram of a computer device 1300 provided in the embodiment of the present application includes a processor 1301, a memory 1302, and a bus 1303. The storage 1302 is used for storing execution instructions and includes a memory 13021 and an external storage 13022; the memory 13021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 1301 and data exchanged with an external storage 13022 such as a hard disk, the processor 1301 exchanges data with the external storage 13022 through the memory 13021, and when the computer device 1300 runs, the processor 1301 and the storage 1302 communicate through the bus 1303, so that the processor 1301 executes the following instructions:

acquiring a video to be identified;

inputting the video to be recognized into a neural network obtained by training based on the neural network training method described in the above embodiment, and determining occurrence probabilities of various events corresponding to the video to be recognized;

and taking the event of which the corresponding occurrence probability meets the preset condition as the event occurring in the video to be identified.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the video identification method in the above-mentioned method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The computer program product of the video identification method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the video identification method described in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again.

The embodiments of the present disclosure also provide a computer program, which when executed by a processor implements any one of the methods of the foregoing embodiments. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the technical scope of the disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of training a neural network, comprising:

acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter; the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;

training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters;

selecting a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameters to obtain a trained neural network;

wherein the constructing a neural network comprising a plurality of directed acyclic graphs comprises:

taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of an (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of the other input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1; the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained after characteristic extraction is carried out on a sampling video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.

2. The method of claim 1, wherein the feature graph of the directed acyclic graph output is determined according to the following method:

3. The method according to claim 1 or 2, wherein there are a plurality of first operation methods corresponding to each edge in the directed acyclic graph for extracting the temporal feature, and there are a plurality of second operation methods corresponding to each edge in the directed acyclic graph for extracting the spatial feature; the plurality of first operation methods include the plurality of second operation methods and at least one other operation method different from the second operation method.

4. The method according to any one of claims 1 to 3, wherein the neural network further comprises a sampling layer connected to the first directed acyclic graph, the sampling layer is configured to sample a sample video to obtain a sampled video frame, perform feature extraction on the sampled video frame to obtain a feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame into a target input node of the first directed acyclic graph;

the neural network further comprises a fully connected layer connected with the last directed acyclic graph; the full connection layer is used for calculating the occurrence probability of various events corresponding to the sample video based on the feature graph output by the last directed acyclic graph;

the training of the constructed neural network based on the sample videos and the event labels corresponding to each sample video to obtain trained weight parameters comprises the following steps:

5. The method according to any one of claims 1 to 4, wherein the feature map corresponding to each node except the input node in the directed acyclic graph is obtained according to the following method:

6. The method according to claim 5, wherein the generating the feature map corresponding to the node according to the feature map corresponding to each previous node pointing to the node and the weight parameters of the operation method corresponding to the edge between the node and each previous node pointing to the node comprises:

processing the feature graph of the previous-level node based on each operation method corresponding to the edge aiming at the edge between the node and each previous-level node pointing to the node to obtain a first intermediate feature graph corresponding to each operation method;

7. The method according to any one of claims 1 to 6, wherein the selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter comprises:

8. The method of claim 7, wherein selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain a trained neural network comprises:

9. A method for video recognition, comprising:

acquiring a video to be identified;

inputting the video to be recognized into a neural network obtained by training based on the neural network training method according to any one of claims 1 to 8, and determining the occurrence probability of a plurality of events corresponding to the video to be recognized;

10. An apparatus for training a neural network, comprising:

the system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for obtaining a sample video and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each operation method has a corresponding weight parameter; the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;

a selection module, configured to select a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter, so as to obtain a trained neural network;

wherein the construction module, when constructing a neural network comprising a plurality of directed acyclic graphs, is configured to:

taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of the (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of another input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1; the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained after characteristic extraction is carried out on a sampling video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.

11. A video recognition apparatus, comprising:

the acquisition module is used for acquiring a video to be identified;

a first determining module, configured to input the video to be recognized into a neural network trained based on the training method of the neural network according to any one of claims 1 to 8, and determine occurrence probabilities of multiple events corresponding to the video to be recognized;

12. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine-readable instructions when executed by the processor performing the steps of the method of training a neural network as claimed in any one of claims 1 to 8, or performing the steps of the method of video recognition as claimed in claim 9.

13. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method for training a neural network according to any one of claims 1 to 8, or performs the steps of the method for video recognition according to claim 9.