CN111767985B - Neural network training method, video identification method and device - Google Patents

Neural network training method, video identification method and device Download PDF

Info

Publication number
CN111767985B
CN111767985B CN202010567864.7A CN202010567864A CN111767985B CN 111767985 B CN111767985 B CN 111767985B CN 202010567864 A CN202010567864 A CN 202010567864A CN 111767985 B CN111767985 B CN 111767985B
Authority
CN
China
Prior art keywords
directed acyclic
node
neural network
graph
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010567864.7A
Other languages
Chinese (zh)
Other versions
CN111767985A (en
Inventor
王子豪
林宸
邵婧
盛律
闫俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sensetime Technology Co Ltd
Original Assignee
Shenzhen Sensetime Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sensetime Technology Co Ltd filed Critical Shenzhen Sensetime Technology Co Ltd
Priority to CN202010567864.7A priority Critical patent/CN111767985B/en
Publication of CN111767985A publication Critical patent/CN111767985A/en
Priority to KR1020227000769A priority patent/KR20220011208A/en
Priority to PCT/CN2021/086199 priority patent/WO2021253938A1/en
Priority to JP2021570177A priority patent/JP7163515B2/en
Priority to TW110115206A priority patent/TWI770967B/en
Application granted granted Critical
Publication of CN111767985B publication Critical patent/CN111767985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/84Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides a neural network training method, a video identification method and a device, comprising the following steps: acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter; training the constructed neural network based on the sample videos and event labels corresponding to the sample videos to obtain trained weight parameters; and selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters to obtain a trained neural network.

Description

Neural network training method, video identification method and device
Technical Field
The disclosure relates to the technical field of computers, in particular to a neural network training method, a video identification method and a device.
Background
Video identification refers to identifying events occurring in a video, and in the related art, a neural network for picture identification is generally used for video identification after being simply modified.
However, since the neural network for image recognition performs target recognition on the image dimension, some video features that cannot be extracted from the image dimension are ignored, and thus the accuracy of the neural network for video recognition is affected.
Disclosure of Invention
The embodiment of the disclosure at least provides a neural network training method, a video identification method and a device.
In a first aspect, an embodiment of the present disclosure provides a training method for a neural network, including:
acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each operation method has a corresponding weight parameter;
training the constructed neural network based on the sample videos and event labels corresponding to the sample videos to obtain trained weight parameters;
and selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters to obtain the trained neural network.
In the method, the constructed neural network not only comprises a directed acyclic graph used for extracting spatial features, but also comprises a directed acyclic graph used for extracting temporal features, and each edge of the directed acyclic graph corresponds to a plurality of operation methods; thus, after the neural network is trained by using the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network is further obtained based on the weight parameters of the trained operation method; the neural network trained by the method not only carries out image dimension spatial feature recognition, but also carries out time dimension time feature recognition, and the trained neural network has higher video recognition precision.
In one possible embodiment, the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;
the constructing a neural network comprising a plurality of directed acyclic graphs, comprising:
taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of the (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of another input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1;
the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained after characteristic extraction is carried out on a sampling video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.
In one possible embodiment, the feature map of the directed acyclic graph output is determined according to the following method:
and connecting feature graphs corresponding to other nodes except the input node in the directed acyclic graph in series, and taking the feature graph after the connection in series as the feature graph output by the directed acyclic graph.
In a possible implementation manner, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, and each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods; the plurality of first operation methods includes the plurality of second operation methods and at least one other operation method different from the second operation method.
In a possible implementation manner, the neural network further includes a sampling layer connected to the first directed acyclic graph, where the sampling layer is configured to sample a sample video to obtain a sampled video frame, perform feature extraction on the sampled video frame to obtain a feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame into a target input node of the first directed acyclic graph;
the neural network further comprises a full connection layer connected with the output node of the last directed acyclic graph; the full connection layer is used for calculating the occurrence probability of various events corresponding to the sample video based on the feature graph output by the last directed acyclic graph;
the training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters includes:
training the constructed neural network based on the occurrence probability of various events corresponding to the sample videos calculated by the full connection layer and the event labels corresponding to the sample videos to obtain trained weight parameters.
In one possible implementation, the feature map corresponding to each node except the input node in the directed acyclic graph is obtained according to the following method:
and generating the characteristic graph corresponding to the node according to the characteristic graph corresponding to each superior node pointing to the node and the weight parameters of the operation method corresponding to the edges between the node and each superior node pointing to the node.
The influence of the operation method between the edge of any node and the previous node of any node on the feature graph of any node can be controlled through the weight parameter, so that the operation method corresponding to the edge between any node and the previous node of any node can be controlled through controlling the weight parameter, and the value of the feature graph of any node is further changed.
In a possible implementation manner, the generating a feature map corresponding to the node according to the feature map corresponding to each previous-level node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each previous-level node pointing to the node includes:
aiming at the edge between the node and each upper-level node pointing to the node, processing the feature graph of the upper-level node based on each operation method corresponding to the edge to obtain a first intermediate feature graph corresponding to each operation method;
carrying out weighted summation on the first intermediate characteristic graphs corresponding to the operation methods according to corresponding weight parameters to obtain second intermediate characteristic graphs corresponding to the edges;
and summing the second intermediate characteristic graphs respectively corresponding to the plurality of edges between the node and each upper-level node pointing to the node to obtain the characteristic graph corresponding to the node.
By the method, each operation method can be applied when the feature map of the node is determined, the influence of a single operation method on the feature map corresponding to the node is reduced, and the influence on the identification precision of the neural network is further avoided.
In a possible embodiment, the selecting a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter includes:
and aiming at each edge of the directed acyclic graph, taking the operation method with the maximum weight parameter corresponding to the edge as the target operation method corresponding to the edge.
In one possible embodiment, the selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain a trained neural network includes:
for each node, determining a weight parameter of the target operation method corresponding to each edge pointing to the node under the condition that the number of the edges pointing to the node is larger than the target number;
sequencing all edges pointing to the node according to the descending order of the corresponding weight parameters, and deleting the rest edges except the edge of the previous K bits, wherein K is the target number;
and taking the neural network subjected to deletion processing as the trained neural network.
By the method, on one hand, the size of the neural network can be reduced, on the other hand, the calculation steps of the neural network can be reduced, and the calculation efficiency of the neural network is improved.
In a second aspect, an embodiment of the present disclosure further provides a video identification method, including:
acquiring a video to be identified;
inputting the video to be recognized into a neural network obtained by training based on the first aspect or any one of the possible embodiments of the first aspect through the training method of the neural network, and determining the occurrence probability of multiple events corresponding to the video to be recognized;
and taking the event with the corresponding occurrence probability meeting the preset condition as the event occurring in the video to be identified.
In a third aspect, an embodiment of the present disclosure provides a training apparatus for a neural network, including:
the device comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for acquiring a sample video and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each operation method has a corresponding weight parameter;
the training module is used for training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters;
and the selecting module is used for selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters so as to obtain the trained neural network.
In one possible embodiment, the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;
the building module, when building a neural network comprising a plurality of directed acyclic graphs, is configured to:
taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of the (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of another input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1;
the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained by performing characteristic extraction on a sampled video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.
In a possible implementation, the building module is further configured to determine a feature map of the directed acyclic graph output according to the following method:
and connecting feature graphs corresponding to other nodes except the input node in the directed acyclic graph in series, and taking the feature graph after the connection in series as the feature graph output by the directed acyclic graph.
In one possible implementation, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, and each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods; the plurality of first operation methods include the plurality of second operation methods and at least one other operation method different from the second operation method.
In a possible implementation manner, the neural network further includes a sampling layer connected to the first directed acyclic graph, where the sampling layer is configured to sample a sample video to obtain a sampled video frame, perform feature extraction on the sampled video frame to obtain a feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame into a target input node of the first directed acyclic graph;
the neural network further comprises a full connection layer connected with the output node of the last directed acyclic graph; the full connection layer is used for calculating the occurrence probability of various events corresponding to the sample video based on the characteristic diagram of the output node;
the training module is configured to, when training the constructed neural network based on the sample videos and the event labels corresponding to each sample video to obtain trained weight parameters,:
training the constructed neural network based on the occurrence probability of various events corresponding to the sample videos calculated by the full connection layer and the event labels corresponding to the sample videos to obtain trained weight parameters.
In a possible embodiment, the building module is further configured to obtain a feature map corresponding to each node except the input node in the directed acyclic graph according to the following method:
and generating the characteristic graph corresponding to the node according to the characteristic graph corresponding to each superior node pointing to the node and the weight parameters of the operation method corresponding to the edges between the node and each superior node pointing to the node.
In a possible implementation manner, when the construction module generates the feature map corresponding to the node according to the feature map corresponding to each superior node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each superior node pointing to the node, the construction module is configured to:
aiming at the edge between the node and each upper-level node pointing to the node, processing the feature graph of the upper-level node based on each operation method corresponding to the edge to obtain a first intermediate feature graph corresponding to each operation method;
carrying out weighted summation on the first intermediate characteristic graphs respectively corresponding to the operation methods according to corresponding weight parameters to obtain second intermediate characteristic graphs corresponding to the edges;
and summing the second intermediate feature graphs respectively corresponding to the plurality of edges between the node and each upper-level node pointing to the node to obtain the feature graph corresponding to the node.
In a possible implementation, the selecting module, when selecting the target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter, is configured to:
and aiming at each edge of the directed acyclic graph, taking the operation method with the maximum weight parameter corresponding to the edge as the target operation method corresponding to the edge.
In one possible embodiment, the selecting module, when selecting the target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain the trained neural network, is configured to:
for each node, determining a weight parameter of the target operation method corresponding to each edge pointing to the node under the condition that the number of the edges pointing to the node is larger than the target number;
sequencing all edges pointing to the node according to the descending order of the corresponding weight parameters, and deleting the rest edges except the edge of the previous K bits, wherein K is the target number;
and taking the neural network subjected to deletion processing as the trained neural network.
In a fourth aspect, an embodiment of the present disclosure further provides a video identification apparatus, including:
the acquisition module is used for acquiring a video to be identified;
a first determining module, configured to input the video to be recognized into a neural network trained based on the training method of the neural network according to the first aspect or any one of the possible embodiments of the first aspect, and determine occurrence probabilities of multiple events corresponding to the video to be recognized;
and the second determining module is used for taking the event of which the corresponding occurrence probability meets the preset condition as the event occurring in the video to be identified.
In a fifth aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions, when executed by the processor, performing the steps of the first aspect, or any one of the possible implementations of the first aspect, or the second aspect.
In a sixth aspect, this disclosure also provides a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in the first aspect, or any one of the possible implementation manners of the first aspect, or to perform the steps in the second aspect.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is to be understood that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art to which the disclosure pertains without the benefit of the inventive faculty, and that additional related drawings may be derived therefrom.
Fig. 1 shows a flowchart of a training method of a neural network provided by an embodiment of the present disclosure;
FIG. 2 is a schematic diagram illustrating a network structure of a neural network including a directed acyclic graph according to an embodiment of the present disclosure;
FIG. 3a is a schematic diagram illustrating a process of time convolution according to an embodiment of the present disclosure;
FIG. 3b is a schematic diagram illustrating another exemplary process of time convolution according to the embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating a neural network architecture provided by an embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating a directed acyclic graph provided by an embodiment of the present disclosure;
fig. 6 shows a flowchart of a method for generating a feature map corresponding to a node according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram illustrating an overall structure of a constructed neural network provided by an embodiment of the present disclosure;
FIG. 8 is a flow chart of a method for training a neural network provided by an embodiment of the present disclosure;
fig. 9 is a schematic flow chart illustrating a video recognition method provided by an embodiment of the present disclosure;
fig. 10 is a schematic diagram illustrating an architecture of a training apparatus for a neural network provided in an embodiment of the present disclosure;
fig. 11 is a schematic diagram illustrating an architecture of a video recognition apparatus provided in an embodiment of the present disclosure;
FIG. 12 is a schematic diagram illustrating a computer device according to an embodiment of the present disclosure;
fig. 13 shows a schematic structural diagram of another computer device provided in an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the disclosure, provided in the accompanying drawings, is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making any creative effort, shall fall within the protection scope of the disclosure.
In the related art, when video identification is performed, an existing image identification neural network is generally modified, however, the existing image identification neural network performs identification in an image dimension, and some video features which cannot be extracted from the image dimension are ignored, so that the identification accuracy of the neural network is influenced.
In addition, in the related art, the neural network for performing video recognition is searched by using an evolutionary algorithm, however, in this method, after training of a plurality of neural networks is completed each time, the neural network with the best performance is selected to be adjusted again, the calculation amount in the adjustment process of the neural network is large, and the training efficiency is low.
The above drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above problems and the solutions proposed by the present disclosure in the following description should be the contribution of the inventor to the present disclosure in the course of the present disclosure.
Based on this, the embodiment of the present disclosure provides a training method for a neural network, where the constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each edge of the directed acyclic graph corresponds to multiple operation methods; thus, after the neural network is trained by using the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network is further obtained based on the weight parameters of the trained operation method; the neural network trained by the method not only carries out image dimension spatial feature recognition, but also carries out time dimension time feature recognition, and the trained neural network has higher video recognition precision.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.
To facilitate understanding of the present embodiment, first, a training method for a neural network disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the training method for a neural network provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a personal computer, or the like, or a server or other processing device. Furthermore, the method proposed in the embodiments of the present disclosure can also be implemented by executing computer program codes by a processor.
Referring to fig. 1, a flowchart of a training method of a neural network provided in an embodiment of the present disclosure is shown, where the method includes steps 101 to 103, where:
step 101, obtaining a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs.
The directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter.
And 102, training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters.
And 103, selecting a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameters to obtain a trained neural network.
The following is a detailed description of the above steps 101 to 103.
In one possible embodiment, when constructing the neural network, the number of the directed acyclic graphs used for extracting the temporal features and the number of the directed acyclic graphs used for extracting the spatial features are preset. The nodes of the directed acyclic graph represent the feature graph, and the edges between the nodes represent the operation methods.
When a neural network comprising a plurality of directed acyclic graphs is constructed, the feature graph output by the (N-1) th directed acyclic graph can be used as the feature graph of one input node of the (N + 1) th directed acyclic graph, and the feature graph output by the (N) th directed acyclic graph can be used as the feature graph of the other input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1.
In a possible implementation manner, each directed acyclic graph includes two input nodes, any one input node of a first directed acyclic graph of a neural network may be used as a target input node, an input of the target input node is a feature graph obtained by performing feature extraction on a sampled video frame of a sample video, and another input node except the target input node in the first directed acyclic graph of the neural network is empty; and taking a feature graph corresponding to one input node of a second directed acyclic graph of the neural network as the feature graph output by the first directed acyclic graph, and taking the other input node as null. In other embodiments, the directed acyclic graph may also include one, three, or more input nodes.
When the feature graph output by any directed acyclic graph is determined, feature graphs corresponding to other nodes except the input node in the directed acyclic graph can be connected in series (contact), and the feature graph after being connected in series can be used as the feature graph output by the directed acyclic graph.
By way of example, a network structure constructed comprising a neural network of a directed acyclic graph may be as shown in figure 2, the graph 2 includes three directed acyclic graphs, white dots represent input nodes, black dots represent feature graphs obtained by connecting feature graphs corresponding to other nodes except the input nodes in the directed acyclic graph in series, one input node of the first directed acyclic graph corresponds to a feature graph of a sample video frame of a sample video, the other input node is empty, the feature graph corresponding to the output node of the first directed acyclic graph serves as one of the input nodes of the second directed acyclic graph, the input node of the second directed acyclic graph is empty, the feature graph output by the second directed acyclic graph and the feature graph output by the first directed acyclic graph serve as feature graphs corresponding to two input nodes of the third directed acyclic graph, and so on.
In one embodiment, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods, and the plurality of first operation methods include the plurality of second operation methods and at least one other operation method different from the second operation method.
For example, the plurality of first operation methods corresponding to each edge in the directed acyclic graph for extracting the temporal feature may include an average pooling operation (e.g., average pooling of 1 × 3 × 3), a maximum pooling operation (e.g., maximum pooling of 1 × 3 × 3), a discrete convolution operation (e.g., discrete convolution of 1 × 3 × 3), a hole discrete convolution (e.g., hole discrete convolution of 1 × 3 × 3); the plurality of second operation methods corresponding to each edge in the directed acyclic graph for extracting the spatial feature may include an average pooling operation, a maximum pooling operation, a discrete convolution operation, a holed discrete convolution, and a different time convolution.
Wherein the temporal convolution is used to extract temporal features. Illustratively, the time convolution may be a time convolution of 3+3 × 3 size, the time convolution of 3+3 × 3 size indicates that the size of the convolution kernel in the time dimension is 3, the size of the convolution kernel in the space dimension is 3 × 3, the processing procedure may illustratively be as shown in fig. 3a, Cin represents a feature map of the input, Cout represents a feature map of the output after processing, ReLU represents an activation function, conv1 × 3 × 3 represents a convolution operation in which the size of the convolution kernel in the time dimension is 1 and the size of the convolution kernel in the space dimension is 3 × 3, conv3 × 1 × 1 represents a convolution operation in which the size of the convolution kernel in the time dimension is 3 and the size of the convolution kernel in the space dimension is 1 × 1, BatchNorm represents a normalization operation, and T, W, H represents the time dimension and two dimensions of the space, respectively.
Illustratively, the time convolution may also be a time convolution of a size of 3+1 × 1, the time convolution of a size of 3+1 × 1 indicates that the size of the convolution kernel in the time dimension is 3, and the size of the convolution kernel in the space dimension is 1 × 1, and the processing procedure may illustratively be as shown in fig. 3b, conv1 × 1 × 1 indicates a convolution operation in which the size of the convolution kernel in the time dimension is 1 and the size of the convolution kernel in the space dimension is 1 × 1, and the meaning of the remaining symbols is the same as that in fig. 3a, and will not be described herein again.
In a possible implementation manner, when a neural network is initially constructed, the structures of the directed acyclic graphs used for extracting the time features are the same, but after the training of the neural network is completed, the target operation methods corresponding to the edges in the directed acyclic graphs used for extracting the time features may be different; similarly, when the neural network is constructed, the structures of the directed acyclic graphs used for extracting the spatial features are also the same, and after the training of the neural network is completed, the target operation methods corresponding to the edges in different directed acyclic graphs used for extracting the spatial features may also be different.
In one possible implementation, each directed acyclic graph for extracting the temporal feature includes two directed acyclic graphs, one is a first directed acyclic graph with a changed size and channel number of the input feature graph, and the other is a second directed acyclic graph with no changed size and channel number of the input feature graph. The first directed acyclic graph may include a first preset number of nodes, the second directed acyclic graph may include a second preset number of nodes, and the first preset number and the second preset number may be the same. Each directed acyclic graph for extracting the spatial feature may also include two directed acyclic graphs, one is a third directed acyclic graph in which the size and the number of channels of the input feature graph are changed, and the other is a fourth directed acyclic graph in which the size and the number of channels of the input feature graph are not changed, where the third directed acyclic graph may include a third preset number of nodes, the fourth directed acyclic graph may include a fourth preset number of nodes, and the third preset number and the fourth preset number may be the same.
Therefore, the constructed neural network comprises the four directed acyclic graphs, in practical application, the preset number of nodes corresponding to each directed acyclic graph comprises the number of nodes at each level in the directed acyclic graph, and after the number of nodes at each level is determined, the connection relationship among the nodes can be directly determined, so that the directed acyclic graph is determined.
For example, as shown in fig. 4, after a sample video is input to the neural network, a sampling layer may be input first, the sample video is sampled, then feature extraction is performed on a sample video frame after sampling, the sample video frame is input to a first directed acyclic graph, a last directed acyclic graph is input to a full connection layer, and an input of the full connection layer is an output of the neural network.
It should be noted here that the size and the number of channels of the feature map are controlled by the directed acyclic graph, so that on one hand, the receptive field of the neural network can be expanded, and on the other hand, the calculation amount of the neural network can be reduced, and the calculation efficiency can be improved.
In a possible implementation manner, when determining the feature graph corresponding to each node except the input node in the directed acyclic graph, the feature graph corresponding to the node may be generated according to the feature graph corresponding to each previous-level node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each previous-level node pointing to the node.
For example, if the directed acyclic graph is shown in fig. 5, when determining the feature graph corresponding to the node 3, the nodes pointing to the node 3 are the node 0, the node 1, and the node 2, and the feature graph corresponding to the node 3 may be determined according to the feature graphs corresponding to the node 0, the node 1, and the node 2 and the weight parameters of the operation methods corresponding to the edges between the node 0, the node 1, and the node 2 and the node 3, respectively.
If the directed acyclic graph is used for extracting the temporal feature, the operation methods corresponding to the edges between the node 0, the node 1, and the node 2 and the node 3 are the first operation methods, and if the directed acyclic graph is used for extracting the spatial feature, the operation methods corresponding to the edges between the node 0, the node 1, and the node 2 and the node 3 are the second operation methods.
Specifically, when generating the feature map corresponding to the node, the method shown in fig. 6 may be referred to, including the following steps:
step 601, aiming at the edge between the current node and each previous-level node pointing to the current node, processing the feature graph of the previous-level node based on each operation method corresponding to the edge to obtain a first intermediate feature graph corresponding to each operation method.
For example, if the directed acyclic graph where the current node is located is a directed acyclic graph for performing time feature extraction, and three edges point to the node, and each edge corresponds to six first operation methods, for any edge, the feature graph corresponding to the previous node connected to the edge may be respectively processed by each operation method corresponding to the edge, so that six first intermediate feature graphs corresponding to the edge may be obtained, and three edges point to the node, and eighteen first intermediate feature graphs may be obtained through calculation.
If the directed acyclic graph where the current node is located is the directed acyclic graph for spatial feature extraction, three edges point to the node, each edge corresponds to four first operation methods, similar to the above calculation method, the number of the first intermediate feature graphs corresponding to each edge is four, and twelve first intermediate feature graphs can be obtained through calculation.
Step 602, performing weighted summation on the first intermediate feature maps respectively corresponding to the operation methods according to the corresponding weight parameters to obtain a second intermediate feature map corresponding to the edge.
The weight parameters are model parameters to be trained, and in one possible implementation, the weight parameters may be randomly assigned and then continuously adjusted in the training process of the neural network.
Each operation method corresponding to the edge pointing to the current node has a corresponding weight parameter, when the first intermediate feature graph corresponding to each operation method is subjected to weighted summation according to the corresponding weight parameters, the value at the position corresponding to the first feature graph can be multiplied by the weight parameter of the operation method corresponding to the first feature graph, and then the multiplication results at the corresponding positions are added to obtain a second intermediate feature graph corresponding to the edge.
In the example in the continuation step 601, there are three edges pointing to the current node, each edge corresponds to six first operation methods, each first operation method has a corresponding weight parameter, each edge may correspond to six first intermediate feature maps, and then the six first intermediate feature maps corresponding to each edge are subjected to weighted summation according to the corresponding weight parameters to obtain a second intermediate feature map corresponding to each edge.
It should be noted here that the weighting parameters of the same operation method corresponding to different edges may be different, for example, edge 1 and edge 2 both point to the current node, the operation methods corresponding to edge 1 and edge 2 both include an average pooling operation, the weighting parameter of the average pooling operation corresponding to edge 1 may be 70%, and the weighting parameter of the average pooling operation corresponding to edge 2 may be 10%.
For example, when calculating the second feature map corresponding to the edge between the ith node and the jth node, the calculation may be performed by the following formula:
Figure BDA0002548488290000121
where O and O' denote operation methods, O denotes a set of operation methods between the ith node and the jth node,
Figure BDA0002548488290000131
a weight parameter indicating an operation method "o" corresponding to an edge between the ith node and the jth node,
Figure BDA0002548488290000132
a weight parameter, o (x), representing the operation method "o'" corresponding to the edge between the ith node and the jth nodei) A feature map corresponding to the ith node is shown,
Figure BDA0002548488290000133
and a second feature graph representing edge correspondence between the ith node and the jth node.
And 603, performing summation operation on second intermediate feature graphs corresponding to a plurality of edges between the current node and each upper-level node pointing to the current node respectively to obtain a feature graph corresponding to the current node.
The sizes of the second intermediate feature maps are the same, and when the summation operation is performed on the second intermediate feature maps, the values at the positions corresponding to the second intermediate feature maps can be added to obtain the feature map corresponding to the current node.
In addition, the constructed neural network further comprises a sampling layer and a full connection layer, wherein the sampling layer is used for sampling videos input into the neural network to obtain sampling video frames, extracting features of the sampling video frames to obtain feature maps corresponding to the sampling video frames, and then inputting the feature maps corresponding to the sampling video frames to a target input node of the first directed acyclic graph. In summary, the overall structure of the constructed neural network is exemplarily shown in fig. 7, where fig. 7 includes three directed acyclic graphs, one full-connection layer, one sampling layer, and the output of the full-connection layer is the output of the neural network.
The event tags corresponding to the sample video are used for representing events occurring in the sample video, and for example, the events occurring in the sample video may include running of a person, playing of a puppy, badminton of two persons, and the like. In one possible implementation, when training the constructed neural network based on the sample video and the event labels corresponding to the sample video, the method as shown in fig. 8 may include the following steps:
step 801, inputting the sample video into a neural network, and outputting to obtain the occurrence probability of multiple events corresponding to the sample video.
Here, the number of the plurality of types of events corresponding to the sample video is the same as the number of types of event labels of the sample video for training the neural network, and for example, when the neural network is trained using sample videos of 400 types of event labels, the neural network can output the occurrence probability of each of the 400 types of events corresponding to the input video after any one video is input to the neural network.
Step 802, determining a predicted event corresponding to the sample video based on the occurrence probability of multiple events corresponding to the sample video.
For example, the event with the highest occurrence probability may be determined as the event predicted by the neural network, and in another possible implementation, the sample video may carry a plurality of event tags, for example, event tags carrying a puppy playing and an event tag for two people to play a badminton, so when the predicted event corresponding to the sample video is determined based on the occurrence probabilities of the plurality of events corresponding to the sample video, the event with the occurrence probability greater than the preset probability may also be determined as the predicted event corresponding to the sample video.
And 803, determining a loss value in the training process based on the predicted event corresponding to the sample video and the event label of the sample video.
For example, the cross entropy loss in the training process may be determined based on the predicted event corresponding to the sample video and the event label of the sample video.
And step 804, judging whether the loss value in the training process is smaller than a preset loss value.
If yes, go to step 805 in sequence; if the result of the determination is negative, adjusting the parameter values of the neural network parameters in the training process, and returning to execute the step 801.
Here, the adjusted neural network parameters include weight parameters of operation methods corresponding to each edge of the directed acyclic graph, and since each weight parameter may affect selection of a target operation method corresponding to each edge of the directed acyclic graph, the weight parameters may be used as structural parameters of the neural network; the adjusted neural network parameters also include operation parameters, such as the size, weight, etc. of the convolution kernel of each convolution operation.
Since the convergence rates of the structural parameter and the operational parameter are different from each other, when the operational parameter is in the early stage of learning, and when the learning rate is small, rapid convergence of the structural parameter may be caused, and therefore, the process of synchronously learning the operational parameter and the structural parameter can be realized by controlling the learning rate.
For example, a gradual learning rate attenuation strategy may be adopted, specifically, a hyper-parameter S may be preset, which indicates that the learning rate is attenuated once every time the operating parameter and the structural parameter are optimized S times, and the amplitude of the attenuation is d (preset), so that gradual attenuation of the learning rate may be realized, and thus synchronous learning, i.e., synchronous optimization, of the structural parameter and the operating parameter may be realized.
In the prior art, when parameter optimization is performed, the following formula is generally used:
ω*(α)=argminωL(ω,α) (1)
Figure BDA0002548488290000141
in the above formula (1), α represents a structural parameter, ω represents an operation parameter, and L (ω, α) represents a loss value calculated based on ω when α is fixed, ω being a value calculated based on ω*The (alpha) represents that alpha is fixed, and then the value of omega is obtained when L (omega, alpha) is minimum through training omega, namely the optimized omega; in the above formula (2), L (. omega.), (I)*(α), α) represents the optimized ω invariant, and α is trained based on the loss value calculated by α such that L (ω) is constant*And (α), α) is minimal. In this method, α needs to be adjusted continuously, and ω needs to be retrained each time α is adjusted, for example, ω needs to be calculated 100 times for each training, and 10000 times are finally calculated if α is adjusted 100 times, which results in a large calculation amount.
In the method provided by the embodiment of the present disclosure, the parameter optimization is generally based on the following formula:
Figure BDA0002548488290000151
Figure BDA0002548488290000152
in the above formula, ξ represents the learning rate of the operating parameter,
Figure BDA0002548488290000153
the gradient value of ω is calculated based on L (ω, α), and when calculating the optimized ω, an approximate calculation method is adopted, so that each time α is optimized, only one calculation is needed when ω is optimized, and therefore, the optimization can be regarded as simultaneous optimization of α and ω.
Based on the method, the network parameters in the neural network can be searched while the neural network structure is searched, and compared with a method of determining the network structure and then determining the network parameters, the method improves the determination efficiency of the neural network.
Step 805, determining a trained neural network model based on the trained neural network parameters.
In a possible implementation manner, a target operation method may be selected for each edge of the multiple directed acyclic graphs based on the trained weight parameter, and the neural network model after determining the target operation method for each edge is the trained neural network.
Illustratively, when a target operation method is selected for each edge of the directed acyclic graph based on the trained weight parameters, regarding each edge of the directed acyclic graph, the operation method with the largest weight parameter corresponding to the edge is taken as the target operation method corresponding to the edge.
In another possible implementation, in order to reduce the size of the neural network and increase the computation speed of the neural network, after the target operation method is selected for each edge of the multiple directed acyclic graphs, the edges of the directed acyclic graphs may be further pruned, and then the pruned neural network is used as the trained neural network.
Specifically, for each node, under the condition that the number of edges pointing to the node is larger than the target number, determining a weight parameter of a target operation method corresponding to each edge pointing to the node, then sorting the edges pointing to the node according to a descending order of the corresponding weight parameters, reserving the edges arranged at the front K bits, deleting the rest edges behind the front K bits, and taking the neural network subjected to deletion processing as a trained neural network, wherein K is the preset target number.
For example, if the number of the targets is two and the number of the edges pointing to a certain node is three, the weight parameters of the target operation methods corresponding to the three edges pointing to the node may be respectively determined, the three edges pointing to the node are sorted according to the weight parameters from large to small, the edges in the first two bits are reserved, and the edges in the third bit are deleted.
Based on the same concept, the embodiment of the present disclosure further provides a video identification method, as shown in fig. 9, which is a schematic flow diagram of the video identification method provided by the embodiment of the present disclosure, and the method includes the following steps:
and step 901, acquiring a video to be identified.
And 902, inputting the video to be recognized into a pre-trained neural network, and determining the occurrence probability of various events corresponding to the video to be recognized.
The neural network is obtained by training based on the training method of the neural network provided by the embodiment.
And step 903, taking the event with the corresponding occurrence probability meeting the preset condition as the event occurring in the video to be identified.
The event with the occurrence probability meeting the preset condition may be the event with the maximum occurrence probability, or the event with the occurrence probability greater than the preset probability value.
In the following, with reference to a specific embodiment, a detailed processing procedure of the video to be recognized by the neural network after the video to be recognized is input to the neural network is described, where the neural network includes a sampling layer, a feature extraction layer, and a full connection layer, and the feature extraction layer includes a plurality of directed acyclic graphs.
1) Sampling layer
After the video to be identified is input into the neural network, the video to be identified is firstly input into the sampling layer, the sampling layer can sample the video to be identified to obtain a plurality of sampling video frames, then the sampling video frames are subjected to feature extraction to obtain feature maps corresponding to the sampling video frames, and then the feature maps corresponding to the sampling video frames are input into the feature extraction layer.
2) Feature extraction layer
The feature extraction layer includes a plurality of directed acyclic graphs for temporal feature extraction and directed acyclic graphs for spatial feature extraction, the number of each type of directed acyclic graphs is preset, the number of nodes in each type of directed acyclic graphs is also preset, and the differences between the directed acyclic graphs for temporal feature extraction and the directed acyclic graphs for spatial feature extraction are shown in table 1 below:
TABLE 1
Figure BDA0002548488290000161
After the sampling layer inputs the feature graph corresponding to the sampled video frame to the feature extraction layer, the feature graph corresponding to the sampled video frame may be input to a target input node of a first directed acyclic graph, another input node of the first directed acyclic graph is empty, one input node of a second directed acyclic graph is connected to an output node of the first directed acyclic graph, another input node is empty, one input node of a third directed acyclic graph is connected to an output node of the second directed acyclic graph, one input node is connected to an output node of the first directed acyclic graph, and so on, and the output node of the last directed acyclic graph inputs the corresponding feature graph to the full connection layer.
3) Full connection layer
After the feature map corresponding to the output node of the directed acyclic graph is input to the full connection layer, the full connection layer may determine occurrence probabilities of a plurality of events corresponding to the input video to be recognized based on the input feature map, where the plurality of events corresponding to the video to be recognized may be event labels corresponding to the sample video applied when the neural network is trained.
In the method provided by the above embodiment, the constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each edge of the directed acyclic graph corresponds to a plurality of operation methods; thus, after the neural network is trained by using the sample video, the weight parameters of the trained operation method can be obtained, and the trained neural network is further obtained based on the weight parameters of the trained operation method; the neural network trained by the method not only carries out image dimension spatial feature recognition, but also carries out time dimension time feature recognition, and the trained neural network has higher video recognition precision.
It will be understood by those of skill in the art that in the above method of the present embodiment, the order of writing the steps does not imply a strict order of execution and does not impose any limitations on the implementation, as the order of execution of the steps should be determined by their function and possibly inherent logic.
Based on the same inventive concept, the embodiment of the present disclosure further provides a training apparatus for a neural network corresponding to the training method for the neural network, and as the principle of solving the problem of the apparatus in the embodiment of the present disclosure is similar to the training method for the neural network described in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and the repeated parts are not described again.
Referring to fig. 10, there is shown a schematic architecture diagram of a training apparatus for a neural network according to an embodiment of the present disclosure, the apparatus includes: a construction module 1001, a training module 1002, and a selection module 1003; wherein,
a constructing module 1001, configured to obtain a sample video and construct a neural network including a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter;
a training module 1002, configured to train the constructed neural network based on the sample videos and event labels corresponding to each sample video, so as to obtain trained weight parameters;
a selecting module 1003, configured to select a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter, so as to obtain a trained neural network.
In one possible embodiment, the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;
the building module 1001, when building a neural network including a plurality of directed acyclic graphs, is configured to:
taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of the (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of another input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1;
the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained after characteristic extraction is carried out on a sampling video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.
In a possible implementation, the building module 1001 is further configured to determine a feature map of the directed acyclic graph output according to the following method:
and connecting feature graphs corresponding to other nodes except the input node in the directed acyclic graph in series, and taking the connected feature graphs as feature graphs output by the directed acyclic graph.
In one possible implementation, each edge in the directed acyclic graph for extracting the temporal feature corresponds to a plurality of first operation methods, and each edge in the directed acyclic graph for extracting the spatial feature corresponds to a plurality of second operation methods; the plurality of first operation methods includes the plurality of second operation methods and at least one other operation method different from the second operation method.
In a possible implementation manner, the neural network further includes a sampling layer connected to the first directed acyclic graph, where the sampling layer is configured to sample a sample video to obtain a sampled video frame, perform feature extraction on the sampled video frame to obtain a feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame into a target input node of the first directed acyclic graph;
the neural network further comprises a full connection layer connected with the output node of the last directed acyclic graph; the full connection layer is used for calculating the occurrence probability of various events corresponding to the sample video based on the feature graph output by the last directed acyclic graph;
the training module 1002 is configured to, when training the constructed neural network based on the sample videos and the event labels corresponding to each sample video to obtain trained weight parameters, be configured to:
and training the constructed neural network based on the occurrence probability of various events corresponding to the sample videos calculated by the full connection layer and the event label corresponding to each sample video to obtain the trained weight parameters.
In a possible implementation, the constructing module 1001 is further configured to obtain a feature map corresponding to each node except the input node in the directed acyclic graph according to the following method:
and generating the characteristic graph corresponding to the node according to the characteristic graph corresponding to each superior node pointing to the node and the weight parameters of the operation method corresponding to the edges between the node and each superior node pointing to the node.
In a possible implementation manner, the building module 1001, when generating the feature map corresponding to the node according to the feature map corresponding to each previous node pointing to the node and the weight parameter of the operation method corresponding to the edge between the node and each previous node pointing to the node, is configured to:
aiming at the edge between the node and each upper-level node pointing to the node, processing the feature graph of the upper-level node based on each operation method corresponding to the edge to obtain a first intermediate feature graph corresponding to each operation method;
carrying out weighted summation on the first intermediate characteristic graphs corresponding to the operation methods according to corresponding weight parameters to obtain second intermediate characteristic graphs corresponding to the edges;
and summing the second intermediate characteristic graphs respectively corresponding to the plurality of edges between the node and each upper-level node pointing to the node to obtain the characteristic graph corresponding to the node.
In a possible implementation, the selecting module 1003, when selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter, is configured to:
and aiming at each edge of the directed acyclic graph, taking the operation method with the maximum weight parameter corresponding to the edge as the target operation method corresponding to the edge.
In a possible implementation manner, the selecting module 1003, when selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain a trained neural network, is configured to:
for each node, determining a weight parameter of the target operation method corresponding to each edge pointing to the node under the condition that the number of the edges pointing to the node is larger than the target number;
sequencing all edges pointing to the node according to the descending order of the corresponding weight parameters, and deleting the rest edges except the edge of the previous K bits, wherein K is the target number;
and taking the neural network subjected to deletion processing as the trained neural network.
The description of the processing flow of each module in the apparatus and the interaction flow between the modules may refer to the relevant description in the above method embodiments, and will not be described in detail here.
Based on the same inventive concept, a video identification device corresponding to the video identification method is further provided in the embodiments of the present disclosure, and as shown in fig. 11, an architecture diagram of a video identification device provided in the embodiments of the present disclosure is shown, where the device includes: the obtaining module 1101, the first determining module 1102 and the second determining module 1103 specifically:
an obtaining module 1101, configured to obtain a video to be identified;
a first determining module 1102, configured to input the video to be recognized into a neural network obtained by training based on the training method of the neural network described in the foregoing embodiment, and determine occurrence probabilities of multiple events corresponding to the video to be recognized;
a second determining module 1103, configured to use an event whose corresponding occurrence probability meets a preset condition as an event occurring in the video to be identified.
Based on the same technical concept, the embodiment of the application also provides computer equipment. Referring to fig. 12, a schematic structural diagram of a computer device provided in the embodiment of the present application includes a processor 1201, a memory 1202, and a bus 1203. The storage 1202 is used for storing execution instructions, and includes a memory 12021 and an external storage 12022; the memory 12021 is also called an internal memory, and is used for temporarily storing operation data in the processor 1201 and data exchanged with an external storage 12022 such as a hard disk, and the processor 1201 exchanges data with the external storage 12022 through the memory 12021, and when the computer apparatus 1200 operates, the processor 1201 and the storage 1202 communicate with each other through the bus 1203, so that the processor 1201 executes the following instructions:
acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter;
training the constructed neural network based on the sample videos and event labels corresponding to the sample videos to obtain trained weight parameters;
and selecting a target operation method for each edge of the directed acyclic graphs based on the trained weight parameters to obtain a trained neural network.
Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the training method for a neural network described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The computer program product of the neural network training method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute steps of the neural network training method described in the above method embodiments, which may be referred to in the above method embodiments specifically, and are not described herein again.
Based on the same technical concept, the embodiment of the application also provides computer equipment. Referring to fig. 13, a schematic structural diagram of a computer device 1300 provided in the embodiment of the present application includes a processor 1301, a memory 1302, and a bus 1303. The storage 1302 is used for storing execution instructions and includes a memory 13021 and an external storage 13022; the memory 13021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 1301 and data exchanged with an external storage 13022 such as a hard disk, the processor 1301 exchanges data with the external storage 13022 through the memory 13021, and when the computer device 1300 runs, the processor 1301 and the storage 1302 communicate through the bus 1303, so that the processor 1301 executes the following instructions:
acquiring a video to be identified;
inputting the video to be recognized into a neural network obtained by training based on the neural network training method described in the above embodiment, and determining occurrence probabilities of various events corresponding to the video to be recognized;
and taking the event of which the corresponding occurrence probability meets the preset condition as the event occurring in the video to be identified.
The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the video identification method in the above-mentioned method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The computer program product of the video identification method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the video identification method described in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again.
The embodiments of the present disclosure also provide a computer program, which when executed by a processor implements any one of the methods of the foregoing embodiments. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the technical scope of the disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (13)

1. A method of training a neural network, comprising:
acquiring a sample video, and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time features and at least one directed acyclic graph used for extracting space features; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each operation method has a corresponding weight parameter; the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;
training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters;
selecting a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameters to obtain a trained neural network;
wherein the constructing a neural network comprising a plurality of directed acyclic graphs comprises:
taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of an (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of the other input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1; the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained after characteristic extraction is carried out on a sampling video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.
2. The method of claim 1, wherein the feature graph of the directed acyclic graph output is determined according to the following method:
and connecting feature graphs corresponding to other nodes except the input node in the directed acyclic graph in series, and taking the feature graph after the connection in series as the feature graph output by the directed acyclic graph.
3. The method according to claim 1 or 2, wherein there are a plurality of first operation methods corresponding to each edge in the directed acyclic graph for extracting the temporal feature, and there are a plurality of second operation methods corresponding to each edge in the directed acyclic graph for extracting the spatial feature; the plurality of first operation methods include the plurality of second operation methods and at least one other operation method different from the second operation method.
4. The method according to any one of claims 1 to 3, wherein the neural network further comprises a sampling layer connected to the first directed acyclic graph, the sampling layer is configured to sample a sample video to obtain a sampled video frame, perform feature extraction on the sampled video frame to obtain a feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame into a target input node of the first directed acyclic graph;
the neural network further comprises a fully connected layer connected with the last directed acyclic graph; the full connection layer is used for calculating the occurrence probability of various events corresponding to the sample video based on the feature graph output by the last directed acyclic graph;
the training of the constructed neural network based on the sample videos and the event labels corresponding to each sample video to obtain trained weight parameters comprises the following steps:
and training the constructed neural network based on the occurrence probability of various events corresponding to the sample videos calculated by the full connection layer and the event label corresponding to each sample video to obtain the trained weight parameters.
5. The method according to any one of claims 1 to 4, wherein the feature map corresponding to each node except the input node in the directed acyclic graph is obtained according to the following method:
and generating the characteristic graph corresponding to the node according to the characteristic graph corresponding to each superior node pointing to the node and the weight parameters of the operation method corresponding to the edges between the node and each superior node pointing to the node.
6. The method according to claim 5, wherein the generating the feature map corresponding to the node according to the feature map corresponding to each previous node pointing to the node and the weight parameters of the operation method corresponding to the edge between the node and each previous node pointing to the node comprises:
processing the feature graph of the previous-level node based on each operation method corresponding to the edge aiming at the edge between the node and each previous-level node pointing to the node to obtain a first intermediate feature graph corresponding to each operation method;
carrying out weighted summation on the first intermediate characteristic graphs respectively corresponding to the operation methods according to corresponding weight parameters to obtain second intermediate characteristic graphs corresponding to the edges;
and summing the second intermediate characteristic graphs respectively corresponding to the plurality of edges between the node and each upper-level node pointing to the node to obtain the characteristic graph corresponding to the node.
7. The method according to any one of claims 1 to 6, wherein the selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter comprises:
and aiming at each edge of the directed acyclic graph, taking the operation method with the maximum weight parameter corresponding to the edge as the target operation method corresponding to the edge.
8. The method of claim 7, wherein selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameter to obtain a trained neural network comprises:
for each node, determining a weight parameter of the target operation method corresponding to each edge pointing to the node under the condition that the number of the edges pointing to the node is larger than the target number;
sequencing all edges pointing to the node according to the descending order of the corresponding weight parameters, and deleting the rest edges except the edge of the previous K bits, wherein K is the target number;
and taking the neural network subjected to deletion processing as the trained neural network.
9. A method for video recognition, comprising:
acquiring a video to be identified;
inputting the video to be recognized into a neural network obtained by training based on the neural network training method according to any one of claims 1 to 8, and determining the occurrence probability of a plurality of events corresponding to the video to be recognized;
and taking the event of which the corresponding occurrence probability meets the preset condition as the event occurring in the video to be identified.
10. An apparatus for training a neural network, comprising:
the system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for obtaining a sample video and constructing a neural network comprising a plurality of directed acyclic graphs; the multiple directed acyclic graphs comprise at least one directed acyclic graph used for extracting time characteristics and at least one directed acyclic graph used for extracting space characteristics; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each operation method has a corresponding weight parameter; the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map;
the training module is used for training the constructed neural network based on the sample videos and the event labels corresponding to the sample videos to obtain trained weight parameters;
a selection module, configured to select a target operation method for each edge of the multiple directed acyclic graphs based on the trained weight parameter, so as to obtain a trained neural network;
wherein the construction module, when constructing a neural network comprising a plurality of directed acyclic graphs, is configured to:
taking the feature graph output by the (N-1) th directed acyclic graph as a feature graph of one input node of the (N + 1) th directed acyclic graph, and taking the feature graph output by the (N) th directed acyclic graph as a feature graph of another input node of the (N + 1) th directed acyclic graph; n is an integer greater than 1; the characteristic graph corresponding to a target input node in a first directed acyclic graph of the neural network is a characteristic graph obtained after characteristic extraction is carried out on a sampling video frame of a sample video, and another input node except the target input node is empty; and the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.
11. A video recognition apparatus, comprising:
the acquisition module is used for acquiring a video to be identified;
a first determining module, configured to input the video to be recognized into a neural network trained based on the training method of the neural network according to any one of claims 1 to 8, and determine occurrence probabilities of multiple events corresponding to the video to be recognized;
and the second determining module is used for taking the event of which the corresponding occurrence probability meets the preset condition as the event occurring in the video to be identified.
12. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine-readable instructions when executed by the processor performing the steps of the method of training a neural network as claimed in any one of claims 1 to 8, or performing the steps of the method of video recognition as claimed in claim 9.
13. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method for training a neural network according to any one of claims 1 to 8, or performs the steps of the method for video recognition according to claim 9.
CN202010567864.7A 2020-06-19 2020-06-19 Neural network training method, video identification method and device Active CN111767985B (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CN202010567864.7A CN111767985B (en) 2020-06-19 2020-06-19 Neural network training method, video identification method and device
KR1020227000769A KR20220011208A (en) 2020-06-19 2021-04-09 Neural network training method, video recognition method and apparatus
PCT/CN2021/086199 WO2021253938A1 (en) 2020-06-19 2021-04-09 Neural network training method and apparatus, and video recognition method and apparatus
JP2021570177A JP7163515B2 (en) 2020-06-19 2021-04-09 Neural network training method, video recognition method and apparatus
TW110115206A TWI770967B (en) 2020-06-19 2021-04-27 Neural network training method, video recognition method, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010567864.7A CN111767985B (en) 2020-06-19 2020-06-19 Neural network training method, video identification method and device

Publications (2)

Publication Number Publication Date
CN111767985A CN111767985A (en) 2020-10-13
CN111767985B true CN111767985B (en) 2022-07-22

Family

ID=72721043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010567864.7A Active CN111767985B (en) 2020-06-19 2020-06-19 Neural network training method, video identification method and device

Country Status (5)

Country Link
JP (1) JP7163515B2 (en)
KR (1) KR20220011208A (en)
CN (1) CN111767985B (en)
TW (1) TWI770967B (en)
WO (1) WO2021253938A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767985B (en) * 2020-06-19 2022-07-22 深圳市商汤科技有限公司 Neural network training method, video identification method and device
CN112598021A (en) * 2020-11-27 2021-04-02 西北工业大学 Graph structure searching method based on automatic machine learning
WO2024172250A1 (en) * 2023-02-15 2024-08-22 이화여자대학교 산학협력단 Method and apparatus for lightweighting artificial intelligence network by using level of contribution

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334910A (en) * 2018-03-30 2018-07-27 国信优易数据有限公司 A kind of event detection model training method and event detecting method

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281853B (en) * 2014-09-02 2017-11-17 电子科技大学 A kind of Activity recognition method based on 3D convolutional neural networks
US10515304B2 (en) 2015-04-28 2019-12-24 Qualcomm Incorporated Filter specificity as training criterion for neural networks
WO2017070656A1 (en) * 2015-10-23 2017-04-27 Hauptmann Alexander G Video content retrieval system
US10546211B2 (en) * 2016-07-01 2020-01-28 Google Llc Convolutional neural network on programmable two dimensional image processor
EP3306528B1 (en) * 2016-10-04 2019-12-25 Axis AB Using image analysis algorithms for providing traning data to neural networks
CN108664849A (en) 2017-03-30 2018-10-16 富士通株式会社 The detection device of event, method and image processing equipment in video
US11010658B2 (en) * 2017-12-22 2021-05-18 Intel Corporation System and method for learning the structure of deep convolutional neural networks
CN108228861B (en) * 2018-01-12 2020-09-01 第四范式(北京)技术有限公司 Method and system for performing feature engineering for machine learning
CN108985259B (en) * 2018-08-03 2022-03-18 百度在线网络技术(北京)有限公司 Human body action recognition method and device
JP7207630B2 (en) 2018-09-25 2023-01-18 Awl株式会社 Object recognition camera system, re-learning system, and object recognition program
CN109284820A (en) * 2018-10-26 2019-01-29 北京图森未来科技有限公司 A kind of search structure method and device of deep neural network
US20200167659A1 (en) * 2018-11-27 2020-05-28 Electronics And Telecommunications Research Institute Device and method for training neural network
CN110598598A (en) * 2019-08-30 2019-12-20 西安理工大学 Double-current convolution neural network human behavior identification method based on finite sample set
CN110705463A (en) * 2019-09-29 2020-01-17 山东大学 Video human behavior recognition method and system based on multi-mode double-flow 3D network
CN110852168A (en) * 2019-10-11 2020-02-28 西北大学 Pedestrian re-recognition model construction method and device based on neural framework search
CN111767985B (en) * 2020-06-19 2022-07-22 深圳市商汤科技有限公司 Neural network training method, video identification method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334910A (en) * 2018-03-30 2018-07-27 国信优易数据有限公司 A kind of event detection model training method and event detecting method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DAGNN:有向无环图神经网络;沈子恒;《CSDN》;20180716;全文 *
NAS-BENCH-201: EXTENDING THE SCOPE OF REPRODUCIBLE NEURAL ARCHITECTURE SEARCH;Xuanyi Dong et al.;《arXiv》;20200115;论文第2.1节,图1 *
VIDEO ACTION RECOGNITION VIA NEURAL ARCHITECTURE SEARCHING;Wei Peng et al.;《arXiv》;20190710;论文第1-4节,图1-2 *
Wei Peng et al..VIDEO ACTION RECOGNITION VIA NEURAL ARCHITECTURE SEARCHING.《arXiv》.2019, *
基于改进卷积神经网络的手写数字识别;杜洋等;《计算机测量与控制》;20180725(第07期);全文 *

Also Published As

Publication number Publication date
JP2022541712A (en) 2022-09-27
TW202201285A (en) 2022-01-01
KR20220011208A (en) 2022-01-27
CN111767985A (en) 2020-10-13
WO2021253938A1 (en) 2021-12-23
JP7163515B2 (en) 2022-10-31
TWI770967B (en) 2022-07-11

Similar Documents

Publication Publication Date Title
CN111767985B (en) Neural network training method, video identification method and device
CN110633745B (en) Image classification training method and device based on artificial intelligence and storage medium
CN111382868B (en) Neural network structure searching method and device
CN110472090A (en) Image search method and relevant apparatus, storage medium based on semantic label
CN111666919B (en) Object identification method and device, computer equipment and storage medium
CN111523831B (en) Risk group identification method and device, storage medium and computer equipment
CN112288086A (en) Neural network training method and device and computer equipment
CN113435509B (en) Small sample scene classification and identification method and system based on meta-learning
CN109918498B (en) Problem warehousing method and device
CN110619082B (en) Project recommendation method based on repeated search mechanism
CN113011471A (en) Social group dividing method, social group dividing system and related devices
CN113642400A (en) Graph convolution action recognition method, device and equipment based on 2S-AGCN
CN113705811A (en) Model training method, device, computer program product and equipment
CN110163206B (en) License plate recognition method, system, storage medium and device
CN113409157B (en) Cross-social network user alignment method and device
CN114419738B (en) Attitude detection method and apparatus, electronic device and storage medium
CN114898881A (en) Survival prediction method, device, equipment and storage medium
CN112905792B (en) Text clustering method, device, equipment and storage medium based on non-text scene
CN113807370A (en) Data processing method, device, equipment, storage medium and computer program product
CN109815490B (en) Text analysis method, device, equipment and storage medium
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN113780027A (en) Multi-label object identification method, device and equipment based on augmented graph convolution
CN111506691A (en) Track matching method and system based on depth matching model
CN115147323B (en) Image enhancement method, device, electronic equipment and storage medium
CN109284776B (en) Random forest based self-training learning system and method for anti-addiction system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40033265

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant