CN112330711A - Model generation method, information extraction method and device and electronic equipment - Google Patents

Model generation method, information extraction method and device and electronic equipment Download PDF

Info

Publication number
CN112330711A
CN112330711A CN202011357509.3A CN202011357509A CN112330711A CN 112330711 A CN112330711 A CN 112330711A CN 202011357509 A CN202011357509 A CN 202011357509A CN 112330711 A CN112330711 A CN 112330711A
Authority
CN
China
Prior art keywords
feature map
motion information
information extraction
appearance
extraction module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011357509.3A
Other languages
Chinese (zh)
Other versions
CN112330711B (en
Inventor
刘倩
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202011357509.3A priority Critical patent/CN112330711B/en
Publication of CN112330711A publication Critical patent/CN112330711A/en
Application granted granted Critical
Publication of CN112330711B publication Critical patent/CN112330711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a model generation method, an information extraction device and electronic equipment, and belongs to the technical field of computers. According to the method, appearance information of a sample video frame is extracted through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network by extracting the sample video frame from a sample video, and motion information of the sample video frame is obtained through a gradient feature map and a difference feature map extracted through a motion information extraction module, wherein the motion information extraction module is used for extracting the gradient feature map and the difference feature map to obtain the motion information. And performing model training based on the appearance information and the motion information to obtain an information extraction model. The motion information and the appearance information can be acquired without pre-calculating an optical flow diagram of a sample video frame as input and additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Description

Model generation method, information extraction method and device and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a model generation method, an information extraction device and electronic equipment.
Background
With the rapid development of internet technology, video has become one of the important propagation ways of content creation and social media platform. In order to better process the video, for example, perform video behavior recognition, it is generally required to comprehensively extract appearance information and motion information of the video. However, the common 2D convolutional neural network can only extract the appearance information of the video frames in the video. Therefore, how to comprehensively extract the appearance information and the motion information of the video becomes a problem which needs to be solved urgently.
In the prior art, generally, an optical flow graph of a video is extracted, then, a video frame in the video is taken as an input of a 2D convolutional neural network, appearance information of the video is extracted by using the 2D convolutional neural network, and an optical flow graph is taken as an input of another 2D convolutional neural network, and motion information is extracted by using the other 2D convolutional neural network. In the dual-channel extraction mode, two 2D convolutional neural networks need to be used, and a video light flow graph needs to be additionally extracted, and the extraction of the light flow graph consumes a large amount of computation, which further causes a large amount of computation and a large cost.
Disclosure of Invention
The embodiment of the invention aims to provide a model generation method, an information extraction method, a device and electronic equipment, so as to solve the problem of high cost in extracting appearance information and motion information. The specific technical scheme is as follows:
in a first aspect of the present invention, there is provided a method for generating a model, the method including:
extracting a sample video frame from a sample video;
extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;
and performing model training based on the appearance information and the motion information to obtain an information extraction model.
In a second aspect of the present invention, there is provided an information extraction method, including:
extracting a video frame to be extracted from a video to be extracted;
inputting the video frame to be extracted into an information extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model; wherein the information extraction model is generated according to the method of any one of the first aspect.
In a third aspect of the present invention, there is also provided a model generation apparatus, including:
the first extraction module is used for extracting a sample video frame from a sample video;
the second extraction module is used for extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;
and the training module is used for carrying out model training based on the appearance information and the motion information to obtain an information extraction model.
In a fourth aspect of the present invention, there is also provided an information extraction apparatus, including:
the extraction module is used for extracting a video frame to be extracted from a video to be extracted;
the input module is used for inputting the video frames to be extracted into an information extraction model, and extracting appearance information and motion information from the video frames to be extracted through the information extraction model; wherein the information extraction model is generated by the apparatus according to any of the third aspects.
In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.
In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.
The model generation method provided by the embodiment of the invention extracts a sample video frame from a sample video, extracts appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracts a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame, wherein the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. A motion information extraction module is embedded in the 2D convolutional neural network, a sample video frame is directly used as input, and the embedded motion information extraction module is combined with the gradient feature map and the difference feature map to obtain motion information. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1-1 is a schematic diagram of an appearance-motion information 2D convolutional neural network according to an embodiment of the present invention;
FIGS. 1-2 are flowcharts illustrating steps of a method for generating a model according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating steps of another method for generating a model according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating steps of a method for extracting information according to an embodiment of the present invention;
FIG. 4 is a block diagram of a model generation apparatus provided by an embodiment of the present invention;
fig. 5 is a block diagram of an information extraction apparatus according to an embodiment of the present invention;
fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
First, a specific application scenario related to the embodiment of the present invention is described. At present, video behavior recognition is widely applied. When performing video behavior recognition, the cost of extracting the appearance information and the motion information of the video may have an important influence on the whole processing process.
Further, in the prior art, when appearance information and motion information are extracted, one 2D convolutional neural network is used to extract the appearance information, and another 2D convolutional neural network is used to extract the motion information based on a light flow graph, which may result in higher cost. In order to simultaneously consider the cost consumed and the comprehensiveness of the extracted information, the embodiment of the invention provides a model generation method, an information extraction device and electronic equipment. In the method, based on a preset appearance-motion information 2D convolutional neural network obtained by embedding at least one motion information extraction module in a 2D convolutional neural network in advance, appearance information is extracted through the appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, and motion information is obtained through the motion information extraction module in the preset appearance-motion information 2D convolutional neural network.
For example, in an implementation manner, fig. 1-1 is a schematic structural diagram of an appearance-motion information 2D convolutional neural network provided by an embodiment of the present invention, and as shown in fig. 1-1, an input layer, an appearance information extraction module, and an output layer may be original structures in an original 2D convolutional neural network, where the input layer may be used to receive an input image, and the output layer may be used to output information such as a final processing result. The appearance-motion information 2D convolutional neural network may be formed after embedding a motion information extraction module in the original 2D convolutional neural network. It should be noted that fig. 1-1 is only an exemplary illustration, in practical applications, the actual number of each layer or module is not limited to that actually shown in the figure, and the convolutional neural network may further include other layers, for example, a feature map generation layer located after the input layer and before the appearance information extraction module and the motion information extraction module, a full connection layer connected between the appearance information extraction module and the output layer, a pooling layer, an activation function layer, and the like, which are connected between the motion information extraction module and the output layer, and the embodiment of the present invention does not limit this.
In the embodiment of the invention, the motion information is acquired by embedding the motion information extraction module in the 2D convolutional neural network, directly taking the sample video frame as input and combining the gradient characteristic diagram and the difference characteristic diagram through the embedded motion information extraction module. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.
Fig. 1-2 is a flowchart illustrating steps of a model generation method according to an embodiment of the present invention, and as shown in fig. 1-2, the method may include:
step 101, extracting a sample video frame from a sample video.
In this embodiment of the present invention, the sample video may be obtained by receiving a video manually input by a user, or may be directly obtained from a network, which is not limited in this embodiment of the present invention.
Further, the specific number of the sample video frames may be selected according to actual requirements, and the sample video frames may be all video frames included in the sample video, or may be some specific video frames included in the sample video. The embodiment of the present invention is not limited thereto. For example, the sample video may be divided into a plurality of video segments, for example, a continuous 64 frames may be randomly extracted from the sample video, and then 8 frames may be extracted from the 64 frames at equal intervals.
102, extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.
In this embodiment of the present invention, the operation of extracting the appearance information and the operation of acquiring the motion information may be performed simultaneously, or the operation of extracting the appearance information may be performed first, or the operation of acquiring the motion information may be performed first, which is not limited in this embodiment of the present invention.
Further, the motion information extraction module may be a module that is designed in advance and is capable of extracting the gradient feature map and the difference feature map from the input feature map, and further extracting the motion information. The input feature map may be obtained by processing the sample video frame by a layer before the motion information extraction module, for example, the feature map generation layer generates the input feature map from the sample video frame, and the feature map generation layer may be a layer that is originally used in the ordinary 2D convolutional neural network to generate the input feature map. By embedding at least one motion information extraction module in the 2D convolutional neural network in advance, the appearance-motion information 2D convolutional neural network is generated, so that the sample video frame is directly processed through the appearance-motion information 2D convolutional neural network without inputting an optical flow diagram of the sample video frame and additionally using the 2D convolutional neural network, and the appearance information and the motion information can be obtained.
Wherein the 2D convolutional neural network may be a residual convolutional neural network, e.g., a 2D ResNet50 convolutional neural network. The specific number and the specific position of the embedded motion information extraction modules may be determined according to actual requirements, which is not limited in the embodiment of the present invention.
Further, when the appearance information of the sample video frame is extracted through an original appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, the convolution operation is performed on the sample video frame by using a convolution kernel set in the original appearance information extraction module, so that the appearance information is extracted.
Further, the sample video frame is used as input, the preset appearance-motion information 2D convolutional neural network is internally processed, abstract feature maps of different degrees of input can be extracted, and further, the pre-embedded motion information extraction module can extract motion information according to the abstract feature maps.
And 103, performing model training based on the appearance information and the motion information to obtain an information extraction model.
In the embodiment of the invention, during model training, a classification model can be used for classifying according to the appearance information and the motion information to obtain the prediction category. Specifically, the motion behavior of the subject included in the video frame may be determined according to the spatial dimension feature and the temporal dimension feature, and the behavior classification may be performed. And then adjusting parameters in the preset appearance-motion information 2D convolutional neural network according to the deviation degree between the prediction category and the real category, and after the adjustment is completed, continuing training by repeatedly executing the steps until the condition of stopping training is met. Therefore, by training the preset appearance-motion information 2D convolutional neural network, the accuracy of extracted information of the neural network can be improved, the obtained information extraction model is finally ensured, and appearance information and motion information can be accurately extracted at low cost.
In summary, in the model generation method provided in the embodiment of the present invention, a sample video frame is extracted from a sample video, appearance information of the sample video frame is extracted through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and a gradient feature map and a difference feature map are extracted through a motion information extraction module to obtain motion information of the sample video frame, where the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. A motion information extraction module is embedded in the 2D convolutional neural network, a sample video frame is directly used as input, and the embedded motion information extraction module is combined with the gradient feature map and the difference feature map to obtain motion information. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.
Fig. 2 is a flowchart of steps of another model generation method provided in an embodiment of the present invention, and as shown in fig. 2, the method may include:
step 201, extracting a sample video frame from a sample video.
Specifically, this step may refer to step 101, which is not limited in this embodiment of the present invention.
Step 202, extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network.
Step 203, extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.
In this step, the motion information extraction module may be embedded in the original appearance information extraction module, and each original appearance information extraction module in the preset appearance-motion information 2D convolutional neural network may be connected to one motion information extraction module. Therefore, the motion information extraction module is connected in each original appearance information extraction module, so that the preset appearance-motion information 2D convolution neural network obtained after embedding can be improved to a greater extent, and the motion information extraction capability is improved. The appearance information extraction module can be a convolution module, and the appearance information extraction module and the motion information extraction module in the appearance-motion information 2D convolution neural network are parallel. For example, assuming that the selected 2D convolutional neural network is 2D ResNet50, the motion information extraction module may be embedded as a branch in each convolution module (bottleeck) of the 2D ResNet50, which is the appearance information extraction module. Accordingly, the embedded convolution module forms an Appearance Motion (AM) convolution module, and the embedded 2D convolution neural network forms a preset appearance-motion information 2D convolution neural network.
Specifically, the present step may include the following substeps:
substep (1): adjusting the number of channels of the input feature map to P; p < Q, which is the original number of channels of the input feature map.
In this step, the input feature map refers to an abstract feature map input to the motion information extraction module. The input feature map can be obtained by processing a sample video frame by a layer in the 2D convolutional neural network before the motion information extraction module. The previous layers may be a plurality of convolutional layers, and through the processing of these layers, abstract feature layers of different degrees may be extracted, where the abstract feature map input to the motion information extraction module is the input feature map. The number of channels of the abstract feature map output by each layer may be determined by the number of dimensions set in the layer, that is, the specific value of Q may be determined according to the number of dimensions set in the previous layer of the motion information extraction module. For example, the number of channels of the abstract feature map output by each layer may be equal to the number of dimensions set in the layer, and assuming that the number of dimensions set is 2048 dimensions, Q may be 2048.
Further, the specific value of P may be set according to practical situations as long as P is ensured to be smaller than the original channel number Q. When adjusting the number of channels, the input feature map may be convolved by using a 1 × 1 convolution kernel to compress the number of channels of the input feature map, which is not limited in the embodiment of the present invention. In the embodiment of the invention, the number of the channels of the input characteristic diagram is adjusted to be P, namely, the number of the channels is reduced, so that the data volume needing to be processed in the subsequent steps can be reduced to a certain extent, and the required consumed calculation amount is further reduced.
Substep (2): and extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map.
In this step, the preset gradient direction may be set according to actual requirements, and for example, the preset gradient direction may be a horizontal direction and a vertical direction. When the gradient feature map of the input feature map in the preset gradient direction is calculated, a preset gradient operator may be used to perform convolution processing on the input feature map in the preset direction. The preset gradient operator can be a Sobel detection operator, and then the Sobel detection operator is used for convolving the input characteristic diagram from the horizontal direction, namely the X-axis direction, so as to obtain a first convolution result GxConvolving the input feature map from the vertical direction, i.e., the Y-axis direction, to obtain a second convolution result Gy
For example, the difference feature map may be extracted using neighboring feature maps in the time dimension. Specifically, the positions in different video frames are different due to the motion areas corresponding to the moving objects in the scene. Therefore, it is possible to calculate the difference values of the feature maps and then generate the differential feature map based on the difference values.
Substep (3): and splicing the gradient characteristic diagram and the difference characteristic diagram to obtain a target characteristic diagram.
In this step, the features of the same channel may be spliced according to the dimension of the number of channels, that is, the features of the same channel are spliced, for example, the feature of the first channel in the gradient feature map and the feature of the first channel in the difference feature map are spliced, the feature of the xth channel in the gradient feature map and the feature of the xth channel in the difference feature map are spliced, and finally, the feature map obtained after the splicing may be used as the target feature map. And splicing the gradient characteristic diagram and the difference characteristic diagram to be used as a target characteristic diagram, so that more abundant characteristic information can be collected in the target characteristic diagram.
Substep (4): and carrying out convolution operation on the target characteristic graph to obtain the motion information.
In this step, since the feature information in the gradient feature map and the difference feature map is collected in the target feature map, the motion information can be extracted by performing convolution operation on the target feature map.
In an actual application scene, by combining the two characteristic graphs, more accurate motion information can be obtained to a certain extent.
Substep (5): and restoring the channel number of the input feature map to Q.
In this step, the number of channels may be adjusted to Q, that is, the number of channels of the input feature map may be recovered, by performing the inverse operation of the operation performed in the foregoing substep (1). For example, in this step, the number of channels to be restored may be realized by performing convolution through a convolution kernel of 1 × 1. Wherein the dimension number set in the convolution operation is Q. In an actual application scenario, after the time dimension feature of the input feature map is extracted, other operations may be executed based on the input feature map, so that in the embodiment of the present invention, by recovering the input feature map to the original state after the motion information is extracted, it is ensured that subsequent processing on the input feature map is not affected. The specific type of other operations may be set according to actual requirements, for example, the other operations may be a characteristic diagram of an output Q channel.
Further, the motion information extraction module may include a number adjustment layer, a feature map generation layer, a splicing layer, a feature map convolution layer, and a number recovery layer; the quantity adjusting layer is used for realizing the operation in the substep (1), and the feature map generating layer can be used for realizing the operation in the substep (2); the splice layer may be used to implement the operations in sub-step (3); the feature map convolutional layer may be used to implement the operations in the sub-step (4); the number recovery layer may be used to implement the operations in sub-step (5).
Further, each convolution layer in the above steps may be followed by a Batch Normalization (BN) layer and/or a modified linear unit (Relu) layer. Accordingly, after the layers complete the processing, the processing result can be processed further by the connected BN layer and/or Relu layer, and then transferred to the next layer. In the network training process, parameters in the neural network are adjusted, the parameter change can cause the data distribution of other layers to change, the essence of the network learning process is the learning data distribution, and if the data distribution of each batch is different, the network adapts to different distributions in each iteration, so that the training speed of the network is greatly reduced.
And step 204, fusing the appearance information and the motion information.
In this step, the elements in the feature map corresponding to the appearance information and the elements in the feature map corresponding to the motion information may be added, and after all the elements are added, the feature map group obtained after the addition is used as the fused information. Or directly splicing and connecting to realize fusion. In the embodiment of the invention, the appearance information and the motion information are fused to achieve the complementation between different types of information, so that the training accuracy is improved while the training is realized by utilizing the two types of information during the subsequent training to a certain extent.
And step 205, performing model training based on the fused information to obtain an information extraction model.
Specifically, the specific training process in this step may refer to step 103, which is not described herein again in this embodiment of the present invention.
In summary, in the model generation method provided in the embodiment of the present invention, a sample video frame is extracted from a sample video, appearance information of the sample video frame is extracted through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and motion information of the sample video frame is extracted through a motion information extraction module, where the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. In the embodiment of the invention, the motion information is extracted by embedding the motion information extraction module in the 2D convolutional neural network and directly taking the sample video frame as input through the embedded motion information extraction module. Therefore, the motion information and the appearance information can be acquired without calculating an optical flow diagram of a sample video frame as input in advance and additionally using a 2D convolutional neural network, and the cost can be reduced to a certain extent.
Fig. 3 is a flowchart of steps of an information extraction method according to an embodiment of the present invention, and as shown in fig. 3, the method may include:
step 301, extracting a video frame to be extracted from a video to be extracted.
In the embodiment of the present invention, the video to be extracted may be obtained by receiving a video manually input by a user, or may be directly obtained from a network, and the like. Further, the specific sampling manner may refer to the foregoing steps, which is not limited in this embodiment of the present invention.
Step 302, inputting the video frame to be extracted into an information extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model; wherein, the feature extraction model is generated according to any one of the model generation method embodiments.
In the embodiment of the invention, the feature extraction model is generated according to the embodiment of the model generation method, namely, the feature extraction model has the capability of accurately extracting the appearance information and the motion information with lower cost, so that the appearance information and the motion information can be extracted with lower cost by inputting the video frame to be extracted into the feature extraction model.
In summary, in the feature extraction method provided in the embodiment of the present invention, a video frame to be extracted is extracted from a video to be extracted, and finally, the video frame to be extracted is input into an information feature extraction model, and appearance information and motion information are extracted from the video frame to be extracted through the information feature extraction model. Because the feature extraction model has the capability of accurately extracting the appearance information and the motion information with lower cost, the appearance information and the motion information can be extracted with lower cost by inputting the video frame to be extracted into the feature extraction model.
Fig. 4 is a block diagram of a model generation apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus 40 may include:
a first extraction module 401 is configured to extract a sample video frame from a sample video.
A second extraction module 402, configured to extract appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extract a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.
A training module 403, configured to perform model training based on the appearance information and the motion information to obtain an information extraction model.
Optionally, each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network is connected to one motion information extraction module.
Optionally, the second extracting module 402 is specifically configured to:
extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the input feature map is obtained by processing the sample video frame by a layer before the motion information extraction module.
And splicing the gradient characteristic diagram and the difference characteristic diagram to obtain a target characteristic diagram.
And carrying out convolution operation on the target characteristic graph to obtain the motion information.
Optionally, the second extracting module 402 is further specifically configured to:
before extracting the gradient feature map and the difference feature map of the input feature map in a preset gradient direction, the apparatus further includes: adjusting the number of channels of the input feature map to P; the P is less than Q, and the Q is the original channel number of the input feature map;
after performing convolution operation on the target feature map, the apparatus further includes: and restoring the channel number of the input feature map to Q.
Optionally, the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;
the motion information extraction module comprises a quantity adjustment layer, a feature map generation layer, a splicing layer, a feature map convolution layer and a quantity recovery layer;
the quantity adjusting layer is used for realizing the operation of adjusting the channel quantity of the input feature map to P; the feature map generation layer is used for realizing the operation of extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the splicing layer is used for realizing the operation of splicing the gradient feature map and the difference feature map; the feature map convolution layer is used for realizing the operation of performing convolution operation on the target feature map; the quantity recovery layer is used for realizing the operation of recovering the channel quantity of the input feature diagram to Q
Optionally, the training module 403 is specifically configured to:
and fusing the appearance information and the motion information.
And carrying out model training based on the fused information to obtain an information extraction model.
In summary, in the model generating apparatus provided in the embodiment of the present invention, the sample video frame is extracted from the sample video, then, the appearance information of the sample video frame is extracted through the appearance information extracting module in the preset appearance-motion information 2D convolutional neural network, and the gradient feature map and the difference feature map are extracted through the motion information extracting module to obtain the motion information of the sample video frame, where the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extracting module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. A motion information extraction module is embedded in the 2D convolutional neural network, a sample video frame is directly used as input, and the embedded motion information extraction module is combined with the gradient feature map and the difference feature map to obtain motion information. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.
Fig. 5 is a block diagram of an information extracting apparatus according to an embodiment of the present invention, and as shown in fig. 5, the apparatus 50 may include:
an extracting module 501, configured to extract a video frame to be extracted from a video to be extracted.
An input module 502, configured to input the video frame to be extracted into an information extraction model, and extract appearance information and motion information from the video frame to be extracted through the information extraction model.
Wherein the information extraction model is generated according to the model generation device.
In summary, in the feature extraction device provided in the embodiment of the present invention, the video frame to be extracted is extracted from the video to be extracted, and finally, the video frame to be extracted is input into the information feature extraction model, and the appearance information and the motion information are extracted from the video frame to be extracted through the information feature extraction model. Because the feature extraction model has the capability of accurately extracting the appearance information and the motion information with lower cost, the appearance information and the motion information can be extracted with lower cost by inputting the video frame to be extracted into the feature extraction model.
For the above device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, refer to the partial description of the method embodiment.
An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,
a memory 603 for storing a computer program;
the processor 601 is configured to implement the following steps when executing the program stored in the memory 603:
extracting a sample video frame from a sample video;
extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;
and performing model training based on the appearance information and the motion information to obtain an information extraction model.
The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform the model generation method or the information extraction method described in any of the above embodiments.
In yet another embodiment, the present invention further provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the model generation method or the information extraction method described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (16)

1. A method of model generation, the method comprising:
extracting a sample video frame from a sample video;
extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;
and performing model training based on the appearance information and the motion information to obtain an information extraction model.
2. The method according to claim 1, wherein one motion information extraction module is connected to each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network.
3. The method of claim 1, wherein the extracting motion information of the sample video frame by a pre-embedded motion information extraction module comprises:
extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the input feature map is obtained by processing the sample video frame by a layer before the motion information extraction module;
splicing the gradient characteristic diagram and the difference characteristic diagram to obtain a target characteristic diagram;
and carrying out convolution operation on the target characteristic graph to obtain the motion information.
4. The method of claim 3,
before extracting the gradient feature map and the difference feature map of the input feature map in a preset gradient direction, the method further includes: adjusting the number of channels of the input feature map to P; the P is less than Q, and the Q is the original channel number of the input feature map;
after performing a convolution operation on the target feature map, the method further includes: and restoring the channel number of the input feature map to Q.
5. The method according to claim 4, wherein the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;
the motion information extraction module comprises a quantity adjustment layer, a feature map generation layer, a splicing layer, a feature map convolution layer and a quantity recovery layer;
the quantity adjusting layer is used for realizing the operation of adjusting the channel quantity of the input feature map to P; the feature map generation layer is used for realizing the operation of extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the splicing layer is used for realizing the operation of splicing the gradient feature map and the difference feature map; the feature map convolution layer is used for realizing the operation of performing convolution operation on the target feature map; the quantity recovery layer is used for realizing the operation of recovering the channel quantity of the input feature diagram to Q.
6. The method according to any one of claims 1 to 5, wherein the performing model training based on the appearance information and the motion information to obtain an information extraction model comprises:
fusing the appearance information and the motion information;
and carrying out model training based on the fused information to obtain an information extraction model.
7. An information extraction method, characterized in that the method comprises:
extracting a video frame to be extracted from a video to be extracted;
inputting the video frame to be extracted into an information extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model;
wherein the information extraction model is generated according to the method of any one of claims 1 to 6.
8. An apparatus for model generation, the apparatus comprising:
the first extraction module is used for extracting a sample video frame from a sample video;
the second extraction module is used for extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;
and the training module is used for carrying out model training based on the appearance information and the motion information to obtain an information extraction model.
9. The apparatus of claim 8, wherein one motion information extraction module is connected to each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network.
10. The apparatus of claim 8, wherein the second extraction module is specifically configured to:
extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the input feature map is obtained by processing the sample video frame by a layer before the motion information extraction module;
splicing the gradient characteristic diagram and the difference characteristic diagram to obtain a target characteristic diagram;
and carrying out convolution operation on the target characteristic graph to obtain the motion information.
11. The apparatus of claim 10, wherein the second extraction module is further specifically configured to:
before extracting the gradient feature map and the difference feature map of the input feature map in a preset gradient direction, the apparatus further includes: adjusting the number of channels of the input feature map to P; the P is less than Q, and the Q is the original channel number of the input feature map;
after performing convolution operation on the target feature map, the apparatus further includes: and restoring the channel number of the input feature map to Q.
12. The apparatus of claim 11, wherein the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;
the motion information extraction module comprises a quantity adjustment layer, a feature map generation layer, a splicing layer, a feature map convolution layer and a quantity recovery layer;
the quantity adjusting layer is used for realizing the operation of adjusting the channel quantity of the input feature map to P; the feature map generation layer is used for realizing the operation of extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the splicing layer is used for realizing the operation of splicing the gradient feature map and the difference feature map; the feature map convolution layer is used for realizing the operation of performing convolution operation on the target feature map; the quantity recovery layer is used for realizing the operation of recovering the channel quantity of the input feature diagram to Q.
13. The apparatus according to any one of claims 8-12, wherein the training module is specifically configured to:
fusing the appearance information and the motion information;
and carrying out model training based on the fused information to obtain an information extraction model.
14. An information extraction apparatus, characterized in that the apparatus comprises:
the extraction module is used for extracting a video frame to be extracted from a video to be extracted;
the input module is used for inputting the video frames to be extracted into an information extraction model, and extracting appearance information and motion information from the video frames to be extracted through the information extraction model;
wherein the information extraction model is generated by the apparatus of any one of claims 8 to 12.
15. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of any one of claims 1 to 7 when executing a program stored in the memory.
16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202011357509.3A 2020-11-26 2020-11-26 Model generation method, information extraction device and electronic equipment Active CN112330711B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011357509.3A CN112330711B (en) 2020-11-26 2020-11-26 Model generation method, information extraction device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011357509.3A CN112330711B (en) 2020-11-26 2020-11-26 Model generation method, information extraction device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112330711A true CN112330711A (en) 2021-02-05
CN112330711B CN112330711B (en) 2023-12-05

Family

ID=74308668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011357509.3A Active CN112330711B (en) 2020-11-26 2020-11-26 Model generation method, information extraction device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112330711B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
US20170262705A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Recurrent networks with motion-based attention for video understanding
CA3077830A1 (en) * 2016-12-05 2018-06-05 Avigilon Coporation System and method for appearance search
CN109447246A (en) * 2018-10-30 2019-03-08 北京字节跳动网络技术有限公司 Method and apparatus for generating model
CN109919087A (en) * 2019-03-06 2019-06-21 腾讯科技(深圳)有限公司 A kind of method of visual classification, the method and device of model training
CN110070067A (en) * 2019-04-29 2019-07-30 北京金山云网络技术有限公司 The training method of video classification methods and its model, device and electronic equipment
CN110324664A (en) * 2019-07-11 2019-10-11 南开大学 A kind of video neural network based mends the training method of frame method and its model
WO2020010979A1 (en) * 2018-07-10 2020-01-16 腾讯科技(深圳)有限公司 Method and apparatus for training model for recognizing key points of hand, and method and apparatus for recognizing key points of hand
WO2020019926A1 (en) * 2018-07-27 2020-01-30 腾讯科技(深圳)有限公司 Feature extraction model training method and apparatus, computer device, and computer readable storage medium
WO2020098158A1 (en) * 2018-11-14 2020-05-22 平安科技(深圳)有限公司 Pedestrian re-recognition method and apparatus, and computer readable storage medium
WO2020108483A1 (en) * 2018-11-28 2020-06-04 腾讯科技(深圳)有限公司 Model training method, machine translation method, computer device and storage medium
CN111259786A (en) * 2020-01-14 2020-06-09 浙江大学 Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
WO2020177582A1 (en) * 2019-03-06 2020-09-10 腾讯科技(深圳)有限公司 Video synthesis method, model training method, device and storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
US20170262705A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Recurrent networks with motion-based attention for video understanding
CA3077830A1 (en) * 2016-12-05 2018-06-05 Avigilon Coporation System and method for appearance search
WO2020010979A1 (en) * 2018-07-10 2020-01-16 腾讯科技(深圳)有限公司 Method and apparatus for training model for recognizing key points of hand, and method and apparatus for recognizing key points of hand
WO2020019926A1 (en) * 2018-07-27 2020-01-30 腾讯科技(深圳)有限公司 Feature extraction model training method and apparatus, computer device, and computer readable storage medium
CN109447246A (en) * 2018-10-30 2019-03-08 北京字节跳动网络技术有限公司 Method and apparatus for generating model
WO2020098158A1 (en) * 2018-11-14 2020-05-22 平安科技(深圳)有限公司 Pedestrian re-recognition method and apparatus, and computer readable storage medium
WO2020108483A1 (en) * 2018-11-28 2020-06-04 腾讯科技(深圳)有限公司 Model training method, machine translation method, computer device and storage medium
CN109919087A (en) * 2019-03-06 2019-06-21 腾讯科技(深圳)有限公司 A kind of method of visual classification, the method and device of model training
WO2020177722A1 (en) * 2019-03-06 2020-09-10 腾讯科技(深圳)有限公司 Method for video classification, method and device for model training, and storage medium
WO2020177582A1 (en) * 2019-03-06 2020-09-10 腾讯科技(深圳)有限公司 Video synthesis method, model training method, device and storage medium
CN110070067A (en) * 2019-04-29 2019-07-30 北京金山云网络技术有限公司 The training method of video classification methods and its model, device and electronic equipment
CN110324664A (en) * 2019-07-11 2019-10-11 南开大学 A kind of video neural network based mends the training method of frame method and its model
CN111259786A (en) * 2020-01-14 2020-06-09 浙江大学 Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIAN LIU ET AL: "CTM:Collaborative Temporal Modelling for Action Recognition", 《ARXIV》, pages 1 - 7 *

Also Published As

Publication number Publication date
CN112330711B (en) 2023-12-05

Similar Documents

Publication Publication Date Title
CN112052787B (en) Target detection method and device based on artificial intelligence and electronic equipment
CN110909663B (en) Human body key point identification method and device and electronic equipment
CN111428805B (en) Method for detecting salient object, model, storage medium and electronic device
CN112418345B (en) Method and device for quickly identifying small targets with fine granularity
CN110176024B (en) Method, device, equipment and storage medium for detecting target in video
CN110544214A (en) Image restoration method and device and electronic equipment
CN112381763A (en) Surface defect detection method
CN114529574B (en) Image matting method and device based on image segmentation, computer equipment and medium
CN114943307B (en) Model training method and device, storage medium and electronic equipment
CN110119736B (en) License plate position identification method and device and electronic equipment
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
CN112765402A (en) Sensitive information identification method, device, equipment and storage medium
CN112580581A (en) Target detection method and device and electronic equipment
CN114638304A (en) Training method of image recognition model, image recognition method and device
CN115035347A (en) Picture identification method and device and electronic equipment
CN115601629A (en) Model training method, image recognition method, medium, device and computing equipment
CN110060264B (en) Neural network training method, video frame processing method, device and system
CN115131695A (en) Training method of video detection model, video detection method and device
CN114255493A (en) Image detection method, face detection device, face detection equipment and storage medium
CN112330711B (en) Model generation method, information extraction device and electronic equipment
CN116258873A (en) Position information determining method, training method and device of object recognition model
CN112329925B (en) Model generation method, feature extraction method, device and electronic equipment
CN113627460A (en) Target identification system and method based on time slice convolutional neural network
CN110909798A (en) Multi-algorithm intelligent studying and judging method, system and server
CN116758295B (en) Key point detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant