CN112330711A

CN112330711A - Model generation method, information extraction method and device and electronic equipment

Info

Publication number: CN112330711A
Application number: CN202011357509.3A
Authority: CN
Inventors: 刘倩; 王涛
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-02-05
Anticipated expiration: 2040-11-26
Also published as: CN112330711B

Abstract

The invention provides a model generation method, an information extraction device and electronic equipment, and belongs to the technical field of computers. According to the method, appearance information of a sample video frame is extracted through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network by extracting the sample video frame from a sample video, and motion information of the sample video frame is obtained through a gradient feature map and a difference feature map extracted through a motion information extraction module, wherein the motion information extraction module is used for extracting the gradient feature map and the difference feature map to obtain the motion information. And performing model training based on the appearance information and the motion information to obtain an information extraction model. The motion information and the appearance information can be acquired without pre-calculating an optical flow diagram of a sample video frame as input and additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Description

Model generation method, information extraction method and device and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a model generation method, an information extraction device and electronic equipment.

Background

With the rapid development of internet technology, video has become one of the important propagation ways of content creation and social media platform. In order to better process the video, for example, perform video behavior recognition, it is generally required to comprehensively extract appearance information and motion information of the video. However, the common 2D convolutional neural network can only extract the appearance information of the video frames in the video. Therefore, how to comprehensively extract the appearance information and the motion information of the video becomes a problem which needs to be solved urgently.

In the prior art, generally, an optical flow graph of a video is extracted, then, a video frame in the video is taken as an input of a 2D convolutional neural network, appearance information of the video is extracted by using the 2D convolutional neural network, and an optical flow graph is taken as an input of another 2D convolutional neural network, and motion information is extracted by using the other 2D convolutional neural network. In the dual-channel extraction mode, two 2D convolutional neural networks need to be used, and a video light flow graph needs to be additionally extracted, and the extraction of the light flow graph consumes a large amount of computation, which further causes a large amount of computation and a large cost.

Disclosure of Invention

The embodiment of the invention aims to provide a model generation method, an information extraction method, a device and electronic equipment, so as to solve the problem of high cost in extracting appearance information and motion information. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a method for generating a model, the method including:

extracting a sample video frame from a sample video;

extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;

and performing model training based on the appearance information and the motion information to obtain an information extraction model.

In a second aspect of the present invention, there is provided an information extraction method, including:

extracting a video frame to be extracted from a video to be extracted;

inputting the video frame to be extracted into an information extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model; wherein the information extraction model is generated according to the method of any one of the first aspect.

In a third aspect of the present invention, there is also provided a model generation apparatus, including:

the first extraction module is used for extracting a sample video frame from a sample video;

the second extraction module is used for extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;

and the training module is used for carrying out model training based on the appearance information and the motion information to obtain an information extraction model.

In a fourth aspect of the present invention, there is also provided an information extraction apparatus, including:

the extraction module is used for extracting a video frame to be extracted from a video to be extracted;

the input module is used for inputting the video frames to be extracted into an information extraction model, and extracting appearance information and motion information from the video frames to be extracted through the information extraction model; wherein the information extraction model is generated by the apparatus according to any of the third aspects.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

The model generation method provided by the embodiment of the invention extracts a sample video frame from a sample video, extracts appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracts a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame, wherein the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. A motion information extraction module is embedded in the 2D convolutional neural network, a sample video frame is directly used as input, and the embedded motion information extraction module is combined with the gradient feature map and the difference feature map to obtain motion information. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1-1 is a schematic diagram of an appearance-motion information 2D convolutional neural network according to an embodiment of the present invention;

FIGS. 1-2 are flowcharts illustrating steps of a method for generating a model according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating steps of another method for generating a model according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating steps of a method for extracting information according to an embodiment of the present invention;

FIG. 4 is a block diagram of a model generation apparatus provided by an embodiment of the present invention;

fig. 5 is a block diagram of an information extraction apparatus according to an embodiment of the present invention;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

First, a specific application scenario related to the embodiment of the present invention is described. At present, video behavior recognition is widely applied. When performing video behavior recognition, the cost of extracting the appearance information and the motion information of the video may have an important influence on the whole processing process.

Further, in the prior art, when appearance information and motion information are extracted, one 2D convolutional neural network is used to extract the appearance information, and another 2D convolutional neural network is used to extract the motion information based on a light flow graph, which may result in higher cost. In order to simultaneously consider the cost consumed and the comprehensiveness of the extracted information, the embodiment of the invention provides a model generation method, an information extraction device and electronic equipment. In the method, based on a preset appearance-motion information 2D convolutional neural network obtained by embedding at least one motion information extraction module in a 2D convolutional neural network in advance, appearance information is extracted through the appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, and motion information is obtained through the motion information extraction module in the preset appearance-motion information 2D convolutional neural network.

For example, in an implementation manner, fig. 1-1 is a schematic structural diagram of an appearance-motion information 2D convolutional neural network provided by an embodiment of the present invention, and as shown in fig. 1-1, an input layer, an appearance information extraction module, and an output layer may be original structures in an original 2D convolutional neural network, where the input layer may be used to receive an input image, and the output layer may be used to output information such as a final processing result. The appearance-motion information 2D convolutional neural network may be formed after embedding a motion information extraction module in the original 2D convolutional neural network. It should be noted that fig. 1-1 is only an exemplary illustration, in practical applications, the actual number of each layer or module is not limited to that actually shown in the figure, and the convolutional neural network may further include other layers, for example, a feature map generation layer located after the input layer and before the appearance information extraction module and the motion information extraction module, a full connection layer connected between the appearance information extraction module and the output layer, a pooling layer, an activation function layer, and the like, which are connected between the motion information extraction module and the output layer, and the embodiment of the present invention does not limit this.

In the embodiment of the invention, the motion information is acquired by embedding the motion information extraction module in the 2D convolutional neural network, directly taking the sample video frame as input and combining the gradient characteristic diagram and the difference characteristic diagram through the embedded motion information extraction module. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Fig. 1-2 is a flowchart illustrating steps of a model generation method according to an embodiment of the present invention, and as shown in fig. 1-2, the method may include:

step 101, extracting a sample video frame from a sample video.

In this embodiment of the present invention, the sample video may be obtained by receiving a video manually input by a user, or may be directly obtained from a network, which is not limited in this embodiment of the present invention.

Further, the specific number of the sample video frames may be selected according to actual requirements, and the sample video frames may be all video frames included in the sample video, or may be some specific video frames included in the sample video. The embodiment of the present invention is not limited thereto. For example, the sample video may be divided into a plurality of video segments, for example, a continuous 64 frames may be randomly extracted from the sample video, and then 8 frames may be extracted from the 64 frames at equal intervals.

102, extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.

In this embodiment of the present invention, the operation of extracting the appearance information and the operation of acquiring the motion information may be performed simultaneously, or the operation of extracting the appearance information may be performed first, or the operation of acquiring the motion information may be performed first, which is not limited in this embodiment of the present invention.

Further, the motion information extraction module may be a module that is designed in advance and is capable of extracting the gradient feature map and the difference feature map from the input feature map, and further extracting the motion information. The input feature map may be obtained by processing the sample video frame by a layer before the motion information extraction module, for example, the feature map generation layer generates the input feature map from the sample video frame, and the feature map generation layer may be a layer that is originally used in the ordinary 2D convolutional neural network to generate the input feature map. By embedding at least one motion information extraction module in the 2D convolutional neural network in advance, the appearance-motion information 2D convolutional neural network is generated, so that the sample video frame is directly processed through the appearance-motion information 2D convolutional neural network without inputting an optical flow diagram of the sample video frame and additionally using the 2D convolutional neural network, and the appearance information and the motion information can be obtained.

Wherein the 2D convolutional neural network may be a residual convolutional neural network, e.g., a 2D ResNet50 convolutional neural network. The specific number and the specific position of the embedded motion information extraction modules may be determined according to actual requirements, which is not limited in the embodiment of the present invention.

Further, when the appearance information of the sample video frame is extracted through an original appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, the convolution operation is performed on the sample video frame by using a convolution kernel set in the original appearance information extraction module, so that the appearance information is extracted.

Further, the sample video frame is used as input, the preset appearance-motion information 2D convolutional neural network is internally processed, abstract feature maps of different degrees of input can be extracted, and further, the pre-embedded motion information extraction module can extract motion information according to the abstract feature maps.

And 103, performing model training based on the appearance information and the motion information to obtain an information extraction model.

In the embodiment of the invention, during model training, a classification model can be used for classifying according to the appearance information and the motion information to obtain the prediction category. Specifically, the motion behavior of the subject included in the video frame may be determined according to the spatial dimension feature and the temporal dimension feature, and the behavior classification may be performed. And then adjusting parameters in the preset appearance-motion information 2D convolutional neural network according to the deviation degree between the prediction category and the real category, and after the adjustment is completed, continuing training by repeatedly executing the steps until the condition of stopping training is met. Therefore, by training the preset appearance-motion information 2D convolutional neural network, the accuracy of extracted information of the neural network can be improved, the obtained information extraction model is finally ensured, and appearance information and motion information can be accurately extracted at low cost.

In summary, in the model generation method provided in the embodiment of the present invention, a sample video frame is extracted from a sample video, appearance information of the sample video frame is extracted through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and a gradient feature map and a difference feature map are extracted through a motion information extraction module to obtain motion information of the sample video frame, where the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. A motion information extraction module is embedded in the 2D convolutional neural network, a sample video frame is directly used as input, and the embedded motion information extraction module is combined with the gradient feature map and the difference feature map to obtain motion information. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Fig. 2 is a flowchart of steps of another model generation method provided in an embodiment of the present invention, and as shown in fig. 2, the method may include:

step 201, extracting a sample video frame from a sample video.

Specifically, this step may refer to step 101, which is not limited in this embodiment of the present invention.

Step 202, extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network.

Step 203, extracting a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.

In this step, the motion information extraction module may be embedded in the original appearance information extraction module, and each original appearance information extraction module in the preset appearance-motion information 2D convolutional neural network may be connected to one motion information extraction module. Therefore, the motion information extraction module is connected in each original appearance information extraction module, so that the preset appearance-motion information 2D convolution neural network obtained after embedding can be improved to a greater extent, and the motion information extraction capability is improved. The appearance information extraction module can be a convolution module, and the appearance information extraction module and the motion information extraction module in the appearance-motion information 2D convolution neural network are parallel. For example, assuming that the selected 2D convolutional neural network is 2D ResNet50, the motion information extraction module may be embedded as a branch in each convolution module (bottleeck) of the 2D ResNet50, which is the appearance information extraction module. Accordingly, the embedded convolution module forms an Appearance Motion (AM) convolution module, and the embedded 2D convolution neural network forms a preset appearance-motion information 2D convolution neural network.

Specifically, the present step may include the following substeps:

substep (1): adjusting the number of channels of the input feature map to P; p < Q, which is the original number of channels of the input feature map.

In this step, the input feature map refers to an abstract feature map input to the motion information extraction module. The input feature map can be obtained by processing a sample video frame by a layer in the 2D convolutional neural network before the motion information extraction module. The previous layers may be a plurality of convolutional layers, and through the processing of these layers, abstract feature layers of different degrees may be extracted, where the abstract feature map input to the motion information extraction module is the input feature map. The number of channels of the abstract feature map output by each layer may be determined by the number of dimensions set in the layer, that is, the specific value of Q may be determined according to the number of dimensions set in the previous layer of the motion information extraction module. For example, the number of channels of the abstract feature map output by each layer may be equal to the number of dimensions set in the layer, and assuming that the number of dimensions set is 2048 dimensions, Q may be 2048.

Further, the specific value of P may be set according to practical situations as long as P is ensured to be smaller than the original channel number Q. When adjusting the number of channels, the input feature map may be convolved by using a 1 × 1 convolution kernel to compress the number of channels of the input feature map, which is not limited in the embodiment of the present invention. In the embodiment of the invention, the number of the channels of the input characteristic diagram is adjusted to be P, namely, the number of the channels is reduced, so that the data volume needing to be processed in the subsequent steps can be reduced to a certain extent, and the required consumed calculation amount is further reduced.

Substep (2): and extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map.

In this step, the preset gradient direction may be set according to actual requirements, and for example, the preset gradient direction may be a horizontal direction and a vertical direction. When the gradient feature map of the input feature map in the preset gradient direction is calculated, a preset gradient operator may be used to perform convolution processing on the input feature map in the preset direction. The preset gradient operator can be a Sobel detection operator, and then the Sobel detection operator is used for convolving the input characteristic diagram from the horizontal direction, namely the X-axis direction, so as to obtain a first convolution result G_xConvolving the input feature map from the vertical direction, i.e., the Y-axis direction, to obtain a second convolution result G_y。

For example, the difference feature map may be extracted using neighboring feature maps in the time dimension. Specifically, the positions in different video frames are different due to the motion areas corresponding to the moving objects in the scene. Therefore, it is possible to calculate the difference values of the feature maps and then generate the differential feature map based on the difference values.

Substep (3): and splicing the gradient characteristic diagram and the difference characteristic diagram to obtain a target characteristic diagram.

In this step, the features of the same channel may be spliced according to the dimension of the number of channels, that is, the features of the same channel are spliced, for example, the feature of the first channel in the gradient feature map and the feature of the first channel in the difference feature map are spliced, the feature of the xth channel in the gradient feature map and the feature of the xth channel in the difference feature map are spliced, and finally, the feature map obtained after the splicing may be used as the target feature map. And splicing the gradient characteristic diagram and the difference characteristic diagram to be used as a target characteristic diagram, so that more abundant characteristic information can be collected in the target characteristic diagram.

Substep (4): and carrying out convolution operation on the target characteristic graph to obtain the motion information.

In this step, since the feature information in the gradient feature map and the difference feature map is collected in the target feature map, the motion information can be extracted by performing convolution operation on the target feature map.

In an actual application scene, by combining the two characteristic graphs, more accurate motion information can be obtained to a certain extent.

Substep (5): and restoring the channel number of the input feature map to Q.

In this step, the number of channels may be adjusted to Q, that is, the number of channels of the input feature map may be recovered, by performing the inverse operation of the operation performed in the foregoing substep (1). For example, in this step, the number of channels to be restored may be realized by performing convolution through a convolution kernel of 1 × 1. Wherein the dimension number set in the convolution operation is Q. In an actual application scenario, after the time dimension feature of the input feature map is extracted, other operations may be executed based on the input feature map, so that in the embodiment of the present invention, by recovering the input feature map to the original state after the motion information is extracted, it is ensured that subsequent processing on the input feature map is not affected. The specific type of other operations may be set according to actual requirements, for example, the other operations may be a characteristic diagram of an output Q channel.

Further, the motion information extraction module may include a number adjustment layer, a feature map generation layer, a splicing layer, a feature map convolution layer, and a number recovery layer; the quantity adjusting layer is used for realizing the operation in the substep (1), and the feature map generating layer can be used for realizing the operation in the substep (2); the splice layer may be used to implement the operations in sub-step (3); the feature map convolutional layer may be used to implement the operations in the sub-step (4); the number recovery layer may be used to implement the operations in sub-step (5).

Further, each convolution layer in the above steps may be followed by a Batch Normalization (BN) layer and/or a modified linear unit (Relu) layer. Accordingly, after the layers complete the processing, the processing result can be processed further by the connected BN layer and/or Relu layer, and then transferred to the next layer. In the network training process, parameters in the neural network are adjusted, the parameter change can cause the data distribution of other layers to change, the essence of the network learning process is the learning data distribution, and if the data distribution of each batch is different, the network adapts to different distributions in each iteration, so that the training speed of the network is greatly reduced.

And step 204, fusing the appearance information and the motion information.

In this step, the elements in the feature map corresponding to the appearance information and the elements in the feature map corresponding to the motion information may be added, and after all the elements are added, the feature map group obtained after the addition is used as the fused information. Or directly splicing and connecting to realize fusion. In the embodiment of the invention, the appearance information and the motion information are fused to achieve the complementation between different types of information, so that the training accuracy is improved while the training is realized by utilizing the two types of information during the subsequent training to a certain extent.

And step 205, performing model training based on the fused information to obtain an information extraction model.

Specifically, the specific training process in this step may refer to step 103, which is not described herein again in this embodiment of the present invention.

In summary, in the model generation method provided in the embodiment of the present invention, a sample video frame is extracted from a sample video, appearance information of the sample video frame is extracted through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and motion information of the sample video frame is extracted through a motion information extraction module, where the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. In the embodiment of the invention, the motion information is extracted by embedding the motion information extraction module in the 2D convolutional neural network and directly taking the sample video frame as input through the embedded motion information extraction module. Therefore, the motion information and the appearance information can be acquired without calculating an optical flow diagram of a sample video frame as input in advance and additionally using a 2D convolutional neural network, and the cost can be reduced to a certain extent.

Fig. 3 is a flowchart of steps of an information extraction method according to an embodiment of the present invention, and as shown in fig. 3, the method may include:

step 301, extracting a video frame to be extracted from a video to be extracted.

In the embodiment of the present invention, the video to be extracted may be obtained by receiving a video manually input by a user, or may be directly obtained from a network, and the like. Further, the specific sampling manner may refer to the foregoing steps, which is not limited in this embodiment of the present invention.

Step 302, inputting the video frame to be extracted into an information extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model; wherein, the feature extraction model is generated according to any one of the model generation method embodiments.

In the embodiment of the invention, the feature extraction model is generated according to the embodiment of the model generation method, namely, the feature extraction model has the capability of accurately extracting the appearance information and the motion information with lower cost, so that the appearance information and the motion information can be extracted with lower cost by inputting the video frame to be extracted into the feature extraction model.

In summary, in the feature extraction method provided in the embodiment of the present invention, a video frame to be extracted is extracted from a video to be extracted, and finally, the video frame to be extracted is input into an information feature extraction model, and appearance information and motion information are extracted from the video frame to be extracted through the information feature extraction model. Because the feature extraction model has the capability of accurately extracting the appearance information and the motion information with lower cost, the appearance information and the motion information can be extracted with lower cost by inputting the video frame to be extracted into the feature extraction model.

Fig. 4 is a block diagram of a model generation apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus 40 may include:

a first extraction module 401 is configured to extract a sample video frame from a sample video.

A second extraction module 402, configured to extract appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extract a gradient feature map and a difference feature map through a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.

A training module 403, configured to perform model training based on the appearance information and the motion information to obtain an information extraction model.

Optionally, each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network is connected to one motion information extraction module.

Optionally, the second extracting module 402 is specifically configured to:

extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the input feature map is obtained by processing the sample video frame by a layer before the motion information extraction module.

And splicing the gradient characteristic diagram and the difference characteristic diagram to obtain a target characteristic diagram.

And carrying out convolution operation on the target characteristic graph to obtain the motion information.

Optionally, the second extracting module 402 is further specifically configured to:

before extracting the gradient feature map and the difference feature map of the input feature map in a preset gradient direction, the apparatus further includes: adjusting the number of channels of the input feature map to P; the P is less than Q, and the Q is the original channel number of the input feature map;

after performing convolution operation on the target feature map, the apparatus further includes: and restoring the channel number of the input feature map to Q.

Optionally, the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;

the motion information extraction module comprises a quantity adjustment layer, a feature map generation layer, a splicing layer, a feature map convolution layer and a quantity recovery layer;

the quantity adjusting layer is used for realizing the operation of adjusting the channel quantity of the input feature map to P; the feature map generation layer is used for realizing the operation of extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the splicing layer is used for realizing the operation of splicing the gradient feature map and the difference feature map; the feature map convolution layer is used for realizing the operation of performing convolution operation on the target feature map; the quantity recovery layer is used for realizing the operation of recovering the channel quantity of the input feature diagram to Q

Optionally, the training module 403 is specifically configured to:

and fusing the appearance information and the motion information.

And carrying out model training based on the fused information to obtain an information extraction model.

In summary, in the model generating apparatus provided in the embodiment of the present invention, the sample video frame is extracted from the sample video, then, the appearance information of the sample video frame is extracted through the appearance information extracting module in the preset appearance-motion information 2D convolutional neural network, and the gradient feature map and the difference feature map are extracted through the motion information extracting module to obtain the motion information of the sample video frame, where the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extracting module in the 2D convolutional neural network in advance. And finally, performing model training based on the appearance information and the motion information to obtain an information extraction model. A motion information extraction module is embedded in the 2D convolutional neural network, a sample video frame is directly used as input, and the embedded motion information extraction module is combined with the gradient feature map and the difference feature map to obtain motion information. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be acquired without additionally using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Fig. 5 is a block diagram of an information extracting apparatus according to an embodiment of the present invention, and as shown in fig. 5, the apparatus 50 may include:

an extracting module 501, configured to extract a video frame to be extracted from a video to be extracted.

An input module 502, configured to input the video frame to be extracted into an information extraction model, and extract appearance information and motion information from the video frame to be extracted through the information extraction model.

Wherein the information extraction model is generated according to the model generation device.

In summary, in the feature extraction device provided in the embodiment of the present invention, the video frame to be extracted is extracted from the video to be extracted, and finally, the video frame to be extracted is input into the information feature extraction model, and the appearance information and the motion information are extracted from the video frame to be extracted through the information feature extraction model. Because the feature extraction model has the capability of accurately extracting the appearance information and the motion information with lower cost, the appearance information and the motion information can be extracted with lower cost by inputting the video frame to be extracted into the feature extraction model.

For the above device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the following steps when executing the program stored in the memory 603:

extracting a sample video frame from a sample video;

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to perform the model generation method or the information extraction method described in any of the above embodiments.

In yet another embodiment, the present invention further provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the model generation method or the information extraction method described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of model generation, the method comprising:

extracting a sample video frame from a sample video;

2. The method according to claim 1, wherein one motion information extraction module is connected to each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network.

3. The method of claim 1, wherein the extracting motion information of the sample video frame by a pre-embedded motion information extraction module comprises:

extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the input feature map is obtained by processing the sample video frame by a layer before the motion information extraction module;

splicing the gradient characteristic diagram and the difference characteristic diagram to obtain a target characteristic diagram;

4. The method of claim 3,

before extracting the gradient feature map and the difference feature map of the input feature map in a preset gradient direction, the method further includes: adjusting the number of channels of the input feature map to P; the P is less than Q, and the Q is the original channel number of the input feature map;

after performing a convolution operation on the target feature map, the method further includes: and restoring the channel number of the input feature map to Q.

5. The method according to claim 4, wherein the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;

the quantity adjusting layer is used for realizing the operation of adjusting the channel quantity of the input feature map to P; the feature map generation layer is used for realizing the operation of extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the splicing layer is used for realizing the operation of splicing the gradient feature map and the difference feature map; the feature map convolution layer is used for realizing the operation of performing convolution operation on the target feature map; the quantity recovery layer is used for realizing the operation of recovering the channel quantity of the input feature diagram to Q.

6. The method according to any one of claims 1 to 5, wherein the performing model training based on the appearance information and the motion information to obtain an information extraction model comprises:

fusing the appearance information and the motion information;

7. An information extraction method, characterized in that the method comprises:

extracting a video frame to be extracted from a video to be extracted;

inputting the video frame to be extracted into an information extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model;

wherein the information extraction model is generated according to the method of any one of claims 1 to 6.

8. An apparatus for model generation, the apparatus comprising:

9. The apparatus of claim 8, wherein one motion information extraction module is connected to each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network.

10. The apparatus of claim 8, wherein the second extraction module is specifically configured to:

11. The apparatus of claim 10, wherein the second extraction module is further specifically configured to:

12. The apparatus of claim 11, wherein the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;

13. The apparatus according to any one of claims 8-12, wherein the training module is specifically configured to:

fusing the appearance information and the motion information;

14. An information extraction apparatus, characterized in that the apparatus comprises:

the input module is used for inputting the video frames to be extracted into an information extraction model, and extracting appearance information and motion information from the video frames to be extracted through the information extraction model;

wherein the information extraction model is generated by the apparatus of any one of claims 8 to 12.

15. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 7 when executing a program stored in the memory.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.