CN112330711B

CN112330711B - Model generation method, information extraction device and electronic equipment

Info

Publication number: CN112330711B
Application number: CN202011357509.3A
Authority: CN
Inventors: 刘倩; 王涛
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2023-12-05
Anticipated expiration: 2040-11-26
Also published as: CN112330711A

Abstract

The invention provides a model generation method, an information extraction device and electronic equipment, and belongs to the technical field of computers. In the method, a sample video frame is extracted from the sample video, appearance information of the sample video frame is extracted through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and a gradient feature map and a difference feature map are extracted through a motion information extraction module to obtain motion information of the sample video frame, wherein the motion information extraction module is used for extracting the gradient feature map and the difference feature map to obtain the motion information. And performing model training based on the appearance information and the motion information to obtain an information extraction model. The optical flow diagram of the sample video frame is not required to be calculated in advance to be used as input, and the motion information and the appearance information can be obtained without using a 2D convolutional neural network additionally, so that the cost can be reduced to a certain extent.

Description

Model generation method, information extraction device and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a model generating method, an information extracting device, and an electronic device.

Background

With the rapid development of internet technology, video has become one of the important propagation ways of content authoring and social media platforms. For better processing of video, for example, video behavior recognition, it is generally necessary to comprehensively extract appearance information and motion information of the video. However, the common 2D convolutional neural network can only extract the appearance information of the video frames in the video. Therefore, how to comprehensively extract the appearance information and the motion information of the video becomes a problem to be solved.

In the prior art, a light flow chart of a video is generally extracted first, then, a video frame in the video is taken as an input of a 2D convolutional neural network, appearance information of the video is extracted by using the 2D convolutional neural network, and a light flow chart is taken as an input of another 2D convolutional neural network, and motion information is extracted by using the other 2D convolutional neural network. In the dual-path extraction mode, two 2D convolutional neural networks are required to be used, and additionally, a light flow graph of a video is required to be extracted, so that the extraction of the light flow graph consumes more calculation amount, and further, the calculation amount and the consumed cost are larger.

Disclosure of Invention

The embodiment of the invention aims to provide a model generation method, an information extraction device and electronic equipment, so as to solve the problem of high cost when appearance information and motion information are extracted. The specific technical scheme is as follows:

In a first aspect of the present invention, there is provided a model generation method, the method including:

extracting a sample video frame from a sample video;

extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module so as to acquire motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;

and performing model training based on the appearance information and the motion information to obtain an information extraction model.

In a second aspect of the present invention, there is provided an information extraction method, the method comprising:

extracting a video frame to be extracted from the video to be extracted;

inputting the video frame to be extracted into an information sign extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model; wherein the information extraction model is generated according to the method of any one of the first aspects.

In a third aspect of the present invention, there is also provided a model generating apparatus, the apparatus including:

a first extraction module for extracting sample video frames from the sample video;

the second extraction module is used for extracting the appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module so as to acquire the motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance;

and the training module is used for carrying out model training based on the appearance information and the motion information to obtain an information extraction model.

In a fourth aspect of the present invention, there is also provided an information extraction apparatus, the apparatus including:

the extraction module is used for extracting video frames to be extracted from the video to be extracted;

the input module is used for inputting the video frame to be extracted into an information sign extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model; wherein the information extraction model is generated according to the apparatus of any one of the third aspects.

In yet another aspect of the invention, there is also provided a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the invention there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

According to the model generation method provided by the embodiment of the invention, the sample video frame is extracted from the sample video, then the appearance information of the sample video frame is extracted through the appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, and the gradient feature map and the difference feature map are extracted through the motion information extraction module so as to obtain the motion information of the sample video frame, wherein the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, training a model based on the appearance information and the motion information to obtain an information extraction model. The motion information is obtained by embedding a motion information extraction module in the 2D convolutional neural network and directly taking a sample video frame as input and combining the gradient feature map and the difference feature map through the embedded motion information extraction module. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be obtained without using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1-1 is a schematic structural diagram of a 2D convolutional neural network for appearance-motion information according to an embodiment of the present invention;

FIGS. 1-2 are flowcharts of steps of a model generation method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of another model generation method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of an information extraction method according to an embodiment of the present invention;

FIG. 4 is a block diagram of a model generating apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of an information extraction apparatus provided by an embodiment of the present invention;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

First, a description will be given of a specific application scenario related to an embodiment of the present invention. Video behavior recognition is widely used at present. In the video behavior recognition, the cost of extracting the appearance information and the motion information of the video can have an important influence on the whole processing procedure.

Further, in the prior art, when appearance information and motion information are extracted, one 2D convolutional neural network is used for extracting the appearance information, and based on a light flow diagram, another 2D convolutional neural network is used for extracting the motion information, which results in larger cost. In order to simultaneously consider the consumed cost and the comprehensiveness of the extracted information, the embodiment of the invention provides a model generation method, an information extraction device and electronic equipment. In the method, based on a preset appearance-motion information 2D convolutional neural network obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance, appearance information is extracted through an appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, and motion information is acquired through a motion information extraction module in the preset appearance-motion information 2D convolutional neural network.

For example, in one implementation manner, fig. 1-1 is a schematic structural diagram of an appearance-motion information 2D convolutional neural network according to an embodiment of the present invention, as shown in fig. 1-1, an input layer, an appearance information extraction module, and an output layer may be original structures in an original 2D convolutional neural network, where the input layer may be used to receive an input image, and the output layer may be used to output information such as a final processing result. The appearance-motion information 2D convolutional neural network may be formed after embedding a motion information extraction module in the original 2D convolutional neural network. It should be noted that fig. 1-1 is only an exemplary illustration, in practical applications, the actual number of each layer or module is not limited to the actual illustration in the drawing, and other layers may be included in the convolutional neural network, for example, a feature map generating layer located after the input layer and before the appearance information extracting module and the motion information extracting module, a full connection layer connected between the appearance information extracting module and the output layer, or connected between the motion information extracting module and the output layer, a pooling layer, an activation function layer, and so on, which are not limited in the embodiments of the present invention.

In the embodiment of the invention, the motion information is obtained by embedding the motion information extraction module in the 2D convolutional neural network and directly taking the sample video frame as input and combining the gradient feature map and the difference feature map through the embedded motion information extraction module. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be obtained without using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Fig. 1-2 are flowcharts of steps of a model generating method according to an embodiment of the present invention, where, as shown in fig. 1-2, the method may include:

step 101, extracting a sample video frame from a sample video.

In the embodiment of the present invention, the sample video may be obtained by receiving a video manually input by a user or may be directly obtained from a network, which is not limited in the embodiment of the present invention.

Further, the specific number of the sample video frames may be selected according to actual requirements, and the sample video frames may be all video frames contained in the sample video, or may be part of specific video frames contained in the sample video. The embodiment of the present invention is not limited thereto. For example, the sample video may be partitioned into multiple video segments, e.g., a succession of 64 frames may be randomly extracted from the sample video, followed by 8 frames being extracted from the 64 frames at equal intervals.

102, extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extracting a gradient feature map and a difference feature map through a motion information extraction module so as to acquire motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.

In the embodiment of the present invention, the operation of extracting the appearance information and the operation of obtaining the motion information may be performed simultaneously, or the operation of extracting the appearance information may be set to be performed first, or the operation of obtaining the motion information may be performed first.

Further, the motion information extraction module may be a pre-designed module capable of extracting a gradient feature map and a difference feature map for the input feature map, thereby extracting motion information. The input feature map may be obtained by processing a sample video frame by a layer before the motion information extraction module, for example, the feature map generating layer generates an input feature map according to the sample video frame, and the feature map generating layer may be an original layer for generating the input feature map in a common 2D convolutional neural network. By embedding at least one motion information extraction module in the 2D convolutional neural network in advance, the appearance-motion information 2D convolutional neural network is generated, so that an optical flow diagram of a sample video frame does not need to be input, and the appearance information and the motion information can be obtained by directly processing the sample video frame through the appearance-motion information 2D convolutional neural network without using a 2D convolutional neural network.

Wherein the 2D convolutional neural network may be a residual convolutional neural network, for example, a 2D resnet50 convolutional neural network. The specific number and specific positions of the embedded motion information extraction modules may be determined according to actual requirements, which is not limited in the embodiment of the present invention.

Further, when appearance information of a sample video frame is extracted through an original appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, a convolution operation is performed on the sample video frame by using a convolution kernel set in the original appearance information extraction module, so that the appearance information is extracted.

Furthermore, the sample video frame is taken as input, the preset appearance-motion information 2D convolutional neural network is internally processed, abstract feature maps with different input degrees can be extracted, and further, the pre-embedded motion information extraction module can extract motion information according to the abstract feature maps.

And 103, performing model training based on the appearance information and the motion information to obtain an information extraction model.

In the embodiment of the invention, when model training is performed, a classification model can be used for classifying according to the appearance information and the motion information to obtain a prediction type. Specifically, the motion behavior of the subject included in the video frame may be determined according to the spatial dimension feature and the temporal dimension feature, and the behavior classification may be performed. And then according to the deviation degree between the predicted category and the real category, adjusting parameters in the 2D convolutional neural network of the preset appearance-motion information based on the deviation degree, and after the adjustment is completed, continuing training by repeatedly executing the steps until the condition of stopping training is met. Therefore, by training the 2D convolutional neural network of the preset appearance-motion information, the accuracy of the extracted information of the neural network can be improved, the obtained information extraction model is finally ensured, and the appearance information and the motion information can be accurately extracted at a low cost.

In summary, in the model generation method provided by the embodiment of the invention, the sample video frame is extracted from the sample video, then the appearance information of the sample video frame is extracted through the appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, and the gradient feature map and the difference feature map are extracted through the motion information extraction module, so that the motion information of the sample video frame is obtained, wherein the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, training a model based on the appearance information and the motion information to obtain an information extraction model. The motion information is obtained by embedding a motion information extraction module in the 2D convolutional neural network and directly taking a sample video frame as input and combining the gradient feature map and the difference feature map through the embedded motion information extraction module. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be obtained without using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Fig. 2 is a flowchart of steps of another method for generating a model according to an embodiment of the present invention, as shown in fig. 2, the method may include:

step 201, extracting a sample video frame from a sample video.

Specifically, this step may refer to the foregoing step 101, which is not limited in the embodiment of the present invention.

And 202, extracting appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network.

Step 203, extracting a gradient feature map and a difference feature map by a motion information extraction module to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.

In this step, the motion information extraction module may be embedded in an original appearance information extraction module, and each original appearance information extraction module in the preset appearance-motion information 2D convolutional neural network may be connected with a motion information extraction module. Therefore, the motion information extraction modules are connected to each original appearance information extraction module, so that the extraction capacity of the embedded preset appearance-motion information 2D convolutional neural network on the motion information can be improved to a large extent. The appearance information extraction module may be a convolution module, and the appearance information extraction module in the appearance-motion information 2D convolution neural network is parallel to the motion information extraction module. For example, assuming that the selected 2D convolutional neural network is the 2D res net50, the motion information extraction module may be embedded as a branch into each convolutional module (bottlenneck) of the 2D res net50, which is the appearance information extraction module. Accordingly, the embedded convolution module forms an Apparent Motion (AM) convolution module, and the embedded 2D convolution neural network forms a preset apparent-motion information 2D convolution neural network.

Specifically, the present step may include the following sub-steps:

substep (1): adjusting the number of channels of the input feature map to P; and P < Q, wherein Q is the original channel number of the input characteristic diagram.

In this step, the input feature map refers to an abstract feature map input to the motion information extraction module. The input feature map may be obtained by processing a sample video frame by a layer before the motion information extraction module in the 2D convolutional neural network. The previous layers can be a plurality of convolution layers, and abstract feature layers with different degrees can be extracted through the processing of the layers, wherein the abstract feature map input to the motion information extraction module is an input feature map. The number of channels of the abstract feature map output by each layer may be determined by the number of dimensions set in the layer, that is, the specific value of Q may be determined according to the number of dimensions set in the previous layer of the motion information extraction module. For example, the number of channels of the abstract feature map output by each layer may be equal to the number of dimensions set in that layer, and Q may be 2048 assuming that the number of dimensions set is 2048 dimensions.

Further, the specific value of P may be set according to practical situations, as long as P is ensured to be smaller than the original channel number Q. In adjusting the number of channels, the input feature map may be convolved with a convolution kernel of 1*1 to compress the number of channels of the input feature map, which is not limited by the embodiment of the present invention. In the embodiment of the invention, the number of the channels of the input feature map is adjusted to P, namely, the number of the channels is adjusted to be smaller, so that the data volume required to be processed in the subsequent steps can be reduced to a certain extent, and the calculated volume required to be consumed is further reduced.

Substep (2): and extracting a gradient characteristic diagram and a difference characteristic diagram of the input characteristic diagram in a preset gradient direction according to the input characteristic diagram.

In this step, the preset gradient direction may be set according to actual requirements, and the preset gradient direction may be a horizontal direction and a vertical direction, for example. When calculating the gradient characteristic diagram of the input characteristic diagram in the preset gradient direction,the input feature map may be convolved in a predetermined direction using a predetermined gradient operator. Wherein, the preset gradient operator can be a Sobel detection operator, and then the input feature map is convolved from the horizontal direction, namely the X-axis direction by utilizing the Sobel detection operator to obtain a first convolution result G _x Convolving the input feature map from the vertical direction, i.e., the Y-axis direction, to obtain a second convolution result G _y 。

For example, a difference feature map may be extracted using adjacent feature maps in the time dimension. Specifically, the positions in different video frames are different due to the motion regions corresponding to the objects moving in the scene. Thus, the differences of the feature maps may be calculated, and then a differential feature map may be generated based on the differences.

Substep (3): and splicing the gradient feature map and the difference feature map to obtain a target feature map.

In this step, the features of the same channel may be spliced according to the number of channels, for example, the features of the first channel in the gradient feature map and the features of the first channel in the difference feature map are spliced, the features of the X-th channel in the gradient feature map and the features of the X-th channel in the difference feature map are spliced, and finally, the feature map obtained after the splicing may be used as the target feature map. And splicing the gradient feature map and the difference feature map to serve as a target feature map, so that richer feature information can be collected in the target feature map.

Substep (4): and carrying out convolution operation on the target feature map to obtain the motion information.

In this step, since the feature information in the gradient feature map and the difference feature map are collected in the target feature map, the motion information can be extracted by performing convolution operation on the target feature map.

In an actual application scene, more accurate motion information can be obtained to a certain extent by combining the two feature images.

Substep (5): and restoring the number of channels of the input characteristic diagram to Q.

In this step, the reverse operation of the operation performed in the foregoing sub-step (1) may be performed, so as to adjust the number of channels to Q, that is, restore the number of channels of the input feature map. By way of example, the number of recovery channels may be achieved by convolving with a convolution kernel of 1*1 in this step. Wherein the number of dimensions set in the convolution operation is Q. In an actual application scenario, other operations may be performed based on the input feature map after the time dimension feature of the input feature map is extracted, so in the embodiment of the present invention, the input feature map is restored to the original state after the motion information is extracted, so that the subsequent processing of the input feature map is ensured not to be affected. The specific type of other operations may be set according to actual requirements, for example, the other operations may be outputting a characteristic map of the Q channel.

Further, the motion information extraction module may include a number adjustment layer, a feature map generation layer, a stitching layer, a feature map convolution layer, and a number recovery layer; the number adjustment layer is used for realizing the operation in the substep (1), and the characteristic diagram generation layer can be used for realizing the operation in the substep (2); the splice layer may be used to implement the operations in sub-step (3); the feature map convolution layer may be used to implement the operations in sub-step (4); the number recovery layer may be used to implement the operations in sub-step (5).

Further, after each convolution layer in the above steps, a batch normalization (Batch Normalization, BN) layer and/or a correction linear element (Rectified linear unit, relu) layer may be connected. Accordingly, after the layers complete the processing, the processing results may be continued by the connected BN layer and/or the Relu layer, after which the processing results are transferred to the next layer. Because parameters in the neural network are adjusted in the network training process, the data distribution of other layers is changed due to the parameter change, the essence of the network learning process is to learn the data distribution, if the data distribution of each batch is different, the network is required to adapt to different distributions in each iteration, so that the training speed of the network is greatly reduced.

And 204, fusing the appearance information and the motion information.

In this step, elements in the feature map corresponding to the appearance information and elements corresponding to the feature map corresponding to the motion information may be added, and after all the elements are added, the feature map group obtained after the addition is used as the information after fusion. Or directly splicing and connecting to realize fusion. In the embodiment of the invention, the complementation between different kinds of information is achieved by fusing the appearance information and the motion information, so that the training accuracy is improved while the training by using the two kinds of information is realized to a certain extent during the subsequent training.

And 205, training a model based on the fused information to obtain an information extraction model.

Specifically, the specific training process in this step may refer to the foregoing step 103, and the embodiments of the present invention are not described herein.

In summary, in the model generation method provided by the embodiment of the invention, the sample video frame is extracted from the sample video, then the appearance information of the sample video frame is extracted through the appearance information extraction module in the preset appearance-motion information 2D convolutional neural network, and the motion information of the sample video frame is extracted through the motion information extraction module, wherein the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance. And finally, training a model based on the appearance information and the motion information to obtain an information extraction model. In the embodiment of the invention, the motion information is extracted by embedding the motion information extraction module in the 2D convolutional neural network and directly taking the sample video frame as input through the embedded motion information extraction module. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be obtained without using a 2D convolutional neural network additionally, so that the cost can be reduced to a certain extent.

Fig. 3 is a flowchart of steps of an information extraction method according to an embodiment of the present invention, where, as shown in fig. 3, the method may include:

step 301, extracting a video frame to be extracted from a video to be extracted.

In the embodiment of the invention, the video to be extracted can be obtained by receiving the video manually input by a user, or can be directly obtained from a network, and the like. Further, the specific sampling method may refer to the foregoing steps, which is not limited in the embodiment of the present invention.

Step 302, inputting the video frame to be extracted into an information sign extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model; wherein the feature extraction model is generated according to any of the foregoing model generation method embodiments.

In the embodiment of the invention, since the feature extraction model is generated according to the embodiment of the model generation method, that is, the feature extraction model has the capability of accurately extracting the appearance information and the motion information at a low cost, the appearance information and the motion information can be extracted at a low cost by inputting the video frame to be extracted into the feature extraction model.

In summary, according to the feature extraction method provided by the embodiment of the invention, the video frame to be extracted is extracted from the video to be extracted, and finally, the video frame to be extracted is input into the information sign extraction model, and the appearance information and the motion information are extracted from the video frame to be extracted through the information extraction model. Because the feature extraction model has the capability of accurately extracting the appearance information and the motion information at a low cost, the appearance information and the motion information can be extracted at a low cost by inputting the video frame to be extracted into the feature extraction model.

Fig. 4 is a block diagram of a model generating apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus 40 may include:

a first extraction module 401 is configured to extract a sample video frame from a sample video.

A second extraction module 402, configured to extract appearance information of the sample video frame through an appearance information extraction module in a preset appearance-motion information 2D convolutional neural network, and extract a gradient feature map and a difference feature map through a motion information extraction module, so as to obtain motion information of the sample video frame; the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extraction module in the 2D convolutional neural network in advance.

The training module 403 is configured to perform model training based on the appearance information and the motion information, and obtain an information extraction model.

Optionally, each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network is connected with one motion information extraction module.

Optionally, the second extraction module 402 is specifically configured to:

extracting a gradient characteristic diagram and a difference characteristic diagram of the input characteristic diagram in a preset gradient direction according to the input characteristic diagram; the input feature map is obtained by processing the sample video frame by a layer before the motion information extraction module.

And splicing the gradient feature map and the difference feature map to obtain a target feature map.

And carrying out convolution operation on the target feature map to obtain the motion information.

Optionally, the second extraction module 402 is further specifically configured to:

before extracting the gradient feature map and the difference feature map of the input feature map in the preset gradient direction, the device further comprises: adjusting the number of channels of the input feature map to P; the P is less than Q, and the Q is the number of original channels of the input characteristic diagram;

After performing the convolution operation on the target feature map, the apparatus further includes: and restoring the number of channels of the input characteristic diagram to Q.

Optionally, the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;

the motion information extraction module comprises a quantity adjustment layer, a characteristic map generation layer, a splicing layer, a characteristic map convolution layer and a quantity recovery layer;

the quantity adjusting layer is used for realizing the operation of adjusting the quantity of the channels of the input feature map to P; the feature map generation layer is used for realizing the operation of extracting a gradient feature map and a difference feature map of the input feature map in a preset gradient direction according to the input feature map; the splicing layer is used for realizing the operation of splicing the gradient characteristic diagram and the difference characteristic diagram; the feature map convolution layer is used for realizing the operation of carrying out convolution operation on the target feature map; the number recovery layer is used for realizing the operation of recovering the number of channels of the input feature diagram to Q

Optionally, the training module 403 is specifically configured to:

And fusing the appearance information and the motion information.

And performing model training based on the fused information to obtain an information extraction model.

In summary, in the model generating device provided by the embodiment of the invention, the sample video frame is extracted from the sample video, then the appearance information of the sample video frame is extracted through the appearance information extracting module in the preset appearance-motion information 2D convolutional neural network, and the gradient feature map and the difference feature map are extracted through the motion information extracting module, so as to obtain the motion information of the sample video frame, wherein the preset appearance-motion information 2D convolutional neural network is obtained by embedding at least one motion information extracting module in the 2D convolutional neural network in advance. And finally, training a model based on the appearance information and the motion information to obtain an information extraction model. The motion information is obtained by embedding a motion information extraction module in the 2D convolutional neural network and directly taking a sample video frame as input and combining the gradient feature map and the difference feature map through the embedded motion information extraction module. Therefore, the optical flow diagram of the sample video frame is not required to be calculated in advance to serve as input, and the motion information and the appearance information can be obtained without using a 2D convolutional neural network, so that the cost can be reduced to a certain extent.

Fig. 5 is a block diagram of an information extraction apparatus according to an embodiment of the present invention, and as shown in fig. 5, the apparatus 50 may include:

the extracting module 501 is configured to extract a video frame to be extracted from a video to be extracted.

The input module 502 is configured to input the video frame to be extracted into an information sign extraction model, and extract appearance information and motion information from the video frame to be extracted through the information extraction model.

Wherein the information extraction model is generated according to the aforementioned model generation means.

In summary, according to the feature extraction device provided by the embodiment of the invention, the video frame to be extracted is extracted from the video to be extracted, and finally, the video frame to be extracted is input into the information sign extraction model, and the appearance information and the motion information are extracted from the video frame to be extracted through the information extraction model. Because the feature extraction model has the capability of accurately extracting the appearance information and the motion information at a low cost, the appearance information and the motion information can be extracted at a low cost by inputting the video frame to be extracted into the feature extraction model.

For the above-described device embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the description of the method embodiments in part.

The embodiment of the invention also provides an electronic device, as shown in fig. 6, which comprises a processor 601, a communication interface 602, a memory 603 and a communication bus 604, wherein the processor 601, the communication interface 602 and the memory 603 complete communication with each other through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to execute the program stored in the memory 603, and implement the following steps:

extracting a sample video frame from a sample video;

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform the model generation method or the information extraction method according to any of the above embodiments.

In a further embodiment of the present invention, a computer program product comprising instructions which, when run on a computer, cause the computer to perform the model generation method or the information extraction method according to any of the above embodiments is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of generating a model, the method comprising:

extracting a sample video frame from a sample video;

model training is carried out based on the appearance information and the motion information, and an information extraction model is obtained;

each appearance information extraction module in the preset appearance-motion information 2D convolutional neural network is connected with one motion information extraction module;

the appearance information extraction module is a convolution module in a preset time-space dimension 2D convolution neural network; the motion information extraction module is connected with the convolution module;

the quantity adjusting layer is used for realizing the operation of adjusting the quantity of channels of the input feature map to P; the feature map generation layer is used for extracting gradient feature maps and difference feature maps of the input feature maps in a preset gradient direction according to the input feature maps; the splicing layer is used for realizing the operation of splicing the gradient feature map and the difference feature map to obtain a target feature map; the feature map convolution layer is used for realizing the operation of carrying out convolution operation on the target feature map; the quantity recovery layer is used for realizing the operation of recovering the quantity of channels of the input feature map to Q; the input feature map is obtained by processing the sample video frame by a layer before the motion information extraction module, P is smaller than Q, and Q is the original channel number of the input feature map.

2. The method of claim 1, wherein the extracting, by the pre-embedded motion information extraction module, the motion information of the sample video frame comprises:

Extracting a gradient characteristic diagram and a difference characteristic diagram of the input characteristic diagram in a preset gradient direction according to the input characteristic diagram;

splicing the gradient feature map and the difference feature map to obtain a target feature map;

3. The method according to any one of claims 1-2, wherein the performing model training based on the appearance information and the motion information to obtain an information extraction model includes:

fusing the appearance information and the motion information;

4. An information extraction method, characterized in that the method comprises:

extracting a video frame to be extracted from the video to be extracted;

inputting the video frame to be extracted into an information sign extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model;

wherein the information extraction model is generated according to the method of any one of claims 1 to 3.

5. A model generation apparatus, characterized in that the apparatus comprises:

the training module is used for carrying out model training based on the appearance information and the motion information to obtain an information extraction model;

6. The apparatus according to claim 5, wherein the second extraction module is specifically configured to:

7. The apparatus according to any one of claims 5-6, wherein the training module is specifically configured to:

fusing the appearance information and the motion information;

8. An information extraction apparatus, characterized in that the apparatus comprises:

the input module is used for inputting the video frame to be extracted into an information sign extraction model, and extracting appearance information and motion information from the video frame to be extracted through the information extraction model;

wherein the information extraction model is generated according to the apparatus of any one of claims 5 to 7.

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

A memory for storing a computer program;

a processor for implementing the method of any of claims 1-4 when executing a program stored on a memory.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-4.