CN111046847A

CN111046847A - Video processing method and device, electronic equipment and medium

Info

Publication number: CN111046847A
Application number: CN201911395074.9A
Authority: CN
Inventors: 王智康; 马原
Original assignee: Beijing Pengsi Technology Co Ltd
Current assignee: Beijing Pengsi Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-04-21

Abstract

The embodiment of the disclosure provides a video processing method, a video processing device, an electronic device and a medium, which relate to the technical field of computers, wherein the method comprises the following steps: extracting a feature map corresponding to each video frame in a video containing a target, determining a horizontal attention feature vector and a vertical attention feature vector corresponding to each feature map according to the distribution of pixel points of each feature map, performing feature splicing on the horizontal attention feature vector and the vertical attention feature vector corresponding to each feature map to obtain an attention feature vector corresponding to each feature map, and performing vector aggregation operation on the attention feature vectors corresponding to the feature maps of all video frames in the video containing the target to obtain a representative feature vector corresponding to the video containing the target. By adopting the method and the device, the representative feature vector can accurately embody important features in the video frame containing the target.

Description

Video processing method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video processing method and apparatus, an electronic device, and a medium.

Background

At present, with the development of artificial intelligence, computer vision and hardware technology, video image processing technology is widely applied to intelligent city systems. The electronic device may determine all videos including the target pedestrian in a video query library according to a segment of video including the target pedestrian based on pedestrian Re-Identification (Person Re-Identification).

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a video processing method, an apparatus, an electronic device, and a medium, so that a representative feature vector can accurately represent important features in a video frame containing a target.

In a first aspect, a video processing method is provided, where the method is applied to an electronic device, and the method includes:

extracting a feature map corresponding to each video frame in a video containing a target;

determining the weight of each line of pixel points in each characteristic graph in the characteristic graph to which the pixel points belong according to the pixel point distribution of each characteristic graph, and determining a horizontal attention characteristic vector corresponding to each characteristic graph based on the weight of each line of pixel points in the characteristic graph to which the pixel points belong;

determining the weight of each column of pixel points in each characteristic graph in the characteristic graph according to the pixel point distribution of each characteristic graph, and determining a vertical attention characteristic vector corresponding to each characteristic graph based on the weight of each column of pixel points in the characteristic graph;

performing feature splicing on the horizontal attention feature vector and the vertical attention feature vector corresponding to each feature map to obtain an attention feature vector corresponding to each feature map;

and carrying out vector aggregation operation on attention feature vectors corresponding to feature maps of all video frames in the video containing the target to obtain a representative feature vector corresponding to the video containing the target.

Optionally, the determining, according to the distribution of the pixel points of each feature map, the weight of each row of pixel points in each feature map in the corresponding feature map, and determining, based on the weight of each row of pixel points in the corresponding feature map, the horizontal attention feature vector corresponding to each feature map includes:

performing global average pooling operation on each feature map to obtain a first feature vector corresponding to the feature map;

performing compression operation and decoding operation on the first feature vector to obtain a second feature vector;

respectively carrying out summation operation on each row of pixel points in the feature map to obtain a third feature vector corresponding to each row of pixel points in the feature map;

performing logistic regression operation on each third feature vector to obtain an attention value corresponding to each third feature vector, taking the attention value corresponding to each third feature vector as the weight of each third feature vector, and performing weighted summation operation on the third feature vectors corresponding to each row of pixel points in the feature map to obtain a fourth feature vector;

and taking the product of the second feature vector and the fourth feature vector as a horizontal attention feature vector corresponding to the feature map.

Optionally, the determining, according to the distribution of the pixel points of each feature map, the weight of each column of pixel points in each feature map in the corresponding feature map, and determining, based on the weight of each column of pixel points in the corresponding feature map, the vertical attention feature vector corresponding to each feature map includes:

performing global average pooling operation on each feature map to obtain a fifth feature vector corresponding to the feature map;

performing compression operation and decoding operation on the fifth feature vector to obtain a sixth feature vector;

summing each row of pixel points in the feature map respectively to obtain a seventh feature vector corresponding to each row of pixel points in the feature map;

performing logistic regression operation on each seventh feature vector to obtain an attention value corresponding to each seventh feature vector, taking the attention value corresponding to each seventh feature vector as the weight of each seventh feature vector, and performing weighted summation operation on the seventh feature vectors corresponding to each row of pixel points in the feature map to obtain eighth feature vectors;

and taking the product of the sixth feature vector and the eighth feature vector as the vertical attention feature vector corresponding to the feature map.

Optionally, the performing a vector aggregation operation on the attention feature vectors corresponding to the feature maps of all the video frames in the video including the target to obtain the representative feature vector corresponding to the video including the target includes:

calculating a vector distance between the attention feature vector and the attention feature vector corresponding to each feature map aiming at the attention feature vector corresponding to each feature map;

calculating an average value of the distance of each component corresponding to the attention feature vector, and taking the calculated average value as the weight of the attention feature vector;

and performing weighted average operation on each attention feature vector based on the weight of each attention feature vector to obtain a representative feature vector corresponding to the video containing the target.

In a second aspect, a video processing apparatus is provided, where the apparatus is applied to an electronic device, and the method includes:

the extraction module is used for extracting a feature map corresponding to each video frame in a video containing a target;

the determining module is used for determining the weight of each line of pixel points in each feature map in the feature map to which the pixel points belong according to the pixel point distribution of each feature map, and determining the horizontal attention feature vector corresponding to each feature map based on the weight of each line of pixel points in the feature map to which the pixel points belong;

the determining module is further configured to determine, according to the distribution of the pixel points of each feature map, a weight of each column of pixel points in each feature map in the feature map to which the pixel points belong, and determine, based on the weight of each column of pixel points in the feature map to which the pixel points belong, a vertical attention feature vector corresponding to each feature map;

the feature splicing module is used for performing feature splicing on the horizontal attention feature vector and the vertical attention feature vector corresponding to each feature map to obtain an attention feature vector corresponding to each feature map;

and the vector aggregation module is used for carrying out vector aggregation operation on attention feature vectors corresponding to feature maps of all video frames in the video containing the target to obtain a representative feature vector corresponding to the video containing the target.

Optionally, the determining module is specifically configured to:

Optionally, the determining module is specifically further configured to:

Optionally, the vector aggregation module is specifically configured to:

In a third aspect, an electronic device is provided that includes a processor, a memory;

a memory for storing a computer program;

a processor for implementing the method steps of the first aspect when executing the program stored in the memory.

In a fourth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the method steps of the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect described above.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of a neural network provided in an embodiment of the present disclosure;

fig. 2 is a flow chart of a method for video processing according to an embodiment of the present disclosure;

fig. 3 is a flow chart of another method of video processing provided by the embodiments of the present disclosure;

fig. 4 is a flow chart of another method of video processing provided by an embodiment of the present disclosure;

fig. 5 is a schematic diagram illustrating an execution flow of a bidirectional graph convolution attention module according to an embodiment of the present disclosure;

fig. 6 is a flow chart of another method of video processing provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of an affinity matrix provided in an embodiment of the present disclosure;

fig. 8 is a schematic diagram illustrating an execution flow of an affinity sequence fusion module according to an embodiment of the disclosure;

fig. 9 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

In the technology known by the inventor, in order to implement association of videos related to the Re ID and including the target pedestrian, the electronic device may encode each frame image in the target pedestrian video into one feature vector, and aggregate the feature vectors into one representative feature vector in a time domain, where the representative feature vector is a set of features of each object in the target pedestrian video, and the electronic device may calculate a similarity degree between the representative feature vector of the target pedestrian video and the representative feature vectors of the videos in the video query library, so as to determine the videos including the target pedestrian in the video query library according to the similarity degree.

The inventor finds that interference factors such as shielding, shadow and background exist in each video frame in the target pedestrian video, so that the video frame with the interference factors influencing the image of the target pedestrian exists in the target pedestrian video, the characteristics of the target pedestrian are not obvious in the video frame with the interference factors, the representative characteristic vector cannot accurately represent the target pedestrian existing in the target pedestrian video, and the video including the target pedestrian in the video query library cannot be accurately determined.

In the following examples, each related art word has a meaning known to those skilled in the art. The convolution operation may be performed by convolution layers in a neural network, and may be, for example, a classical CNN convolution neural network and various specific implementations thereof, such as a full convolution neural network FCN, a split network SegNet, a deep convolution network VGGNet, ResNet50, ResNet101, and the like. The graph convolution operation may be performed by a graph convolution neural network GCN, and may be, for example, a spectrum convolution, a spatial domain convolution, or the like, and may be, in particular, a spatial domain convolution.

The constituent structure in a neural network and its function can be understood by those skilled in the art. For example, the convolutional layer may be used to perform a convolution operation, extracting feature information of an input image (e.g., with a size of 227 × 227) to obtain a feature map (e.g., with a size of 13 × 13); the pooling layer may perform a pooling operation on the input image, such as a max-pooling (max-pooling) method, a mean-pooling (mean-pooling) method, etc.; the activation layer introduces nonlinear factors through activation functions, such as adopting correction unit (ReLU, Leaky-ReLU, P-ReLU, R-ReLU) functions, S-type functions (Sigmoid functions) or hyperbolic tangent functions (tanh functions) and the like. And the full connection layer is used for converting the feature map output by convolution into a one-dimensional vector. The loss function is used for evaluating the degree of inconsistency between the predicted value f (x) and the true value Y during the neural network training, and can be a log-log loss function, a square loss function, an exponential loss function, a Hinge loss function and the like.

It is well known to those skilled in the art that neural networks need to be trained to achieve desired performance before deployment.

In the embodiments described below, a video containing objects, which may be, for example, pedestrians, animals, vehicles, etc. in the video.

The embodiment of the disclosure provides a video processing method, which is applied to electronic equipment, wherein a neural network is deployed in the electronic equipment, and the neural network is specifically a neural network based on intimacy guidance.

As shown in fig. 1, fig. 1 is a schematic diagram of a neural network according to an embodiment of the present disclosure, the neural network including: the system comprises a spatial feature extractor, a bipartite graph convolution attention module and a parent density sequence fusion module.

The electronic device can input the video containing the target into the neural network and acquire the representative feature vector corresponding to the video containing the target output by the neural network.

The spatial feature extractor may be a convolutional neural network, and the convolutional neural network is configured to perform spatial image feature extraction on each video frame included in the video including the target, so as to extract a feature map of each video frame, and input the feature map of each video frame to the bidirectional convolutional attention module.

In practical applications, the spatial feature extractor may be a convolutional neural network obtained after pre-training on a pre-acquired data set. For example, the spatial feature extractor training set may be a training set constructed based on an image network (ImageNet) database, and the Deep residual network may be a Deep residual network 50(Deep residual network 50, ResNet 50). The electronic device can respectively perform feature extraction on each frame in the video containing the target through ResNet50 pre-trained based on ImageNet to obtain a feature map of 2048 channel dimensions corresponding to each frame of video frame image.

The bidirectional graph convolution attention module is used for converting the feature graph extracted by the spatial feature extractor into an attention feature vector. The bidirectional graph convolution attention module is composed of a horizontal graph convolution attention submodule and a vertical graph convolution attention submodule.

Specifically, the horizontal map convolution attention submodule is configured to determine a horizontal attention feature vector corresponding to the feature map, and the vertical map convolution attention submodule is configured to determine a vertical attention feature vector corresponding to the feature map. The bidirectional graph convolution attention module can carry out feature splicing on the horizontal attention feature vector and the vertical attention feature vector, and the feature vectors after feature splicing are used as output and input into the intimacy sequence fusion module.

The intimacy degree sequence fusion module is used for determining representative feature vectors of the video frames containing the targets based on intimacy degree between attention feature vectors of corresponding feature maps of all video frames in the video containing the targets.

A method for processing a video according to an embodiment of the present disclosure will be described in detail below with reference to a specific implementation manner according to a neural network shown in fig. 1, as shown in fig. 2, the specific steps are as follows:

step 201, extracting a feature map corresponding to each video frame in a video containing the target.

Step 202, determining the weight of each line of pixel points in each feature map in the feature map to which the pixel points belong according to the pixel point distribution of each feature map, and determining the horizontal attention feature vector corresponding to each feature map based on the weight of each line of pixel points in the feature map to which the pixel points belong.

And 203, determining the weight of each row of pixel points in each feature map in the feature map to which the pixel points belong according to the pixel point distribution of each feature map, and determining the vertical attention feature vector corresponding to each feature map based on the weight of each row of pixel points in the feature map to which the pixel points belong.

And 204, performing feature splicing on the horizontal attention feature vector and the vertical attention feature vector corresponding to each feature map to obtain the attention feature vector corresponding to each feature map.

Step 205, performing vector aggregation operation on the attention feature vectors corresponding to the feature maps of all the video frames in the video including the target to obtain a representative feature vector corresponding to the video including the target.

By adopting the video processing method provided by the embodiment of the disclosure, the weight of each row of pixel points in the belonging characteristic graph can be determined, so that the important characteristic can be given with larger weight, and the characteristic of the interference factor can be given with smaller weight, so that the horizontal attention characteristic vector and the vertical attention characteristic vector determined based on the weights can amplify the important characteristic and reduce the interference characteristic. And then performing feature splicing on the horizontal attention feature vector and the vertical attention feature vector to obtain an attention feature vector corresponding to each feature map, and then performing vector aggregation operation on the attention feature vector corresponding to each feature map to obtain a representative feature vector corresponding to the video containing the target, wherein the representative feature vector can also accurately represent important features (such as features of target pedestrians) in a video frame containing the target, so that the electronic device can accurately determine the video containing the target in a video query library through the representative feature vector.

In the embodiment of the present disclosure, in the above step 201, the video containing the target may be, for example, a segment of video including a target pedestrian, and in combination with the neural network shown in fig. 1, in the embodiment of the present disclosure, the spatial feature extractor may perform a convolution operation on a video frame of the video containing the target to obtain a feature map of the video frame of the video containing the target.

In step 201, each video frame may be sequentially input to the spatial feature extractor or N video frames in the video including the target may be extracted by the spatial feature extractor, so as to obtain a video sequence corresponding to the video including the target and composed of the N video frames, and after the spatial feature extractor performs feature extraction on each video frame in the video sequence, N2048 channel dimension feature maps corresponding to the video sequence may be obtained.

In step 202, the electronic device may determine a horizontal attention feature vector corresponding to each feature map based on the horizontal map convolution attention submodule in fig. 1.

Because the horizontal convolution attention submodule determines the weight of each row of pixel points in the feature map when calculating the horizontal attention feature vector of each feature map, important pixel point rows can be given a larger weight, pixel point rows of the interference factors can be given a smaller weight, that is, the important features are enlarged, and the features of the interference factors are reduced.

In step 203, the electronic device may determine a vertical attention feature vector corresponding to each feature map based on the vertical map convolution attention submodule in fig. 1.

Because the vertical convolution attention submodule determines the weight of each column of pixel points in the feature map when calculating the vertical attention feature vector of each feature map, an important pixel point column can be given a larger weight, a pixel point column of an interference factor can be given a smaller weight, that is, the important feature is enlarged, and the feature of the interference factor is reduced.

In step 204, the electronic device may determine an attention feature vector corresponding to each feature map based on the bipartite graph convolution attention module in fig. 1.

Wherein, each feature map corresponds to two feature vectors (i.e. a horizontal attention feature vector and a vertical attention feature vector), and the bidirectional convolution attention module can splice the two feature vectors into the attention feature vector corresponding to the feature map through a feature splicing operation.

As described above, in the horizontal attention feature vector, the important pixel point row is given a large weight, the pixel point row of the disturbance factor is given a small weight, and in the vertical attention feature vector, the important pixel point column is given a large weight, and the pixel point column of the disturbance factor is given a small weight, in combination with the contents described in step 202 and step 203. Therefore, in the attention feature vector obtained after the feature map is respectively corresponding to the two feature vectors and subjected to the feature splicing operation by the bidirectional graph convolution attention module, the important feature is amplified, and the feature of the interference factor is reduced.

In step 205, the representative feature vector is used to represent a main feature in the video containing the target, which may be, for example, a main feature of a target pedestrian in the embodiment of the present disclosure.

After the intimacy sequence fusion module determines the representative feature vector corresponding to the video containing the target, the electronic device may determine the video containing the target in the video search library based on the representative feature vector corresponding to the video containing the target.

Optionally, the electronic device may determine a vector distance (cosine distance or euclidean distance) between the representative feature vector corresponding to the video including the target and the representative feature vector corresponding to each video in the video search library, and then determine, as the query result, the video in the video search library corresponding to the representative feature vector whose vector distance is smaller than the preset threshold.

If the vector distance is smaller, the similarity between the representative feature vectors is higher, so that when the vector distance between the representative feature vector corresponding to the video including the target and the representative feature vector of the video in the video search library is smaller than a preset threshold, the similarity between the representative feature vector corresponding to the video including the target and the representative feature vector of the video in the video search library is high, and further, the video in the video search library also includes the target.

In an implementation manner of the embodiment of the present disclosure, as shown in fig. 3, for step 202, determining, according to the distribution of the pixel points of each feature map, a weight of each row of pixel points in the feature map to which each row of pixel points belongs, and determining, based on the weight of each row of pixel points in the feature map to which each row of pixel points belongs, a horizontal attention feature vector corresponding to each feature map specifically includes the following steps:

step 301, for each feature map, performing global average pooling operation on the feature map to obtain a first feature vector corresponding to the feature map.

Among them, the goal of Global Average Pooling (GAP) operation is to convert a multidimensional feature map into a one-dimensional feature vector.

Those skilled in the art will appreciate that the feature map may also be converted into a one-dimensional feature vector through the full connectivity layer FC.

Step 302, performing compression operation and decoding operation on the first feature vector to obtain a second feature vector.

The horizontal map convolution attention sub-module can perform compression operation on the first feature vector through one map convolution layer, and perform decoding operation on the result of the compression operation through the other map convolution layer, and the compression operation and the decoding operation can re-estimate the weight of each channel element in the first feature vector to further refine the effective features.

And 303, respectively carrying out summation operation on each row of pixel points in the feature map to obtain a third feature vector corresponding to each row of pixel points in the feature map.

And 304, performing logistic regression operation on each third feature vector to obtain an attention value corresponding to each third feature vector, taking the attention value corresponding to each third feature vector as the weight of each third feature vector, and performing weighted summation operation on the third feature vectors corresponding to each row of pixel points in the feature map to obtain a fourth feature vector.

The attention value corresponding to the third feature vector is determined by the horizontal map convolution attention submodule through softmax logistic regression operation and is used for representing the importance degree of the third feature vector in the feature map.

Because the horizontal map convolution attention submodule introduces a weight for each third feature vector when calculating the fourth feature vector, in the fourth feature vector, important features are enlarged and features of the disturbance factors are reduced.

In the embodiment of the present disclosure, steps 301 to 302 and steps 303 to 304 may be executed in parallel, and the embodiment of the present disclosure is not limited thereto.

And step 305, taking the product of the second feature vector and the fourth feature vector as a horizontal attention feature vector corresponding to the feature map.

It is understood that the horizontal map convolution attention submodule may obtain a horizontal attention feature vector corresponding to each feature map by performing the operations of step 301 to step 305 on the feature map of each video frame in the video frame sequence.

In the embodiment of the present disclosure, when the horizontal graph convolution attention submodule calculates the second feature vector, important features in the feature graph are amplified through a compression operation and a decoding operation, and when the horizontal graph convolution attention submodule calculates the fourth feature vector, the weight of each row of pixel points is introduced, and the important features in the feature graph are amplified, so that the important features are amplified in the horizontal attention feature vector finally determined by the horizontal graph convolution attention submodule, and the features of the interference factors are reduced.

In an implementation manner of the embodiment of the present disclosure, referring to the principle of step 301 and step 305, as shown in fig. 4, for step 203, according to the distribution of pixel points of each feature map, determining the weight of each column of pixel points in each feature map in the belonging feature map, and determining the vertical attention feature vector corresponding to each feature map based on the weight of each column of pixel points in the belonging feature map, the electronic device may specifically execute the following steps:

step 401, for each feature map, performing global average pooling operation on the feature map to obtain a fifth feature vector corresponding to the feature map.

And step 402, performing compression operation and decoding operation on the fifth feature vector to obtain a sixth feature vector.

The vertical graph convolution attention sub-module can perform compression operation on the fifth eigenvector through one graph convolution layer, perform decoding operation on the result of the compression operation through the other graph convolution layer, and perform further refinement on the effective characteristics by re-estimating the weight of each channel element in the fifth eigenvector through the compression operation and the decoding operation.

And 403, respectively performing summation operation on each row of pixel points in the feature map to obtain a seventh feature vector corresponding to each row of pixel points in the feature map.

And 404, performing logistic regression operation on each seventh feature vector to obtain an attention value corresponding to each seventh feature vector, taking the attention value corresponding to each seventh feature vector as the weight of each seventh feature vector, and performing weighted summation operation on the seventh feature vectors corresponding to each row of pixel points in the feature map to obtain an eighth feature vector.

And the attention value corresponding to the seventh feature vector is determined by the horizontal map convolution attention submodule through softmax logistic regression operation and is used for representing the importance degree of the seventh feature vector in the feature map.

Since the vertical map convolution attention submodule introduces a weight for each seventh feature vector when calculating the eighth feature vector, in the eighth feature vector, the important features are enlarged and the features of the disturbing factors are reduced.

In the embodiment of the present disclosure, steps 401 to 402 and steps 403 to 404 may be executed in parallel, which is not limited in the embodiment of the present disclosure.

And step 405, taking the product of the sixth feature vector and the eighth feature vector as the vertical attention feature vector corresponding to the feature map.

It is understood that the vertical map convolution attention submodule may obtain a vertical attention feature vector corresponding to each feature map by performing the operations of steps 401 to 405 on the feature map of each video frame in the video frame sequence.

In the embodiment of the present disclosure, when the vertical graph convolution attention submodule calculates the sixth feature vector, important features in the feature graph are amplified through a compression operation and a decoding operation, and when the vertical graph convolution attention submodule calculates the eighth feature vector, a weight of each column of pixel points is introduced, and the important features in the feature graph are amplified, so that the important features are amplified in the vertical attention feature vector finally determined by the vertical graph convolution attention submodule, and features of the interference factors are reduced.

As described with reference to fig. 3 and 4, important features are enlarged in the horizontal attention feature vector and the vertical attention feature vector corresponding to the feature map, and features of the disturbance factors are reduced, so after the features of the horizontal attention feature vector and the vertical attention feature vector corresponding to the feature map are spliced by the bidirectional convolution attention module, the important features are also enlarged in the obtained attention feature vector corresponding to the feature map, and the features of the disturbance factors are also reduced.

As shown in fig. 5, fig. 5 is a schematic diagram illustrating an execution flow of a bidirectional graph convolution attention module according to an embodiment of the present disclosure, where the bidirectional graph convolution attention module may be configured to execute the above step 202 to step 204.

Wherein, the bidirectional graph convolution attention module comprises a horizontal graph convolution attention submodule and a vertical graph convolution attention submodule, and the horizontal graph convolution attention submodule comprises: a channel attention branch and a spatial attention branch, the vertical map convolution attention submodule comprising: one channel attention branch and one spatial attention branch.

The horizontal map convolution attention submodule is configured to perform steps 301 to 305, and specifically, in the horizontal map convolution attention submodule, the channel attention branch is configured to perform steps 301 to 302, that is, the channel attention branch may perform global average pooling on the input feature map to obtain a first feature vector corresponding to the feature map, and then perform a compression operation and a decoding operation on the first feature vector to obtain a second feature vector.

The spatial attention branch in the horizontal map convolution attention submodule is used for executing the steps 303 to 304, that is, the spatial attention branch in the horizontal map convolution attention submodule performs summation operation on each row of pixel points in the input feature map to obtain a third feature vector corresponding to each row of pixel points in the feature map, then performs logistic regression operation on each third feature vector to obtain an attention value corresponding to each third feature vector, and performs weighted summation operation on the third feature vectors corresponding to each row of pixel points in the feature map to obtain a fourth feature vector by taking the attention value corresponding to each third feature vector as the weight of each third feature vector.

The vertical graph convolution attention submodule is configured to perform steps 401 to 405, and specifically, in the vertical graph convolution attention submodule, the channel attention branch is configured to perform steps 401 to 402, where the channel attention branch may perform global average pooling on the input feature graph to obtain a fifth feature vector corresponding to the feature graph, and then perform a compression operation and a decoding operation on the fifth feature vector to obtain a sixth feature vector.

The spatial attention branch in the vertical graph convolution attention submodule is used for executing the steps 403 to 404, that is, the spatial attention branch in the vertical graph convolution attention submodule performs a summation operation on each row of pixel points in the input feature graph to obtain a seventh feature vector corresponding to each row of pixel points in the feature graph, then performs a logistic regression operation on each seventh feature vector to obtain an attention value corresponding to each seventh feature vector, and performs a weighted summation operation on the seventh feature vectors corresponding to each row of pixel points in the feature graph to obtain an eighth feature vector.

Optionally, as shown in fig. 6, as for step 205, performing vector aggregation operation on the attention feature vectors corresponding to the feature maps of all video frames in the video including the target to obtain a representative feature vector corresponding to the video including the target, specifically including the following steps:

step 601, calculating a vector distance between the attention feature vector and the attention feature vector corresponding to each feature map according to the attention feature vector corresponding to each feature map.

Wherein, the attention feature vector corresponding to each feature map is the attention feature vector obtained by the bidirectional convolution attention module through the step 204. Optionally, the vector distance may be a cosine distance or an euclidean distance, and further, step 601 may be specifically executed as:

and calculating the cosine distance or Euclidean distance between the attention feature vector and the cosine distance or Euclidean distance between the attention feature vector and the attention feature vectors except the attention feature vector aiming at the attention feature vector corresponding to each feature map.

Specifically, this step may be executed by the affinity sequence fusion module, and the vector distances calculated by the affinity sequence fusion module may form an affinity matrix, as shown in fig. 7, fig. 7 is a schematic diagram of an affinity matrix provided in the embodiment of the present disclosure, and a row of elements in the affinity matrix shown in fig. 7 represents a vector distance between an attention feature vector and each attention feature vector.

For example, the video including the target includes a video frame a, a video frame B, a video frame C, and a video frame D, the vector distance is a cosine distance, as shown in fig. 7, a first element in a first row in the intimacy matrix represents that the cosine distance between the attention feature vector corresponding to the video frame a and the attention feature vector corresponding to the video frame a is 1.0, a second element in the first row in the intimacy matrix represents that the cosine distance between the attention feature vector corresponding to the video frame a and the attention feature vector corresponding to the video frame B is 0.95, and so on, in the intimacy matrix, each row element represents a cosine distance between the attention feature vector corresponding to one video frame and the attention feature vector corresponding to each video frame.

It should be noted that the cosine distance is a value greater than or equal to-1 and less than or equal to 1, and when the cosine distance between two vectors is 1, it indicates that the degree of correlation between the two vectors is the highest, that is, the vector distance between the two vectors is the shortest.

In the embodiment of the present disclosure, the cosine distance or the euclidean distance calculated by the intimacy sequence fusion module represents the degree of correlation between the attention feature vector and the attention feature vector, and therefore, the distance between the corresponding components of one attention feature vector represents: the attention feature vector corresponds to a degree of correlation between the video frame and each video frame in the video containing the target.

Step 602, calculating an average value of the distance of each component corresponding to the attention feature vector, and using the calculated average value as the weight of the attention feature vector.

As shown in fig. 7, after the affinity sequence fusion module calculates the affinity matrix, it may calculate an average value of each row in the affinity matrix, and use the calculated average value as the weight of the attention feature vector.

Because a row of elements in the affinity matrix represents a vector distance between an attention feature vector and each attention feature vector, calculating the average value of each row in the affinity matrix is the average value of the vector distances corresponding to the attention feature vectors.

Because the vector distance represents the degree of correlation between video frames, the average value of the vector distances corresponding to one attention feature vector represents: the attention feature vector corresponds to a degree of correlation of the video frame with the video containing the object.

Step 603, performing weighted average operation on each attention feature vector based on the weight of each attention feature vector to obtain a representative feature vector corresponding to the video including the target.

In the embodiment of the disclosure, when the intimacy sequence fusion module calculates the representative feature vector corresponding to the video including the target, since the weight corresponding to each video frame is introduced, when the intimacy sequence fusion module performs weighted average operation on each attention feature vector, the weight of an important video frame is increased, and the weight of an unimportant video frame is reduced, so that the important feature is more obvious in the representative feature vector corresponding to the video including the target.

As shown in fig. 8, fig. 8 is a schematic diagram illustrating an execution flow of an affinity sequence fusion module according to an embodiment of the present disclosure, where the affinity sequence fusion module is configured to execute the steps 601 to 603.

Specifically, after the affinity sequence fusion module obtains the attention feature vector corresponding to each video frame output by the bidirectional graph convolution attention module, the step 601 may be executed to obtain the affinity matrix by calculation, then the step 602 is executed to calculate the average value of each row element of the affinity matrix, and the calculated average value is used as the weight of the attention feature vector, and then the step 603 is executed to perform weighted average operation on each attention feature vector, and output the representative feature vector.

After the affinity sequence fusion module performs weighted average operation on the attention feature vector corresponding to each video frame, the representative feature vector output by the affinity sequence fusion module can be used as the representative feature vector corresponding to the video containing the target.

In practical application, after the electronic device receives a query request, a video including a target may be acquired, the video including the target is input to the neural network shown in fig. 1, a representative feature vector corresponding to the video including the target output by the neural network is acquired, then the electronic device may calculate a vector distance (cosine distance or euclidean distance) between the representative feature vector corresponding to the video including the target and the representative feature vector of the video in the video search library, and determine the video in the video search library corresponding to the representative feature vector whose vector distance is smaller than a preset threshold as a query result.

And if the query request received by the electronic equipment is used for requesting to query the video including the target pedestrian in the video search library, determining that the query result is the video including the target pedestrian in the video search library by the electronic equipment.

Optionally, the electronic device may also train and test the neural network shown in fig. 1 based on a preset data set.

The embodiment of the present disclosure provides an example of a preset data set, which is specifically shown in the following table:

watch 1

Here, the ILIDS-VID and PRID2011 are video data sets, and the videos in the ILIDS-VID and PRID2011 are video data sets legally taken by a fixed camera.

MARS is a data set of images, and the images in MARS are a data set of legally derived single-frame video frames.

The electronic device can randomly divide the data set shown in the table one into two equal parts, wherein one part is used as a training set, and the other part is used as a testing set.

During the training process, the electronic device may determine a label for each video in the training set, where the label is used to label whether each video in the training set includes the target pedestrian.

After inputting a video containing a target in the training set into the neural network shown in fig. 1, the electronic device inputs a representative feature vector corresponding to the video containing the target output by the neural network into a preset full-link layer (the dimension of the full-link layer is equal to the number of videos containing the target in the training set), and obtains the target probability output by the full-link layer.

Wherein the target probability represents a probability that a target pedestrian is included in the video containing the target.

After the electronic device obtains the target probability corresponding to the video containing the target, the electronic device may calculate a cross entropy loss function value based on the target probability output by the full connection layer and the label of the video containing the target, and adjust parameters of the neural network according to the back propagation algorithm and the cross entropy loss function value, so that the neural network may more accurately output the representative feature vector of the video containing the target.

In the testing process, the electronic equipment can centralize the testing, take a section of video of a target pedestrian under one camera as a video containing the target, and take the rest of the videos in the testing centralize as a video search library.

After the electronic device determines the video containing the target and the video search library, the electronic device may input the video containing the target into the neural network shown in fig. 1, and obtain the representative feature vector of the video containing the target output by the neural network.

If the electronic equipment can accurately determine that the video search library comprises the video of the target pedestrian according to the representative feature vector of the video containing the target, the training of the neural network is finished.

If the electronic equipment cannot accurately determine that the video search library comprises the videos of the target pedestrians according to the representative feature vector of the video comprising the target, the fact that the neural network needs to be trained continuously is indicated.

The training process is described above in terms of videos containing pedestrians, as are videos of other types of targets.

In the embodiment of the disclosure, the electronic device can make the neural network more accurately output the representative feature vector corresponding to the video containing the target through the training and testing processes.

Based on the same technical concept, an embodiment of the present disclosure further provides a video processing apparatus, as shown in fig. 9, the apparatus including: an extraction module 901, a determination module 902, a feature concatenation module 903, and a vector aggregation module 904:

an extracting module 901, configured to extract a feature map corresponding to each video frame in a video including a target;

a determining module 902, configured to determine, according to the pixel point distribution of each feature map, a weight of each row of pixel points in each feature map in the feature map to which the row belongs, and determine, based on the weight of each row of pixel points in the feature map to which the row belongs, a horizontal attention feature vector corresponding to each feature map;

the determining module 902 is further configured to determine, according to the distribution of the pixel points of each feature map, a weight of each column of pixel points in each feature map in the feature map to which the pixel points belong, and determine, based on the weight of each column of pixel points in the feature map to which the pixel points belong, a vertical attention feature vector corresponding to each feature map;

a feature stitching module 903, configured to perform feature stitching on the horizontal attention feature vector and the vertical attention feature vector corresponding to each feature map to obtain an attention feature vector corresponding to each feature map;

and a vector aggregation module 904, configured to perform vector aggregation operation on attention feature vectors corresponding to feature maps of all video frames in the video including the target, to obtain a representative feature vector corresponding to the video including the target.

Optionally, the determining module 902 is specifically configured to:

and taking the product of the second feature vector and the fourth feature vector as the horizontal attention feature vector corresponding to the feature map.

Optionally, the determining module 902 is further specifically configured to:

Optionally, the vector aggregation module 904 is specifically configured to:

calculating the average value of the distance of each vector corresponding to the attention characteristic vector, and taking the calculated average value as the weight of the attention characteristic vector;

Optionally, the vector aggregation module 904 is further specifically configured to:

By adopting the video processing device provided by the embodiment of the disclosure, as the weight of each row of pixel points in the feature map to which the pixel points belong can be determined in the embodiment of the disclosure, the important features can be given with larger weight, and the features of the interference factors can also be given with smaller weight, so that the important features can be amplified and the interference features can be reduced based on the horizontal attention feature vector and the vertical attention feature vector determined by the weights. And then performing feature splicing on the horizontal attention feature vector and the vertical attention feature vector to obtain an attention feature vector corresponding to each feature map, and then performing vector aggregation operation on the attention feature vector corresponding to each feature map to obtain a representative feature vector corresponding to the video containing the target, wherein the representative feature vector can also accurately represent important features (such as features of target pedestrians) in a video frame containing the target, so that the electronic device can accurately determine the video containing the target in a video query library through the representative feature vector.

Those skilled in the art will appreciate that the modules described in the above embodiments are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of a processor executing corresponding functional software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

For example, a processor may be a general-purpose logical operation device having data processing capabilities and/or program execution capabilities, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Microprocessor (MCU), or the like, that execute computer instructions of corresponding functions to implement the corresponding functions. The computer instructions comprise one or more processor operations defined by an instruction set architecture corresponding to the processor, which may be logically embodied and represented by one or more computer programs.

For example, a processor may be a hardware entity, such as a field programmable logic array (FPGA) or an Application Specific Integrated Circuit (ASIC), with programmable functions to perform the respective functions.

For example, the processor may be a hardware circuit specifically designed to perform the corresponding function, such as a Tensor Processor (TPU) or a neural Network Processor (NPU), or the like.

An embodiment of the present disclosure also provides an electronic device, including:

a memory for non-temporarily storing a computer program; and

a processor for executing the computer program, when executed by the processor, performing the method of any of the preceding embodiments.

Referring to fig. 10, a specific implementation of the computer apparatus is provided. The electronic device comprises a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 are communicated with each other through the communication bus 1004,

a memory 1003 for storing a computer program;

the processor 1001 is configured to implement the following steps when executing the program stored in the memory 1003:

It should be noted that, when the processor 1001 is configured to execute the program stored in the memory 1003, it is also configured to implement other steps described in the foregoing method embodiment, and reference may be made to relevant descriptions in the foregoing method embodiment, which is not described herein again.

The communication bus mentioned in the network device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the network device and other devices.

In the embodiments of the present disclosure, the memory, the storage medium, and the like may be a local physical storage device, and may also be a virtual storage device connected by remote communication, such as a VPS, a cloud storage, and the like. .

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, or discrete hardware components.

Based on the same technical concept, the embodiment of the present disclosure further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the video processing method.

Based on the same technical concept, embodiments of the present disclosure also provide a computer program product containing instructions, which when run on a computer, causes the computer to perform the above-mentioned video processing method steps.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the disclosure are, in whole or in part, generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure are included in the scope of protection of the present disclosure.

Claims

1. A method of video processing, the method comprising:

extracting a feature map corresponding to each video frame of a video containing a target;

2. The method of claim 1, wherein determining the weight of each row of pixels in each feature map in the corresponding feature map according to the pixel point distribution of each feature map, and determining the horizontal attention feature vector corresponding to each feature map based on the weight of each row of pixels in the corresponding feature map comprises:

3. The method of claim 1, wherein determining the weight of each column of pixels in each feature map in the corresponding feature map according to the pixel point distribution of each feature map, and determining the vertical attention feature vector corresponding to each feature map based on the weight of each column of pixels in the corresponding feature map comprises:

4. The method according to any one of claims 1 to 3, wherein the performing a vector aggregation operation on the attention feature vectors corresponding to the feature maps of all video frames in the video including the target to obtain a representative feature vector corresponding to the video including the target comprises:

5. A video processing apparatus, characterized in that the apparatus comprises:

the extraction module is used for extracting a feature map corresponding to each video frame of the video containing the target;

6. The apparatus of claim 5, wherein the determining module is specifically configured to:

7. The apparatus according to claim 5, wherein the determining module is further configured to:

8. The apparatus according to any of claims 5-7, wherein the vector aggregation module is specifically configured to:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 4 when executing the computer program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.