CN112131978A

CN112131978A - Video classification method and device, electronic equipment and storage medium

Info

Publication number: CN112131978A
Application number: CN202010941467.1A
Authority: CN
Inventors: 赵教生
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2020-12-25
Anticipated expiration: 2040-09-09
Also published as: CN112131978B

Abstract

The application discloses a video classification method, a video classification device, electronic equipment and a storage medium, which can acquire at least one target video image of a target video and extract a global feature map of the target video image; identifying at least one salient region of a global feature map of the target video image; extracting region feature vectors of each salient region; based on the importance of each salient region to the classification result of the target video, fusing the global feature map and the region feature vectors of each salient region to obtain the image feature vector of the target video image; fusing the image feature vectors of all target video images to obtain the video feature vectors of the target videos; and classifying the target video based on the video feature vector to obtain at least one class label of the target video. According to the video classification method and device, the regional characteristic vectors of all the salient regions are fused, the representation force of the video characteristic vectors can be enhanced, and the improvement of the video classification accuracy is facilitated.

Description

Video classification method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of computers, in particular to a video classification method, a video classification device, electronic equipment and a storage medium.

Background

With the development of computer technology, multimedia is more and more widely applied, the variety of videos is more and more abundant, and the number of videos is also increased sharply. Videos that people can watch are also more and more diversified, and in order to facilitate users to quickly acquire videos to be watched from massive videos, a video playing platform generally classifies a large number of videos in the video playing platform. Video classification plays an important role in realizing management and interest recommendation of videos. In addition, the video classification technology is widely applied in the fields of monitoring, retrieval, human-computer interaction and the like.

In the related art, generally, video frames of a video to be classified are extracted to obtain a plurality of target video images, image feature information of each target video image is extracted through a neural network, then, the image feature information of a frame level is converted into video feature information of a video level, specifically, the image feature information of each target video image can be fused to obtain video feature information of the video to be classified, and finally, the video is classified based on the video feature information. However, the extraction of the video features is not sufficient, and the representation power of the video feature information is weak, so that the accuracy of the video classification result is relatively low.

Disclosure of Invention

The embodiment of the application provides a video classification method, a video classification device, electronic equipment and a storage medium, which can enhance the representation force of video feature vectors and are beneficial to improving the accuracy of video classification.

The embodiment of the application provides a video classification method, which comprises the following steps:

obtaining at least one target video image, and performing feature extraction on the target video image to obtain a global feature map corresponding to the target video image, wherein the target video image is derived from a target video;

carrying out salient region identification on the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image;

extracting features of each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image;

based on the importance of each salient region of the target video image to the classification result of the target video, fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image;

fusing image feature vectors of all target video images to obtain video feature vectors of the target videos;

classifying the target video based on the video feature vector to obtain at least one class label of the target video.

Correspondingly, an embodiment of the present application provides a video classification apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring at least one target video image and extracting the characteristics of the target video image to obtain a global characteristic map corresponding to the target video image, and the target video image is derived from a target video;

the identification unit is used for identifying the salient region of the global feature map of the target video image and determining at least one salient region of the global feature map of the target video image;

the extraction unit is used for extracting the features of each salient region in the global feature map of the target video image to obtain the region feature vector of each salient region of the target video image;

the first fusion unit is used for fusing the feature map vectors of the global feature map of the target video image and the region feature vectors of the salient regions based on the importance of the salient regions of the target video image on the classification result of the target video image to obtain the image feature vectors of the target video image;

the second fusion unit is used for fusing the image characteristic vectors of all the target video images to obtain the video characteristic vectors of the target videos;

and the classification unit is used for classifying the target video based on the video feature vector to obtain at least one class label of the target video.

Optionally, in some embodiments of the present application, the identification unit may include a sliding subunit, a first identification subunit, and a first determination subunit, as follows:

the sliding subunit is configured to slide on the global feature map of the target video image through a preset window to obtain a plurality of candidate regions of the global feature map of the target video image;

the first identification subunit is used for identifying the significance of each candidate region based on the feature map information of each candidate region in the global feature map;

and the first determining subunit is used for determining at least one significance region from the candidate regions based on the identification result.

Optionally, in some embodiments of the present application, the identification unit may further include a frame regression subunit, a second identification subunit, and a screening subunit, as follows:

the frame regression subunit is configured to perform frame regression on the candidate saliency region by using the determined saliency region as a candidate saliency region, so as to obtain a frame-adjusted candidate saliency region;

the second identification subunit is configured to perform saliency identification on the frame-adjusted candidate saliency region based on feature map information of the frame-adjusted candidate saliency region in the global feature map;

and the screening subunit is used for screening the candidate saliency area after the frame adjustment based on the identification result to obtain the saliency area of the target video image.

Optionally, in some embodiments of the application, the extraction unit may be specifically configured to perform pooling processing on each salient region in the global feature map of the target video image, so as to obtain a region feature vector of each salient region of the target video image.

Optionally, in some embodiments of the present application, the first fusion unit may include a second determining subunit and a weighting subunit, as follows:

the second determining subunit is configured to determine, based on the importance of each saliency region of the target video image to the classification result of the target video, a weight corresponding to each saliency region of the target video image;

and the weighting subunit is used for weighting the feature vector of the global feature map of the target video image and the region feature vector of each salient region based on the weight to obtain the image feature vector of the target video image.

Optionally, in some embodiments of the present application, the second fusion unit may include a clustering subunit, a first calculating subunit, and a first fusion subunit, as follows:

the clustering subunit is used for clustering the image feature vectors of the target video images to obtain at least one clustering set, and determining a central feature vector serving as a clustering center in each clustering set;

the first calculating subunit is configured to calculate, for each cluster set, a difference value between a non-central feature vector and a central feature vector in the cluster set, to obtain a feature residual vector of the cluster set;

and the first fusion subunit is used for fusing the characteristic residual vectors of the cluster sets to obtain the video characteristic vector of the target video.

Optionally, in some embodiments of the present application, the clustering subunit may be specifically configured to determine a number K of cluster sets, where K is a positive integer not less than 1;

selecting K image feature vectors from the image feature vectors of the target video image as central feature vectors of K clustering sets respectively;

calculating the vector distance between the image characteristic vector of each target video image and each central characteristic vector;

adding each image feature vector to a cluster set to which a central feature vector closest to the vector of the image feature vector belongs to obtain K cluster sets;

and aiming at each cluster set, selecting image feature vectors meeting the cluster center condition from the cluster sets as new center feature vectors, returning to the step of calculating the vector distance between the image feature vectors of the target video images and the center feature vectors until the center feature vectors of the cluster sets meet the cluster end condition, obtaining K cluster sets, and obtaining the center feature vectors serving as cluster centers in the cluster sets.

Optionally, in some embodiments of the application, the obtaining unit may be specifically configured to perform feature extraction on the target video image through a classification model to obtain a global feature map corresponding to the target video image.

Optionally, in some embodiments of the application, the identification unit may be specifically configured to perform, through the classification model, salient region identification on the global feature map of the target video image, and determine at least one salient region of the global feature map of the target video image.

Optionally, in some embodiments of the present application, the classification unit may be specifically configured to classify, by the classification model, the target video based on the video feature vector to obtain at least one class label of the target video.

Optionally, in some embodiments of the present application, the video classification apparatus further includes a training unit, where the training unit is configured to train the classification model; the training unit may include a first obtaining subunit, a first extracting subunit, a second fusing subunit, a third determining subunit, a second calculating subunit, and an adjusting subunit, as follows:

the first obtaining subunit is configured to obtain training data, where the training data includes a sample video image of a sample video and real category information corresponding to the sample video;

the first extraction subunit is configured to perform feature extraction on the sample video image through a preset classification model to obtain a global feature map corresponding to the sample video image, perform saliency region identification on the global feature map of the sample video image, and determine at least one prediction saliency region of the global feature map of the sample video image;

the second extraction subunit is configured to perform feature extraction on each prediction saliency region in the global feature map of the sample video image to obtain a region feature vector of each prediction saliency region of the sample video image, and fuse the feature map vector of the global feature map of the sample video image and the region feature vector of each prediction saliency region based on the importance of each prediction saliency region of the sample video image to the classification result of the sample video to obtain an image feature vector of the sample video image;

the second fusion subunit is used for fusing the image feature vectors of the sample video images to obtain the video feature vectors of the sample videos;

the third determining subunit is used for determining prediction probability information of the sample video on each preset category based on the video feature vector;

a second calculating subunit, configured to calculate a first loss value between the prediction probability information and the real category information of the sample video;

and the adjusting subunit is used for adjusting the parameters of the preset classification model based on the first loss value to obtain the classification model meeting the preset conditions.

Optionally, in some embodiments of the application, the training unit may further include a third calculating subunit, a fourth determining subunit, a second obtaining subunit, and a third obtaining subunit, where the third calculating subunit is configured to adjust, by the adjusting subunit, parameters of a preset classification model based on the first loss value, and before obtaining a classification model that meets a preset condition, the following steps are performed:

the third computing subunit is configured to compute a gradient of the first loss value to the video feature vector of the sample video, and draw a thermodynamic diagram corresponding to a global feature diagram of a sample video image of the sample video based on the gradient;

a fourth determining subunit, configured to determine category information of the sample video based on prediction probability information of the sample video;

a second obtaining subunit, configured to, when the category information of the sample video is consistent with the real category information, obtain a saliency area of a global feature map of the sample video image based on the thermodynamic diagram, and set the obtained saliency area as a real saliency area of the sample video image;

a third obtaining subunit, configured to, when the category information of the sample video is inconsistent with the real category information, obtain, based on the thermodynamic diagram, a non-significant region of a global feature map of the sample video image, and set the obtained non-significant region as a non-real significant region of the sample video image;

the adjusting subunit may be specifically configured to calculate a second loss value for a predicted salient region of the sample video image based on the true salient region and the non-true salient region; and adjusting parameters of a preset classification model based on the first loss value and the second loss value to obtain the classification model meeting preset conditions.

Optionally, in some embodiments of the present application, the step of "calculating a second loss value of the predicted saliency region of the sample video image based on the true saliency region and the non-true saliency region" may include:

determining a true saliency region probability for a predicted saliency region of the sample video image based on a region overlap degree of the predicted saliency region and the true saliency region;

determining a true saliency region probability for a predicted saliency region based on a region overlap degree of the predicted saliency region and the non-true saliency region of the sample video image;

determining the prediction probability of the prediction significance region as a real significance region based on the characteristic map information of the prediction significance region through a preset classification model;

calculating a classification loss of the predicted significance region based on a prediction probability of the predicted significance region and a corresponding true significance region probability;

calculating regression loss of the prediction significance region based on the prediction significance region with the real significance region probability not lower than a preset probability threshold, the position information in the global feature map of the sample video image and the position information of the real significance region in the global feature map of the sample video image;

and fusing the classification loss and the regression loss to obtain a second loss value of the prediction significance region of the sample video image.

The electronic device provided by the embodiment of the application comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the processor loads the instructions to execute the steps in the video classification method provided by the embodiment of the application.

In addition, a storage medium is further provided, on which a computer program is stored, where the computer program is executed by a processor to implement the steps in the video classification method provided in the embodiments of the present application.

The embodiment of the application provides a video classification method, a video classification device, electronic equipment and a storage medium, wherein at least one target video image can be obtained, and the target video image is subjected to feature extraction to obtain a global feature map corresponding to the target video image, wherein the target video image is derived from a target video; carrying out salient region identification on the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image; extracting features of each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image; based on the importance of each salient region of the target video image to the classification result of the target video, fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image; fusing image feature vectors of all target video images to obtain video feature vectors of the target videos; classifying the target video based on the video feature vector to obtain at least one class label of the target video. According to the video classification method and device, the regional characteristic vectors of all the salient regions are fused, the representation force of the video characteristic vectors can be enhanced, and the improvement of the video classification accuracy is facilitated.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of a video classification method provided in an embodiment of the present application;

fig. 1b is a flowchart of a video classification method provided in an embodiment of the present application;

fig. 2a is another flowchart of a video classification method provided in an embodiment of the present application;

fig. 2b is another flowchart of a video classification method provided in an embodiment of the present application;

fig. 3a is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3b is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3c is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3d is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3e is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3f is a schematic structural diagram of another video classification apparatus provided in the embodiment of the present application;

fig. 3g is a schematic structural diagram of another video classification apparatus provided in the embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a video classification method and device, electronic equipment and a storage medium. The video classification apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal or a server.

It is understood that the video classification method of the present embodiment may be executed on a terminal, may also be executed on a server, and may also be executed by both the terminal and the server. The above examples should not be construed as limiting the present application.

As shown in fig. 1a, the video classification method is performed by the terminal and the server together. The video classification system provided by the embodiment of the application comprises a terminal 10, a server 11 and the like; the terminal 10 and the server 11 are connected via a network, for example, a wired or wireless network connection, etc., wherein the video classification apparatus may be integrated in the server.

The terminal 10 may extract a video frame of the target video to obtain at least one target video image of the target video, and send the target video image to the server 11, so that the server 11 classifies the target video based on the feature information of the target video image, and returns a category label of the target video to the terminal 10. The terminal 10 may include a mobile phone, a smart television, a tablet Computer, a notebook Computer, a Personal Computer (PC), or the like. A client, which may be an application client or a browser client or the like, may also be provided on the terminal 10.

The server 11 may be configured to: obtaining at least one target video image, and performing feature extraction on the target video image to obtain a global feature map corresponding to the target video image; carrying out salient region identification on the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image; extracting features of each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image; based on the importance of each salient region of the target video image to the classification result of the target video, fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image; fusing image feature vectors of all target video images to obtain video feature vectors of the target videos; classifying the target video based on the video feature vector to obtain at least one category label of the target video, and sending the category label to the terminal 10. The server 11 may be a single server, or may be a server cluster or a cloud server composed of a plurality of servers.

The step of the server 11 classifying the video may be executed by the terminal 10.

The embodiment of the application provides a video classification method, which relates to a computer vision technology in the field of artificial intelligence. According to the embodiment of the application, the region feature vectors of all the salient regions can be fused, the representation force of the video feature vectors can be enhanced, and the accuracy of video classification can be improved.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The present embodiment will be described from the perspective of a video classification apparatus, which may be specifically integrated in an electronic device, which may be a server or a terminal, or the like.

The video classification method provided by the embodiment of the application can be applied to various scenes needing to classify videos, wherein the video duration and the video types are not limited. For example, a certain video platform needs to classify millions of videos, and at least one category label is printed on each video, so that massive videos can be rapidly classified by the video classification method provided by the embodiment, and the video classification method provided by the embodiment can enhance the representation power of the video feature vectors by fusing the region feature vectors of each salient region, so that the video classification accuracy is high.

As shown in fig. 1b, the specific flow of the video classification method may be as follows:

101. the method comprises the steps of obtaining at least one target video image, and carrying out feature extraction on the target video image to obtain a global feature map corresponding to the target video image, wherein the target video image is derived from a target video.

The target video is a video to be classified, the video type of the target video is not limited, and the video duration is not limited. The target video may correspond to one category label or may correspond to a plurality of category labels. The category label may specifically be an element included in the video, such as "cat" and "dog", and may also be a feeling given to the user by the video scene, such as "surprise" and "fun".

In this embodiment, video frame extraction may be performed on the target video to obtain at least one target video image of the target video. Specifically, the target video image may be extracted from the target video at certain time intervals; a certain number of target video images may also be extracted from the target video, and it can be understood that a specific video frame extraction manner may be set according to an actual situation, which is not limited in this embodiment.

Before feature extraction is performed on each target video image, preprocessing may be performed on each target video image, where the preprocessing may include image size adjustment, image data enhancement, and the like on each target video image. Image data enhancement may include histogram equalization, sharpening, smoothing, and the like.

In this embodiment, convolution processing may be performed on each target video image to obtain a global feature map corresponding to each target video image. Specifically, the feature information of the target video image may be extracted through a neural Network, which may be an open end model (inclusion), an efficiency Network (efficiency Network), a Visual Geometry Group Network (VGGNet, Visual Geometry Group Network), a Residual Network (Residual Network), a Dense connection convolution Network (densnet), or the like, but it should be understood that the neural Network of the present embodiment is not limited to the above-listed types.

In each convolutional layer of the neural network, data exists in three dimensions, and can be viewed as a stack of a plurality of two-dimensional pictures, wherein each two-dimensional picture is called a feature map (feature map).

102. And identifying the salient region of the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image.

In this embodiment, an area that mainly affects the class label of the final prediction target video is referred to as a saliency area. In the related video classification technology at present, for each frame of target video image of a target video, each frame is treated as a whole to perform convolution operation, that is, the whole area of the frame is treated equally, however, each category label of the video corresponds to some areas of some frames in the video, which are areas that should be focused in video classification, for example, the category labels of a certain video are "forest" and "zebra", and the classification result is obtained based on the area of the video frame containing "forest" and/or "zebra" in the video. The embodiment adds attention to the regions (namely, the salient regions), and enhances the representation power of the video feature vectors by extracting the region feature vectors of the salient regions and fusing the region feature vectors, thereby being beneficial to improving the multi-classification effect of the video.

Optionally, in some embodiments, the step of "performing saliency region identification on the global feature map of the target video image, and determining at least one saliency region of the global feature map of the target video image" may include:

sliding on the global feature map of the target video image through a preset window to obtain a plurality of candidate regions of the global feature map of the target video image;

performing significance recognition on each candidate region based on the feature map information of each candidate region in the global feature map;

determining at least one salient region from the candidate regions based on the recognition result.

In other embodiments, the global feature map of the target video image may also be subjected to salient region identification through image segmentation.

The aspect ratio, size, angle, etc. of the preset window may be preset. In some embodiments, the preset window may include a variety of aspect ratios and sizes. The aspect ratio and the size can be set according to practical situations, and the embodiment does not limit the aspect ratio and the size.

The step of sliding on the global feature map of the target video image through a preset window to obtain a plurality of candidate regions of the global feature map of the target video image may specifically include: and sliding on the global feature map based on the sliding preset window, namely traversing the global feature map of the target video image and marking out a plurality of candidate areas on the global feature map of the target video image. In some embodiments, the preset window includes a plurality of sizes and aspect ratios, and the size and the aspect ratio of the candidate region obtained by dividing are different based on the preset window.

In the step "performing saliency recognition on each candidate region based on feature map information of each candidate region in the global feature map", for each candidate region, specifically, regarding parameters corresponding to the candidate region in the global feature map, regarding the parameters as feature map information of the candidate region, performing saliency recognition on the candidate region based on the feature map information of the candidate region, and determining whether the candidate region is a saliency region, specifically, performing saliency recognition on the candidate region by using an image contour detection method or other target detection methods, and identifying whether an element affecting a classification result of a target video exists in the candidate region, and if so, determining the candidate region as the saliency region.

For example, whether elements such as "cat", "pig", and "forest" exist in the candidate region may be detected, specifically, similarity between feature map information of the candidate region and feature information of the elements may be compared, and when the similarity is greater than a preset value, the candidate region is determined as a significant region.

In this embodiment, a sub-Network that identifies a salient Region may adopt a Region extraction Network (RPN), and through the RPN, candidate regions with different sizes and aspect ratios and position information of each candidate Region may be generated based on a sliding preset window S ═ x, y, w, h, and the salient Region may be determined from the candidate regions based on feature map information corresponding to the candidate regions. Wherein, (x, y) represents the central point of the preset window, w and h represent the width and height of the preset window, and the parameters w and h of the preset window can be set according to actual requirements.

Optionally, in some embodiments, after the step "determining at least one significant region from the candidate regions based on the recognition result", the method may further include:

taking the determined saliency region as a candidate saliency region, and performing frame regression on the candidate saliency region to obtain a frame-adjusted candidate saliency region;

performing saliency recognition on the candidate saliency areas after frame adjustment based on the feature map information of the candidate saliency areas after frame adjustment in the global feature map;

and screening the candidate saliency areas after the frame adjustment based on the identification result to obtain the saliency areas of the target video image.

The Bounding Box Regression (Bounding Box Regression) is a process of approximating a generated candidate Box with a marked real Box as a target in the target detection process.

The detected salient region can be positioned closer to a real region by performing frame regression on the candidate salient region, and the positioning accuracy is improved. The candidate saliency areas after the frame adjustment can be subjected to saliency recognition again. Specifically, the similarity between the feature map information of the candidate saliency region after the frame adjustment and the feature information of a preset element (specifically, an element strongly related to the video classification result) may be calculated, and the candidate saliency region after the frame adjustment is screened based on the size of the similarity. For example, the candidate saliency region after the frame adjustment with the similarity greater than the preset similarity may be used as the saliency region of the target video image; the candidate saliency areas after the frame adjustment can also be arranged from large to small based on the similarity, and the candidate saliency areas after the first N frames are adjusted are taken as the saliency areas of the target video image.

103. And performing feature extraction on each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image.

For each salient region, feature extraction can be performed again on feature map information of the salient region corresponding to the global feature map of the target video image, so as to obtain a region feature vector of the salient region.

Optionally, in some embodiments, the step of "performing feature extraction on each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image" may include:

and pooling each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image.

Wherein, the feature map information of each significant region can be reduced by Pooling treatment, which can include Max-Pooling (Maximum-Pooling), Average-Pooling (Avg-Pooling, Average-Pooling), and Generalized mean-Pooling (GEM-Pooling), etc. It should be understood that the pooling process of this embodiment is not limited to the types listed above.

Optionally, in some embodiments, the global feature map of the target video image may also be subjected to pooling processing, so as to obtain a feature map vector of the pooled global feature map.

104. And fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region based on the importance of each salient region of the target video image to the classification result of the target video to obtain the image feature vector of the target video image.

In some embodiments, the fusion mode may specifically be to splice a feature map vector of the global feature map and a region feature vector of each salient region to obtain an image feature vector of the target video image. For example, the feature vector of the global feature map and the region feature vector of each salient region may be spliced from large to small according to the scale of the feature vector to obtain the image feature vector of the target video image.

Optionally, in some embodiments, the step "based on the importance of each salient region of the target video image to the classification result of the target video, fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image" may include:

determining weights corresponding to the various salient regions of the target video image based on the importance of the various salient regions of the target video image to the classification result of the target video;

and based on the weight, carrying out weighting processing on the feature vector of the global feature map of the target video image and the regional feature vector of each salient region to obtain the image feature vector of the target video image.

The weight of the global feature map may be regarded as 1, or a weight may be set for the global feature map, which may be specifically set according to an actual situation, which is not limited in this embodiment.

In some embodiments, the weight corresponding to each salient region may be preset, and may be specifically set according to an actual situation, which is not limited in this embodiment. In other embodiments, the weight corresponding to each salient region may also be obtained through learning of a fully connected layer of the neural network.

Optionally, in a specific embodiment, a feature vector of a global feature map of the target video image and a region feature vector of each salient region may be fused by a keyessattention mechanism (keyless attention mechanism), so as to obtain an image feature vector of the target video image.

105. And fusing the image characteristic vectors of all the target video images to obtain the video characteristic vectors of the target videos.

In some embodiments, the image feature vectors of the respective target video images may be fused by a (Next) Local Aggregated descriptor Vector (nextvrad), to obtain video feature vectors of the target videos.

Optionally, in some embodiments, the fusion mode may specifically splice image feature information of each target video image to obtain a video feature vector of the target video. Specifically, according to the size of the image feature vector, the image feature vectors of all target video images are spliced from large to small to obtain the video feature vector of the target video.

Optionally, in another embodiment, the step of "fusing the image feature vectors of the target video images to obtain the video feature vectors of the target video" may include:

clustering image feature vectors of all target video images to obtain at least one cluster set, and determining a central feature vector serving as a clustering center in each cluster set;

calculating the difference value between the non-central characteristic vector and the central characteristic vector in each cluster set to obtain the characteristic residual vector of the cluster set;

and fusing the characteristic residual vectors of the cluster sets to obtain the video characteristic vector of the target video.

The clustering method may be many ways, for example, K-means (K-means) clustering algorithm, K-means (K-center) algorithm, DBSCAN (density-based clustering algorithm), hierarchical clustering algorithm, or self-organizing map clustering algorithm, and the above examples should not be construed as limiting the present application.

Optionally, in some embodiments, the step "performing clustering processing on the image feature vectors of each target video image to obtain at least one cluster set, and determining a center feature vector serving as a cluster center in each cluster set" may include:

determining the number K of the cluster sets, wherein K is a positive integer not less than 1;

Wherein, the vector distance between the image feature vector and the central feature vector can represent the similarity between the two. The smaller the vector distance, the greater the similarity. There are many ways to calculate the vector distance between the image feature vector and the central feature vector, such as calculating by cosine distance or euclidean distance, which is not limited in this embodiment.

In the step "for each cluster set, selecting an image feature vector meeting a cluster center condition from the cluster sets as a new center feature vector", where the cluster center condition may be that the distance from the distribution center of gravity of the cluster set is minimum, specifically, the distribution center of gravity of the cluster set may be determined for the distribution information of the image feature vectors in each cluster set, and the image feature vector having the minimum distance from the distribution center of gravity is used as the new center feature vector.

For each cluster set, respectively calculating whether the latest central feature vector of the cluster set is the same as the central feature vector adopted last time in the clustering process, namely calculating whether the vector distance between the latest central feature vector and the central feature vector is 0. If the cluster centers of the cluster sets are the same, the cluster centers of the cluster sets are not changed, if the cluster centers of all the cluster sets are not changed, the clustering process is finished, K cluster sets are obtained, and the central feature vectors serving as the cluster centers in all the cluster sets are obtained; and if the cluster centers of all the cluster sets are not changed, returning to the step of calculating the vector distance between the image feature vector of each target video image and each center feature vector until the cluster centers of each cluster set are not changed any more.

It should be noted that the fact that the latest feature vector of the center of each cluster set in the clustering process is the same as the last used cluster center of the cluster set is only an optional condition for ending the loop, and the optional condition may also be that the difference between the two cluster centers is smaller than a preset value, and the preset value may be set according to actual conditions.

106. Classifying the target video based on the video feature vector to obtain at least one class label of the target video.

Wherein the category label of the target video can be predicted based on the video feature vector through a classifier. The classifier may be a Support Vector Machine (SVM), a fully connected Deep Neural Network (DNN), or the like, which is not limited in this embodiment.

The target video is classified, specifically, multi-label classification (multi-label), and a classification method in which the target video includes a plurality of category labels is called multi-label classification.

Optionally, in some embodiments, the step of "performing feature extraction on the target video image to obtain a global feature map corresponding to the target video image" may include:

performing feature extraction on the target video image through a classification model to obtain a global feature map corresponding to the target video image;

the identifying the salient regions of the global feature map of the target video image and determining at least one salient region of the global feature map of the target video image comprise:

performing salient region identification on the global feature map of the target video image through the classification model, and determining at least one salient region of the global feature map of the target video image;

the classifying the target video based on the video feature vector to obtain at least one category label of the target video includes:

and classifying the target video based on the video feature vector through the classification model to obtain at least one class label of the target video.

The classification model can be used for extracting a global feature map of a target video image, and performing salient region identification on the global feature map of the target image to obtain at least one salient region of the global feature map of the target video image; then, fusing the regional characteristic information of each salient region of the target video image with the characteristic map vector of the global characteristic map to obtain the image characteristic vector of the target video image; and classifying the target videos based on the video feature vectors to obtain at least one class label of the target videos.

The classification model may be a Visual Geometry Group Network (VGGNet), a Residual Network (ResNet), a Dense connection convolution Network (densnet), etc., but it should be understood that the classification model of the present embodiment is not limited to the above-mentioned types.

It should be noted that the classification model is trained by a plurality of training data with labels, the training data of this embodiment may include sample video images of a plurality of sample videos, and the labels refer to real category information corresponding to the sample videos; the classification model may be specifically provided to the video classification apparatus after being trained by other devices, or may be trained by the video classification apparatus itself.

If the video classification device performs training by itself, before the step of performing feature extraction on the target video image through a classification model to obtain a global feature map corresponding to the target video image, the method may further include:

acquiring training data, wherein the training data comprises a sample video image of a sample video and real category information corresponding to the sample video;

performing feature extraction on the sample video image through a preset classification model to obtain a global feature map corresponding to the sample video image, performing saliency region identification on the global feature map of the sample video image, and determining at least one prediction saliency region of the global feature map of the sample video image;

extracting features of each prediction significance region in the global feature map of the sample video image to obtain a region feature vector of each prediction significance region of the sample video image, and fusing the feature map vector of the global feature map of the sample video image and the region feature vector of each prediction significance region to obtain an image feature vector of the sample video image based on the importance of each prediction significance region of the sample video image to the classification result of the sample video;

fusing image feature vectors of all sample video images to obtain video feature vectors of the sample videos;

determining prediction probability information of the sample video on each preset category based on the video feature vector;

calculating a first loss value between the prediction probability information and real category information of the sample video;

and adjusting parameters of a preset classification model based on the first loss value to obtain the classification model meeting preset conditions.

In the training process, parameters of the preset classification model may be adjusted based on a back propagation algorithm, so that a first loss value between the predicted probability information obtained through the preset classification model and the real class information is smaller than a preset value, and the preset value may be set according to an actual situation, which is not limited in this embodiment. For example, the preset value may be set smaller in order to improve the classification accuracy of the classification model.

The real category information of the sample video may specifically be a real probability of the sample video in each preset category, where the real probability in the real category is 1, and the real probability in other preset categories except the real category is 0.

The adjustment of the preset classification model parameters may include adjustment of the number of neurons in the preset classification model, adjustment of connection weights and biases between neurons in each layer, and the like.

Generally, if the prediction probability of the preset classification model in a certain preset class exceeds a threshold, the target video can be considered as a video in the preset class. In the training process of the preset classification model, if the class information predicted by the preset classification model is consistent with the real class information, that is, the class label prediction of the sample video is right through the preset classification model, a thermodynamic diagram can be analyzed and obtained based on the parameters involved in the prediction process, and the significance region of the thermodynamic diagram can be identified to obtain the real significance region of the sample video image. In the training process of the preset classification model, if the class information predicted by the preset classification model is inconsistent with the real class information, that is, the class label of the sample video is wrongly predicted by the preset classification model, a thermodynamic diagram can be obtained through analysis based on the parameters involved in the prediction process, and the non-real significance region of the sample video image is obtained according to the thermodynamic diagram.

In some embodiments, the thermodynamic diagram may be obtained through analysis by a gradient-weighted Class Activation Map (Grad-CAM), which is to calculate gradients of the first loss value to the video feature vector of the sample video, calculate weights corresponding to regions in a global feature Map of the sample video image by using a global average of the gradients, and based on the weights of the regions in the global feature Map, may describe the thermodynamic diagram corresponding to the global feature Map. The video feature vector of the target video may be specifically obtained by stitching image feature vectors corresponding to target video images of the target video. The basic idea of the Grad-CAM is that the weight of a feature map corresponding to a certain class can be translated into expressing this weight using a back propagation gradient.

Specifically, if the prediction of the class label of the sample video is correct, the thermodynamic diagram area of the Grad-CAM analysis can be used as a positive sample, and if the prediction of the class label of the sample video is wrong, the thermodynamic diagram area of the Grad-CAM analysis can be used as a negative sample.

Specifically, in some embodiments, before the step "based on the first loss value, adjust parameters of a preset classification model to obtain a classification model satisfying a preset condition", the method may further include:

calculating the gradient of the first loss value to the video feature vector of the sample video, and drawing a thermodynamic diagram corresponding to a global feature diagram of a sample video image of the sample video based on the gradient;

determining category information of the sample video based on the prediction probability information of the sample video;

when the category information of the sample video is consistent with the real category information, acquiring a saliency area of a global feature map of the sample video image based on the thermodynamic diagram, and setting the acquired saliency area as a real saliency area of the sample video image;

when the category information of the sample video is inconsistent with the real category information, acquiring a non-significant region of a global feature map of the sample video image based on the thermodynamic diagram, and setting the acquired non-significant region as a non-real significant region of the sample video image;

based on the first loss value, adjusting parameters of a preset classification model to obtain a classification model meeting preset conditions, including:

calculating a second loss value for a predicted salient region of the sample video image based on the true salient region and the non-true salient region;

and adjusting parameters of a preset classification model based on the first loss value and the second loss value to obtain the classification model meeting preset conditions.

The real salient region of the sample video image can be regarded as a positive sample in the process of carrying out supervision training on the salient region; the unreal salient region of the sample video image can be regarded as a negative sample in the process of carrying out supervised training on the salient region.

The step of adjusting parameters of a preset classification model based on the first loss value and the second loss value to obtain a classification model satisfying a preset condition may specifically include:

fusing the first loss value and the second loss value to obtain a total loss value;

and adjusting parameters of a preset classification model based on the total loss value to obtain the classification model meeting preset conditions.

The fusion mode of the first loss value and the second loss value may specifically be to perform weighted summation on the first loss value and the second loss value to obtain a total loss value.

Specifically, the first loss value is a loss function of the label classification, and the calculation process thereof can be shown as equation (1):

therein, Loss_TIs a first loss value, w and b are parameters of a classification model, T is the number of class labels of the sample video, T is a positive integer no greater than T,

for the prediction probability, y, of the sample video in the t-th preset category_tThe true probability (i.e., true category information) of the sample video in the t-th preset category is specifically 0 or 1, and x is a video feature vector of the sample video. The Sigmoid function is a common Sigmoid function in biology, also called a Sigmoid growth curve, which is often used as an activation function for neural networks, and maps variables between 0 and 1, and z is a defining sign, which is a variable of the Sigmoid function.

Optionally, in some embodiments, the step of "calculating a second loss value of the predicted saliency region of the sample video image based on the true saliency region and the non-true saliency region" may include:

The region overlapping degree may be specifically expressed by a region intersection ratio. In Object Detection (Object Detection), the Intersection-to-Union ratio (IoU) is the ratio of the Intersection to Union of two regions, and has a value between [0 and 1], which can be used to indicate the degree of coincidence between two sets.

Optionally, in some embodiments, the true saliency region probability of the predicted saliency region that intersects with the region of the true saliency region at a ratio greater than a first preset value may be set to 1, that is, the predicted saliency region that intersects with the region of the true saliency region at a ratio greater than the first preset value is considered as the true saliency region; setting the real significance region probability of a prediction significance region which is intersected with the region of the non-real significance region and has a cross ratio larger than a second preset value to be 0, namely, regarding the prediction significance region which is intersected with the region of the non-real significance region and has a cross ratio larger than the second preset value as a non-significance region; the first preset value and the second preset value can be set according to actual conditions.

Wherein the true saliency region probability can be regarded as a true label of each predicted saliency region.

For example, in a specific embodiment, the true saliency region probability of the predicted saliency region that intersects with the region of the true saliency region with the highest intersection ratio may be set to 1, and the true saliency region probability of the predicted saliency region that intersects with the region of the true saliency region with a intersection ratio greater than 0.7 may also be set to 1.

The preset probability threshold value can be set according to actual conditions. In some embodiments, the regression loss may be calculated only for the preset saliency region with true saliency region probability of 1.

Specifically, the calculation process of the second loss value and the total loss value may be as shown in equations (4) and (5):

L_sum＝Loss_T+a·L({p_i},{t_i}) (5)

wherein L is_sumDenotes the total loss value, α denotes the fusion weight of the first loss value and the second loss value, L ({ p })_i},{t_i}) represents a second loss value. The second loss value consists of two parts, classification loss and regression loss, respectively, and λ is a fusion weight that balances the classification loss and the regression loss. In the formula (4), the reaction mixture is,

a loss of classification is indicated and,

denotes regression loss, i denotes index value of each prediction significance region, p_iFor the prediction probability that the ith prediction saliency region is a true saliency region,

for the ith prediction significance region corresponding true significance region probability, t_iRepresenting location information of a predicted salient region in a global feature map of the sample video image,

representing the position information of a real saliency area corresponding to the predicted saliency area in the global feature map of the sample video image, N_clsRepresenting the number of predicted salient regions, N_regThe number of predicted saliency areas representing true saliency area probabilities not below a preset probability threshold.

For the regression loss of the bounding box (i.e. the prediction significance region), the position information t may be parameterized by 4 coordinates: x, y, w, and h; (x, y) represents the center coordinates of the frame, and w and h represent the width and height of the frame, respectively.

The video tag classification effect can be improved by fusing the characteristic information of the video saliency region, the method can be applied to a multi-tag classification scene of a video, and downstream can perform related recommendation and video retrieval according to predicted tags.

As can be seen from the above, the electronic device of this embodiment may obtain at least one target video image, and perform feature extraction on the target video image to obtain a global feature map corresponding to the target video image, where the target video image is derived from a target video; carrying out salient region identification on the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image; extracting features of each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image; based on the importance of each salient region of the target video image to the classification result of the target video, fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image; fusing image feature vectors of all target video images to obtain video feature vectors of the target videos; classifying the target video based on the video feature vector to obtain at least one class label of the target video. According to the video classification method and device, the regional characteristic vectors of all the salient regions are fused, the representation force of the video characteristic vectors can be enhanced, and the improvement of the video classification accuracy is facilitated.

The method described in the foregoing embodiment will be described in further detail below by way of example in which the video classification apparatus is specifically integrated in a server.

An embodiment of the present application provides a video classification method, and as shown in fig. 2a, a specific process of the video classification method may be as follows:

201. and the server extracts video frames of the target video to obtain at least one target video image.

202. And the server performs feature extraction on the target video image to obtain a global feature map corresponding to the target video image.

203. And the server identifies the salient region of the global feature map of the target video image through a sliding preset window, and determines at least one salient region of the global feature map of the target video image.

Optionally, in some embodiments, the step "the server performs salient region identification on the global feature map of the target video image through a sliding preset window, and determines at least one salient region of the global feature map of the target video image" may include:

The significance region of the global feature map may be identified by a region extraction network RPN, the RPN may be trained end to end by back propagation and random gradient descent, and the specific training process may refer to the description in step 106. The RPN is used for generating a candidate region, and then performing classification identification and secondary correction of a prediction frame on the candidate region through an identification network. Specifically, a plurality of preset windows of scales and aspect ratios may be set, so as to obtain candidate regions of a plurality of scales and aspect ratios, and then the candidate regions are classified and regressed.

For example, a preset window of 3 scales and 3 aspect ratios may be used to generate k — 9 candidate regions at each sliding position of the global feature map, that is, 9 candidate regions (3 aspect ratios — 3 scales) are set at each point of the global feature map through the sliding window. For a global feature map of size W × H (typically 2,400), there are a total of W × H × k candidate regions.

204. And the server performs pooling treatment on each salient region in the global feature map of the target video image to obtain the region feature vector of each salient region of the target video image.

205. And the server performs weighted fusion on the feature map vector of the global feature map of the target video image and the region feature vector of each salient region based on the importance of each salient region of the target video image on the classification result of the target video image to obtain the image feature vector of the target video image.

206. And the server fuses the image feature vectors of all the target video images to obtain the video feature vectors of the target videos.

The step of fusing the image feature vectors of the target video images to obtain the video feature vectors of the target videos may include:

207. And the server classifies the target video based on the video feature vector to obtain at least one class label of the target video.

After the category label of the target video is obtained, the target video may be labeled with the category label. The video playing platform can perform related video pushing according to the category labels, and a user can also retrieve videos according to the category labels.

In one embodiment, the target video may be multi-labeled classified by a classification model. Specifically, as shown in fig. 2b, video frame extraction may be performed on the target video to obtain N frames of target video images, and each frame of target video image extracts the global feature map through a backbone (backbone) network of the classification model, where the backbone network may be an inclusion network. Aiming at each frame of target video image, on the basis of a global feature map, some candidate saliency areas can be selected through an area extraction network (RPN), then pooling processing is carried out on each saliency area to obtain area feature vectors of each saliency area, then the global feature map and the area feature vectors of each saliency area are fused to obtain an image feature vector of each frame of target video image, then the image feature vectors of N frames are fused to obtain a video feature vector of the target video, and the target video is classified based on the video feature vectors.

In the training process of the classification model, the extracted saliency region of the RPN network may be supervised and trained according to the visualization result, and the specific process may refer to the description in step 106. Specifically, positive and negative samples of the saliency areas can be obtained through a thermodynamic diagram (namely a class activation diagram) of Grad-CAM analysis, and then an extraction network of the saliency areas is trained.

In the related technologies of current video classification, there are video classification based on an image convolution neural network, video classification based on a video dual stream, video classification based on a three-dimensional convolution, and the like. However, these methods all perform convolution operation by taking the video frame as a whole, that is, the whole area of the frame is treated equally, and is not optimized for the salient region.

The video classification based on the image convolutional neural network can extract N frames of videos, each frame of picture extracts a characteristic graph through the convolutional neural network, the characteristic graphs are converted into characteristic information in a full-connection or pooling mode, so that each frame can obtain a characteristic information representation, the characteristic information of all the frames is averaged or spliced to represent the characteristic information of the videos, and a multi-label classification layer is connected to the final video characteristic information representation for training.

Video classification based on video double stream (tow-stream), among others: extracting N frames of video, extracting a feature graph and feature information (embedding) from each frame through a convolutional neural network, calculating optical flow information among the frames to form an optical flow picture, inputting the optical flow picture into the convolutional neural network, similarly obtaining the feature graph and the embedding of the optical flow, fusing the embedding and the optical flow embedding of the multi-frame picture respectively, calculating the probability on each label respectively, and fusing the probability fraction of the picture on each label and the probability fraction of the optical flow on each label to obtain the final probability score of the video on each label.

Wherein, video classification based on three-dimensional convolution: the three-dimensional convolution operation is introduced, so that the spatial information among video streams can be better captured, and meanwhile, the convolution operation is carried out on multiple frames, so that the spatial domain information of each frame can be extracted, and the time domain information among frames can also be extracted.

According to the video frame feature information extraction method and device, the salient regions of the video frames can be identified, the region feature information of the salient regions is extracted, the characterization force of the video feature information is enhanced by fusing the region feature information of the salient regions of the videos, and the effect of multi-classification of the videos is improved. Compared with a video classification method without paying attention to the salient region, the video classification result is greatly improved in the index of average retrieval accuracy (mAP).

As can be seen from the above, in this embodiment, the server may perform video frame extraction on the target video to obtain at least one target video image; extracting the features of the target video image to obtain a global feature map corresponding to the target video image; identifying salient regions of the global feature map of the target video image through a sliding preset window, and determining at least one salient region of the global feature map of the target video image; pooling each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image; based on the importance of each salient region of the target video image to the classification result of the target video, carrying out weighted fusion on the feature map vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image; fusing image feature vectors of all target video images to obtain video feature vectors of the target videos; classifying the target video based on the video feature vector to obtain at least one class label of the target video. According to the video classification method and device, the regional characteristic vectors of all the salient regions are fused, the representation force of the video characteristic vectors can be enhanced, and the improvement of the video classification accuracy is facilitated.

In order to better implement the above method, an embodiment of the present application further provides a video classification apparatus, as shown in fig. 3a, the video classification apparatus may include an obtaining unit 301, an identifying unit 302, an extracting unit 303, a first fusing unit 304, a second fusing unit 305, and a classifying unit 306, as follows:

(1) an acquisition unit 301;

an obtaining unit 301, configured to obtain at least one target video image, and perform feature extraction on the target video image to obtain a global feature map corresponding to the target video image, where the target video image is derived from a target video.

Optionally, in some embodiments of the present application, the obtaining unit 301 may be specifically configured to perform feature extraction on the target video image through a classification model to obtain a global feature map corresponding to the target video image.

(2) An identification unit 302;

the identifying unit 302 is configured to perform salient region identification on the global feature map of the target video image, and determine at least one salient region of the global feature map of the target video image.

Optionally, in some embodiments of the present application, the identifying unit 302 may include a sliding subunit 3021, a first identifying subunit 3022, and a first determining subunit 3023, see fig. 3b, as follows:

the sliding subunit 3021 is configured to slide on the global feature map of the target video image through a preset window to obtain a plurality of candidate regions of the global feature map of the target video image;

a first identifying subunit 3022, configured to perform saliency identification on each candidate region based on feature map information of each candidate region in the global feature map;

a first determining subunit 3023, configured to determine at least one significant region from the candidate regions based on the recognition result.

Optionally, in some embodiments of the present application, the identifying unit 302 may further include a frame regression subunit 3024, a second identifying subunit 3025, and a screening subunit 3026, see fig. 3c, as follows:

the frame regression subunit 3024 is configured to perform frame regression on the candidate saliency region by using the determined saliency region as a candidate saliency region, to obtain a frame-adjusted candidate saliency region;

a second identifying subunit 3025, configured to perform saliency identification on the frame-adjusted candidate saliency region based on feature map information of the frame-adjusted candidate saliency region in the global feature map;

a screening subunit 3026, configured to screen the candidate saliency region after the frame adjustment based on the recognition result, so as to obtain the saliency region of the target video image.

Optionally, in some embodiments of the present application, the identifying unit 302 may be specifically configured to perform, through the classification model, salient region identification on the global feature map of the target video image, and determine at least one salient region of the global feature map of the target video image.

(3) An extraction unit 303;

the extracting unit 303 is configured to perform feature extraction on each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image.

Optionally, in some embodiments of the application, the extraction unit 303 may be specifically configured to perform pooling processing on each salient region in the global feature map of the target video image, so as to obtain a region feature vector of each salient region of the target video image.

(4) A first fusion unit 304;

a first fusion unit 304, configured to fuse, based on the importance of each salient region of the target video image to the classification result of the target video, a feature vector of a global feature map of the target video image and a region feature vector of each salient region, so as to obtain an image feature vector of the target video image.

Optionally, in some embodiments of the present application, the first fusion unit 304 may include a second determining subunit 3041 and a weighting subunit 3042, see fig. 3d, as follows:

the second determining subunit 3041 is configured to determine, based on the importance of each saliency region of the target video image to the classification result of the target video, a weight corresponding to each saliency region of the target video image;

a weighting subunit 3042, configured to perform weighting processing on the feature map vector of the global feature map of the target video image and the region feature vectors of each salient region based on the weights, to obtain an image feature vector of the target video image.

(5) A second fusion unit 305;

the second fusion unit 305 is configured to fuse the image feature vectors of the target video images to obtain the video feature vectors of the target video.

Optionally, in some embodiments of the present application, the second fusion unit 305 may include a clustering subunit 3051, a first calculating subunit 3052, and a first fusion subunit 3053, see fig. 3e, as follows:

the clustering subunit 3051 is configured to perform clustering processing on the image feature vectors of the target video images to obtain at least one cluster set, and determine a central feature vector serving as a clustering center in each cluster set;

the first calculating subunit 3052, configured to calculate, for each cluster set, a difference between a non-central feature vector and a central feature vector in the cluster set, to obtain a feature residual vector of the cluster set;

and the first fusion subunit 3053 is configured to fuse the feature residual vectors of the cluster sets to obtain a video feature vector of the target video.

Optionally, in some embodiments of the present application, the clustering subunit 3051 may be specifically configured to determine a number K of cluster sets, where K is a positive integer not less than 1;

(6) A classification unit 306;

a classifying unit 306, configured to classify the target video based on the video feature vector to obtain at least one class label of the target video.

Optionally, in some embodiments of the present application, the classification unit 306 may be specifically configured to classify, by the classification model, the target video based on the video feature vector to obtain at least one class label of the target video.

Optionally, in some embodiments of the present application, the video classification apparatus further includes a training unit 307, where the training unit 307 is configured to train a classification model; the training unit 307 may comprise a first obtaining sub-unit 3071, a first extracting sub-unit 3072, a second extracting sub-unit 3073, a second fusing sub-unit 3074, a third determining sub-unit 3075, a second calculating sub-unit 3076 and an adjusting sub-unit 3077, see fig. 3f, as follows:

the first obtaining subunit 3071 is configured to obtain training data, where the training data includes a sample video image of a sample video and real category information corresponding to the sample video;

the first extraction subunit 3072 is configured to perform feature extraction on the sample video image through a preset classification model to obtain a global feature map corresponding to the sample video image, perform saliency region identification on the global feature map of the sample video image, and determine at least one predicted saliency region of the global feature map of the sample video image;

a second extraction subunit 3073, configured to perform feature extraction on each prediction saliency region in the global feature map of the sample video image to obtain a region feature vector of each prediction saliency region of the sample video image, and fuse the feature map vector of the global feature map of the sample video image and the region feature vector of each prediction saliency region based on the importance of each prediction saliency region of the sample video image to the classification result of the sample video to obtain an image feature vector of the sample video image;

the second fusion subunit 3074 is configured to fuse the image feature vectors of the sample video images to obtain video feature vectors of the sample videos;

a third determining subunit 3075, configured to determine, based on the video feature vector, prediction probability information of the sample video in each preset category;

a second calculating sub-unit 3076 for calculating a first loss value between the prediction probability information and the true category information of the sample video;

an adjusting subunit 3077, configured to adjust parameters of a preset classification model based on the first loss value, so as to obtain a classification model meeting a preset condition.

Optionally, in some embodiments of the present application, the training unit 307 may further include a third calculating subunit 3078, a fourth determining subunit 3079, a second obtaining subunit 307A, and a third obtaining subunit 307B, where the third calculating subunit 3078 is configured to adjust parameters of a preset classification model based on the first loss value by the adjusting subunit 3077, and before obtaining a classification model meeting preset conditions, see fig. 3g, as follows:

the third computing subunit 3078, configured to compute a gradient of the first loss value to the video feature vector of the sample video, and based on the gradient, draw a thermodynamic diagram corresponding to a global feature map of a sample video image of the sample video;

a fourth determining subunit 3079, configured to determine category information of the sample video based on the prediction probability information of the sample video;

a second obtaining subunit 307A, configured to, when the category information of the sample video is consistent with the real category information, obtain a saliency area of a global feature map of the sample video image based on the thermodynamic diagram, and set the obtained saliency area as a real saliency area of the sample video image;

a third obtaining subunit 307B, configured to, when the category information of the sample video is inconsistent with the real category information, obtain, based on the thermodynamic diagram, a non-significant region of a global feature map of the sample video image, and set the obtained non-significant region as a non-real significant region of the sample video image;

the adjusting subunit 3077 may be specifically configured to calculate a second loss value of the predicted saliency region of the sample video image based on the true saliency region and the non-true saliency region; and adjusting parameters of a preset classification model based on the first loss value and the second loss value to obtain the classification model meeting preset conditions.

As can be seen from the above, in this embodiment, the obtaining unit 301 obtains at least one target video image, and performs feature extraction on the target video image to obtain a global feature map corresponding to the target video image, where the target video image is derived from a target video; performing salient region identification on the global feature map of the target video image through an identification unit 302, and determining at least one salient region of the global feature map of the target video image; extracting features of each salient region in the global feature map of the target video image through an extraction unit 303 to obtain a region feature vector of each salient region of the target video image; based on the importance of each salient region of the target video image to the classification result of the target video, the first fusion unit 304 fuses the feature map vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image; fusing the image feature vectors of the target video images through a second fusion unit 305 to obtain video feature vectors of the target videos; the target video is classified by the classification unit 306 based on the video feature vector, resulting in at least one class label of the target video. According to the video classification method and device, the regional characteristic vectors of all the salient regions are fused, the representation force of the video characteristic vectors can be enhanced, and the improvement of the video classification accuracy is facilitated.

An electronic device according to an embodiment of the present application is further provided, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to an embodiment of the present application, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

obtaining at least one target video image, and performing feature extraction on the target video image to obtain a global feature map corresponding to the target video image, wherein the target video image is derived from a target video; carrying out salient region identification on the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image; extracting features of each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image; based on the importance of each salient region of the target video image to the classification result of the target video, fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image; fusing image feature vectors of all target video images to obtain video feature vectors of the target videos; classifying the target video based on the video feature vector to obtain at least one class label of the target video.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, in this embodiment, at least one target video image may be obtained, and feature extraction is performed on the target video image to obtain a global feature map corresponding to the target video image, where the target video image is derived from a target video; carrying out salient region identification on the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image; extracting features of each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image; based on the importance of each salient region of the target video image to the classification result of the target video, fusing the feature vector of the global feature map of the target video image and the region feature vector of each salient region to obtain the image feature vector of the target video image; fusing image feature vectors of all target video images to obtain video feature vectors of the target videos; classifying the target video based on the video feature vector to obtain at least one class label of the target video. According to the video classification method and device, the regional characteristic vectors of all the salient regions are fused, the representation force of the video characteristic vectors can be enhanced, and the improvement of the video classification accuracy is facilitated.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the video classification methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video classification method provided in the embodiments of the present application, beneficial effects that can be achieved by any video classification method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the video classification aspect described above.

The foregoing detailed description is directed to a video classification method, apparatus, electronic device and storage medium provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of video classification, comprising:

2. The method according to claim 1, wherein the performing salient region identification on the global feature map of the target video image, and determining at least one salient region of the global feature map of the target video image comprises:

3. The method of claim 2, wherein after determining at least one salient region from the candidate regions based on the identification result, the method further comprises:

4. The method according to claim 1, wherein the extracting features of each salient region in the global feature map of the target video image to obtain a region feature vector of each salient region of the target video image comprises:

5. The method according to claim 1, wherein the fusing the feature map vector of the global feature map of the target video image and the region feature vector of each salient region based on the importance of each salient region of the target video image to the classification result of the target video to obtain the image feature vector of the target video image comprises:

6. The method according to claim 1, wherein the fusing the image feature vectors of the respective target video images to obtain the video feature vector of the target video comprises:

7. The method according to claim 6, wherein the clustering the image feature vectors of the target video images to obtain at least one cluster set, and determining a center feature vector as a cluster center in each cluster set comprises:

8. The method according to claim 1, wherein the performing feature extraction on the target video image to obtain a global feature map corresponding to the target video image comprises:

9. The method according to claim 8, wherein before the feature extraction is performed on the target video image through the classification model to obtain the global feature map corresponding to the target video image, the method further comprises:

10. The method according to claim 9, wherein before adjusting parameters of a preset classification model based on the first loss value to obtain a classification model satisfying a preset condition, the method further comprises:

11. The method of claim 10, wherein the calculating a second loss value for a predicted salient region of the sample video image based on the true salient region and the non-true salient region comprises:

12. A video classification apparatus, comprising:

13. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the video classification method according to any one of claims 1 to 11.

14. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the video classification method according to any one of claims 1 to 11.