CN111291602A - Video detection method and device, electronic equipment and computer readable storage medium - Google Patents

Video detection method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN111291602A
CN111291602A CN201811496505.6A CN201811496505A CN111291602A CN 111291602 A CN111291602 A CN 111291602A CN 201811496505 A CN201811496505 A CN 201811496505A CN 111291602 A CN111291602 A CN 111291602A
Authority
CN
China
Prior art keywords
image
video
key frame
clustering
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811496505.6A
Other languages
Chinese (zh)
Inventor
黄君实
罗玄
陈强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201811496505.6A priority Critical patent/CN111291602A/en
Publication of CN111291602A publication Critical patent/CN111291602A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The application provides a video detection method, a video detection device, electronic equipment and a computer readable storage medium, which are applied to the technical field of video detection, wherein the method comprises the following steps: the method comprises the steps of extracting image features of a key frame image in a video to be detected through a feature extraction network, processing the extracted image features through a pooling network to obtain image feature vectors with fixed lengths of the key frame image, inputting the obtained image feature vectors with the fixed lengths into a classification network to obtain a low-popular detection result of the key frame image, determining the low-popular detection result of the video to be detected based on the low-popular detection result of the key frame image, namely determining the low-popular detection result of the key frame image based on the image features of the key frame image of the video to be detected, and determining the low-popular detection result of the video to be detected according to the low-popular detection result of the key frame image, so that automatic detection of whether the video to be detected is a low-popular video is achieved, and meanwhile detection cost of the video to be detected is reduced.

Description

Video detection method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of video detection technologies, and in particular, to a video detection method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
With the development of video production technology, people can produce and share own videos, and the number of videos on a network is increased rapidly. In a large number of videos on a network, a part of videos are "trivia" (i.e. mediocre, vulvous and charomy) videos, the information content of the spread videos is very vulvous, the spread of related vulvous videos has adverse effects on society, and therefore, the realization of the vulvous detection of the videos is of great significance.
At present, the popular detection of videos is realized in a manual mode, that is, relevant workers of a corresponding video platform browse and watch videos on the video platform one by one in a manual detection mode, and then determine whether the corresponding videos are popular videos. However, the popular detection of the video is performed according to the existing manual mode, related workers need to detect the related videos one by one, and for partial videos, even after watching the complete content of the videos, whether the related videos are popular videos can be determined, so that the detection efficiency is very low. Therefore, the conventional popular detection method for manually performing videos has the problems of low efficiency and high cost.
Disclosure of Invention
The application provides a video detection method, a video detection device, electronic equipment and a computer-readable storage medium, which are used for improving the video vulgar detection efficiency and reducing the video vulgar detection cost, and the technical scheme adopted by the application is as follows:
in a first aspect, a video detection method is provided, the method comprising,
extracting image characteristics of key frame images in a video to be detected through a characteristic extraction network;
processing the extracted image features through a pooling network to obtain image feature vectors with fixed lengths of the key frame images;
inputting the obtained image feature vectors with fixed lengths into a classification network to obtain a low-colloquial detection result of the key frame image;
and determining the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image.
Further, the processing the extracted image features through the pooling network to obtain fixed-length image feature vectors of the key frame images, wherein the fixed-length image feature vectors include at least one of the following:
inputting the extracted image features into a global average pooling network to obtain fixed-length image feature vectors of the key frame images;
and inputting the extracted image features into a VLAD pooling network to obtain fixed-length image feature vectors of the key frame images.
Further, inputting the extracted image features into a VLAD pooling network to obtain image feature vectors of the key frame images, including:
clustering the plurality of image features to obtain a plurality of clustering centers;
calculating and determining residual values of the characteristic values of the image characteristics and the characteristic values of the corresponding clustering centers respectively, and summing the residual values between the clustering centers and the corresponding image characteristics aiming at any clustering center to obtain the sum of the residual values;
and determining the fixed-length image feature vector of the key frame image based on the sum of the obtained residual values respectively corresponding to the clustering centers.
Further, the method for determining the key frame image of the video to be detected comprises the following steps:
decoding a video to be detected to obtain a plurality of video frame images;
clustering the plurality of video frame images based on the image characteristics of each video frame image to obtain at least one clustering group;
and respectively determining one video frame image from each clustering group as a key frame image.
Further, determining the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image comprises the following steps:
determining the vulgar detection results of a plurality of key frame images respectively;
and performing weighted fusion processing on the vulgar detection results of the plurality of key frame images, and determining the vulgar detection result of the video to be detected based on the result of the weighted fusion processing.
In a second aspect, there is provided a video detection apparatus, the apparatus comprising,
the extraction module is used for extracting the image characteristics of the key frame image in the video to be detected through a characteristic extraction network;
the characteristic processing module is used for processing the image characteristics extracted by the extraction module through a pooling network to obtain image characteristic vectors with fixed lengths of the key frame images;
the detection module is used for inputting the image feature vectors with fixed length processed by the feature processing module into a classification network to obtain a low-custom detection result of the key frame image;
and the first determining module is used for determining the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image detected by the detecting module.
Further, the feature processing module is used for inputting the extracted image features into a global average pooling network to obtain fixed-length image feature vectors of the key frame images; and or, the fixed-length image feature vector is used for inputting the extracted image features into the VLAD pooling network to obtain the key frame image.
Further, the feature processing module includes: the device comprises a clustering processing unit, a calculating unit and a first determining unit;
the clustering unit is used for clustering the image characteristics to obtain a plurality of clustering centers;
the computing unit is used for computing residual values of the characteristic values of the determined image characteristics and the characteristic values of the corresponding clustering centers determined by the clustering processing unit, and summing the residual values between the clustering centers and the corresponding image characteristics aiming at any clustering center to obtain the sum of the residual values;
and the first determining unit is used for determining the image feature vector with the fixed length of the key frame image based on the sum of the residual values respectively corresponding to the clustering centers calculated by the calculating unit.
Further, the apparatus further comprises: the device comprises a decoding module, a clustering module and a second determining module;
the decoding module is used for decoding the video to be detected to obtain a plurality of video frame images;
the clustering module is used for clustering the plurality of video frame images based on the image characteristics of each video frame image decoded by the decoding module to obtain at least one clustering group;
and the second determining module is used for determining one video frame image as a key frame image from each clustering group obtained by clustering processing of the clustering module.
Further, the first determining module includes: a second determining unit and a third determining unit;
the second determining unit is used for respectively determining the vulgar detection results of the plurality of key frame images;
and the third determining unit is used for performing weighted fusion processing on the vulgar detection results of the plurality of key frame images determined by the second determining unit and determining the vulgar detection result of the video to be detected based on the result of the weighted fusion processing.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the video detection method shown in the first aspect is performed.
In a fourth aspect, a computer-readable storage medium is provided, which is used for storing computer instructions, which when run on a computer, make the computer perform the video detection method shown in the first aspect.
Compared with the prior art of performing video vulgar detection in a manual mode, the video detection method, the device, the electronic equipment and the computer-readable storage medium extract the image characteristics of a key frame image in a video to be detected through a characteristic extraction network, process the extracted image characteristics through a pooling network to obtain an image characteristic vector with fixed length of the key frame image, input the obtained image characteristic vector with fixed length into a classification network to obtain a vulgar detection result of the key frame image, determine the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image, namely determine the vulgar detection result of the key frame image based on the image characteristics of the key frame image in the extracted video to be detected, and then determine the vulgar detection result of the video to be detected according to the vulgar detection result of the key frame image, therefore, whether the video to be detected is the vulgar video or not is automatically detected, and meanwhile, the detection cost of the video to be detected is reduced.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a video detection method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a video detection apparatus according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of another video detection apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The embodiment of the present application provides a video detection method, as shown in fig. 1, the method includes,
s101, extracting image characteristics of a key frame image in a video to be detected through a characteristic extraction network;
for the embodiment, the video is composed of a plurality of image frames, generally, a video for one second corresponds to 24 image frames, whether the video to be detected is popular detection or not can be converted into detection of the image frames, and because the video to be detected corresponds to the plurality of image frames after being decoded, a certain number of key frames can be determined from the plurality of image frames, and the detection of the video to be detected is converted into the detection of the determined key frames, so that the data processing amount is reduced.
Specifically, the convolution layer may perform convolution operation on the extracted key frame image in the video to be detected, so as to obtain the image feature of the key frame image.
Step S102, processing the extracted image features through a pooling network to obtain fixed-length image feature vectors of the key frame images;
for this example, General Pooling mainly includes Mean Pooling (Mean Pooling), maximum Pooling (Max Pooling), and the effect of General Pooling is embodied in down-sampling. Unlike the general pooling operation, the pooling network of the present embodiment is used to perform corresponding processing on the extracted image features, so as to obtain fixed-length image feature vectors for representing corresponding key frame images.
Step S103, inputting the obtained image feature vectors with fixed lengths into a classification network to obtain a low-custom detection result of the key frame image;
for the embodiment, the obtained image feature vectors with fixed lengths are input into a classification network to obtain the low-colloquial detection result of the key frame image; specifically, the detection of whether the key frame image is vulgar is a problem of two classifications, and the image feature vector can be input into a Sigmoid classifier, so that a vulgar detection result of the key frame image is obtained, wherein the output values respectively correspond to the probability of the vulgar and the probability of the non-vulgar of the corresponding key frame image; the low-level detection result of the key frame image can also be obtained based on the image feature vector through a Softmax classifier or other methods, which are not limited herein.
And step S104, determining the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image.
For the embodiment, the low popular detection result of the video to be detected is determined according to the low popular detection result of the key frame image, specifically, the representation of the low popular detection result of the key frame image may be two probability values, a certain threshold may be set corresponding to the low popular probability and the non-low popular probability of the key frame image, respectively, and the low popular detection result of the video to be detected is determined based on the set threshold and the two probability values of the low popular detection result of the key frame image.
Compared with the prior art of performing video vulgar detection in a manual mode, the video detection method provided by the embodiment of the application extracts the image characteristics of the key frame image in the video to be detected through the characteristic extraction network, processes the extracted image characteristics through the pooling network to obtain the image characteristic vector with the fixed length of the key frame image, inputs the obtained image characteristic vector with the fixed length into the classification network to obtain the vulgar detection result of the key frame image, determines the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image, namely determines the vulgar detection result of the key frame image based on the extracted image characteristics of the key frame image in the video to be detected, and determines the vulgar detection result of the video to be detected according to the vulgar detection result of the key frame image, therefore, whether the video to be detected is the vulgar video or not is automatically detected, and meanwhile, the detection cost of the video to be detected is reduced.
The embodiment of the present application provides a possible implementation manner, and specifically, step S102 includes at least one of the following:
step S1021 (not shown in the figure), which inputs the extracted image features into a global average pooling network to obtain fixed-length image feature vectors of the key frame images;
for this embodiment, the fully-connected network is always a standard configuration structure of a CNN (Convolutional Neural Networks) classification network, and generally there is an activation function behind a fully-connected network layer for classification, the fully-connected network forms a feature map stretch obtained by convolution of a last layer into a vector, and then a score of each corresponding category can be obtained through a softmax layer, and the defect of the fully-connected network is that: too large a number of parameters is likely to cause overfitting. After the convolution layer is expanded into a vector by the full-connection layer, classification needs to be carried out on each feature map, and the idea of Global Average Pooling (GAP) is to combine the two processes into one, so that the parameter number can be reduced, overfitting is prevented, and in addition, the input of any image size can be realized by the Global average pooling.
Illustratively, the image features of the key frame image are C featuremaps of W × H, and the size of the global average pooled Pooling window is as large as the size of the entire featuremap, so that the average can be made in one featuremap dimension, and the featuremap input of W × H C is converted into 1 × 1C output, and then a fixed-length eigenvector is obtained.
In step S1022 (not shown), the extracted image features are input to the VLAD pooling network to obtain fixed-length image feature vectors of the key frame images.
For this embodiment, a VLAD (vector of locally aggregated descriptors) algorithm may cluster image features, and a VLAD network (e.g., NetVLAD) is a new pooling method based on the VLAD, and the VLAD pooling network may process the image features of the input key frame image to obtain an image feature vector with a fixed length.
For the embodiment, the input image features are processed through the global average pooling network and/or the VLAD pooling network, so that the image feature vector with a fixed length is obtained, and a basis is provided for the subsequent classification prediction of the vulgar detection of the key image.
The embodiment of the present application provides a possible implementation manner, and specifically, step S1022 includes:
step S10221 (not shown), clustering the plurality of image features to obtain a plurality of clustering centers;
for example, assuming that there are n d-dimensional image features, a K-means clustering algorithm may be used to perform clustering processing on the n d-dimensional image features to obtain K clustering centers.
Step S10222 (not shown in the figure), calculating and determining residual values between feature values of each image feature and feature values of corresponding clustering centers, and summing residual values between the clustering centers and the corresponding image features for any clustering center to obtain a sum of the residual values;
in the above example, any image feature corresponds to one feature value, where the feature value may be represented by a vector, for example, the feature value of any image feature is a d-dimensional vector, residual values between the feature value of each image feature and the feature value of its corresponding cluster center are calculated, and the residual values corresponding to any cluster center are summed to obtain a sum of the residual values of any cluster center, so as to obtain k residual sums, or k d-dimensional residual vectors.
Step S10223 (not shown in the figure), the fixed-length image feature vector of the key frame image is determined based on the sum of the obtained residual values respectively corresponding to the respective cluster centers.
In the above example, a one-dimensional vector of k × d is synthesized based on the obtained k d-dimensional residual vectors, thereby obtaining an image feature vector of a fixed length. The obtained image feature vectors with fixed length may be normalized, such as power law normalization and L2 norm normalization.
For the embodiment, a plurality of clustering centers are obtained by clustering the image features, and the residual sum of each clustering center is calculated, so that the image feature vector with a fixed length can be obtained according to the input image features, and a basis is provided for the subsequent classification prediction of whether the key frame image is popular or not.
The embodiment of the present application provides a possible implementation manner, and further, the method further includes:
step S105 (not shown in the figure), decoding the video to be detected to obtain a plurality of video frame images;
for this embodiment, a plurality of video frame images are obtained by decoding a video to be detected by a corresponding video decoding technique.
Step S106 (not shown in the figure), clustering a plurality of video frame images based on the image characteristics of each video frame image to obtain at least one clustering group;
for this embodiment, based on the image characteristics of each video frame image, a plurality of video frame images may be clustered by a hierarchical clustering algorithm and/or a K-means clustering method to obtain at least one cluster group.
Step S107 (not shown in the figure) determines one video frame image as a key frame image from each cluster group, respectively.
Specifically, for one of the cluster groups, a video frame image closest to the cluster center may be determined as a key frame image, or a plurality of video frame images closer to the cluster center may be determined as key frame images, so that a key frame image may be determined from each cluster group; wherein the distance may be a calculated euclidean distance or obtained by other methods.
For the embodiment of the application, the clustering processing is carried out on the plurality of video frame images of the video to be detected, and the key frame images are determined from the obtained clustering groups, so that the problem of determining the key frame images of the video to be detected is solved.
The embodiment of the present application provides a possible implementation manner, and specifically, step S104 includes:
step S1041 (not shown in the figure), determining respective vulgar detection results of the plurality of key frame images;
step S1042 (not shown in the figure), performs weighted fusion processing on the vulgar detection results of the plurality of key frame images, and determines the vulgar detection result of the video to be detected based on the result of the weighted fusion processing.
Specifically, weighted average calculation may be performed on the vulgar detection results of each key frame image, and then the vulgar detection result of the video to be detected is determined according to the result of the weighted average calculation, for example, a judgment threshold value for whether the video to be detected is vulgar may be preset, and when the result of the weighted average calculation is greater than the preset threshold value, the video to be detected is determined to be the vulgar video.
For the embodiment of the application, the vulgar detection result of the video to be detected is determined according to the vulgar detection results of the plurality of key frame images, and compared with the determination of the vulgar detection result of the video to be detected only according to one key frame image, the accuracy rate of the vulgar detection of the video to be detected can be improved.
Fig. 2 is a video detection apparatus according to an embodiment of the present application, where the apparatus 20 includes: the system comprises an extraction module 201, a feature processing module 202, a detection module 203, a first determination module 204 and an identification result acquisition module 205;
the extraction module 201 is configured to extract image features of a key frame image in a video to be detected through a feature extraction network;
the feature processing module 202 is configured to process the image features extracted by the extraction module 201 through a pooling network to obtain image feature vectors with fixed lengths of the key frame images;
the detection module 203 is used for inputting the image feature vectors with fixed length processed by the feature processing module 202 into a classification network to obtain a low-colloquial detection result of the key frame image;
the first determining module 204 is configured to determine a vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image detected by the detecting module 203.
Compared with the prior art of performing video vulgar detection in a manual mode, the video detection device extracts the image characteristics of the key frame image in the video to be detected through the characteristic extraction network, processes the extracted image characteristics through the pooling network to obtain the image characteristic vector with the fixed length of the key frame image, inputs the obtained image characteristic vector with the fixed length into the classification network to obtain the vulgar detection result of the key frame image, determines the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image, namely determines the vulgar detection result of the key frame image based on the extracted image characteristics of the key frame image in the video to be detected, determines the vulgar detection result of the video to be detected according to the vulgar detection result of the key frame image, and accordingly realizes the automatic detection of whether the video to be detected is the vulgar video, meanwhile, the detection cost of the video to be detected is reduced.
The video detection apparatus of the present embodiment can perform the video detection method provided in the above embodiments of the present application, and the implementation principles thereof are similar, and are not described herein again.
An embodiment of the present application provides another video detection apparatus, as shown in fig. 3, an apparatus 30 of the present embodiment includes: an extraction module 301, a feature processing module 302, a detection module 303 and a first determination module 304;
the extraction module 301 is configured to extract image features of a key frame image in a video to be detected through a feature extraction network;
wherein, the extracting module 301 in fig. 3 has the same or similar function as the extracting module 201 in fig. 2.
The feature processing module 302 is configured to process the image features extracted by the extraction module 301 through a pooling network to obtain image feature vectors with fixed lengths of the key frame images;
where feature processing module 302 in fig. 3 is the same or similar in function to feature processing module 202 in fig. 2.
The detection module 303 is configured to input the image feature vectors with the fixed length processed by the feature processing module 302 into a classification network to obtain a low-colloquial detection result of the key frame image;
wherein the detection module 303 in fig. 3 has the same or similar function as the detection module 203 in fig. 2.
A first determining module 304, configured to determine a vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image detected by the detecting module 303.
Wherein the first determining module 304 in fig. 3 has the same or similar function as the first determining module 204 in fig. 2.
The embodiment of the present application provides a possible implementation manner, and specifically, the feature processing module 302 is configured to input the extracted image features into a global average pooling network to obtain image feature vectors of a fixed length of a key frame image; and or, the fixed-length image feature vector is used for inputting the extracted image features into the VLAD pooling network to obtain the key frame image.
For the embodiment, the input image features are processed through the global average pooling network and/or the VLAD pooling network, so that the image feature vector with a fixed length is obtained, and a basis is provided for the subsequent classification prediction of the vulgar detection of the key image.
The embodiment of the present application provides a possible implementation manner, and specifically, the feature processing module 302 includes: a clustering section 3021, a calculating section 3022, and a first determining section 3023;
a clustering unit 3021, configured to perform clustering on the multiple image features to obtain multiple clustering centers;
a calculating unit 3022, configured to calculate residual values between the feature values of each image feature and the feature values of the corresponding clustering center determined by the clustering processing unit 3021, and sum residual values between the clustering center and each corresponding image feature for any clustering center to obtain a sum of the residual values;
a first determining unit 3023 configured to determine an image feature vector of the key frame image based on a sum of residual values respectively corresponding to the respective cluster centers calculated by the calculating unit 3022.
For the embodiment, a plurality of clustering centers are obtained by clustering the image features, and the residual sum of each clustering center is calculated, so that the image feature vector with a fixed length can be obtained according to the input image features, and a basis is provided for the subsequent classification prediction of whether the key frame image is popular or not.
The embodiment of the present application provides a possible implementation manner, and further, the apparatus further includes: a decoding module 305, a clustering module 306, a second determining module 307;
a decoding module 305, configured to decode a video to be detected to obtain a plurality of video frame images;
a clustering module 306, configured to perform clustering processing on the multiple video frame images based on the image characteristics of each video frame image decoded by the decoding module 305, so as to obtain at least one cluster group;
a second determining module 307, configured to determine a video frame image as a key frame image from each cluster group obtained through clustering processing by the clustering module 306.
For the embodiment of the application, the clustering processing is carried out on the plurality of video frame images of the video to be detected, and the key frame images are determined from the obtained clustering groups, so that the problem of determining the key frame images of the video to be detected is solved.
The embodiment of the present application provides a possible implementation manner, and specifically, the first determining module 304 includes: a second determination unit 3041 and a third determination unit 3042;
a second determining unit 3041 for determining vulgar detection results of the plurality of key frame images, respectively;
a third determining unit 3042, configured to perform weighted fusion processing on the vulgar detection results of the plurality of key frame images determined by the second determining unit 3041, and determine the vulgar detection result of the video to be detected based on the result of the weighted fusion processing.
For the embodiment of the application, the vulgar detection result of the video to be detected is determined according to the vulgar detection results of the plurality of key frame images, and compared with the determination of the vulgar detection result of the video to be detected only according to one key frame image, the accuracy rate of the vulgar detection of the video to be detected can be improved.
Compared with the prior art of performing popular detection of videos in a manual mode, the video detection device extracts image features of key frame images in videos to be detected through a feature extraction network, processes the extracted image features through a pooling network to obtain image feature vectors with fixed lengths of the key frame images, inputs the obtained image feature vectors with the fixed lengths to a classification network to obtain popular detection results of the key frame images, determines popular detection results of the key frame images based on the image features of the key frame images in the videos to be detected, and determines popular detection results of the videos to be detected according to the popular detection results of the key frame images, therefore, whether the video to be detected is the vulgar video or not is automatically detected, and meanwhile, the detection cost of the video to be detected is reduced.
The embodiment of the application provides a video detection device which is suitable for the method embodiment. And will not be described in detail herein.
An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Further, the electronic device 40 may also include a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 400 is not limited to the embodiment of the present application.
The processor 4001 is applied in this embodiment of the application, and is configured to implement the functions of the extracting module, the feature processing module, the detecting module, and the first determining module shown in fig. 2 or fig. 3, and the functions of the decoding module, the clustering module, and the second determining module shown in fig. 3. The transceiver 4004 includes a receiver and a transmitter.
Processor 4001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. Bus 4002 may be a PCI bus, EISA bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.
Memory 4003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, an optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. The processor 4001 is configured to execute application code stored in the memory 4003 to implement the functions of the video detection apparatus provided by the embodiment shown in fig. 2 or fig. 3.
Compared with the prior art of performing video popular detection in a manual mode, the electronic equipment extracts the image characteristics of the key frame image in the video to be detected through the characteristic extraction network, processes the extracted image characteristics through the pooling network to obtain the image characteristic vector with the fixed length of the key frame image, inputs the obtained image characteristic vector with the fixed length to the classification network to obtain the popular detection result of the key frame image, determines the popular detection result of the video to be detected based on the popular detection result of the key frame image, namely determines the popular detection result of the key frame image based on the extracted image characteristics of the key frame image in the video to be detected, and then determines the popular detection result of the video to be detected according to the popular detection result of the key frame image, thereby realizing the automatic detection of whether the video to be detected is the popular video, meanwhile, the detection cost of the video to be detected is reduced.
The embodiment of the application provides an electronic device suitable for the method embodiment. And will not be described in detail herein.
The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the video detection method shown in the above embodiment.
Compared with the prior art of performing vulgar detection of videos in a manual mode, the embodiment of the application provides a computer-readable storage medium, the image features of a key frame image in a video to be detected are extracted through a feature extraction network, then the extracted image features are processed through a pooling network to obtain image feature vectors with fixed lengths of the key frame image, the obtained image feature vectors with fixed lengths are input to a classification network to obtain the vulgar detection result of the key frame image, then the vulgar detection result of the video to be detected is determined based on the vulgar detection result of the key frame image, namely the vulgar detection result of the key frame image is determined based on the extracted image features of the key frame image in the video to be detected, and then the vulgar detection result of the video to be detected is determined according to the vulgar detection result of the key frame image, therefore, whether the video to be detected is the vulgar video or not is automatically detected, and meanwhile, the detection cost of the video to be detected is reduced.
The embodiment of the application provides a computer-readable storage medium which is suitable for the method embodiment. And will not be described in detail herein.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A video detection method, comprising:
extracting image characteristics of key frame images in a video to be detected through a characteristic extraction network;
processing the extracted image features through a pooling network to obtain fixed-length image feature vectors of the key frame images;
inputting the obtained image feature vectors with fixed lengths into a classification network to obtain a low-colloquial detection result of the key frame image;
and determining the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image.
2. The method according to claim 1, wherein the processing the extracted image features through a pooling network to obtain a fixed-length image feature vector of the key frame image comprises at least one of:
inputting the extracted image features into a global average pooling network to obtain fixed-length image feature vectors of the key frame images;
and inputting the extracted image features into a VLAD pooling network to obtain fixed-length image feature vectors of the key frame images.
3. The method of claim 2, wherein inputting the extracted image features into a VLAD pooling network to obtain a fixed-length image feature vector of the key frame image comprises:
clustering the plurality of image features to obtain a plurality of clustering centers;
calculating and determining residual values of the characteristic values of the image characteristics and the characteristic values of the corresponding clustering centers respectively, and summing the residual values between the clustering centers and the corresponding image characteristics aiming at any clustering center to obtain the sum of the residual values;
and determining the fixed-length image feature vector of the key frame image based on the sum of the obtained residual values respectively corresponding to the clustering centers.
4. The method according to claim 1, wherein the method for determining the key frame image of the video to be detected comprises:
decoding a video to be detected to obtain a plurality of video frame images;
clustering the plurality of video frame images based on the image characteristics of each video frame image to obtain at least one clustering group;
and respectively determining one video frame image from each clustering group as a key frame image.
5. The method of claim 1, comprising: determining a vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image, wherein the determining comprises the following steps:
determining the vulgar detection results of a plurality of key frame images respectively;
and performing weighted fusion processing on the vulgar detection results of the plurality of key frame images, and determining the vulgar detection result of the video to be detected based on the result of the weighted fusion processing.
6. A video detection apparatus, comprising:
the extraction module is used for extracting the image characteristics of the key frame image in the video to be detected through a characteristic extraction network;
the feature processing module is used for processing the image features extracted by the extraction module through a pooling network to obtain image feature vectors with fixed lengths of the key frame images;
the detection module is used for inputting the image feature vectors with the fixed length processed by the feature processing module into a classification network to obtain a low-colloquial detection result of the key frame image;
and the first determining module is used for determining the vulgar detection result of the video to be detected based on the vulgar detection result of the key frame image detected by the detecting module.
7. The apparatus according to claim 6, wherein the feature processing module is configured to input the extracted image features into a global average pooling network to obtain a fixed-length image feature vector of the key frame image; and or, the fixed-length image feature vector used for inputting the extracted image features into the VLAD pooling network is obtained.
8. The apparatus of claim 7, wherein the feature processing module comprises: the device comprises a clustering processing unit, a calculating unit and a first determining unit;
the clustering unit is used for clustering the image characteristics to obtain a plurality of clustering centers;
the computing unit is used for computing residual values of the characteristic values of the image characteristics and the characteristic values of the corresponding clustering centers determined by the clustering processing unit, and summing the residual values between the clustering centers and the corresponding image characteristics aiming at any clustering center to obtain the sum of the residual values;
the first determining unit is configured to determine the fixed-length image feature vector of the key frame image based on a sum of residual values respectively corresponding to the respective clustering centers calculated by the calculating unit.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the video detection method according to any of claims 1 to 5.
10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the video detection method of any of claims 1 to 5.
CN201811496505.6A 2018-12-07 2018-12-07 Video detection method and device, electronic equipment and computer readable storage medium Pending CN111291602A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811496505.6A CN111291602A (en) 2018-12-07 2018-12-07 Video detection method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811496505.6A CN111291602A (en) 2018-12-07 2018-12-07 Video detection method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111291602A true CN111291602A (en) 2020-06-16

Family

ID=71024285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811496505.6A Pending CN111291602A (en) 2018-12-07 2018-12-07 Video detection method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111291602A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926439A (en) * 2021-02-22 2021-06-08 深圳中科飞测科技股份有限公司 Detection method and device, detection equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100329547A1 (en) * 2007-04-13 2010-12-30 Ipharro Media Gmbh Video detection system and methods
CN102073864A (en) * 2010-12-01 2011-05-25 北京邮电大学 Football item detecting system with four-layer structure in sports video and realization method thereof
CN105046197A (en) * 2015-06-11 2015-11-11 西安电子科技大学 Multi-template pedestrian detection method based on cluster
CN105844239A (en) * 2016-03-23 2016-08-10 北京邮电大学 Method for detecting riot and terror videos based on CNN and LSTM
US20170289624A1 (en) * 2016-04-01 2017-10-05 Samsung Electrônica da Amazônia Ltda. Multimodal and real-time method for filtering sensitive media
CN108681695A (en) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 Video actions recognition methods and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100329547A1 (en) * 2007-04-13 2010-12-30 Ipharro Media Gmbh Video detection system and methods
CN102073864A (en) * 2010-12-01 2011-05-25 北京邮电大学 Football item detecting system with four-layer structure in sports video and realization method thereof
CN105046197A (en) * 2015-06-11 2015-11-11 西安电子科技大学 Multi-template pedestrian detection method based on cluster
CN105844239A (en) * 2016-03-23 2016-08-10 北京邮电大学 Method for detecting riot and terror videos based on CNN and LSTM
US20170289624A1 (en) * 2016-04-01 2017-10-05 Samsung Electrônica da Amazônia Ltda. Multimodal and real-time method for filtering sensitive media
CN108681695A (en) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 Video actions recognition methods and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RELJA ARANDJELOVIĆ等: "NetVLAD: CNN architecture for weakly supervised place recognition", 《COMPUTER VISION AND PATTERN RECOGNITION》, pages 1 - 17 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926439A (en) * 2021-02-22 2021-06-08 深圳中科飞测科技股份有限公司 Detection method and device, detection equipment and storage medium

Similar Documents

Publication Publication Date Title
US20230196837A1 (en) Action recognition method and apparatus, and device and storage medium
CN107273458B (en) Depth model training method and device, and image retrieval method and device
CN110188829B (en) Neural network training method, target recognition method and related products
CN108875487B (en) Training of pedestrian re-recognition network and pedestrian re-recognition based on training
CN110991321B (en) Video pedestrian re-identification method based on tag correction and weighting feature fusion
CN111401374A (en) Model training method based on multiple tasks, character recognition method and device
CN107578011A (en) The decision method and device of key frame of video
CN116978011B (en) Image semantic communication method and system for intelligent target recognition
CN114723760B (en) Portrait segmentation model training method and device and portrait segmentation method and device
CN109359530B (en) Intelligent video monitoring method and device
CN107886093B (en) Character detection method, system, equipment and computer storage medium
CN111931572B (en) Target detection method for remote sensing image
CN111291602A (en) Video detection method and device, electronic equipment and computer readable storage medium
CN111931551A (en) Face detection method based on lightweight cascade network
CN115984671A (en) Model online updating method and device, electronic equipment and readable storage medium
CN110555462A (en) non-fixed multi-character verification code identification method based on convolutional neural network
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention
CN115063831A (en) High-performance pedestrian retrieval and re-identification method and device
CN114694209A (en) Video processing method and device, electronic equipment and computer storage medium
CN114005069A (en) Video feature extraction and retrieval method
CN112906508A (en) Face living body detection method based on convolutional neural network
CN111783655A (en) Image processing method and device, electronic equipment and storage medium
CN111368071A (en) Video detection method and device based on video related text and electronic equipment
CN115240106B (en) Task self-adaptive small sample behavior recognition method and system
CN117372935B (en) Video target detection method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination