CN111242019B

CN111242019B - Video content detection method and device, electronic equipment and storage medium

Info

Publication number: CN111242019B
Application number: CN202010027419.1A
Authority: CN
Inventors: 彭健腾; 王兴华; 康斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2023-11-14
Anticipated expiration: 2040-01-10
Also published as: CN111242019A

Abstract

The embodiment of the invention discloses a method, a device, electronic equipment and a storage medium for detecting video content, wherein the method for detecting the video content comprises the following steps: obtaining a video to be detected, selecting a video frame for video content detection from the video to be detected to obtain a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining clustering centers to which each content category belongs, respectively calculating distances between each image feature and the plurality of clustering centers to obtain distance sets corresponding to each image feature, and determining the content category of the video to be detected according to the distance sets and the plurality of clustering centers, so that the accuracy of detecting the video content can be improved.

Description

Video content detection method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for detecting video content, an electronic device, and a storage medium.

Background

In the age of rapid development of the internet, many websites support and allow users to upload videos and display the videos to the public by themselves, and as the threshold of content production decreases, the video uploading amount increases rapidly at an exponential rate, so that in order to ensure the security of the distributed content, the auditing of the video content needs to be completed in a short time, for example, whether the content relates to sensitive information, content quality, security and the like is identified and processed.

At present, a detection scheme of video content is mainly to count distribution of color features of video, content categories of the video are divided based on the distribution of the color features, and semantics of the video are not considered, so that in the current detection scheme, a detection result is inaccurate.

Disclosure of Invention

The embodiment of the invention provides a method, a device, electronic equipment and a storage medium for detecting video content, which can improve the accuracy of detecting the video content.

The embodiment of the invention provides a method for detecting video content, which comprises the following steps:

acquiring a video to be detected, wherein the video to be detected comprises a plurality of video frames;

selecting a video frame for video content detection from the video to be detected to obtain a plurality of video frames to be detected;

extracting image features corresponding to each video frame to be detected, and acquiring a clustering center to which each content category belongs;

respectively calculating the distance between each image feature and a plurality of clustering centers to obtain a distance set corresponding to each image feature;

and determining the content category of the video to be detected according to the distance set and the plurality of clustering centers.

Correspondingly, the embodiment of the invention also provides a device for detecting the video content, which comprises the following steps:

The first acquisition module is used for acquiring a video to be detected, wherein the video to be detected comprises a plurality of video frames;

the selection module is used for selecting video frames for video content detection from the videos to be detected to obtain a plurality of video frames to be detected;

the extraction module is used for extracting image features corresponding to each video frame to be detected;

the second acquisition module is used for acquiring a clustering center to which each content category belongs;

the computing module is used for respectively computing the distance between each image feature and a plurality of clustering centers to obtain a distance set corresponding to each image feature;

and the determining module is used for determining the content category of the video to be detected according to the distance set and the plurality of clustering centers.

Optionally, in some embodiments of the present invention, the second obtaining module includes:

the system comprises an acquisition unit, a video content classification unit and a video content classification unit, wherein the acquisition unit is used for acquiring a trained video content classification model and a plurality of sample video frames marked with video content types, and the video content classification model is trained by the plurality of sample video frames;

and the construction unit is used for constructing a clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames.

Optionally, in some embodiments of the invention, the building unit includes:

The extraction subunit is used for extracting the characteristics of each sample video frame by utilizing the trained video content classification model;

and the construction subunit is used for constructing a clustering center to which each content category belongs based on the extracted characteristics.

Optionally, in some embodiments of the invention, the building subunit is specifically configured to:

acquiring a plurality of preset content labels;

determining the number of classification to be clustered according to a plurality of preset content labels;

and carrying out clustering processing on the extracted features based on a preset clustering algorithm and the number of classifications to obtain clustering centers to which each content category belongs.

Optionally, in some embodiments of the present invention, the method further includes a training module, where the training module is specifically configured to:

collecting a plurality of sample video frames marked with video content types;

determining a sample video frame which is required to be trained currently from a plurality of collected sample video frames to obtain a current processing object;

the current processing object is imported into a preset initial classification model for training, and a predicted value of video content corresponding to the current processing object is obtained;

converging the predicted value corresponding to the current processing object and the marked video content type of the current processing object so as to adjust the parameters of the preset initial classification model;

And returning to the step of executing the sample video frame which is determined to be trained currently from the plurality of collected sample video frames until the plurality of sample video frames are trained.

Optionally, in some embodiments of the present invention, the determining module includes:

a selecting unit, configured to select a preset number of distances from the distance set, to obtain at least one target distance;

the first determining unit is used for determining a cluster center corresponding to the target distance to obtain at least one target cluster center;

and the second determining unit is used for determining the content category of the video to be detected according to at least one target clustering center.

Optionally, in some embodiments of the present invention, the second determining unit is specifically configured to:

respectively obtaining content categories corresponding to a plurality of target clustering centers;

based on the determined content category, the content category of the video to be detected is determined.

Optionally, in some embodiments of the present invention, the selecting unit is specifically configured to: and selecting the smallest distance from the distance set as the target distance.

Optionally, in some embodiments of the present invention, the selecting module is specifically configured to:

detecting the number of video frames in a video to be detected;

Judging whether the number is larger than a preset number or not;

when the number is larger than the preset number, removing corresponding video frames from the video to be detected based on a preset strategy to obtain a reserved video frame set;

and selecting a plurality of video frames from the reserved video frame set at intervals to obtain a video frame to be detected.

Optionally, in some embodiments of the present invention, the selecting module is specifically further configured to:

and when the number is smaller than or equal to the preset number, selecting a plurality of video frames from the video frames to be detected, and obtaining the video frames to be detected.

After obtaining a video to be detected, wherein the video to be detected comprises a plurality of video frames, selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining clustering centers to which each content category belongs, respectively calculating distances between each image feature and the plurality of clustering centers to obtain a distance set corresponding to each image feature, and finally determining the content category of the video to be detected according to the distance set and the plurality of clustering centers. Therefore, the scheme can effectively improve the accuracy of detecting the video content.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic view of a video content detection method according to an embodiment of the present invention;

fig. 1b is a flowchart illustrating a method for detecting video content according to an embodiment of the present invention;

FIG. 1c is a schematic diagram of distribution of cluster centers in a method for detecting video content according to an embodiment of the present invention

Fig. 2a is another flow chart of a method for detecting video content according to an embodiment of the present invention;

fig. 2b is another schematic view of a video content detection method according to an embodiment of the present invention;

FIG. 2c is a schematic diagram of sample video processing performed by a server in detecting video content according to an embodiment of the present invention;

fig. 3a is a schematic structural diagram of a device for detecting video content according to an embodiment of the present invention;

fig. 3b is another schematic structural diagram of a device for detecting video content according to an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The Computer Vision technology (CV) is a science for researching how to make a machine "look at", and more specifically, a camera and a Computer are used to replace human eyes to perform machine Vision such as recognition, positioning and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The embodiment of the invention provides a method and a device for detecting video content, electronic equipment and a storage medium.

The detection device of the video content (hereinafter referred to as detection device) may be integrated in a server or a terminal, where the server may include a server or a distributed server that operates independently, or may include a server cluster that is formed by a plurality of servers, and the terminal may include a mobile phone, a tablet computer, or a personal computer (PC, personal Computer).

Referring to fig. 1a, taking the example that the detection device is integrated in a server, the server may receive a video uploaded by a user through a network (i.e. a video to be detected), for convenience of description, the server uploads a video a to describe, after the server obtains the video a, the video a may include a plurality of video frames, the server selects a video frame for detecting video content from the video a to obtain a plurality of video frames to be detected, then the server extracts image features corresponding to the video frames to be detected, and obtains cluster centers to which each content category belongs, and then the server may calculate distances between each image feature and the plurality of cluster centers to obtain a distance set corresponding to each image feature, and finally, the server determines a content category of the video a according to the distance set and the plurality of cluster centers, for example, detects that the content category of the video a is cartoon.

According to the scheme, the distances between the image features and the clustering centers are calculated, and then the content category of the video to be detected is determined based on the distance set and the clustering centers, namely, in the actual detection process, the content category of each frame of image frame is considered, so that the accuracy of detecting the video content can be improved.

The following will describe in detail. It should be noted that the following description order of embodiments is not a limitation of the priority order of embodiments.

A method of detecting video content, comprising: obtaining a video to be detected, selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining clustering centers to which each content category belongs, respectively calculating distances between each image feature and the plurality of clustering centers to obtain distance sets corresponding to each image feature, and determining the content category of the video to be detected according to the distance sets and the plurality of clustering centers.

Referring to fig. 1b, fig. 1b is a flowchart illustrating a method for detecting video content according to an embodiment of the present invention. The specific flow of the video content detection method can be as follows:

101. And acquiring a video to be detected.

The video to be detected may include a plurality of video frames, and various ways of obtaining the video to be detected may be available, for example, the video to be detected may be obtained from the internet and/or a designated database, and may specifically be determined according to the requirements of practical applications, where the video to be detected may include a television, a movie, a video recorded by a user, and so on.

102. And selecting video frames for video content detection from the videos to be detected, and obtaining a plurality of video frames to be detected.

For example, specifically, the video frames for detecting the video content may be selected from the video frames to be detected according to the arrangement sequence of the video frames in the video to be detected, so as to improve the calculation efficiency, so that the number of the video frames may be compressed, that is, a part of the video frames may be deleted, and then the video frames for detecting the video content may be selected from the remaining video frames, that is, optionally, in some embodiments, the step of "selecting the video frames for detecting the video content from the video to be detected, so as to obtain a plurality of video frames to be detected" may specifically include:

(11) Detecting the number of video frames in a video to be detected;

(12) Judging whether the number is larger than a preset number;

(13) When the number is greater than the preset number, removing corresponding video frames from the video to be detected based on a preset strategy to obtain a reserved video frame set;

(14) And selecting a plurality of video frames from the reserved video frame set at intervals to obtain a video frame to be detected.

For example, the number of video frames in the video to be detected is 200 frames, and the preset number is 100 frames, so that the corresponding video frames can be removed from the video to be detected based on a preset strategy to obtain a reserved video frame set, and then a plurality of video frames are selected from the reserved video frame set at intervals to obtain the video frame to be detected.

In addition, when the number of video frames is less than or equal to the preset number, a plurality of video frames may be selected from the video frames to be detected, to obtain the video frames to be detected, that is, optionally, in some embodiments, the method may further include: and when the number is smaller than or equal to the preset number, selecting a plurality of video frames from the video to be detected, and obtaining the video frames to be detected.

It should be further noted that, in order to reduce the influence of the low information content frame on the subsequent video content detection, the low information content frame may be deleted before the video frame for video content detection is selected from the videos to be detected, where the low information content frame refers to a video frame with too simple color texture feature, a video header frame, and so on, and specifically, the image frame complexity algorithm may be used to detect all the video frames, that is, optionally, in some embodiments, the step of "before the video frame for video content detection is selected from the videos to be detected, and a plurality of video frames to be detected are obtained," may specifically further include:

(21) Respectively detecting all video frames in the video to be detected by adopting a preset algorithm;

(22) And processing the video to be detected based on the detection result to obtain the processed video to be detected.

For example, when detecting that 3 frames of video frames are black and white frames and that solid color blocks exist at the edges of the image corresponding to 2 frames of video frames, then the 5 frames of video frames can be deleted to obtain a processed video to be detected, and then a video frame for detecting video content can be selected from the preprocessed video to be detected to obtain a plurality of video frames to be detected.

103. And extracting image characteristics corresponding to each video frame to be detected, and acquiring a clustering center to which each content category belongs.

For example, specifically, feature extraction may be performed on each video frame to be detected based on a trained video content classification model to obtain image features corresponding to each video frame to be detected, where the video content classification model is trained by a plurality of sample video frames labeled with video content types, and a cluster center to which each content category belongs may be further constructed based on the trained video content classification model and the plurality of sample video frames, that is, optionally, in some embodiments, the step of "obtaining a cluster center to which each content category belongs" may specifically include:

(31) Acquiring a trained video content classification model;

(32) And constructing a clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames.

For example, specifically, each sample video frame may be imported into a trained video content classification model to obtain sample image features corresponding to each sample video, and then a cluster center to which each content category belongs is constructed according to the sample image features, that is, optionally, the step of "constructing a cluster center to which each content category belongs based on the trained video content classification model and a plurality of sample video frames" may specifically include:

(41) Respectively extracting the characteristics of each sample video frame by using the trained video content classification model;

(42) And constructing a clustering center to which each content category belongs based on the extracted features.

The method includes the steps that a plurality of content labels can be preset, for example, labels such as comedy, horror, inferring drama and the like can be set, or labels such as cartoon and non-cartoon can also be set, and the number of classification of the required clusters is 2, then the extracted features are subjected to clustering processing based on a preset clustering algorithm and the number of classification, so that a clustering center to which each content category belongs is obtained, namely, optionally, the step of constructing the clustering center to which each content category belongs based on the extracted features can specifically include:

(51) Acquiring a plurality of preset content labels;

(52) Determining the number of classification to be clustered according to a plurality of preset content labels;

(53) Clustering the extracted features based on a preset clustering algorithm and the number of classifications to obtain clustering centers to which each content category belongs

The number of content tags can be set according to actual demands, for example, the tags of "cartoon" and "non-cartoon" can be set first, then the tags subordinate to the content tags can be set under the "cartoon" and "non-cartoon" respectively, for example, the tags such as "fun", "hot-blood-music" and "different world-music" can be set under the "non-cartoon" tag, the tags such as "talk-back", "comedy" and "reason-back" can be set under the "cartoon" tag, or the tags such as "X-shadow X", "dog XX" and "X-god" can be set under the "cartoon" tag, and the tags subordinate to the content tags can not be set under the "non-cartoon" tag, and are set according to actual conditions specifically, and are not repeated herein.

It should be noted that, the video content classification model may be pre-established, that is, in some embodiments, before the step of "obtaining the trained video content classification model", the method may specifically further include:

(61) Collecting a plurality of sample video frames marked with video content types;

(62) Determining a sample video frame which is required to be trained currently from a plurality of collected sample video frames to obtain a current processing object;

(63) Importing the current processing object into a preset initial classification model for training to obtain a predicted value of video content corresponding to the current processing object;

(64) Converging the predicted value corresponding to the current processing object and the marked video content type of the current processing object so as to adjust the parameters of the preset initial classification model;

(65) And returning to the step of executing the sample video frame which is determined to be trained currently from the plurality of collected sample video frames until the plurality of sample video frames are trained.

Convolution layer: the method is mainly used for extracting characteristics of an input image (such as a training sample or an image frame to be identified), wherein the size of a convolution kernel can be determined according to practical application, for example, the sizes of convolution kernels from a first layer of convolution layers to a fourth layer of convolution layers can be (7, 7), (5, 5), (3, 3) in sequence; optionally, in order to reduce the complexity of calculation and improve the calculation efficiency, in this embodiment, the convolution kernels of the four convolution layers may be set to (3, 3), the activation functions are all "Relu (linear rectification function, rectified Linear Unit)", and the padding (padding refers to the space between the attribute defining element frame and the element content) modes are all set to "same", and the "same" padding mode may be simply understood as padding edges with 0, where the number of left (upper) 0 supplements is the same as or less than the number of right (lower) 0 supplements. Optionally, the convolution layers may be connected by a direct connection manner, so as to increase the network convergence speed, in order to further reduce the calculation amount, a downsampling (sampling) operation may be performed on all layers or any 1-2 layers of the second to fourth convolution layers, where the downsampling operation is substantially the same as the convolution operation, and only a maximum value (max sampling) or an average value (average sampling) of the corresponding positions is taken as a convolution kernel of the downsampling, which is described as an example in the second layer convolution layer and the third layer convolution layer for convenience of description.

It should be noted that, for convenience of description, in the embodiment of the present invention, the layer where the activation function is located and the downsampling layer (also referred to as the pooling layer) are both included in the convolution layer, it should be understood that the structure may also be considered to include the convolution layer, the layer where the activation function is located, the downsampling layer (i.e. the pooling layer), and the full connection layer, and of course, may also include an input layer for inputting data and an output layer for outputting data, which are not described herein again.

Full tie layer: the learned features can be mapped to a sample marking space, which mainly plays a role of a "classifier" in the whole convolutional neural network, and each node of the full-connection layer is connected with all nodes output by the upper layer (such as a downsampling layer in the convolutional layer), wherein one node of the full-connection layer is called one neuron in the full-connection layer, and the number of the neurons in the full-connection layer can be determined according to the practical requirement, for example, in the text detection model, the number of the neurons of the full-connection layer can be set to 512, or can also be set to 128, and the like. Similar to the convolutional layer, optionally, in the fully connected layer, non-linear factors can also be added by adding an activation function, for example, an activation function sigmoid (S-type function) can be added.

For example, a sample video frame set may be specifically collected through multiple ways, where the sample video frame set includes multiple sample video frames labeled with video content types, then a sample video frame that needs to be trained currently is determined from the multiple collected sample video frames to obtain a current processing object, then the current processing object is imported into a preset initial classification model to perform training to obtain a predicted value of video content corresponding to the current processing object, then the predicted value corresponding to the current processing object and the labeled video content types of the current processing object are converged to adjust parameters of the preset initial classification model, and a step of determining the sample video frame that needs to be trained currently from the multiple collected sample video frames is performed until all the multiple sample video frames are trained is performed, and finally the video content classification model is obtained.

104. And respectively calculating the distance between each image feature and a plurality of clustering centers to obtain a distance set corresponding to each image feature.

For example, there are 10 image features and 6 cluster centers, and each image feature corresponds to a distance set, where the distance set includes distances between the image feature and the 6 cluster centers, where the distances may be represented by euclidean distances, and of course, may also be represented by mahalanobis distances, which are specifically selected according to practical situations, and are not described herein again.

105. And determining the content category of the video to be detected according to the distance set and the plurality of clustering centers.

For example, specifically, 3 distances may be randomly selected from the distance set, and the content category of the video to be detected may be determined according to the cluster centers corresponding to the 3 distances, that is, optionally, in some embodiments, the step of determining the content category of the video to be detected according to the distance set and the multiple cluster centers may specifically include:

(71) Selecting a preset number of distances from the distance set to obtain at least one target distance;

(72) Determining a cluster center corresponding to the target distance to obtain at least one target cluster center;

(73) And determining the content category of the video to be detected according to at least one target cluster center.

The distance collection is used for selecting a preset number of distances from the distance collection, and at least one target distance is obtained in two ways;

the first way is: taking a video feature a and 6 cluster centers as an example for illustration, the distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5, and a sixth distance B6, three distances may be randomly selected from the distance set B, for example, the second distance B2, the fifth distance B5, and the sixth distance B6 are selected, then, the cluster center corresponding to the second distance B2, the cluster center corresponding to the fifth distance B5, and the cluster center corresponding to the sixth distance B6 are obtained respectively, and finally, based on the content categories corresponding to the cluster centers, the content category of the video to be detected is determined, that is, optionally, in some embodiments, the step of determining the content category of the video to be detected "according to at least one target cluster center" includes:

(81) Respectively obtaining content categories corresponding to a plurality of target clustering centers;

(82) Based on the determined content category, the content category of the video to be detected is determined.

For example, when obtaining the content categories of the image frame D to be detected corresponding to 3 target clustering centers, the content categories corresponding to the 3 target clustering centers are respectively: "X-job X-person", "X-shadow X-person", and "pseudo X-person", whose distribution is shown in fig. 1c, the content category of the to-be-detected image frame D may be determined to be "cartoon", the same processing is performed for other image frames, and finally, the content category of the to-be-detected video is determined according to the content categories corresponding to all to-be-detected image frames, for example, the number of to-be-detected image frames is 100, and the number of to-be-detected image frames of which the content category is "cartoon", then the content category of the to-be-detected video may be determined to be "cartoon", that is, in some embodiments, the step of "determining the content category of the to-be-detected video based on the determined content category" may specifically include:

(91) Calculating the proportion of the image frames to be detected corresponding to each content category in all the image frames to be detected;

(92) And when the proportion is larger than the preset proportion, determining the content type of the image frame to be detected corresponding to the proportion larger than the preset proportion as the content type of the video to be detected.

The preset proportion can be set according to actual requirements, and will not be described herein.

In order to further improve accuracy of detecting video content, a minimum three distances may be selected from the distance set B, and preferably, in some embodiments, N distances may be selected from the distance set, to obtain N target distances, where N is a positive odd number.

The second way is: taking a video feature a and 6 cluster centers as an example for illustration, the distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5, and a sixth distance B6, where the distance set B may be randomly selected to be the target distance with the smallest distance, if the first distance B1 is the smallest, then the first distance B1 is selected as the target distance, that is, optionally, in some embodiments, the step of "selecting a preset number of distances from the distance set to obtain at least one target distance" may specifically include: and selecting the smallest distance from the distance set as the target distance.

After obtaining a video to be detected, selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining clustering centers to which each content category belongs, respectively calculating distances between each image feature and the plurality of clustering centers to obtain distance sets corresponding to each image feature, and finally determining the content category of the video to be detected according to the distance sets and the plurality of clustering centers. Compared with the existing video content detection scheme, the video content detection scheme has the advantages that the distances between the image features and the clustering centers are calculated, then the content category of the video to be detected is determined based on the distance set and the clustering centers, namely, in the actual detection process, the content category of each frame of image frame is considered, and then the content category of the video to be detected is determined, so that the accuracy of detecting the video content can be improved.

The method according to the embodiment will be described in further detail by way of example.

In this embodiment, a description will be given of an example in which the detection device for video content is specifically integrated in a server.

Referring to fig. 2a, a method for detecting video content may include the following steps:

201. and the server acquires the video to be detected.

The video to be detected may include a plurality of video frames, and the server may acquire the video to be detected in various ways, for example, the server may acquire the video to be detected from a specified database, and may specifically depend on the requirements of practical applications, where the video to be detected may include a television play, a movie, a video recorded by a user, and so on.

202. The server selects video frames for video content detection from the videos to be detected, and a plurality of video frames to be detected are obtained.

For example, the server may select video frames for video content detection from among the video frames to be detected according to the arrangement sequence of the video frames in the video frames to be detected, to obtain a plurality of video frames to be detected, so that in order to improve the calculation efficiency, the number of video frames may be compressed, that is, a part of the video frames may be deleted, and then a video frame for video content detection may be selected from the remaining video frames.

It should be noted that, in order to reduce the influence of the low information content frame on the subsequent video content detection, the server may also delete the low information content frame before selecting the video frame for video content detection from the videos to be detected, where the low information content frame refers to a video frame with too simple color texture feature, a video header frame, and so on, and specifically may detect all the video frames by using an algorithm of image frame complexity degree.

203. And the server extracts image features corresponding to the video frames to be detected and acquires a clustering center to which each content category belongs.

For example, the server may perform feature extraction on each video frame to be detected based on a trained video content classification model to obtain image features corresponding to each video frame to be detected, where the video content classification model is formed by training a plurality of sample video frames labeled with video content types, and the server may import each sample video frame into the trained video content classification model to obtain sample image features corresponding to each sample video, and then construct a cluster center to which each content category belongs according to the sample image features.

It should be noted that the video content classification model may be pre-established, and in particular, please refer to the previous embodiment, which is not described herein.

204. The server calculates the distance between each image feature and a plurality of clustering centers respectively to obtain a distance set corresponding to each image feature.

205. And the server determines the content category of the video to be detected according to the distance set and the plurality of clustering centers.

The server determining the content category of the video to be detected according to the distance set and the plurality of clustering centers can comprise two modes.

The first way is: taking a video feature a and 6 clustering centers as an example for illustration, the distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5 and a sixth distance B6, the server may randomly select three distances from the distance set B, for example, select the second distance B2, the fifth distance B5 and the sixth distance B6, then the server obtains the clustering center corresponding to the second distance B2, the clustering center corresponding to the fifth distance B5 and the clustering center corresponding to the sixth distance B6 respectively, and finally, the server determines the content category of the video to be detected based on the content categories corresponding to the clustering centers.

The second way is: taking a video feature a and 6 cluster centers as an example for illustration, the distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5 and a sixth distance B6, and if the server randomly selects the distance with the smallest distance as the target distance in the distance set B, the server can take the first distance B1 as the target distance, then the server obtains the cluster center corresponding to the first distance B1, and finally, the server determines the content category of the video to be detected based on the content category corresponding to the cluster center.

In order to facilitate understanding of the method for detecting video content provided by the embodiment of the present invention, taking as an example to detect which animation a video belongs to, please refer to fig. 2b, wherein the device for detecting video content is integrated in a server, and the server can detect a video X to be detected according to a trained video content classification model, and the following three stages are: training data acquisition phase, video frame preprocessing phase and detection phase.

In the training data acquisition stage, the server may acquire a sample video set through multiple ways, where the sample video set includes multiple sample video frames marked with video content types, for example, 100 cartoon videos and 1 non-cartoon video can be acquired, where each frame training image has two labels, a first label i is used to indicate whether the video frame is a cartoon, a second label k is used to indicate which part of the cartoon the video frame belongs to, and then a convolutional neural network is used as a basic network, and training the convolutional neural network through the sample video frame set to obtain a trained video content classification model, where it is required to be noted that when training the convolutional neural network, three penalty functions including a penalty function L1, a penalty function L2 and a penalty function L3 can be adopted, the penalty function L1 is used to determine whether the video is a cartoon, the penalty function L2 is used to determine which part of the video belongs to, and the penalty function L3 is used to determine whether the video is a cartoon, as shown in fig. 2c, and when using, a video content classification model can be obtained by removing the penalty function

In the video frame preprocessing stage, the server may compress the frame number, delete the frame with low information content and/or crop the video frame, and in particular please refer to the foregoing embodiments, which are not described herein, and it should be noted that the video frame may be preprocessed or not preprocessed, and the selection is specifically performed according to the actual situation. And then, the server respectively extracts the characteristics of each sample video frame by utilizing the trained video content classification model, and finally, the server constructs a clustering center to which each content category belongs based on the extracted characteristics.

In the detection stage, when the server acquires the image characteristic Q of an image, calculating the Euclidean distance d between the image characteristic Q and all cluster centers, wherein the cluster centers can be divided into a first cluster center set C1 and a second cluster center set C2, C1= { xi|i=1, 2,..once, C1}, C2= { yi|i=1, 2,, C2}, i is a positive integer, the labels corresponding to the cluster centers in the first cluster center set can be labels such as "laugh", "hot and" different world ", the labels corresponding to the cluster centers in the second cluster center set can be labels such as" talk "," comedy "and" inference play ", the clustering method can adopt a K-means clustering algorithm (K-meansClustering Algorithm), then, the Euclidean distance between the image characteristic Q and each cluster center is calculated, obtaining Euclidean distances L between an image feature Q and each clustering center, selecting the shortest n Euclidean distances from the Euclidean distances, wherein n is a positive odd number, and obtaining the clustering center corresponding to the selected Euclidean distances to obtain a target clustering center, for example, selecting the shortest 3 Euclidean distances from the Euclidean distances, and obtaining the clustering centers corresponding to the 3 Euclidean distances to obtain a clustering center T1, a clustering center T2 and a clustering center T3, then respectively determining the content categories of the clustering center T1, the clustering center T2 and the clustering center T3, and determining that the image feature Q belongs to a cartoon if the clustering center T1 is" laughter ", the clustering center T2 is" different world double "and the clustering center T3 is" comedy ", if the clustering center T1 is" comedy ", and determining that the image feature Q belongs to the comedy" if the clustering center T1 is "comedy" The clustering center T2 is 'different world' and the clustering center T3 is 'comedy', then the image characteristic Q can be determined to belong to the non-cartoon, in the invention, the image characteristics corresponding to the video frames to be detected can be all subjected to the three steps, the judgment of whether each frame image of one video is the cartoon can be obtained, preferably, if the cartoon frame of one video accounts for more than 30%, the video is considered to be the cartoon video, otherwise, the video is judged to be the non-cartoon video,

As can be seen from the above, after the server in the embodiment of the present invention obtains the video to be detected, the server selects the video frame for detecting the video content from the video to be detected to obtain a plurality of video frames to be detected, then the server extracts the image features corresponding to each video frame to be detected, obtains the clustering center to which each content category belongs, then the server calculates the distance between each image feature and the plurality of clustering centers, respectively, to obtain the distance set corresponding to each image feature, and finally the server determines the content category of the video to be detected according to the distance set and the plurality of clustering centers. Compared with the existing video content detection scheme, the video content detection method and device based on the clustering center have the advantages that the distance between each image feature and the clustering centers is calculated, then the content category of the video to be detected is determined based on the distance set and the clustering centers, namely, in the actual detection process, the content category of each frame of image frame is considered, and then the content category of the video to be detected is determined, so that the accuracy of detecting the video content can be improved.

In order to facilitate better implementation of the method for detecting video content according to the embodiment of the present invention, the embodiment of the present invention further provides a device for detecting video content based on the above method (abbreviated as a detection device). The meaning of the noun is the same as that in the method for detecting video content, and specific implementation details can be referred to the description in the method embodiment.

Referring to fig. 3a, fig. 3a is a schematic structural diagram of a video content detection apparatus according to an embodiment of the present invention, where the detection apparatus may include a first acquisition module 301, a selection module 302, an extraction module 303, a second acquisition module 304, a calculation module 305, and a determination module 306, and may specifically be as follows:

the first obtaining module 301 is configured to obtain a video to be detected.

The video to be detected may include a plurality of video frames, and the ways of obtaining the video to be detected may be various, for example, the first obtaining module 301 may obtain the video from the internet and/or a designated database, and may specifically be determined according to the requirements of practical applications, where the video to be detected may include a television, a movie, a video recorded by a user, and so on.

The selecting module 302 is configured to select a video frame for video content detection from the videos to be detected, so as to obtain a plurality of video frames to be detected.

For example, specifically, the selecting module 302 may select video frames for video content detection from among the video frames to be detected according to the arrangement sequence of the video frames in the video frames to be detected, so as to improve the computing efficiency, so that the selecting module 302 may compress the number of video frames, that is, delete part of the video frames, and then select video frames for video content detection from the remaining video frames, that is, optionally, in some embodiments, the selecting module 302 may specifically be configured to: detecting the number of video frames in the video to be detected, judging whether the number is larger than the preset number, and removing corresponding video frames in the video to be detected based on a preset strategy to obtain a reserved video frame set when the number is larger than the preset number, and selecting a plurality of video frames from the reserved video frame set at intervals to obtain the video frame to be detected.

Optionally, in some embodiments, the selection module 302 may be further specifically configured to: and when the number is smaller than or equal to the preset number, selecting a plurality of video frames from the video to be detected, and obtaining the video frames to be detected.

The extracting module 303 is configured to extract image features corresponding to each video frame to be detected.

For example, the extracting module 303 may perform feature extraction on each video frame to be detected based on a trained video content classification model, to obtain image features corresponding to each video frame to be detected, where the video content classification model is trained by a plurality of sample video frames labeled with video content types.

The second obtaining module 304 is configured to obtain a cluster center to which each content category belongs.

Wherein the second obtaining module 304 may construct a cluster center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames, that is, optionally, in some embodiments, the second obtaining module 304 includes:

the acquisition unit is used for acquiring the trained video content classification model and a plurality of sample video frames marked with the video content types;

Optionally, in some embodiments, the building unit may specifically include:

Alternatively, in some embodiments, the building sub-unit may specifically be configured to: acquiring a plurality of preset content tags, determining the number of categories to be clustered according to the plurality of preset content tags, and carrying out clustering processing on the extracted features based on a preset clustering algorithm and the number of categories to obtain a clustering center to which each content category belongs.

Optionally, in some embodiments, referring to fig. 3b, the detection apparatus further includes a training module 307, and the training module 307 may specifically be configured to: collecting a plurality of sample video frames marked with video content types, determining a sample video frame which is required to be trained currently from the collected plurality of sample video frames to obtain a current processing object, guiding the current processing object into a preset initial classification model to train to obtain a predicted value of video content corresponding to the current processing object, converging the predicted value corresponding to the current processing object and the marked video content types of the current processing object to adjust parameters of the preset initial classification model, and returning to execute the step of determining the sample video frame which is required to train currently from the collected plurality of sample video frames until the plurality of sample video frames are trained completely.

The calculating module 305 is configured to calculate distances between each image feature and a plurality of cluster centers, respectively, to obtain a distance set corresponding to each image feature.

The determining module 306 is configured to determine a content category of the video to be detected according to the distance set and the plurality of clustering centers.

For example, specifically, the determining module 306 may randomly select 3 distances from the distance set, and determine the content category of the video to be detected according to the cluster center corresponding to the 3 distances.

Optionally, in some embodiments, the determining module 306 may specifically include:

Alternatively, in some embodiments, the selection unit may specifically be configured to: and selecting the smallest distance from the distance set as the target distance.

Alternatively, in some embodiments, the second determining unit may specifically be configured to: and respectively acquiring content categories corresponding to the target clustering centers, and determining the content category of the video to be detected based on the determined content categories.

It can be seen that, after the first obtaining module 301 obtains the video to be detected, the selecting module 302 selects the video frame for detecting the video content from the video to be detected to obtain a plurality of video frames to be detected, the extracting module 303 extracts the image features corresponding to each video frame to be detected, the second obtaining module 304 obtains the clustering centers to which each content category belongs, the calculating module 305 calculates the distances between each image feature and the plurality of clustering centers to obtain the distance set corresponding to each image feature, and the determining module 306 determines the content category of the video to be detected according to the distance set and the plurality of clustering centers. Compared with the existing video content detection scheme, the video content detection scheme has the advantages that the distances between the image features and the clustering centers are calculated, then the content category of the video to be detected is determined based on the distance set and the clustering centers, namely, in the actual detection process, the content category of each frame of image frame is considered, and then the content category of the video to be detected is determined, so that the accuracy of detecting the video content can be improved.

In addition, the embodiment of the invention further provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:

the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

obtaining a video to be detected, selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining clustering centers to which each content category belongs, respectively calculating distances between each image feature and the plurality of clustering centers to obtain distance sets corresponding to each image feature, and determining the content category of the video to be detected according to the distance sets and the plurality of clustering centers.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

After obtaining a video to be detected, the server selects video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, then the server extracts image features corresponding to the video frames to be detected, obtains clustering centers to which each content category belongs, then calculates distances between each image feature and the clustering centers to obtain distance sets corresponding to each image feature, and finally determines the content category of the video to be detected according to the distance sets and the clustering centers. Compared with the existing video content detection scheme, the video content detection method and device based on the clustering center have the advantages that the distance between each image feature and the clustering centers is calculated, then the content category of the video to be detected is determined based on the distance set and the clustering centers, namely, in the actual detection process, the content category of each frame of image frame is considered, and then the content category of the video to be detected is determined, so that the accuracy of detecting the video content can be improved.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present invention provides a storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the video content detection methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The instructions stored in the storage medium can execute the steps in any video content detection method provided by the embodiment of the present invention, so that the beneficial effects that any video content detection method provided by the embodiment of the present invention can be achieved, and detailed descriptions of the previous embodiments are omitted herein.

The foregoing describes in detail a method, apparatus, electronic device and storage medium for detecting video content provided by the embodiments of the present invention, and specific examples are applied to illustrate the principles and embodiments of the present invention, where the foregoing examples are only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. A method for detecting video content, comprising:

determining the content category of the video to be detected according to the distance set and the plurality of clustering centers;

The determining the content category of the video to be detected according to the distance set and the plurality of clustering centers comprises the following steps: selecting a preset number of distances from the distance set to obtain at least one target distance; determining a cluster center corresponding to the target distance to obtain at least one target cluster center;

respectively acquiring content categories corresponding to the at least one target clustering center, and determining content categories commonly subordinate to the content categories corresponding to the at least one target clustering center as content categories of corresponding image frames to be detected;

calculating the proportion of the image frames to be detected corresponding to each content category in all the image frames to be detected; and when the proportion is larger than the preset proportion, determining the content type of the image frame to be detected corresponding to the proportion to be detected as the content type of the video to be detected.

2. The method according to claim 1, wherein the obtaining a cluster center to which each content category belongs includes:

acquiring a trained video content classification model and a plurality of sample video frames marked with video content types, wherein the video content classification model is trained by the plurality of sample video frames;

And constructing a clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames.

3. The method of claim 2, wherein constructing a cluster center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames comprises:

respectively extracting the characteristics of each sample video frame by using the trained video content classification model;

and constructing a clustering center to which each content category belongs based on the extracted features.

4. A method according to claim 3, wherein said constructing a cluster center to which each content category belongs based on the extracted features comprises:

acquiring a plurality of preset content labels;

5. The method of claim 2, wherein prior to the obtaining the trained video content classification model, further comprising:

collecting a plurality of sample video frames marked with video content types;

6. The method of claim 1, wherein selecting a predetermined number of distances from the set of distances to obtain at least one target distance comprises:

and selecting the smallest distance from the distance set as the target distance.

7. The method according to any one of claims 1 to 5, wherein selecting a video frame for video content detection from the videos to be detected, to obtain a plurality of video frames to be detected, comprises:

detecting the number of video frames in a video to be detected;

Judging whether the number is larger than a preset number or not;

8. The method as recited in claim 7, further comprising:

9. A video content detection apparatus, comprising:

The determining module is used for determining the content category of the video to be detected according to the distance set and the plurality of clustering centers;

the determining module is specifically configured to select a preset number of distances from the distance set to obtain at least one target distance; determining a cluster center corresponding to the target distance to obtain at least one target cluster center; respectively acquiring content categories corresponding to the at least one target clustering center, and determining content categories commonly subordinate to the content categories corresponding to the at least one target clustering center as content categories of corresponding image frames to be detected; calculating the proportion of the image frames to be detected corresponding to each content category in all the image frames to be detected; and when the proportion is larger than the preset proportion, determining the content type of the image frame to be detected corresponding to the proportion to be detected as the content type of the video to be detected.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of detecting video content according to any one of claims 1-8 when the program is executed by the processor.

11. A storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps of the method of detecting video content according to any of claims 1-8.