CN112800919A

CN112800919A - Method, device and equipment for detecting target type video and storage medium

Info

Publication number: CN112800919A
Application number: CN202110084414.7A
Authority: CN
Inventors: 付志康
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-14

Abstract

The disclosure discloses a method, a device, equipment and a storage medium for detecting a target type video, and relates to the field of artificial intelligence such as deep learning and image processing. One specific implementation of the method for detecting the target type video includes: extracting at least one image frame from a video to be detected; inputting at least one image frame into a pre-trained target image classification model to obtain a classification result of a video to be detected; in response to the fact that the classification result of the video to be detected is not the target type video, extracting a digital audio file from the video to be detected; and inputting the digital audio file into the first target object classification model and/or inputting at least one image frame into the second target object classification model to obtain a second modal detection result of the video to be detected, so that the recall rate and the accuracy of the target type video detection technology are improved.

Description

Method, device and equipment for detecting target type video and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence such as deep learning and image processing, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a target type video.

Background

Short videos are now gradually becoming entertainment modes for the masses, and people can freely upload and download short videos in application programs. But forbidden videos (such as pornography videos, violence videos and the like) exist in the short videos uploaded by part of users.

Disclosure of Invention

The disclosure provides a method, a device, equipment and a storage medium for detecting a target type video.

According to a first aspect of the present disclosure, there is provided a method for detecting a target type video, including: extracting at least one image frame from a video to be detected; inputting at least one image frame into a pre-trained target image classification model to obtain a classification result of a video to be detected; in response to the fact that the classification result of the video to be detected is not the target type video, extracting a digital audio file from the video to be detected; and inputting the digital audio file into a first target object classification model and/or inputting at least one image frame into a second target object classification model to obtain a second modal detection result of the video to be detected, wherein the first target object classification model is used for determining whether the digital audio file contains a first target object, the second target object classification model is used for determining whether the at least one image frame contains a second target object, and the first target object and the second target object are used for representing a target type video.

According to a second aspect of the present disclosure, there is provided a target type detection video apparatus including: a first extraction module configured to extract at least one image frame from a video to be detected; the first classification module is configured to input at least one image frame into a pre-trained target image classification model to obtain a classification result of the video to be detected; a second extraction module configured to extract a digital audio file from the video to be detected in response to the classification result of the video to be detected not being the target type video; and the second classification module is configured to input the digital audio file into a first target object classification model and/or input at least one image frame into a second target object classification model to obtain a second modal detection result of the video to be detected, wherein the first target object classification model is used for determining whether the digital audio file contains a first target object, the second target object classification model is used for determining whether the at least one image frame contains a second target object, and the first target object and the second target object are used for representing the target type video.

According to a third aspect of the present disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

According to a fourth aspect of the present disclosure, a non-transitory computer-readable storage medium is presented storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.

According to a fifth aspect of the present disclosure, a computer program product is presented, comprising a computer program which, when executed by a processor, performs the method as described in any of the implementations of the first aspect.

According to the method, the device, the equipment and the storage medium for detecting the target type video, at least one image frame is extracted from a video to be detected; then inputting at least one image frame into a pre-trained target image classification model to obtain a classification result of the video to be detected; then, in response to the fact that the classification result of the video to be detected is not the target type video, extracting a digital audio file from the video to be detected; and finally, inputting the digital audio file into the first target object classification model and/or inputting at least one image frame into the second target object classification model to obtain a second modal detection result of the video to be detected, so that the recall rate and the accuracy rate of the target type video detection technology are improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a method for detecting target type video in accordance with the present application;

FIG. 3 is a schematic flow chart diagram illustrating another embodiment of a method for detecting target type video in accordance with the present application;

FIG. 4 is a schematic diagram of an application scenario of an embodiment of a method for detecting a target type video according to the present application;

FIG. 5 is a block diagram illustrating an embodiment of a target type video apparatus of the present application;

fig. 6 is a block diagram of an electronic device for implementing the method for detecting a target type video according to the embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the detected object type video method or detected object type video apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

Terminal device 101 may interact with server 103 through network 102. Various types of videos to be detected can be uploaded in the terminal device 101, including but not limited to legal videos, prohibited videos, and the like.

The server 103 may provide various services, for example, the server 103 may perform processing such as analysis on various types of data of the video to be detected and the like acquired from the terminal device 101, and generate a processing result (for example, obtain a second-modality detection result of the video to be detected).

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for detecting the target type video provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the target type video detecting device is generally disposed in the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of detecting target type video in accordance with the present application is shown. The method comprises the following steps:

step 201, at least one image frame is extracted from a video to be detected.

In the present embodiment, an execution subject (for example, the server 103 shown in fig. 1) of the method of detecting the target type video may extract at least one image frame from the video to be detected.

The video frames of the video to be detected can be intercepted based on a preset time interval. Illustratively, the frame-cutting process may be set to be performed every three seconds, or other time interval. At least one image frame can be extracted from the video to be detected by the frame-cut processing.

Step 202, inputting at least one image frame into a pre-trained target image classification model to obtain a classification result of the video to be detected.

In this embodiment, the executing body may input at least one image frame to a pre-trained target image classification model to obtain a classification result of the video to be detected.

The target image classification model can adopt a neural network classification model to judge whether each frame of image is a target image or not, and then a classification result of each frame of image is obtained.

The target image refers to a target type image, such as an image containing prohibited pictures (e.g., pornography, violence, etc.). The target image classification model can be obtained by training through the following steps: firstly, obtaining a plurality of training samples, wherein each training sample comprises a sample image and a label of whether the sample image belongs to a target type image; and then taking the sample image in each training sample as the input of the target image classification model, taking the label of the sample image as the expected output of the target image classification model, and training to obtain the required target image classification model.

The step of training may be an initialized target image classification model, the initialized target image classification model may be an untrained target image classification model or an untrained target image classification model, each layer of the initialized target image classification model may be provided with initial parameters, and the parameters may be continuously adjusted in the training process of the target image classification model. The initialized target image classification model may be various types of untrained or untrained artificial neural networks or a model obtained by combining various types of untrained or untrained artificial neural networks, for example, the initialized target image classification model may be an untrained convolutional neural network, an untrained cyclic neural network, or a model obtained by combining an untrained convolutional neural network, an untrained cyclic neural network, and an untrained full-link layer. Alternatively, the target image classification model may employ a support vector machine model. A Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner, and a decision boundary of the SVM is a maximum edge distance hyperplane for solving a learning sample.

The execution main body can aggregate the classification result of each frame of image to obtain the classification result of the video to be detected. For example, if the classification result of any frame of image is that the frame of image is the target type image, the classification result of the video to be detected is that the video to be detected is the target type video. For another example, if more than half of the frames (including half) of the images in the classification results of all the frame images are target type images, the classification result of the video to be detected is that the video to be detected is the target type video.

And step 203, in response to that the classification result of the video to be detected is not the target type video, extracting a digital audio file from the video to be detected.

In this embodiment, the executing body may execute, in response to the classification result of the video to be detected not being the target type video, extracting the digital audio file from the video to be detected.

If the classification result of the video to be detected is that the video to be detected is not the target type video, executing step 203; if the classification result of the video to be detected is that the video to be detected is the target type video, step 203 is not executed. The target type video may be a prohibited video, such as a pornographic video, a violent video, etc., among others.

Wherein, the audio track can be extracted from the video to be detected and saved as a digital audio file (e.g. wav file, etc.) by using an audio information extraction tool (e.g. Music Extractor, etc.).

Step 204, inputting the digital audio file into the first target object classification model and/or inputting at least one image frame into the second target object classification model to obtain a second modal detection result of the video to be detected.

In this embodiment, the executing body may input a digital audio file to the first target object classification model and/or input at least one image frame to the second target object classification model, so as to obtain a second modal detection result of the video to be detected.

Wherein the first target object classification model is used to determine whether the digital audio file contains a first target object, the second target object classification model is used to determine whether the at least one image frame contains a second target object, and the first target object and the second target object are used to characterize a target type video.

Besides the target type image, the target type video may also include other target objects that can be used to characterize that the video belongs to the target type video. Taking a target type video as an example of a pornographic video, the pornographic video can comprise pornographic actions, pornographic organs, pornographic characters, pornographic voice and other objects which can represent that the video belongs to the pornographic video besides pornographic images. The target object can be divided into a first target object and a second target object according to the carrier form of the target object, the carrier form of the first target object can be a digital audio file, and the carrier form of the second target object can be an image. Illustratively, taking a target type video as a pornographic video as an example, the first target object may be pornographic voice, and the second target object may be pornographic motion, pornographic organs, pornographic characters, and the like.

The first target object classification model may adopt a neural network classification model to determine whether the digital audio file contains the first target object, so as to obtain a classification result of the digital audio file. The second target object classification model may also adopt a neural network classification model to determine whether each frame of image includes the second target object, so as to obtain a classification result of each frame of image.

The second mode detection result of the video to be detected can be determined according to the classification result of the digital audio file or the classification result of each frame of image. For example, if the classification result of the digital audio file indicates that the digital audio file contains the first target object, the second mode detection result indicates that the video to be detected is the target type video; and if the classification result of any frame of image is that the image contains a second target object, the second mode detection result is that the video to be detected is the target type video.

And determining a second mode detection result of the video to be detected according to the classification result of the digital audio file and the classification result of each frame of image. For example, the classification result of the digital audio file and the classification result of the at least one frame of image may be aggregated to obtain the second mode detection result of the video to be detected. Illustratively, the aggregation manner includes, but is not limited to, if only one classification result is that a target object (a first target object or a second target object) is included, the second modality detection result is that the video to be detected is a target type video.

The method for detecting the target type video, provided by the embodiment of the application, can effectively detect the target type video in a large amount of videos, has high recall rate and accuracy, replaces manual review, and saves manpower.

With further reference to fig. 3, shown is a flow diagram of another embodiment of a method of detecting target type video, the method comprising the steps of:

step 301, at least one image frame is extracted from a video to be detected.

Step 301 is substantially the same as step 201, and therefore will not be described again.

Step 302, inputting at least one image frame into a pre-trained target image classification model to obtain a classification result of the video to be detected.

Step 302 is substantially the same as step 202, and therefore is not described in detail.

And 303, in response to the fact that the classification result of the video to be detected is not the target type video, recognizing characters of at least one image frame by adopting an optical character recognition technology.

Among them, Optical Character Recognition (OCR) refers to a process in which an electronic device checks characters in an image, determines a shape thereof by detecting dark and light patterns, and then translates the shape into computer characters by a Character Recognition method. If the classification result of the video to be detected is that the video to be detected is not the target type video, at least one image frame can be input into the OCR model, and characters in the image can be recognized.

And step 304, matching the recognized characters with a preset target character dictionary.

The target characters refer to characters which can be used for representing that the video belongs to a target type video, and the target characters can be pornographic characters by taking the target type video as a pornographic video as an example. The target word dictionary refers to a set of all target words collected by experience. The recognized characters in each frame image may be matched with the target character dictionary to determine whether the recognized characters belong to the target characters. For example, similarity matching may be performed between the recognized characters in the image and the target characters in the target character dictionary, and when any one of the target characters in the target character dictionary and the recognized characters reaches a preset similarity threshold (e.g., 90%), the recognized characters belong to the target characters.

And 305, obtaining a second mode detection result of the video to be detected according to the matching result.

The characters identified in each frame of image can be matched with a preset target character dictionary to obtain a matching result of each frame of image. And then aggregating the matching results of each frame of image to obtain a second mode detection result of the video to be detected. For example, if the matching result of any frame of image is that the frame of image contains the target characters, the second modality detection result is that the video to be detected is the target type video. For another example, if more than half of the frames (including half of the frames) of the matching results of all the frame images contain target characters, the second mode detection result is that the video to be detected is the target type video.

In some optional implementations of this embodiment, the second target object is a target organ, and the step 204 includes: and inputting at least one image frame into a pre-trained target organ detection model to obtain a second mode detection result of the video to be detected.

The target organ detection model can adopt a neural network classification model to judge whether each frame of image comprises a target organ or not, and then obtains the classification result of each frame of image.

The target organ may be an organ used for representing that the video belongs to a target type video, and taking the target type video as a pornographic video as an example, the target organ may include male genitalia, female genitalia, and the like. The target organ detection model can be obtained by training through the following steps: firstly, obtaining a plurality of training samples, wherein each training sample comprises a sample image and whether the sample image contains a label of a target organ; and then taking the sample image in each training sample as the input of the target organ detection model, taking the label of the sample image as the expected output of the target organ detection model, and training to obtain the required target organ detection model.

The execution main body can aggregate the classification result of each frame of image to obtain a second mode detection result of the video to be detected. For example, if the classification result of any frame of image is that the frame of image includes a target organ, the second modality detection result is that the video to be detected is a target type video. For another example, if more than half of the frames (including half) of the images in the classification results of all the frame images include the target organ, the second modality detection result is that the video to be detected is the target type video.

In some optional implementations of this embodiment, the second target object is a target action, and the step 204 includes: and inputting at least one image frame into a pre-trained target action detection model to obtain a second mode detection result of the video to be detected.

The target action detection model can adopt a neural network classification model to judge whether each frame of image comprises a target action or not, and then a classification result of each frame of image is obtained.

The target action may be an action used for representing that the video belongs to a target type video, and taking the target type video as a pornographic video as an example, the target action may include a sexual action, a sexual feeling action, and the like. The target motion detection model can be obtained by training through the following steps: firstly, obtaining a plurality of training samples, wherein each training sample comprises a sample image and a label of whether the sample image contains a target action; and then, taking the sample image in each training sample as the input of the target motion detection model, taking the label of the sample image as the expected output of the target motion detection model, and training to obtain the required target motion detection model.

The execution main body can aggregate the classification result of each frame of image to obtain a second mode detection result of the video to be detected. For example, if the classification result of any frame of image is that the frame of image includes a target motion, the second modality detection result is that the video to be detected is a target type video. For another example, if more than half of the frames (including half of the frames) of the classification results of all the frame images include the target motion, the second modality detection result is that the video to be detected is the target type video.

In some optional implementation manners of this embodiment, the first target object is a target voice, and the step 204 includes: and inputting the digital audio file into a pre-trained target voice detection model to obtain a second modal detection result of the video to be detected.

The target voice detection model can adopt a neural network classification model to judge whether the digital audio files at the preset time interval include the target voice or not, and then obtain the classification result of the digital audio files.

The target voice may be a voice used for representing that the video belongs to a target type video, and taking the target type video as a pornographic video as an example, the target voice may include a pornographic voice such as tussimus. The target voice detection model can be obtained by training the following steps: firstly, obtaining a plurality of training samples, wherein each training sample comprises a digital audio file with a preset time interval and a label indicating whether the digital audio file contains target voice; and then, taking the digital audio file with the preset time interval in each training sample as the input of the target voice detection model, taking the label of the digital audio file as the expected output of the target voice detection model, and training to obtain the required target voice detection model.

The execution main body can aggregate a plurality of digital audio file classification results at preset time intervals to obtain a second mode detection result of the video to be detected. For example, if the classification result of the digital audio file of any segment is that the digital audio file contains the target voice, the second mode detection result is that the video to be detected is the target type video. For another example, if more than half (including half) of the digital audio files in the classification results of the digital audio files of all the segments include the target voice, the second mode detection result indicates that the video to be detected is the target type video.

In some optional implementations of this embodiment, the second target object includes a target text, a target organ, and a target action, the first target object includes a target voice, and the step 204 includes:

step 2041, inputting at least one image frame to the target character classification model, and obtaining a third modal detection result of the video to be detected.

The characters identified in each frame of image can be matched with a preset target character dictionary to obtain a matching result of each frame of image. And then aggregating the matching results of each frame of image to obtain a third mode detection result of the video to be detected. For example, if the matching result of any frame of image is that the frame of image contains the target characters, the third modality detection result is that the video to be detected is the target type video. For another example, if more than half of the frames (including half of the frames) of the matching results of all the frame images contain target characters, the third modality detection result is that the video to be detected is the target type video.

Step 2042, inputting at least one image frame to the pre-trained target organ detection model to obtain a fourth modal detection result of the video to be detected.

The execution main body can aggregate the classification result of each frame of image to obtain a fourth modal detection result of the video to be detected. For example, if the classification result of any frame of image is that the frame of image includes a target organ, the fourth modality detection result is that the video to be detected is a target type video. For another example, if more than half of the frames (including half) of the images in the classification results of all the frame images include the target organ, the fourth modality detection result is that the video to be detected is the target type video.

Step 2043, inputting at least one image frame to the pre-trained target motion detection model to obtain a fifth modal detection result of the video to be detected.

The execution main body can aggregate the classification result of each frame of image to obtain a fifth modal detection result of the video to be detected. For example, if the classification result of any frame of image is that the frame of image includes a target motion, the fourth modality detection result is that the video to be detected is a target type video. For another example, if more than half of the frames (including half of the frames) of the classification results of all the frame images include the target motion, the fifth modality detection result is that the video to be detected is the target type video.

And 2044, inputting the digital audio file to a pre-trained target voice detection model to obtain a sixth modal detection result of the video to be detected.

The execution main body can aggregate a plurality of digital audio file classification results at preset time intervals to obtain a sixth modal detection result of the video to be detected. For example, if the classification result of the digital audio file of any segment is that the digital audio file contains the target voice, the sixth modality detection result is that the video to be detected is the target type video. For another example, if more than half (including half) of the digital audio files in the classification results of the digital audio files of all the segments include the target voice, the sixth modality detection result indicates that the video to be detected is the target type video.

Step 2045, aggregating the third modality detection result, the fourth modality detection result, the fifth modality detection result and the sixth modality detection result of the video to be detected to obtain the second modality detection result of the video to be detected.

The third modality detection result, the fourth modality detection result, the fifth modality detection result, and the sixth modality detection result may be combined into the second modality detection result, and the combination manner includes but is not limited to: if any one of the third modal detection result, the fourth modal detection result, the fifth modal detection result and the sixth modal detection result is that the video to be detected is the target type video, the second modal detection result of the video to be detected is that the video to be detected is the target type video; and if more than half of the third, fourth, fifth and sixth modal detection results are the target type video, the second modal detection result of the video to be detected is the target type video, and the like.

For ease of understanding, fig. 4 shows a schematic application scenario of an embodiment of a method of detecting a target type video according to the present application.

As shown in fig. 4, the process of judging whether the video is a pornographic video includes:

(1) extracting audio and frame cutting from the video to obtain wav file and N images

(2) And inputting the video frames into a pornographic image classification model, judging whether each frame image is a pornographic image or not by adopting a neural network classification model to obtain N results, and then aggregating the N results to obtain a result A of the whole video.

(3) Inputting the video frame into a pornographic organ detection model, wherein the pornographic organs are as follows: male genitalia, female genitalia, etc. The pornographic organ model adopts a neural network detection model to judge whether pornographic organs exist in each frame of image or not to obtain N results, and then the N results are aggregated to obtain a result B of the whole video.

(5) Inputting the video frames into an OCR model, identifying characters in the images, matching the characters with a dictionary containing pornographic characters, judging whether each frame of image contains pornographic characters or not to obtain N results, and then aggregating the N results to obtain a result C of the whole video.

(4) And extracting voice characteristics from the wav file, inputting the pornographic voice classification model, and obtaining a result D of the whole video by adopting the neural network voice classification model for the pornographic voice classification model.

(6) And inputting the video frames into a pornographic action classification model, wherein the pornographic action classification model adopts a neural network classification model, and judges whether each frame image is pornographic action or not to obtain a result E of the whole video. The pornographic actions include: sexual intercourse, sexual feeling, etc.

(7) The five results are: A. b, C, D, E are merged into a final result in a manner that includes, but is not limited to: if only one result is pornographic, the whole video is judged to be the pornographic video, and if the five results are normal, the whole video is judged to be the non-pornographic video.

With further reference to fig. 5, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for detecting a target type video, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the detection target type video apparatus 500 of the present embodiment may include: a first extraction module 501, a first classification module 502, a second extraction module 503, and a second classification module 504. The first extraction module 501 is configured to extract at least one image frame from a video to be detected; a first classification module 502 configured to input at least one image frame into a pre-trained target image classification model to obtain a classification result of a video to be detected; a second extraction module 503 configured to extract a digital audio file from the video to be detected in response to the classification result of the video to be detected not being the target type video; a second classification module 504 configured to input the digital audio file to a first target object classification model and/or input the at least one image frame to a second target object classification model, to obtain a second modal detection result of the video to be detected, wherein the first target object classification model is used for determining whether the digital audio file contains a first target object, the second target object classification model is used for determining whether the at least one image frame contains a second target object, and the first target object and the second target object are used for representing a target type video.

In the present embodiment, in the detection target type video apparatus 500: the detailed processing of the first extracting module 501, the first classifying module 502, the second extracting module 503 and the second classifying module 504 and the technical effects thereof can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of this embodiment, the second target object is a target text, and the second classification module 504 is further configured to: recognizing characters of at least one image frame by adopting an optical character recognition technology; matching the recognized characters with a preset target character dictionary; and obtaining a second mode detection result of the video to be detected according to the matching result.

In some optional implementations of this embodiment, the second target object is a target organ, and the second classification module 504 is further configured to: and inputting at least one image frame into a pre-trained target organ detection model to obtain a second mode detection result of the video to be detected.

In some optional implementations of this embodiment, the second target object is a target action, and the second classification module 504 is further configured to: and inputting at least one image frame into a pre-trained target action detection model to obtain a second mode detection result of the video to be detected.

In some optional implementations of this embodiment, the first target object is target speech, and the second classification module 504 is further configured to: and inputting the digital audio file into a pre-trained target voice detection model to obtain a second modal detection result of the video to be detected.

In some optional implementations of this embodiment, the second target object includes a target text, a target organ, a target action, the first target object includes a target voice, and the second classification module 504 is further configured to: inputting at least one image frame into a target character classification model to obtain a third modal detection result of the video to be detected; inputting at least one image frame into a pre-trained target organ detection model to obtain a fourth modal detection result of the video to be detected; inputting at least one image frame into a pre-trained target action detection model to obtain a fifth modal detection result of the video to be detected; inputting the digital audio file into a pre-trained target voice detection model to obtain a sixth modal detection result of the video to be detected; and aggregating the third modal detection result, the fourth modal detection result, the fifth modal detection result and the sixth modal detection result of the video to be detected to obtain a second modal detection result of the video to be detected.

In some optional implementation manners of this embodiment, aggregating the third modality detection result, the fourth modality detection result, the fifth modality detection result, and the sixth modality detection result of the video to be detected, and obtaining the second modality detection result of the video to be detected includes: and responding to the third mode detection result or the fourth mode detection result or the fifth mode detection result or the sixth mode detection result of the video to be detected as the target video, and determining that the second mode detection result of the video to be detected is the target video.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as detecting a target type video. For example, in some embodiments, the detection of the target type video may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the method of detecting target type video described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform detecting the target type video in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of detecting a target type video, comprising:

extracting at least one image frame from a video to be detected;

inputting the at least one image frame into a pre-trained target image classification model to obtain a classification result of the video to be detected;

in response to the fact that the classification result of the video to be detected is not the target type video, extracting a digital audio file from the video to be detected;

and inputting the digital audio file into a first target object classification model and/or inputting the at least one image frame into a second target object classification model to obtain a second modal detection result of the video to be detected, wherein the first target object classification model is used for determining whether the digital audio file contains a first target object, the second target object classification model is used for determining whether the at least one image frame contains a second target object, and the first target object and the second target object are used for representing a target type video.

2. The method of claim 1, wherein the second target object is a target text, and the inputting the at least one image frame into a second target object classification model to obtain a second modal detection result of the video to be detected comprises:

recognizing characters of the at least one image frame by adopting an optical character recognition technology;

matching the recognized characters with a preset target character dictionary;

and obtaining a second mode detection result of the video to be detected according to the matching result.

3. The method of claim 1, wherein the second target object is a target organ, and the inputting the at least one image frame into a second target object classification model to obtain a second modal detection result of the video to be detected comprises:

and inputting the at least one image frame into a pre-trained target organ detection model to obtain a second mode detection result of the video to be detected.

4. The method of claim 1, wherein the second target object is a target action, and the inputting the at least one image frame into a second target object classification model to obtain a second modal detection result of the video to be detected comprises:

and inputting the at least one image frame into a pre-trained target action detection model to obtain a second mode detection result of the video to be detected.

5. The method of claim 1, wherein the first target object is a target voice, and the inputting the digital audio file into the first target object classification model to obtain the second modal detection result of the video to be detected comprises:

and inputting the digital audio file into a pre-trained target voice detection model to obtain a second mode detection result of the video to be detected.

6. The method of claim 1, the second target object comprising a target text, a target organ, a target action, the first target object comprising a target voice, the inputting the digital audio file to a first target object classification model and the inputting the at least one image frame to a second target object classification model, the obtaining a second modal detection result of the video to be detected comprising:

inputting the at least one image frame into a target character classification model to obtain a third modal detection result of the video to be detected;

inputting the at least one image frame into a pre-trained target organ detection model to obtain a fourth modal detection result of the video to be detected;

inputting the at least one image frame into a pre-trained target action detection model to obtain a fifth modal detection result of the video to be detected;

inputting the digital audio file into a pre-trained target voice detection model to obtain a sixth modal detection result of the video to be detected;

and aggregating the third modal detection result, the fourth modal detection result, the fifth modal detection result and the sixth modal detection result of the video to be detected to obtain a second modal detection result of the video to be detected.

7. The method according to claim 6, wherein the aggregating the third modality detection result, the fourth modality detection result, the fifth modality detection result, and the sixth modality detection result of the video to be detected to obtain the second modality detection result of the video to be detected includes:

and responding to the third mode detection result or the fourth mode detection result or the fifth mode detection result or the sixth mode detection result of the video to be detected as the target video, and determining that the second mode detection result of the video to be detected is the target video.

8. An apparatus for detecting a target type video, the apparatus comprising:

a first extraction module configured to extract at least one image frame from a video to be detected;

the first classification module is configured to input the at least one image frame into a pre-trained target image classification model to obtain a classification result of the video to be detected;

a second extraction module configured to extract a digital audio file from the video to be detected in response to the classification result of the video to be detected not being a target type video;

a second classification module configured to input the digital audio file to a first target object classification model and/or input the at least one image frame to a second target object classification model, to obtain a second modal detection result of the video to be detected, wherein the first target object classification model is used for determining whether the digital audio file contains a first target object, the second target object classification model is used for determining whether the at least one image frame contains a second target object, and the first target object and the second target object are used for representing a target type video.

9. The apparatus of claim 8, wherein the second target object is a target word, the second classification module further configured to:

matching the recognized characters with a preset target character dictionary;

10. The apparatus of claim 8, wherein the second target object is a target organ, the second classification module further configured to:

11. The apparatus of claim 8, wherein the second target object is a target action, the second classification module further configured to:

12. The apparatus of claim 8, wherein the first target object is target speech, the second classification module further configured to:

13. The apparatus of claim 8, wherein the second target object comprises a target text, a target organ, a target action, the first target object comprises a target voice, the second classification module is further configured to:

14. The apparatus according to claim 13, wherein the aggregating the third modality detection result, the fourth modality detection result, the fifth modality detection result, and the sixth modality detection result of the video to be detected to obtain the second modality detection result of the video to be detected includes:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.