CN110309795B

CN110309795B - Video detection method, device, electronic equipment and storage medium

Info

Publication number: CN110309795B
Application number: CN201910600863.5A
Authority: CN
Inventors: 吴韬; 徐敘遠; 龚国平; 杨喻茸
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2024-03-12
Anticipated expiration: 2039-07-04
Also published as: CN110309795A

Abstract

The invention provides a video detection method, a video detection device, electronic equipment and a storage medium; the method comprises the following steps: extracting a plurality of sample frame images from a video to be detected; respectively detecting foreground objects in each sample frame image, and determining a foreground object area in each sample frame image; based on the determined foreground object area, respectively extracting the characteristics of each sample frame image to obtain the foreground object characteristics of each sample frame image; respectively matching the foreground object features of each sample frame image with the foreground object features in the feature library in a similarity manner to obtain a matching result; and determining the video which is the same as the video to be detected in the video library based on the matching result. Thus, the efficiency and accuracy of video detection can be improved.

Description

Video detection method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a video detection method, a video detection device, an electronic device, and a storage medium.

Background

In the related art, for video detection, such as copyright detection of video, the detection is mainly realized based on complete picture features of a frame image of the video (namely, including foreground features and background features of the frame image), the processing process is complicated, the detection efficiency is low, and the detection mode has strong dependence on the background features of the image, for example, false detection is easy to occur under the conditions that the backgrounds of the images are similar and the foreground is slightly different.

Disclosure of Invention

The embodiment of the invention provides a video detection method, a video detection device, electronic equipment and a storage medium, which can improve the efficiency and accuracy of video detection.

The embodiment of the invention provides a video detection method, which comprises the following steps:

extracting a plurality of sample frame images from a video to be detected;

respectively detecting foreground objects in each sample frame image, and determining foreground object areas in each sample frame image;

respectively extracting features of each sample frame image based on the determined foreground object region to obtain foreground object features of each sample frame image;

respectively carrying out similarity matching on foreground object features of each sample frame image and foreground object features in a feature library to obtain a matching result;

and searching the video which is the same as the video to be detected in a video library based on the matching result.

The embodiment of the invention also provides a video detection device, which comprises:

the frame extraction unit is used for extracting a plurality of sample frame images from the video to be detected;

the detection unit is used for respectively detecting foreground objects in the sample frame images and determining foreground object areas in the sample frame images;

The extraction unit is used for respectively carrying out feature extraction on each sample frame image based on the determined foreground object area to obtain foreground object features of each sample frame image;

the matching unit is used for respectively matching the foreground object characteristics of each sample frame image with the foreground object characteristics in the characteristic library in a similarity manner to obtain a matching result;

and the determining unit is used for searching the video which is the same as the video to be detected in the video library based on the matching result.

In the above scheme, the detection unit is further configured to input each of the sample frame images to the single polygon detector SSD, and output a corresponding sample frame image carrying at least one foreground object frame;

and determining an image area contained in the frame of the foreground object in the output sample frame image as the foreground object area.

In the above scheme, the device further includes:

and the screening unit is used for screening the foreground object areas based on the integrity of the foreground objects in the foreground object area or the proportion of the foreground object area to the corresponding sample frame image to obtain at least one foreground object area with the integrity or the proportion meeting the preset condition.

In the above aspect, the extracting unit includes:

the intercepting unit is used for intercepting the foreground object area of each sample frame image based on the foreground object area to obtain an area image of each sample frame image;

and the characteristic unit is used for respectively carrying out characteristic extraction on the regional images of the sample frame images to obtain the foreground object characteristics of the sample frame images.

In the above scheme, the feature unit is further configured to perform shallow feature extraction on the area image of each sample frame image, so as to obtain shallow features representing structural information of the area image;

respectively carrying out deep feature extraction on the region images of each sample frame image to obtain deep features representing semantic information of the region images;

and respectively carrying out weighted fusion on the shallow layer features and the deep layer features of each sample frame image to obtain the foreground object features of each sample frame image.

In the above scheme, the feature unit is further configured to perform shallow feature extraction on the area image of the sample frame image through a shallow network model included in the feature extraction model, so as to obtain shallow features that characterize structural information of the area image;

Deep feature extraction is carried out on the regional images of each sample frame image through a deep network model included in the feature extraction model, so that deep features representing semantic information of the regional images are obtained;

and carrying out weighted fusion on the shallow layer features and the deep layer features through an attention model included in the feature extraction model to obtain foreground object features of the sample frame image.

In the above scheme, the feature unit is further configured to perform feature extraction on the first sample image, the second sample image, and the third sample image through the feature extraction model, so as to obtain foreground object features of the first sample image, the second sample image, and the third sample image;

wherein the first sample image and the second sample image are of the same class, and the first sample image and the third sample image are of different classes;

determining a value of a triplet loss function of the feature extraction model based on the foreground object features of the first sample image, the foreground object features of the second sample image, and the foreground object features of the third sample image;

and updating model parameters of the feature extraction model based on the values of the triplet loss function.

In the above scheme, the determining unit is further configured to determine, when the number of images satisfying the matching condition in the plurality of sample frame images represented by the matching result reaches a number threshold, a video corresponding to the matching result in the video library, where the video corresponding to the matching result is the same video as the video to be detected;

the sample frame image meeting the matching condition is a sample frame image with similarity between the foreground object features and the foreground object features in the feature library reaching a similarity threshold.

The embodiment of the invention also provides electronic equipment, which comprises:

a memory for storing executable instructions;

and the processor is used for realizing the video detection method provided by the embodiment of the invention when executing the executable instructions stored in the memory.

The embodiment of the invention also provides a storage medium which stores executable instructions for realizing the video detection method provided by the embodiment of the invention when the processor is caused to execute.

The application of the embodiment of the invention has the following beneficial effects:

respectively matching the foreground object features of each sample frame image with the foreground object features in the feature library to obtain a matching result, and determining the video which is the same as the video to be detected in the video library based on the matching result; that is, for detecting videos in a video library, the feature of a foreground object in a sample frame image of the video to be detected is not dependent on the background feature of the image, so that false detection under the conditions that the picture backgrounds are similar and the foreground is slightly different is avoided on the basis of improving the detection efficiency, and the accuracy of video detection is improved.

Drawings

Fig. 1 is a schematic diagram of a video detection system 100 according to an embodiment of the present invention;

fig. 2 is a schematic hardware structure of a server according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an implementation scenario of video copyright detection according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a redundant video management implementation scenario provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of an implementation scenario of video recommendation according to an embodiment of the present invention;

fig. 6 is a schematic flow chart of a video detection method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative data structure of a video according to an embodiment of the present invention;

fig. 8 is a schematic diagram of human body detection by a single polygon detector according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a feature extraction model according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of training targets corresponding to a triplet loss function according to an embodiment of the present invention;

fig. 11 is a schematic flow chart of a video detection method according to an embodiment of the present invention;

fig. 12 is a flowchart of a video detection method according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of a video frame image with the same background but different foreground persons according to an embodiment of the present invention;

Fig. 14 is a schematic structural diagram of a video detection device according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", and the like are merely used to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", or the like may be interchanged with one another, if permitted, to enable embodiments of the invention described herein to be practiced otherwise than as illustrated or described herein.

In this application, when the above embodiments of the present application are applied to specific products or technologies, the relevant data collecting, using and processing processes should comply with the national legal and regulatory requirements, the information processing rules should be notified and the individual consent of the target object should be solicited before the face information is collected, and the face information is processed in strict compliance with the legal and regulatory requirements and the personal information processing rules, and technical measures are taken to ensure the security of the relevant data.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

1) And foreground objects, wherein each image comprises a foreground and a background, the content, close to the camera, in the image is the foreground, and the objects in the foreground are foreground objects, such as people, animals and the like.

Fig. 1 is a schematic diagram of an alternative architecture of a video detection system 100 according to an embodiment of the present invention, referring to fig. 1, in order to support an exemplary application, a terminal (including a terminal 400-1 and a terminal 400-2) is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of both, and a wireless or wired link is used to implement data transmission.

A terminal (e.g. terminal 400-1) configured to send a video detection request to the server 200, where the video detection request carries a video to be detected;

The server 200 is configured to extract a plurality of sample frame images from a video to be detected based on a video detection request, detect foreground objects in each sample frame image respectively, determine foreground object areas in each sample frame image, perform feature extraction on each sample frame image based on the determined foreground object areas to obtain foreground object features of each sample frame image, perform similarity matching on the foreground object features of each sample frame image and the foreground object features in the feature library respectively to obtain a matching result, search for a video identical to the video to be detected in the video library based on the matching result, and return the search result to the terminal;

here, in practical application, the server 200 may be one server supporting various services configured separately, or may be configured as a server cluster.

The terminal (terminal 400-1 and/or terminal 400-2) is further configured to display the search result.

In practical applications, the terminal may be a smart phone, a tablet computer, a notebook computer, a wearable computing device, a Personal Digital Assistant (PDA), a desktop computer, a cellular phone, a media player, a navigation device, a game console, a television, etc., or a combination of any two or more of these data processing devices or other data processing devices.

In some embodiments, a video playing client is provided on a terminal, a user can play a video online, upload and download the video, and the like through the video playing client, by way of example, the user uploads the video (i.e., the video to be detected) through the video playing client, the video playing client sends an uploading request carrying the video to be detected to a server, the server analyzes the uploading request sent by the video playing client to obtain the video to be detected, extracts a plurality of sample frame images from the video to be detected, respectively detects foreground objects in each sample frame image, determines a foreground object area in each sample frame image, respectively performs feature extraction on each sample frame image based on the determined foreground object area, obtains foreground object features of each sample frame image, respectively performs similarity matching on the foreground object features of each sample frame image and the foreground object features in a feature library, obtains a matching result, searches the video which is the same as the video to be detected in the video library based on the matching result, and returns a corresponding searching result.

An electronic device implementing the video detection method according to the embodiment of the present invention will be described below. In some embodiments, the electronic device may be a terminal, and may also be a server. In the embodiment of the invention, the electronic equipment is taken as a server as an example, and the hardware structure of the server is described in detail.

Fig. 2 is a schematic hardware structure of a server according to an embodiment of the present invention, and it may be understood that fig. 2 only illustrates an exemplary structure of the server, but not all the structures, and some or all of the structures illustrated in fig. 2 may be implemented as required. Referring to fig. 2, a server provided in an embodiment of the present invention includes: at least one processor 201, a memory 202, a user interface 203, and at least one network interface 204. The various components in the server are coupled together by a bus system 205. It is understood that the bus system 205 is used to enable connected communications between these components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

It will be appreciated that the memory 202 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory.

The memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the server. Examples of such data include: any executable instructions, such as executable instructions, for operating on a server, may be included in the executable instructions for implementing the methods of embodiments of the present invention.

The video detection method disclosed in the embodiment of the invention can be implemented by the processor 201. The processor 201 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the video detection method may be performed by integrated logic circuitry of hardware in the processor 201 or instructions in the form of software. The processor 201 may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 201 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the invention can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software module may be located in a storage medium, where the storage medium is located in the memory 202, and the processor 201 reads information in the memory 202, and in combination with its hardware, performs the steps of the video detection method provided by the embodiment of the invention.

Based on the above description of the video detection system and the electronic device according to the embodiments of the present invention, an application scenario or field of the video detection method provided by the embodiments of the present invention is described next, and it should be noted that the video detection method according to the embodiments of the present invention is not limited to the scenario or field mentioned below:

1. detecting video copyright;

fig. 3 is a schematic diagram of a scenario for implementing video copyright detection according to an embodiment of the present invention, and next, with reference to fig. 1 and fig. 3, a scenario in which the video copyright detection method according to an embodiment of the present invention is applied to video copyright detection will be described.

Taking the terminal as the terminal 400-1 in fig. 1 as an example, a video playing client is arranged on the terminal, a background server corresponding to the video playing client is the server 200 in fig. 1, a user uploads a video (such as an a film) through the video playing client, and the video playing client sends an uploading request carrying the a film to the background server;

the video library of the background server stores a plurality of videos with copyright attribution attributes (such as copyright attribution is a user of a video playing client, namely a video publisher, or copyright attribution is a playing platform), and the feature library stores foreground object features of each video in the corresponding video library; the background server extracts a plurality of sample frame images from the A film based on the uploading request, respectively detects foreground objects in each sample frame image, determines foreground object areas in each sample frame image, respectively performs feature extraction on each sample frame image based on the determined foreground object areas to obtain foreground object features of each sample frame image, respectively performs similarity matching on the foreground object features of each sample frame image and the foreground object features in the feature library, searches for videos identical to the A film in the video library based on the matching result, and returns corresponding information to the terminal based on the searching result, for example, if the videos identical to the A film are searched for in the video library, returns information for prohibiting uploading to the terminal; if the video which is the same as the A film is not found in the video library, returning the information of successful uploading to the terminal.

The video detection method of the embodiment of the invention can effectively provide copyright protection for the video and effectively maintain the rights and interests of video uploaders and play platforms.

2. Redundant video management

Fig. 4 is a schematic diagram of an implementation scenario of redundant video management according to an embodiment of the present invention, and next, a scenario in which the video detection method according to an embodiment of the present invention is applied to redundant video management will be described with reference to fig. 1 and fig. 4.

Taking the first terminal as the terminal 400-1 in fig. 1, the first terminal is provided with a video playing client, the second terminal is the terminal 400-2 in fig. 1, and the background server corresponding to the video playing client is the server 200 in fig. 1 as an example, where the first terminal faces the video viewer, and the second terminal faces the manager of the video playing client, and in some embodiments, the manager may also play video through the video playing client set on the second terminal.

In actual implementation, the second terminal is provided with management software (such as a management client), and a manager can manage resources of the corresponding video playing client stored on the background server through a user interface provided by a management tool.

In some embodiments, the second terminal sends a repetitive query request carrying a video to be detected to the background server, the background server extracts a plurality of sample frame images from the video to be detected, respectively detects foreground objects in each sample frame image, determines foreground object areas in each sample frame image, respectively performs feature extraction on each sample frame image based on the determined foreground object areas to obtain foreground object features of each sample frame image, respectively performs similarity matching on the foreground object features of each sample frame image and the foreground object features in the feature library, searches for the video identical to the video to be detected in the video library based on the matching result, and returns a corresponding search result, for example, if the video identical to the video to be detected is searched, returns corresponding video information (such as a video identifier and a video name) to the second terminal, and if the video identical to the video to be detected is not searched, returns information of the video not searched to the second terminal;

The manager can perform corresponding processing based on the search result returned by the background server, for example, if the second terminal receives the video information which is returned by the background server and is the same as the video to be detected, the manager can delete the video based on the video information, so that the occupation of the storage space of the background server can be reduced, and the stock video of the video playing platform can be purified.

3. Video recommendation

Fig. 5 is a schematic diagram of an implementation scenario of video recommendation provided by an embodiment of the present invention, and next, with reference to fig. 1 and fig. 5, a description is given of a scenario of application of a video detection method of an embodiment of the present invention to video recommendation.

Taking the terminal as the terminal 400-1 in fig. 1 as an example, a video playing client is provided on the terminal, and a background server corresponding to the video playing client is the server 200 in fig. 1, so that a user can watch video through the video playing client.

In some embodiments, the background server may perform video recommendation through the video playing client, where the background server is provided with a video library and a feature library, where the video library stores video recommended in a period of time, and the feature library stores foreground object features corresponding to each video in the video library.

Before video recommendation is carried out, a background server extracts a plurality of sample frame images from videos to be recommended, respectively detects foreground objects in each sample frame image, determines foreground object areas in each sample frame image, respectively carries out feature extraction on each sample frame image based on the determined foreground object areas to obtain foreground object features of each sample frame image, respectively carries out similarity matching on the foreground object features of each sample frame image and the foreground object features in a feature library, searches for videos identical to the videos to be recommended in the video library based on a matching result, and judges whether to recommend the videos to be recommended or not based on the search result, for example, if the videos identical to the videos to be recommended are found, recommendation is not carried out on the videos, so that the filtration of recommended videos can be realized, and repeated recommendation of the videos is avoided; if the video which is the same as the video to be recommended is not found, pushing the video to a video playing client to recommend the video.

Next, a video detection method provided by an embodiment of the present invention is described, and fig. 6 is a schematic flow chart of the video detection method provided by the embodiment of the present invention, in some embodiments, the video detection method may be implemented by a server or a terminal, or implemented by the server and the terminal cooperatively, for example, implemented by a server, for example, implemented by the server 200 in fig. 1, and in combination with fig. 1 and fig. 6, the video detection method provided by the embodiment of the present invention includes:

Step 601: the server extracts a plurality of sample frame images from the video to be detected.

In practical applications, the video to be detected may be a complete video, such as a complete movie file, or a video clip, such as a clip of a movie.

In practical implementation, the server may take a plurality of frames from the video to be detected, for example, randomly take a preset number of sample frame images from the video to be detected.

In some embodiments, sample frame images may be extracted from the video according to the data structure of the video, fig. 7 is a schematic diagram of an alternative data structure of the video according to the embodiment of the present invention, and referring to fig. 7, the video data may be structurally divided into four layers of a movie, a scene, a lens, and a frame, and visually continuous video is formed by continuously showing a still image on a screen or a display, where the still image is a video frame; in the video shooting process, a video shot continuously and uninterruptedly shot by a camera is called a shot, the shot is a basic unit of video data, a plurality of shots with similar contents form a scene, the shots describe the same event from different angles, and a movie consists of a plurality of scenes and describes a complete story.

Based on the data structure of the video to be detected, in some embodiments, the server may also implement the extraction of the sample image frames by: performing shot switching detection on a video frame of the video to be detected to obtain a plurality of shots corresponding to the video to be detected; and respectively extracting sample frames from the video frames corresponding to the shots to obtain a plurality of sample frame images.

Here, the lens shift detection will be described. In practical applications, shot-cut detection can use the characteristics of the shot when it is switched to find the position where it is switched, so as to divide the whole video into individual shots.

For example, shot cut detection of a video to be detected may be achieved by:

and calculating the difference degree of the pixel points at the same position in adjacent video frames of the video to be detected by adopting an inter-frame pixel point matching method, determining the number of the pixel points, the difference degree of which exceeds a first difference threshold value, in the adjacent two video frames, and determining that shot switching occurs between the two video frames when the number reaches a preset number threshold value.

In practical application, each shot corresponds to a plurality of video frames, and in practical implementation, sample frames can be extracted from the video frames corresponding to the shots by the following manner: uniformly extracting a preset number of sample frames from a plurality of video frames corresponding to the lens; for example, a shot corresponds to 100 video frames, and starting from the 1 st video frame, one sample frame is extracted per each similar video frame, that is, 25 sample frames are extracted, so as to obtain 25 sample frame images.

Based on the data structure of the video to be detected, in some embodiments, the server may also implement the extraction of the sample image frames by: performing scene switching detection on video frames of the video to be detected to obtain a plurality of scenes corresponding to the video to be detected; and respectively extracting sample frames from the video frames corresponding to each scene to obtain a plurality of sample frame images.

Here, in practical application, the scene switching detection of the video to be detected may be implemented by:

and calculating the histogram difference degree of adjacent video frames of the video to be detected, and determining that scene switching occurs between two frames of video frames of which the histogram difference degree reaches a second difference threshold value.

In practical application, each scene corresponds to a plurality of video frames, and in practical implementation, sample frames can be extracted from the video frames corresponding to the scenes by the following manner: and uniformly extracting a preset number of sample frames from a plurality of video frames corresponding to the scene.

Step 602: and respectively detecting foreground objects in each sample frame image, and determining a foreground object area in each sample frame image.

In some embodiments, the foreground object region in each sample frame image may be determined by:

respectively inputting each sample frame image to a single polygon Detector (SSD, single Shot Multi-Box Detector), and outputting a corresponding sample frame image carrying at least one foreground object frame; and determining an image area contained in a foreground object frame in the output sample frame image as a foreground object area.

In practical applications, the foreground object may be a person or an animal, and the foreground object is described as a person.

The sample frame image is input to the SSD, and human body detection is performed on the sample frame image by the SSD, that is, a human body of the sample frame image is identified, and a human body region is determined, fig. 8 is a schematic diagram of human body detection by the single polygon frame detector according to the embodiment of the present invention, and referring to fig. 8, the image regions included in the frame 1 and the frame 2 are human body regions.

In practical applications, a plurality of foreground objects may exist in the sample frame image, however, not all detected foreground objects have application values, so when a plurality of foreground object areas exist in the sample frame image through SSD detection, the detected foreground object areas need to be screened, and in some embodiments, the screening of the foreground object areas may be achieved by:

and screening the foreground object areas based on the integrity of the foreground objects in the foreground object areas or the proportion of the foreground object areas to the corresponding sample frame images to obtain at least one foreground object area with the integrity or the proportion meeting the preset condition.

Here, a foreground object will be described as an example of a human body. In practical application, for a human body with unimportant identity in a video, the following situations may exist in a frame image obtained by shooting: some detected human bodies are incomplete due to shielding and the like, such as a head area is not included; because the occupied position is far away from the camera, the proportion of the detected human body area occupied picture is smaller, as shown in fig. 8, the proportion of the human body area corresponding to the frame 1 occupied sample image frame is smaller; in practical implementation, corresponding thresholds can be set for the integrity of the human body and the proportion of the human body area to the sample frame image so as to filter out the human body with the integrity smaller than the integrity threshold and the human body with the proportion of the human body area to the sample frame image smaller than the proportion threshold.

Step 603: and respectively extracting the characteristics of each sample frame image based on the determined foreground object area to obtain the foreground object characteristics of each sample frame image.

In some embodiments, the foreground object features of each sample frame image may be obtained by:

based on the determined foreground object area, respectively carrying out foreground object area interception on each sample frame image to obtain an area image of each sample frame image; and respectively extracting the characteristics of the regional images of each sample frame image to obtain the foreground object characteristics of each sample frame image.

In some embodiments, feature extraction may be performed on the truncated region image to obtain foreground object features of the sample frame image by:

shallow feature extraction is carried out on the regional image of the sample frame image, so that shallow features of structural information representing the regional image are obtained; deep feature extraction is carried out on the regional image of the sample frame image, so that deep features representing semantic information of the regional image are obtained; and carrying out weighted fusion on the shallow layer features and the deep layer features of the sample frame image to obtain the foreground object features of the sample frame image.

Here, a weighted fusion of shallow features and deep features will be described. In practical implementation, the weighting processing of the shallow features and the deep features can be realized according to the following formula:

；（1）

wherein,is characterized by superficial layer->Is deep in nature>Is->The preset constant can be specifically set according to actual conditions.

In some embodiments, the feature extraction model obtained by training may be used to perform feature extraction on the region image obtained by capturing, so as to obtain the foreground object feature of the sample frame image. For example, fig. 9 is a schematic structural diagram of a feature extraction model provided in an embodiment of the present invention, referring to fig. 9, the feature extraction model may include a shallow network model (i.e. shallow layer), a deep network model (i.e. deep layer), and an attention model (i.e. attention layer); the shallow network model is used for extracting shallow features of the regional image of the sample frame image to obtain shallow features of structural information representing the regional image; the deep network model is used for extracting deep features of the regional images of the sample frame images to obtain deep features of semantic information representing the regional images; and the attention model is used for carrying out weighted fusion on the shallow layer characteristics and the deep layer characteristics to obtain foreground object characteristics of the sample frame image.

The training of the feature extraction model is described. In some embodiments, training of the feature extraction model may be accomplished by:

respectively carrying out feature extraction on the first sample image, the second sample image and the third sample image through a feature extraction model to obtain foreground object features of the first sample image, the second sample image and the third sample image; wherein the categories of the first sample image and the second sample image are the same, and the categories of the first sample image and the third sample image are different; determining a value of a triplet loss function of the feature extraction model based on the foreground object feature of the first sample image, the foreground object feature of the second sample image, and the foreground object feature of the third sample image; based on the values of the triplet loss function, model parameters of the feature extraction model are updated.

In some embodiments, the model parameters of the feature extraction model may be updated by:

and judging whether the value of the triple loss function exceeds a preset threshold value or not based on the value of the triple loss function, determining an error signal of the feature extraction model based on the triple loss function when the value of the triple loss function exceeds the preset threshold value, reversely transmitting the error signal in the feature extraction model, and updating model parameters of each layer in the transmission process.

The back propagation is described herein, in which training sample data (i.e., a sample image) is input to a shallow layer (i.e., a shallow layer network model) of a feature extraction model, passes through a middle layer, a deep layer (i.e., a deep layer network model), and an attention layer (an attention model), and finally reaches an output layer and outputs a result, which is a forward propagation process of the feature extraction model, and since the output result of the feature extraction model has an error with an actual result, an error between the output result and the actual value is calculated, and the error is back propagated from the output layer to the attention layer until the error propagates to the shallow layer, and in the back propagation process, the value of a model parameter is adjusted according to the error; the above process is iterated until convergence.

A triplet loss function is illustrated. Fig. 10 is a schematic diagram of a training target corresponding to a triple loss function provided in an embodiment of the present invention, referring to fig. 10, a sample image is randomly selected from a training sample set as a first sample image, denoted as a, then a sample image with the same class as the first sample image is randomly selected as a second sample image, denoted as p, and then a sample image with a different class as the first sample image is randomly selected as a third sample image, denoted as n, thereby forming an (a, p, n) triple.

For each element (sample) in the triplet, performing feature extraction through a feature extraction model to obtain feature expression of three elements, which are respectively recorded as，/>，/>The goal of the triplet loss function is to let +.>And->The distance between them is as small as possible, let +.>And->The distance between them is as large as possible and is to be left +.>And->Distance between, and->And->With a minimum spacing margin between the distances, in some embodiments, the triplet loss function L is as follows:

；（2）

wherein,representation->And->European distance between->Representation->And->The Euclidean distance between the two points is constant, and can be specifically set according to actual needs.

A training sample set of feature extraction models is described. In practical implementation, the training sample set may include at least two subsets, each subset including a plurality of sample images, wherein the similarity between the sample images in each subset reaches a preset first similarity threshold, and the similarity between the sample images in different subsets is smaller than a preset second similarity threshold, that is, the sample images in the same subset are the same in category, and the sample images in different subsets are different in category.

Next, a training sample set of the feature extraction model will be described using a foreground object as an example of a human body. In practical implementation, human body areas in a plurality of video frame images included in a video are detected, the human body similarity of adjacent frame images is judged, and human body area images with similarity reaching a first similarity threshold are classified as similar sample images.

Step 604: and respectively carrying out similarity matching on the foreground object features of each sample frame image and the foreground object features in the feature library to obtain a matching result.

In practical application, a video library and a feature library are provided on the server, wherein a plurality of videos are stored in the video library, foreground object features corresponding to a plurality of videos in the video library are stored in the feature library, for a specific video in the video library, foreground object features corresponding to a plurality of sample frame images of the specific video are stored in the feature library, and the sample frame extraction mode of a video frame to be detected is the same as the sample frame extraction mode of the specific video.

In some embodiments, matching of foreground object features of a sample frame image to foreground object features in a feature library may be accomplished by:

performing similarity calculation on the foreground object features of the sample frame images of the video to be detected and the foreground object features of the corresponding sample frame images (such as the 5 th sample frame of the extracted video or the video frames with the same playing time) in the feature library to obtain corresponding similarity results, and when the similarity reaches a preset similarity threshold value, indicating that the matching is successful; and when the similarity does not reach the preset similarity threshold, indicating that the matching fails.

Taking the foreground object feature as a human feature as an example, in practical application, the similarity of two human features is represented by the cosine distance of the two human features, and correspondingly, the similarity of the two human features can be calculated by the following formula:

；（3）

wherein,representing human body characteristics->Is in contact with the human body>Cosine distance of (2); />Representation vector->Is set to be a normal number of L2 of (c),representation vector->Is a L2 norm of (c).

Step 605: and searching the video which is the same as the video to be detected in the video library based on the matching result.

In some embodiments, the same video in the video library as the video to be detected may be looked up by:

when the matching result represents that the number of images meeting the matching condition reaches a number threshold value in a plurality of sample frame images of the video to be detected, determining the video corresponding to the matching result in a video library, wherein the video corresponding to the matching result is the same as the video to be detected; the sample frame images meeting the matching condition are sample frame images with the similarity between the foreground object features and the foreground object features in the feature library reaching a similarity threshold, namely sample frame images successfully matched with the foreground object features of the sample frame images in the feature library.

By applying the embodiment of the invention, for detecting the video in the video library, the characteristic of the foreground object in the sample frame image of the video to be detected is based, the background characteristic of the image is not relied on, and the false detection under the conditions of similar picture background and slightly different foreground is avoided on the basis of improving the detection efficiency, thereby improving the accuracy of video detection.

The video detection method according to the embodiment of the present invention will be described below by taking a foreground object as an example of a human body. Fig. 11 and fig. 12 are flowcharts of a video detection method according to an embodiment of the present invention, in some embodiments, the video detection method may be implemented cooperatively by a terminal and a server, for example, by the terminal 400-1 and the server 200 in fig. 1, where the terminal 400-1 is provided with a video playing client, and in conjunction with fig. 1, fig. 11 and fig. 12, the video detection method according to an embodiment of the present invention includes:

step 701: the video playing client sends a video uploading request carrying the video to be detected to the server.

In practical application, a user can play, download, upload and the like of a video through a video playing client side arranged in the terminal, when the user uploads the video, the video playing client side receives a video uploading instruction triggered by the user and a user interface, the video uploading instruction instructs to upload the video to be detected, the video playing client side sends a video uploading request carrying the video to be detected to the server, and in practical application, a sending end identifier such as a terminal identifier can be carried in the video uploading request.

Step 702: the server analyzes the video uploading request to obtain a video to be detected, and a first frame extraction mode is adopted to extract a preset number of sample frame images from the video to be detected.

Here, in actual implementation, the server performs shot switching detection on a video frame of the video to be detected to obtain a plurality of shots corresponding to the video to be detected; and respectively extracting sample frames from the video frames corresponding to the shots to obtain a preset number of sample frame images.

Step 703: and sequentially inputting the sample frame images into a human body detection network, and determining the human body area in the sample frame images.

In practical application, the human body detection network adopts an SSD network structure, so that the human body in the sample frame image can be accurately identified.

Step 704: and performing human body screening based on the determined human body area to obtain a target human body image.

In practical implementation, the server may screen the human body based on the integrity of the human body or the proportion of the human body area to the sample frame image, specifically, the server selects the human body area image with the integrity reaching the preset integrity threshold or the proportion of the human body area to the sample frame image reaching the preset proportion threshold as the target human body image, and it should be noted that one or more than one target human body images obtained by screening may be provided for one sample frame image.

Step 705: and extracting the characteristics of the target human body image of the sample frame image through the characteristic extraction model to obtain corresponding human body characteristics.

In some embodiments, the extraction of human features may be achieved by:

and respectively carrying out shallow feature extraction and deep feature extraction on the target human body image to obtain shallow features and deep features of the target human body, and carrying out weighted fusion on the obtained shallow features and deep features to obtain human body features of the sample frame image.

Step 706: and matching the human body characteristics of the obtained sample frame image with the human body characteristics of the corresponding sample frame image in the characteristic library in a similarity manner to obtain a matching result.

Here, the feature library stores human body features corresponding to a plurality of videos in the video library, the videos in the video library are videos with copyright attribution attributes, the feature library is constructed in advance, and the construction of the feature library comprises:

extracting a preset number of sample frame images from a target video (or video fragment) in a second frame extraction mode, sequentially inputting the sample frame images into a human body detection network, determining a human body area in the sample frame images, performing human body screening based on the determined human body area to obtain a target human body image, performing feature extraction on the target human body image of the sample frame image through a feature extraction model to obtain corresponding human body features, and storing the human body features of a plurality of sample frame images of the obtained target video into a feature library.

Here, the first frame extraction method is the same as the second frame extraction method.

In practical implementation, when matching human body features, the sample frame image of the video to be detected needs to be matched with the corresponding sample frame image in the feature library, for example, the fifth sample frame image of the video to be detected needs to be matched with the fifth sample frame image of the video in the feature library when feature matching is currently performed.

In practical application, the similarity of two human body features can be represented by the cosine distance of the two human body features, when the cosine distance of the two human body features reaches a set threshold value, the matching is determined to be successful, and when the cosine distance of the features does not reach the set threshold value, the matching is determined to be failed.

Step 707: searching for the video identical to the video to be detected in the video library, and executing step 708 when the video identical to the video to be detected is searched for; when the same video as the video to be detected is not found, step 710 is performed.

Here, in practical application, the server determines that the video satisfying the following conditions in the video library is the same video as the video to be detected:

in the matching result of the sample frame images of the video to be detected, the number of the sample frame images successfully matched reaches a preset number threshold.

Step 708: and sending the message of prohibiting uploading to the video playing client.

Step 709: the video playing client displays the message of prohibiting uploading through the user interface.

Step 710: and sending the message of successful uploading to the video playing client.

By applying the embodiment of the invention, the human body characteristics in the sample frame image of the video to be detected are not dependent on the background characteristics of the image, false detection under the conditions of similar picture background and slightly different foreground is avoided on the basis of improving the detection efficiency, the accuracy of video detection is improved, and when the video uploaded by a user is detected to be the same as the video in the video library, the uploading of the video is forbidden, so that the copyright protection can be effectively provided for the video, and the rights and interests of a video uploading person and a playing platform are effectively maintained.

Continuing with the example of a human body as a foreground object, the video detection method according to the embodiment of the present invention may be executed by a terminal, or by a server, or by a terminal in conjunction with the server, and then will be described with the example of server execution, such as by the server 200 in fig. 1.

The video detection method of the embodiment of the invention is based on Content repeatability detection (CBCD, content-Based Copy Detection), because the human body information is the key information which includes the information of hair, face, clothes, gesture and the like and is used for distinguishing the difference of video Content, the human body characteristics in the video frame images can be used for representing the characteristics of the video frame images, abstract characteristics (namely human body characteristics) of the video frame images are extracted as indexes and are matched with the human body characteristics of the videos in the video library, and the attribution of the video copyright can be judged through the matching result, so that not only the videos which normally contain the human body can be distinguished, but also the videos which have the same or similar background but different foreground characters can be distinguished accurately can be distinguished, and fig. 13 is a schematic diagram of the video frame images which have the same background but different foreground characters, if the detection is performed based on the complete picture characteristics of the video frame, the false detection is easy to occur because the background is the same.

Referring to fig. 12, in the stage of constructing a feature library, frame extraction sampling is performed on a video requiring feature library to obtain a plurality of sample frame images of the video, a trained human body detection network is used for detecting a human body for each sampled sample frame image, the detected human body is screened, after the human body meeting the library requirement is intercepted, a trained feature extraction network is used for extracting features, and the extracted features are added into the feature library.

In the feature matching stage, the video to be matched is subjected to the steps of frame extraction sampling, human body detection, human body screening and feature extraction which are the same as those in the feature library construction stage, and finally the extracted human body features are matched with the human body features in the feature library.

Next, frame extraction sampling, human body detection, human body screening and feature extraction, which are all involved in the feature library construction stage and the feature matching stage, are respectively described.

For sampling frames, a plurality of sampling modes, such as uniformly sampling a preset number of video frames as sample frames, and sampling the sample frames according to different shots and scenes of the video, it should be noted that the sampling mode of the sample frames of the video to be detected in the matching stage needs to be the same as the sampling mode of the sample frames of the video in the feature library construction stage.

For human body detection, in the embodiment of the invention, the human body detection network can be used for determining the human body area in the sample frame image, the human body detection network can adopt an SSD network structure, the network structure has very good performance on detection speed and detection precision, and in the test stage, the human body detection efficiency of the network can reach more than 100 frames/second on a graphic processor (GPU, graphics Processing Unit), and meanwhile, the detection rate of more than 85% is ensured.

For human body screening, in some embodiments, the human body is mainly based on the integrity of the human body, or the human body area occupies the proportion of the sample frame image, so that some non-important human bodies can be screened out, in practical application, the non-important human bodies can include human bodies with small proportion of the human body occupying the image and without heads, such as spectators in performance programs, passers-by in movie and television drama, and the like, the human bodies exist in most videos and have certain repeatability, and therefore the human bodies are screened out, the human body heads contain important information such as hairstyles, faces and the like of the human bodies, and the human bodies have important effect on judging the information of the human bodies, and therefore the human bodies without heads are screened out. Through the test, the human body which occupies a small proportion of the picture and does not contain the head is removed by screening, and the accuracy can be improved by about 10%.

For feature extraction, in some embodiments, the feature extraction network (i.e. feature extraction model) obtained by training may be used to extract features of the human body from the screened human body image, which mainly includes the following steps:

and carrying out shallow feature extraction on the human body image to obtain shallow features of the human body image, carrying out deep feature extraction on the human body image to obtain deep features of the human body image, and carrying out weighting treatment on the obtained shallow features and deep features to obtain final human body features.

Next, training of the feature extraction network will be described.

First, the construction of the training data set will be explained.

The training sample set may include at least two subsets, each subset including a plurality of human body images (i.e., sample images), wherein the similarity between the human body images in each subset reaches a preset first similarity threshold, and the similarity between the human body images in different subsets is less than a preset second similarity threshold, i.e., the human body images in the same subset are the same in category, and the human body images in different subsets are different in category. In practical applications, because videos have very strong correlation in time and space, embodiments of the present invention construct a dataset containing a large number of similar/dissimilar human pairs from unlabeled video sequences, for example: detecting human body areas of adjacent or similar frames in a video picture, judging the similarity of human bodies extracted from the adjacent or similar frames by judging whether the video picture has shot switching or not and simultaneously utilizing some artificial features (such as Scale-invariant feature transform (SIFT, scale-invariant feature transform) SIFT features and the like), wherein the similarity accords with expected sample pairs and is added into a training data set; in some embodiments, attacks including random clipping, blurring, rotation, etc. may be added to these pairs of samples, thereby improving the robustness of the network.

Next, a network architecture of the feature extraction network will be described. Referring to fig. 9, the feature extraction network may include shallow, intermediate, deep, and attention layers; the shallow layer is used for extracting shallow layer characteristics of an input human body image, and in practical application, the shallow layer characteristics are used for representing structural information of the image; the deep layer is used for extracting deep features of the human body image processed by the middle layer, and in practical application, the deep features are used for representing semantic information of the image; the attention layer is used for combining the shallow layer characteristics and the deep layer characteristics, and specifically, the weighted fusion of the shallow layer characteristics and the deep layer characteristics can be realized based on a formula (1), so that the accuracy rate can be improved by about 20% compared with the single use of the shallow layer characteristics or the deep layer characteristics.

Feature matching is described next.

The cosine distance is adopted in the characteristic distance calculation, specifically, the cosine distance between human body characteristics can be calculated through a formula (3); for the requested video, extracting human body features in the video, calculating feature distances from human body features in the feature library, judging similar human bodies for the feature human bodies with the distance smaller than a set threshold, and judging the same video fragments when the number of sample frames of the similar human bodies reaches the set number threshold for the whole video, namely the proportion of the similar human bodies is larger than a certain threshold.

By applying the embodiment of the invention, on one hand, copyright protection can be provided for videos such as movies, television shows, and various products, and rights and interests of video uploaders and platforms are effectively maintained; on the other hand, the method can be applied to repeated detection of videos, not only can the stock videos of the video platform be purified and the video quality of the platform be improved, but also the method can be applied to filtering the recommended videos during recommendation.

The description of the video detection device provided by the embodiment of the invention is continued. Fig. 14 is a schematic structural diagram of a video detection device according to an embodiment of the present invention, referring to fig. 14, a video detection device 140 according to an embodiment of the present invention includes:

a frame extracting unit 141, configured to extract a plurality of sample frame images from a video to be detected;

a detection unit 142, configured to detect foreground objects in each of the sample frame images, and determine foreground object areas in each of the sample frame images;

an extracting unit 143, configured to perform feature extraction on each of the sample frame images based on the determined foreground object area, so as to obtain foreground object features of each of the sample frame images;

the matching unit 144 is configured to perform similarity matching on the foreground object features of each sample frame image and the foreground object features in the feature library, so as to obtain a matching result;

And the determining unit 145 is configured to search for a video that is the same as the video to be detected in the video library based on the matching result.

In some embodiments, the frame extraction unit is further configured to perform shot switching detection on a video frame of the video to be detected, so as to obtain a plurality of shots corresponding to the video to be detected;

and respectively extracting sample frames from the video frames corresponding to the shots to obtain a plurality of sample frame images.

In some embodiments, the frame extraction unit is further configured to perform scene switching detection on a video frame of the video to be detected, so as to obtain a plurality of scenes corresponding to the video to be detected;

and respectively extracting sample frames from the video frames corresponding to the scenes to obtain a plurality of sample frame images.

In some embodiments, the detection unit is further configured to input each of the sample frame images to a single polygon detector SSD, and output a corresponding sample frame image carrying at least one foreground object frame;

In some embodiments, the apparatus further comprises:

In some embodiments, the extraction unit comprises:

In some embodiments, the feature unit is further configured to perform shallow feature extraction on a region image of each sample frame image, to obtain shallow features that characterize structural information of the region image;

In some embodiments, the feature unit is further configured to perform shallow feature extraction on a region image of the sample frame image through a shallow network model included in the feature extraction model, so as to obtain shallow features that characterize structural information of the region image;

In some embodiments, the feature unit is further configured to perform feature extraction on the first sample image, the second sample image, and the third sample image through the feature extraction model, so as to obtain foreground object features of the first sample image, the second sample image, and the third sample image;

In some embodiments, the determining unit is further configured to determine, when the number of images satisfying the matching condition in the plurality of sample frame images represented by the matching result reaches a number threshold, a video corresponding to the matching result in the video library, where the video corresponding to the matching result is the same video as the video to be detected;

It should be noted here that: the description of the device is similar to the description of the method, and the description of the beneficial effects of the method is omitted herein for details of the device not disclosed in the embodiments of the present invention, please refer to the description of the embodiments of the method of the present invention.

a memory for storing an executable program;

and the processor is used for realizing the video detection method provided by the embodiment of the invention when executing the executable program stored in the memory.

The embodiment of the invention also provides a storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor is caused to execute the video detection method provided by the embodiment of the invention.

All or part of the steps of the embodiments may be performed by hardware associated with program instructions, and the foregoing program may be stored in a computer readable storage medium, which when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: a mobile storage device, a random access Memory (RAM, random Access Memory), a Read-Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the embodiments of the present invention may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program code, such as a removable storage device, RAM, ROM, magnetic or optical disk.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of video detection, the method comprising:

extracting a plurality of sample frame images from a video to be detected;

based on the determined foreground object areas, respectively carrying out foreground object area interception on each sample frame image to obtain area images of each sample frame image;

shallow feature extraction is carried out on the regional images of each sample frame image respectively, so that shallow features representing structural information of the regional images are obtained;

respectively carrying out weighted fusion on the shallow layer features and the deep layer features of each sample frame image to obtain foreground object features of each sample frame image, wherein the weighted fusion comprises the following steps: products among the shallow layer features, the deep layer features and preset constants are respectively determined, and a summation among the products is obtained, wherein the summation is the foreground object feature;

2. The method of claim 1, wherein extracting a plurality of sample frame images from the video to be detected comprises:

performing shot switching detection on the video frames of the video to be detected to obtain a plurality of shots corresponding to the video to be detected;

3. The method of claim 1, wherein extracting a plurality of sample frame images from the video to be detected comprises:

performing scene switching detection on the video frames of the video to be detected to obtain a plurality of scenes corresponding to the video to be detected;

4. The method of claim 1, wherein the detecting foreground objects in each of the sample frame images, respectively, determining foreground object regions in each of the sample frame images, comprises:

Respectively inputting each sample frame image to a single polygon frame detector SSD, and outputting a corresponding sample frame image carrying at least one foreground object frame;

5. The method of claim 1, wherein the method further comprises:

and screening a plurality of foreground object areas based on the integrity of the foreground objects in the foreground object area or the proportion of the foreground object area to the corresponding sample frame image to obtain at least one foreground object area with the integrity or the proportion meeting the preset condition.

6. The method of claim 1, wherein shallow feature extraction of the region image of the sample frame image is performed by a shallow network model comprised by a feature extraction model; performing deep feature extraction on the region images of each sample frame image through a deep network model included in the feature extraction model; and performing weighted fusion on the shallow layer features and the deep layer features through an attention model included by the feature extraction model.

7. The method of claim 6, wherein the method further comprises:

respectively extracting features of the first sample image, the second sample image and the third sample image through the feature extraction model to obtain foreground object features of the first sample image, the second sample image and the third sample image;

8. The method of claim 1, wherein the searching for the same video in the video library as the video to be detected based on the matching result comprises:

when the matching result represents that the number of sample frame images meeting the matching condition in the plurality of sample frame images reaches a number threshold, determining videos corresponding to the matching result in the video library, wherein the videos corresponding to the matching result are the same as the videos to be detected;

The sample frame image meeting the matching condition is as follows: and the similarity between the foreground object features and the foreground object features in the feature library reaches a sample frame image with a similarity threshold.

9. A video detection apparatus, the apparatus comprising:

the extraction unit is used for respectively carrying out foreground object area interception on each sample frame image based on the determined foreground object area to obtain an area image of each sample frame image; shallow feature extraction is carried out on the regional images of each sample frame image respectively, so that shallow features representing structural information of the regional images are obtained; respectively carrying out deep feature extraction on the region images of each sample frame image to obtain deep features representing semantic information of the region images; respectively carrying out weighted fusion on the shallow layer features and the deep layer features of each sample frame image to obtain foreground object features of each sample frame image, wherein the weighted fusion comprises the following steps: products among the shallow layer features, the deep layer features and preset constants are respectively determined, and a summation among the products is obtained, wherein the summation is the foreground object feature;

10. The apparatus of claim 9, wherein the device comprises a plurality of sensors,

the frame extraction unit is further used for performing shot switching detection on the video frames of the video to be detected to obtain a plurality of shots corresponding to the video to be detected;

11. The apparatus of claim 9, wherein the device comprises a plurality of sensors,

the frame extraction unit is further used for performing scene switching detection on the video frames of the video to be detected to obtain a plurality of scenes corresponding to the video to be detected;

12. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

a processor for implementing the video detection method of any one of claims 1 to 8 when executing executable instructions stored in the memory.

13. A storage medium storing executable instructions for causing a processor to perform the video detection method of any one of claims 1 to 8.

14. A computer program product comprising computer-executable instructions or a computer program, which, when executed by a processor, implements the method of any one of claims 1 to 8.