CN112383824A

CN112383824A - Video advertisement filtering method, device and storage medium

Info

Publication number: CN112383824A
Application number: CN202011077376.4A
Authority: CN
Inventors: 刘安捷
Original assignee: Wangsu Science and Technology Co Ltd
Current assignee: Wangsu Science and Technology Co Ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2021-02-19

Abstract

The application discloses a video advertisement filtering method, video advertisement filtering equipment and a storage medium. In the application, through utilizing the deep convolutional network model obtained based on the training of the deep convolutional network algorithm, each video segment separated from the video to be processed is analyzed and processed, and then according to the starting time and the ending time of the identified advertisement video, the advertisement video is deleted from the video to be processed, the filtering of the advertisement implanted by hard coding is realized, meanwhile, based on the self-adaptability of the deep convolutional network model, the self-learning and self-regulation can be carried out according to the processed video to be processed, and the deep convolutional network model can have better recognition capability on the newly added advertisement. In addition, the depth convolution network model can analyze and process the time dimension and the space dimension of the video clip, namely, the depth convolution network model introduces a time and space attention mechanism, thereby greatly improving the identification effect.

Description

Video advertisement filtering method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of computer vision, in particular to a video advertisement filtering method, video advertisement filtering equipment and a storage medium.

Background

The advertisement, as the name implies, is an advertisement that informs the general public of the society of something. At present, with the rapid development of multimedia technology and internet technology, videos on the internet are usually inserted with tens of seconds or even hundreds of seconds of advertisements in the beginning of a film for commercial promotion purposes, but the experience of watching the videos by users is seriously influenced, and a large amount of video space is also occupied.

In order to filter out advertisements in a video, currently, a request response message of a hypertext Transfer Protocol (HTTP) is mainly analyzed to further obtain video description information, and finally, advertisement segments in the video are identified by comparing description information of different segments, and the identified advertisement segments are filtered out from the video. However, this method is only applicable to the situation of dynamic advertisement video implantation on the internet, and cannot be applied to the situation that advertisement segments and original video are hard-coded into one video. In order to filter out advertisement segments and hard coding of an original video into advertisement segments in a video, the current solution is to maintain an advertisement video library, identify and locate the positions of advertisements by comparing the feature similarities of the segments of the video and the videos in the advertisement video library, and further filter out the identified advertisement segments from the video. However, the method can only identify the existing advertisements in the advertisement video library constructed in advance, and has poor adaptability to identifying the newly added advertisements.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus and a storage medium for filtering a video advertisement, which aim to solve the above technical problems.

In order to solve the above technical problem, an embodiment of the present application provides a video advertisement filtering method, including the following steps:

segmenting a video to be processed into a plurality of video segments;

analyzing and processing time dimension and space dimension of each video clip by using a preset deep convolutional network model to obtain the starting time and the ending time of the advertisement video contained in the video to be processed;

and deleting the advertisement video from the video to be processed according to the starting time and the ending time.

Embodiments of the present application further provide a video advertisement filtering apparatus, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a video advertisement filtering method as described above.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements a video advertisement filtering method as described above.

According to the video advertisement filtering method, the video advertisement filtering device and the video advertisement filtering storage medium, each video segment separated from a video to be processed is analyzed and processed by utilizing the deep convolutional network model obtained through training based on the deep convolutional network algorithm, and then the advertisement video is deleted from the video to be processed according to the start time and the end time of the identified advertisement video, so that filtering of advertisements implanted in hard codes is achieved, self learning and self adjustment can be performed according to the processed video to be processed based on the adaptivity of the deep convolutional network model, and the deep convolutional network model can have better identification capability on newly-added advertisements.

In addition, the depth convolution network model can analyze and process the time dimension and the space dimension of the video clip, namely, the depth convolution network model introduces a time and space attention mechanism, thereby greatly improving the identification effect.

In addition, the segmenting the video to be processed into a plurality of video segments includes: taking a shot as a segmentation unit, and dividing the video to be processed into a plurality of continuous shot segments according to the time dimension; and taking each shot segment as a video segment. By taking the shot section with the minimum unit as the video section, each frame image of the advertisement video implanted through hard coding can be filtered through identifying each shot section, and then the filtering of the advertisement implanted through hard coding is realized.

In addition, the dividing unit of the shot divides the video to be processed into a plurality of continuous shot segments according to the time dimension, and the dividing unit of the shot comprises the following steps: traversing the video to be processed, and executing the following operations on each traversed frame image: mapping the color space of the current frame image to an HSV space to obtain a channel value of the current frame image in the HSV space; judging whether the current frame image and the previous frame image belong to the same lens according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space; if the current frame image and the previous frame image belong to the same image set, storing the current frame image and the previous frame image in the same image set; otherwise, storing the current frame image in a new image set; after traversing the video to be processed, taking a shot as a segmentation unit for each image set, and merging the images stored in the image sets according to the time dimension to obtain shot fragments. By traversing each frame of image of the video to be processed and processing each frame of image according to the method, the video to be processed can be cut into a plurality of shot segments after the traversal is finished, and each frame of image is ensured to have a corresponding shot.

In addition, mapping the color space of the current frame image to an HSV space to obtain a channel value of the current frame image in the HSV space, which includes: mapping the color space of the current frame image to a hue channel of the HSV space to obtain a hue value; mapping the color space of the current frame image to a saturation channel of the HSV space to obtain a saturation value; mapping the color space of the current frame image to a brightness channel of the HSV space to obtain a brightness value; and taking any one or more of the hue value, the saturation value and the brightness value as a channel value of the current frame image in the HSV space.

In addition, the determining whether the current frame image and the previous frame image belong to the same lens according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space includes: calculating an HSV average value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image of the current frame image in the HSV space; judging whether the HSV average value is smaller than a preset threshold value or not; if the current frame image and the previous frame image belong to the same lens, determining that the current frame image and the previous frame image belong to the same lens; otherwise, determining that the current frame image and the previous frame image do not belong to the same lens.

Additionally, the channel values include the hue values; or the saturation value; or the lightness value; calculating an HSV average value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space, wherein the calculation comprises the following steps: when the channel value comprises the tone value, calculating the mean square error of the tone value of the current frame image and the tone value of the previous frame image to obtain a tone mean square error value, and taking the tone mean square error value as an HSV mean value; when the channel value comprises a saturation value, calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value, and taking the saturation mean square error value as an HSV mean value; and when the channel value comprises a brightness value, calculating the mean square error of the brightness value of the current frame image and the brightness value of the previous frame image to obtain a brightness mean square error value, and taking the brightness mean square error value as an HSV mean value. This embodiment provides a specific manner of obtaining an HSV mean value when the channel value is any one of the hue value, the saturation value, and the brightness value.

In addition, the channel values include the hue value and the saturation value; or the hue value and the lightness value; or the saturation value and the brightness value; calculating an HSV average value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space, wherein the calculation comprises the following steps: when the channel value comprises the hue value and the saturation value, calculating the mean square error of the hue value of the current frame image and the hue value of the previous frame image to obtain a hue mean square error value; calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value; averaging the hue mean square difference value and the saturation mean square difference value to obtain an HSV mean value; when the channel value comprises the tone value and the lightness value, calculating the mean square error of the tone value of the current frame image and the tone value of the previous frame image to obtain a tone mean square error value; calculating the mean square error of the lightness value of the current frame image and the lightness value of the previous frame image to obtain a lightness mean square error value; averaging the hue mean square difference value and the brightness mean square difference value to obtain an HSV mean value; when the channel value comprises the saturation value and the brightness value, calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value; calculating the mean square error of the lightness value of the current frame image and the lightness value of the previous frame image to obtain a lightness mean square error value; and averaging the saturation mean square difference value and the brightness mean square difference value to obtain an HSV mean value. This embodiment provides a specific manner of obtaining the HSV mean value when the channel value is any two of the hue value, the saturation value, and the brightness value.

In addition, the channel values include the hue value, the saturation value and the brightness value; calculating an HSV average value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space, wherein the calculation comprises the following steps: calculating the mean square error of the tone value of the current frame image and the tone value of the previous frame image to obtain a tone mean square error value; calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value; calculating the mean square error of the lightness value of the current frame image and the lightness value of the previous frame image to obtain a lightness mean square error value; and averaging the hue mean square difference value, the saturation mean square difference value and the brightness mean square difference value to obtain an HSV mean value. This embodiment provides a specific way to obtain the HSV mean value when the channel value is the hue value, the saturation value, and the brightness value.

In addition, the analyzing and processing of the time dimension and the space dimension on each video segment by using the preset deep convolutional network model to obtain the starting time and the ending time of the advertisement video contained in the video to be processed includes: analyzing and processing time dimension and space dimension of each video clip by using a preset deep convolutional network model, and screening out advertisement scenes; determining the starting time of an advertisement video contained in the video to be processed according to the starting frame of the advertisement scene; and determining the end time of the advertisement video contained in the video to be processed according to the end frame of the advertisement scene. Because a complete video is composed of a plurality of related scenes, advertisement scenes are screened out by analyzing and processing each video segment by using a preset deep convolutional network model, and the starting time and the ending time of the advertisement video are determined according to the starting frame and the ending frame of the advertisement scenes, so that the advertisement video composed of the related advertisement scenes in the video to be processed can be quickly positioned.

In addition, the deep convolutional network model comprises a feature extraction module, a time attention module, a space attention module and a loss function module; the method for screening the advertisement scenes by analyzing and processing the time dimension and the space dimension of each video clip by using the preset deep convolutional network model comprises the following steps: acquiring the video features of each video clip by using the feature extraction module, the time attention module and the space attention module; and analyzing and processing the video characteristics of each video clip by using the loss function module, and screening out the advertisement scenes.

In addition, the acquiring the video feature of each video segment by using the feature extraction module, the temporal attention module and the spatial attention module includes: for each video clip, sampling N frames of images from the video clip, and cutting the sampled N frames of images to a preset size to obtain an inputAn object; inputting the input object into the feature extraction module, and analyzing and processing the input object by the feature extraction module to obtain a feature map F of a four-dimensional axis with the image size reduced to a preset proportion_b(ii) a The preset proportion is determined according to a selected preset depth convolution network algorithm; the feature map F_bInputting the time attention module, and analyzing and processing the time attention module to obtain the time weight W of the four-dimensional axis_t(ii) a The feature map F_bAnd the time weight W_tElement-by-element multiplication is carried out, and the time characteristics F of the four-dimensional axis fused in time are obtained by adding according to the time axis of the first dimension_t(ii) a The time characteristic F is_tInputting the space attention module, and obtaining the space weight W of the four-dimensional axis through the analysis and processing of the space attention module_s(ii) a The time characteristic F is_tAnd the spatial weight W_sMultiplying element by element, and adding according to the second and third spatial axes respectively to obtain the spatial feature F of the four-dimensional axis_s(ii) a The spatial feature F_sAs a video feature of the video clip. The time attention module is embedded into the deep convolutional network model, so that the influence of fuzzy frames is effectively weakened, and the space attention module is embedded into the deep convolutional network model, so that the attention to advertisements is greatly enhanced, the extracted video features can reflect the features of video clips more accurately, and the advertisement identification effect is improved.

In addition, the analyzing and processing the video characteristics of each video clip by using the loss function module to screen out the advertisement scenes comprises: analyzing and processing the video characteristics of each video clip by using the loss function module, and combining the video clips of the same scene into one scene by taking the scene as a unit; analyzing and processing each scene by using the loss function module to obtain a prediction result of a shot segment contained in each scene; wherein the prediction result is an advertisement or a positive; counting the number of shot segments with the prediction results of the advertisements in each scene to obtain the number of the advertisement shots; and if the number of the advertising shots meets a preset condition, determining that the corresponding scene is an advertising scene. As can be seen from the above description, the deep convolutional network model provided in the embodiment of the present application can identify whether a video segment is an advertisement, and can compare whether two video segments belong to the same scene, that is, multi-task identification is achieved, so that the video advertisement filtering method provided in the embodiment of the present application can be applied to various service scenes.

In addition, the analyzing and processing the video features of the video segments by using the loss function module, and merging the video segments of the same scene into one scene by using the scene as a unit, includes: traversing each video clip, and determining the Euclidean distance between two video clips according to the video characteristics of any two video clips by using the loss function module; and combining the video clips with Euclidean distances smaller than a preset threshold value into a scene.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a detailed flow chart of a video advertisement filtering method according to a first embodiment of the present application;

FIG. 2 is a detailed flow chart of a video advertisement filtering method according to a second embodiment of the present application;

FIG. 3 is a schematic block diagram of a video advertisement filtering apparatus according to a third embodiment of the present application;

fig. 4 is a schematic structural diagram of a video advertisement filtering apparatus according to a fourth embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in the examples of the present application, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present application, and the embodiments may be mutually incorporated and referred to without contradiction.

The first embodiment relates to a video advertisement filtering method, which is applied to a video advertisement filtering device, and in practical application, the video advertisement filtering device may be any client terminal, such as a tablet computer, a mobile phone, a personal computer, and the like, which are not listed one by one, and this embodiment is not limited thereto.

The following describes implementation details of the video advertisement filtering method of the present embodiment, and the following description is provided only for easy understanding and is not necessary to implement the present embodiment.

The specific flow of this embodiment is shown in fig. 1, and specifically includes the following steps:

step 101, segmenting a video to be processed into a plurality of video segments.

To facilitate understanding of the video advertisement filtering method provided in the present embodiment, the following description will first describe "shots" and "scenes".

Specifically, the "shot" refers to a continuous picture taken without switching one imaging shot; the "scene" refers to various scenes in movies and drama works, and is composed of character activities, backgrounds, and the like. Moreover, in practical applications, each scene may be considered to describe a video segment that is coherent in content and relatively independent.

Further, it should be understood that, in general, a video is composed of a plurality of scenes, and a scene is composed of a plurality of shots. That is, a plurality of related shot slices constitute a scene, and a plurality of related scene slices constitute a complete video.

Based on this, in order to filter out inserted video advertisements from videos in any coding form, such as video advertisements embedded in a hard coding manner or video advertisements dynamically embedded on the internet, in this embodiment, when a video to be processed, that is, a video for which video advertisements need to be filtered out, is cut into a plurality of video segments, a shot is specifically used as a cutting unit, the video to be processed is separated according to a time dimension, so that a plurality of continuous shot segments arranged according to the time dimension sequence are obtained, and finally, each shot segment is used as one video segment.

That is, in the present embodiment, the video segments used for inputting the deep convolutional network model in step 102 are substantially the shot segments obtained according to each shot partition.

In order to facilitate understanding of the above operation of dividing the to-be-processed video into a plurality of consecutive shot segments according to a time dimension by using shots as the segmentation units, a specific shot division manner is provided in this embodiment, which is as follows:

(1) and traversing the video to be processed.

Specifically, the traversing of the video to be processed specifically is to traverse by taking a frame as a unit, that is, each frame image in the video to be processed needs to be traversed.

(2) And executing the following operations on each traversed frame image:

and (2.1) mapping the color space of the current frame image to an HSV space to obtain a channel value of the current frame image in the HSV space.

It should be understood that the above-mentioned HSV space is a so-called color model space composed of a Hue (Hue) channel, a Saturation (Saturation) channel, and a Value (Value) channel.

Correspondingly, the above-mentioned operation of mapping the current frame image, i.e. the sample space of each traversed frame image, to the HSV space to obtain the channel value of the current frame image in the HSV space, specifically, mapping the color space of the current frame image to the hue channel of the HSV space to obtain the hue value; mapping the color space of the current frame image to a saturation channel of the HSV space to obtain a saturation value; mapping the color space of the current frame image to a brightness channel of the HSV space to obtain a brightness value; and finally, taking any one or more of the hue value, the saturation value and the brightness value as a channel value of the current frame image in the HSV space.

That is, the channel values mentioned in this embodiment are formed by randomly combining the values corresponding to the three channels.

In addition, it is worth mentioning that, in practical application, in order to simplify the processing logic as much as possible, it may be pre-determined that the channel value is formed by values corresponding to which channels of the HSV space, so that when the color space of the current frame image is mapped to the HSV space, only the color space of the current frame image is mapped to the corresponding channels, and for convenience of understanding, the following description is respectively made with respect to several ways of obtaining the channel value:

the first method is as follows:

firstly, mapping the color space of the current frame image to the hue channel of the HSV space to obtain a hue value.

And then, taking the hue value as a channel value of the current frame image in the HSV space.

The second method comprises the following steps:

firstly, mapping the color space of the current frame image to a saturation channel of the HSV space to obtain a saturation value.

And then, taking the saturation value as a channel value of the current frame image in the HSV space.

The third method comprises the following steps:

firstly, mapping the color space of the current frame image to a brightness channel of the HSV space to obtain a brightness value.

And then, taking the brightness value as a channel value of the current frame image in the HSV space.

The method is as follows:

firstly, mapping the color space of the current frame image to a hue channel of the HSV space to obtain a hue value; and mapping the color space of the current frame image to a saturation channel of the HSV space to obtain a saturation value.

And then, taking the hue value and the saturation value as channel values of the current frame image in the HSV space.

The fifth mode is as follows:

firstly, mapping the color space of the current frame image to a hue channel of the HSV space to obtain a hue value; and mapping the color space of the current frame image to a brightness channel of the HSV space to obtain a brightness value.

And then, taking the hue value and the brightness value as channel values of the current frame image in the HSV space.

The method six:

firstly, mapping the color space of the current frame image to a saturation channel of the HSV space to obtain a saturation value; and mapping the color space of the current frame image to a brightness channel of the HSV space to obtain a brightness value.

And then, taking the saturation value and the brightness value as channel values of the current frame image in the HSV space.

The method is as follows:

firstly, mapping the color space of the current frame image to a hue channel of the HSV space to obtain a hue value; mapping the color space of the current frame image to a saturation channel of the HSV space to obtain a saturation value; and mapping the color space of the current frame image to a brightness channel of the HSV space to obtain a brightness value.

And then, taking the hue value, the saturation value and the brightness value as channel values of the current frame image in the HSV space.

And (2.2) judging whether the current frame image and the previous frame image belong to the same lens according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space.

Specifically, if the current frame image and the previous frame image belong to the same lens through judgment, the current frame image and the previous frame image are stored in the same image set; otherwise, storing the current frame image in a new image set.

Regarding to the step (2.2), judging whether the current frame image and the previous frame image belong to the same lens according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space, specifically, calculating an HSV mean value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space; then, judging whether the HSV average value is smaller than a preset threshold value or not; finally, if the HSV mean value is smaller than a preset threshold value, determining that the current frame image and the previous frame image belong to the same lens; otherwise, when the HSV average value is larger than or equal to a preset threshold value, determining that the current frame image and the previous frame image do not belong to the same shot.

That is, when the HSV mean value is greater than or equal to the preset threshold, it indicates that the current frame has jumped to another shot, so that the current frame image, i.e. the first image, has become the image of another shot.

In addition, it should be understood that, regarding the preset threshold obtained above, in practical applications, the preset threshold may be specifically set by a person skilled in the art in advance as needed, and the embodiment is not limited to this.

That is, through the above operations, each frame of image belonging to the same shot can be stored in the same image set, and images not belonging to the same shot can be separately stored in one image set, thereby realizing the division of the shot corresponding to each frame of image in the video to be processed.

For ease of understanding, the following description is made in conjunction with the examples:

assuming that the video to be processed is a video with a duration of 10 minutes, and the frame rate of the video to be processed is 30 frames per second, when it is determined whether the current frame image and the previous frame image belong to the same shot, if 30 frames in the first second all belong to the same shot, the 30 frames are stored in the same image set, if the frame images corresponding to the previous 28 th frame (including the 28 th frame) all belong to the same shot, but the frame image corresponding to the 28 th frame and the frame image corresponding to the 29 th frame do not belong to the same shot, the 28 previous frame images are stored in the same image set (for convenience of description, hereinafter referred to as a first image set), the frame image corresponding to the 29 th frame is stored in another image set (for convenience of description, hereinafter referred to as a second image set), and then if the frame image corresponding to the 30 th frame and the frame image corresponding to the 29 th frame do not belong to the same shot, the frame image corresponding to the 30 th frame is stored in the third image set.

That is, as long as the traversed current frame image is different from the frame image corresponding to any one of the traversed frames, that is, does not belong to the same shot, the traversed current frame image is separately stored in a new image set, otherwise, the traversed current frame image is stored in the image set corresponding to the same shot.

It should be understood that the above examples are only examples for better understanding of the technical solution of the present embodiment, and are not to be taken as the only limitation to the present embodiment.

Further, as can be seen from the description of step (2.1), the channel value in this embodiment is composed of any one or more of a hue value, a saturation value, and a brightness value, that is, the above-mentioned 7 cases. Therefore, the operation of judging whether the current frame image and the previous frame image belong to the same lens according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space is specifically performed based on specific parameters constituting the channel value.

For convenience of distinction and description, information related to a current frame image is represented by "first", and information related to a previous frame image of the current frame image is represented by "second", where, for example, the first image represents the current frame image, a first channel value represents a channel value of the current frame image in the HSV space, a second image represents a previous frame image of the current frame image, and a second channel value represents a channel value of the previous frame image of the current frame image in the HSV space.

Correspondingly, three parameters under the first channel value can be represented as a first hue value, a first saturation value and a first lightness value; the three parameters under the second channel value can be expressed as a second hue value, a second saturation value and a second lightness value.

Based on this, the operation of calculating the HSV mean value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space is classified into the following three categories according to the 7 given ways of determining the channel value:

the first type: the channel value is formed by any one of the hue value, the saturation value and the brightness value, and the calculation of the HSV mean value is divided into the following three types:

under the condition that the channel value comprises the hue value, namely the channel value only consists of the hue value:

firstly, the mean square error of the first tone value and the second tone value is calculated to obtain a tone mean square error value.

And then, taking the hue mean square difference value as an HSV mean value.

When the channel value includes a saturation value, that is, the channel value is only composed of the saturation value:

firstly, the mean square error of the first saturation value and the second saturation value is calculated to obtain a saturation mean square error value.

Then, taking the saturation mean square deviation value as an HSV mean value;

and thirdly, when the channel value comprises a lightness value, namely the channel value only consists of the lightness value:

firstly, the mean square error of the first lightness value and the second lightness value is calculated to obtain a lightness mean square error value.

Then, the value mean variance value is taken as an HSV mean value.

The second type: the channel value is composed of any two of the hue value, the saturation value and the brightness value, and the calculation of the HSV mean value is divided into the following three types:

when the channel value includes the hue value and the saturation value, that is, the channel value is formed by the hue value and the saturation value:

firstly, calculating the mean square error of a first tone value and a second tone value to obtain a tone mean square error value; and calculating the mean square error of the first saturation value and the second saturation value to obtain a saturation mean square error value.

Then, averaging the hue mean square difference value and the saturation mean square difference value to obtain an HSV mean value;

when the channel value comprises the hue value and the lightness value, namely the channel value is formed by the hue value and the lightness value:

firstly, calculating the mean square error of a first tone value and a second tone value to obtain a tone mean square error value; and calculating the mean square error of the first lightness value and the second lightness value to obtain a lightness mean square error value.

Then, averaging the hue mean square difference value and the brightness mean square difference value to obtain an HSV mean value;

and thirdly, when the channel value comprises the saturation value and the lightness value, namely the channel value is formed by the saturation value and the lightness value:

firstly, calculating the mean square error of a first saturation value and a second saturation value to obtain a saturation mean square error value; and calculating the mean square error of the first lightness value and the second lightness value to obtain a lightness mean square error value.

And then, averaging the saturation mean square difference value and the brightness mean square difference value to obtain an HSV mean value.

In the third category: and the channel value is a condition formed by the hue value, the saturation value and the brightness value together, and the HSV average value is calculated as follows:

firstly, calculating the mean square error of a first hue value and a second hue value to obtain a hue mean difference value; the mean square error of the first saturation value and the second saturation value is obtained to obtain a saturation mean square error value; and the mean square error of the first lightness value and the second lightness value is obtained to obtain a saturation mean square error value.

And then, averaging the hue mean square difference value, the saturation mean square difference value and the brightness mean square difference value to obtain an HSV average value.

(3) After traversing the video to be processed, taking a shot as a segmentation unit for each image set, and merging the images stored in the image sets according to the time dimension to obtain shot fragments.

it is assumed that A, B, C three image sets are obtained after the video to be processed is processed through the operations given in step (1) and step (2) above. Among them, the image set a stores the image a1, the image a2, the image a3, and the image a4, the image set B stores the image B1, the image B2, and the image B3, and the image set C stores the image C1, the image C2, the image C3, the image C4, and the image C5.

If the order of the images stored in the image set a, the image set B and the image set C is sequentially arranged according to the time dimension, that is, the time axis, then for the image set a, the shots are taken as the segmentation unit, and the images stored therein are merged according to the time dimension, and the obtained shot segment can be represented as a1a2a3a 4.

Accordingly, the shot obtained from the images stored in the image set B can be represented as B1B2B 3; the shot from the images stored in the image set C may be denoted as C1C2C3C4C 5.

And 102, analyzing and processing the time dimension and the space dimension of each video segment by using a preset deep convolutional network model to obtain the starting time and the ending time of the advertisement video contained in the video to be processed.

As can be seen from the descriptions of "video", "scene" and "shot" in step 101, the video to be processed is composed of a plurality of related scene segments.

That is, the advertisement video and the feature video constituting the video to be processed are respectively constituted by a plurality of related scene segments.

Therefore, in order to determine the starting time and the ending time of the advertisement video contained in the video to be processed, the preset deep convolutional network model is only needed to analyze the time dimension and the space dimension of each video segment, so as to screen out the advertisement scenes forming the advertisement video, then the starting time of the advertisement video contained in the video to be processed is determined according to the starting frame of the advertisement scenes, and the ending time of the advertisement video contained in the video to be processed is determined according to the ending frame of the advertisement scenes.

In addition, it is worth mentioning that in order to enable the preset deep convolutional network model to achieve feature extraction of the video segments, the deep convolutional network model has self-learning and self-adaptive characteristics, and the deep convolutional network model can focus on image frames (or called as video frames) with more abundant information, weaken the influence of the fuzzy frames on the recognition result, focus on the spatial position area of the advertisement, reduce the interference of the background, and simultaneously achieve multi-task recognition, such as recognizing whether the input video segments are advertisements, and comparing whether the two video segments belong to the same scene. The deep convolutional network model obtained through pre-training in the embodiment mainly comprises a feature extraction module constructed based on a preset deep convolutional network algorithm, a time attention module constructed based on a time attention mechanism, a space attention module constructed based on a space attention mechanism, and a loss function module constructed based on a preset loss function.

It should be understood that, in practical applications, the operation of focusing on the spatial location area of the advertisement may specifically be based on the identification information of the advertisement, for example, in a certain video, if the video is entirely targeted at a certain scene, but there exists a picture, a name, etc. of a product completely unrelated to the scene, for this case, such identification information may be used as a basis for identifying the advertisement.

Correspondingly, the operation of performing analysis processing of time dimension and space dimension on each video clip by using the preset deep convolutional network model to screen out an advertisement scene specifically comprises: firstly, acquiring the video characteristics of each video clip by using the characteristic extraction module, the time attention module and the space attention module; and then, analyzing and processing the video characteristics of each video clip by using the loss function module, and screening out the advertisement scenes.

As can be seen from the above description, the temporal attention module is constructed based on the temporal attention mechanism, and the spatial attention module is constructed based on the spatial attention mechanism. Therefore, when the preset deep convolutional network model is used for analyzing and processing the time dimension and the space dimension of each video clip, the analysis and the processing of the time dimension can be realized through the time attention module, and the analysis and the processing of the space dimension can be realized through the space attention module.

In addition, it should be understood that, in practical applications, in order to ensure that the step 102 is normally executed, before executing the step, corresponding modules need to be constructed in advance based on a preset deep convolutional network algorithm, a time attention mechanism, a space attention mechanism and a loss function, each module in the constructed training model is iteratively trained by using preset sample data until a training result meets a preset condition, and finally, the training model meeting the preset condition is taken as the deep convolutional network model.

Regarding the training of the deep convolutional network model, those skilled in the art may refer to the implementation of the related algorithm by themselves, and details of this embodiment are not described again.

And 103, deleting the advertisement video from the video to be processed according to the starting time and the ending time.

Specifically, the deleting the advertisement video from the to-be-processed video according to the start time and the end time is to essentially identify the deep convolutional network model as a segment of an advertisement scene.

In practical application, when the advertisement video is deleted from the video to be processed, the video can be cut by using the existing video editing tool, and then the advertisement video is deleted from the video to be processed.

For example, to delete the first 10 seconds of the video to be processed, ffmpeg (Fast Forward Mpeg, a set of open source computer programs that can be used to record, convert digital audio, video, and convert them into streams) can be used to execute the following command "ffmpeg-ss 10-i in.mp4 out.mp 4".

It is not difficult to find out through the above description that the video advertisement filtering method provided in this embodiment performs analysis processing on each video segment separated from the video to be processed by using the deep convolutional network model obtained through training based on the deep convolutional network algorithm, and then deletes the advertisement video from the video to be processed according to the start time and the end time of the identified advertisement video, so that not only is filtering of the advertisement embedded in the hard code realized, but also self-learning and self-adjustment can be performed according to the processed video to be processed based on the adaptivity of the deep convolutional network model, so that the deep convolutional network model can have better identification capability for the newly added advertisement.

A second embodiment of the present application relates to a video advertisement filtering method. The second embodiment provides a specific implementation mode which is beneficial to a preset deep convolutional network model and realizes the purpose of extracting characteristics to identify advertisement scenes.

For convenience of explanation, the following description is made with reference to fig. 2:

firstly, after a video to be processed is cut into a plurality of continuous shot segments by taking a shot as a unit, the obtained shot segments are respectively input into a preset depth convolution network model, that is, each shot segment needs to be processed by feature extraction, temporal attention, spatial attention, exponential normalization loss (softmax loss) and triple loss (triplet loss) shown in fig. 2.

Next, for each shot (i.e., the video clip referred to in the first embodiment), the following processing is performed:

(1) and sampling N frames of images from the video clip, and cutting the sampled N frames of images to a preset size to obtain an input object.

For easy understanding, the present embodiment takes an image with an input object of 224 × 224 size as an example, that is, the sampled N frame images are cropped to 224 × 224 size.

(2) Inputting the input object into the feature extraction module, and analyzing and processing the input object by the feature extraction module to obtain a feature map F of a four-dimensional axis with the image size reduced to a preset proportion_b。

Specifically, the preset proportion is determined according to a selected deep convolutional network algorithm, namely the deep convolutional network algorithm according to which the feature extraction module is constructed.

It should be noted that, in practical applications, the deep convolutional Network algorithm according to which the feature extraction module is constructed may be a Residual Network (ResNet) series or an Efficientx (a novel convolutional neural Network CNN, which not only improves accuracy but also improves efficiency) series.

Different series of algorithms, the convolution kernels supported by the constructed feature extraction module are different in size, so that the scaling is also different.

For example, ResNet50 in ResNet series is scaled by 32 times, and the obtained feature map F with four-dimensional axes_bIt is 32 times smaller than the input object.

Taking the image with the size of 224 × 224 as an example, the feature map F of the four-dimensional axis corresponding to the N frames of images_bThe pattern of (1) is [ Nx 7 x 2048 ]]。

Wherein, N is on the first dimension axis, the first 7 is on the second dimension axis, the second 7 is on the third dimension axis, 2048 is on the fourth dimension axis.

Furthermore, it should be understood that the first dimension axis represents time, the second dimension axis represents the height of the image, the third dimension axis represents the width of the image, i.e. the second dimension axis and the third dimension axis represent space, and the fourth dimension axis represents the number of feature maps.

Further, 7 on the second dimensional axis and the third dimensional axis indicates that the input 224 × 224 size image is reduced to 7 × 7.

2048 on the fourth dimension axis is a fixed value, and is 2048 no matter how large the input image is and how large the reduction factor is.

(3) The feature map F_bInputting the time attention module, and analyzing and processing the time attention module to obtain the time weight W of the four-dimensional axis_t。

It should be noted that, in the present embodiment, the temporal attention module is mainly used to fuse information of N frames of images, and considering that different frames contribute differently to the final result, for example, some frames have motion blur to generate interference information. Thus, by mapping the feature of the four-dimensional axes to F_bAnd inputting a time attention module, and weighing the importance of each frame image by using a time attention mechanism through the time attention module.

Specifically, in order to ensure the accuracy of the finally extracted video features, the features extracted by the feature extraction module are feature graphs F of four-dimensional axes_bIn order to guarantee temporal attention the module can map the feature F of the four-dimensional axis_bThe convolution kernel of the time attention module needs to be 1 × 1 in size and the output channel is 2048, so as to fuse all the features in the four-dimensional axis feature map F_bTime of conversion to four-dimensional axisInter weight W_t。

Further, in practical application, in order to reduce the calculation difficulty and complexity, the feature map F of the four-dimensional axis is used_bAfter convolution with a kernel of 1 × 1 and an output of 2048 channels, the output result can be transformed by using a Sigmoid function (an activation function, which is commonly used as activation and logistic regression of a neural network), and the output result is mapped between 0 and 1.

Further, in order to ensure that the finally extracted video features are more stable, L1 regularization processing can be performed on the result after Sigmoid function transformation, and finally the result after Sigmoid function and L1 regularization processing is used as a four-dimensional time weight W_t。

Feature F still in four-dimensional axes_bThe pattern of (1) is [ Nx 7 x 2048 ]]For example, the time weight W of the four-dimensional axis is obtained by convolution with a kernel of 1 × 1 and an output of 2048 channels, and regularization with the Sigmoid function and L1_tThe same pattern is [ Nx7 x7 x 2048 ]]。

Furthermore, it should be understood from the above description that the temporal attention module is provided with a feature map F having four-dimensional axes in addition to_bTime weight W converted to four-dimensional axis_tThe function (2) also has Sigmoid transformation and L1 regularization functions, so that when the time attention module is constructed, in addition to the time attention mechanism, the time attention module also needs to be constructed based on the Sigmoid function and L1 regularization.

(4) Feature map F of the four-dimensional axis_bAnd a time weight W of the four-dimensional axis_tElement-by-element multiplication is carried out, and the time characteristics F of the four-dimensional axis fused in time are obtained by adding according to the time axis of the first dimensional axis_t。

Specifically, the adding according to the time axis of the first dimension axis is to apply a feature map F of the four dimension axes of the N-frame images_bAnd the temporal weight W of the four-dimensional axes of the N frame images_tTemporally fusing a frame of image, namely obtaining a temporal feature F of a temporally fused four-dimensional axis_t。

Still in four-dimensional axisCharacteristic diagram F of_bAnd time weight W of the four-dimensional axis_tThe pattern of (1) is [ Nx 7 x 2048 ]]For example, the resulting temporal feature F of the temporally fused four-dimensional axis_tIs in a specific pattern of [ 1X 7X 2048 ]]。

(5) Time characteristic F of the four-dimensional axis_tInputting the space attention module, and obtaining the space weight W of the four-dimensional axis through the analysis and processing of the space attention module_s。

It should be noted that, in this embodiment, the spatial attention module is mainly used for fusing information of spatial positions in the N frames of images, for example, the importance of a trademark in the N frames of images is higher than that of other objects, and the background often does not contain important information.

Specifically, in order to ensure the accuracy of the finally extracted video features, the time feature F of the four-dimensional axis is obtained by the processing of the time attention module_tTo ensure spatial attention the module can characterize the time F on the four-dimensional axis_tAll features in (1) are processed, the convolution kernel of the spatial attention module also needs to be 1 × 1 in size and the output channel is 2048, so that the temporal feature F of the four-dimensional axis is obtained_tSpatial weight W converted to four-dimensional axis_s。

Further, in practical application, in order to reduce the calculation difficulty and complexity, the time characteristic F of the four-dimensional axis is used_tAfter convolution with a kernel of 1 × 1 and output of 2048 channels, the output result can be transformed by using a Sigmoid function, and the output result is mapped between 0 and 1.

Further, in order to ensure that the finally extracted video features are more stable, L1 regularization processing can be performed on the result after Sigmoid function transformation, and finally the result after Sigmoid function and L1 regularization processing is used as the spatial weight W of the four-dimensional axis_s。

Time characteristic F still in four-dimensional axis_tThe pattern of (1X 7X 2048)]For example, the four-dimensional degree is obtained by convolution with a kernel of 1 × 1 and output of 2048 channels, and by the Sigmoid function and L1 regularizationSpatial weight W of axis_sThe same pattern is [1 × 7 × 7 × 2048 ]]。

Furthermore, it should be understood that from the above description, the spatial attention module is provided with temporal features F that are all four dimensional axes_tSpatial weight W converted to four-dimensional axis_sThe function of (2) also has Sigmoid transformation and L1 regularization functions, so that when a spatial attention module is constructed, in addition to the construction based on the spatial attention mechanism, regularization based on the Sigmoid function and L1 is also required.

(6) Time characteristic F of the four-dimensional axis_tAnd spatial weight W of said four-dimensional axis_sMultiplying element by element, and adding according to the space axes of the second dimension axis and the third dimension axis respectively to obtain the space characteristic F of the four dimension axis_s。

Specifically, the aforementioned spatial axis addition according to the second dimension axis and the third dimension axis specifically means that the temporal feature F of the four dimension axes of the frame image obtained by the temporal attention module fusion is_tAnd spatial weight W of four-dimensional axis of one frame image_sThe images are fused into a frame in space, and the time characteristic F of the four-dimensional axis which is also fused in space is obtained_t。

Time characteristic F still in four-dimensional axis_tAnd spatial weight W of the four-dimensional axis_sThe pattern of (1X 7X 2048)]For example, the resulting spatial feature F of the spatially fused four-dimensional axis_sIs [ 1X 2048 ]]。

(7) Spatial features F of the four-dimensional axis_sAs a video feature of the video clip.

It is easy to find that the three stages of feature extraction, temporal attention and spatial attention in fig. 2 are completed by the processing of the above steps (1) to (7).

For the last stage in fig. 2, i.e., the operation performed by the loss function module in the deep convolutional network model, fig. 2 is specifically implemented by two loss functions, one is an exponential normalized loss softmax loss for identifying whether the shot segment input to the deep convolutional network model is an advertisement or a feature, and the other is a triple loss for guiding learning the distance of comparing two shot features.

Specifically, when the video features of each video segment are obtained, the following processing is performed in the loss function module of the deep convolutional network model:

(1) and analyzing and processing the video characteristics of the video clips by using the loss function module, and combining the video clips of the same scene into one scene by taking the scene as a unit.

Specifically, in the present embodiment, in the process of merging video segments of the same scene into one scene, it is determined whether any two video segments belong to the same scene based on the euclidean distance, and then the video segments belonging to the same scene are merged into one scene.

In order to implement the above operations, it is necessary to traverse each video segment, then determine the euclidean distance between two video segments according to the video features of any two video segments by using the loss function module, and finally merge the video segments with the euclidean distance smaller than the preset threshold into one scene.

The above-mentioned euclidean distance determination is specifically obtained by calculation according to formula (1):

wherein a, p, n is belonged to [1, K ]]_-，f_1,KVideo features representing the K shot of an advertisement, f_2,KAnd D is the Euclidean distance obtained by calculation.

(2) And analyzing and processing each scene by using the loss function module to obtain a prediction result of the shot segment contained in each scene.

It should be understood that in practical applications, the video segments contained in the video to be processed are either advertisement segments or feature segments. Therefore, after the obtained scenes are analyzed and processed by the loss function module, the shot prediction results contained in the obtained scenes are either advertisements or photos.

That is, the predicted result is an advertisement or a feature.

In addition, in practical application, besides adopting "advertisement" and "feature film" as the prediction results, preset information can be adopted to represent advertisement according to convention, and other information can be adopted to mark feature film.

The operation of predicting whether the shots contained in each scene are advertisements or prints is specifically realized by using an exponential normalized loss softmax loss function.

For convenience of implementation, in this embodiment, the prediction result of each shot is specifically predicted based on formula (2):

wherein p is_iLabel representing ith shot, q_iIndicating the prediction result of the ith shot.

(3) And counting the number of shot segments with the prediction results of the advertisements in each scene to obtain the number of the advertisement shots.

Correspondingly, if the number of the advertising shots meets a preset condition, determining that the corresponding scene is an advertising scene, namely the content needing to be deleted; otherwise, the current scene is considered as a feature scene, i.e. the content that does not need to be deleted.

Regarding the meeting of the preset condition, in this embodiment, it means that the number of shots included in a scene is more than half, that is, for any scene, if more than half of the shots are predicted as advertisements, the scene is an advertisement scene, otherwise, the scene is a feature scene.

Therefore, according to the video advertisement filtering method provided by the embodiment, the influence of the fuzzy frame is effectively weakened by embedding the time attention module in the deep convolutional network model, and the attention to the advertisement is greatly strengthened by embedding the space attention module, so that the extracted video features can more accurately reflect the features of the video segments, and the advertisement identification effect is further improved.

In addition, as can be seen from the above description, the deep convolutional network model provided in the embodiment of the present application can not only identify whether a video segment is an advertisement, but also compare whether two video segments belong to the same scene, that is, multi-task identification is achieved, so that the video advertisement filtering method provided in the embodiment of the present application can be applied to various service scenes.

It should be understood that the above steps of the various methods are divided for clarity, and the implementation may be combined into one step or split into a plurality of steps, and all that includes the same logical relationship is within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A third embodiment of the present application relates to a video advertisement filtering apparatus, as shown in fig. 3, including: a segmentation module 301, an analysis module 302, and a deletion module 303.

The segmentation module 301 is configured to segment a video to be processed into a plurality of video segments; an analysis module 302, configured to perform analysis processing on a time dimension and a space dimension on each video segment by using a preset deep convolutional network model, so as to obtain a start time and an end time of an advertisement video included in the video to be processed; a deleting module 303, configured to delete the advertisement video from the to-be-processed video according to the start time and the end time.

In addition, in another example, the segmentation module 301 is specifically configured to divide the video to be processed into a plurality of consecutive shot segments according to a time dimension by using a shot as a segmentation unit; and taking each shot segment as a video segment.

In addition, in another example, the operation of dividing the to-be-processed video into a plurality of consecutive shot segments according to a time dimension by using the shot as a splitting unit specifically includes:

traversing the video to be processed, and executing the following operations on each traversed frame image:

mapping the color space of the current frame image to an HSV space to obtain a channel value of the current frame image in the HSV space; judging whether the current frame image and the previous frame image belong to the same lens according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space; if the current frame image and the previous frame image belong to the same image set, storing the current frame image and the previous frame image in the same image set; otherwise, storing the current frame image in a new image set;

after traversing the video to be processed, taking a shot as a segmentation unit for each image set, and merging the images stored in the image sets according to the time dimension to obtain shot fragments.

In addition, in another example, the operation of mapping the color space of the current frame image to an HSV space to obtain a channel value of the current frame image in the HSV space specifically includes:

mapping the color space of the current frame image to a hue channel of the HSV space to obtain a hue value;

mapping the color space of the current frame image to a saturation channel of the HSV space to obtain a saturation value;

mapping the color space of the current frame image to a brightness channel of the HSV space to obtain a brightness value;

and taking any one or more of the hue value, the saturation value and the brightness value as a channel value of the current frame image in the HSV space.

In addition, in another example, the determining, according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image of the current frame image in the HSV space, whether the current frame image and the previous frame image belong to the same shot specifically includes:

calculating an HSV average value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image of the current frame image in the HSV space;

judging whether the HSV average value is smaller than a preset threshold value or not;

if the current frame image and the previous frame image belong to the same lens, determining that the current frame image and the previous frame image belong to the same lens;

otherwise, determining that the current frame image and the previous frame image do not belong to the same lens.

Moreover, in another example, the channel values include the hue values; or the saturation value; or the brightness value.

Correspondingly, calculating an HSV mean value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image of the current frame image in the HSV space, which specifically comprises the following steps:

when the channel value comprises the tone value, calculating the mean square error of the tone value of the current frame image and the tone value of the previous frame image to obtain a tone mean square error value, and taking the tone mean square error value as an HSV mean value;

when the channel value comprises a saturation value, calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value, and taking the saturation mean square error value as an HSV mean value;

and when the channel value comprises a brightness value, calculating the mean square error of the brightness value of the current frame image and the brightness value of the previous frame image to obtain a brightness mean square error value, and taking the brightness mean square error value as an HSV mean value.

Moreover, in another example, the channel values include the hue values and the saturation values; or the hue value and the lightness value; or the saturation value and the brightness value.

when the channel value comprises the hue value and the saturation value, calculating the mean square error of the hue value of the current frame image and the hue value of the previous frame image to obtain a hue mean square error value; calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value; averaging the hue mean square difference value and the saturation mean square difference value to obtain an HSV mean value;

when the channel value comprises the tone value and the lightness value, calculating the mean square error of the tone value of the current frame image and the tone value of the previous frame image to obtain a tone mean square error value; calculating the mean square error of the lightness value of the current frame image and the lightness value of the previous frame image to obtain a lightness mean square error value; averaging the hue mean square difference value and the brightness mean square difference value to obtain an HSV mean value;

when the channel value comprises the saturation value and the brightness value, calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value; calculating the mean square error of the lightness value of the current frame image and the lightness value of the previous frame image to obtain a lightness mean square error value; and averaging the saturation mean square difference value and the brightness mean square difference value to obtain an HSV mean value.

Further, in another example, the channel values include the hue value, the saturation value, and the brightness value.

calculating the mean square error of the tone value of the current frame image and the tone value of the previous frame image to obtain a tone mean square error value;

calculating the mean square error of the saturation value of the current frame image and the saturation value of the previous frame image to obtain a saturation mean square error value;

calculating the mean square error of the lightness value of the current frame image and the lightness value of the previous frame image to obtain a lightness mean square error value;

and averaging the hue mean square difference value, the saturation mean square difference value and the brightness mean square difference value to obtain an HSV mean value.

In addition, in another example, the analysis module 302 is specifically configured to perform analysis processing on a time dimension and a space dimension on each video segment by using a preset deep convolutional network model, and screen out an advertisement scene; determining the starting time of an advertisement video contained in the video to be processed according to the starting frame of the advertisement scene; and determining the end time of the advertisement video contained in the video to be processed according to the end frame of the advertisement scene.

In addition, in another example, the deep convolutional network model specifically includes a feature extraction module, a temporal attention module, a spatial attention module, and a loss function module.

Correspondingly, the operation of performing analysis processing of time dimension and space dimension on each video clip by using the preset deep convolutional network model to screen out the advertisement scene specifically comprises the following steps:

for each video clip, sampling N frames of images from the video clip, and cutting the sampled N frames of images to a preset size to obtain an input object;

inputting the input object into the feature extraction module, and analyzing and processing the input object by the feature extraction module to obtain a feature map F of a four-dimensional axis with the image size reduced to a preset proportion_b(ii) a The preset proportion is determined according to a selected preset depth convolution network algorithm;

feature map F of the four-dimensional axis_bInputting the time attention module, and analyzing and processing the time attention module to obtain the four-dimensional time weight W_t；

Feature map F of the four-dimensional axis_bAnd a time weight W of the four-dimensional axis_tElement-by-element multiplication is carried out, and the time characteristics F of the four-dimensional axis fused in time are obtained by adding according to the time axis of the first dimensional axis_t；

Time characteristic F of the four-dimensional axis_tInputting the space attention module, and obtaining the space weight W of the four-dimensional axis through the analysis and processing of the space attention module_s；

Time characteristic F of the four-dimensional axis_tAnd spatial weight W of said four-dimensional axis_sMultiplying element by element, and adding according to the space axes of the second dimension axis and the third dimension axis respectively to obtain the space characteristic F of the four dimension axis_s；

Spatial features F of the four-dimensional axis_sAs a video feature of the video clip.

In another example, the operation of analyzing and processing the video features of each video segment by using the loss function module to screen out the advertisement scenes specifically includes:

analyzing and processing the video characteristics of each video clip by using the loss function module, and combining the video clips of the same scene into one scene by taking the scene as a unit;

analyzing and processing each scene by using the loss function module to obtain a prediction result of a shot segment contained in each scene; wherein the prediction result is an advertisement or a positive;

counting the number of shot segments with the prediction results of the advertisements in each scene to obtain the number of the advertisement shots;

and if the number of the advertising shots meets a preset condition, determining that the corresponding scene is an advertising scene.

In another example, the analyzing and processing the video features of the video segments by using the loss function module, and the operation of merging the video segments of the same scene into one scene by using the scene as a unit includes:

traversing each video clip, and determining the Euclidean distance between two video clips according to the video characteristics of any two video clips by using the loss function module;

and combining the video clips with Euclidean distances smaller than a preset threshold value into a scene.

It should be understood that the present embodiment is a device embodiment corresponding to the first or second embodiment, and the present embodiment can be implemented in cooperation with the first or second embodiment. The related technical details mentioned in the first or second embodiment are still valid in this embodiment, and are not described herein again to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first or second embodiment.

It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present application, a unit that is not so closely related to solving the technical problem proposed by the present application is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.

A fourth embodiment of the present application is directed to a video advertisement filtering device, as shown in fig. 4, comprising at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; wherein the memory 402 stores instructions executable by the at least one processor 401, the instructions being executable by the at least one processor 401 to enable the at least one processor 401 to perform the video advertisement filtering method described in the first or second embodiments above.

Where the memory 402 and the processor 401 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 401 and the memory 402 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 401 may be transmitted over a wireless medium via an antenna, which may receive the data and transmit the data to the processor 401.

The processor 401 is responsible for managing the bus and general processing and may provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 402 may be used to store data used by processor 401 in performing operations.

A fifth embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program, when executed by a processor, implements the video advertisement filtering method embodiments described above.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the present application, and that various changes in form and details may be made therein without departing from the spirit and scope of the present application in practice.

Claims

1. A method for filtering video advertisements, comprising:

segmenting a video to be processed into a plurality of video segments;

2. The method of claim 1, wherein the segmenting the video to be processed into a plurality of video segments comprises:

taking a shot as a segmentation unit, and dividing the video to be processed into a plurality of continuous shot segments according to the time dimension;

and taking each shot segment as a video segment.

3. The method for filtering video advertisements according to claim 2, wherein the dividing the video to be processed into a plurality of consecutive shot segments according to the time dimension by taking the shot as the dividing unit comprises:

4. The method of claim 3, wherein the mapping the color space of the current frame image to an HSV space to obtain the channel value of the current frame image in the HSV space comprises:

5. The method according to claim 4, wherein said determining whether the current frame image and the previous frame image belong to the same shot according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space comprises:

6. The video advertisement filtering method of claim 5, wherein the channel values comprise the hue values; or the saturation value; or the lightness value;

calculating an HSV average value according to the channel value of the current frame image in the HSV space and the channel value of the previous frame image in the HSV space, wherein the calculation comprises the following steps:

7. The video advertisement filtering method of claim 5, wherein the channel values comprise the hue value and the saturation value; or the hue value and the lightness value; or the saturation value and the brightness value;

8. The video advertisement filtering method of claim 5, wherein the channel values comprise the hue value, the saturation value, and the brightness value;

9. The method according to any one of claims 1 to 8, wherein the analyzing and processing of time dimension and space dimension on each video segment by using a preset deep convolutional network model to obtain a start time and an end time of an advertisement video included in the video to be processed comprises:

analyzing and processing time dimension and space dimension of each video clip by using a preset deep convolutional network model, and screening out advertisement scenes;

determining the starting time of an advertisement video contained in the video to be processed according to the starting frame of the advertisement scene;

and determining the end time of the advertisement video contained in the video to be processed according to the end frame of the advertisement scene.

10. The video advertisement filtering method of claim 9, wherein the deep convolutional network model comprises a feature extraction module, a temporal attention module, a spatial attention module, and a loss function module;

the method for screening the advertisement scenes by analyzing and processing the time dimension and the space dimension of each video clip by using the preset deep convolutional network model comprises the following steps:

acquiring the video features of each video clip by using the feature extraction module, the time attention module and the space attention module;

and analyzing and processing the video characteristics of each video clip by using the loss function module, and screening out the advertisement scenes.

11. The method of claim 10, wherein said obtaining video features of each video segment using said feature extraction module, said temporal attention module, and said spatial attention module comprises:

the feature map F_bInputting the time attention module, and analyzing and processing the time attention module to obtain the time weight W of the four-dimensional axis_t；

The feature map F_bAnd the time weight W_tElement-by-element multiplication is carried out, and the time characteristics F of the four-dimensional axis fused in time are obtained by adding according to the time axis of the first dimension_t；

The time characteristic F is_tInputting the space attention module, and obtaining the space weight W of the four-dimensional axis through the analysis and processing of the space attention module_s；

The time characteristic F is_tAnd the spatial weight W_sMultiplying element by element, and adding according to the second and third spatial axes respectively to obtain the spatial feature F of the four-dimensional axis_s；

The spatial feature F_sAs a video feature of the video clip.

12. The method according to claim 10, wherein the analyzing and processing the video characteristics of each video segment by using the loss function module to filter out the advertisement scenes comprises:

13. The method of claim 12, wherein the analyzing the video characteristics of the video segments by the loss function module to merge the video segments of the same scene into one scene in units of scenes comprises:

14. A video advertisement filtering device, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video advertisement filtering method of any of claims 1-13.

15. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the video advertisement filtering method of any of claims 1 to 13.