CN115035509A

CN115035509A - Video detection method and device, electronic equipment and storage medium

Info

Publication number: CN115035509A
Application number: CN202210753941.7A
Authority: CN
Inventors: 毕泊
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-09

Abstract

The embodiment of the invention provides a video detection method, a video detection device, electronic equipment and a medium, wherein the video detection method comprises the following steps: acquiring a video file and determining a detection target aiming at the video file; acquiring a plurality of continuous video clips from the video file, and respectively determining a plurality of audio and video characteristic information of the video clips; respectively inputting the audio and video characteristic information into a pre-trained classification model to obtain a plurality of corresponding output results; determining candidate video segments from the plurality of video segments according to the plurality of output results; performing character recognition on the image frames of the candidate video clips to obtain character recognition results; and determining a target image frame where the detection target is located according to the character recognition result. According to the embodiment of the invention, the candidate video segment where the detection target is located is determined by combining the picture information and the audio information, and character recognition is carried out in the segment range, so that the image frame where the detection target is located is accurately positioned.

Description

Video detection method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video detection method, a video detection apparatus, an electronic device, and a computer-readable storage medium.

Background

In the video streaming media service, in order to improve the viewing experience of a user, a specific time point location in a video is marked, and the marking of the specific time point location can help the user to quickly locate the video content corresponding to the time point location, or can pertinently provide a corresponding service function at the time point location. For example, the ending time point of the title may be marked, on one hand, a skip function may be provided to help the user to quickly enter the feature content, and on the other hand, a rich-content user experience such as a promissory may be inserted into the point. In addition, the starting time point of the film end can be marked, and the function of recommending similar films can be provided after the films are played, so that the stay time of the user in the films is prolonged. The video streaming media service can also mark the start time point of the egg in the movie film, and provide a skip function to help the user to quickly enter the content of the egg.

The marked time point positions in the video are edited and made by a video content producer, and the marked positions of the different video time point positions are different. The traditional marking mode is that the marking is carried out manually, so that the marking efficiency is low; or a uniform image template is used for matching and marking, the marking method cannot solve the problems that point positions in different videos possibly have deviation and part of point positions are flexibly arranged, and the marking flexibility is poor.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a video detection method and a corresponding video detection apparatus, an electronic device, and a computer-readable storage medium that overcome or at least partially solve the above problems.

The embodiment of the invention discloses a video detection method, which comprises the following steps:

acquiring a video file and determining a detection target aiming at the video file; the detection target comprises at least one of title end mark information, title start mark information and title end mark information;

acquiring a plurality of continuous video clips from the video file, and respectively determining a plurality of audio and video characteristic information of the video clips;

respectively inputting the audio and video characteristic information into a pre-trained classification model to obtain a plurality of corresponding output results;

determining candidate video segments from the plurality of video segments according to the plurality of output results;

performing character recognition on the image frames of the candidate video clips to obtain character recognition results;

and determining a target image frame where the detection target is located according to the character recognition result.

The embodiment of the invention also discloses a video detection device, which comprises:

the first acquisition and determination module is used for acquiring a video file and determining a detection target aiming at the video file; the detection target comprises at least one of title end mark information, title start mark information and title end mark information;

the second obtaining and determining module is used for obtaining a plurality of continuous video clips from the video file and respectively determining a plurality of audio and video characteristic information of the video clips;

the input and output module is used for respectively inputting the audio and video characteristic information into a pre-trained classification model to obtain a plurality of corresponding output results;

a first determining module, configured to determine candidate video segments from the plurality of video segments according to the plurality of output results;

the character recognition module is used for carrying out character recognition on the image frames of the candidate video clips to obtain character recognition results;

and the second determining module is used for determining the target image frame where the detection target is located according to the character recognition result.

The embodiment of the invention also discloses an electronic device, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program when executed by the processor implementing the steps of a video detection method as described above.

The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the video detection method are realized.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, candidate video segments which possibly contain the detection target in the video segments can be determined through audio and video characteristic information of the continuous video segments in the video file, and the image frames of the candidate video segments are subjected to character recognition to determine the target image frame where the detection target is located. By adopting the method, the candidate video segment where the detection target is located is determined by utilizing the deep learning model in combination with the picture information and the audio information of the video file, and character recognition is carried out in the segment range, so that the accurate image frame where the detection target is located is positioned, the recognition precision of the head end time point position, the tail start time point position and the color egg start time point position of the video can be improved, the detection method does not need to carry out image template matching and does not need to rely on manual operation, the point position recognition mode is flexible, and the recognition efficiency is high.

Drawings

FIG. 1 is a flow chart of the steps of a video detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of steps of another video detection method according to an embodiment of the present invention;

FIGS. 2A-2G are flow diagrams of sub-steps of another video detection method according to embodiments of the present invention;

fig. 3 is a schematic diagram of a processing procedure of audio/video feature information according to an embodiment of the present invention;

FIG. 4 is a flow chart of a video detection method according to an embodiment of the present invention;

FIG. 5 is a flow chart of another video detection method according to an embodiment of the present invention;

FIG. 6 is a flow chart of another video detection method according to an embodiment of the present invention;

fig. 7 is a block diagram of a video detection apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of them. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

In order to improve the watching experience of a user in the video streaming media service, the head end time point location, the tail end start time point location, the color egg start time point location and the like are marked.

The traditional time point marking method is not suitable for massive videos or is manually marked, the point position cannot be quickly determined by manual marking, and the marking efficiency is low; or a uniform image template is used for matching, and the marking method cannot solve the problems that point positions in different videos possibly have deviation and part of point positions are flexibly arranged.

One of the core ideas of the embodiment of the invention is that candidate video segments possibly containing a detection target in a plurality of video segments can be determined through audio and video characteristic information of a plurality of continuous video segments in a video file, and character recognition is carried out on image frames of the candidate video segments to determine a target image frame where the detection target is located. By adopting the method, the picture information and the audio information of the video file are combined, the candidate video segment where the detection target is located is determined by utilizing the deep learning model, and character recognition is carried out in the segment range, so that the accurate image frame where the detection target is located is positioned, the recognition precision of the head end time point position, the tail end start time point position and the color egg start time point position of the video can be improved, the detection method does not need to carry out image template matching and manual operation, and the point position recognition mode is flexible and the recognition efficiency is high.

Referring to fig. 1, a flowchart illustrating steps of a video detection method according to an embodiment of the present invention is shown, which may specifically include the following steps:

step 101, a video file is obtained, and a detection target for the video file is determined.

The detection target comprises at least one of slice head end mark information, slice tail start mark information and slice tail end mark information.

In the embodiment of the present invention, the video file may be a multimedia video file such as a video file of a television show, a video file of a variety program, and a video file of a movie, and the video file may include one or more of title information, feature information, end information, and egg information.

The playing order of the film header information and the feature film information can be flexibly set, and in one example, the film header information is played first and then the feature film information is played; in another example, a part of the feature information may be played first, then the slice header information may be played, and then another part of the feature information may be played. The embodiment of the present invention is not particularly limited to the playing order of the slice header information and the feature information.

The end of title flag information may refer to the end of title content flag information; the end-of-title start flag information may refer to flag information indicating the start of end-of-title content; the end-of-title flag information may refer to flag information indicating that the end of the content of the end-of-title exists. For a video file of a series of dramas, the title end flag information may be episode number information or release number information. For a movie video file, the ending start flag information or ending end flag information may be text box information.

The position of the start time point of the film leader can be determined by detecting the end mark information of the film leader in the video file, the position of the start time point of the film leader can be determined by detecting the start mark information of the film leader in the video file, and the position of the start time point of the egg can be determined by detecting the end mark information of the film leader in the video file, so that a user can quickly jump to the corresponding time point, or the corresponding service function can be provided on the corresponding time point in a targeted manner.

Step 102, obtaining a plurality of continuous video clips from the video file, and respectively determining a plurality of audio/video characteristic information of the plurality of video clips.

In the embodiment of the invention, the video file can be divided into a plurality of continuous video segments, a plurality of audio and video characteristic information of the video segments is respectively determined, and feature content (including egg content), head content and tail content are distinguished by combining the video characteristics and the audio characteristics.

And 103, respectively inputting the audio and video characteristic information into a pre-trained classification model to obtain a plurality of corresponding output results.

In the embodiment of the invention, a classification model is constructed in advance, and the classification model is used for determining whether the corresponding video clip is a leader clip, a feature clip or a trailer clip according to the input audio and video characteristic information. In the movie video file, a feature film segment played after the end-of-film segment is played is usually used as a painted egg segment.

And 104, determining candidate video clips from the plurality of video clips according to the plurality of output results.

And respectively inputting the audio and video characteristic information into a pre-trained classification model, so that a plurality of corresponding output results can be obtained, and the output results can be analyzed to determine candidate video segments which are possible to contain the detection target in the video segments.

And 105, performing character recognition on the image frames of the candidate video clips to obtain character recognition results.

In the embodiment of the present invention, after determining the candidate video segment that may include the detection target, the image frames of the candidate video segment may be subjected to text recognition to obtain a corresponding text recognition result.

And 106, determining a target image frame where the detection target is located according to the character recognition result.

And determining a target image frame where the detection target is located according to a character recognition result obtained by character recognition, so that the position where the head end time point location or the tail start time point location or the color egg start time point location in the video file can be located, and the location precision reaches the frame level.

In summary, in the embodiment of the present invention, candidate video segments that may include a detection target in a plurality of video segments may be determined through audio/video feature information of a plurality of consecutive video segments in a video file, and text recognition is performed on image frames of the candidate video segments to determine a target image frame where the detection target is located. By adopting the method, the candidate video segment where the detection target is located is determined by utilizing the deep learning model in combination with the picture information and the audio information of the video file, and character recognition is carried out in the segment range, so that the accurate image frame where the detection target is located is positioned, the recognition precision of the head end time point position, the tail start time point position and the color egg start time point position of the video can be improved, the detection method does not need to carry out image template matching and does not need to rely on manual operation, the point position recognition mode is flexible, and the recognition efficiency is high.

Referring to fig. 2, a flowchart illustrating steps of another video detection method according to an embodiment of the present invention is shown, which may specifically include the following steps:

step 201, acquiring a video file, and determining a detection target for the video file.

In the embodiment of the present invention, a video file that needs to be detected and analyzed may be obtained, and a detection target for the video file may be determined, where the detection target may be one or more of end-of-slice flag information, end-of-slice start flag information, and end-of-slice flag information.

In one example, the detection target may be determined according to a video type of a video file, for example, in the case of a video file of a television series, the end-of-title flag information and the end-of-title start flag information may be used as the detection target of the video file; in the case of a movie video file, the end-of-title flag information, the end-of-title start flag information, and the end-of-title flag information may all be targets for detecting the video file.

Step 202, obtaining a plurality of continuous video clips from the video file, and respectively determining a plurality of audio/video characteristic information of the plurality of video clips.

In the embodiment of the present invention, the detection target needs to be detected, and a plurality of consecutive video segments need to be acquired from the video file, where the consecutive video segments refer to that the video segments are consecutive in the playing time of the video file, and a next video segment is played immediately after one video segment is played. After acquiring the plurality of continuous video segments, a plurality of audio-video characteristic information of the plurality of video segments may be respectively determined.

In an optional embodiment of the present invention, the step of obtaining a plurality of consecutive video segments from the video file in step 202 may specifically include the following sub-steps:

and a substep S11 of determining a truncation time period according to the detection target.

And a substep S12, intercepting the video file according to the intercepting time period to obtain an intercepted video clip.

And a substep S13, performing average segmentation on the intercepted video segments to obtain a plurality of continuous video segments.

Different intercepting time periods can be determined according to different detection targets, the intercepted video clips are obtained from the complete video file according to the intercepting time periods, and then the intercepted video clips are divided into a plurality of continuous video clips.

For example, t may be truncated ₀ -t ₁ The first video segment in the time range is taken as a cut video segment andthe first video segment is divided into T video segments on average, each video segment containing 64 frames of images and 12.8s of audio. By intercepting a part of fragments in the video file for analysis, the analysis efficiency can be improved.

In one example, if the end-of-track start flag information or the end-of-track end flag information in the movie video file needs to be detected, the last 30-minute video clip in the movie video file can be intercepted as the intercepted video clip for analysis.

After a plurality of continuous video segments are obtained, corresponding audio and video characteristic information can be respectively determined for each video segment, and positive film contents (including colored egg contents), head film contents and tail film contents are distinguished by combining the video characteristics and the audio characteristics.

In an optional embodiment of the present invention, the step of determining the plurality of audio/video feature information of the plurality of video segments in step 202 may specifically include the following sub-steps:

and a substep S21, for each video segment, extracting corresponding audio characteristic information by adopting a pre-trained super-resolution test sequence VGG model, extracting corresponding video characteristic information by adopting a pre-trained double-flow expansion three-dimensional convolution network I3D model, and merging the audio characteristic information and the video characteristic information to obtain the audio and video characteristic information corresponding to the video segment.

Inputting the audio information in the video clip into an audio feature extraction model for feature extraction; the image information in the video clip is input into a video feature extraction model for feature extraction.

The audio feature extraction model may be a super-resolution test sequence VGG (video geometry group) model trained in a public data set, and the super-resolution test sequence VGG model uses repeatedly stacked 3 × 3 small convolution kernels and increases the depth of a network. For example, 12.8s audio of a video segment may be input into a super resolution test sequence VGG model, outputting 8 x 128 dimensional audio features.

The video feature extraction model can be an unfolded 3D ConvNet (double-flow expanded three-dimensional convolution network I3D) model trained in a public data set, and the double-flow expanded three-dimensional convolution network I3D model is an enhanced version based on a 2D convolution network, and expands convolution kernels and pooling kernels of convolution classification into 3D. For example, 64 frames of images of a video segment may be input into a two-stream expanded three-dimensional convolution network I3D model, and 6 x 1024-dimensional video features may be output.

For a video clip, after the audio characteristic information and the video characteristic information corresponding to the video clip are obtained, the audio characteristic information and the video characteristic information can be combined to obtain the audio and video characteristic information corresponding to the video clip.

Fig. 3 is a schematic diagram of a processing process of audio/video characteristic information according to an embodiment of the present invention. Acquiring a video file needing to be detected, extracting continuous video frames from the video file to obtain video frame images, wherein the video frame images can comprise a video frame image sequence 1, a video frame image sequence 2, … … and T video frame image sequences (equivalent to T video clips) of a video frame image sequence T; and extracting an audio file corresponding to the video frame image from the video file, wherein the audio file can be divided into T audio file segments of audio file segment 1, audio file segment 2, … … and audio file segment T; respectively extracting video characteristics of the T video frame image sequences to obtain T video characteristics of video characteristic 1, video characteristic 2, video characteristic … … and video characteristic T; respectively extracting audio features of the T audio file segments to obtain audio features 1, audio features 2 and … … and T audio features of the audio features T; and combining the corresponding audio features and the video features to obtain T audio and video features of the audio and video features 1, 2 and … … and the audio and video features T.

In an optional embodiment of the present invention, the step of combining the audio characteristic information and the video characteristic information to obtain the audio/video characteristic information corresponding to the video segment in sub-step S21 may specifically include the following sub-steps:

and a substep S211 of performing attention calculation on the audio feature information and the video feature information respectively based on a shift attention mechanism to obtain corresponding attention audio feature information and attention video feature information.

And a substep S212, splicing the attention audio characteristic information and the attention video characteristic information to obtain corresponding audio and video characteristic information.

The method for combining the audio characteristic information and the video characteristic information adopts a shift attention mechanism. Since it is found through experiments that simply splicing audio features and video features cannot train a satisfactory classification model, the meaning of feature vectors and expression values is different because the modalities of video pictures and audio are different. To solve this problem, a shift attention mechanism is adopted, and an attention unit is added on each modality and a shift operation is performed to improve the feature expression of audio/video. The attention calculation formula is:

wherein X is an input audio feature or video feature; both α and β are learnable parameters; a is an attention weighting vector; n is the number of attention units; v is an attention audio feature or an attention video feature of the output.

As can be seen from the above formula, after performing linear transformation operation on the input audio features or video features by using learnable parameters, L2 regularization processing is performed (i | | α · aX + β | | sweet hair ₂ ) And obtaining the attention audio characteristic information and the attention video characteristic information which are transformed by the shift attention mechanism, and then splicing to obtain the audio and video characteristic information. The number of the attention units is set through experiments, and the number of the audio features and the number of the video features can be set to be 8.

And 203, respectively inputting the audio and video characteristic information into the classification model to obtain a plurality of corresponding confidence results.

And the confidence result is used for representing the confidence that the corresponding video segment belongs to the head segment/the tail segment.

In an alternative embodiment of the present invention, the classification model may be trained by:

acquiring a sample video clip set for training; the sample video clip set comprises a plurality of sample video clips in succession; the segment types respectively marked on the plurality of sample video segments are a slice head segment, a positive segment or a slice tail segment; respectively determining a plurality of sample audio and video characteristic information of the plurality of sample video clips; and performing model training by using the audio and video characteristic information of the plurality of samples to obtain the classification model for identifying the head fragment/the tail fragment.

The method comprises the steps that sample video clips for model training are concentrated to contain a plurality of continuous sample video clips, each sample video clip is marked as a leader clip, a positive clip or a tail clip, a plurality of sample audio and video characteristic information of the sample video clips can be determined, then the sample audio and video characteristic information is respectively input into a model training system for model training, and a classification model with the capability of recognizing the leader clip, the positive clip or the tail clip is obtained. In one example, the classification model may be a Fully Connected FC (Fully Connected) classification model.

In the embodiment of the invention, the audio and video characteristic information of one video clip is input into the classification model, and the output confidence result can be used for determining whether the video clip belongs to a head clip, a head clip or a tail clip.

Illustratively, the confidence result output by the classification model may be a probability score value. In an example, the audio/video feature information of a certain video segment is input into a trained classification model, and a probability score value of the video segment belonging to a leader segment/a feature segment/a trailer segment can be obtained.

Step 204, determining candidate video segments from the plurality of video segments according to the plurality of output results.

In the embodiment of the present invention, candidate video segments that may include the detection target may be determined from the plurality of video segments according to the confidence result output by the classification model.

In an alternative embodiment of the present invention, step 204 may specifically include the following sub-steps:

and a substep S31, comparing the confidence level results with preset confidence level thresholds respectively, and obtaining a plurality of corresponding comparison results.

And a sub-step S32 of determining the candidate video segment from the plurality of video segments according to the plurality of comparison results.

In one embodiment, the slice header segment has a corresponding first confidence threshold, and if the confidence result is greater than the first confidence threshold, the corresponding video segment may be determined to be the slice header segment; the positive film segment has a corresponding second confidence threshold, and if the confidence result is greater than the second confidence threshold, the corresponding video segment can be determined as the positive film segment; the end-of-title segment has a corresponding third confidence threshold, and if the confidence result is greater than the third confidence threshold, the corresponding video segment may be determined to be the end-of-title segment.

In another embodiment, only a first confidence threshold corresponding to a head segment and a third confidence threshold corresponding to a tail segment may be set, and if the confidence result is greater than the first confidence threshold, the corresponding video segment may be determined as the head segment, otherwise, the corresponding video segment is a positive segment; if the confidence result is greater than the third confidence threshold, the corresponding video segment may be determined to be an end-of-segment, otherwise, to be a positive segment.

In the embodiment of the present invention, the segment types of the corresponding video segments are determined according to the comparison results, and the candidate video segments that may include the detection target may be determined from the video segments according to the segment types of the video segments and the playing orders of the video segments.

In an optional embodiment of the present invention, the candidate video segments include a first candidate video segment for finding a detection target as the end-of-title information, and the sub-step S32 may specifically include the following sub-steps:

in the substep S321, if the detection target is the slice header end flag information, the plurality of video segments are respectively classified into a slice header segment and a feature segment according to the plurality of comparison results.

In the sub-step S322, if there is a slice header slice and a feature slice adjacent to each other in the playing order in the plurality of video slices, and the feature slice is played after the slice header slice is played, the slice header slice and the feature slice are determined as the first candidate video slice.

If the detection target is the head end flag information, the plurality of video segments may be classified into a head segment and a feature segment according to the plurality of comparison results, respectively. If one head segment and one feature segment adjacent to the playing order exist in the plurality of video segments and the one feature segment is played after the one head segment is played, the one head segment and the one feature segment are determined as first candidate video segments which may contain head end marker information.

For example, a first confidence threshold may be set to be 0.8, when the confidence result of a certain video segment is greater than the first confidence threshold of 0.8, the video segment may be considered as a head segment, otherwise, the video segment is a feature segment. Assuming that there are 10 consecutive video segments { T1, … …, T10}, which respectively correspond to confidence results of [1.0, 0.9, 0.9, 0.9, 0.9, 0.9, 0.7, 0.4, 0.3, 0.1], comparing the confidence results with the first confidence threshold, it can be determined that T1-T6 all belong to a slice header segment, T7-T10 all belong to a positive slice segment, that is, the slice header is T1-T6, the positive slice is T7-T10, T6 and T7 are a slice header segment and a positive slice segment adjacent in playing order, and T7 is played after T6 is played, then T6 and T7 can be selected as the first candidate video segment.

In an optional embodiment of the present invention, the candidate video segments include a second candidate video segment for searching for the start-of-title information as the detection target, and the sub-step S32 may specifically include the following sub-steps:

and a substep S323 of classifying the plurality of video segments into a feature segment and a trailer segment according to the plurality of comparison results if the detection target is the trailer start flag information.

In the sub-step S324, if there is a positive segment and a negative segment adjacent to each other in the playing order in the plurality of video segments, and the negative segment is played after the positive segment is played, the positive segment and the negative segment are determined as the second candidate video segment.

If the detection target is the film end start mark information, the plurality of video clips can be classified into a positive clip and a film end clip according to a plurality of comparison results. If one feature clip and one end clip adjacent to the playing order exist in the plurality of video clips, and the one end clip is played after the one feature clip is played, the one feature clip and the one end clip are determined as second candidate video clips which possibly contain end-of-clip start mark information.

For example, a third confidence threshold may be set to be 0.8, when the confidence result of a certain video segment is greater than the third confidence threshold 0.8, the video segment may be considered as an end-of-segment, otherwise, the video segment is a feature segment. Assuming that there are 10 consecutive video segments { T1, … …, T10}, which respectively correspond to confidence results of [0.1, 0.9, 0.9, 0.9, 0.1, 0.1, 0.1, 0.8, 0.8, 0.9], comparing the confidence results with a third confidence threshold, it can be determined that t2-t4 and t8-t10 all belong to a trailer segment, t1 and t5-t7 all belong to a feature segment, that is, the first feature is t1, the first trailer is t2-t4, the second feature is t5-t7, the second trailer is t8-t10, t1 and t2 are a trailer segment and a trailer segment adjacent in playing order, t7 and t8 are also a trailer segment and a trailer segment adjacent in playing order, and t2 plays after t1 and t8 plays after t7, t1 and t2, and t7 and t8 can be selected as the second candidate video segment.

In an optional embodiment of the present invention, the candidate video segments include a third candidate video segment for finding that the detection target is end-of-title information, and the sub-step S32 specifically includes the following sub-steps:

and a substep S325, if the detection target is the end of film clip flag information, classifying the plurality of video clips into a positive clip and a negative clip according to the plurality of comparison results.

In the sub-step S326, if there is an end-of-clip segment and a feature segment that are adjacent in playing order in the plurality of video segments, and the feature segment is played after the end-of-clip segment is played, the end-of-clip segment and the feature segment are determined as the third candidate video segment.

The classification model can only identify the head fragment, the head fragment and the tail fragment, but can not identify the colored egg fragment. If it is recognized in the plurality of video clips intercepted for analysis that one feature clip continues to be played after one end clip is played, the feature clip can be regarded as a preserved egg clip.

In the embodiment of the present invention, if the detection target is the end-of-title flag information, the plurality of video segments may be classified into a feature segment and an end-of-title segment according to the plurality of comparison results. If one end-of-title segment and one feature segment adjacent in playing order exist in the plurality of video segments, and the one feature segment is played after the one end-of-title segment is played, the one end-of-title segment and the one feature segment are determined as a third candidate video segment which may contain end-of-title flag information.

For example, a third confidence threshold may be set to be 0.8, when the confidence result of a certain video segment is greater than the third confidence threshold of 0.8, the video segment may be considered as an end-of-segment, otherwise, the video segment is a feature segment. Assuming that there are 10 consecutive video segments { T1, … …, T10} with their corresponding confidence results being [1.0, 0.9, 0.9, 0.9, 0.1, 0.1, 0.1, 0.8, 0.8, 0.9], comparing the confidence results with a third confidence threshold, it can be determined that T1-T4 and T8-T10 both belong to end-of-slice segments, T5-T7 belong to positive slice segments, that is, the first end-of-slice is T1-T4, the positive slice is T5-T7, the second end-of-slice is T8-T10, T4 and T5 are one end-of-slice segments and one positive slice segment adjacent in playing order, and T5 is played after T4 is played, then T4 and T5 can be selected as a third candidate video segment.

It should be noted that the substeps S321-S322, the substeps 323-S324 and the substeps 325-S326 are executed in parallel, and if the detection target is the slice header end flag information, the substeps S321-S322 are executed; if the detection target is the end of piece start mark information, executing the substeps S323-S324; if the detection target is the end-of-title flag information, sub-steps S325-S326 are performed. There is no restriction relation of execution sequence between the three pairs of sub-steps.

Step 205, performing character recognition on the image frames of the candidate video segments to obtain character recognition results.

In the embodiment of the present invention, the image frames in the candidate video segments may be input into a pre-trained character recognition model for character recognition. The character recognition model has the capability of detecting the text display position area in the image and distinguishing the text content, and combines the text display position area and the text content to perform character recognition, so that the recognition accuracy can be improved.

In a specific implementation, the image frames of the candidate video segments may be sampled, and the sampled image frames may be input into a character recognition model for character recognition. For example, the sampling may be performed 5 frames apart, and the sampled image frames are input to the character recognition model.

And step 206, determining a target image frame where the detection target is located according to the character recognition result.

In an optional embodiment of the present invention, the text recognition result includes a first text recognition result obtained by performing text recognition on an image frame of the first candidate video segment, and step 206 may specifically include the following sub-steps:

and a substep S41, matching the text content in the first character recognition result with a preset keyword, tracking the image frame containing the preset keyword after matching the image frame containing the preset keyword, and determining the last tracked image frame containing the preset keyword as the target image frame where the end-of-title mark information is located.

The character recognition result comprises a first character recognition result aiming at the first candidate video segment, the first character recognition result can comprise recognized text content, the text content can be matched with a preset keyword, after the text content containing the preset keyword is matched, an image frame corresponding to the text content is determined, the image frame is tracked, and the last tracked image frame containing the preset keyword is determined as a target image frame where the end-of-title mark information is located.

And matching the text content in the recognized character recognition result with a preset keyword. The preset keyword may be collection number information or issue number information, such as "collection number", "chapter number", issue number, and the like.

And tracking the image frames containing the preset keywords, wherein the image frame tracked until the pixel jitter exceeds the threshold value can be determined as the last tracked image frame containing the preset keywords, and the image frame tracked until the pixel jitter exceeds the threshold value is the image frame in which the font pictures for capturing the preset keywords gradually disappear. For example, a preset keyword is matched in the nth frame, when the nth +3 frame is tracked, the pixel jitter exceeds a threshold value, at this time, the font picture of the preset keyword gradually disappears, and the nth +3 frame is taken as the target image frame.

In an optional embodiment of the present invention, the text recognition result includes a second text recognition result obtained by performing text recognition on the image frame of the second candidate video segment, and step 206 may specifically include the following sub-steps:

and a substep S42, traversing second character recognition results corresponding to each image frame in the second candidate video segment in a time sequence, and if the number of text boxes in the second character recognition results corresponding to a plurality of consecutive image frames is greater than a preset number threshold, determining a first image frame in the second character recognition results whose number of text boxes is greater than the preset number threshold as the target image frame where the end-of-flight start flag information is located.

The character recognition result comprises a second character recognition result aiming at the second candidate video segment, if the detected target is the end-of-piece starting mark information, the second character recognition result can be analyzed, the second character recognition results corresponding to all the image frames in the second candidate video segment are traversed according to the time sequence, if the number of the text boxes in the second character recognition results corresponding to a plurality of continuous image frames is larger than a preset number threshold value, the end-of-piece can be considered to be started, and the first image frame of which the number of the text boxes is larger than the preset number threshold value can be determined as the target image frame of which the end-of-piece starting mark information is located.

In an optional embodiment of the present invention, the text recognition result includes a third text recognition result obtained by performing text recognition on the image frame of the third candidate video segment, and step 206 may specifically include the following sub-steps:

and a substep S43, if the third character recognition result includes a text box, tracking the image frame including the text box, and determining the last tracked image frame including the text box as the target image frame where the end-of-title flag information is located.

The character recognition result comprises a third character recognition result aiming at a third candidate video clip, if the detected target is end-of-title mark information, the third character recognition result can be analyzed to determine whether the third character recognition result contains a text box, if the third character recognition result contains the text box, the image frame containing the text box can be tracked, and the last tracked image frame containing the text box is determined as the target image frame where the end-of-title mark information is located.

Since the video file trailer usually has rolling production information, such as a production staff list, a production company, and the like, a plurality of rolling text boxes exist in the image frame of the trailer, the position coordinates of the text boxes can be detected, the image frame containing the text boxes is tracked, and the last tracked image frame containing the text boxes is determined as the target image frame where the trailer end mark information is located.

It should be noted that the substep S41, the substep S42, and the substep S43 are executed in parallel, and if the detection target is the slice header end flag information, the substep S41 is executed; if the detection target is the end-of-piece start flag information, performing substep S42; if the detection target is the end-of-title start flag information, sub-step S43 is performed. And the three substeps do not have restriction relation of execution sequence. In addition, sub-step S41 is performed with sub-steps S321-S322, sub-step S42 is performed with sub-steps S323-S324, and sub-step S43 is performed with sub-steps S325-S326.

Referring to fig. 4, a flowchart of a video detection method according to an embodiment of the present invention is shown, where a detection target is end-of-slice flag information, and a specific process includes:

s4a, acquiring a video file of a television play, and intercepting the video file with a time period T ₀ -T ₁ The section of (2) is used as a cut video section for analysis, the cut video section can be divided into a plurality of video sections, and a plurality of image information and a plurality of audio information of the plurality of video sections are respectively obtained.

And S4b, inputting the image information and the audio information corresponding to any video clip into the video feature extraction model and the audio feature extraction model respectively to extract features, and obtaining the video feature information and the audio feature information corresponding to the video clip.

And S4c, for any video clip, merging the video characteristic information and the audio characteristic information corresponding to the video clip to obtain the audio and video characteristic information corresponding to the video clip.

And S4d, for any video clip, inputting the audio and video characteristic information corresponding to the video clip into the trained classification model to obtain the confidence result that the video clip belongs to the leader clip/the positive clip/the trailer clip.

S4e, determining a first candidate video segment from the plurality of video segments according to the output plurality of confidence level results, specifically, classifying the plurality of video segments into a slice header segment and a feature segment according to the output plurality of confidence level results, and if there is a slice header segment and a feature segment adjacent to each other in the playing order in the plurality of video segments and the feature segment is played after the slice header segment is played, determining the slice header segment and the feature segment as the first candidate video segment for searching the slice header end flag information.

And S4f, inputting the image frame of the first candidate video segment into the character recognition model to obtain a corresponding first character recognition result.

S4g, matching the text content in the first character recognition result with a preset keyword, tracking the image frame containing the preset keyword after the image frame containing the preset keyword is matched, and determining the last tracked image frame containing the preset keyword as a target image frame where the end-of-title mark information is located. If the image frames containing the preset keywords are not matched, the capturing time period in step S4a may be adjusted, for example, the capturing time period for capturing the video segment may be adjusted to T ₁ -T ₂ Time period of T for the interception time period ₁ -T ₂ The steps S4b-S4g are repeated until the image frames containing the preset keywords are matched or the video clips contained in the video file are analyzed.

Referring to fig. 5, a flowchart of another video detection method according to an embodiment of the present invention is shown, where a detection target is start-of-title information, and the specific process includes:

s5a, a video file of a set of videos can be obtained, fragments which may contain information of a start mark of a fragment in the video file are intercepted for analysis, and an intercepting time period is assumed to be T ₉ -T ₁₀ The section of (2) is used as a cut video section for analysis, and the cut video section can be divided into a plurality of video sections to respectively obtain a plurality of image information and a plurality of audio information of the plurality of video sections.

And S5b, inputting the image information and the audio information corresponding to any video clip into the video characteristic extraction model and the audio characteristic extraction model respectively to extract the characteristics, and obtaining the video characteristic information and the audio characteristic information corresponding to the video clip.

And S5c, for any video clip, merging the video characteristic information and the audio characteristic information corresponding to the video clip to obtain the audio and video characteristic information corresponding to the video clip.

And S5d, for any video clip, inputting the audio and video characteristic information corresponding to the video clip into the trained classification model to obtain the confidence result that the video clip belongs to a leader clip/a positive clip/a trailer clip.

S5e, determining a second candidate video segment from the plurality of video segments according to the output plurality of confidence level results, specifically, classifying the plurality of video segments into a feature segment and a trailer segment according to the output plurality of confidence level results, and if there is a feature segment and a trailer segment adjacent to each other in playing order in the plurality of video segments and the trailer segment is played after the one feature segment is played, determining the feature segment and the trailer segment as the second candidate video segment for searching trailer start flag information.

And S5f, inputting the image frame of the second candidate video clip into the character recognition model to obtain a corresponding second character recognition result.

And S5g, traversing the second character recognition results corresponding to each image frame in the second candidate video segment in time sequence, and if the number of text boxes in the second character recognition results corresponding to a plurality of continuous image frames is greater than a preset number threshold, determining the first image frame in which the number of text boxes in the second character recognition results is greater than the preset number threshold as the target image frame where the end-of-segment start mark information is located. If the number of text boxes in the second character recognition result corresponding to a plurality of consecutive image frames is not greater than the preset number threshold, the capturing time period in step S5a may be adjusted, for example, the capturing time period for capturing the video segment may be adjusted to T ₁₀ -T ₁₁ Time period of T for the interception time period ₁₀ -T ₁₁ Intercepting video segments for analysis and repeatingAnd repeating the steps S5 b-S5 g until the number of text boxes in the second character recognition result corresponding to the continuous image frames is larger than a preset number threshold or all the video clips contained in the video file are analyzed.

Referring to fig. 6, a flowchart of another video detection method according to an embodiment of the present invention is shown, where a detection target is trailer end flag information, and a specific process includes:

s6a, a video file of a set of videos can be obtained, fragments which may contain end-of-fragment flag information in the video file are intercepted for analysis, and the intercepting time period is assumed to be T ₈ -T ₉ The section of (2) is used as a cut video section for analysis, and the cut video section can be divided into a plurality of video sections to respectively obtain a plurality of image information and a plurality of audio information of the plurality of video sections.

And S6b, inputting the image information and the audio information corresponding to any video clip into the video characteristic extraction model and the audio characteristic extraction model respectively to extract the characteristics, and obtaining the video characteristic information and the audio characteristic information corresponding to the video clip.

And S6c, for any video clip, merging the video characteristic information and the audio characteristic information corresponding to the video clip to obtain the audio and video characteristic information corresponding to the video clip.

S6d, for any video clip, inputting the audio and video characteristic information corresponding to the video clip into the trained classification model to obtain the confidence result that the video clip belongs to the leader clip/the positive clip/the trailer clip.

S6e, determining a third candidate video segment from the multiple video segments according to the output multiple confidence level results, specifically, classifying the multiple video segments into a positive segment and a negative segment according to the output multiple confidence level results, and if there is a negative segment and a positive segment in the multiple video segments that are adjacent in playing order and the positive segment is played after the negative segment is played, determining the negative segment and the positive segment as the third candidate video segment for finding the negative end marker information.

And S6f, inputting the image frame of the third candidate video clip into the character recognition model to obtain a corresponding third character recognition result.

S6g, if the third character recognition result contains a text box, tracking the image frame containing the text box, and determining the last tracked image frame containing the text box as the target image frame where the end-of-title mark information is located. If the third word recognition result does not include the text box, the capturing time period in step S6a may be adjusted, for example, the capturing time period for capturing the video segment may be adjusted to T ₉ -T ₁₀ Time period of T for the interception time period ₉ -T ₁₀ The steps S6 b-S6 g are repeated until the third character recognition result contains a text box or the video clips contained in the video file are all analyzed.

The title detection method provided by the invention can accurately identify the title under the condition that the title is played after the feature film is played in the homemade drama.

The film trailer detection method provided by the invention comprises the steps of firstly extracting the 3D convolution network characteristics of a film video file, identifying a video picture by training a classification network whether the film is a positive film or a trailer, taking a positive film ending time point location as a film trailer starting time point location, carrying out character detection on a picture near the positive film ending time point location, and taking a time point location at which a plurality of text boxes are displayed as the film trailer starting time point location.

The method for detecting the colored eggs can automatically detect the starting time point position of the film colored eggs, search whether the video clips of the feature film type exist after the first feature film ending time point position, and if so, take the second feature film starting time point position as the starting time point position of the colored eggs.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 7, a block diagram of a structure of a video detection apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

a first obtaining and determining module 701, configured to obtain a video file and determine a detection target for the video file; the detection target comprises at least one of title end mark information, title start mark information and title end mark information;

a second obtaining and determining module 702, configured to obtain a plurality of continuous video clips from the video file, and determine a plurality of pieces of audio and video feature information of the plurality of video clips, respectively;

the input and output module 703 is configured to input the plurality of audio/video feature information into a pre-trained classification model respectively to obtain a plurality of corresponding output results;

a first determining module 704, configured to determine a candidate video segment from the plurality of video segments according to the plurality of output results;

the character recognition module 705 is configured to perform character recognition on the image frames of the candidate video segments to obtain a character recognition result;

and a second determining module 706, configured to determine, according to the character recognition result, a target image frame where the detection target is located.

In an embodiment of the present invention, the input and output module includes:

the input and output submodule is used for respectively inputting the audio and video characteristic information into the classification model to obtain a plurality of corresponding confidence results; the confidence result is used for representing the confidence that the corresponding video segment belongs to the head segment/the tail segment.

In an embodiment of the present invention, the classification model is trained by the following modules:

the acquisition module is used for acquiring a sample video clip set for training; the sample video clip set comprises a plurality of sample video clips in succession; the segment types respectively marked by the plurality of sample video segments are slice head segments, positive segments or slice tail segments;

the third determining module is used for respectively determining a plurality of sample audio and video characteristic information of the plurality of sample video clips;

and the model training module is used for performing model training by using the audio and video characteristic information of the plurality of samples to obtain the classification model for identifying the leader segment/the positive segment/the tail segment.

In an embodiment of the present invention, the second obtaining and determining module includes:

and the feature extraction and combination submodule is used for extracting corresponding audio feature information by adopting a pre-trained super-resolution test sequence VGG model aiming at each video segment, extracting corresponding video feature information by adopting a pre-trained double-flow expansion three-dimensional convolution network I3D model, and combining the audio feature information and the video feature information to obtain the audio and video feature information corresponding to the video segment.

In an embodiment of the present invention, the feature extraction and merging sub-module includes:

the attention calculating unit is used for respectively carrying out attention calculation on the audio characteristic information and the video characteristic information based on a shift attention mechanism to obtain corresponding attention audio characteristic information and attention video characteristic information;

and the splicing unit is used for splicing the attention audio characteristic information and the attention video characteristic information to obtain the corresponding audio and video characteristic information.

In an embodiment of the present invention, the first determining module includes:

the comparison submodule is used for comparing the confidence coefficient results with a preset confidence coefficient threshold respectively to obtain a plurality of corresponding comparison results;

a first determining sub-module, configured to determine the candidate video segment from the plurality of video segments according to the plurality of comparison results.

In this embodiment of the present invention, the candidate video segments include a first candidate video segment for searching for the detection target as the end-of-title information, and the first determining sub-module includes:

a first classification unit, configured to classify the plurality of video segments into a slice header segment and a feature segment according to the plurality of comparison results if the detection target is the slice header end flag information;

a first determining unit, configured to determine a head segment and a feature segment that are adjacent to each other in playing order as the first candidate video segment if the head segment and the feature segment exist in the plurality of video segments and the feature segment is played after the head segment is played.

In this embodiment of the present invention, the candidate video segments include a second candidate video segment for searching for the detection target as the end-of-piece start flag information, and the first determining sub-module includes:

a second classification unit, configured to classify the multiple video segments into a feature segment and a trailer segment according to the multiple comparison results if the detection target is the trailer start flag information;

a second determining unit, configured to determine a feature clip and an end clip that are adjacent to each other in playing order as the second candidate video clip if the feature clip and the end clip exist in the plurality of video clips and the end clip is played after the feature clip is played.

In this embodiment of the present invention, the candidate video segments include a third candidate video segment for searching for the detection target as the end-of-title information, and the first determining sub-module includes:

a third classifying unit, configured to classify the video segments into a feature segment and a trailer segment according to the comparison results if the detection target is the trailer ending flag information;

a third determining unit, configured to determine, if there is an end-of-title segment and a feature segment that are adjacent in playing order in the plurality of video segments and the feature segment is played after the end-of-title segment is played, the end-of-title segment and the feature segment as the third candidate video segment.

In an embodiment of the present invention, the text recognition result includes a first text recognition result obtained by performing text recognition on an image frame of the first candidate video segment, and the second determining module includes:

and the second determining sub-module is used for matching the text content in the first character recognition result with a preset keyword, tracking the image frame containing the preset keyword after the image frame containing the preset keyword is matched, and determining the last tracked image frame containing the preset keyword as the target image frame where the end-of-title mark information is located.

In an embodiment of the present invention, the text recognition result includes a second text recognition result obtained by performing text recognition on the image frame of the second candidate video segment, and the second determining module includes:

and a third determining submodule, configured to traverse second character recognition results corresponding to each image frame in the second candidate video segment in a time sequence, and if the number of text boxes in the second character recognition results corresponding to a plurality of consecutive image frames is greater than a preset number threshold, determine a first image frame in the second character recognition results, in which the number of text boxes is greater than the preset number threshold, as the target image frame in which the end-of-piece start flag information is located.

In an embodiment of the present invention, the text recognition result includes a third text recognition result obtained by performing text recognition on the image frame of the third candidate video segment, and the second determining module includes:

and the fourth determining submodule is used for tracking the image frame containing the text box if the third character recognition result contains the text box, and determining the last tracked image frame containing the text box as the target image frame where the end-of-fragment mark information is located.

a fifth determining submodule, configured to determine an intercepting time period according to the detection target;

the intercepting submodule is used for intercepting the video file according to the intercepting time period to obtain an intercepted video clip;

and the segmentation submodule is used for carrying out average segmentation on the intercepted video segments to obtain the continuous video segments.

In summary, in the embodiment of the present invention, candidate video segments that may include the detection target in the multiple video segments may be determined through audio/video feature information of multiple consecutive video segments in the video file, and text recognition is performed on image frames of the candidate video segments to determine a target image frame where the detection target is located. By adopting the method, the candidate video segment where the detection target is located is determined by utilizing the deep learning model in combination with the picture information and the audio information of the video file, and character recognition is carried out in the segment range, so that the accurate image frame where the detection target is located is positioned, the recognition precision of the head end time point position, the tail start time point position and the color egg start time point position of the video can be improved, the detection method does not need to carry out image template matching and does not need to rely on manual operation, the point position recognition mode is flexible, and the recognition efficiency is high.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides an electronic device, including: the video detection method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the video detection method embodiment is realized, the same technical effect can be achieved, and in order to avoid repetition, the details are not repeated.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned video detection method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the detailed description is omitted here.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "include", "including" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or terminal device including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such process, method, article, or terminal device. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing detailed description has been made of a video detection method, a video detection apparatus, an electronic device, and a computer-readable storage medium, and specific examples are used herein to explain the principles and embodiments of the present invention, where the descriptions of the foregoing examples are only used to help understand the method and its core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for video detection, the method comprising:

2. The method according to claim 1, wherein the step of inputting the plurality of audio/video feature information into a pre-trained classification model respectively to obtain a plurality of corresponding output results comprises:

respectively inputting the audio and video characteristic information into the classification model to obtain a plurality of corresponding confidence coefficient results; and the confidence result is used for representing the confidence that the corresponding video segment belongs to the head segment/the tail segment.

3. The method of claim 1, wherein the classification model is trained by:

acquiring a sample video clip set for training; the sample video clip set comprises a plurality of sample video clips in succession; the segment types respectively marked on the plurality of sample video segments are a slice head segment, a positive segment or a slice tail segment;

respectively determining a plurality of sample audio and video characteristic information of the plurality of sample video clips;

and performing model training by using the audio and video characteristic information of the plurality of samples to obtain the classification model for identifying the head fragment/the positive fragment/the tail fragment.

4. The method of claim 1, wherein the determining the plurality of audio-video feature information of the plurality of video segments respectively comprises:

and aiming at each video segment, extracting corresponding audio characteristic information by adopting a pre-trained super-resolution test sequence VGG model, extracting corresponding video characteristic information by adopting a pre-trained double-flow expansion three-dimensional convolution network I3D model, and combining the audio characteristic information and the video characteristic information to obtain the audio and video characteristic information corresponding to the video segment.

5. The method according to claim 4, wherein the merging the audio feature information and the video feature information to obtain the audio and video feature information corresponding to the video clip includes:

respectively carrying out attention calculation on the audio characteristic information and the video characteristic information based on a shifting attention mechanism to obtain corresponding attention audio characteristic information and attention video characteristic information;

and splicing the attention audio characteristic information and the attention video characteristic information to obtain the corresponding audio and video characteristic information.

6. The method of claim 2, wherein determining candidate video segments from the plurality of video segments based on the plurality of output results comprises:

comparing the plurality of confidence level results with a preset confidence level threshold respectively to obtain a plurality of corresponding comparison results;

determining the candidate video segments from the plurality of video segments according to the plurality of comparison results.

7. The method according to claim 6, wherein the candidate video segments comprise a first candidate video segment for finding the detection target as the end-of-title information, and the determining the candidate video segment from the plurality of video segments according to the plurality of comparison results comprises:

if the detection target is the leader end mark information, respectively classifying the video clips into leader clips and positive clips according to the comparison results;

if one head segment and one feature segment adjacent to the playing order exist in the plurality of video segments, and the one feature segment is played after the one head segment is played, determining the one head segment and the one feature segment as the first candidate video segment.

8. The method according to claim 6, wherein the candidate video segments include a second candidate video segment for finding the detection target as the end-of-segment start flag information, and wherein determining the candidate video segment from the plurality of video segments according to the plurality of comparison results comprises:

if the detection target is the film ending start mark information, classifying the plurality of video clips into positive clips and film ending clips according to the plurality of comparison results;

if a feature clip and an end clip adjacent to each other in the playing order exist in the plurality of video clips, and the end clip is played after the feature clip is played, determining the feature clip and the end clip as the second candidate video clip.

9. The method according to claim 6, wherein the candidate video segments include a third candidate video segment for finding the detection target as the end-of-title information, and the determining the candidate video segment from the plurality of video segments according to the comparison results comprises:

if the detection target is the end of the film trailer mark information, classifying the video clips into positive clips and film trailer clips according to the comparison results;

if one end-of-title segment and one feature segment adjacent to the playing order exist in the plurality of video segments, and the one feature segment is played after the one end-of-title segment is played, determining the one end-of-title segment and the one feature segment as the third candidate video segment.

10. The method of claim 7, wherein the text recognition result comprises a first text recognition result obtained by performing text recognition on the image frame of the first candidate video segment, and the determining a target image frame where the detection target is located according to the text recognition result comprises:

matching the text content in the first character recognition result with a preset keyword, tracking the image frame containing the preset keyword after the image frame containing the preset keyword is matched, and determining the last tracked image frame containing the preset keyword as the target image frame where the end-of-title mark information is located.

11. The method of claim 8, wherein the text recognition result comprises a second text recognition result obtained by performing text recognition on the image frame of the second candidate video segment, and the determining a target image frame where the detection target is located according to the text recognition result comprises:

and traversing second character recognition results corresponding to each image frame in the second candidate video segment according to a time sequence, and if the number of text boxes in the second character recognition results corresponding to a plurality of continuous image frames is greater than a preset number threshold, determining a first image frame in the second character recognition results, the number of which is greater than the preset number threshold, as the target image frame where the end-of-segment start mark information is located.

12. The method of claim 9, wherein the text recognition result comprises a third text recognition result obtained by performing text recognition on the image frame of the third candidate video segment, and the determining the target image frame where the detection target is located according to the text recognition result comprises:

if the third character recognition result contains a text box, tracking the image frame containing the text box, and determining the last tracked image frame containing the text box as the target image frame where the end-of-title mark information is located.

13. The method of claim 1, wherein the obtaining a plurality of consecutive video segments from the video file comprises:

determining an interception time period according to the detection target;

intercepting the video file according to the intercepting time period to obtain an intercepted video clip;

and carrying out average segmentation on the intercepted video clips to obtain the continuous video clips.

14. A video detection apparatus, characterized in that the apparatus comprises:

15. An electronic device, comprising: processor, memory and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of a video detection method as claimed in any one of claims 1 to 13.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a video detection method according to any one of claims 1 to 13.