CN113761589A

CN113761589A - Video detection method and device and electronic equipment

Info

Publication number: CN113761589A
Application number: CN202110430813.4A
Authority: CN
Inventors: 杨天舒
Original assignee: Tencent Technology Beijing Co Ltd
Current assignee: Tencent Technology Beijing Co Ltd
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-12-07

Abstract

The embodiment of the application provides a video detection method, a video detection device and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the steps of decoding a video to be detected, separating an audio file and an image file, detecting the integrity of the audio file, and detecting the integrity of the image file if the audio file is complete. The method comprises the steps of respectively obtaining a video feature vector corresponding to an image file and a label feature vector corresponding to a video label, and determining whether video content is complete or not based on the similarity between the video feature vector and the label feature vector. Whether the video content is complete or not is detected by detecting whether the audio file is complete or not firstly and detecting whether the video content is complete or not based on the similarity of the video image file and the video label on the basis of the completeness of the audio file, so that the efficiency and the accuracy of judging the completeness of the video content are improved.

Description

Video detection method and device and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a video detection method and device and electronic equipment.

Background

With the continuous development of the internet, mobile platforms such as smart phones rise rapidly, short videos with smart phones and flat plates as carriers become a new content transmission form in recent years, and the requirement that people acquire more information more quickly and conveniently in daily life is greatly met.

With the explosive growth of short video data, how to quickly and accurately detect whether video content is completely played by a video platform is a key point for ensuring the quality of the platform content, whether the video is complete can be judged by detecting background sound of the video, but whether the video is complete is difficult to accurately detect, and the application scene is limited.

Disclosure of Invention

The purpose of the present application is to solve at least one of the above-mentioned technical drawbacks, in particular the technical drawback of low video detection accuracy.

In a first aspect, a video detection method is provided, and the method includes:

extracting audio data and image data from a video to be detected, and acquiring a video label corresponding to the video to be detected;

acquiring a first detection result based on an audio frequency spectrum corresponding to the audio data; the first detection result is used for indicating whether the audio data of the video to be detected meets an audio integrity condition or not;

if the first detection result indicates that the audio data meets the audio integrity condition, acquiring a second detection result based on the image data and the video tag; the second detection result is used for indicating whether the video to be detected is complete.

In an optional embodiment of the first aspect, obtaining the first detection result based on an audio spectrum corresponding to the audio data includes:

determining a target audio frequency spectrum corresponding to the target frame signal from the audio frequency spectrum;

if the target audio frequency spectrum meets the condition that the amplitude variation range is larger than the amplitude fluctuation threshold value and the amplitude corresponding to the highest frequency in the target audio frequency spectrum is larger than the amplitude threshold value, the first detection result indicates that the audio data does not meet the audio integrity condition.

In an optional embodiment of the first aspect, before determining the target audio spectrum corresponding to the target frame signal from the audio spectrum, the method further includes:

performing framing processing on the audio data to obtain at least one frame signal;

and taking the frame signals of the last preset number in the sequence as target frame signals based on the sequence of the frame signals in the audio data.

In an optional embodiment of the first aspect, obtaining the second detection result based on the image data and the video tag comprises:

respectively acquiring a video feature vector corresponding to the image data and a label feature vector corresponding to the video label;

and acquiring a second detection result based on the similarity between the video feature vector and the label feature vector.

In an optional embodiment of the first aspect, the video feature vector comprises at least one sub-video feature vector; obtaining a second detection result based on the similarity between the video feature vector and the label feature vector, including:

respectively calculating the similarity between the label feature vector and each sub-video feature vector in the video feature vectors;

and if the similarity between the label feature vector and each sub-video feature vector in the video feature vectors meets a preset similarity condition, the second detection result indicates that the video to be detected is complete.

In an alternative embodiment of the first aspect, the preset similar conditions include:

the similarity between the feature vectors of the first number of sub-videos and the feature vector of the label is larger than a preset first similarity threshold.

In an alternative embodiment of the first aspect, the preset similar condition further comprises:

determining a second number of sub-video feature vectors with the similarity to the label feature vector larger than a preset second similarity threshold, wherein the ratio of the second number to the total number of the sub-video feature vectors is larger than a preset ratio; wherein the second similarity threshold is less than the first similarity threshold.

In an optional embodiment of the first aspect, further comprising:

and if the first detection result indicates that the audio data does not meet the preset audio integrity condition or the second detection result indicates that the video to be detected is incomplete, sending an abnormal prompt message to a user terminal corresponding to the video to be detected, wherein the abnormal prompt message is used for prompting the incomplete video to be detected.

In a second aspect, an apparatus for video detection is provided, the apparatus comprising:

the extraction module is used for extracting audio data and image data from the video to be detected and acquiring a video tag corresponding to the video to be detected;

the first detection module is used for acquiring a first detection result based on an audio frequency spectrum corresponding to the audio data; the first detection result is used for indicating whether the audio data of the video to be detected meets an audio integrity condition or not;

the second detection module is used for acquiring a second detection result based on the image data and the video label if the first detection result indicates that the audio data meets the audio integrity condition; the second detection result is used for indicating whether the video to be detected is complete.

In an optional embodiment of the second aspect, the apparatus further includes a processing module, specifically configured to:

In an optional embodiment of the second aspect, when the first detection module obtains the first detection result based on an audio frequency spectrum corresponding to the audio data, the first detection module is specifically configured to:

In an optional embodiment of the second aspect, when the second detection module obtains the second detection result based on the image data and the video tag, the second detection module is specifically configured to:

In an optional embodiment of the second aspect, when the second detection module obtains the second detection result based on the similarity between the video feature vector and the tag feature vector, the second detection module is specifically configured to:

In an alternative embodiment of the second aspect, the preset similar conditions include:

In an alternative embodiment of the second aspect, the preset similar conditions further comprise:

In an optional embodiment of the second aspect, the apparatus further includes a sending module, specifically configured to:

In a third aspect, an electronic device is provided, which includes:

the video detection system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the video detection method of any one of the embodiments.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for video detection in any of the above embodiments.

In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device implements the method provided in the first aspect embodiment or the second aspect embodiment when executed.

The video detection method decodes the video to be detected, separates out audio data and image data, detects the integrity of the audio data, and determines whether the video content is complete based on the similarity between the video characteristic vector and the label characteristic vector if the audio data is complete. Whether the video is complete or not is judged by combining the audio data and the image data of the video, so that the accuracy of judging the integrity of the video content can be improved.

Furthermore, the integrity of the image data is judged on the basis of the integrity of the audio data, and the video can be preliminarily screened, so that the efficiency of judging the integrity of the video content is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a video detection method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating framing an audio file in a video detection method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating a selected target frame signal in a video detection method according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a video detection method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a video detection method according to an embodiment of the present application;

fig. 6 is a schematic diagram of 3D convolution in a video detection method according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a 3D convolution model structure in a video detection method according to an embodiment of the present application;

fig. 8 is a schematic flowchart of a 3D convolution in a video detection method according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a transform model encoder in a video detection method according to an embodiment of the present application;

fig. 10 is a schematic diagram of a transform model decoder in a video detection method according to an embodiment of the present application;

fig. 11 is a schematic flowchart of a video detection method according to an embodiment of the present application;

fig. 12 is a schematic flowchart of a video detection method according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a video detection apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an electronic device for video detection according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

With the development of the internet, a fifth generation mobile communication technology (5G) is increasingly mentioned in various fields, and information transmission with high speed and low time delay is ubiquitous.

The improvement of the speed of the internet enables people to acquire information and share self more easily, for example, the casual inspiration of people in life can be uploaded to a network platform at any time, global users can read, comment and forward synchronously, and all ideas and viewpoints of individuals can be extended, stored, collided and exchanged in the global information network. The carrier for expressing self thought by people is not limited to characters and pictures, and the shot short video can be quickly uploaded to a network platform to interact with other users.

With more and more users participating in shooting and sharing of short videos, the video platform needs to quickly and accurately detect videos uploaded by the users so as to guarantee the content quality of the video platform. For example, a video platform needs to detect whether a video is complete, so as to prevent the video platform from having a situation that a title and video content are inconsistent and the video content is incomplete.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The method has the advantages that the characteristics of the video title and the video audio are respectively extracted through an artificial intelligence technology, and the correlation of the video title and the video audio is further judged, so that the effect of judging the integrity of the video is realized.

The present application relates to a MultiModal Machine Learning method (MMML). The source or form of each information may be referred to as a Modality (Modality), and the Modality may be defined widely, for example, two different languages may be regarded as two modalities, or a data set acquired under two different conditions may be regarded as two modalities. The multi-modal machine learning aims to realize the capability of processing and understanding multi-source modal information through a machine learning method, such as multi-modal learning among images, videos, audios and semantics.

The present application provides a video detection method, an apparatus, an electronic device, and a computer-readable storage medium, which are intended to solve the above technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The video detection method provided by the embodiment of the application can be applied to a server and can also be applied to a terminal.

Those skilled in the art will understand that the "terminal" used herein may be a Mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), etc.; a "server" may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

An embodiment of the present application provides a method for video detection, and as shown in fig. 1, the method includes:

step S101, extracting audio data and image data from a video to be detected, and acquiring a video label corresponding to the video to be detected.

In the embodiment of the present application, the audio data and the image data may refer to an audio file and an image file, respectively, that is, the audio file and the image file may be separated from the video to be detected, or may refer to other data forms. The video to be detected can be from a video which is uploaded by a user and does not pass the verification, namely the user uploads the video to be published on a video platform, and can fill in text information related to the video, and the video verification system performs video integrity detection.

In other embodiments, the video to be detected may also be selected from published videos, for example, at intervals, a fixed number of videos are randomly extracted from a published video library as the video to be detected, so as to implement an effect of reviewing the published video.

The video tag may refer to text information related to a video filled by a user when the video is uploaded, may refer to a video title, may refer to a video profile filled by the user, and may also refer to a video tag set by the user, such as a "sports" tag, a "food" tag, a dedicated tag for the user to participate in platform activities, and the like.

In the embodiment of the application, the video to be detected can be decoded to obtain the audio data and the image data of the video to be detected, so that the audio data and the image data can be conveniently analyzed respectively in the next step.

Wherein, the audio data and the image data in the video to be detected can be separated through the FFmpeg tool. FFmpeg is a set of open source computer programs that can be used to record, convert digital audio, video, and convert them into streams.

Step S102, acquiring a first detection result based on the audio frequency spectrum corresponding to the audio data, wherein the first detection result is used for indicating whether the audio data of the video to be detected meets the audio integrity condition.

In this embodiment of the application, the audio data may be processed to obtain an audio spectrum corresponding to the audio data, for example, the audio data is subjected to fourier transform to obtain an audio spectrum corresponding to the audio data.

The frequency spectrum is an abbreviation of frequency spectrum density and is a distribution curve of frequency. Complex oscillations can be decomposed into harmonic oscillations of different amplitudes and different frequencies, and the pattern of amplitude values of these harmonic oscillations arranged in frequency is called a frequency spectrum. The spectrum is widely applied to the aspects of acoustic, optical and radio technologies, and the research on signals is introduced from a time domain to a frequency domain, so that more intuitive knowledge is brought.

The audio frequency spectrum can be analyzed to judge whether the audio data of the video to be detected is complete or not, so that a first detection result is obtained. Specifically, whether the video to be detected has a swallowing phenomenon or an abnormal ending phenomenon can be determined by whether the frequency range of the sound in the audio frequency spectrum meets a preset audio integrity condition.

The condition that the preset audio integrity is met may mean that the frequency range of the audio frequency spectrum corresponding to a small segment of audio at the end of the audio is smaller than a preset frequency difference value, or that the frequency spectrum corresponding to the end of the audio data meets the frequency distribution that the sound is normally reduced to disappear.

And if the first detection result prompts that the audio data does not meet the preset audio integrity condition, the video to be detected is directly considered to be incomplete, a video auditing process is suspended, and the video to be detected is returned to the contribution user.

In the embodiment of the application, the integrity of the audio data of the video to be detected is detected, the time consumed by the process is extremely short, if the first detection result is incomplete, the process does not need to be detected again, and the detection efficiency of the integrity of the video is improved.

Step S103, if the first detection result indicates that the audio data meets the audio integrity condition, a second detection result is obtained based on the image data and the video label; and the second detection result is used for indicating whether the video to be detected is complete.

In this embodiment of the application, if the first detection result indicates that the audio data is complete, the integrity of the video to be detected may be further determined by the image data and the video tag. The image data may be image data separated from the video to be detected, and the content of the image data may be compared with the content of the video tag to determine whether the content in the video tag is completely displayed by the video to be detected. The second detection result can be used for indicating whether the video to be detected is complete or not, and if the video to be detected is incomplete, a prompt message can be sent to the user terminal.

In the above embodiment, the video to be detected is decoded, the audio data and the image data are separated, the integrity of the audio data is detected, and if the audio data is complete, whether the video content is complete is determined based on the similarity between the video feature vector and the tag feature vector. Whether the video is complete or not is judged by combining the audio data and the image data of the video, so that the accuracy of judging the integrity of the video content can be improved. Furthermore, the integrity of the image data is judged on the basis of the integrity of the audio data, and the video can be preliminarily screened, so that the efficiency of judging the integrity of the video content is improved.

In this embodiment of the present application, the processing of the audio data to obtain the audio spectrum corresponding to the audio data may include the following steps:

(1) and performing framing processing on the audio data to obtain at least one frame signal.

In the embodiment of the present application, the audio data may be firstly subjected to framing processing to obtain a frame signal, and then the frame signal is used as an input signal of fourier transform.

The Fourier transform can be used in the field of voice processing, can convert signals from a time domain to a frequency domain, better analyzes the characteristics of the signals, and generally has better analysis effect on stable signals. For more complex audio data, the audio data may be firstly subjected to framing processing, and a small segment of audio in a shorter time is taken as a frame signal, and the frame signal is approximately regarded as a stationary signal so as to be taken as an input signal of fourier transform.

In particular, the framing operation may be implemented by weighting with a movable finite-length window, i.e. multiplying the audio signal s (n) by the window function ω (n). The requirements for the window function are: the slopes at the two ends of the time window are small, so that the two ends of the edge of the window do not cause rapid change and smoothly transit to zero, the waveform of the intercepted frame signal is slowly reduced to zero, and the interception effect of the frame signal is reduced; there is a wide 3dB bandwidth in the frequency domain and a small side band maximum. The amplitude of the frame signal is gradually changed to 0 at two ends through the window function meeting the conditions, so that each peak on the frequency spectrum is thinner, and the frequency spectrum leakage is reduced.

Meanwhile, the two ends of the frame signal are weakened during windowing, so that adjacent frame signals can have mutually overlapped parts when the frame signal is intercepted, and the time difference between the starting positions of the two adjacent frames is called frame shift. The frame shift may be half of the frame length, or may be fixed to 10 ms, and the frame length may be 20-50 ms in general.

In the embodiment of the present application, the window functions that can be selected are: rectangular windows, hanning windows, hamming windows, blackman windows, and the like.

In one example, as shown in fig. 2, a section of audio data to be framed is obtained, a frame length is set to be N, and a frame shift is set to be N/2, then a method for determining a length of each signal frame may be as shown in the figure, where a length of coincidence between a K-th frame and a K + 1-th frame is N/2, a length of coincidence between a K + 1-th frame and a K + 2-th frame is N/2, and so on.

(2) Performing Fourier transform on each frame signal in the at least one frame signal to obtain an audio frequency spectrum; the audio spectrum includes a short-time spectrum corresponding to each of the at least one frame signal.

In this embodiment of the present application, fourier transform may be performed on each frame signal in at least one frame signal, so as to obtain a short-time spectrum corresponding to each frame signal. The audio frequency spectrum corresponding to the audio data of the video to be detected may include a short-time frequency spectrum corresponding to each frame signal in the at least one frame signal.

Fourier transform can transform signals from time domain to frequency domain, so that the characteristics of the signals can be better analyzed, and generally, the analysis effect on stationary signals is better. The frame signal in this application may be approximately regarded as a stationary signal, and the stationary signal refers to a signal in which the distribution parameter or the distribution law does not change with time.

In the embodiment of the application, after the audio data and the image data are extracted from the video to be detected, the audio data can be subjected to framing processing to obtain at least one frame signal, and then at least one target frame signal can be selected from the at least one frame signal, so that the target frame signal can narrow the audio frequency spectrum range to be analyzed, and whether the audio data is ended by swallowing or abnormal termination can be more efficiently judged. Specifically, the last frame signal of a preset number in the sequence may be set as the target frame signal based on the sequence of the at least one frame signal in the audio data. The sorting may refer to a playing sequence of at least one frame signal in the audio data, and the last preset number of frame signals in the sorting is set as the target frame signal to perform the emphasis detection, so that whether the audio ends with swallowing or abnormal termination can be effectively detected.

In this embodiment, the target frame signal may refer to a frame signal corresponding to a last preset time of the audio data. For example, a frame signal corresponding to the last 5 seconds of audio data may be set as the target frame signal.

The number of the target frame signals may be a preset fixed value, for example, the last 100 frame signals with the playing time are set as the target frame signals. The number of the target frame signals may also be determined according to the total number of frame signals obtained after the audio data is framed, for example, 1% of the total number of the signal frames is taken, and then the whole number is set as the preset number of the target frame signals.

In one example, as shown in fig. 3, the frame length is set to 20 milliseconds, 10000 frame signals are obtained after audio data is framed, and 1% of the total number of signal frames is taken as the number of target frame signals, that is, the preset number of target frame signals is 100, then the frame signal whose playing time is the last 100 in the audio data may be taken as the target frame signal. As shown in fig. 3, the target frame signal is obtained by taking the reciprocal 100 frame signals according to the playing order of the frame signals in the audio data.

In this embodiment of the application, obtaining the first detection result based on the audio frequency spectrum corresponding to the audio data may include the following steps:

(1) and determining a target audio frequency spectrum corresponding to the target frame signal from the audio frequency spectrum.

The audio frequency spectrum may be obtained by performing fourier transform on each frame signal of the at least one frame signal, and the audio frequency spectrum may include a short-time frequency spectrum corresponding to the at least one frame signal. The target frame signal may be filtered from at least one frame signal, and the target audio spectrum of the audio spectrum may include a corresponding short-time spectrum of each target frame signal.

(2) If the target audio frequency spectrum meets the condition that the amplitude variation range is larger than the amplitude fluctuation threshold value and the amplitude corresponding to the highest frequency in the target audio frequency spectrum is larger than the amplitude threshold value, the first detection result indicates that the audio data does not meet the audio integrity condition.

Specifically, it may be detected whether each short-time frequency spectrum included in the target audio frequency spectrum satisfies that the amplitude variation range is greater than the amplitude fluctuation threshold and the amplitude corresponding to the highest frequency is greater than the amplitude threshold, and if so, the audio data may be considered not to satisfy the audio integrity condition. The maximum frequency in the target audio frequency spectrum may refer to the maximum frequency in the short-time frequency spectrums corresponding to all the target frame signals, or may be the maximum frequency of each short-time frequency spectrum in the short-time frequency spectrum corresponding to each target frame signal. The amplitude fluctuation threshold and the amplitude threshold of the target frame signal can be set according to the actual application scene.

And determining the number of frequency spectrums of the short-time frequency spectrum corresponding to the target frame signal, which meet the condition that the amplitude variation range is larger than the amplitude fluctuation threshold value and the amplitude corresponding to the highest frequency is larger than the amplitude threshold value, and detecting whether the number of frequency spectrums is larger than or equal to a preset frequency spectrum number threshold value. For example, the threshold of the number of frequency spectrums may be set to 4, and if the number of short-time frequency spectrums satisfying the condition in the target audio frequency spectrum is greater than or equal to 4, the first detection result may indicate that the audio data does not satisfy the audio integrity condition, that is, the detection result of the audio data is incomplete.

In some embodiments, the audio data may be firstly subjected to framing processing to obtain at least one frame signal, each frame signal is subjected to fourier transform to obtain an audio frequency spectrum, then a target frame signal may be selected from the at least one frame signal, and whether the audio data is complete or not may be determined according to a frequency spectrum corresponding to the target frame signal.

In other embodiments, after the audio data is subjected to framing processing to obtain the at least one frame signal, the target frame signal may be selected from the at least one frame signal, and then fourier transform may be performed on the target frame signal, and other frame signals except the target frame signal may not be subjected to fourier transform, so as to improve the efficiency of analyzing the audio data.

In an example, as shown in fig. 4, the process of obtaining the first detection result may be to perform framing on the audio data to obtain at least one frame signal, perform fourier transform on each frame signal to obtain a short-time spectrum corresponding to each frame signal, determine a target frame signal from the at least one frame signal, and analyze the short-time spectrum corresponding to the target frame signal to obtain the first detection result. And if the short-time frequency spectrum corresponding to the target frame signal is abnormally fluctuated or ended by a high point, the first detection result is that the audio data is incomplete. The short-time spectrum corresponding to the frame signal other than the target frame signal may be used for other content detection, for example, the short-time spectrum corresponding to the remaining frame signals may be used for speech recognition of the audio data.

The extraction step of the most basic and most common MFCC (Mel Frequency Cepstral coefficients) features in the voice recognition comprises the steps of performing Fourier transform on each frame signal to obtain a short-time spectrum corresponding to each frame signal, and then completing voice feature extraction through a Mel filter bank, discrete cosine transform and other processes.

In an example, the process of obtaining the first detection result may be as shown in fig. 5, after the audio data is divided into at least one frame signal, first determining a preset number of target frame signals from the at least one frame signal. The preset number may be a preset fixed value, for example, the last 100 frame signals of the playing time are set as the target frame signals. The preset number may also be determined according to the total number of signal frames obtained after audio data is framed, for example, 1% of the total number of frame signals is taken, and the numerical value of the whole number is set as the preset number of target frame signals. After the target frame signals are determined, fourier transform is performed on each target frame signal to obtain a short-time spectrum corresponding to each target frame signal, and the short-time spectrum corresponding to each target frame signal is analyzed to obtain a first detection result. The frame signals are screened to obtain target frame signals, and then Fourier transform is carried out, so that the analysis efficiency of the integrity of audio data can be improved.

In this embodiment of the application, obtaining the second detection result based on the image data and the video tag may include: respectively acquiring a video feature vector corresponding to the image data and a label feature vector corresponding to the video label; and acquiring a second detection result based on the similarity between the video feature vector and the label feature vector.

Specifically, features can be extracted from image data to obtain a video feature vector; features can be extracted from the video tags to obtain tag feature vectors. The relevance between the image data of the video to be detected and the video label can be determined based on the video feature vector and the label feature vector, so that whether the video to be detected is complete or not can be determined.

In the embodiment of the application, the similarity between the video feature vector and the tag feature vector can be calculated to determine the correlation between the image data of the video to be detected and the video tag.

If the similarity is high, namely the image data of the video to be detected and the video tag have strong correlation, the second detection result can prompt the completeness of the video to be detected, and then other audits can be performed on the video to be detected, such as an audit process of sensitive content and the like; if the similarity is low, namely the correlation between the image data of the video to be detected and the video label is weak, the second detection result can prompt that the video to be detected is incomplete, and corresponding processing is carried out. For example, the video to be detected may be returned to the user terminal of the posting user, the user may be prompted to modify the video to be detected, the posting user may also be prompted to modify the video tag, so that the video tag is more conformable to the video content of the video to be detected, and the uploaded video content may not be modified.

In an embodiment of the present application, the video feature vector may include at least one sub-video feature vector; respectively obtaining a video feature vector corresponding to the image data and a tag feature vector corresponding to the video tag, which may include the following steps:

(1) performing frame extraction processing on the image data to obtain at least one video frame; grouping at least one video frame to obtain at least one group of sub-video frames;

(2) and respectively extracting the characteristics of each group of sub-video frames to obtain at least one sub-video characteristic vector.

In the embodiment of the application, after the image data and the audio data of the video to be detected are separated, frame extraction processing can be performed on the image data, so that the calculation amount is simplified, and the detection efficiency is improved.

Specifically, the frame extraction processing may be performed on the image data by opencv or FFmpeg. The opencv is a cross-platform computer vision and machine learning software library issued in an open source, the FFmpeg is a set of open source computer programs which can be used for recording and converting digital audio and video and can convert the digital audio and video into streams, and the openmpeg and the open source computer programs can conveniently realize frame extraction of image data of a video to be detected to obtain at least one video frame. It is understood that video frames are essentially image data.

A fixed frame extraction interval can be set, for example, 5 frames are extracted on average per second, the frame extraction interval can be adjusted according to the duration of the video to be detected, and the frame extraction interval can be increased when the duration of the video to be detected is longer; when the duration of the video to be detected is short, the frame extraction interval can be reduced, and the setting can be flexibly carried out.

In the embodiment of the application, frame extraction may be performed on image data to obtain at least one video frame, then at least one video frame may be grouped to obtain at least one group of sub-video frames, and features of each group of sub-video frames in the at least one group of sub-video frames are extracted to obtain sub-video feature vectors corresponding to each group of sub-video frames.

The features of each group of sub-video frames can be extracted through a trained deep learning model, and sub-video feature vectors corresponding to each group of sub-video frames are obtained. The deep learning model may primarily include 3D convolutional layers and 3D pooling layers.

The 3D pooling layer may be referred to as max pooling (maxporoling), which is to take the maximum value of the feature points in the neighborhood, preserve texture features well, and remember the index position of the maximum value to facilitate back propagation.

The 3D convolutional layer is formed by stacking a plurality of consecutive frames into a cube and then applying a 3D convolutional kernel in the cube. In this configuration, each feature map (map) in the convolutional layer is concatenated with a number of adjacent consecutive frames in the previous layer, thus capturing motion information. In one example, a convolution kernel of size 3 × 3 × 3 is convolved on a cube, as shown in fig. 6, resulting in an output.

In this embodiment of the application, the deep learning model may mainly include a 3D convolution layer and a 3D pooling layer, and an internal structure of the deep learning model may be formed by interleaving three 3D convolution layers and two pooling layers as shown in fig. 7, and feature extraction is performed on each group of sub-video frames, and the last layer is an embedding layer (embedding), and a large sparse vector is converted into a low-dimensional space with a feature relation retained, so as to obtain sub-video feature vectors corresponding to each group of sub-video frames.

In the embodiment of the present application, the trained deep learning model may be a 3D Convolutional Neural Network (CNN) architecture, as shown in fig. 8.

Each group of sub-video frames may be used as an input to the model, for example, each group of sub-video frames may be a continuous 7 frames, and then the continuous 7 frames may be used as an input to the first layer of the model. The first layer of the model may be a hard wired layer (H1) for processing the original frame to generate information for multiple channels, encoding a priori knowledge of the features, better than random initialization performance. Specifically, the hard line layer extracts information of five channels of each video frame in each group of sub-video frames, which are respectively: the gray scale, the gradients in the x and y directions, the optical flows in the x and y directions, the first three channels can all be calculated for each frame, while the optical flow fields in the horizontal and vertical directions require two consecutive frames to be determined.

The second layer of the model may be a 3D convolution layer (C2) with the output of the hard-wired layer as the input to the layer, and performing convolution operations on the input five channels of information, respectively. At this level, if the number of feature maps is to be increased, convolution can be performed with different 3D convolution kernels.

The third layer of the model may be a downsampling layer (S3), which uses a maximum pooling operation, and the number of feature maps remains unchanged after downsampling but the resolution is reduced.

The fourth layer of the model may be a 3D convolution layer (C4), and similarly, to increase the number of feature maps, a variety of different convolution kernels may be used to convolve the feature maps.

The fifth layer of the model may be a downsampling layer (S5), and the downsampling operation is performed on each feature map, where the feature map of each channel is already small.

The sixth layer of the model may be a 3D convolution layer (C6), in which the number of video frames in the time dimension is already small, so that the layer may be convolved only in the spatial dimension, and the output feature map may be reduced to the size of 1X1, that is, a value, as the sub-video feature vector corresponding to the finally obtained sub-video frame. For example, there are 128 feature maps in the C6 level, and the final feature vector has 128 dimensions.

After multi-layer convolution and down-sampling, each successive set of sub-video frames is converted into a multi-dimensional feature vector that captures the motion information of the input frame.

In the embodiment of the present application, the order between the 3D pooling and the 3D convolution layer in the deep learning model is not limited, and the setting of the convolution kernel size may also be adjusted according to the video resolution.

(3) And extracting the characteristics of the video label to obtain a label characteristic vector corresponding to the video label.

In the embodiment of the present application, the video tag may refer to text information related to a video filled by a user when the video is uploaded, may refer to a video title, may refer to a video profile filled by the user, and may also refer to a video tag set by the user, such as a "sports" tag, a "food" tag, a dedicated tag for the user to participate in platform activities, and the like.

In the embodiment of the application, the video tag may refer to a video title, word segmentation and vectorization may be realized through an Embedding (Embedding) layer in the deep learning field, a large sparse vector is converted into a low-dimensional space for retaining a semantic relationship, resource occupation is reduced, and an internal semantic relationship between words is retained.

In the embodiment of the application, after the video tag is input into the embedding layer, a word vector corresponding to each word in the video tag is obtained, and the features of the video tag can be further extracted through a transform model to obtain the feature vector of the video tag. The Transformer model can be roughly decomposed into an encoding component, a decoding component, and a connection layer therebetween, the encoding component can be composed of 6 encoders, and the decoding component can be composed of 6 decoders.

In the embodiment of the present application, the structure of the encoder may be as shown in fig. 9, X1 and X2 may be two word vectors input into the self-attention layer, Z1 and Z2 may be outputs corresponding to X1 and X2 after being processed by the self-attention layer, then Z1 and Z2 are transformed into a matrix form, summed with the residual block, and then normalized to obtain Z1 'and Z2', Z1 'and Z2' may be input into a feed-forward neural network, summed with the residual block and then normalized in the same way, and the process of encoding the word vectors by using a single-layer substructure is completed.

Wherein, the Self-Attention layer may be a Self-Attention mechanism (Self-Attention), and as each unit of the sequence to be processed is input, the Self-Attention focuses on all units of the whole input sequence, and the understanding of all relevant units is integrated into the unit being processed to assist the encoding process. The self-Attention mechanism can be a single-Head Attention mechanism or a Multi-Head Attention mechanism (Multi-Head Attention), and the Multi-Head Attention mechanism can increase the capability of a model for capturing different position information and can be associated with words at more positions; when mapping is carried out, the weight is not shared, the mapping is carried out on different subspaces, and the information covered by the finally spliced vector is wider. The number of heads of a multi-head attention mechanism is increased, and the long-distance information capturing capability of the model can be improved.

Feed-Forward neural Networks (FFN) may be unidirectional, multi-layer structures, where each layer contains a number of neurons, and each neuron may receive signals from a neuron in a previous layer and generate an output to a next layer. Specifically, the feedforward neural network can be realized by adopting a full connection layer, and the full connection layer can be formed by a two-layer neural network, and is subjected to linear transformation, then to ReLU nonlinear transformation, and finally to linear transformation.

The Normalization method includes various methods, such as Layer Normalization (LN), Batch Normalization (BN), and Weight Normalization (WN).

The Residual block can prevent the degradation in the deep neural Network training, the gradient disappearance problem caused by the depth increase in the deep neural Network is relieved, and the acquisition of the Residual block can be realized through a Residual Network (ResNet).

In the embodiment of the present application, the structure of a single decoder may be as shown in fig. 10, and may include: a Multi-Head Attention mechanism (Masked Multi-Head Attention), a Multi-Head Attention mechanism (Multi-Head Attention), and a feedback neural network (FFN, Feed-Forward Networks) with masks.

In the embodiment of the application, the video tag may refer to a video introduction, and for paragraph-type text, the video tag may be converted into a tag feature vector through word2 vec. Specifically, a word2vec model can be pre-trained by using a large number of video tags in a video library to obtain a word vector set, the word vector set comprises a mapping relation between words and word vectors and can be used for converting words into corresponding word vectors, the word vectors corresponding to each word after the video tags are segmented can be determined according to the base word vector set, and the obtained word vectors are accumulated and normalized to obtain tag feature vectors corresponding to the video tags.

In an embodiment of the present application, the video feature vector comprises at least one sub-video feature vector; obtaining a second detection result based on the similarity between the video feature vector and the tag feature vector, which may include the following steps: respectively calculating the similarity between the label feature vector and each sub-video feature vector in the video feature vectors; and if the similarity between the label feature vector and each sub-video feature vector in the video feature vectors meets a preset similarity condition, the second detection result indicates that the video to be detected is complete.

In the embodiment of the present application, the similarity between the tag feature vector and each sub-video feature vector in the video feature vector may be calculated, for example, if there are 100 sub-video feature vectors, the similarity corresponding to each sub-video feature vector in the 100 sub-video feature vectors may be calculated and obtained. Specifically, the similarity may be cosine similarity, and the principle is that the closer the included angle between two vectors is to 0, the closer the rest chord values are to 1, which indicates that the two vectors are more similar, thereby reflecting the relevance between each group of sub-video frames and video tags. The relevance between the video to be detected and the video tag can be determined based on the relevance between each group of sub-video frames and the video tag, so as to detect whether the video plays the content related to the video tag.

Specifically, if the image data of the video to be detected has strong correlation with the video tag, the second detection result can be set as the integrity of the video to be detected, and then other audits can be performed on the video to be detected, such as an audit process of sensitive content and the like; if the relevance between the image data of the video to be detected and the video label is weak, the second detection result can be set as incomplete video to be detected, and corresponding processing is carried out.

In the embodiment of the present application, the preset similar conditions may include: there is a first number of similarity between the video feature vectors and the tag feature vectors that is greater than a preset first similarity threshold.

Specifically, the first similarity threshold may be a higher value, and when the similarity between the first number of sub-video feature vectors and the title feature vector is greater than the first similarity threshold, the content of the video to be detected may be considered complete, so that the second detection result prompts that the content of the video to be detected is complete, and the video to be detected enters other detection links or is disclosed on a platform.

In the embodiment of the present application, the preset similar conditions further include: determining a second number of sub-video feature vectors with the similarity to the label feature vector larger than a preset second similarity threshold, wherein the ratio of the second number to the total number of the sub-video feature vectors is larger than a preset ratio; wherein the second similarity threshold is less than the first similarity threshold.

Specifically, the second quantity refers to the number of sub-video feature vectors with similarity between the sub-video feature vectors and the label feature vector being greater than a preset second similarity threshold, and a ratio between the second quantity and the total number of the sub-video feature vectors is calculated and obtained, wherein the ratio can represent a ratio of video content with strong correlation with the video label to the video to be detected, and if the ratio reaches the preset ratio, the video to be detected can be considered to be complete. The second similarity threshold is used as a condition for counting the proportion of the video tag-related content in the video to be detected, and can be smaller than the first similarity threshold.

In this embodiment of the application, the second detection result corresponding to the video to be detected, which satisfies the above two conditions at the same time, may be set as the content of the video to be detected is complete.

For example, the first similarity threshold may be set to 0.9, the second similarity threshold may be set to 0.6, and the preset ratio may be set to 0.6. The similarity between the 10 sub-video feature vectors corresponding to the video to be detected and the label feature vector is as follows: 0.22, 0.46, 0.61, 0.73, 0.84, 0.92, 0.83, 0.75, 0.69, 0.56. Wherein, the similarity between one sub-video feature vector and the label feature vector is 0.92 greater than a first similarity threshold value of 0.9; in the 10 sub-video feature vectors, the similarity between the 7 sub-video feature vectors and the label feature vector is greater than a second similarity threshold value of 0.6, the ratio of the similarity to the total number of the sub-video feature vectors is 0.7 and is greater than a preset ratio of 0.6, and if the two conditions are met, the second detection result can be set as the complete content of the video to be detected.

In this embodiment of the present application, the image data of the video to be detected may be extracted first and then grouped to obtain at least one group of sub-video frames, and then the condition of the correlation ratio may further refer to: determining at least one sub-video feature vector with the similarity between the sub-video feature vector and the label feature vector larger than a preset second similarity threshold, calculating the ratio of the sub-video frame corresponding to each sub-video feature vector in the at least one sub-video feature vector meeting the condition in the image data, and if the ratio is larger than the preset ratio, setting the second detection result as the integrity of the video content to be detected.

In the embodiment of the application, if the first detection result indicates that the audio data does not meet the preset audio integrity condition, or the second detection result indicates that the video to be detected is incomplete, an abnormal prompt message is sent to the user terminal corresponding to the video to be detected, and the abnormal prompt message is used for prompting that the video to be detected is incomplete.

For example, if the first detection result indicates that the audio data does not meet the preset audio integrity condition, the first detection result may be sent to the user terminal corresponding to the video to be detected; and receiving the new video to be detected returned by the user terminal.

If the second detection result is that the content of the video to be detected is incomplete, the second detection result can be sent to the user terminal corresponding to the video to be detected; and receiving a new video label returned by the user terminal.

Specifically, the time consumption of the audio integrity detection process is short, if the first detection result prompts that the audio data does not meet the preset audio integrity condition, the detection can be performed no longer, the detection is directly considered to be performed on the incompleteness of the video to be detected, the first detection result is sent to the user terminal corresponding to the video to be detected, the new video to be detected returned by the user terminal is waited to be received, and the effect of improving the video content detection efficiency is achieved.

If the second detection result is that the content of the video to be detected is incomplete, the second detection result can be sent to the user terminal corresponding to the video to be detected, the second detection result can prompt the contribution user that the content of the video to be detected is incomplete, the contribution user is requested to input a new video tag again, and the server receives the new video tag returned by the user terminal.

In this embodiment of the application, after the server generates the first detection result, the server may generate first prompt information based on the first detection result, so as to more clearly indicate that the video to be detected is incomplete, prompt a posting user corresponding to the video to be detected to upload a new video to be detected again, and the server may receive the new video to be detected returned by the user terminal, and perform a video content integrity detection process on the video to be detected again. Similarly, after the server generates the second detection result, second prompt information is generated based on the second detection result, and is used for prompting the user that the video content is incomplete and the relevance with the video tag is weak, and prompting the contribution user to change the video tag. After receiving the new video label sent by the user terminal, the server does not need to analyze the video content to be detected, can only extract the characteristics of the new video label to obtain a new label characteristic vector, then calculates the similarity between the extracted video characteristic vector of the video to be detected and the new label characteristic vector, and carries out the video content integrity detection again.

In order to explain the video detection method of the present application more clearly, the video detection method will be further explained with reference to specific examples.

In one embodiment, the present application provides a video detection method, as shown in fig. 11, comprising the steps of:

step 1101, acquiring a video to be detected and a video tag corresponding to the video to be detected; the video label can be a video title and can be a video content brief;

step S1102, extracting audio data and image data from a video to be detected;

step S1103, performing framing processing on the audio data to obtain at least one frame signal; wherein, the frame length can be 20-50 ms.

Step S1104, performing Fourier transform on each frame signal in at least one frame signal to obtain an audio frequency spectrum; the audio frequency spectrum comprises a short-time frequency spectrum corresponding to each frame signal in at least one frame signal;

step S1105, determining the playing time of each frame signal in the audio data, setting the frame signal with the last playing time as the target frame signal, and determining the corresponding short-time spectrum of the target frame signal in the audio frequency spectrum for each target frame signal;

step S1106, detecting short-time frequency spectrums corresponding to at least one target frame signal, respectively, to obtain a first detection result, where the first detection result is used to indicate integrity of audio data;

step S1107, whether the first detection result prompts that the audio data meets a preset audio integrity condition is determined, if yes, step S1109 is performed, otherwise, step S1108 is performed;

step S1108, sending the first detection result to a user terminal corresponding to the video to be detected; receiving a new video to be detected returned by the user terminal, and entering step S1102;

step S1109, respectively obtaining a video feature vector corresponding to the image data and a label feature vector corresponding to the video label; wherein the video feature vector comprises at least one sub-video feature vector; the characteristics of the image data can be extracted through 3D convolution, and the characteristics of the video label can be extracted through a transformer model;

step S1110, respectively calculating the similarity between the title feature vector and each sub-video feature vector in the video feature vectors;

step S1111, acquiring a second detection result based on the similarity between the label feature vector and each sub-video feature vector in the video feature vectors;

step S1112, determining whether the second detection result is that the video content to be detected is complete, if so, entering step S1114, otherwise, entering step S1113;

step S1113, sending the second detection result to the user terminal corresponding to the video to be detected; after receiving a new video title returned by the user terminal, the method goes to step S1109;

in step S1114, the content integrity check process ends and the process enters another check process.

In an example, as shown in fig. 12, a video to be detected posted by a user and a video title of the video to be detected may be obtained, the video to be detected is decoded, and audio data and image data are separated. The audio data can be detected firstly, specifically, the audio data is framed and subjected to Fourier transform to obtain a short-time frequency spectrum corresponding to each frame signal, the short-time frequency spectrum corresponding to the last 5 seconds of frame signal of the audio data is analyzed, if abnormal fluctuation exists when any short-time frequency spectrum is ended or the frequency spectrum is ended at a high point, the audio data is considered to be incomplete, the detection result of the video to be detected is that the video content is incomplete, at the moment, the detection result is sent to a submission user, the user is prompted to upload a new video to be detected, and content integrity detection is carried out again.

And if the detection result of the audio data is that the audio data is complete, entering the next detection process. And extracting the frame of the image data and grouping the frame to obtain at least one group of sub video frames, and extracting the characteristics of at least one group of sub video frames through a trained 3D convolution model to obtain a video characteristic vector. The video feature vector comprises a plurality of sub-video feature vectors, and the sub-video feature vectors are obtained based on at least one group of sub-video frames. And performing feature extraction on the video title of the video to be detected through a transform model to obtain a title feature vector corresponding to the video title. The correlation between the title content and the video content can be judged by respectively calculating cosine similarity between the title feature vector and at least one group of sub-video feature quantities. Specifically, if the similarity between the title feature vector and each sub-video feature vector satisfies at least one of the following items, the detection result of the video to be detected can be set as the content of the video to be detected is complete:

(1) the similarity between at least one sub-video feature vector and the title feature vector is larger than a preset first similarity threshold value;

(2) determining a second number of sub-video feature vectors with the similarity to the title feature vector larger than a preset second similarity threshold, wherein the ratio of the second number to the total number of the sub-video feature vectors is larger than a preset ratio; wherein the first similarity threshold is greater than the second similarity threshold.

Wherein condition (2) may be replaced with: determining at least one sub-video feature vector with the similarity between the sub-video feature vector and the title feature vector larger than a preset second similarity threshold, calculating the ratio of a sub-video frame corresponding to each sub-video feature vector in the at least one sub-video feature vector meeting the condition in the image data, if the ratio is larger than the preset ratio, setting the detection result of the video to be detected to be complete, ending the integrity detection process, entering other detection processes or passing the audit, disclosing the video to be detected on a platform, and more accurately judging whether the video content is complete or not by determining the similarity between the video image data with rich information and the video title.

If the detection result of the video to be detected is that the content of the video to be detected is incomplete, the detection result can be sent to a contribution user to prompt the user to fill in a new video title, and at the moment, the analyzed video to be detected can not be changed, so that the video content integrity detection efficiency is improved.

In the video detection method in the embodiment of the application, a video to be detected is decoded, audio data and image data are separated, the integrity of the audio data is detected, and if the audio data is complete, whether video content is complete or not is determined based on the similarity between the video feature vector and the label feature vector. Whether the video is complete or not is judged by combining the audio data and the image data of the video, so that the accuracy of judging the integrity of the video content can be improved.

An embodiment of the present application provides a video detection apparatus, and as shown in fig. 13, the video detection apparatus 130 may include: an extraction module 1301, a first detection module 1302, and a second detection module 1303, wherein,

the extraction module 1301 is used for extracting audio data and image data from a video to be detected and acquiring a video tag corresponding to the video to be detected;

a first detection module 1302, configured to obtain a first detection result based on an audio frequency spectrum corresponding to the audio data; the first detection result is used for indicating whether the audio data of the video to be detected meets an audio integrity condition or not;

the second detection module 1303 is configured to, if the first detection result indicates that the audio data meets the audio integrity condition, obtain a second detection result based on the image data and the video tag; the second detection result is used for indicating whether the video to be detected is complete.

The video detection device decodes the video to be detected, separates out audio data and image data, detects the integrity of the audio data, and determines whether the video content is complete based on the similarity between the video characteristic vector and the label characteristic vector if the audio data is complete. Whether the video is complete or not is judged by combining the audio data and the image data of the video, so that the accuracy of judging the integrity of the video content can be improved.

In an embodiment of the present application, the apparatus further includes a processing module, specifically configured to:

In this embodiment of the application, when the first detection module 1302 obtains the first detection result based on the audio frequency spectrum corresponding to the audio data, it is specifically configured to:

In this embodiment of the application, when the second detection module 1303 obtains the second detection result based on the image data and the video tag, it is specifically configured to:

In this embodiment of the application, when the second detection module 1303 obtains the second detection result based on the similarity between the video feature vector and the tag feature vector, it is specifically configured to:

In the embodiment of the present application, the preset similar conditions include:

In the embodiment of the present application, the preset similar conditions further include:

In an embodiment of the present application, the apparatus further includes a sending module, specifically configured to:

In an alternative embodiment, there is provided an electronic apparatus, as shown in fig. 14, an electronic apparatus 4000 shown in fig. 14 including: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 14, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application program codes (computer programs) for executing the present scheme, and is controlled by the processor 4001 to execute. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

The electronic devices include, but are not limited to, mobile terminals such as mobile phones, notebook computers, PADs, etc., and fixed terminals such as digital TVs, desktop computers, etc.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, whether the video content is complete or not is detected by detecting whether the audio data is complete or not firstly and based on the similarity of the video image data and the video label on the basis of the completeness of the audio data, so that the efficiency and the accuracy of judging the completeness of the video content are improved.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the following when executed:

extracting audio data and image data from a video to be detected, and acquiring a video label corresponding to the video to be detected; acquiring a first detection result based on an audio frequency spectrum corresponding to the audio data; the first detection result is used for indicating whether the audio data of the video to be detected meets an audio integrity condition or not; if the first detection result indicates that the audio data meets the audio integrity condition, acquiring a second detection result based on the image data and the video tag; the second detection result is used for indicating whether the video to be detected is complete.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A video detection method, comprising:

extracting audio data and image data from a video to be detected, and acquiring a video tag corresponding to the video to be detected;

acquiring a first detection result based on an audio frequency spectrum corresponding to the audio data; the first detection result is used for indicating whether the audio data of the video to be detected meets an audio integrity condition;

if the first detection result indicates that the audio data meets an audio integrity condition, acquiring a second detection result based on the image data and the video tag; and the second detection result is used for indicating whether the video to be detected is complete.

2. The video detection method according to claim 1, wherein the obtaining a first detection result based on an audio spectrum corresponding to the audio data comprises:

determining a target audio frequency spectrum corresponding to a target frame signal from the audio frequency spectrum;

3. The video detection method of claim 2, wherein before determining the target audio spectrum corresponding to the target frame signal from the audio spectrum, the method further comprises:

and taking the last frame signals of a preset number in the sequence as the target frame signals based on the sequence of the frame signals in the audio data.

4. The video detection method of claim 1, wherein the obtaining a second detection result based on the image data and the video tag comprises:

5. The video detection method of claim 4, wherein the video feature vector comprises at least one sub-video feature vector; the obtaining a second detection result based on the similarity between the video feature vector and the tag feature vector includes:

6. The video detection method according to claim 5, wherein the preset similarity condition comprises:

and the similarity between a first number of the sub-video feature vectors and the label feature vectors is larger than a preset first similarity threshold.

7. The video detection method according to claim 5 or 6, wherein the preset similarity condition further comprises:

determining a second number of sub-video feature vectors with similarity to the label feature vector greater than a preset second similarity threshold, wherein a ratio of the second number to the total number of the sub-video feature vectors is greater than a preset ratio; wherein the second similarity threshold is less than the first similarity threshold.

8. The video detection method of claim 1, further comprising:

if the first detection result indicates that the audio data does not meet a preset audio integrity condition, or the second detection result indicates that the video to be detected is incomplete, sending an abnormal prompt message to a user terminal corresponding to the video to be detected, wherein the abnormal prompt message is used for prompting that the video to be detected is incomplete.

9. An apparatus for video inspection, comprising:

the first detection module is used for acquiring a first detection result based on the audio frequency spectrum corresponding to the audio data; the first detection result is used for indicating whether the audio data of the video to be detected meets an audio integrity condition;

the second detection module is used for acquiring a second detection result based on the image data and the video tag if the first detection result indicates that the audio data meets an audio integrity condition; and the second detection result is used for indicating whether the video to be detected is complete.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the video detection method of any of claims 1-7 when executing the program.