CN113836966A

CN113836966A - Video detection method, device, equipment and storage medium

Info

Publication number: CN113836966A
Application number: CN202010513757.6A
Authority: CN
Inventors: 朱艳宏; 李琴; 李唯源
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2021-12-24

Abstract

The invention discloses a video detection method, a video detection device, video detection equipment and a storage medium. The video detection method comprises the following steps: the method comprises the steps of obtaining video information of a video playing terminal, wherein the video information comprises at least one of the following: image information, audio information and interactive information used for indicating interactive behaviors in the video playing process; identifying video abnormal forms in the video playing process based on the video information and the video detection model, wherein the video abnormal forms comprise at least one of the following forms: identifying whether a first abnormal form related to the image exists in the video playing process or not based on the image information in the video information and a first video detection model; identifying whether a second abnormal form related to sound exists in the video playing process or not based on the audio information in the video information and a second video detection model; and identifying whether a third abnormal form related to interaction exists in the video playing process based on the interaction information in the video information and a third video detection model.

Description

Video detection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of video detection, and in particular, to a video detection method, apparatus, device, and storage medium.

Background

In the related art, in order to detect video abnormal forms (such as image blockage, audio blockage, mosaic, black screen, splash screen, color abnormality, active user pause, and the like) in a video playing process of a video playing terminal, video abnormal form detection in the video playing process is often performed by embedding an SDK (Software Development Kit) packet or capturing a video stream, wherein the embedded SDK packet needs to acquire a Development right of a video APP (application program); the problem that the captured video stream cannot be analyzed relative to the encrypted video stream exists, and a threshold needs to be set manually through experience in the detection of the video stream, so that the detection result is inaccurate.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video detection method, an apparatus, a device, and a storage medium, which aim to implement detection of video abnormal forms without acquiring development rights or analysis rights of a video APP.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a video detection method, which comprises the following steps:

the method comprises the steps of obtaining video information of a video playing terminal, wherein the video information comprises at least one of the following: image information, audio information and interactive information used for indicating interactive behaviors in the video playing process;

identifying video abnormal forms in the video playing process based on the video information and the video detection model, wherein the video abnormal forms comprise at least one of the following forms:

identifying whether a first abnormal form related to the image exists in the video playing process or not based on the image information in the video information and a first video detection model;

identifying whether a second abnormal form related to sound exists in the video playing process or not based on the audio information in the video information and a second video detection model;

identifying whether a third abnormal form related to interaction exists in the video playing process based on the interaction information in the video information and a third video detection model;

the first video detection model, the second video detection model and the third video detection model are generated based on data sets and federal learning training of at least two video playing terminals.

An embodiment of the present invention further provides a video detection apparatus, including:

the acquisition module is used for acquiring video information of the video playing terminal, wherein the video information comprises at least one of the following: image information, audio information and interactive information used for indicating interactive behaviors in the video playing process;

the detection module is used for identifying video abnormal forms in the video playing process based on the video information and the video detection model, and comprises at least one of the following components:

and identifying whether a third abnormal form related to interaction exists in the video playing process based on the interaction information in the video information and a third video detection model.

An embodiment of the present invention further provides a video detection device, including: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor, when running the computer program, is adapted to perform the steps of the method according to any of the embodiments of the present invention.

The embodiment of the invention also provides a storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the method of any embodiment of the invention are realized.

The technical scheme provided by the embodiment of the invention is that the video information of a video playing terminal is obtained, and the video abnormal form in the video playing process is identified based on the video information and a video detection model, wherein the video detection model comprises at least one of the following components: the video anomaly detection method comprises the steps that a first video detection model for image anomaly detection, a second video detection model for sound anomaly detection and a third video detection model for interaction anomaly detection are used, so that video anomaly detection can be carried out based on at least one of image information, audio information and interaction information in video information of a video playing terminal, and the detection of video anomaly forms in the video playing process can be realized without acquiring the development permission or analysis permission of a video APP.

Drawings

FIG. 1 is a flowchart illustrating a video detection method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a video morphology monitoring system based on multi-channel fusion in an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a self-evolution training process of a multi-channel intelligent fusion module according to an embodiment of the present invention;

FIG. 4 is a schematic view of a process of video morphology monitoring based on multi-channel intelligent fusion in an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating the training principle of the image monitoring module in an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating the training principle of the audio monitoring module in an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating the training principle of the interaction monitoring module in an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a video detection apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a video detection device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

An embodiment of the present invention provides a video detection method, as shown in fig. 1, the method includes:

step 101, obtaining video information of a video playing terminal, wherein the video information includes at least one of the following: image information, audio information and interactive information used for indicating interactive behaviors in the video playing process;

and 102, identifying the video abnormal form in the video playing process based on the video information and the video detection model.

In the embodiment of the invention, the video detection method can be applied to the video playing terminal, and the video playing terminal detects the video abnormal form in the local video playing process; the video detection method can also be applied to a cloud or a server and is used for detecting video abnormity of the video playing terminal covered by the cloud or the server.

Here, the obtained video information of the video playing terminal is the video information cached by the video playing terminal side. For example, the video information cached in the video card and the sound card may be acquired through an Application Programming Interface (API) of the video playing terminal, so as to effectively avoid the problem that the SDK cannot be embedded to acquire data for a non-owned service, and solve the problem that the encrypted traffic cannot be analyzed by using a packet capture tool.

Here, the video information cached by the video playing terminal includes image information, audio information, and interactive information. The interactive information may be represented as icon information generated by the video playing terminal in response to the interactive behavior (such as pause, fast forward, etc.) of the user. The image information and the interaction information can be obtained by scheduling image cache information of a display card in the video playing terminal, and the audio information can be obtained by scheduling sound cache information of a sound card in the video playing terminal. Based on this, in some embodiments, the step 101 obtains the video information of the video playing terminal, including at least one of the following:

acquiring image information based on image cache information of a video playing terminal;

acquiring audio information based on audio cache information of a video playing terminal;

and acquiring the interactive information based on the image cache information of the video playing terminal.

In the embodiment of the present invention, the video detection model includes: the system comprises a first video detection model for image anomaly detection, a second video detection model for sound anomaly detection and a third video detection model for interaction anomaly detection.

Here, the video anomaly pattern includes: the method comprises the following steps of firstly, obtaining a first abnormal form related to images (such as image katon, mosaic, black screen, flower screen, color abnormity and the like), secondly, obtaining a second abnormal form related to sound (such as audio katon, audio advance, audio lag and the like) and thirdly, obtaining a third abnormal form related to interaction (such as pause, fast forward, rewind and the like).

In this embodiment of the present invention, the step 102 identifies the video abnormal state in the video playing process based on the video information and the video detection model, and includes at least one of the following:

The video detection method of the embodiment of the invention obtains the video information of the video playing terminal, and identifies the abnormal form in the video playing process based on the video information and the video detection model, wherein the video detection model comprises at least one of the following components: the video anomaly detection method comprises the steps that a first video detection model for image anomaly detection, a second video detection model for sound anomaly detection and a third video detection model for interaction anomaly detection are used, so that video anomaly detection can be carried out based on at least one of image information, audio information and interaction information in video information of a video playing terminal, and the detection of video anomaly forms in the video playing process can be realized without acquiring the development permission or analysis permission of a video APP.

In practical application, the first video detection model, the second video detection model and the third video detection model can be generated based on the data sets and the federal learning training of at least two video playing terminals so as to meet the generalization capability of video anomaly detection, and compared with the method of setting a detection threshold value based on experience, the accuracy rate of video anomaly detection can be improved. Based on this, the video detection method of the embodiment of the present invention further includes: and training to generate a video detection model.

Here, considering that the image information, the audio information, and the interactive information have different characteristics, the three kinds of information may be trained respectively to obtain a first video detection model, a second video detection model, and a third video detection model. The first video detection model, the second video detection model and the third video detection model generated by training can be respectively deployed in an image channel, an audio channel and an interaction channel of video detection equipment, so that the detection of video abnormal forms is realized.

In the embodiment of the invention, model training can be carried out on the basis of federal learning aiming at a plurality of video playing terminals to obtain a video detection model. Federal learning has the following characteristics: all data are kept locally, so that privacy is not disclosed and regulations are not violated; each participant combines data to establish a virtual common model and a system which benefits jointly; under a federal learning system, the identity and the status of each participant are the same; the modeling effect of federated learning is the same as, or comparable to, the effect of placing the entire data set in one place for modeling. Therefore, the generalization capability of the video detection model can be improved, and compared with the detection threshold set based on experience, the accuracy of video anomaly detection can be improved.

Here, training generates a video detection model, including at least one of:

generating the first video detection model based on the data sets and the federal learning training of at least two video playing terminals;

generating the second video detection model based on the data sets and the federal learning training of at least two video playing terminals;

and generating the third video detection model based on the data sets of at least two video playing terminals and the federal learning training.

In some embodiments, generating the first video detection model based on the data sets of at least two video playback terminals and federal learning training includes:

aiming at least two video playing terminals, acquiring first model parameters generated by the video playing terminals based on a first video detection model and respective first data set training, wherein the first data set is a set of image information with labels;

performing federal learning based on the first model parameters of the video playing terminals to fuse the first model parameters of the video playing terminals;

and sending the fused first model parameters to each video playing terminal so as to start the next iteration of each video playing terminal based on the fused first model parameters until the federally learned first video detection model meets the condition of ending the iteration, and obtaining the trained first video detection model.

In some embodiments, the first video detection model employs a deep learning model of a three-dimensional Convolutional Neural Network (3D-CNN) combined with a Long Short-Term Memory Network (LSTM). The 3D-CNN can realize feature extraction of the image sequence, and the LSTM can realize anomaly detection of the time sequence feature sequence.

In some embodiments, the generating the second video detection model based on the data set of at least two video playing terminals and the federal learning training includes:

aiming at least two video playing terminals, acquiring a second model parameter generated by each video playing terminal based on a second video detection model and a respective second data set training, wherein the second data set is a set of audio information with labels;

performing federal learning based on the second model parameters of the video playing terminals to fuse the second model parameters of the video playing terminals;

and sending the fused second model parameters to each video playing terminal so as to start the next iteration of each video playing terminal based on the fused second model parameters until the second video detection model of the federal study meets the condition of ending the iteration, and obtaining the trained second video detection model.

In some embodiments, the second video detection model employs a deep learning model that combines Convolutional Neural Networks (CNNs) with LSTM. The CNN can realize the feature extraction of the audio sequence, and the LSTM can realize the anomaly detection of the sequence feature sequence.

In some embodiments, the generating the third video detection model based on the data set of at least two video playing terminals and the federal learning training includes:

aiming at least two video playing terminals, acquiring a third model parameter generated by each video playing terminal based on a third video detection model and respective third data set training, wherein the third data set is a set of interactive information with labels;

performing federal learning based on the third model parameters of the video playing terminals to fuse the third model parameters of the video playing terminals;

and sending the fused third model parameters to each video playing terminal so as to start the next iteration of each video playing terminal based on the fused third model parameters until the third video detection model of the federal study meets the condition of ending the iteration, and obtaining the trained third video detection model.

Here, the third video detection model may employ a CNN-based multi-classification model.

Therefore, the embodiment of the invention can train the local data of the plurality of video playing terminals based on federal learning to obtain the trained first video detection model, second video detection model and third video detection model. Based on federal learning, the problem that data cannot go out of the local due to safety and privacy is solved, and the generalization capability of the model is improved.

In practical application, the trained first video detection model, the trained second video detection model and the trained third video detection model may be respectively deployed in a multi-channel intelligent fusion module of a video detection device (which may be a video playing terminal or a server), and the multi-channel intelligent fusion module may acquire video information cached in a video playing process, for example, acquire image information through an image channel, acquire audio information through an audio channel, and acquire interaction information through an interaction channel, so that a first abnormal form detection related to an image is performed on the basis of the image information and the first video detection model, a second abnormal form detection related to a sound is performed on the basis of the audio information and the second video detection model, and a third abnormal form detection related to an interaction is performed on the basis of the interaction information and the third video detection model.

In some embodiments, the video detection method further comprises:

acquiring an abnormal sample data set for correcting the recognition result of the abnormal form based on the interactive behavior in the video playing process;

updating at least one of the first video detection model, the second video detection model, and the third video detection model based on the set of abnormal sample data.

Therefore, the abnormal samples in the video playing process can be actively marked, so that correction is realized, the first video detection model, the second video detection model and the third video detection model are updated based on the corrected abnormal data set, a self-evolution mechanism of the video detection models is formed, and the generalization capability of the video detection models is continuously improved.

In some embodiments, the video detection method further comprises at least one of:

determining whether a network side fault exists based on the identification result of the video abnormal form in the video playing process;

and determining an evaluation result for evaluating the user experience based on the identification result of the video abnormal form in the video playing process.

The video detection method provided by the embodiment of the invention can be used for subsequent application based on the identification result of the video abnormal form. The first application is for determining whether there is a network side failure. The second application is for evaluating user experience.

For the first application, it is desirable for the operator to detect the video abnormal state so as to know whether the network side has a fault. However, there are two cases of image detection stuck: one is because of a problem on the network side and the other is because the user has actively tentatively appeared picture stuck. Obviously, the operator is concerned with the first case, and the embodiment of the present invention may determine whether there is a network side fault based on the recognition results of the first abnormal form and the second abnormal form.

For the second application, for a video manufacturer, the first video detection model and the second video detection model can assist a video APP party in evaluating the experience perception of a customer, and the third video detection model can assist in analyzing the user preference through interactive behaviors, so that some customized services are realized. Therefore, the embodiment of the invention can determine the evaluation result for evaluating the user experience based on the recognition results of the first abnormal form, the second abnormal form and the third abnormal form, and push the video service based on the evaluation result.

The present invention will be described in further detail with reference to the following application examples.

Fig. 2 shows a schematic structural diagram of a video morphology monitoring system based on multi-channel fusion according to an embodiment of the application. Fig. 3 shows a schematic flow chart of self-evolution training of the multi-channel intelligent fusion module in the embodiment of the present application. Fig. 4 shows a schematic flow chart of video morphology monitoring based on multi-channel intelligent fusion in the embodiment of the application.

As shown in fig. 2, the video morphology monitoring system based on multi-channel fusion of the present application embodiment includes: the device comprises a multi-channel intelligent fusion model training device and a video form monitoring model reasoning device based on multi-channel intelligent fusion.

The multi-channel intelligent fusion model training device mainly comprises a cloud server and N video terminal devices (namely the video playing terminals), wherein N is a natural number greater than 1. The video information comprises three information of image information, audio information and interactive information. Wherein the interactive information is represented by icon information generated by the video terminal equipment responding to the user interactive behavior (such as pause, fast forward and other operations). The image information and the interaction information can be obtained by scheduling image cache information of a display card functional unit in the video terminal equipment, and the audio information can be obtained by scheduling sound cache information of a sound card functional unit in the video terminal equipment. The three kinds of information have different characteristics, and the application embodiment trains the three kinds of information respectively to obtain different models which are respectively deployed in the image channel, the audio channel and the interaction channel of the multi-channel intelligent fusion module. Aiming at the image recognition model, a neural network model combining 3D-CNN and LSTM is designed; aiming at the audio recognition model, a neural network model combining CNN and LSTM is designed; aiming at the interactive recognition model, a neural network model of CNN is designed. The problem that data cannot go out of the local area due to safety and privacy is solved, meanwhile, the generalization capability of the model is improved, and a federal learning mode is adopted in the training process. A federal learning model is deployed on a cloud server, a local model is deployed on N video terminal devices, and a model corresponding to the multi-channel intelligent fusion module is trained in a federal learning mode, so that abnormal forms of videos in video information can be accurately identified.

A video morphology monitoring model reasoning device based on multi-channel intelligent fusion mainly comprises video terminal equipment, a video information monitoring module, a multi-channel intelligent fusion module and a monitoring result display module. The video information monitoring module has the main functions of acquiring image information, audio information and interaction information from video terminal equipment through the video information monitoring module respectively and inputting the acquired information into the multi-channel intelligent fusion module respectively. The multi-channel intelligent fusion module is composed of an image channel, an audio channel and an interaction channel, wherein an image monitoring model (namely a first video detection module), an audio monitoring model (namely a second video detection module) and an interaction monitoring model (namely a third video detection module) are respectively deployed. The image channel is used for image morphology monitoring and comprises a plurality of abnormal morphologies appearing in the image; the audio channel is used for monitoring audio morphology and comprises a plurality of abnormal morphologies appearing in the audio; the interactive channel has two functions, firstly, the interactive monitoring model can monitor interactive behaviors, such as active pause of a user, and secondly, the user can actively mark abnormal samples in the video playing process and supplement the abnormal samples to the local model for iterative training of the model, the generalization capability of the model is continuously improved, and a self-evolution mechanism of the model is formed.

In this application embodiment, the video detection includes: the method comprises a multi-channel intelligent fusion model training process and a video morphology monitoring model reasoning process based on multi-channel intelligent fusion.

As shown in fig. 3, the multi-channel intelligent fusion model training process includes:

step 301, presetting an initialization model;

here, the initialization model may be preset at the cloud server, and may include initialization models of an image monitoring model, an audio monitoring model, and an interaction monitoring model.

Step 302, selecting terminal equipment participating in model training;

here, the cloud server may select K participants from N terminal devices, K ≦ N.

303, the terminal equipment participating in model training acquires a preset initialization model;

and the terminal equipment participating in the model training acquires a preset initialization model from the cloud server.

Step 304, training a model by using local data and calculating a model updating parameter by a participant;

each participant trains the model by using local data, for example, trains the image monitoring model by using a set of image information with labels to obtain an update parameter of the image monitoring model; training an audio monitoring module by using a set of audio information with labels to obtain an update parameter of an audio monitoring model; and training the interactive monitoring model by using the set of the interactive information with the label to obtain the updating parameters of the interactive monitoring model.

Step 305, uploading model updating parameters by a participant;

each participant uploads the updated parameters of the trained model, for example, the updated parameters of the image monitoring model, the updated parameters of the audio monitoring model and the updated parameters of the interaction monitoring model.

Step 306, the server collects the model parameters uploaded by each participant and updates the parameters of the federated learning model;

the cloud server collects the model updating parameters uploaded by each participant, and updates the parameters of the federal learning model based on the collected parameters. Here, pooling refers to weighted summation of model parameters of each participant based on the ratio based on the number of samples of each participant. For example, the updated parameters of the image monitoring models of the participants can be collected to obtain federate learning parameters of the image monitoring models; collecting updated parameters of the audio monitoring models of all participants to obtain federate learning parameters of the audio monitoring models; the updated parameters of the interaction monitoring models of the participants can be collected to obtain the federate learning parameters of the interaction monitoring models.

Step 307, adding an abnormal labeling sample;

the interactive channel is maintained for a fixed duration of t₀The buffer is used for collecting abnormal samples of images and audios corresponding to the abnormal labels (labels actively corrected), wherein the samples are sequences containing images or audios of abnormal forms, and the corresponding sequences and label texts are sent to the terminal equipment so as to update the local data set of the terminal equipment.

Step 308, continuing training based on the federate learning parameters of the model and the local data set of the terminal equipment until the model converges;

here, the local model of the terminal device shuffles the newly added abnormal samples with the local samples, performs model training based on the federate learning parameters on the cloud server side, and uploads the updated parameters after the local model training to the cloud server, the cloud server collects the model updated parameters uploaded by each participant, updates the parameters of the federate learning model based on the collected parameters (that is, repeats the foregoing steps 304 to 306), and determines that the federate learning model converges to obtain a trained model.

And 309, deploying the trained model to a multi-channel intelligent fusion model.

The trained model is deployed to a multi-channel intelligent fusion model of the video detection equipment by the cloud server, so that the video detection equipment can perform video abnormal form inference (namely video abnormal form detection).

As shown in fig. 4, the video morphology monitoring model inference process based on multi-channel intelligent fusion includes:

step 401, a video information monitoring module initiates a data request to a video terminal device;

here, the video information monitoring module initiates a data request to the video terminal device, where the data request is used to obtain video related information in a video card and a sound card functional unit of the terminal device through a terminal device API, and the video information monitoring module may be integrated with the video terminal device, or may be deployed outside the terminal device in another server with storage and communication capabilities.

Step 402, the video terminal equipment returns images, audios and interactive information;

the video terminal equipment responds to the request, schedules images and interactive information cached in a display card functional unit in the terminal equipment, and simultaneously schedules audio information cached in a sound card functional unit in the terminal equipment and returns the images, the audio and the interactive information to the video information monitoring module;

step 403, respectively outputting the image, the audio and the interactive information to an image channel, an audio channel and an interactive channel;

the video information monitoring module correspondingly inputs the image, the audio and the interactive information acquired in the step 402 into an image channel, an audio channel and an interactive channel in the multi-channel intelligent fusion module respectively. Specifically, step 403 includes:

step 3.1, outputting image information;

step 3.2, outputting audio information;

and 3.3, outputting the interactive information.

The multi-channel intelligent fusion module identifies and monitors the acquired video information by utilizing an image monitoring model, an audio monitoring model and an interaction monitoring model which are respectively deployed in an image channel, an audio channel and an interaction channel; and meanwhile, carrying out abnormity marking on the image and the audio information by an abnormity marking submodule in the interactive channel.

It should be noted that the video information monitoring module outputs the image information to the image channel, outputs the audio information to the audio channel, and outputs the interactive information to the interactive channel (i.e., steps 3.1 to 3.3) may be executed synchronously.

And 404, fusing and summarizing the monitoring results by the multi-channel intelligent fusion module and outputting the monitoring results to the monitoring result display module for visualization.

Here, step 404 includes:

step 4.1, outputting an image monitoring result;

step 4.2, outputting an audio monitoring result;

4.3, outputting an interaction monitoring result;

and 4.4, carrying out exception annotation on the data in the step 3.1 and the step 3.2, and outputting an exception annotation result.

Here, the multi-channel intelligent fusion module outputs the recognition result of the first abnormal form related to the image, the recognition result of the second abnormal form related to the sound, the recognition result of the third abnormal form related to the interaction and the result of the abnormal label to the monitoring result display module for visualization. It should be noted that steps 4.1, 4.2, 4.3 and 4.4 may be performed synchronously.

The following describes an image monitoring model in this application embodiment in detail with reference to fig. 5:

training data: the picture sequence in the video information with the label is to label abnormal forms such as image blocking, mosaic, black screen, flower screen, color abnormity and the like in video playing.

Model design: and a multi-classification deep learning model combining 3D-CNN (convolutional neural network) and LSTM (long-time memory network) is adopted. The 3D-CNN can realize the feature extraction of the image sequence, and the LSTM can realize the anomaly detection of the time sequence feature sequence.

The algorithm flow is as follows:

a) the input training data set D (image sequence marked with the presence or absence of an abnormality and the type of the abnormality) is subjected to data preprocessing (color conversion, size clipping, and scale conversion), and the processed data is input into a model.

b) The 3D-CNN module extracts the characteristic features of the training data, and the function is expressed as: x_cnn＝f(W_cnn·X_in+B_cnn)，X_cnnFeature vector, X, representing the output of CNN layer_inIs input data, W_cnnIs a first weight parameter, B_cnnIs the bias and f is the activation function.

c) Inputting the feature vector output by the 3D-CNN into the LSTM layer, and expressing the function as: c (t) ═ f (X)_cnn(t)·W_lstm+C(t-1)·V+B_lstm) C (t) is the output of LSTM, V is the memory parameter, B_lstmIs an offset, W_lstmIs a second weight parameter.

d) The output of the LSTM is connected to the fully-connected layer, which employs the softmax function,

e) the loss function is set up such that,

the loss function is expressed in cross entropy and is minimized.

f) The whole process is solved based on a random gradient descent algorithm, the loss function reaches the minimum value, and the model converges.

g) And finally outputting y to represent the label types of image blocking, mosaic, black screen, flower screen and the like.

The following describes the audio monitoring model in this application embodiment with reference to fig. 6:

training data: an audio sequence (e.g., data that labels audio kation) in tagged video information is obtained.

Model design: a multi-classification deep learning model combining CNN (convolutional neural network) and LSTM (long-short time memory network) is adopted. CNN can realize the feature extraction of audio sequences, and LSTM can realize the anomaly detection of time sequence feature sequences.

The algorithm flow is as follows:

a) performing data preprocessing (performing pre-framing, windowing, short-time Fourier transform and Mel filtering on an audio sequence, slicing a Mel spectrogram by a time length tau to obtain n input layer slices) on an input training data set D (marking whether an abnormal audio sequence exists or not and an abnormal type), and inputting the processed data into a model;

b) the CNN module extracts the characteristic features of the training data, and the function is expressed as: x_cnn＝f(W_cnn·X_in+B_cnn)，X_cnnFeature vector, X, representing the output of CNN layer_inIs input data, W_cnnIs a first weight parameter, B_cnnIs the bias and f is the activation function.

c) The feature vectors output by the CNN are input to the LSTM layer, and the function is expressed as: c (t) ═ f (X)_cnn(t)·W_lstm+C(t-1)·V+B_lstm) C (t) is the output of LSTM, V is the memory parameter, B_lstmIs an offset, W_lstmIs a second weight parameter.

e) the loss function is set up such that,

the loss function is expressed in cross entropy and is minimized.

g) Finally, output y represents the tag type of audio stuck, intermittent, etc.

The following describes the interaction monitoring model in this application embodiment in detail with reference to fig. 7:

training data: interactive information in video information with label (e.g. image sequence with user active pause, fast forward icon)

Model design: using CNN (convolutional neural network) -based multi-classification models

The algorithm flow is as follows:

a) performing data preprocessing (color transformation, size cutting and scale transformation) on an input training data set D (an image sequence marked with whether interactive icons exist) and inputting the processed data into a model;

b) the CNN module extracts the characteristic features of the training data, and the function is expressed as: x_cnn＝f(W_cnn·X_in+B_cnn)，X_cnnFeature vector, X, representing the output of CNN layer_inIs input data, W_cnnIs a weight parameter, B_cnnIs the bias and f is the activation function.

c) The output of the CNN module is connected to the fully-connected layer, which employs the softmax function,

d) the loss function is set up such that,

the loss function is expressed in cross entropy and is minimized.

e) The whole process is solved based on a random gradient descent algorithm, the loss function reaches the minimum value, and the model converges.

f) Finally output y represents the type of label the user is actively pausing, fast forwarding, etc.

According to the video detection method, the video information monitoring device is used for acquiring the cache information in the video card and sound card function unit from the API of the terminal device, the problem that data acquisition cannot be achieved by embedding the SDK for non-owned services is solved, the problem that encrypted flow cannot be analyzed by using a packet capturing tool is solved, the video abnormal form in the playing process can be identified based on the acquired cached video information and a video detection model, specifically, the first abnormal form can be detected based on the cached video information and an image monitoring module, the second abnormal form can be detected based on the cached image information and an audio monitoring module, and the third abnormal form can be detected based on the cached interaction information and an interaction monitoring module. In addition, the image monitoring module, the audio monitoring module and the interaction monitoring module carry out iterative training based on a self-evolution mechanism, the generalization capability is strong, the image channel, the audio channel and the interaction channel are respectively established, and the identification results of the three channels are integrated, so that the fault or the abnormality of the video can be more accurately judged.

In order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides a video detection apparatus, where the video detection apparatus corresponds to the video detection method, and each step in the video detection method is also completely applicable to the embodiment of the video detection apparatus.

As shown in fig. 8, the video detection apparatus 800 includes: an acquisition module 801 and a detection module 802; the obtaining module 801 is configured to obtain video information of a video playing terminal, where the video information includes at least one of: image information, audio information and interactive information used for indicating interactive behaviors in the video playing process; the detection module 802 is configured to identify a video abnormal state in a video playing process based on the video information and a video detection model, where the video abnormal state includes at least one of:

In some embodiments, the acquisition module 801 is configured to at least one of:

In some embodiments, the video detection apparatus 800 further comprises: a model training module 803, configured to generate the first video detection model based on the data set of at least two video playing terminals and the federal learning training.

In some embodiments, the model training module 803 generates the first video detection model based on the data sets of at least two video playback terminals and the federal learning training, including:

In some embodiments, the first video detection model employs a deep learning model of a three-dimensional convolutional neural network (3D-CNN) combined with a long short-term memory network (LSTM).

In some embodiments, the model training module 803 is further configured to generate the second video inspection model based on the data sets of at least two video playback terminals and the federal learning training.

In some embodiments, the model training module 803 generates the second video detection model based on the data sets of at least two video playback terminals and the federal learning training, including:

In some embodiments, the second video detection model employs a deep learning model of a Convolutional Neural Network (CNN) in combination with LSTM.

In some embodiments, the model training module 803 is further configured to generate the third video inspection model based on the data sets of at least two video playback terminals and the federal learning training.

In some embodiments, the model training module 803 generates the third video detection model based on the data sets of at least two video playing terminals and the federal learning training, including:

In some embodiments, the obtaining module 801 is further configured to obtain an abnormal sample data set for correcting the recognition result of the abnormal form based on the interaction behavior in the video playing process;

the model training module 803 is also configured to update at least one of the first video detection model, the second video detection model, and the third video detection model based on the set of abnormal sample data.

In some embodiments, the video detection apparatus 800 further comprises: a determining module 804 configured to at least one of:

determining whether a network side fault exists based on the identification result of the abnormal form in the video playing process;

and determining an evaluation result for evaluating the user experience based on the identification result of the abnormal form in the video playing process.

In practical applications, the obtaining module 801, the detecting module 802, the model training module 803, and the determining module 804 may be implemented by a processor in the video detecting apparatus. Of course, the processor needs to run a computer program in memory to implement its functions.

It should be noted that: in the video detection apparatus provided in the above embodiment, only the division of the program modules is exemplified when performing video detection, and in practical applications, the processing distribution may be completed by different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the video detection apparatus and the video detection method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present invention, the embodiment of the present invention further provides a video detection device. Fig. 9 shows only an exemplary structure of the video detection apparatus, not the entire structure, and a part of or the entire structure shown in fig. 9 may be implemented as necessary.

As shown in fig. 9, a video detection apparatus 900 according to an embodiment of the present invention includes: at least one processor 901, memory 902, a user interface 903, and at least one network interface 904. The various components in the video inspection device 900 are coupled together by a bus system 905. It will be appreciated that the bus system 905 is used to enable communications among the components. The bus system 905 includes a power bus, a control bus, and a status signal bus, in addition to a data bus. For clarity of illustration, however, the various buses are labeled in fig. 9 as bus system 905.

The user interface 903 may include a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, a touch screen, or the like, among others.

The memory 902 in embodiments of the present invention is used to store various types of data to support the operation of the video detection device. Examples of such data include: any computer program for operating on a video detection device.

The video detection method disclosed by the embodiment of the invention can be applied to the processor 901, or implemented by the processor 901. The processor 901 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the video detection method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 901. The Processor 901 may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 901 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 902, and the processor 901 reads information in the memory 902, and completes the steps of the video detection method provided by the embodiment of the present invention in combination with hardware thereof.

In an exemplary embodiment, the video detection Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.

It will be appreciated that the memory 902 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The described memory for embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.

In an exemplary embodiment, the embodiment of the present invention further provides a storage medium, that is, a computer storage medium, which may be specifically a computer readable storage medium, for example, a memory 902 storing a computer program, where the computer program is executable by a processor 901 of a video detection device to perform the steps described in the method of the embodiment of the present invention. The computer readable storage medium may be a ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM, among others.

It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In addition, the technical solutions described in the embodiments of the present invention may be arbitrarily combined without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A video detection method, comprising:

2. The method according to claim 1, wherein the obtaining the video information of the video playing terminal comprises at least one of:

3. The method of claim 1, further comprising:

and generating the first video detection model based on the data sets of at least two video playing terminals and the federal learning training.

4. The method of claim 3, wherein generating the first video inspection model based on the data sets of at least two video playback terminals and federal learning training comprises:

5. The method of claim 4, wherein the first video detection model employs a deep learning model of a three-dimensional convolutional neural network 3D-CNN combined with a long-short term memory network (LSTM).

6. The method of claim 1, further comprising:

and generating the second video detection model based on the data sets of at least two video playing terminals and the federal learning training.

7. The method of claim 6, wherein generating the second video inspection model based on the data sets of at least two video playback terminals and federal learning training comprises:

8. The method of claim 7, wherein the second video detection model employs a deep learning model of Convolutional Neural Network (CNN) combined with LSTM.

9. The method of claim 1, further comprising:

10. The method of claim 9, wherein generating the third video inspection model based on the data sets of at least two video playback terminals and federal learning training comprises:

11. The method of claim 1, further comprising:

12. The method of claim 1, further comprising at least one of:

13. A video detection apparatus, comprising:

14. A video detection device, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,

the processor, when executing the computer program, is adapted to perform the steps of the method of any of claims 1 to 12.

15. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method of any one of claims 1 to 12.