CN112150457A - Video detection method, device and computer readable storage medium - Google Patents

Video detection method, device and computer readable storage medium Download PDF

Info

Publication number
CN112150457A
CN112150457A CN202011074238.0A CN202011074238A CN112150457A CN 112150457 A CN112150457 A CN 112150457A CN 202011074238 A CN202011074238 A CN 202011074238A CN 112150457 A CN112150457 A CN 112150457A
Authority
CN
China
Prior art keywords
video
information
detected
image
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011074238.0A
Other languages
Chinese (zh)
Inventor
王铭喜
李宁
高荣欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202011074238.0A priority Critical patent/CN112150457A/en
Publication of CN112150457A publication Critical patent/CN112150457A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Abstract

The present disclosure relates to a video detection method, apparatus, and computer-readable storage medium, the method comprising: acquiring a plurality of feature information corresponding to a video to be detected, wherein the feature information comprises text information, image information and audio information; determining a characteristic detection result corresponding to the video to be detected under the characteristic information aiming at each characteristic information; and determining a target detection result corresponding to the video to be detected according to a plurality of feature detection results corresponding to the video to be detected. Therefore, on one hand, the quality of the video to be detected can be detected from the dimensionality of the text information, the image information and the audio information corresponding to the video to be detected, and therefore the accuracy and the reliability of the determined target detection result can be guaranteed. On the other hand, the comparison and evaluation based on the high-quality video are not needed, so that the application range of the video detection method can be increased, and the data processing amount of the video detection method can be reduced, so that the video detection efficiency is improved.

Description

Video detection method, device and computer readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a video detection method and apparatus, and a computer-readable storage medium.
Background
UGC, User Generated Content, namely User original Content, the User can upload the User original Content through an internet platform, and therefore the User can display or provide the User original Content for other users. The quality of the short videos is uneven due to the fact that shooting equipment adopted by users for shooting the short videos is varied and the deviation of shooting environments, modes and the like of the users is caused.
In the related art, the video uploaded by the user can be detected through the quality evaluation of the video. Currently, the quality evaluation is usually performed on the video quality, such as: PSNR (Peak Signal-to-Noise Ratio) and ssim (structural similarity), however, these evaluation methods have a premise that a high-definition video is required as a reference for evaluation.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a video detection method, apparatus, and computer-readable storage medium.
According to a first aspect of the embodiments of the present disclosure, there is provided a video detection method, including:
acquiring a plurality of feature information corresponding to a video to be detected, wherein the feature information comprises text information, image information and audio information;
determining a characteristic detection result corresponding to the video to be detected under the characteristic information aiming at each characteristic information;
and determining a target detection result corresponding to the video to be detected according to a plurality of feature detection results corresponding to the video to be detected.
Optionally, the image information includes video frame image information and/or cover image information corresponding to the detection video; the determining the feature detection result corresponding to the video to be detected under the feature information includes:
inputting the video frame image information and/or the cover image information into an image detection model to obtain a characteristic detection result of the image information corresponding to the video to be detected, wherein a target sample image in a training sample of the image detection model is determined in the following way:
acquiring a candidate sample image with the resolution less than or equal to the target resolution;
and aiming at the candidate sample image with the resolution smaller than the target resolution, adding target pixel points in the candidate sample image according to the resolution of the candidate sample image and the target resolution to obtain an extended sample image with the resolution of the target resolution, wherein the target sample image comprises the extended sample image and the candidate sample image with the resolution of the target resolution.
Optionally, the image detection model includes a residual sub-model, a pooling layer and a full-link layer, where the residual sub-model is a pre-trained migration learning model, and in a training process of the image detection model, parameters of the pooling layer and the full-link layer are adjusted according to a loss value of the image detection model determined by the training sample to obtain the image detection model.
Optionally, the feature information is audio information; the determining the feature detection result corresponding to the video to be detected under the feature information includes:
inputting the audio information into an audio detection model, and obtaining a feature detection result of the audio information corresponding to the video to be detected, wherein the audio detection model is obtained by training based on a multi-classification task model, and the number of training samples corresponding to each classification of the audio detection model is the same.
Optionally, the loss function of the audio detection model is a sum of a cross entropy function and a maximum average difference function corresponding to multiple classes of the audio model.
Optionally, the determining, according to the plurality of feature detection results corresponding to the video to be detected, a target detection result corresponding to the video to be detected includes:
and according to the weight corresponding to each piece of feature information, carrying out weighted summation on the feature detection results to obtain a target detection result of the video to be detected.
Optionally, the method further comprises:
determining abnormal characteristic information in a plurality of characteristic information corresponding to the video to be detected under the condition that the video to be detected is determined to be abnormal according to the target detection result of the video to be detected;
and outputting prompt information, wherein the prompt information comprises the target detection result and the abnormal characteristic information.
According to a second aspect of the embodiments of the present disclosure, there is provided a video detection apparatus including:
the acquisition module is configured to acquire a plurality of characteristic information corresponding to a video to be detected, wherein the characteristic information comprises text information, image information and audio information;
the first determining module is configured to determine, for each piece of feature information, a feature detection result corresponding to the video to be detected under the feature information;
the second determining module is configured to determine a target detection result corresponding to the video to be detected according to the plurality of feature detection results corresponding to the video to be detected.
Optionally, the image information includes video frame image information and/or cover image information corresponding to the detection video; the first determining module includes:
the first input sub-module is configured to input the video frame image information and/or the cover image information into an image detection model, and obtain a feature detection result of image information corresponding to the video to be detected, wherein a target sample image in a training sample of the image detection model is determined by:
acquiring a candidate sample image with the resolution less than or equal to the target resolution;
and aiming at the candidate sample image with the resolution smaller than the target resolution, adding target pixel points in the candidate sample image according to the resolution of the candidate sample image and the target resolution to obtain an extended sample image with the resolution of the target resolution, wherein the target sample image comprises the extended sample image and the candidate sample image with the resolution of the target resolution.
Optionally, the image detection model includes a residual sub-model, a pooling layer and a full-link layer, where the residual sub-model is a pre-trained migration learning model, and in a training process of the image detection model, parameters of the pooling layer and the full-link layer are adjusted according to a loss value of the image detection model determined by the training sample to obtain the image detection model.
Optionally, the feature information is audio information; the first determining module includes:
and the second input submodule is configured to input the audio information into an audio detection model, and obtain a feature detection result of the audio information corresponding to the video to be detected, wherein the audio detection model is obtained by training based on a multi-classification task model, and the number of training samples corresponding to each classification of the audio detection model is the same.
Optionally, the loss function of the audio detection model is a sum of a cross entropy function and a maximum average difference function corresponding to multiple classes of the audio model.
Optionally, the second determination module is configured to:
and according to the weight corresponding to each piece of feature information, carrying out weighted summation on the feature detection results to obtain a target detection result of the video to be detected.
Optionally, the apparatus further comprises:
the third determining module is configured to determine abnormal feature information in a plurality of feature information corresponding to the video to be detected under the condition that the target detection result of the video to be detected determines that the video to be detected is abnormal;
and the output module is configured to output prompt information, wherein the prompt information comprises the target detection result and the abnormal characteristic information.
According to a third aspect of the embodiments of the present disclosure, there is provided a video detection apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a plurality of feature information corresponding to a video to be detected, wherein the feature information comprises text information, image information and audio information;
determining a characteristic detection result corresponding to the video to be detected under the characteristic information aiming at each characteristic information;
and determining a target detection result corresponding to the video to be detected according to a plurality of feature detection results corresponding to the video to be detected.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the video detection method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
in the technical scheme, by acquiring text information, image information and audio information corresponding to a video to be detected, a feature detection result corresponding to the video to be detected in each feature dimension can be determined according to the information, and a target detection result corresponding to the video to be detected is determined according to a plurality of feature detection results corresponding to the video to be detected. Therefore, by the technical scheme, on one hand, the quality of the video to be detected can be detected from the dimensionality of the text information, the image information and the audio information corresponding to the video to be detected, and therefore the accuracy and the reliability of the determined target detection result can be guaranteed. On the other hand, the comparison and evaluation based on the high-quality video are not needed, so that the application range of the video detection method can be increased, and the data processing amount of the video detection method can be reduced, so that the video detection efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart illustrating a video detection method according to an exemplary embodiment.
FIG. 2 is a schematic diagram illustrating an expanded sample image according to an exemplary embodiment.
Fig. 3 is a flow chart illustrating a video detection method according to an example embodiment.
Fig. 4 is a block diagram illustrating a video detection device according to an exemplary embodiment.
Fig. 5 is a block diagram illustrating a video detection device according to an example embodiment.
Fig. 6 is a block diagram illustrating a video detection device according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a flow diagram illustrating a video detection method according to an exemplary embodiment, which may include the following steps, as shown in fig. 1.
In step 11, a plurality of feature information corresponding to a video to be detected is obtained, where the feature information includes text information, image information, and audio information.
The video to be detected can be acquired from a video content pool of the short video uploaded by the user, so that the video can be firstly detected before the video uploaded by the user is shared and viewed by other users, and the quality of the video watched by other users can be ensured.
When the feature information is text information, in a possible embodiment, the obtaining of the text information corresponding to the video to be detected may be extracting corresponding title information of the video to be detected. For example, when a user uploads a video, title information is usually edited for the video, and therefore, when text information corresponding to a video to be detected is obtained, the title information can be extracted as the text information. In another possible embodiment, the video to be detected may include subtitle information, and the obtaining of the text information corresponding to the video to be detected may be extracting the subtitle information.
When the feature information is image information, in a possible embodiment, acquiring the image information corresponding to the video to be detected may be acquiring video frame image information corresponding to the video to be detected, and then extracting a video frame image sequence from the video to be detected. For example, the corresponding video frames may be extracted according to a preset time interval, or the corresponding video frames may be extracted according to a preset number of video frames, and according to the duration of the video to be detected, on average. In another possible embodiment, the video usually corresponds to a cover page image, and acquiring the image information corresponding to the video to be detected may be extracting the corresponding cover page image information of the video to be detected.
When the feature information is audio information, in a possible embodiment, the obtaining of the audio information corresponding to the video to be detected may be extracting a corresponding audio file from the video to be detected, where an implementation manner of extracting the audio file from the video is the prior art, and is not described herein again.
In step 12, for each feature information, a feature detection result corresponding to the feature information of the video to be detected is determined.
In this step, can determine the testing result of this video that treats to detect under the text dimension according to text information, can determine the testing result of this video that treats to detect under the image dimension according to image information to and can determine the testing result of this video that treats to detect under the audio dimension according to audio information, thereby can determine the testing result of this video that treats to detect under a plurality of analysis dimensions according to a plurality of characteristic information of treating to detect the video, improve and treat that the video that detects comprehensive nature and comprehensiveness.
In step 13, a target detection result corresponding to the video to be detected is determined according to the plurality of feature detection results corresponding to the video to be detected.
In the technical scheme, by acquiring text information, image information and audio information corresponding to a video to be detected, a feature detection result corresponding to the video to be detected in each feature dimension can be determined according to the information, and a target detection result corresponding to the video to be detected is determined according to a plurality of feature detection results corresponding to the video to be detected. Therefore, by the technical scheme, on one hand, the quality of the video to be detected can be detected from the dimensionality of the text information, the image information and the audio information corresponding to the video to be detected, and therefore the accuracy and the reliability of the determined target detection result can be guaranteed. On the other hand, the comparison and evaluation based on the high-quality video are not needed, so that the application range of the video detection method can be increased, and the data processing amount of the video detection method can be reduced, so that the video detection efficiency is improved.
In order to make those skilled in the art understand the technical solutions provided by the embodiments of the present invention, the following detailed descriptions are provided for the above steps.
In a possible embodiment, the determining the feature detection result corresponding to the video to be detected under the feature information includes:
and inputting the title information into the first text detection model to obtain a characteristic detection result of the title information corresponding to the video to be detected. Wherein the first text detection model may be trained based on a model formed by connecting the BERT model to an output layer.
For example, in order to improve the accuracy and comprehensiveness of the video detection result, when the training sample of the first text detection model is labeled, the training sample may be labeled from dimensions such as the summarization degree of text information, included symbols, semantic information, and syntactic structure information, and for example, the training sample corresponding to the header information may be labeled by a score, for example, if the summarization degree of the header information is too simple, the score may be labeled to be lower; if the number of the special symbols contained in the header information is too large, the score can be marked to be lower; semantic information corresponding to the title information is ambiguous, and if the title sentences are unsmooth, the marking score is low; the grammar structure information corresponding to the header information is abnormal, and if grammar errors exist, the labeled score can be low, so that the first text detection model can be trained based on the labeled training samples, wherein the mode of training the model based on the training samples is similar to the training mode of the BERT model, and is not repeated here. The training of the first text detection model is carried out based on the training samples, so that the accuracy and reliability of the feature detection result determined based on the first text detection model can be improved.
In a possible embodiment, the determining the feature detection result corresponding to the video to be detected under the feature information includes:
and inputting the subtitle information into the second text detection model to obtain a feature detection result of the subtitle information corresponding to the video to be detected. Wherein the second text detection model may be trained based on a model formed by connecting the BERT model to an output layer. Illustratively, dimensions such as sentence association degree, semantic information, syntactic structure information and the like corresponding to the text sample may be labeled, so as to obtain a training sample, so as to train the second text detection model. In the training process, the caption information and the mark value are input, vector coding is carried out on the caption information through a BERT model to obtain a text vector corresponding to the caption information, wherein the BERT model comprises two training tasks of next sentence prediction and shielding prediction, loss values and prediction probability matrixes corresponding to the next sentence prediction and the shielding prediction can be obtained, and model parameters are obtained through summation minimization of the next sentence prediction and the shielding prediction loss values, so that a second text detection model is obtained. Among them, the training of the BERT model is a routine choice in the art, and is not described herein. Therefore, by the technical scheme, the subtitle information in the video to be detected can be detected, and the multi-dimensionality of video detection is improved.
In a possible embodiment, the image information includes video frame image information and/or cover image information corresponding to the detection video; in step 12, an exemplary implementation manner of determining a feature detection result corresponding to the video to be detected under the feature information may include:
inputting the video frame image information and/or the cover image information into an image detection model to obtain a characteristic detection result of the image information corresponding to the video to be detected, wherein a target sample image in a training sample of the image detection model is determined in the following way:
and acquiring a candidate sample image with the resolution less than or equal to a target resolution, wherein the target resolution can be set according to an actual use scene, which is not limited by the disclosure.
And aiming at the candidate sample image with the resolution smaller than the target resolution, adding target pixel points in the candidate sample image according to the resolution of the candidate sample image and the target resolution to obtain an extended sample image with the resolution of the target resolution, wherein the target sample image comprises the extended sample image and the candidate sample image with the resolution of the target resolution.
In one possible embodiment, the candidate sample image having the target resolution may be directly determined as the target sample image. Because the network input dimensions of deep learning are fixed, however, the resolution of the training image for deep learning is not fixed and is not unique, in the related art, the training image is generally standardized by adopting an interception or interpolation method, etc., so as to obtain a training image with uniform resolution for model training. However, in the above operation, certain influence is inevitably caused on the quality of the training image, for example, a part of features may be lacked after the image is cut, or a part of features may be changed due to an interpolation method, so that the accuracy of the trained image detection model is insufficient.
Based on this, in this embodiment, the target resolution may be selected as the upper limit of the resolution of the target sample image, and then the situation that image features are lost due to image truncation does not occur in the determined target sample image. And aiming at the candidate sample image with the resolution smaller than the target resolution, aligning a central pixel point of the candidate sample image with a central pixel point corresponding to the target resolution, and adding a target pixel point at a pixel point which belongs to the pixel point corresponding to the target resolution and does not belong to the candidate sample image to expand the candidate sample image, as shown in fig. 2, a is an original candidate sample image, B is an expanded sample image corresponding to the increased target pixel point of a, wherein a circle is used for representing the added target pixel point, and P is a central pixel point, wherein the value of the target pixel point can be preset according to an actual use scene.
After the target sample image is determined, the target sample image may be labeled, for example, scores may be labeled from angles such as image blur, black edge condition, stretching deformation, image brightness, image exposure, and image sharpness, so as to obtain a training sample.
Therefore, by the technical scheme, the resolution of the candidate sample image can reach the target resolution, and the position and value between each pixel point in the candidate sample image can not be changed, so that the consistency of the image characteristics extracted based on the extended sample image and the corresponding candidate sample image can be ensured, the accuracy of the image detection model obtained by training based on the training sample is improved, and meanwhile, the training efficiency of the image detection model can also be improved.
In a possible embodiment, the image detection model includes a residual submodel, a pooling layer and a full link layer, wherein the residual submodel is a pre-trained migration learning model, and in a training process of the image detection model, parameters of the pooling layer and the full link layer are adjusted according to a loss value of the image detection model determined by the training sample to obtain the image detection model.
The residual submodel can be a ResNet-18 model trained on ImageNet, and the ImageNet project is a large visual database used for visual object recognition software research, wherein the pre-training mode of the ResNet-18 model is a conventional choice in the field and is not described herein again.
In this embodiment, a Pooling layer (Pooling) and a fully connected layer are connected after the residual sub-model, so that the transfer learning can be performed based on the residual sub-model to obtain the image detection model. The residual error submodel is used as a basic model for transfer learning, and model parameters of the residual error submodel can be frozen, namely, the parameters of the residual error submodel are not updated in the process of training the image detection model, and only the parameters of the subsequent pooling layer and the full connection layer are adjusted, so that the training of the image detection model can be carried out based on the pre-trained model, and the training efficiency and the accuracy of the image detection model are improved. For example, the loss function of the image detection model may be a MSE mean square error loss function. The number of samples selected for each training, Adam optimizer parameters, the number of iterations of training samples, the learning rate, etc. may be set according to the actual usage scenario, for example, the number of samples selected for training batch _ size may be set to 120, Adam optimizer parameter β 1 may be set to 0.9, β 2 may be set to 0.99, training samples iterate 10 epochs, and the learning rate lr may be set to 3 e-4.
For example, the pooling layer may be a ROI (regions of interest) layer, where an ROI (region of interest) is used to characterize a region of interest, and when image processing is performed in the ROI layer, the ROI is mapped to a corresponding position in a feature map according to input image information, the mapped region is divided into local regions of the same size, and maximum pooling operation is performed on each local region, so that global features in the information and local region features in the image information can be considered, so that a fixed-size output feature map can be obtained after maximum pooling is performed on non-uniform-size feature input, and therefore accuracy of the output feature map can be improved, and meanwhile, image processing speed is increased, and accuracy of a feature detection result determined based on the output feature map is ensured.
In one possible embodiment, the characteristic information is audio information; in step 12, an exemplary implementation manner of determining a feature detection result corresponding to the video to be detected under the feature information may include:
inputting the audio information into an audio detection model, and obtaining a feature detection result of the audio information corresponding to the video to be detected, wherein the audio detection model is obtained by training based on a multi-classification task model, and the number of training samples corresponding to each classification of the audio detection model is the same.
In this embodiment, can train and obtain this audio frequency detection model based on the multi-classification task model, through categorizing audio information promptly to can discern normal audio frequency and other audio frequencies, for example robot dubbing, many noisy ambient noise of people, wind sound ambient noise, car sound ambient noise etc. thereby can further determine audio information's characteristic testing result. The number of corresponding classes in the multi-classification model is not limited in this disclosure. When the audio detection model is trained, the number of training samples corresponding to each classification is the same, so that the influence on the output result of the audio detection model caused by different sample numbers can be effectively avoided, meanwhile, the distribution boundary corresponding to each classification can be accurately determined in a follow-up process, the audio detection model can accurately distinguish each classification, and support is provided for improving the accuracy of the feature detection result of the audio information.
Illustratively, the loss function of the audio detection model is a sum of a cross entropy function and a maximum average difference function corresponding to multiple classes of the audio model.
Loss function L of the audio detection modelHYBCan be expressed as follows:
LHYBCCE+MMD
wherein L isCCEFor representing the cross entropy function, any cross entropy function in the related art can be selected, which is not limited by the present disclosure;
LMMDfor representing the maximum average difference function.
Taking a two-class model as an example,
Figure BDA0002716129960000121
wherein, Fl() An activation function for representing a hidden layer/of a feedforward neural network F included in the audio detection model;
Figure BDA0002716129960000122
and
Figure BDA0002716129960000123
is a spectrogram of the audio information corresponding to each of the two classes d and d';
m is used to represent the number of training samples corresponding to each class in the audio detection model.
Wherein mmd (maximum mean variance) represents the maximum average difference. Based on two distributed samples, a continuous function f on a sample space is found, the mean value of function values of samples with different distributions on the f is calculated, and the average difference of the two distributions corresponding to the f can be obtained by subtracting the two mean values. Finding an f maximizes this average difference, and thus the maximum average difference can be obtained. The MMD can then be used as a test statistic to determine if the two distributions are the same.
Defining the MMD (P, Q) function, P and Q are two distributed sample sets, P is X ═ X given a set containing m data1,...,XmQ is a set Y ═ Y containing m data1,...,Ym}. k is a gaussian kernel function (RBF).
Figure BDA0002716129960000131
If P is Q, then MMD (P, Q) is 0. Thus, the boundaries of the two classes can be unambiguously demarcated by determining the MMD.
By the technical scheme, when the loss function of the audio detection model is determined, the classification accuracy of the audio detection model can be ensured through the classification cross entropy function, so that the accuracy of the feature detection result of the audio information under the correct classification is ensured. Meanwhile, the loss function comprises a maximum mean difference function, so that the boundary representation of each classification in the audio detection model can be ensured, the audio features for accurate classification in the audio information can be conveniently obtained, the accuracy of the feature detection result of the audio information is further improved, and data support is provided for accurate and reasonable evaluation of the video to be detected.
In a possible embodiment, in step 13, according to a plurality of feature detection results corresponding to a video to be detected, an exemplary implementation manner of determining a target detection result corresponding to the video to be detected is as follows, and the step may include:
and according to the weight corresponding to each piece of feature information, carrying out weighted summation on the feature detection results to obtain a target detection result of the video to be detected.
The weight of each piece of feature information may be set according to an actual usage scenario, and the sum of the weights corresponding to the plurality of pieces of feature information is 1. For example, in a video retrieval platform, a user may be required to provide more appropriate title information, and the weight corresponding to text information may be set to be higher; in the video random recommendation, videos with better image quality may need to be recommended to the user, and the weight corresponding to the image information may be set to be higher. The foregoing is illustrative only and is not intended to be limiting of the present disclosure.
As an example, the target detection result may be represented by a score, and a result of weighted summation of the plurality of feature detection results may be used as the target detection result. As another example, the target detection result may be represented by a grade, and the grade range of the result may be determined according to a result of weighted summation of a plurality of feature detection results, so as to convert the result into a corresponding grade. Therefore, by the technical scheme, the target monitoring result of the video to be detected can be comprehensively determined according to the plurality of characteristic detection results corresponding to the video to be detected, so that the video to be detected can be comprehensively and comprehensively evaluated, and the objectivity of video detection is ensured.
Optionally, the determined target detection result may be associated with the video to be detected, so that the target detection result of the video to be detected may be considered when recommending the video to be detected in the subsequent process, and data support is provided for improving accurate and reliable recommendation of the video.
In a possible embodiment, as shown in fig. 3, on the basis of fig. 1, the method may further include:
in step 31, when it is determined that the video to be detected is abnormal according to the target detection result of the video to be detected, determining abnormal feature information in the plurality of feature information corresponding to the video to be detected.
As an example, the target detection result may be represented by a score, a video anomaly threshold may be set according to an actual usage scene, and when the target detection result of the video to be detected is smaller than the video anomaly threshold, it is determined that the video to be detected is anomalous. As another example, the target detection result may be represented by a grade, an abnormal grade may be preset, and in a case that the determined target detection result is the abnormal grade, it may be determined that the video to be detected has an abnormality. Accordingly, abnormal information can be output at the moment to prompt a user who uploads the video to be detected, and the video is abnormal.
In the embodiment of the present disclosure, the video to be detected is detected through the feature information of multiple dimensions of the video to be detected, and specific dimensions of the video to be detected, in which the video to be detected is abnormal, can be further determined when the video to be detected is determined to be abnormal. For example, an abnormal threshold corresponding to each feature information may be preset, where the abnormal threshold corresponding to each feature information may be the same or different, and this disclosure does not limit this. Correspondingly, determining abnormal feature information in a plurality of feature information corresponding to the video to be detected may include: and if the characteristic detection result corresponding to the characteristic information is smaller than the abnormal threshold corresponding to the characteristic information, determining the characteristic information as abnormal characteristic information, so that the specific reason of the abnormality of the video to be detected can be determined when the abnormality exists.
In step 32, prompt information is output, where the prompt information includes the target detection result and the abnormal feature information.
For example, the prompt message may be used to prompt a user who uploads the video to be detected that the uploaded video is abnormal, and may prompt a specific reason for the abnormality, such as a title message abnormality. Then the user can further modify this video of waiting to detect based on this tip information, perhaps can provide standard suggestion to the follow-up video of uploading of shooing of this user, both can treat and detect the video and carry out accurate detection, guarantee the accuracy and the reliability of the video that is used for sharing, the user of being convenient for simultaneously in time discovers the anomaly in the video of uploading, the operation of standard user shooting or uploading video, be convenient for further improve the quality of uploading video, the user of being convenient for uses, promotion user uses and experiences.
The present disclosure also provides a video detection apparatus, as shown in fig. 4, the apparatus 10 includes:
the acquisition module 100 is configured to acquire a plurality of feature information corresponding to a video to be detected, where the feature information includes text information, image information, and audio information;
a first determining module 200, configured to determine, for each piece of feature information, a feature detection result corresponding to the to-be-detected video under the feature information;
a second determining module 300, configured to determine, according to a plurality of feature detection results corresponding to the video to be detected, a target detection result corresponding to the video to be detected.
Optionally, the image information includes video frame image information and/or cover image information corresponding to the detection video; the first determining module includes:
the first input sub-module is configured to input the video frame image information and/or the cover image information into an image detection model, and obtain a feature detection result of image information corresponding to the video to be detected, wherein a target sample image in a training sample of the image detection model is determined by:
acquiring a candidate sample image with the resolution less than or equal to the target resolution;
and aiming at the candidate sample image with the resolution smaller than the target resolution, adding target pixel points in the candidate sample image according to the resolution of the candidate sample image and the target resolution to obtain an extended sample image with the resolution of the target resolution, wherein the target sample image comprises the extended sample image and the candidate sample image with the resolution of the target resolution.
Optionally, the image detection model includes a residual sub-model, a pooling layer and a full-link layer, where the residual sub-model is a pre-trained migration learning model, and in a training process of the image detection model, parameters of the pooling layer and the full-link layer are adjusted according to a loss value of the image detection model determined by the training sample to obtain the image detection model.
Optionally, the feature information is audio information; the first determining module includes:
and the second input submodule is configured to input the audio information into an audio detection model, and obtain a feature detection result of the audio information corresponding to the video to be detected, wherein the audio detection model is obtained by training based on a multi-classification task model, and the number of training samples corresponding to each classification of the audio detection model is the same.
Optionally, the loss function of the audio detection model is a sum of a cross entropy function and a maximum average difference function corresponding to multiple classes of the audio model.
Optionally, the second determination module is configured to:
and according to the weight corresponding to each piece of feature information, carrying out weighted summation on the feature detection results to obtain a target detection result of the video to be detected.
Optionally, the apparatus further comprises:
the third determining module is configured to determine abnormal feature information in a plurality of feature information corresponding to the video to be detected under the condition that the target detection result of the video to be detected determines that the video to be detected is abnormal;
and the output module is configured to output prompt information, wherein the prompt information comprises the target detection result and the abnormal characteristic information.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video detection method provided by the present disclosure.
Fig. 5 is a block diagram illustrating a video detection apparatus 800 according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 5, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the video detection method described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the video detection methods described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the video detection method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the video detection method described above when executed by the programmable apparatus.
Fig. 6 is a block diagram illustrating a video detection apparatus 1900 according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to FIG. 6, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the video detection method described above.
The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system, such as Windows Server, stored in memory 1932TM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMOr the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A video detection method, comprising:
acquiring a plurality of feature information corresponding to a video to be detected, wherein the feature information comprises text information, image information and audio information;
determining a characteristic detection result corresponding to the video to be detected under the characteristic information aiming at each characteristic information;
and determining a target detection result corresponding to the video to be detected according to a plurality of feature detection results corresponding to the video to be detected.
2. The method according to claim 1, wherein the image information comprises video frame image information and/or cover image information corresponding to the detection video; the determining the feature detection result corresponding to the video to be detected under the feature information includes:
inputting the video frame image information and/or the cover image information into an image detection model to obtain a characteristic detection result of the image information corresponding to the video to be detected, wherein a target sample image in a training sample of the image detection model is determined in the following way:
acquiring a candidate sample image with the resolution less than or equal to the target resolution;
and aiming at the candidate sample image with the resolution smaller than the target resolution, adding target pixel points in the candidate sample image according to the resolution of the candidate sample image and the target resolution to obtain an extended sample image with the resolution of the target resolution, wherein the target sample image comprises the extended sample image and the candidate sample image with the resolution of the target resolution.
3. The method of claim 2, wherein the image detection model comprises a residual submodel, a pooling layer and a full link layer, wherein the residual submodel is a pre-trained migration learning model, and during training of the image detection model, parameters of the pooling layer and the full link layer are adjusted according to a loss value of the image detection model determined by the training samples to obtain the image detection model.
4. The method of claim 1, wherein the feature information is audio information; the determining the feature detection result corresponding to the video to be detected under the feature information includes:
inputting the audio information into an audio detection model, and obtaining a feature detection result of the audio information corresponding to the video to be detected, wherein the audio detection model is obtained by training based on a multi-classification task model, and the number of training samples corresponding to each classification of the audio detection model is the same.
5. The method of claim 4, wherein the loss function of the audio detection model is a sum of a cross entropy function and a maximum mean difference function corresponding to multiple classes of the audio model.
6. The method according to claim 1, wherein the determining the target detection result corresponding to the video to be detected according to the plurality of feature detection results corresponding to the video to be detected comprises:
and according to the weight corresponding to each piece of feature information, carrying out weighted summation on the feature detection results to obtain a target detection result of the video to be detected.
7. The method of claim 1, further comprising:
determining abnormal characteristic information in a plurality of characteristic information corresponding to the video to be detected under the condition that the video to be detected is determined to be abnormal according to the target detection result of the video to be detected;
and outputting prompt information, wherein the prompt information comprises the target detection result and the abnormal characteristic information.
8. A video detection apparatus, comprising:
the acquisition module is configured to acquire a plurality of characteristic information corresponding to a video to be detected, wherein the characteristic information comprises text information, image information and audio information;
the first determining module is configured to determine, for each piece of feature information, a feature detection result corresponding to the video to be detected under the feature information;
the second determining module is configured to determine a target detection result corresponding to the video to be detected according to the plurality of feature detection results corresponding to the video to be detected.
9. A video detection apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a plurality of feature information corresponding to a video to be detected, wherein the feature information comprises text information, image information and audio information;
determining a characteristic detection result corresponding to the video to be detected under the characteristic information aiming at each characteristic information;
and determining a target detection result corresponding to the video to be detected according to a plurality of feature detection results corresponding to the video to be detected.
10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.
CN202011074238.0A 2020-10-09 2020-10-09 Video detection method, device and computer readable storage medium Pending CN112150457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011074238.0A CN112150457A (en) 2020-10-09 2020-10-09 Video detection method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011074238.0A CN112150457A (en) 2020-10-09 2020-10-09 Video detection method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN112150457A true CN112150457A (en) 2020-12-29

Family

ID=73952683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011074238.0A Pending CN112150457A (en) 2020-10-09 2020-10-09 Video detection method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112150457A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486853A (en) * 2021-07-29 2021-10-08 北京百度网讯科技有限公司 Video detection method and device, electronic equipment and medium
CN113609824A (en) * 2021-08-10 2021-11-05 上海交通大学 Multi-turn dialog rewriting method and system based on text editing and grammar error correction
CN113676599A (en) * 2021-08-20 2021-11-19 上海明略人工智能(集团)有限公司 Network call quality detection method, system, computer device and storage medium
CN115953715A (en) * 2022-12-22 2023-04-11 北京字跳网络技术有限公司 Video detection method, device, equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486853A (en) * 2021-07-29 2021-10-08 北京百度网讯科技有限公司 Video detection method and device, electronic equipment and medium
CN113486853B (en) * 2021-07-29 2024-02-27 北京百度网讯科技有限公司 Video detection method and device, electronic equipment and medium
CN113609824A (en) * 2021-08-10 2021-11-05 上海交通大学 Multi-turn dialog rewriting method and system based on text editing and grammar error correction
CN113676599A (en) * 2021-08-20 2021-11-19 上海明略人工智能(集团)有限公司 Network call quality detection method, system, computer device and storage medium
CN113676599B (en) * 2021-08-20 2024-03-22 上海明略人工智能(集团)有限公司 Network call quality detection method, system, computer equipment and storage medium
CN115953715A (en) * 2022-12-22 2023-04-11 北京字跳网络技术有限公司 Video detection method, device, equipment and storage medium
CN115953715B (en) * 2022-12-22 2024-04-19 北京字跳网络技术有限公司 Video detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109871896B (en) Data classification method and device, electronic equipment and storage medium
TWI766286B (en) Image processing method and image processing device, electronic device and computer-readable storage medium
CN112150457A (en) Video detection method, device and computer readable storage medium
CN110827253A (en) Training method and device of target detection model and electronic equipment
EP3855360A1 (en) Method and device for training image recognition model, and storage medium
CN110619350B (en) Image detection method, device and storage medium
CN106228556B (en) image quality analysis method and device
CN110602527A (en) Video processing method, device and storage medium
CN111259148B (en) Information processing method, device and storage medium
CN111931844B (en) Image processing method and device, electronic equipment and storage medium
CN111583907A (en) Information processing method, device and storage medium
CN109819288B (en) Method and device for determining advertisement delivery video, electronic equipment and storage medium
CN111753895A (en) Data processing method, device and storage medium
CN110889489A (en) Neural network training method, image recognition method and device
CN112148980B (en) Article recommending method, device, equipment and storage medium based on user click
CN111583919A (en) Information processing method, device and storage medium
CN111160448A (en) Training method and device for image classification model
CN111814538B (en) Method and device for identifying category of target object, electronic equipment and storage medium
CN111753917A (en) Data processing method, device and storage medium
CN112035651B (en) Sentence completion method, sentence completion device and computer readable storage medium
CN107135494B (en) Spam short message identification method and device
CN107480773B (en) Method and device for training convolutional neural network model and storage medium
CN111047049B (en) Method, device and medium for processing multimedia data based on machine learning model
CN112381091A (en) Video content identification method and device, electronic equipment and storage medium
CN111274389B (en) Information processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination