CN113920085A

CN113920085A - Automatic auditing method and system for product display video

Info

Publication number: CN113920085A
Application number: CN202111174478.2A
Authority: CN
Inventors: 吕晨; 房鹏展
Original assignee: Focus Technology Co Ltd
Current assignee: Focus Technology Co Ltd
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-01-11

Abstract

An automatic auditing method for product display videos, S1: inputting a seller to upload a video and acquiring basic information of the video; s2: inputting the video into a video quality analysis module for video quality analysis, and grading the video picture jitter condition and the video voice noise condition by utilizing a time sequence multilayer neural network and a time sequence voice noise analysis network; s3: splitting a video into an image part and a voice part, wherein the image part is split into a multi-frame picture set according to one frame/second; s4: the split picture set is sequentially input into a video picture infringement information auditing module for infringement information auditing, and whether infringement information exists in the picture is detected by utilizing a multilayer neural network in the infringement information auditing module; s5: the split picture set is sequentially input into a video picture yellow-related information auditing module to audit yellow-related information; s6: and inputting the split picture sets into a video picture character information auditing module in sequence for character information auditing.

Description

Automatic auditing method and system for product display video

Technical Field

The invention relates to the field of video auditing, in particular to an automatic auditing method and system for product display videos.

Background

The video has the advantages of comprehensiveness, intuition and good visual and auditory display effects, so that sellers on the e-commerce platform actively develop product video introduction, and the propaganda effect is remarkable. Therefore, video display becomes one of hot spots for commodity display in a period of time.

The E-commerce platform is large in product quantity, seller's daily uploaded display videos are huge, meanwhile, the display videos are different in quality, part of the videos are serious in jitter, background noise is large, user experience is poor, violation information such as infringement information, violence and terrorism slogan and yellow information exists in part of the videos, a large amount of videos need to be audited manually, manpower investment is large, auditing efficiency is low, the time for display is long, and timely display of the products is seriously affected.

An existing video automatic auditing method is, for example, CN 2018108250709, a log identification video playback system based on video monitoring, which includes a system processing module, the system processing module is bidirectionally connected with a surveillance video extraction system, the surveillance video extraction system includes an extraction information processing module, a log retrieval identification module, an extraction video analysis module and an extraction video sending module, an output end of the extraction information processing module is connected with an input end of the log retrieval identification module, and the method relates to the technical field of video monitoring. The video monitoring-based log identification video playback system can greatly improve the extraction efficiency, does not need monitoring personnel to spend a large amount of time to extract video data, but cannot check the content of images and cannot ensure the quality of monitoring safety check work.

CN2019800458824 provides a video processing method and apparatus in a video encoding or decoding system for processing a video image that is partitioned into blocks with one or more partitioning constraints. A video encoding or decoding system receives input data for a current block and checks whether the current block is allowed to be partitioned using a predefined partition type according to first and second constraints. The first constraint limits each sub-block partitioned from the current block to be entirely contained in one pipeline unit, and the second constraint limits each sub-block partitioned from the current block to be contained in one or more complete pipeline units. A pipeline unit is a non-overlapping unit in a video image designed for pipeline processing. If any of the sub-blocks partitioned by the predefined partition type violates the first and second constraints, the current block is not partitioned by the predefined partition type.

Although the image target detection task is advanced, the detection performance is obviously improved. However, in the fields of video surveillance, vehicle-assisted driving and the like, video-based target detection has a wider demand. Due to motion blur, occlusion and form change diversity in the video, it is very important to obtain complete detection by using an image target detection technology, for example, information such as target timing information and context in the video is used to improve the video target detection performance. For video object detection, a good detector not only ensures accurate detection on each frame of image, but also ensures consistency/continuity of detection results (i.e. for a specific object, a good detector should continuously detect the object and not confuse it with other objects, which is called video object detection timing consistency).

The video target detection algorithm mainly uses the following frame that the video frame is regarded as an independent image and the detection result is obtained by the image target detection algorithm; correcting the detection result by using the time sequence information and the context information of the video; and further correcting the detection result based on the tracking track of the high-quality detection window. But the convolutional neural network deep learning-based method can more efficiently and qualitatively meet the problem which cannot be solved by the prior art.

Aiming at the situation, the invention adopts a video automatic auditing method and a video automatic auditing system, intercepts the illegal video or the video with poor quality by utilizing a deep learning technology, carries out manual review by website auditors, informs video uploaders of modifying if the review has problems, and directly shows the qualified video in the website. The method and the system are beneficial to improving the auditing efficiency, saving a large amount of manpower, accelerating the video display speed of the seller and improving the user experience and the overall quality of the website.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method and a system for automatically auditing a product display video.

In order to solve the above technical problem, the present invention provides a multi-level fusion video auditing method, which is characterized by comprising the following steps:

the method comprises the following steps: inputting a seller to upload a video and acquiring basic information of the video;

step two: the video is input into a video quality analysis module, the whole quality of the video is analyzed by utilizing a time sequence multilayer neural network in the video quality analysis module, the video jitter condition is scored, the video voice noise level is judged by utilizing a time sequence voice noise analysis network in the video quality analysis module, and the noise condition is scored.

Step three: splitting a video into an image part and a voice part, splitting the image part into a multi-frame image set according to one frame/second, and reserving the voice part after noise reduction;

step four: inputting the video frame picture into an infringement information auditing module, detecting whether infringement information exists in the picture by utilizing a multilayer neural network in the infringement information auditing module, and if so, recording the infringement information and the position of the infringement information in the picture;

step five: inputting a video frame picture into a yellow-related information auditing module, extracting picture characteristic information by using a multilayer neural network in the yellow-related information auditing module, judging whether the input picture belongs to a yellow picture category, and recording corresponding information if the input picture belongs to a yellow picture;

step six: inputting a video frame picture into a character information auditing module, detecting whether characters exist in the picture by using a character detection and character recognition model in the character information auditing module, and if so, positioning the position of the characters; intercepting characters and inputting the characters into a character recognition model, extracting character picture convolution characteristics by utilizing a convolution neural network, and transcribing character information; comparing the violence terrorist thesaurus dictionary, judging whether the violence terrorist information exists, and recording corresponding information if the violence terrorist slogan exists and the like; and setting a self-defined character information dictionary, matching the transcribed character information, and recording corresponding matching information if the matching is successful.

Step seven: inputting the voice file in the step two into a voice information auditing module, transcribing the voice information into character information by using a voice transcription model in the voice information auditing module, comparing with an violence and terrorism lexicon dictionary, judging whether the violence and terrorism information exists, and recording corresponding information if the violence and terrorism slogans exist; and setting a self-defined character information dictionary, matching the transcribed character information, and recording corresponding matching information if the matching is successful.

In the first step, the video basic information includes a resolution, a video frame rate, a video duration, a video storage capacity, and the like of the video.

In the second step, the method for analyzing the video quality specifically comprises the following steps: for a video image, the situation of integral switching of video pictures is eliminated, if the video is seriously jittered, the transition between frames is not smooth and the frame difference is large in a certain time sequence time, but the frame difference is small in a non-jittered video, a time sequence multilayer neural network is constructed, the frames switched by the video pictures are eliminated, the change situation between the frames is calculated in the time sequence time, the video jitters are normalized and then scored, the video jittering ranges are respectively 0-9 points, wherein the lower the jittering score is, the lower the jittering degree is, and the jittering score is lower than 5 and is set as qualified;

for video voice, a voice spectrogram is obtained through fast Fourier transform, the time sequence voice noise analysis network is used for calculating the square of a magnitude spectrum of the time sequence voice spectrogram, then the square of a pure voice magnitude spectrum is calculated, the difference is the noise condition, noise data are normalized and scored, the noise ranges are respectively 0-9 minutes, the noise is less when the noise score is lower, and the noise score is lower than 5, and the result is qualified.

And in the third step, the video image and the voice part are separated, the image part divides the video into a multi-frame picture set according to one frame/second by using an open source tool, the voice part is completely reserved, part of additive noise is removed by using the square of the noise amplitude spectrum in the second step and the spectral subtraction method, and the additive noise is stored as a voice file.

In the fourth step, infringement information related to the product display video comprises brand trademark infringement and appearance infringement, in an infringement information auditing module, a multi-layer neural network used for detecting infringement is trained by collecting infringement brand trademark and appearance infringement pictures and marking corresponding infringement positions and categories, multi-frame picture sets in the third step are sequentially input into the multi-layer neural network to obtain confidence, categories and coordinate positions of the infringement information in the pictures, an infringement information judgment threshold is set to be 0.45, when the confidence is greater than the threshold, the infringement information in the pictures is judged to exist, and meanwhile, the positions and the categories of the infringement information in the pictures are output.

In the fifth step, in a yellow-related information auditing module, a yellow-related key point detection judgment model is constructed by using a yellow-related picture data set disclosed by a network, multi-frame picture sets in the third step are sequentially input into the detection network, a multi-layer convolutional neural network is used for outputting the position coordinates of a key point, meanwhile, a key point position area is sampled and input into a yellow-related classification judgment model, and whether the key point position area is yellow-related or not is judged.

In the sixth step, the character detection and identification model is divided into a convolutional neural network and a bidirectional cyclic neural network, the convolutional neural network is used for extracting character and picture convolution characteristics in a multi-frame picture set, and the bidirectional cyclic neural network is used for transcribing the character and picture convolution characteristics into characters; the method comprises the steps of constructing a violence and terrorist word bank dictionary and a user-defined word information dictionary, intercepting a violence and terrorist slogan and user-defined words, constructing a twin semantic model aiming at the violence and terrorist slogan or user-defined word information which is intentionally avoided to be detected, inputting character information to be detected into the model, carrying out semantic similarity calculation with the violence and terrorist word bank dictionary or the user-defined word information dictionary, and considering that the characters to be detected are suspected to be illegal when the similarity is larger than a threshold value.

In the seventh step, the video voice information auditing module is a voice transcription part and an illegal word detection part, the voice file denoised in the third step is transcribed into words by using the voice transcription model, and whether the illegal words exist in the video voice is detected by using the illegal word detection part in the sixth step.

A product display video automatic auditing method and system are characterized in that: the system comprises a video basic information module, a video quality analysis module, a video picture and voice noise reduction splitting module, a video picture infringement information auditing module, a video picture yellow-related information auditing module, a video picture text information auditing module and a video voice information auditing module which are sequentially connected;

the video basic information module is used for detecting the basic information of the video and reading the resolution, the video frame rate, the video duration and the video storage capacity of the video;

the video quality analysis module is used for detecting the video jitter condition and the noise condition;

the video picture and voice denoising and splitting module is used for splitting a video into a picture set, denoising and storing a voice as a voice file;

the video picture infringement information auditing module is used for detecting whether infringement information exists in the video picture;

the video picture yellow-related information auditing module is used for detecting whether yellow-related information exists in the video picture;

the video picture character information auditing module is used for detecting whether the video picture has the violence and terrorist slogans or other illegal character information; the character information auditing module comprises four sub-modules, namely a character detection sub-module, a character recognition sub-module, an violence and terrorist slogan comparison module and a self-defined violation character information comparison module; the character detection submodule is used for detecting whether the picture contains characters or not; the character recognition submodule is used for recognizing the detected characters; the riot and terrorist slogan comparison module is used for detecting whether characters contain the riot and terrorist slogans; and the user-defined illegal character information comparison module is used for detecting whether the characters contain the user-defined illegal character information.

The video voice information auditing module is used for detecting whether the video voice has a sudden and terrorist mouth number or other illegal voice information; the voice information auditing module comprises three sub-modules, namely a voice transcription sub-module, a riot and terrorist slogan comparison module and a self-defined illegal character information comparison module; the voice transcription submodule is used for transcribing the voice file into characters; the riot and terrorist slogan comparison module is used for detecting whether the transcribed characters contain the riot and terrorist slogans; and the custom illegal character information comparison module is used for detecting whether the transcribed characters contain custom illegal character information.

The invention achieves the following beneficial effects: the video voice quality can be optimized, whether illegal conditions such as infringement, yellow affection, riot terrorism and the like exist in the video can be rapidly and accurately identified, the video quality is provided, the labor expenditure of video audit is greatly reduced, and the audit efficiency and accuracy are improved.

Drawings

FIG. 1 is a schematic flow diagram of a method of an exemplary embodiment of the present invention;

fig. 2 is a system configuration diagram of an exemplary embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and the exemplary embodiments:

as shown in fig. 1, an automatic auditing method and system for product display video includes the following steps:

step S1: and inputting the seller to upload the video and acquiring the basic information of the video.

Step S2: and inputting the video into a video quality analysis module for video quality analysis, and grading the video picture shaking condition and the video voice noise condition by utilizing a time sequence multilayer neural network and a time sequence voice noise analysis network.

Step S3: and splitting the video into an image part and a voice part, splitting the image part into a multi-frame image set according to one frame/second, and reserving the whole voice part after noise reduction.

Step S4: and sequentially inputting the split picture set into a video picture infringement information auditing module for infringement information auditing, detecting whether the picture has infringement information or not by utilizing a multilayer neural network in the infringement information auditing module, and recording the infringement information and the position of the infringement information in the picture if the picture has the infringement information.

Step S5: and sequentially inputting the split picture set into a video picture yellow-related information auditing module for yellow-related information auditing, extracting picture characteristic information by using a multilayer neural network in the yellow-related information auditing module, judging whether the input picture belongs to a yellow picture category, and recording corresponding information if the input picture belongs to the yellow picture.

Step S6: the split picture set is sequentially input into a video picture character information auditing module for character information auditing, whether characters exist in the picture is detected firstly by utilizing a character detection and character recognition model in the character information auditing module, and if the characters exist in the picture, the character position is positioned; intercepting characters and inputting the characters into a character recognition model, extracting character picture convolution characteristics by utilizing a convolution neural network, and transcribing character information; comparing the violence terrorist thesaurus dictionary, judging whether the violence terrorist information exists, and recording corresponding information if the violence terrorist slogan exists and the like; and setting a self-defined character information dictionary, matching the transcribed character information, and recording corresponding matching information if the matching is successful.

Step S7: inputting the split voice file into a video voice information module for video voice information auditing, and transcribing the video voice into characters by using a voice transcription model in the video voice information auditing module; comparing the violence and terrorism word stock dictionary, judging whether the violence and terrorism information exists, and recording corresponding information if the violence and terrorism number exists; and setting a self-defined character information dictionary, matching the transcribed character information, and recording corresponding matching information if the matching is successful.

In step S1, the video basic information includes a resolution, a video frame rate, a video duration, a video storage capacity, and the like of the video; the video resolution is the width and height of the video pictures, the video frame rate is how many frames of pictures are contained in each second of the video, the video storage capacity is the space occupied by the video storage, the width and height is recommended to be more than 640 × 360, the video frame rate is recommended to be more than 24 frames, and the video storage space is recommended to be less than 150 MB.

In step S2, the method for analyzing the video quality specifically includes: for a video image, the situation of integral switching of video pictures is eliminated, if the video is seriously jittered, the transition between frames is not smooth and the frame difference is large in a certain time sequence time, but the frame difference is small in a non-jittered video, a time sequence multilayer neural network is constructed, the frames switched by the video pictures are eliminated, the change situation between the frames is calculated in the time sequence time, the video jitters are normalized and then scored, the video jittering ranges are respectively 0-9 points, wherein the lower the jittering score is, the lower the jittering degree is, and the jittering score is lower than 5 and is set as qualified; for video voice, a voice spectrogram is obtained through fast Fourier transform, the time sequence voice noise analysis network is used for calculating the square of a magnitude spectrum of the time sequence voice spectrogram, then the square of a pure voice magnitude spectrum is calculated, the difference is the noise condition, noise data are normalized and scored, the noise ranges are respectively 0-9 minutes, the noise is less when the noise score is lower, and the noise score is lower than 5, and the result is qualified.

In the step S3, the video image and the voice portion are separated, the image portion is split into a multi-frame image set by an open source tool according to one frame/second, the voice portion is completely retained, part of the additive noise is removed by a spectral subtraction method by using the square of the noise magnitude spectrum in the step S, and the audio file is stored.

In the step S4, infringement information related to the product display video includes brand trademark infringement and appearance infringement, in an infringement information auditing module, a multilayer neural network for detecting infringement is trained by collecting infringement brand trademark and appearance infringement pictures and labeling corresponding infringement positions and categories, a multi-frame picture set in S3 is sequentially input to the multilayer neural network to obtain confidence, categories and coordinate positions of the infringement information in the pictures, an infringement information determination threshold is set to be 0.45, when the confidence is greater than the threshold, the infringement information in the pictures is determined to exist, and meanwhile, the positions and categories of the infringement information in the pictures are output.

In the step S5, in the yellow-related information auditing module, a yellow-related key point detection determination model is constructed by using a yellow-related picture data set disclosed by the network, multiple frames of picture sets in the third step are sequentially input into the detection network, a multi-layer convolutional neural network is used to output coordinates of key point positions, and meanwhile, a key point position area is sampled and input into the yellow-related classification determination model, so as to determine whether the key point position area is yellow-related.

In the step S6, the text detection and recognition model is divided into a convolutional neural network and a bidirectional cyclic neural network, the convolutional neural network is used to extract text and picture convolution characteristics in the multi-frame picture set, and the bidirectional cyclic neural network is used to transcribe the text and picture convolution characteristics into text; the method comprises the steps of constructing a violence and terrorist word bank dictionary and a user-defined word information dictionary, intercepting a violence and terrorist slogan and user-defined words, constructing a twin semantic model aiming at the violence and terrorist slogan or user-defined word information which is intentionally avoided to be detected, inputting character information to be detected into the model, carrying out semantic similarity calculation with the violence and terrorist word bank dictionary or the user-defined word information dictionary, and considering that the characters to be detected are suspected to be illegal when the similarity is larger than a threshold value.

The input of the convolutional neural network CNN is usually a matrix, such as pixel information corresponding to a picture of each frame, and its input and output data is sometimes referred to as a feature map (feature map). The convolutional layer, which is the first layer after the input layer, is used to extract the local features of the input and maintain the spatial continuity of the image, and is composed of a plurality of filters for calculating different feature maps. Features of different levels can be extracted by the convolutional layers with different layers, low-level features such as edges and lines can be extracted by the convolutional layer with the first layer, higher-level and more abstract feature information can be extracted by the convolutional layer with higher layers as the layers are deeper, information about visual contents of images is less and more information about categories is more and more, and the reason why models related to complex feature extraction tasks generally have more layers is also provided. The pooling layer is closely following the convolution layer, and has the effects of reducing the dimensionality of the middle hidden layer, reducing the amount of calculation required behind, and playing a role in space invariance. The calculation can be performed by maximum pooling (max-pooling), average pooling (mean-pooling), mixed pooling (mixed pooling), random pooling, and the like. On the other hand, it has no parameters to learn, and simply takes the maximum value (or average value, etc.) from the target region, and does not change the order of the input data.

For CNN, the CNN representation can be presented in a visual form, which is a representation of the visual concept. Since the third flush of deep learning, various techniques have also been developed to visualize and interpret these models. The network model structure of the CNN adopts a convolution neural network and an AlexNet model, and consists of five convolution layers, three pooling layers, two full-connection layers and a data local normalization layer. The training of the method realizes GPU accelerated calculation for the first time, so that the model completes training within an acceptable time range, dropout is adopted to avoid model overfitting, ReLU is used for replacing sigmod to be used as an activation function, and maximum pooling is used for replacing average pooling.

After a number of convolution and pooling operations, the CNN will typically be connected to one or more fully connected layers at the end. The fully connected layer connects each neuron of the current layer with all neurons of the previous layer, processes the extracted features through layer-by-layer calculation and mapping, summarizes all local features, and transmits the output value obtained by the excitation function of the neurons in the features to the output layer.

In the step S7, the video-audio information auditing module includes two parts, namely, audio transcription and illegal character detection, and the audio transcription module is used to transcribe the audio file denoised in the step three into characters, and the illegal character detection part in the step S6 is used to detect whether the illegal character exists in the characters transcribed by the video-audio.

The video voice information auditing module is used for detecting whether the video voice has a sudden and terrorist mouth number or other illegal voice information; the voice information auditing module comprises three sub-modules, namely a voice transcription sub-module, a riot and terrorist slogan comparison module and a self-defined illegal character information comparison module; the voice transcription submodule is used for transcribing the voice file into characters; the riot and terrorist slogan comparison module is used for detecting whether characters contain the riot and terrorist slogans; and the user-defined illegal character information comparison module is used for detecting whether the characters contain the user-defined illegal character information.

Fig. 2 is a schematic structural diagram of a method and a system for automatically auditing product display videos.

The module 1 is a video basic information module and is used for detecting basic information of a video and reading the resolution, the video frame rate, the video duration, the video storage capacity and the like of the video.

The module 2 is a video quality analysis module for detecting video jitter and noise. For a video image, the situation of integral switching of video pictures is eliminated, if the video is seriously jittered, the transition between frames is not smooth and the frame difference is large in a certain time sequence time, but the frame difference is small in a non-jittered video, a time sequence multilayer neural network is constructed, the frames switched by the video pictures are eliminated, the change situation between the frames is calculated in the time sequence time, the video jitters are normalized and then scored, the video jittering ranges are respectively 0-9 points, wherein the lower the jittering score is, the lower the jittering degree is, and the jittering score is lower than 5 and is set as qualified; for video voice, a voice spectrogram is obtained through fast Fourier transform, the time sequence voice noise analysis network is used for calculating the square of a magnitude spectrum of the time sequence voice spectrogram, then the square of a pure voice magnitude spectrum is calculated, the difference is the noise condition, noise data are normalized and scored, the noise ranges are respectively 0-9 minutes, the noise is less when the noise score is lower, and the noise score is lower than 5, and the result is qualified.

The module 3 is a video picture and voice denoising and splitting module, and is used for splitting a video into a picture set, denoising and storing a voice as a voice file. And D, splitting the video into a multi-frame picture set according to one frame/second by using an open source tool, completely reserving the voice part, removing part of additive noise by using the square of the noise amplitude spectrum in the step two and using a spectral subtraction method, and storing the additive noise as a voice file.

The module 4 is a video picture infringement information auditing module, and is used for detecting whether infringement information exists in the video picture. Infringement information contains two broad categories, one is brand trademark infringement and one is appearance infringement. Collecting infringement brand trademarks and appearance infringement pictures, marking corresponding infringement positions and categories, and training a multilayer neural network model. Sequentially inputting a to-be-video picture set into an infringement detection multilayer neural network to obtain confidence, category and coordinate position of an object in the picture, setting an infringement information judgment threshold to be 0.45, judging that infringement information exists in the picture when the confidence is greater than the threshold, and simultaneously outputting the position and category of the infringement information in the picture.

The module 5 is a video picture yellow-related information auditing module and is used for detecting whether yellow-related information exists in the video picture. The method comprises the steps of utilizing a yellow-related picture data set disclosed by a network, wherein the judgment standard of a yellow-related picture is whether partial key points are exposed to yellow, so that a yellow-related key point detection judgment model is constructed in a yellow-related information auditing module, utilizing a multilayer convolutional neural network to output key point position coordinates, simultaneously sampling key point position areas and inputting a yellow-related classification judgment model, and judging whether the key point position areas are yellow-related.

The module 6 is a video picture character information auditing module and is used for detecting whether the video picture has the violence and terrorist slogans or other illegal character information. The video picture text information auditing module is divided into four sub-modules, namely a text detection sub-module 61, a text identification sub-module 62, an violence and terrorism comparison module 63 and a custom illegal text information comparison module 64.

And the character detection submodule 61 is used for detecting whether the video pictures contain characters or not, labeling the collected pictures containing the characters, generating partial samples and training a character detection model based on the multilayer convolutional neural network. And sequentially inputting the video picture set into the character detection model, and if the pictures contain characters, outputting the position coordinates of the characters by the model.

And the character recognition submodule 62 is configured to recognize the detected characters, train a character recognition model including two parts by using the generated and labeled data collected in the existing data set, where the model is divided into a convolutional neural network and a bidirectional cyclic neural network, the convolutional neural network is used to extract the character and picture convolution characteristics, and the bidirectional cyclic neural network is used to transcribe the character and picture convolution characteristics into characters.

And the riot-terrorist slogan comparison submodule 63 is used for detecting whether the characters contain the riot-terrorist slogans or not, constructing a riot-terrorist word bank dictionary, matching the identified characters, and recording corresponding information if the riot-terrorist slogans exist.

And the custom illegal character information comparison submodule 64 is used for detecting whether the characters contain custom illegal character information or not, setting a custom character information dictionary, carrying out template matching on the transcribed character information, and recording corresponding matching information if the matching is successful.

The module 7 is a video voice information auditing module and is used for detecting whether the video voice has a sudden and terrorist mouth number or other illegal voice information. The video picture text information auditing module is divided into three sub-modules, namely a voice transcription sub-module 71, a riot and terrorist slogan comparison module 72 and a custom violation text information comparison module 73.

And the voice transcription submodule 71 is used for transcribing the video voice information into character information, labeling the collected voice file, collecting part of open voice samples and training a voice transcription model based on bidirectional circulation lstm and CTC loss. And inputting the video voice into the voice transcription model to obtain the transcribed text.

The riot-terrorist slogan comparison sub-module 72, which is the same as the sub-module 63, belongs to a common module and is used for detecting whether transcribed characters contain the riot-terrorist slogans or not, constructing a riot-terrorist word bank dictionary, matching the recognized characters, and recording corresponding information if the riot-terrorist slogans exist.

The custom illegal character information comparison submodule 73 is the same as the submodule 64, belongs to a common module, and is used for detecting whether the transcribed characters contain custom illegal character information or not, setting a custom character information dictionary, performing template matching on the transcribed character information, and recording corresponding matching information if the matching is successful.

The invention is mainly used for providing the automatic auditing method and the system for the product display video, which can optimize the video voice quality, quickly and accurately identify whether the video has illegal conditions such as infringement, yellow exposure, riot terrorism and the like, and simultaneously give the video quality, greatly reduce the manpower expenditure for video auditing and improve the auditing efficiency and accuracy.

The above embodiments do not limit the present invention in any way, and all other modifications and applications that can be made to the above embodiments in equivalent ways are within the scope of the present invention.

Claims

1. An automatic auditing method for product display videos is characterized by comprising the following steps:

step S1: inputting a seller to upload a video and acquiring basic information of the video;

step S2: inputting the video into a video quality analysis module for video quality analysis, and grading the video picture jitter condition and the video voice noise condition by utilizing a time sequence multilayer neural network and a time sequence voice noise analysis network;

step S3: splitting a video into an image part and a voice part, splitting the image part into a multi-frame image set according to one frame/second, and reserving the voice part after noise reduction;

step S4: the split picture set is sequentially input into a video picture infringement information auditing module for infringement information auditing, whether infringement information exists in the picture is detected by utilizing a multilayer neural network in the infringement information auditing module, and if yes, the infringement information and the position of the infringement information in the picture are recorded;

step S5: the split picture set is sequentially input into a video picture yellow-related information auditing module for yellow-related information auditing, picture characteristic information is extracted by using a multilayer neural network in the yellow-related information auditing module, whether the input picture belongs to a yellow picture category is judged, and if the input picture belongs to a yellow picture, corresponding information is recorded;

step S6: the split picture set is sequentially input into a video picture character information auditing module for character information auditing, whether characters exist in the picture is detected firstly by utilizing a character detection and character recognition model in the character information auditing module, and if the characters exist in the picture, the character position is positioned; intercepting characters and inputting the characters into a character recognition model, extracting character picture convolution characteristics by utilizing a convolution neural network, and transcribing character information; comparing the violence terrorist thesaurus dictionary, judging whether the violence terrorist information exists, and recording corresponding information if the violence terrorist slogan exists and the like; setting a self-defined character information dictionary, matching the transcribed character information, and recording corresponding matching information if the matching is successful;

step S7: inputting the split voice file into a video voice information module for video voice information auditing, and transcribing the video voice into characters by using a voice transcription model in the video voice information auditing module; comparing the violence and terrorism word library dictionary, judging whether the violence and terrorism information exists, and recording corresponding information if the violence and terrorism number exists; and setting a self-defined character information dictionary, matching the transcribed character information, and recording corresponding matching information if the matching is successful.

2. The method for automatically auditing product display videos according to claim 1, where in S1, the video basic information includes video resolution, video frame rate, video duration, video storage capacity, and the like;

in S2, the method for analyzing the video quality specifically includes: for a video image, the situation of integral switching of video pictures is eliminated, if the video is seriously jittered, the transition between frames is not smooth and the frame difference is large in a certain time sequence time, but the frame difference is small in a non-jittered video, a time sequence multilayer neural network is constructed, the frames switched by the video pictures are eliminated, the change situation between the frames is calculated in the time sequence time, the video jitters are normalized and then scored, the video jittering ranges are respectively 0-9 points, wherein the lower the jittering score is, the lower the jittering degree is, and the jittering score is lower than 5 and is set as qualified;

for video voice, a voice spectrogram is obtained by utilizing fast Fourier transform, the square of a magnitude spectrum is calculated for the time sequence voice spectrogram by utilizing a time sequence voice noise analysis network, then the square of a pure voice magnitude spectrum is calculated, the difference is a noise condition, noise data are normalized and scored, noise ranges are respectively 0-9 minutes, the lower the noise score is, the less the noise is, and the noise score is lower than 5, and the qualified result is set;

in the step S3, the video image and the voice portion are separated, the image portion is split into a multi-frame image set by using an open source tool according to one frame/second, the voice portion is completely retained, part of the additive noise is removed by using the square of the noise magnitude spectrum in the step two and the additive noise is removed by using a spectral subtraction method, and the additive noise is stored as a voice file;

in the S4, infringement information related to the product display video comprises brand trademark infringement and appearance infringement, in an infringement information auditing module, a multi-layer neural network for detecting infringement is trained by collecting infringement brand trademark and appearance infringement pictures and marking corresponding infringement positions and categories, multi-frame picture sets in the three steps are sequentially input into the multi-layer neural network to obtain confidence, categories and coordinate positions of the infringement information in the pictures, an infringement information judgment threshold is set to be 0.45, when the confidence is greater than the threshold, the infringement information in the pictures is judged to exist, and meanwhile, the positions and the categories of the infringement information in the pictures are output;

in the S5, in the yellow-related information auditing module, a yellow-related key point detection determination model is constructed by using a yellow-related picture data set disclosed by a network, multiple frames of picture sets in the third step are sequentially input into the detection network, a multi-layer convolutional neural network is used to output coordinates of key point positions, and meanwhile, a key point position area is sampled and input into the yellow-related classification determination model, so as to determine whether the key point position area is yellow-related;

in S6, the text detection and recognition model is divided into a convolutional neural network and a bidirectional cyclic neural network, the convolutional neural network is used to extract text-to-picture convolution characteristics in the multi-frame picture set, and the bidirectional cyclic neural network is used to transcribe the text-to-picture convolution characteristics into text; constructing a violence and terrorist word bank dictionary and a self-defined character information dictionary for intercepting violence and terrorist slogans and self-defined characters, constructing a twin semantic model aiming at the violence and terrorist slogans or self-defined character information which are intentionally avoided to be detected, inputting character information to be detected into the model, carrying out semantic similarity calculation with the violence and terrorist word bank dictionary or the self-defined character information dictionary, and considering that the characters to be detected are suspected to be illegal when the similarity is greater than a threshold value;

in the step S7, the video-audio information auditing module includes two parts, namely, audio transcription and illegal word detection, and the audio transcription module is used to transcribe the audio file denoised in the step three into words, and the illegal word detection part in the step six is used to detect whether the illegal words exist in the video audio.

3. The method for automatically auditing product display videos according to any one of claims 1-2, wherein in step S1, the video basic information includes video resolution, video frame rate, video duration, video storage capacity, and the like; the video resolution is the width and height size of the video pictures, the video frame rate is the number of frames of pictures contained in each second of the video, the video storage capacity is the space occupied by video storage, the width and height size is recommended to be more than 640 × 360, the video frame rate is more than 24 frames, and the video storage space is recommended to be less than 150 MB;

in step S2, the method for analyzing the video quality specifically includes: for a video image, the situation of integral switching of video pictures is eliminated, if the video is seriously jittered, the transition between frames is not smooth and the frame difference is large in a certain time sequence time, but the frame difference is small in a non-jittered video, a time sequence multilayer neural network is constructed, the frames switched by the video pictures are eliminated, the change situation between the frames is calculated in the time sequence time, the video jitters are normalized and then scored, the video jittering ranges are respectively 0-9 points, wherein the lower the jittering score is, the lower the jittering degree is, and the jittering score is lower than 5 and is set as qualified; for video voice, a voice spectrogram is obtained by utilizing fast Fourier transform, the square of a magnitude spectrum is calculated for the time sequence voice spectrogram by utilizing a time sequence voice noise analysis network, then the square of a pure voice magnitude spectrum is calculated, the difference is a noise condition, noise data are normalized and scored, noise ranges are respectively 0-9 minutes, the lower the noise score is, the less the noise is, and the noise score is lower than 5, and the qualified result is set;

in the step S3, the video image and the voice portion are separated, the image portion is split into a multi-frame image set by an open source tool according to one frame/second, the voice portion is completely retained, part of the additive noise is removed by a spectral subtraction method by using the square of the noise magnitude spectrum in the step two, and the audio file is stored;

in the step S4, infringement information related to the product display video includes brand trademark infringement and appearance infringement, in an infringement information auditing module, a multilayer neural network for detecting infringement is trained by collecting infringement brand trademark and appearance infringement pictures and labeling corresponding infringement positions and categories, a multi-frame picture set in S3 is sequentially input into the multilayer neural network to obtain confidence, categories and coordinate positions of the infringement information in the pictures, an infringement information determination threshold is set to be 0.45, when the confidence is greater than the threshold, the infringement information in the pictures is determined to exist, and meanwhile, the positions and categories of the infringement information in the pictures are output;

in the step S5, in the yellow-related information auditing module, a yellow-related key point detection determination model is constructed by using a yellow-related picture data set disclosed by a network, multiple frames of picture sets in the third step are sequentially input into the detection network, a multi-layer convolutional neural network is used to output coordinates of key point positions, and meanwhile, a key point position area is sampled and input into the yellow-related classification determination model, so as to determine whether the key point position area is yellow-related;

in the step S6, the text detection and recognition model is divided into a convolutional neural network and a bidirectional cyclic neural network, the convolutional neural network is used to extract text and picture convolution characteristics in the multi-frame picture set, and the bidirectional cyclic neural network is used to transcribe the text and picture convolution characteristics into text; constructing a violence and terrorist word bank dictionary and a self-defined character information dictionary for intercepting violence and terrorist slogans and self-defined characters, constructing a twin semantic model aiming at the violence and terrorist slogans or self-defined character information which are intentionally avoided to be detected, inputting character information to be detected into the model, carrying out semantic similarity calculation with the violence and terrorist word bank dictionary or the self-defined character information dictionary, and considering that the characters to be detected are suspected to be illegal when the similarity is greater than a threshold value; the convolutional neural network CNN adopts an AlexNet model and consists of five convolutional layers, three pooling layers, two full-connection layers and a data local normalization layer; dropout is adopted, model overfitting is avoided, ReLU is used for replacing sigmod to serve as an activation function, and maximum pooling is used for replacing average pooling; after a number of convolution and pooling operations, the CNN will typically be connected to one or more fully connected layers at the end; the fully connected layer connects each neuron of the current layer with all neurons of the previous layer, processes the extracted features through calculation and mapping layer by layer, collects all local features, and transmits the output value obtained by the excitation function of the neurons in the features to the output layer;

4. The system obtained by the automatic auditing method for product display videos according to any one of claims 1-3, characterized by comprising a video basic information module, a video quality analysis module, a video picture and voice noise reduction splitting module, a video picture infringement information auditing module, a video picture yellow-related information auditing module, a video picture text information auditing module and a video voice information auditing module which are connected in sequence;

the video picture character information auditing module is used for detecting whether the video picture has the violence and terrorist slogans or other illegal character information; the character information auditing module comprises four sub-modules, namely a character detection sub-module, a character recognition sub-module, an violence and terrorist slogan comparison module and a self-defined violation character information comparison module; the character detection submodule is used for detecting whether the picture contains characters or not; the character recognition submodule is used for recognizing the detected characters; the riot and terrorist slogan comparison module is used for detecting whether characters contain the riot and terrorist slogans; the user-defined illegal character information comparison module is used for detecting whether the characters contain user-defined illegal character information or not;