CN106503610B

CN106503610B - Video identification method and device

Info

Publication number: CN106503610B
Application number: CN201510568027.5A
Authority: CN
Inventors: 毛锋
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-09-08
Filing date: 2015-09-08
Publication date: 2020-05-26
Anticipated expiration: 2035-09-08
Also published as: CN106503610A

Abstract

The application provides a video identification method and a video identification device, wherein the method comprises the following steps: dividing a video to be identified into a plurality of intervals; and respectively carrying out multi-level detection on a plurality of intervals of the lens pornographic probability value, acquiring pornographic statistical indexes of the video to be identified of the lens pornographic probability value in the process of carrying out multi-level detection on each interval, and judging whether the video to be identified of the lens pornographic probability value is the pornographic video or not according to the pornographic statistical indexes of the lens pornographic probability value. The video identification method reduces the time consumption of identification and improves the identification speed and accuracy.

Description

Video identification method and device

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video identification method and apparatus.

Background

Pornographic content is used as bad information of the internet and needs to be strictly controlled, and video is used as a main part of internet traffic and needs to be effectively identified and controlled. At present, the methods for identifying pornographic content mainly include the following:

1) and identifying matching based on the feature library. And searching feature data corresponding to the human sensitive part picture matched with the image to be recognized in a pre-stored human feature library, then calculating the confidence corresponding to the image to be recognized according to the matched feature data, and further judging whether the corresponding image to be recognized is a pornographic image or not according to the confidence. The recall rate and accuracy of the method depend on the completeness and matching rules of a human body feature library, the method is sensitive to human body postures, sheltering and illumination, and particularly, the recognition rate is low for images which are not fully exposed.

2) Based on identification of a skin color model or a sensitive region model. The method is mainly used for training a color model of human skin or a color model of a human sensitive organ through a large number of sample image skin color areas or the human sensitive organ. And then detecting the skin area or sensitive organs of the human body of the image to be identified according to the model. Although the method is sensitive to the illumination and the races with different skin colors, the misjudgment rate is higher for normal naked people such as the face and the arms. In addition, the sensitive part model is sensitive to shielding, and the extraction speed and the recognition speed of the features are low.

3) And (3) identification of an identification method based on a convolutional neural network. The method is based on a convolutional neural network, a large number of samples are trained circularly to obtain a final detection model, and image frames in a video to be recognized are recognized through the detection model.

Although the recognition method based on the convolutional neural network can overcome the problems of diversification of ambient light, diversity of human race and the like, the method and the methods 1) and 2) are only image-level recognition methods. Currently, in identifying video, the identification can be performed by extracting key frames of the video and then applying one or more image-level processing methods to each key frame. However, the key frame extraction in the method does not consider the continuity of the images in the video in the time domain, and the method has the advantages of simple decision model, low classification characteristic value discrimination and unsatisfactory precision; in addition, the whole video needs to be decoded, a large amount of redundant calculation is needed, and the identification speed is not ideal.

Disclosure of Invention

The present application aims to address the above technical problem, at least to some extent.

Therefore, a first objective of the present application is to provide a video identification method, which reduces the time consumption for identification and improves the identification speed and accuracy.

A second object of the present application is to provide a video recognition apparatus.

To achieve the above object, according to a first aspect of the present application, a video recognition method is provided, including the following steps: dividing a video to be identified into a plurality of intervals; and respectively carrying out multi-level detection on the plurality of intervals, acquiring the pornographic statistical indexes of the video to be identified in the process of carrying out multi-level detection on each interval, and judging whether the video to be identified is the pornographic video or not according to the pornographic statistical indexes.

The video identification method can divide the video to be identified into a plurality of intervals, multi-level detection is carried out, whether the video to be identified is the pornographic video is judged according to the pornographic statistical indexes in the detection process, the visual similarity and the time continuity between adjacent image frames are utilized through the multi-level detection mode, the obtained pornographic statistical indexes are strong in robustness, the pornographic statistical indexes can be updated according to the detection progress in the detection process, judgment is carried out, the judgment result can be returned in advance, the whole video to be identified does not need to be decoded and detected, the time consumed by identification is reduced, and the identification speed and the identification precision are improved.

The embodiment of the second aspect of the present application provides a video identification device, including: the dividing module is used for dividing the video to be identified into a plurality of intervals; the detection module is used for respectively carrying out multi-level detection on the plurality of intervals and acquiring the pornographic statistical index of the video to be identified in the process of carrying out multi-level detection on each interval; and the judging module is used for judging whether the video to be identified is the pornographic video or not according to the pornographic statistical indexes in the process of carrying out multi-level detection on each interval.

The video identification device of the embodiment of the application can divide the video to be identified into a plurality of intervals, and carry out multilevel detection, judge whether the video to be identified is the pornographic video according to the pornographic statistical index in the detection process, through the mode of multilevel detection, visual similarity and time continuity between adjacent image frames are utilized, the obtained pornographic statistical index is strong in robustness, and the pornographic statistical index can be updated according to the detected progress in the detection process, thereby carrying out judgment, thereby the judgment result can be returned in advance, the whole video to be identified does not need to be decoded and detected, the time consumed by identification is reduced, and the identification speed and the accuracy are improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a video identification method according to one embodiment of the present application;

FIG. 2 is a schematic diagram of a multi-level structure according to one embodiment of the present application;

fig. 3 is a flowchart illustrating acquiring pornographic statistical indicators of a video to be identified in a process of performing multi-level detection on each interval according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a process of performing a shot-level pornography identification for a shot in which a frame to be identified is located according to an embodiment of the present application;

FIG. 5 is a flowchart of segment-level pornography identification of a segment to be detected, according to one embodiment of the present application;

FIG. 6 is a flow chart of acquiring a frame difference between two image frames according to one embodiment of the application;

fig. 7 is a flowchart of performing a shot-level pornography identification on a shot in which a frame to be identified is located according to another embodiment of the present application;

fig. 8 is a flowchart illustrating acquiring a pornographic statistical indicator of a video to be identified and determining whether the video to be identified is a pornographic video in a process of performing multi-level detection on each interval according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a video recognition device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

A video recognition method and apparatus according to an embodiment of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a video identification method according to an embodiment of the present application.

As shown in fig. 1, a video identification method according to an embodiment of the present application includes:

s101, dividing a video to be identified into a plurality of intervals.

In one embodiment of the present application, the video to be identified may be divided into a plurality of intervals according to the duration of the video. The present application does not limit the rule of division of the intervals. For example, the video to be recognized may be averagely divided into a plurality of intervals with equal length according to the duration of the video to be recognized, or the video to be recognized may be divided into a plurality of intervals with the duration being a preset duration, and when the duration of the video to be recognized cannot be divided by the preset duration, the remaining part which cannot be divided by the preset duration may be used alone as one interval or combined into adjacent intervals.

In one embodiment of the present application, the plurality of intervals may have respective processing priorities, respectively. The processing priority may be set for each section when dividing the section. The higher the processing priority, the higher the processing priority can be processed in the following multi-level detection. Specifically, when the processing priority is set, the setting may be performed according to historical experience, for example, according to analysis of a large number of pornographic videos, it is known that most of sensitive content in pornographic videos is located at the middle rear part of the videos, and the processing priority of the section located at the middle rear part of the videos to be identified may be set to be high.

S102, respectively carrying out multi-level detection on a plurality of intervals, acquiring a pornographic statistic index of the video to be identified in the process of carrying out multi-level detection on each interval, and judging whether the video to be identified is the pornographic video or not according to the pornographic statistic index.

The multi-level detection refers to detecting the video to be identified according to the multi-level structure of the video. Specifically, as shown in fig. 2, the videos may respectively include, from top to bottom:

a first stage: the interval stage (i-th interval is represented by range [ i ] in FIG. 2). That is, the video is divided into a plurality of sections according to its time length, can be implemented by S101.

And a second stage: lens level (the ith lens is denoted by shot [ i ] in FIG. 2). A shot is a video processing unit at the sub-semantic level, a shot being a collection of temporally successive and visually similar frames. When pornographic recognition is carried out in the shot, pornographic probability values are counted by taking the shot as a unit, and duration is accumulated.

And a third stage: segment level (segment [ i ] represents the ith segment in FIG. 2). Since an ultra-long shot results in a long processing time, the long shot may be divided into segments of a certain duration (i.e., a third time, which may be 15 seconds, for example). And accumulating the statistical probability value of the fragment to the statistical probability value of the subordinate shot when pornography identification is carried out in the fragment.

Fourth stage: frame level (frame [ i ] in fig. 2 represents the ith image frame). Decoding from a video to obtain a complete image frame, identifying the image frame based on a pornographic image identification algorithm of a deep convolutional neural network, and accumulating the obtained pornographic probability values into a fragment or lens statistic value.

Wherein, the division of the shots in the interval and the division of the segments in the shots are generated during the detection process when the shots are identified frame by frame.

Specifically, the dividing of the shots is to calculate the frame difference between each image frame and its adjacent image frame when identifying the image frame, and if the frame difference is 1, the two image frames are separated by two shots.

The division of the segments in the shot is that in the process of identifying the shot, the duration of each image frame is accumulated backwards from the first image frame in the shot, when the accumulated duration reaches the set segment duration, one segment is generated, and the generation of the next segment is started. The beginning of the first segment in a shot is the initial frame of the shot and the end of the last segment is the end frame of the shot.

In an embodiment of the present application, when performing multi-level detection on multiple intervals, the multiple intervals may be divided into N interval groups, and the N interval groups are respectively detected by N parallel threads, where each thread sequentially performs multi-level detection on the intervals in the corresponding interval group according to a processing priority, where N is a positive integer.

It should be understood that in another embodiment of the present application, it is also not necessary to group multiple intervals, and multiple intervals can be probed in parallel directly by multiple parallel threads according to the processing priorities of the multiple intervals.

Therefore, multiple intervals can be subjected to multi-level detection through multiple parallel threads according to a certain strategy, hardware computing resources can be fully calculated, the hardware resources are converted into speed improvement by utilizing a multithreading technology, and therefore the video identification efficiency is improved.

In an embodiment of the present application, the pornographic statistical indicator may include one or more of a total pornographic duration, a cumulative pornographic probability value, a number of non-pornographic times, a total shot processing number, a total pornographic frame number, a total frame processing number, a video duration and a frame rate of the video to be identified. The accumulated pornographic probability value, the number of times of non-pornography, the total pornographic frame number, the total pornographic duration, the total shot processing number, the total frame processing number, the video duration and the frame rate of the video to be recognized can be analyzed based on a pre-trained support vector machine discrimination model so as to judge whether the video to be recognized is the pornographic video.

The total pornographic probability value is the sum of the shot durations of the identified shots, wherein the shot durations are not less than the first time and the pornographic probability value is greater than the second probability threshold in the detection process, and the total pornographic duration is gradually accumulated along with the shot detection process.

The number of non-pornographic times is the number of times that the image frames with pornographic probability values not greater than the first probability threshold are identified in the detection process.

The total pornographic frame number is the total number of pornographic image frames identified in the detection process. And the pornographic image frame is an image frame of which the pornographic probability value is greater than a fourth probability threshold value. In addition, if the average pornographic probability threshold value of the image frames in one image frame set is greater than the fourth probability threshold value, the image frames in the image frame set are judged to be pornographic image frames.

The shot processing total is the total shot number for shot-level pornography identification in the detection process.

The total frame processing number is the total number of the image frames for pornographic identification in the detection process.

The total pornographic duration is the sum of the shot durations of the identified shots, the shot durations of which are not less than the first time and the pornographic probability value of which is greater than the second probability threshold in the detection process.

It should be understood that the video duration and the frame rate of the video to be identified in the above pornography statistical indexes are inherent attributes of the video to be identified. And the total pornographic duration, the cumulative pornographic probability value, the number of times of non-pornography, the total shot processing number, the total pornographic frame number and the total frame processing number are continuously changed along with detection in the detection process, so that in the detection process, when each index is updated, whether the video to be identified is the pornographic video or not can be judged according to the change.

In the embodiment of the application, a large number of artificially marked sample videos (including pornographic videos and non-pornographic videos) are detected in a process to obtain statistical indexes, and then training is performed according to the obtained statistical indexes and a Support Vector Machine (SVM) is supported to establish the discriminant model.

It should be understood that, in the embodiment of the present application, the multi-level detection may be a combined detection of two or more levels of the above four levels, for example, interval level detection- > lens level detection- > fragment level detection- > frame level detection, or interval level detection- > lens level detection- > frame level detection, or interval level detection- > frame level detection, or the like.

Specifically, the interval level detection- > lens level detection- > fragment level detection- > frame level detection may be as shown in fig. 3, 4 and 5, the interval level detection- > lens level detection- > frame level detection may be as shown in fig. 3 and 7, and the interval level detection- > frame level detection may be as shown in fig. 8.

The process of performing multi-level probing on each interval according to the present application is described below with reference to fig. 3 to 8.

Fig. 3 is a flowchart illustrating acquiring pornographic statistical indicators of a video to be identified in a process of performing multi-level detection on each interval according to an embodiment of the present application. As shown in fig. 3, the acquiring of the pornographic statistical indicator of the video to be identified in the process of performing multi-level detection on each interval includes the following steps:

and S1, taking the first image frame of the current detection interval as a frame to be identified.

And S2, performing pornographic identification on the frame to be identified to obtain the pornographic probability value of the frame to be identified, and adding 1 to the total processing number of the current frame.

Specifically, the frame to be recognized may be decoded first, and pornographic recognition may be performed on the frame to be recognized after decoding. In the embodiment of the present application, the first image frame of the current detection interval is taken as the frame to be identified.

In the embodiment of the application, pornographic recognition can be carried out on the frame image to be recognized based on the deep convolutional neural network so as to obtain the pornographic probability value of the frame to be recognized. The recognition method based on the deep convolutional neural network can recognize the image frame according to the information of the color, the shape, the texture, the structure and the like of the image frame, avoids the complex processes of feature extraction and data reconstruction in the traditional recognition algorithm, can directly serve as the input of the deep convolutional neural network, and has high invariance to the translation, the scaling, the inclination or the deformation in other forms of the image. According to the experimental statistics result, in the embodiment of the application, the recall rate and the precision of the identification method based on the deep convolutional neural network can reach 90% and 95% respectively, so that the identification precision and the recall rate of the single-frame image are greatly improved by the method.

S3, if the pornographic probability value of the frame to be identified is larger than the first probability threshold, carrying out lens-level pornographic identification on the lens where the frame to be identified is located so as to obtain the pornographic probability value and pornographic duration of the lens where the frame to be identified is located.

In one embodiment of the present application, the first probability threshold is used to determine that a single image frame is a suspicious pornographic frame if the pornographic probability value of the image frame is greater than the first probability threshold. Therefore, if the pornographic probability value of the frame to be recognized is greater than the first probability threshold, the embodiment shown in fig. 4 can be used for further performing the lens-level pornographic recognition on the lens where the frame to be recognized is located, so as to obtain the pornographic probability value and pornographic duration of the lens where the frame to be recognized is located, and perform subsequent judgment and recognition.

And S4, if the eroticism probability value of the lens where the frame to be recognized is located is greater than the second probability threshold, further judging whether the eroticism duration of the lens where the frame to be recognized is located is less than the first time.

And S5, if the time is less than the first time, merging the shot of the frame to be identified with the adjacent similar shot so that the duration of the merged shot is not less than the first time.

And if the duration of the shot is less than the first time, the shot is represented as a short shot. Because the short lenses have lower confidence, the lenses can be merged. Wherein the first time may be 2 seconds.

Specifically, the rule of lens merging is that adjacent pornographic lenses are merged and adjacent non-pornographic lenses are merged. That is, if the current shot is a short pornographic shot and its preceding neighboring shot is also a pornographic shot, the shot is merged into the preceding shot. If the current shot is a short erotic shot and its preceding adjacent shot is not an erotic shot, then the shots will not be merged. Similarly, if the current shot is a short non-erotic shot and its immediately preceding adjacent shot is also a non-erotic shot, then the shot is merged into the preceding shot. If the current shot is not a short shot, but its preceding neighboring shot is a short shot, merging is also attempted. During merging, the shot duration is added, and the erotic probability value scores of the shots are added.

And S6, if the time is not less than the first time, accumulating the pornographic duration of the lens where the frame to be identified is to the current total pornographic duration so as to update the total pornographic duration.

In addition, the pornographic probability value of the shot where the frame to be identified is located can be accumulated to the current accumulated pornographic probability value so as to update the accumulated pornographic probability value.

S7, if the total pornographic duration is greater than the second time, judging that the video to be identified is the pornographic video;

s8, if the total pornographic duration is not more than the second time, taking the next image frame of the frame to be identified as a new frame to be identified, and repeating S2-S8 until the detection of the current detection interval is completed, wherein the first time is less than the second time.

Wherein the first time is less than the second time.

The total pornographic duration is the sum of pornographic durations of all currently detected intervals, and in the interval detection process, the total pornographic duration changes in real time along with the detection process. Specifically, after each pair of shots is identified, the pornographic duration of each shot is accumulated to the current total pornographic duration, so that the total pornographic duration is updated.

If the total pornographic duration is longer than the second time, the video to be identified can be judged to be the pornographic video, otherwise, the detection is continued until the total pornographic duration is longer than the second time, the video to be identified is judged to be the pornographic video, or all of the intervals are detected to finish judging the video to be the non-pornographic video.

And S9, if the pornographic probability value of the frame to be identified is not more than the first probability threshold, adding 1 to the non-pornographic times, and judging whether the non-pornographic times are more than the time threshold.

And S10, if the number of times of non-pornography is larger than the threshold value of times, judging that the lens where the frame to be identified is located is a non-pornography lens, adding 1 to the number of the non-pornography lenses, finishing the detection of the current detection interval, and detecting the next interval of the current detection interval.

If the pornographic probability value of the frame to be identified is not greater than the first probability threshold value, the frame to be identified is represented as a non-pornographic frame, and then the next interval can be directly detected without further shot-level identification. Therefore, the times of lens-level identification can be controlled, and the identification efficiency is improved.

And S11, if the number of the non-pornographic times is not more than the time threshold value, taking the next image frame of the frame to be identified as a new frame to be identified, and repeating S2-S11 until the detection of the current detection interval is finished.

Thereby, the detection of one section is completed.

In the embodiment of the application, because the intervals can be detected in parallel through a plurality of threads, that is, each thread processes one interval sequence, after one interval is detected by one thread, the next interval in the interval sequence processed by the thread can be continuously detected, so that the identification process of the video is performed orderly and efficiently.

Fig. 4 is a flowchart illustrating a shot-level pornography identification of a shot in which a frame to be identified is located according to an embodiment of the present application. As shown in fig. 4, the process of performing the lens-level pornography recognition on the lens where the frame to be recognized is located specifically includes the following steps:

s301, judging whether the frame to be identified is the last frame image of the video to be identified.

If so, the pornographic probability value and the pornographic duration of the lens are respectively used as the pornographic probability value and the pornographic duration of the lens where the frame to be identified is located.

S302, if not, further judging whether the number of the identified image frames in the lens where the frame to be identified is located is larger than a preset number.

And S303, if the number of the identified image frames is greater than the preset number, searching the subsequent image frames until the shot boundary is reached, and taking the next image frame of the frame to be identified as a new frame to be identified.

Specifically, in one shot pornography process, every time one image frame is recognized, 1 is added to the number of recognized image frames to update the number of recognized image frames in real time. When the number of the identified image frames in the shot is larger than the preset number, the subsequent image frames can be directly searched to find the shot boundary, and the process of identifying the shot in which the frame to be identified is located is finished. At this time, the next image frame of the searched shot boundary may be taken as a frame to be recognized.

S304, if the number of the identified image frames is not more than the preset number, caching the frames to be identified to the segments to be detected.

Wherein the third time may be 15 seconds.

S305, detecting whether the frame to be identified reaches shot boundary detection.

In a specific embodiment of the present application, S305 may specifically include: calculating the frame difference between the frame to be identified and the next image frame of the frame to be identified by a color statistical histogram method; if the frame difference is 1, judging that the frame to be identified reaches shot boundary detection; and if the frame difference is 0, judging whether the frame to be identified does not reach shot boundary detection or not.

S306, if the frame to be recognized does not reach the shot boundary, whether the total duration of the image frames in the segment to be detected reaches the third time or not is judged.

S307, if the total duration of the image frames in the segment to be detected does not reach the third time, taking the next frame of the frame to be identified as a new frame to be identified, and repeating S303-S307 until the total duration of the image frames in the segment to be detected reaches the third time.

Wherein the third time is used to control the length of the segment. And if the total duration of the image frames in the segment to be detected reaches the third time or the frame to be identified reaches the shot boundary, the image frames cached in the segment to be detected can form a segment. Therefore, fragment-level pornography identification can be carried out on the fragments to be detected so as to obtain pornography probability values and pornography duration of the fragments to be detected.

That is, an empty set may be generated as the segment to be detected when the lens-level recognition starts, and then the adjacent image frames are sequentially added to the segment to be detected until the total duration of the image frames in the segment to be detected reaches the third time, and then the segment to be detected is completed. Thus, the length of the segment can be controlled in this way and the segment can be generated dynamically during the recognition process.

And S308, if the frame to be recognized reaches the shot boundary or the total duration of the image frames in the segment to be detected reaches the third time, performing segment-level pornographic recognition on the segment to be detected to obtain the pornographic probability value and pornographic duration of the segment to be detected.

Specifically, the fragment-level pornography recognition can be further performed on the fragments to be detected by the embodiment as shown in FIG. 5.

S309, accumulating the pornographic probability value of the segment to be detected to the current lens pornographic probability value, and accumulating the pornographic duration of the segment to be detected to the current lens pornographic duration to update the current lens pornographic probability value and the current lens pornographic duration respectively.

And S310, if the current pornographic probability value of the shot is greater than the third probability threshold value, respectively taking the pornographic probability value of the shot and the pornographic duration of the shot where the frame to be identified is located as the pornographic probability value and the pornographic duration of the shot.

The third probability threshold is used for judging the lens, and if the eroticism probability threshold of the lens is larger than the third probability threshold, the lens can be judged to be the eroticism lens. Therefore, if the current pornography probability value of the shot is greater than the third probability threshold, the pornography probability value of the shot and the pornography duration can be respectively used as the pornography probability value and the pornography duration of the shot where the frame to be identified is located.

S311, if the total pornographic probability value of the segments is not greater than the third probability threshold, whether the frame to be identified reaches the shot boundary is detected.

S312, if the frame to be recognized reaches the shot boundary, or the frame to be recognized is judged to be the last frame image of the video to be recognized in S301, or after the shot boundary is searched in S303, the shot porn probability value and the shot porn duration are respectively used as the porn probability value and the porn duration of the shot in which the frame to be recognized is located.

Specifically, if it is determined in S311 that the shot boundary is reached or the shot boundary is searched in S303, it is necessary to return to interval level detection to trigger shot level pornography identification of the next shot, and therefore, the obtained shot pornography probability value and the obtained shot pornography duration may be respectively used as the pornography probability value and the pornography duration of the shot in which the frame to be identified is located.

If the frame to be recognized is judged to be the last frame image of the video to be recognized in the S301, the detection of the video to be recognized is finished, the pornographic probability value of the lens and the pornographic duration of the lens where the frame to be recognized is located can be respectively used as the pornographic probability value and the pornographic duration of the lens where the frame to be recognized is located, so that the total pornographic duration and the accumulated pornographic probability value are further obtained and input to the discrimination model, and the video to be recognized is recognized and judged.

S313, if the frame to be recognized does not reach the shot boundary, the next image frame of the frame to be recognized is used as a new frame to be recognized, and S301-S313 are repeated.

Thus, the process of lens-level pornography recognition is completed.

Fig. 5 is a flowchart of segment-level pornography identification of a segment to be detected according to an embodiment of the present application. As shown in fig. 5, the segment-level pornography identification of the segment to be detected specifically includes the following steps:

s501, sampling image frames in the segment to be detected according to a preset rule, wherein the sampled image frames form a third initial evaluation set.

Specifically, the segment to be detected is a group of temporally continuous image frame sequences, and the image frame sequences may be sampled at equal intervals according to a preset sampling rate to obtain a third initial evaluation set. For example, if there are 100 image frames in the segment to be detected and the preset sampling rate is 50%, the image frames may be sampled every two or more every other image frames, so as to obtain 50 image frames in the third initial evaluation set.

S502, pornographic identification is carried out on the third initial evaluation set to obtain the pornographic probability value of each image frame in the third initial evaluation set.

In an embodiment of the present application, the image frames in the third initial evaluation set may be subjected to frame-level identification in parallel through multiple threads, that is, pornographic identification is performed on each image frame through the deep convolutional neural network, so as to obtain a pornographic probability value of each image frame in the third initial evaluation set.

S503, calculating the average pornographic probability value of the image frames in the third initial evaluation set according to the pornographic probability value of each image frame in the third initial evaluation set, and determining the image frame with the minimum pornographic probability value in the third initial evaluation set.

S504, whether the average pornographic probability value of the image frames in the third initial evaluation set is larger than a fourth probability threshold value is judged.

And S505, if the average pornographic probability value of the image frames in the third initial evaluation set is greater than a fourth probability threshold, taking the average pornographic probability value of the image frames in the third initial evaluation set as the pornographic probability value of the to-be-detected segment, wherein the fourth probability threshold is greater than the first probability threshold.

And if the eroticism probability value of the image frame is greater than the fourth probability threshold value, the image frame can be judged to be an eroticism frame. And if the average pornographic probability value of the image frames in the third initial evaluation set is greater than the fourth probability threshold, indicating that the image frames in the third initial evaluation set are pornographic frames. At this time, the process of segment-level pornography identification can be ended, and the average pornography probability value is taken as the pornography probability value of the segment to be detected.

S506, if the average pornographic probability value of the image frames in the third initial evaluation set is not larger than the fourth probability threshold, respectively calculating the frame difference between each image frame which is not sampled in the segment to be detected and the image frame with the minimum pornographic probability value, and adding the image frame which is not sampled and has the frame difference of 1 to the third initial evaluation set to obtain a third target evaluation set.

And S507, performing pornographic identification on the third target evaluation set to obtain a pornographic probability value in each image frame in the third target evaluation set.

Likewise, pornography recognition may be performed on images in the third set of target evaluations based on a deep convolutional neural network.

And S508, calculating the average pornographic probability value of the image frames in the third target evaluation set according to the pornographic probability value of each image frame in the third target evaluation set, and taking the average pornographic probability value as the pornographic probability value of the segment to be detected.

Then, frame-level identification can be performed in parallel, and the average pornographic probability value of each image frame is obtained and then averaged, so as to obtain the average pornographic probability value of the image frames in the third target evaluation set. The secondary evaluation and parallel processing mode ensures ideal processing speed and precision. The process flow is shown in fig. 5.

The frame difference between the image frame in the third target evaluation set and the image frame with the minimum pornographic probability value is 1, that is, the difference between the image frame in the third target evaluation set and the image frame with the minimum pornographic probability value is large, and the probability that the image frame with the minimum pornographic probability value is the pornographic image frame is low, so that the pornographic probability value of the image frame in the third target evaluation set is high, the third target evaluation set makes up the problem of insufficient sampling of the third initial evaluation set to a great extent, and the sensitivity to pornographic content is improved.

Thus, segment-level pornography recognition is completed. In the process of segment-level pornography identification, the speed and the precision of identification are further improved by parallel frame-level identification and a second evaluation mode.

FIG. 6 is a flow chart of acquiring a frame difference between two image frames according to one embodiment of the application. Specifically, as shown in fig. 6, acquiring a frame difference between two image frames includes the steps of:

s601, dividing two image frames into M areas according to a preset rule, and numbering the M areas respectively, wherein M is a positive integer greater than 2.

For example, each image frame is divided into a 4 × 4 square matrix area, where M equals 16. Wherein, each area corresponds to a reference numeral of 0-15 respectively.

S602, calculating the color value of each pixel point in the two image frames.

The color Value may be a color Value corresponding to any one of an RGB (Red, Green, Blue, Red, Green, Blue) space, an HSV (Hue, Saturation, brightness) space, a YUV (chrominance, luminance) space, and the like. For example, the color value corresponding to the YUV space is a gray value.

S603, local color statistical histograms of the two image frames are respectively calculated according to color values of pixel points in each region of the two image frames.

Taking a gray value as an example, calculating the local color statistical histograms of the two image frames means that, for each region in the image frames, the statistical result of the proportion of each gray value in all the pixels in the region can be represented by formula (1):

H_i(y)＝Σ{x：I(x)＝y} (1)

wherein, i is more than or equal to 0 and less than or equal to M-1 represents the number of the area, and y is more than or equal to 0 and less than or equal to 255 represents the gray value.

And S604, calculating the frame difference of the two image frames according to the local color statistical histograms of the two image frames.

Specifically, the frame difference f of two image frames can be calculated by formula (2):

wherein the content of the first and second substances,

and representing the difference of the color statistical histograms of the ith area in the two image frames, wherein LTH represents a local threshold value, and GTH represents a global threshold value, and the LTH and the GTH are preset.

Thus, the frame difference between the frame to be recognized and the next image frame of the frame to be recognized in the embodiment shown in fig. 4 can be calculated by the embodiment shown in fig. 6 to detect the lens boundary. The frame difference between each image frame not sampled in the segment to be detected in the embodiment shown in fig. 5 and the image frame with the minimum pornographic probability value can also be calculated by the embodiment shown in fig. 6 for quadratic evaluation.

The frame difference between the frame to be identified and the next image frame of the frame to be identified is calculated by a local color histogram method so as to detect the shot boundary, different shots can be accurately detected in a large locally changed area, the detection precision is higher, and the matching degree with the single frame identification of the deep convolutional neural network is higher.

Fig. 7 is a flowchart illustrating a shot-level pornography identification of a shot in which a frame to be identified is located according to another embodiment of the present application. As shown in fig. 7, the process of performing the shot-level pornography recognition on the shot where the frame to be recognized is located specifically includes the following steps:

s701, sampling image frames in a lens where the frames to be identified are located according to a preset rule, wherein the sampled image frames form a second initial evaluation set.

The shot boundary can be searched from the frame to be identified, so as to determine the shot where the frame to be identified is located.

S702, pornographic identification is carried out on the second initial evaluation set to obtain the pornographic probability value of each image frame in the second initial evaluation set.

And S703, calculating the average pornographic probability value of the image frames in the second initial evaluation set according to the pornographic probability value of each image frame in the second initial evaluation set.

Among them, S701-S703 refer to S501-S503.

S704, judging whether the average pornographic probability value of the image frames in the second initial evaluation set is larger than a sixth probability threshold value.

S705, if the average pornographic probability value of the image frames in the second initial evaluation set is greater than the sixth probability threshold, taking the average pornographic probability value of the image frames in the second initial evaluation set as the pornographic probability value of the lens where the frame to be identified is located, and taking the accumulated duration of the image frames in the second initial evaluation set as the pornographic duration of the lens where the frame to be identified is located.

And S706, if the average pornographic probability value of the image frames in the second initial evaluation set is not greater than the sixth probability threshold, determining the image frame with the minimum pornographic probability value in the second initial evaluation set.

And S707, respectively calculating a frame difference between each image frame which is not sampled in the segment to be detected and the image frame with the minimum pornographic probability value, and adding the image frame which is not sampled and has the frame difference of 1 to the second initial evaluation set to obtain a second target evaluation set.

And S708, performing pornographic identification on the second target evaluation set to acquire a pornographic probability value in each image frame in the second target evaluation set.

Likewise, pornography recognition may be performed on images in the second set of target evaluations based on a deep convolutional neural network.

And S709, calculating the average pornographic probability value of the image frames in the second target evaluation set according to the pornographic probability value of each image frame in the second target evaluation set.

And S710, if the average pornographic probability value of the image frames in the second target evaluation set is greater than the sixth probability threshold, taking the average pornographic probability value of the image frames in the second target evaluation set as the pornographic probability value of the lens where the frame to be identified is located, and taking the accumulated duration of the image frames in the second target evaluation set as the pornographic duration of the lens where the frame to be identified is located.

Fig. 8 is a flowchart illustrating obtaining a pornographic statistical indicator of a video to be identified and determining whether the video to be identified is a pornographic video in a process of performing multi-level detection on each interval according to an embodiment of the present application. As shown in fig. 8, the method specifically includes the following steps:

s801, sampling image frames in the current detection interval according to a preset rule, wherein the sampled image frames form a first initial evaluation set.

S802, pornographic identification is carried out on the first initial evaluation set to obtain the pornographic probability value of each image frame in the first initial evaluation set.

And S803, calculating the average pornographic probability value of the image frames in the first initial evaluation set according to the pornographic probability value of each image frame in the first initial evaluation set.

Specifically, S801-S803 can refer to S501-S503.

S804, judging whether the average pornographic probability value of the image frames in the first initial evaluation set is larger than a fifth probability threshold value.

And S805, if the average pornographic probability value of the image frames in the first initial evaluation set is greater than a fifth probability threshold, judging that the video to be identified is the pornographic video.

S806, if the average pornographic probability value of the image frames in the first initial evaluation set is not greater than the fifth probability threshold, determining the image frame in the first initial evaluation set with the smallest pornographic probability value.

S807, respectively calculating a frame difference between each image frame not sampled in the current detection interval and the image frame with the minimum pornographic probability value, and adding the image frame not sampled with the frame difference of 1 to the first initial evaluation set to obtain a first target evaluation set.

And S808, performing pornographic identification on the first target evaluation set to acquire a pornographic probability value in each image frame in the first target evaluation set.

Likewise, pornography recognition may be performed on images in the first set of target evaluations based on a deep convolutional neural network.

And S809, calculating the average pornographic probability value of the image frames in the first target evaluation set according to the pornographic probability value of each image frame in the first target evaluation set.

And S810, if the average pornographic probability value of the image frames in the first target evaluation set is greater than a fifth probability threshold, judging that the video to be identified is the pornographic video, and otherwise, detecting the next interval.

In addition, the video identification method of the embodiment of the application also has the following advantages:

1. the pornographic degree of the video single frame is judged based on the deep convolutional neural network, so that the accuracy of single frame identification is improved;

2. the identification method based on level detection takes a lens as a processing basic unit, utilizes the visual similarity and time continuity of the images in the lens, and reduces the noise caused by false alarm and missed alarm of frame level identification and redundant calculation through sampling and secondary evaluation of difference;

3. the boundary detection of the video shot is carried out by utilizing the local pornographic histogram difference, so that the images in the same shot have high probability of having similar characteristic expression under a pornographic image recognition model based on a deep neural network;

4. the method comprises the steps that a video to be identified is divided into a plurality of sections which have respective processing priorities and can be processed in parallel by using priori knowledge, and multiple frames can be processed in parallel by caching frames in the process of fragment-level pornography identification, so that the identification efficiency is improved;

5. each image frame is decoded before being identified, and not all image frames in the video are decoded at the beginning of video identification, so that when the result is returned in advance, the number of the decoded image frames can be greatly reduced, and the processing speed is improved;

6. in the process of identifying each hierarchy, the calculation limit is controlled by the upper limit of the number of the identified image frames in the lens, the threshold of pornographic duration, the threshold of segment duration and the upper limit of interval non-pornographic times, so that the overall identification speed is controllable and is irrelevant to the video duration;

7. and a discrimination model is established by using the total pornographic duration, the accumulated pornographic probability value, the non-pornographic times, the total lens processing number, the total pornographic frame number, the total frame processing number, the video duration and the frame rate of the video to be recognized and the like as variables to judge and recognize pornographic conditions, so that the recognition precision is improved.

In order to implement the above embodiments, the present application further provides a video identification device.

As shown in fig. 9, the video recognition apparatus according to the embodiment of the present application includes: a dividing module 10, a detecting module 20 and a judging module 30.

Specifically, the dividing module 10 is configured to divide the video to be identified into a plurality of intervals.

In an embodiment of the present application, the dividing module 10 may divide the video to be identified into a plurality of intervals according to the duration of the video to be identified. The present application does not limit the rule of division of the intervals. For example, the dividing module 10 may averagely divide the video to be recognized into a plurality of intervals with equal length according to the duration of the video to be recognized, or divide the video to be recognized into a plurality of intervals with the duration being a preset duration, and when the duration of the video to be recognized cannot divide the preset duration, the remaining part that cannot be divided may be used as one interval alone or be merged into the adjacent intervals.

The detection module 20 is configured to perform multi-level detection on a plurality of intervals respectively, and acquire a pornographic statistic indicator of a video to be identified in a process of performing multi-level detection on each interval.

The multi-level detection refers to detecting a video to be identified according to a multi-level structure of the video, and may be specifically shown in fig. 2.

In an embodiment of the present application, when the detection module performs multi-level detection on multiple intervals, the detection module is specifically configured to: dividing a plurality of intervals into N interval groups, wherein N is a positive integer; and respectively detecting the N interval groups through the N parallel threads, wherein each thread sequentially detects the intervals in the corresponding interval group in a multi-level mode according to the processing priority.

Specifically, the detection module 20 is specifically configured to perform the steps shown in fig. 3 to perform multi-level detection on each interval, and acquire a pornographic statistical indicator of the video to be identified in the detection process. In the process of multi-level detection of the interval, lens-level pornography identification can be further performed on the lens where the suspicious pornography image frame is located, and the specific process can be as shown in fig. 4. Further, in the process of shot-level pornography, shots can be divided into segments, and segment-level pornography is performed. In the process of segment-level pornography identification, a plurality of continuous image frames in a segment are sampled and then processed in parallel, and secondary evaluation is carried out after processing. The specific embodiments can refer to the above embodiments, and are not described herein again. Therefore, the pornographic statistical indexes with strong robustness can be obtained by utilizing the visual similarity and the time continuity between adjacent image frames in a multi-level detection mode.

It should be understood that in the embodiment of the present application, the detection module 20 may perform not only the level detection of the interval-level detection- > lens-level detection- > segment-level detection- > frame-level detection through the embodiments shown in fig. 3, fig. 4 and fig. 5, but also the interval-level detection- > lens-level detection- > frame-level detection through the embodiments shown in fig. 3 and fig. 7, and also the interval-level detection- > frame-level detection through the embodiment shown in fig. 8.

The judging module 30 is configured to judge whether the video to be identified is a pornographic video according to the pornographic statistical indicator in the process of performing multi-level detection on each interval.

In an embodiment of the present application, the pornographic statistical indicator may include one or more of a total pornographic duration, a cumulative pornographic probability value, a number of non-pornographic times, a total shot processing number, a total pornographic frame number, a total frame processing number, a video duration and a frame rate of the video to be identified. The judgment module 30 can analyze the cumulative pornographic probability value, the number of times of non-pornography, the total pornographic frame number, the total pornographic duration, the total lens processing count, the total frame processing count, the video duration and the frame rate of the video to be recognized based on the pre-trained support vector machine judgment model to judge whether the video to be recognized is the pornographic video.

In addition, the video identification device of the embodiment of the application also has the following advantages:

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.

Claims

1. A video recognition method, comprising the steps of:

dividing a video to be identified into a plurality of intervals;

respectively carrying out multi-level detection on the plurality of intervals, acquiring the pornographic statistical indexes of the video to be identified in the process of carrying out multi-level detection on each interval, and judging whether the video to be identified is the pornographic video or not according to the pornographic statistical indexes, wherein the method comprises the following steps of:

sampling image frames in the current detection interval according to a preset rule, wherein the sampled image frames form a first initial evaluation set;

performing pornographic identification on the first initial evaluation set to acquire a pornographic probability value of each image frame in the first initial evaluation set;

calculating the average pornographic probability value of the image frames in the first initial evaluation set according to the pornographic probability value of each image frame in the first initial evaluation set;

if the average pornographic probability value of the image frames in the first initial evaluation set is not greater than a fifth probability threshold, determining the image frame with the minimum pornographic probability value in the first initial evaluation set;

respectively calculating the frame difference between each image frame which is not sampled in the current detection interval and the image frame with the minimum pornographic probability value, and adding the image frame which is not sampled and has the frame difference of 1 to the first initial evaluation set to obtain a first target evaluation set;

performing pornographic identification on the first target evaluation set to acquire a pornographic probability value in each image frame in the first target evaluation set;

calculating the average pornographic probability value of the image frames in the first target evaluation set according to the pornographic probability value of each image frame in the first target evaluation set;

and if the average pornographic probability value of the image frames in the first target evaluation set is greater than a fifth probability threshold, judging that the video to be identified is the pornographic video, otherwise, detecting the next interval.

2. The video identification method according to claim 1, wherein the plurality of intervals have respective processing priorities, and the performing multi-level detection on the plurality of intervals specifically comprises:

dividing the plurality of intervals into N interval groups, wherein N is a positive integer;

and respectively detecting the N interval groups through N parallel threads, wherein each thread sequentially detects the intervals in the corresponding interval group in a multi-level mode according to the processing priority.

3. The video identification method according to any of claims 1-2, wherein the pornographic statistical indicator includes one or more of a total pornographic duration, a cumulative pornographic probability value, a number of non-pornographic times, a total shot processing count, a total pornographic frame number, a total frame processing count, a video duration and a frame rate of the video to be identified, and the determining whether the video to be identified is a pornographic video according to the pornographic statistical indicator specifically includes:

and analyzing the accumulated pornographic probability value, the non-pornographic times, the total pornographic frame number, the total pornographic duration, the total lens processing number, the total frame processing number, the video duration and the frame rate of the video to be recognized based on a pre-trained support vector machine discrimination model so as to judge whether the video to be recognized is the pornographic video.

4. The video recognition method of any one of claims 1-2, wherein the method further comprises:

and if the average pornographic probability value of the image frames in the first initial evaluation set is greater than a fifth probability threshold, judging that the video to be identified is the pornographic video.

5. The video recognition method of claim 1, wherein the frame difference between two image frames is obtained by:

dividing the two image frames into M areas according to a preset rule, and numbering the M areas respectively, wherein M is a positive integer greater than 2;

calculating the color value of each pixel point in the two image frames;

respectively calculating local color statistical histograms of the two image frames according to color values of pixel points in each region of the two image frames;

and calculating the frame difference of the two image frames according to the local color statistical histograms of the two image frames.

6. The video recognition method of claim 1, wherein pornographic recognition is performed on each frame of image based on a deep convolutional neural network.

7. A video recognition apparatus, comprising:

the dividing module is used for dividing the video to be identified into a plurality of intervals;

the detection module is used for respectively carrying out multi-level detection on the plurality of intervals and acquiring the pornographic statistical index of the video to be identified in the process of carrying out multi-level detection on each interval;

the judging module is used for judging whether the video to be identified is the pornographic video or not according to the pornographic statistical indexes in the process of carrying out multi-level detection on each interval;

wherein the detection module is specifically configured to:

the judgment module is specifically configured to:

8. The video identification device of claim 7, wherein the plurality of intervals each have a respective processing priority, and wherein the detection module is specifically configured to:

9. The video identification device according to any of claims 7 to 8, wherein the pornographic statistical indicator includes one or more of a total pornographic duration, a cumulative pornographic probability value, a number of non-pornographic times, a total shot processing count, a total pornographic frame number, a total frame processing count, a video duration and a frame rate of the video to be identified, and the determining module is specifically configured to:

10. The video recognition apparatus of any one of claims 7-8,

the judgment module is specifically configured to:

11. The video recognition apparatus of claim 7, wherein pornographic recognition is performed on each frame of image based on a deep convolutional neural network.