CN111027419B

CN111027419B - Method, device, equipment and medium for detecting video irrelevant content

Info

Publication number: CN111027419B
Application number: CN201911159761.0A
Authority: CN
Inventors: 万明阳; 马连洋; 衡阵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2023-10-20
Anticipated expiration: 2039-11-22
Also published as: CN111027419A

Abstract

The application belongs to the technical field of data processing, and mainly relates to a computer vision technology and a voice technology in artificial intelligence, and discloses a method, a device, equipment and a medium for detecting video irrelevant content. Therefore, when the video irrelevant content is detected, the detection application range is enlarged, and the robustness and the accuracy of the detection result are improved.

Description

Method, device, equipment and medium for detecting video irrelevant content

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting video independent content.

Background

With the development of internet technology and intelligent terminal technology, and the popularization of video photographing devices and video editing technology, the number of released videos has been rapidly increased. Users typically post video distribution after adding additional video-independent content in the beginning or end of the video body. For example, video-unrelated content is a black screen, advertisement, self-media number, and the like.

In the prior art, the following method is generally adopted to detect video irrelevant content: and respectively extracting video characteristics of a head time area or a tail time area in each video file aiming at a plurality of video files in the television drama video, and judging that video irrelevant contents exist when the video characteristics of the plurality of video files are similar.

However, by adopting a video feature detection mode, only the television video can be detected, and the factors such as the resolution, the size and the like of the video have larger influence on the video detection result, the application range is smaller, the robustness is poorer, and the detection accuracy is poorer.

Thus, there is a need for a video-independent content detection scheme that can improve the applicable range of video-independent content detection, as well as the robustness and accuracy of the detection results.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a medium for detecting video irrelevant contents, which are used for improving the application range of the video irrelevant contents detection and the robustness and the accuracy of the detection result when the video irrelevant contents are detected.

In one aspect, a method for video-independent content detection is provided, comprising:

obtaining a video frame to be detected from a video to be detected of a target object;

acquiring a video multimedia feature set formed by video multimedia features of each video frame to be detected;

when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition, determining that video irrelevant content exists in the video to be detected;

the sample multimedia feature set is obtained according to video irrelevant content contained in historical videos of the target object, and the video irrelevant content contained in the historical videos is determined according to the matching frame number of video frames among the historical videos and the audio similarity.

In one aspect, an apparatus for video-independent content detection is provided, comprising:

the acquisition unit is used for acquiring a video frame to be detected from the video to be detected of the target object;

the composition unit is used for acquiring a video multimedia feature set composed of video multimedia features of each video frame to be detected;

The determining unit is used for determining that video irrelevant content exists in the video to be detected when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition;

Preferably, the acquiring unit is configured to:

and sampling video frames in a specified time period of the video to be detected according to the specified sampling time length to obtain each video frame to be detected.

Preferably, the determining unit is configured to:

sequentially determining the similarity of video frames between each video picture feature in the video multimedia feature set and each corresponding video picture feature in the sample multimedia feature set until the obtained similarity of the video frames is lower than a preset picture similarity threshold, wherein the video multimedia features at least comprise video picture features;

determining the video frame logarithm of which the video frame similarity is not lower than a preset picture similarity threshold as a matching frame number;

and when the matching frame number is larger than a first preset frame number threshold value, determining that video irrelevant contents exist in the video to be detected.

Preferably, the determining unit is further configured to:

when the sample multimedia feature set of the target object is one, determining irrelevant content duration of video irrelevant content according to the matching frame number and the appointed sampling duration;

when the sample multimedia feature set of the target object is a plurality of, determining the maximum matching frame number in the obtained plurality of matching frame numbers, and determining the irrelevant content duration of the video irrelevant content according to the maximum matching frame number and the appointed sampling duration.

Preferably, the determining unit is further configured to:

acquiring a historical video set formed by a plurality of historical videos uploaded by a target object;

respectively obtaining a video multimedia feature set of each historical video;

taking out a to-be-processed historical video from the historical video set, and determining the matching frame number between the to-be-processed historical video and each historical video in the updated historical video set according to the video picture characteristics contained in each video multimedia characteristic set;

screening historical videos meeting preset matching conditions according to the obtained matching frame numbers;

according to the matching frame numbers of the historical video to be processed and each screened historical video, a corresponding matching frame feature set is obtained;

And determining a sample multimedia feature set corresponding to the historical video to be processed according to the obtained multiple matching frame feature sets.

Preferably, the determining unit is further configured to:

screening historical videos with the matching frame number being larger than a first preset frame number threshold value and smaller than a second preset frame number threshold value;

determining a corresponding matching time period according to the matching frame number corresponding to each screened historical video and the appointed sampling duration;

extracting corresponding audio features according to the matching time periods matched between the historical video to be processed and each screened historical video;

determining the audio similarity between the historical video to be processed and the audio characteristics of each screened historical video respectively;

and screening historical videos corresponding to the audio similarity meeting the preset audio similarity condition.

Preferably, the determining unit is further configured to:

respectively extracting first audio features of videos in a matching time period in the historical videos to be processed and each screened historical video;

and respectively extracting the to-be-processed historical video and the second audio features of the video which are in the specified time period and match the outside of the time period in each screened historical video.

Preferably, the determining unit is further configured to:

And determining the first audio similarity between the to-be-processed historical video and the first audio feature and the second audio similarity between the to-be-processed historical video and the second audio feature of each screened historical video respectively.

Preferably, the determining unit is further configured to:

historical videos with the first audio similarity higher than a first preset audio threshold and the second audio similarity lower than a second preset audio threshold are screened out.

Preferably, the determining unit is further configured to:

and when the number of the videos of the history videos contained in the updated history video set is greater than one, executing the step of taking out one history video to be processed from the history video set and determining the matching frame number between the history video to be processed and each history video in the updated history video set.

Preferably, the video independent content is any one or combination of the following: a head and a tail.

In one aspect, there is provided a control device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the steps of the method of any of the video-independent content detection described above.

In one aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, performs the steps of a method of any of the video-independent content detection described above.

In the method, the device, the equipment and the medium for detecting the video irrelevant content provided by the embodiment of the application, a sample multimedia feature set is generated in advance according to the matching frame numbers and the audio similarity between a plurality of historical videos of the target object, when the video to be detected of the target object is received, a video frame to be detected is obtained from the video to be detected, a video multimedia feature set formed by video multimedia features of a plurality of video frames to be detected is obtained, and when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition, the video irrelevant content in the video to be detected is determined. Therefore, when the video irrelevant content is detected, the detection application range is enlarged, and the robustness and the accuracy of the detection result are improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of a system architecture for video-independent content detection in an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating a sample multimedia feature set according to an embodiment of the present application;

FIG. 3 is a flow chart of video independent content detection in an embodiment of the application;

FIG. 4a is a diagram illustrating a video recommendation list according to an embodiment of the present application;

FIG. 4b is a flowchart outlining an exemplary diagram of one exemplary video-independent content detection according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus for video-independent content detection according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a control device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantageous effects of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

First, some terms related to the embodiments of the present application will be described so as to be easily understood by those skilled in the art.

Terminal equipment: various applications can be installed, and an object provided in the installed application can be displayed, and the electronic device can be mobile or fixed. For example, a mobile phone, a tablet computer, various wearable devices, a vehicle-mounted device, a personal digital assistant (personal digital assistant, PDA), a point of sale (POS), or other electronic devices capable of realizing the above functions, etc.

Application: i.e., application programs, computer programs that perform one or more tasks, typically have a visual display interface that enables interaction with a user, such as electronic maps and WeChat, may be referred to as applications.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. Computer vision research-related theory and technology has attempted to build artificial intelligence systems that can obtain information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

The perceptual hash algorithm is a generic name of a type of hash algorithm, and is used for generating fingerprint character strings of each image and comparing fingerprint information of different images to judge similarity of the images. The closer the comparison results are to the image, the more similar. The perceptual hash algorithm includes a mean hash (aHash), a perceptual hash (pHash), and a difference value hash (dHash). The aHash speed is higher, but the accuracy is lower; pHash goes against its way, with higher accuracy but slower speed; dHash combines both, and is higher in accuracy and speed.

Hamming distance: is named by the name of richard Wei Sili hamming. In the information theory, the hamming distance between two equal-length character strings is the number of different characters at the corresponding positions of the two character strings. In other words, it is the number of characters that need to be replaced to transform one string into another. For example: the hamming distance between 1011101 and 1001001 is 2. The hamming distance between 2143896 and 2233796 is 3. The hamming distance between "connected" and "references" is 3.

Video independent content: in order to add the video content-independent part in the video, the video content-independent part is removed without affecting the integrity of the video content, including the head and tail, usually black screen, advertisement and self-media number.

The following describes the design concept of the embodiment of the present application.

With the development of internet technology and intelligent terminal technology, and the popularization of video shooting equipment and video editing technology, more and more users shoot through intelligent terminals and the like to obtain videos or short videos, or download various videos or short videos through networks, and then add additional video irrelevant contents to the video text of the video.

The video irrelevant content is usually black screen, advertisement, self media number and the like, so that the purposes of advertisement promotion and the like are realized.

Obviously, a video with video-independent content can greatly reduce the user experience. In order to improve user experience, a video publishing server generally performs video irrelevant content detection on videos uploaded by users, reduces recommendation weights of videos containing video irrelevant contents according to detection results, and displays a video recommendation list according to the recommendation weights. Furthermore, according to the detection result, the video irrelevant contents can be removed before the video is played.

In the conventional technology, when detecting video irrelevant contents, the following methods are generally adopted:

the video characteristics of the head time area or the tail time area in each video file can be extracted respectively only for a plurality of video files of the television drama video, the video characteristics of the plurality of obtained video files are matched, and when the matching degree between the video characteristics of the video files meets the matching condition, the existence of video irrelevant contents is judged.

However, by adopting the method, only the television drama video can be detected, the application range is small, and the robustness and the accuracy of the video detection result determined by the video features are poor because the resolution, the size and other factors of the video have a large influence on the extracted video features.

Further, when it is determined that there is video-independent content, the duration of the video-independent content cannot be identified, and the repeated video, the video with similar beginning portions, and the video with similar ending portions cannot be filtered, so that the repeated video and the video with similar beginning or ending portions of the video are easily identified as video-independent content, and the accuracy of the video detection result is reduced.

Obviously, the conventional technology does not provide a technical scheme with wide application range and high robustness and accuracy of the detection result, so that a technical scheme for detecting the video irrelevant content is needed to expand the detection application range and improve the robustness and accuracy of the detection result when detecting the video irrelevant content.

In view of the above analysis and consideration, the embodiment of the present application provides a technical solution for detecting video independent content, mainly related to computer vision technology and voice technology in artificial intelligence, where in the solution, video independent content is screened out in advance according to matching frame numbers of video frames between a plurality of historical videos of the target object and audio similarity between audio features, and a corresponding sample multimedia feature set is generated according to video picture features of the screened video independent content.

Receiving a video to be detected of a target object, acquiring a video frame to be detected from the video to be detected, acquiring a video multimedia feature set formed by video multimedia features of a plurality of video frames to be detected, and determining that video irrelevant content exists in the video to be detected when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition.

In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method based on routine or non-inventive labor. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application. The methods may be performed sequentially or in parallel as shown in the embodiments or the drawings when the actual processing or the apparatus is performed.

Referring to fig. 1, a schematic diagram of a system architecture for video-independent content detection is shown, the system includes: a plurality of terminals 100 and a server 101. The terminal 100 and the server 101 are connected through a wired or wireless network.

The terminal 100 is provided with a video application for uploading a video to be detected submitted by a user to the server 101 through the video application, and is further configured to obtain a recommended video list through the server 101 in response to a video viewing operation, and display the recommended video list, and is further configured to play a video specified by the user in response to a video playing operation.

Server 101: the method comprises the steps of generating a corresponding sample multimedia feature set when video irrelevant content exists according to the matching frame number of video frames among a plurality of historical videos of a target object and the audio similarity among audio features; the method is also used for acquiring a video frame to be detected in the video to be detected when receiving the video to be detected of the target object uploaded by the terminal 100, acquiring a video multimedia feature set formed by video multimedia features of each video frame to be detected, and determining that video irrelevant content exists in the video to be detected when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition; the method is also used for setting the recommendation weight of the video to be detected according to the video irrelevant content detection result, and sequencing the video to be recommended according to the recommendation weight to obtain a sequenced recommended video list.

It should be noted that the video multimedia features at least include video picture features of video frames, and may also include audio features.

Considering that the video uploaded by the same object generally uses repeated video irrelevant contents, in the embodiment of the application, a sample multimedia feature set is generated for the video irrelevant contents contained in a plurality of historical videos of each target object respectively, and the video to be detected of the target object is detected through the sample multimedia feature set to obtain a detection result of the video irrelevant content detection.

Optionally, the target object may be a user or a group, and generally takes account information of an uploaded video or one type of account information as the target object, and in practical application, the target object may be set according to a practical application scenario, which is not described herein. The video may be in the form of a television series, a short video, etc.

The embodiment of the application can be applied to application scenes for detecting video irrelevant contents of one or more videos (such as a television play and a short video), and for the sake of understanding, only the detection of the video irrelevant contents of the short video is taken as an example for explanation, and the description is omitted here.

In the embodiment of the application, one or more sample multimedia feature sets of a target object are generated in advance before video irrelevant content detection is carried out on a video to be detected of the target object. Wherein the sample multimedia feature set is obtained from video-independent content contained in the historical video of the target object.

Referring to fig. 2, a flowchart of an implementation of a method for generating a sample multimedia feature set according to the present application is shown. The method comprises the following specific processes:

step 200: the server acquires a historical video set composed of a plurality of historical videos uploaded by the target object.

Specifically, since the video irrelevant content is usually a video irrelevant content contained in a video uploaded by a user, such as a video header or a video trailer, and the like, and usually contains repeated video irrelevant content, a sample multimedia feature set of each target object is generated for each target object's historical video, not all the historical videos. In this way, the plurality of historical videos uploaded by the target object can be used as sample videos for generating the sample multimedia feature set.

Step 201: the server obtains the video multimedia feature set of each historical video in the historical video set respectively.

Specifically, the server performs the following steps for each history video:

s2010: and sampling video frames in a specified time period in the historical videos in the historical video set according to the specified sampling time length to obtain a plurality of sample video frames.

In practical application, the designated sampling duration and the designated time period can be set correspondingly according to the practical application scene, and are not described herein.

For example, if the specified sampling duration is 1s and the specified time period is the first 10s of the beginning of the video, the server extracts 1 video frame every 1s within the first 10s of the video, and obtains 10 video frames.

S2011: and respectively extracting the video picture characteristics of each sample video frame.

In one embodiment, a hash algorithm is used to extract video picture features for each sample video frame separately.

The video picture features are 64-bit hash values obtained after hash calculation is performed on the sample video frames.

Alternatively, the hash algorithm may employ a perceptual hash algorithm. The perceptual hash algorithm is a generic name of a type of hash algorithm, and is used for generating fingerprint character strings of each image, comparing fingerprint information of different images to judge similarity of the images, wherein the closer the comparison result is to the image, the more similar the comparison result is.

S2012: a video multimedia feature set of the plurality of sample video frames is obtained.

Thus, the video multimedia feature set of the historical video can be extracted, and the image features of the historical video can be represented through the video multimedia feature set of the historical video.

Step 202: the server takes out a history video to be processed from the history video set to obtain an updated history video set.

Specifically, the server extracts any one historical video from the historical video set to serve as a historical video to be processed, and obtains a historical video set composed of historical videos remaining after the historical video to be processed is removed.

Step 203: and the server determines the matching frame number between the to-be-processed historical video and each historical video in the updated historical video set according to the obtained video multimedia feature set.

Specifically, the server performs the following steps for each history video in the updated history video set, respectively:

s2031: and determining the similarity of video frames between each video picture feature in the video multimedia feature set of the historical video to be processed and each corresponding video picture feature in the video multimedia feature set of the historical video in sequence until the obtained similarity of the video frames is lower than a preset picture similarity threshold.

Wherein each video picture feature corresponds to a video frame. In determining the similarity of the video frames, a similarity algorithm such as hamming distance or euclidean distance may be used, which is not limited herein.

When the similarity of two images is quantized by using the Hamming distance, the larger the Hamming distance is, the smaller the similarity of the images is, and the smaller the Hamming distance is, the larger the similarity of the images is.

In practical application, the preset picture similarity threshold may be set according to a practical application scene, and optionally, the preset picture similarity threshold may be [0,1].

When determining the similarity of the video frames, if the video frames have a slice header, the video frames are detected in the time sequence from front to back, and if the video frames have a slice tail, the video frames are detected in the time sequence from back to front.

S2032: and determining the video frame logarithm of which the video frame similarity is not lower than a preset picture similarity threshold as a matching frame number.

Thus, the number of matching frames, i.e., the degree of matching, between the video to be processed and each of the historical videos can be determined.

Step 204: and the server screens out the historical video which accords with the preset matching condition according to the obtained matching frame number.

Specifically, when executing step 204, the server may employ the following steps:

S2041: historical videos with the matching frame number being larger than a first preset frame number threshold and smaller than a second preset frame number threshold are screened out.

Wherein the first preset frame number threshold is determined based on a shortest length of the predicted video-independent content and the second preset frame number threshold is determined based on a maximum length of the predicted video-independent content. In practical application, the first preset frame number threshold and the second preset frame number threshold may be set according to a practical application scene, which is not limited herein.

S2042: and determining a corresponding matching time period according to the matching frame number corresponding to each screened historical video and the appointed sampling duration.

Specifically, for each screened historical video, determining a matching time length according to the product of the matching frame number and the appointed sampling time length, if the detected video contains a slice header, using the matching time length as a matching time length, and if the detected video contains a slice tail, determining the matching time length according to the matching time length of the video tail.

S2043: and extracting corresponding audio features according to the matching time periods between the historical video to be processed and each screened historical video.

Specifically, the first audio features of the video content in the matching time period in the to-be-processed historical video and each screened historical video are respectively extracted, and the second audio features of the video content in the specified time period and outside the matching time period in the to-be-processed historical video and each screened historical video are respectively extracted.

That is, the following steps are performed for each screened historical video, respectively:

extracting first audio features of video contents in a matching time period from the screened historical video, and extracting first audio features of video contents in the matching time period from the historical video to be processed; and extracting second audio features of the video contents in the specified time period and matching outside the time period in the screened historical video, and extracting second audio features of the video contents in the specified time period and matching outside the time period in the historical video to be processed.

For example, if the matching time period is the first 8s of the video and the specified time period is the first 10s of the video, then the first audio feature of the video content within 0-8s of the video and the second audio feature of the video content within 8-10s of the video are extracted.

Alternatively, the audio features may be audio fingerprint features extracted by audio fingerprinting, etc., or may be extracted in other manners, which are not limited herein.

S2044: and determining the audio similarity between the to-be-processed historical video and the audio characteristics of each screened historical video.

Specifically, a first audio similarity between the to-be-processed historical video and the first audio feature of each screened historical video and a second audio similarity between the second audio features are respectively determined.

determining a first audio similarity between the first audio feature of the historical video to be processed and the first audio feature of the screened historical video; and determining a second audio similarity between the second audio feature of the historical video to be processed and the second audio feature of the screened historical video.

S2045: and screening historical videos corresponding to the audio similarity meeting the preset audio similarity condition.

Specifically, historical videos with the first audio similarity higher than a first preset audio threshold and the second audio similarity lower than a second audio threshold are screened out.

In practical application, the first preset audio threshold and the second preset audio threshold can be set according to practical application scenes, and the ranges can be 0 and 1, which are not described herein.

Because the video pictures are generally similar to the resource video, the game video and the like, and the audios are different, the repeated videos and the similar video irrelevant content videos cannot be screened out only by the matching degree of the video picture characteristics, in the embodiment of the application, the history videos are initially screened out by the matching degree of the video picture characteristics, then the repeated videos and the similar video irrelevant content are filtered out by further accurately screening the audio similarity of the audio characteristics, and the accuracy of the video irrelevant content is improved.

It should be noted that if the matching degree of the video image features and the audio similarity of the audio features are both higher than the corresponding specified repetition threshold, it is determined that the video is repeated. If the matching degree of the video picture features is higher than the specified similar picture threshold, the audio similarity of the audio features is lower than the specified similar audio threshold, and the similar video irrelevant content exists. If the matching degree of the video picture features and the audio similarity of the audio features of the head or tail portion (i.e., the matching time period) are both higher, the matching degree of the video picture features and the audio similarity of the audio features of the video text portion (within the specified time period and outside the matching time period) are both lower, then the existence of video irrelevant content is indicated.

Step 205: and the server obtains a corresponding matched frame characteristic set according to the matching frame numbers of the historical video to be processed and each screened historical video.

Specifically, the server performs the following steps for each screened historical video:

and extracting a plurality of matched sample video frames from the video multimedia feature set corresponding to the screened historical video according to the corresponding matching frame number, and forming a matching frame feature set according to the plurality of sample video frames.

Further, the maximum matching frame number in the matching frame numbers is determined, and according to the maximum matching frame number, sample video frames in the video multimedia feature set of the historical video to be processed are extracted, so that a corresponding matching frame feature set is obtained.

In this way, a video multimedia feature set of video irrelevant content, namely a matching frame feature set, of which the historical video to be processed is matched with each screened historical video respectively can be obtained.

Step 206: and the server determines a sample multimedia feature set corresponding to the historical video to be processed according to the obtained matching frame feature set.

Specifically, the value of each data bit of each video picture feature in the sample multimedia feature set is the mode of the corresponding data bit of the corresponding video picture feature in the plurality of matching frame feature sets.

Further, the sample multimedia feature set may further include audio features, that is, audio features corresponding to each video picture feature included in the sample multimedia feature set.

It should be noted that the video picture is characterized by a 64-bit hash value, i.e. contains 64 data bits.

For example, the server obtains three matching frame feature sets, namely set 1, set 2 and set 3, respectively, where the first data bit of the video picture feature of the first sample video frame in set 1 is 0, the first data bit of the video picture feature of the first sample video frame in set 2 is 1, and the first data bit of the video picture feature of the first sample video frame in set 3 is 0, then the first data bit of the first video picture feature in the sample multimedia feature set is mode 0 in {0,1,0 }.

Thus, a sample multimedia feature set can be generated from each matching frame feature set.

Step 207: the server determines whether the number of videos of the historical videos included in the updated historical video set is greater than one, if so, step 202 is executed, otherwise, step 208 is executed.

Step 208: the flow ends.

In the embodiment of the present application, only the sample multimedia feature set obtained from the target object is taken as an example for explanation, and according to a similar principle, the sample multimedia feature sets of a plurality of objects can be obtained, which is not described herein.

Thus, after the sample multimedia feature sets of the plurality of objects are obtained, the sample multimedia feature sets can be used for detecting the video to be detected uploaded by the target objects.

Referring to fig. 3, a flowchart of a method for detecting video-independent content according to the present application is shown. The method comprises the following specific processes:

step 300: the server acquires a plurality of video frames to be detected of the video to be detected of the target object.

Specifically, the server samples video frames of the video to be detected in a specified time period according to the specified sampling time length, and obtains each video frame to be detected.

In this way, a plurality of video frames to be detected in the video to be detected can be extracted.

Step 301: the server acquires a video multimedia feature set formed by video multimedia features of each plurality of video frames to be detected.

Specifically, the video multimedia features at least include video picture features, and then the server extracts the video picture features of each video frame to be detected respectively, and forms a video multimedia feature set from the acquired plurality of video picture features.

The video image features may be extracted by a hash algorithm, or may be obtained by other manners, which will not be described herein.

Further, the video multimedia features may further include audio features, and the server extracts audio features within a specified time period corresponding to the plurality of video frames to be detected.

Step 302: when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition, the server determines that video irrelevant content exists in the video to be detected.

Specifically, when executing step 302, the server may employ the following steps:

s3021: and determining the similarity of each video picture feature in the video multimedia feature set and each corresponding video picture feature in the sample multimedia feature set in sequence until the obtained similarity of the video frames is lower than a preset picture similarity threshold.

Specifically, the following steps may be adopted:

step 1: and taking out a video picture characteristic from the video picture set according to the appointed time sequence.

Step 2: video frame similarities between the video picture features and corresponding video picture features in the sample set of multimedia features are determined.

Step 3: and (5) judging whether the similarity of the video frames is not lower than a preset picture similarity threshold, if so, executing the step (4), otherwise, executing the step (5).

Step 4: the video frame pairs are added by one and step 1 is performed as described above.

Wherein the initial value of the video frame logarithm is 0.

Step 5: stopping the video frame similarity determination flow.

S3022: and determining the video frame logarithm of which the video frame similarity is not lower than a preset picture similarity threshold as a matching frame number.

In this way, the number of pairs of matched video frames, i.e. the number of matched frames, between the video multimedia feature set and the sample multimedia feature set can be determined according to the similarity of the video frames.

S3023: and when the matching frame number is larger than a first preset frame number threshold value, determining that video irrelevant contents exist in the video to be detected.

Therefore, whether the video to be detected contains video irrelevant content can be judged according to the matching frame number.

Further, after determining that the number of matching frames is greater than the first preset number of frames threshold, determining an audio similarity between an audio feature included in the video multimedia feature set and an audio feature included in the sample multimedia feature set, and determining that video irrelevant content exists in the video to be detected when the audio similarity is greater than the first preset audio threshold.

Therefore, whether the video to be detected contains video irrelevant content can be judged according to the matching frame number and the audio similarity.

Further, the server may further determine the irrelevant content duration of the video irrelevant content included in the video to be detected by using the following ways:

the first way is: when the sample multimedia feature set of the target object is one, determining the irrelevant content duration of the video irrelevant content according to the acquired matching frame number and the appointed sampling duration.

In one embodiment, the product of the number of matching frames and the specified sampling duration is determined as the unrelated content duration of the video unrelated content.

The second mode is as follows: when the sample feature set of the target object is multiple, determining the maximum matching frame number in the obtained multiple matching frame numbers, and determining the irrelevant content duration of the video irrelevant content according to the maximum matching frame number and the appointed sampling duration.

In one embodiment, the product of the maximum number of matching frames and the specified sampling duration is determined as the irrelevant content duration of the video irrelevant content.

Further, the server can adjust corresponding recommendation weights according to detection results of video irrelevant content detection of the video to be detected, and display a video recommendation list to a user according to the recommendation weights of the videos.

For example, referring to fig. 4a, an illustration of a video recommendation list is shown, video 1 does not include video independent content, the corresponding recommendation weight is 1, video 2 includes video independent content, the corresponding recommendation weight is 0.5, and video 1 and video 2 are displayed in sequence according to the recommendation weights of video 1 and video 2.

It should be noted that fig. 4a is only used to show the ordering of the videos, and if the lines or text in fig. 4a are not clear, the clarity of the embodiment of the present application is not affected.

When determining the recommended weight of the video to be detected, the following modes can be adopted:

the first way is: if the detection result is that the video irrelevant content is not contained, the recommendation weight of the video to be detected is set as a first preset weight, otherwise, the recommendation weight of the video to be detected is set as a second preset recommendation weight.

The first preset weight and the second preset weight may be set according to an actual application scenario, which is not limited herein.

The second mode is as follows: if the detection result is that the video irrelevant content is not contained, the recommendation weight of the video to be detected is set as a first preset weight, otherwise, the irrelevant content duration of the video to be detected is obtained, and the recommendation weight of the video to be detected is set as a recommendation weight value corresponding to the irrelevant content duration.

In one embodiment, the server sets a corresponding recommended weight value for each of the intervals Rong Shichang in the radio.

In one embodiment, the server establishes a corresponding functional relationship between the irrelevant content duration and the recommended weight value in advance, and determines the corresponding recommended weight according to the functional relationship. For example, the functional relationship is a proportional relationship.

Further, the sample multimedia feature set is updated according to the obtained video multimedia feature set of the video independent content contained in the video to be detected.

Further, before playing the video, the server may further remove the video unrelated content from the video according to the unrelated content duration interval, and then play the video after removing the video unrelated content.

Referring to fig. 4b, an exemplary diagram of a flow overview of video-independent content detection is shown. In the embodiment of the application, the flow of the combination of the sample multimedia feature set generating method and the video irrelevant content detecting method is as follows:

step 401: and the server determines the matching frame number between the historical videos according to the video multimedia feature set of the obtained historical videos.

Specifically, in order to facilitate understanding, two historical videos are taken as an example for illustration when executing steps 401-405, and a similar principle is adopted to process a plurality of historical videos, which is not described herein.

Step 402: the server determines whether the matching frame number meets the preset picture matching condition, if so, step 403 is executed, otherwise, step 407 is executed.

Specifically, the preset picture matching conditions are: the number of matching frames is greater than a first preset frame number threshold and less than a second preset frame number threshold.

Step 403: the server determines audio similarities between the audio features of the extracted historical video.

Specifically, step 403 is performed, and specific steps are described in the foregoing S2042 to S2044.

Step 404: the server judges whether the audio similarity meets the preset audio similarity condition.

Specifically, when step 404 is performed, specific steps are referred to above in S2045.

Step 405: the server determines a sample multimedia feature set from the historical video.

Specifically, when step 405 is performed, specific steps are referred to above in steps 205-206.

Step 406: and when receiving the video to be detected of the target object, detecting the video to be detected according to the sample multimedia feature set.

Step 407: the server obtains the detection result of the video irrelevant content.

In the embodiment of the application, whether video irrelevant contents exist in the historical video to be processed and the historical video are judged according to the matching frame number, namely the matching degree, and the audio similarity between the historical video to be processed and each historical video, and a sample multimedia feature set is generated according to the obtained video irrelevant contents, so that the newly added video to be detected is detected according to the sample multimedia feature set.

And then, acquiring a sample multimedia feature set, and judging whether the video to be detected contains video irrelevant content according to the matching degree of the video multimedia feature set of the video to be detected and the sample multimedia feature set.

In the traditional mode, detection is needed according to a plurality of videos in a television play, but detection cannot be carried out according to one short video, so that the application range is small, in the embodiment of the application, one or more videos (such as the television play video and the short video) in any form can be detected, the application range is wide, the detection efficiency is high, the influence of factors such as video resolution, size and the like on a detection result is reduced by determining the matching degree through the video picture characteristics of video frames, and the like, and the video similar to repeated videos and video irrelevant contents is filtered through the audio similarity, so that the robustness and accuracy of the detection result are improved, furthermore, the irrelevant content duration of the video irrelevant contents can be determined, so that the video irrelevant contents can be accurately positioned, and the video recommendation sequence can be adjusted or the video irrelevant contents can be removed according to the video irrelevant contents.

The actual detection effect of the embodiment of the application is as follows: the embodiment of the application is applied to the detection of the video irrelevant contents of a plurality of videos, the detection accuracy of the video irrelevant contents is 96%, the recall rate is 94.71%, and more than one million video irrelevant contents are identified and processed.

Based on the same inventive concept, the embodiment of the application also provides a device for detecting video irrelevant content, and because the principle of the device and equipment for solving the problem is similar to that of a method for detecting video irrelevant content, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Fig. 5 is a schematic structural diagram of an apparatus for detecting video independent content according to an embodiment of the present application. An apparatus for video-independent content detection comprising:

an obtaining unit 501, configured to obtain a video frame to be detected from a video to be detected of a target object;

a composition unit 502, configured to obtain a video multimedia feature set composed of video multimedia features of each video frame to be detected;

a determining unit 503, configured to determine that video irrelevant content exists in the video to be detected when the matching degree of the video multimedia feature set and the sample multimedia feature set meets a set condition;

Preferably, the obtaining unit 501 is configured to:

Preferably, the determining unit 503 is configured to:

Preferably, the determining unit 503 is further configured to:

respectively obtaining a video multimedia feature set of each historical video;

Preferably, the determining unit 503 is further configured to:

Fig. 6 shows a schematic structural diagram of a control device 6000. Referring to fig. 6, the control apparatus 6000 includes: a processor 6010, a memory 6020, a power supply 6030, a display unit 6040, and an input unit 6050.

The processor 6010 is a control center of the control apparatus 6000, connects respective components using various interfaces and lines, and performs various functions of the control apparatus 6000 by running or executing software programs and/or data stored in the memory 6020, thereby performing overall monitoring of the control apparatus 6000.

In an embodiment of the present application, the processor 6010 performs the method of video-independent content detection provided by the embodiment shown in fig. 3 when calling a computer program stored in the memory 6020.

Optionally, processor 6010 may include one or more processing units; preferably, the processor 6010 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 6010. In some embodiments, the processor, memory, may be implemented on a single chip, and in some embodiments, they may be implemented separately on separate chips.

The memory 6020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, various applications, and the like; the storage data area may store data created according to the use of the control device 6000, or the like. In addition, memory 6020 may comprise high-speed random access memory and may also comprise non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device, and the like.

The control device 6000 further includes a power supply 6030 (e.g., a battery) for supplying power to the respective components, which may be logically connected to the processor 6010 through a power management system, so that functions of managing charge, discharge, power consumption, etc. are performed through the power management system.

The display unit 6040 may be used to display information input by a user or information provided to the user, various menus of the control device 6000, and the like, and is mainly used to display a display interface of each application in the control device 6000 and objects such as texts and pictures displayed in the display interface in the embodiment of the present invention. The display unit 6040 may include a display panel 6041. The display panel 6041 may be configured in the form of a Liquid crystal display (Liquid CrystalDisplay, LCD), an Organic Light-Emitting Diode (OLED), or the like.

The input unit 6050 may be used to receive information such as numbers or characters input by a user. The input unit 6050 may include a touch panel 6051 and other input devices 6052. Wherein the touch panel 6051, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 6051 or thereabout using any suitable object or accessory such as a finger, stylus, etc.).

Specifically, the touch panel 6051 may detect a touch operation by a user, detect a signal caused by the touch operation, convert the signal into a touch point coordinate, send the touch point coordinate to the processor 6010, and receive and execute a command sent from the processor 6010. In addition, the touch panel 6051 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. Other input devices 6052 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, on-off keys, etc.), a trackball, mouse, joystick, etc.

Of course, the touch panel 6051 may cover the display panel 6041, and when the touch panel 6051 detects a touch operation thereon or thereabout, the touch operation is transmitted to the processor 6010 to determine the type of the touch event, and then the processor 6010 provides a corresponding visual output on the display panel 6041 according to the type of the touch event. Although in fig. 6, the touch panel 6051 and the display panel 6041 are provided as two separate components to implement the input and output functions of the control device 6000, in some embodiments, the touch panel 6051 may be integrated with the display panel 6041 to implement the input and output functions of the control device 6000.

The control device 6000 may also include one or more sensors, such as pressure sensors, gravitational acceleration sensors, proximity sensors, and the like. Of course, the control device 6000 may also include other components such as cameras, as needed in a specific application, and these components are not shown in fig. 6 and will not be described in detail, since they are not the components that are important in the embodiments of the present application.

It will be appreciated by those skilled in the art that fig. 6 is merely an example of a control device and is not limiting of the control device, and may include more or fewer components than shown, or may combine certain components, or different components.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method of video-independent content detection in any of the above method embodiments.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a control device (which may be a personal computer, a server, or a network device, etc.) to execute the method of each embodiment or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of video-independent content detection, comprising:

the sample multimedia feature set is obtained according to video irrelevant content contained in historical videos of the target object, and the video irrelevant content contained in the historical videos is determined according to matching frame numbers of video frames among the historical videos and audio similarity;

When the matching degree of the video multimedia feature set and the sample multimedia feature set meets a set condition, determining that video irrelevant content exists in the video to be detected includes: sequentially determining the similarity of video frames between each video picture feature in the video multimedia feature set and each corresponding video picture feature in the sample multimedia feature set until the obtained similarity of video frames is lower than a preset picture similarity threshold, wherein the video multimedia features at least comprise video picture features; determining the video frame logarithm of which the video frame similarity is not lower than a preset picture similarity threshold as a matching frame number; and when the matching frame number is larger than a first preset frame number threshold value, determining that video irrelevant contents exist in the video to be detected.

2. The method of claim 1, wherein obtaining the video frame to be detected from the video to be detected of the target object comprises:

3. The method of claim 1, further comprising, after determining that video-independent content is present in the video to be detected:

When the sample multimedia feature set of the target object is one, determining the irrelevant content duration of the video irrelevant content according to the matching frame number and the appointed sampling duration;

and when the sample multimedia feature set of the target object is a plurality of, determining the maximum matching frame number in the obtained plurality of matching frame numbers, and determining the irrelevant content duration of the video irrelevant content according to the maximum matching frame number and the appointed sampling duration.

4. A method according to claim 1 or 3, wherein the sample set of multimedia features is determined according to the steps of:

acquiring a historical video set consisting of a plurality of historical videos uploaded by the target object;

respectively obtaining a video multimedia feature set of each historical video;

taking out a history video to be processed from a history video set, and determining the matching frame number between the history video to be processed and each history video in an updated history video set according to video picture characteristics contained in each video multimedia characteristic set;

5. The method of claim 4, wherein screening historical video that meets a preset matching condition based on the obtained number of matching frames, further comprises:

determining the audio similarity between the to-be-processed historical video and the audio characteristics of each screened historical video respectively;

6. The method of claim 5, wherein extracting the corresponding audio features based on matching time periods between the historical video to be processed and each filtered historical video, respectively, comprises:

Respectively extracting the to-be-processed historical video and first audio features of the video in the matching time period in each screened historical video;

and respectively extracting second audio features of the to-be-processed historical video and the video which is within a specified time period and is outside the matching time period of each screened historical video.

7. The method of claim 6, wherein determining the audio similarity between the historical video to be processed and the audio features of each filtered historical video, respectively, comprises:

and determining a first audio similarity between the to-be-processed historical video and the first audio feature and a second audio similarity between the to-be-processed historical video and the second audio feature of each screened historical video respectively.

8. The method of claim 7, wherein screening the historical video corresponding to the audio similarity that meets a predetermined audio similarity condition comprises:

9. The method of claim 4, further comprising, after determining a set of sample multimedia features corresponding to the historical video to be processed from the obtained plurality of sets of matching frame features:

And when the number of the videos of the historical videos contained in the updated historical video set is greater than one, executing the step of taking out one historical video to be processed from the historical video set and determining the matching frame number between the historical video to be processed and each historical video in the updated historical video set.

10. The method of claim 1 or 2, wherein the video-independent content is any one or a combination of the following: a head and a tail.

11. An apparatus for video-independent content detection, comprising:

Wherein the determining unit is configured to: sequentially determining the similarity of video frames between each video picture feature in the video multimedia feature set and each corresponding video picture feature in the sample multimedia feature set until the obtained similarity of video frames is lower than a preset picture similarity threshold, wherein the video multimedia features at least comprise video picture features; determining the video frame logarithm of which the video frame similarity is not lower than a preset picture similarity threshold as a matching frame number; and when the matching frame number is larger than a first preset frame number threshold value, determining that video irrelevant contents exist in the video to be detected.

12. A control device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1-10 when the program is executed.

13. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any of claims 1-10.