CN111027419A

CN111027419A - Method, device, equipment and medium for detecting video irrelevant content

Info

Publication number: CN111027419A
Application number: CN201911159761.0A
Authority: CN
Inventors: 万明阳; 马连洋; 衡阵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-04-17
Anticipated expiration: 2039-11-22
Also published as: CN111027419B

Abstract

The application belongs to the technical field of data processing, mainly relates to a computer vision technology and a voice technology in artificial intelligence, and discloses a method, a device, equipment and a medium for detecting video irrelevant content. Therefore, when the video irrelevant content is detected, the detection application range is expanded, and the robustness and the accuracy of the detection result are improved.

Description

Method, device, equipment and medium for detecting video irrelevant content

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting video-independent content.

Background

With the development of internet technology and intelligent terminal technology, and the popularization of video shooting equipment and video editing technology, the number of released videos is rapidly increasing. The user usually distributes the video after adding extra video irrelevant content to the head or tail of the video text. For example, the video-independent content is a black screen, an advertisement, a self-media number, and the like.

In the prior art, the following methods are generally adopted to detect video irrelevant content: the method comprises the steps of respectively extracting video characteristics of a head time area or a tail time area in each video file aiming at a plurality of video files in a video of the television series, and judging that video irrelevant content exists when the video characteristics of the video files are similar.

However, the video feature detection method is only used for detecting the video of the television, and factors such as the resolution, the size and the like of the video have large influence on the video detection result, so that the method has the advantages of small application range, poor robustness and poor detection accuracy.

Therefore, a video-unrelated content detection scheme capable of improving the application range of video-unrelated content detection and the robustness and accuracy of the detection result is urgently needed.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a medium for detecting video irrelevant content, which are used for improving the application range of the video irrelevant content detection and the robustness and the accuracy of a detection result when the video irrelevant content is detected.

In one aspect, a method for detecting video-independent content is provided, including:

acquiring a video frame to be detected from a video to be detected of a target object;

acquiring a video multimedia feature set consisting of video multimedia features of each video frame to be detected;

when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition, determining that video irrelevant content exists in the video to be detected;

the sample multimedia feature set is obtained according to video irrelevant content contained in historical videos of the target object, and the video irrelevant content contained in the historical videos is determined according to the number of matching frames of video frames among the historical videos and the audio similarity.

In one aspect, an apparatus for video-independent content detection is provided, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video frame to be detected from a video to be detected of a target object;

the composition unit is used for acquiring a video multimedia feature set composed of video multimedia features of each video frame to be detected;

the determining unit is used for determining that video irrelevant content exists in the video to be detected when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition;

Preferably, the obtaining unit is configured to:

and according to the specified sampling duration, carrying out video frame sampling on the video in the specified time period in the video to be detected to obtain each video frame to be detected.

Preferably, the determination unit is configured to:

sequentially determining video frame similarity between each video picture feature in the video multimedia feature set and each corresponding video picture feature in the sample multimedia feature set until the obtained video frame similarity is lower than a preset picture similarity threshold, wherein the video multimedia features at least comprise video picture features;

determining the video frame logarithm with the video frame similarity not lower than a preset picture similarity threshold as a matching frame number;

and when the matching frame number is larger than a first preset frame number threshold value, determining that video irrelevant content exists in the video to be detected.

Preferably, the determination unit is further configured to:

when the sample multimedia feature set of the target object is one, determining the irrelevant content time length of the video irrelevant content according to the matching frame number and the specified sampling time length;

when the sample multimedia feature of the target object is multiple, determining the maximum matching frame number in the multiple obtained matching frame numbers, and determining the irrelevant content time length of the video irrelevant content according to the maximum matching frame number and the specified sampling time length.

Preferably, the determination unit is further configured to:

acquiring a historical video set consisting of a plurality of historical videos uploaded by a target object;

respectively obtaining a video multimedia feature set of each historical video;

taking out a historical video to be processed from the historical video set, and determining the number of matching frames between the historical video to be processed and each historical video in the updated historical video set according to the video picture characteristics contained in each video multimedia characteristic set;

screening out historical videos meeting preset matching conditions according to the obtained matching frame number;

obtaining a corresponding matched frame feature set according to the matched frame number of the historical video to be processed and each filtered historical video;

and determining a sample multimedia feature set corresponding to the historical video to be processed according to the obtained multiple matched frame feature sets.

Preferably, the determination unit is further configured to:

screening out historical videos of which the matching frame number is greater than a first preset frame number threshold and less than a second preset frame number threshold;

determining a corresponding matching time period according to the matching frame number corresponding to each filtered historical video and the specified sampling duration;

extracting corresponding audio features according to the matching time periods matched between the historical video to be processed and each filtered historical video;

determining audio similarity between the historical video to be processed and the audio characteristics of each filtered historical video;

and screening out the historical videos corresponding to the audio similarity meeting the preset audio similarity condition.

Preferably, the determination unit is further configured to:

respectively extracting first audio features of videos in a matching time period in the historical videos to be processed and each filtered historical video;

and respectively extracting second audio features of the videos within a specified time period and outside the matched time period in the historical videos to be processed and each filtered historical video.

Preferably, the determination unit is further configured to:

and determining a first audio similarity between the historical video to be processed and the first audio feature of each filtered historical video and a second audio similarity between the second audio features.

Preferably, the determination unit is further configured to:

and screening out historical videos of which the first audio similarity is higher than a first preset audio threshold and the second audio similarity is lower than a second preset audio threshold.

Preferably, the determination unit is further configured to:

and when the number of the historical videos contained in the updated historical video set is more than one, a step of taking out a historical video to be processed from the historical video set and determining the number of matching frames between the historical video to be processed and each historical video in the updated historical video set is executed.

Preferably, the video-independent content is any one or a combination of the following: leader and trailer.

In one aspect, a control device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform any of the above-mentioned steps of the method for detecting irrelevant video content.

In one aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of any of the above-mentioned methods of video-independent content detection.

In the method, the device, the equipment and the medium for detecting the video irrelevant content, a sample multimedia feature set is generated in advance according to the number of matching frames and the audio similarity among a plurality of historical videos of the target object, when the video to be detected of the target object is received, the video frame to be detected is obtained from the video to be detected, a video multimedia feature set formed by video multimedia features of a plurality of video frames to be detected is obtained, and when the matching degree of the video multimedia feature set and the sample multimedia feature set meets the set condition, the video irrelevant content in the video to be detected is determined. Therefore, when the video irrelevant content is detected, the detection application range is expanded, and the robustness and the accuracy of the detection result are improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of a system architecture for video-independent content detection according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating an implementation of a method for generating a sample multimedia feature set according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of video-independent content detection according to an embodiment of the present disclosure;

FIG. 4a is an illustration of a video recommendation list in an embodiment of the present application;

FIG. 4b is a flowchart illustrating an exemplary overview of a process for video-independent content detection according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for video-independent content detection according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a control device in an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solution and beneficial effects of the present application more clear and more obvious, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

First, some terms referred to in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

The terminal equipment: the electronic device can be mobile or fixed, and can be used for installing various applications and displaying objects provided in the installed applications. For example, a mobile phone, a tablet computer, various wearable devices, a vehicle-mounted device, a Personal Digital Assistant (PDA), a point of sale (POS), or other electronic devices capable of implementing the above functions may be used.

The application comprises the following steps: i.e. application programs, computer programs that can perform one or more services, typically have a visual display interface that can interact with a user, for example electronic maps and wechat, are referred to as applications.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. Theories and techniques related to computer vision research attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

The perceptual hash algorithm is a generic name of a class of hash algorithms, and has the functions of generating a fingerprint character string of each image and comparing fingerprint information of different images to judge the similarity of the images. The closer the comparison result is to the image, the more similar. Perceptual hash algorithms include mean hash (aHash), perceptual hash (pHash), and difference value hash (dHash). aHash speed is faster, but accuracy is lower; pHash is performed in reverse, with higher accuracy but slower speed; the dHash takes both into account, and has higher accuracy and higher speed.

Hamming distance: named after the name richard westerly hamming. In the information theory, the hamming distance between two character strings with equal length is the number of different characters at the corresponding positions of the two character strings. In other words, it is the number of characters that need to be replaced to convert one string into another. For example: the hamming distance between 1011101 and 1001001 is 2. The hamming distance between 2143896 and 2233796 is 3. The hamming distance between "toned" and "roses" is 3.

Video-independent content: in order to add extra parts irrelevant to the video text content in the video, the integrity of the video content is not affected after the video irrelevant content is removed, including a leader and a trailer, usually a black screen, advertisements, a self-media number and the like.

The design concept of the embodiment of the present application is described below.

With the development of internet technology and intelligent terminal technology and the popularization of video shooting equipment and video editing technology, more and more users shoot through intelligent terminals and the like to obtain videos or short videos, or download various videos or short videos through a network, and then add additional video irrelevant content to the head or tail of a video text and other parts through the video editing technology.

The video irrelevant content is usually a black screen, an advertisement, a self-media number and the like, so as to achieve the purposes of advertising and promotion and the like.

Obviously, a video with video-independent content will greatly reduce the user experience. In order to improve user experience, the video publishing server usually performs video-independent content detection on a video uploaded by a user, reduces the recommendation weight of the video containing the video-independent content according to a detection result, and displays a video recommendation list according to the recommendation weight. Furthermore, according to the detection result, before the video is played, the video irrelevant content can be removed.

In the conventional technology, when detecting video-independent content, the following method is generally adopted:

the video characteristics of a head time region or a tail time region in each video file can be respectively extracted from a plurality of video files of the video of the television series, the obtained video characteristics of the plurality of video files are matched, and when the matching degree between the video characteristics of the video files meets the matching condition, the video irrelevant content is judged to exist.

However, in this way, only the video of the television series can be detected, the application range is small, and the influence of factors such as the resolution and the size of the video on the extracted video features is large, so that the robustness and the accuracy of the video detection result determined by the video features are poor.

Further, when it is determined that video-unrelated content exists, the duration of the video-unrelated content cannot be identified, and the repeated video, the video with similar beginning part, and the video with similar ending part cannot be filtered, so that the repeated video and the video with similar beginning or ending part of the video are easily identified as the video-unrelated content, and the accuracy of the video detection result is reduced.

Obviously, the conventional technology does not provide a technical solution with a wide application range and high robustness and accuracy of the detection result, and therefore, a technical solution for detecting the video-independent content is urgently needed to expand the detection application range and improve the robustness and accuracy of the detection result when detecting the video-independent content.

In view of the above analysis and consideration, the present application provides a technical solution for detecting video-unrelated content, mainly relating to a computer vision technology and a voice technology in artificial intelligence, in the solution, video-unrelated content is screened out in advance according to the number of matching frames of video frames between a plurality of historical videos of the target object and the audio similarity between audio features, and a corresponding sample multimedia feature set is generated according to the video picture features of the screened video-unrelated content.

The method comprises the steps of receiving a video to be detected of a target object, obtaining a video frame to be detected from the video to be detected, obtaining a video multimedia feature set formed by video multimedia features of a plurality of video frames to be detected, and determining that video irrelevant content exists in the video to be detected when the matching degree of the video multimedia feature set and a sample multimedia feature set meets a set condition.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel according to the method shown in the embodiment or the figure when the method is executed in an actual processing procedure or a device.

Referring to fig. 1, a schematic diagram of a system architecture for video-independent content detection is shown, the system including: a plurality of terminals 100 and a server 101. The terminal 100 and the server 101 are connected through a wired or wireless network.

The terminal 100 is installed with a video application, and is configured to upload a video to be detected submitted by a user to the server 101 through the video application, and is further configured to acquire a recommended video list through the server 101 in response to a video viewing operation, and display the recommended video list, and is further configured to play a video specified by the user in response to a video playing operation.

The server 101: the method comprises the steps of generating a corresponding sample multimedia feature set when video irrelevant content exists according to the matching frame number of video frames among a plurality of historical videos of a target object and the audio similarity among audio features; the method is further used for acquiring a video frame to be detected in the video to be detected when the video to be detected of the target object uploaded by the receiving terminal 100 is received, acquiring a video multimedia feature set composed of video multimedia features of each video frame to be detected, and determining that video-unrelated content exists in the video to be detected when the matching degree of the video multimedia feature set and the sample multimedia feature set meets a set condition; and the recommendation weight of the video to be recommended is set according to the video irrelevant content detection result, and the videos to be recommended are sequenced according to the recommendation weight to obtain a sequenced recommended video list.

It should be noted that the video multimedia features at least include video picture features of video frames, and may also include audio features.

Considering that the videos uploaded by the same object generally use repeated video-independent contents, in the embodiment of the present application, a sample multimedia feature set is generated for the video-independent contents included in a plurality of historical videos of each target object, and a detection result of video-independent content detection is obtained by detecting the to-be-detected video of the target object through the sample multimedia feature set.

Optionally, the target object may be a user or a group, and usually one account information or a class of account information of the uploaded video is used as the target object, and in the actual application, the target object may be set according to an actual application scenario, which is not described herein again. The video can be in the form of a television show, a short video and the like.

The embodiment of the present application may be applied to an application scenario for performing video-independent content detection on one or more videos in various forms (e.g., a tv show and a short video), and for convenience of understanding, only performing video-independent content detection on a short video is taken as an example for description, and details are not repeated here.

In the embodiment of the application, one or more sample multimedia feature sets of a target object are generated in advance before video-independent content detection is performed on a video to be detected of the target object. Wherein the sample multimedia feature set is obtained from video-independent content contained in the historical video of the target object.

Referring to fig. 2, a flowchart of an implementation of a method for generating a sample multimedia feature set according to the present application is shown. The method comprises the following specific processes:

step 200: the server acquires a historical video set consisting of a plurality of historical videos uploaded by the target object.

Specifically, since the video-independent content is usually a leader or a trailer additionally added in a video text, and the like, the video-independent content included in the video uploaded by one user usually includes repeated video-independent content, and thus, a sample multimedia feature set of a corresponding target object is generated for the history video of each target object, not all the history videos. In this way, the plurality of historical videos uploaded by the target object can be used as sample videos for generating the sample multimedia feature set.

Step 201: the server respectively obtains a video multimedia feature set of each historical video in the historical video set.

Specifically, the server executes the following steps for each historical video respectively:

s2010: and according to the specified sampling duration, video frames of videos in the specified time period in the historical videos in the historical video set are sampled, and a plurality of sample video frames are obtained.

In practical application, the specified sampling duration and the specified time period may be set correspondingly according to a practical application scenario, and are not described herein again.

For example, if the specified sampling time length is 1s and the specified time period is the first 10s of the beginning of the video, the server extracts 1 video frame every 1s in the first 10s of the video to obtain 10 video frames.

S2011: and respectively extracting the video picture characteristics of each sample video frame.

In one embodiment, a hash algorithm is used to extract the video picture features of each sample video frame.

The video picture is characterized by a 64-bit hash value obtained by performing hash calculation on a sample video frame.

Alternatively, the hash algorithm may be a perceptual hash algorithm. The perceptual hash algorithm is a generic name of a class of hash algorithms, and has the functions of generating a fingerprint character string of each image, comparing fingerprint information of different images to judge the similarity of the images, wherein the closer the comparison result is to the images, the more similar the comparison result is.

S2012: a set of video multimedia features for the plurality of sample video frames is obtained.

Therefore, the video multimedia feature set of the historical video can be extracted, and the image features of the historical video can be represented through the video multimedia feature set of the historical video.

Step 202: and the server takes out a historical video to be processed from the historical video set to obtain an updated historical video set.

Specifically, the server extracts any one historical video from the historical video set as the historical video to be processed, and obtains the historical video set formed by the historical videos remaining after the historical video to be processed is removed.

Step 203: and the server determines the number of matching frames between the historical video to be processed and each historical video in the updated historical video set respectively according to the obtained video multimedia feature set.

Specifically, the server executes the following steps for each history video in the updated history video set respectively:

s2031: and sequentially determining the video frame similarity between each video picture feature in the video multimedia feature set of the historical video to be processed and each corresponding video picture feature in the video multimedia feature set of the historical video until the obtained video frame similarity is lower than a preset picture similarity threshold.

Wherein each video picture feature corresponds to a video frame. When determining the similarity of the video frames, a similarity algorithm such as a hamming distance or a euclidean distance may be used, which is not limited herein.

When the similarity of the two images is quantized by using the hamming distance, the similarity of the images is smaller when the hamming distance is larger, and the similarity of the images is larger when the hamming distance is smaller.

In practical application, the preset picture similarity threshold may be set according to a practical application scene, and optionally, the preset picture similarity threshold may be [0, 1 ].

When determining the similarity of video frames, if detecting whether a slice header exists in a video, the video is in a time sequence from front to back, and if detecting whether a slice trailer exists in a video, the video is in a time sequence from back to front.

S2032: and determining the video frame logarithm with the video frame similarity not lower than a preset picture similarity threshold as the matching frame number.

Thus, the matching frame number, namely the matching degree, between the video to be processed and each historical video can be determined.

Step 204: and the server screens out the historical videos meeting the preset matching conditions according to the obtained matching frame number.

Specifically, when step 204 is executed, the server may adopt the following steps:

s2041: and screening out the historical videos of which the matching frame number is greater than a first preset frame number threshold and less than a second preset frame number threshold.

Wherein the first predetermined frame number threshold is determined according to a shortest length of the predicted video-independent content, and the second predetermined frame number threshold is determined according to a longest length of the predicted video-independent content. In practical application, the first preset frame number threshold and the second preset frame number threshold may be set according to a practical application scenario, which is not limited herein.

S2042: and determining a corresponding matching time period according to the matching frame number corresponding to each filtered historical video and the specified sampling duration.

Specifically, for each filtered historical video, determining matching time duration according to the product of the number of matching frames and specified sampling time duration, if the video is detected to contain a title, using the matching time duration as a matching time period, and if the video is detected to contain a trailer, determining the matching time period according to the matching time duration at the end of the video.

S2043: and extracting corresponding audio features according to the matching time periods between the historical video to be processed and each filtered historical video.

Specifically, first audio features of video content within a matching time period in the historical video to be processed and each filtered historical video are respectively extracted, and second audio features of video content within a specified time period and outside the matching time period in the historical video to be processed and each filtered historical video are respectively extracted.

That is, the following steps are performed for each filtered historical video respectively:

extracting first audio features of video contents in a matching time period in the screened historical videos, and extracting first audio features of the video contents in the matching time period in the historical videos to be processed; and extracting second audio features of the video content within the specified time period and outside the matched time period in the screened historical video, and extracting second audio features of the video content within the specified time period and outside the matched time period in the historical video to be processed.

For example, if the matching time period is the first 8s of the video and the specified time period is the first 10s of the video, then the first audio features of the video content within 0-8s of the video and the second audio features of the video content within 8-10s of the video are extracted.

Optionally, the audio features may be audio fingerprint features extracted by an audio fingerprint technology, or may be extracted in other manners, which is not limited herein.

S2044: and determining the audio similarity between the historical video to be processed and the audio characteristics of each filtered historical video.

Specifically, a first audio similarity between the historical video to be processed and a first audio feature of each filtered historical video and a second audio similarity between second audio features are respectively determined.

determining first audio similarity between first audio features of the historical video to be processed and first audio features of the filtered historical video; and determining second audio similarity between the second audio characteristics of the historical video to be processed and the second audio characteristics of the filtered historical video.

S2045: and screening out the historical videos corresponding to the audio similarity meeting the preset audio similarity condition.

Specifically, historical videos are screened out, wherein the first audio similarity is higher than a first preset audio threshold, and the second audio similarity is lower than a second audio threshold.

In practical application, both the first preset audio threshold and the second preset audio threshold may be set according to a practical application scenario, and the ranges may be [0, 1], which is not described herein again.

As for resource videos, game videos and the like, video pictures are generally similar, audio is different, and repeated videos and videos with irrelevant contents of similar videos cannot be screened out only through the matching degree of video picture features, in the embodiment of the application, historical videos are primarily screened out through the matching degree of the video picture features, then, the repeated videos and irrelevant contents of similar videos are further accurately screened out through the audio similarity of the audio features, and the accuracy of the irrelevant contents of videos is improved.

It should be noted that, if both the matching degree of the video picture features and the audio similarity of the audio features are higher than the corresponding specified repetition threshold, it is determined that the video is repeated. And if the matching degree of the video picture characteristics is higher than the specified similar picture threshold value, and the audio similarity of the audio characteristics is lower than the specified similar audio threshold value, indicating that similar video irrelevant content exists. If the matching degree of the video picture features and the audio similarity of the audio features of the head or tail part (namely the matching time period) are high, and the matching degree of the video picture features and the audio similarity of the audio features of the main part (within the specified time period and outside the matching time period) are low, the existence of the video irrelevant content is indicated.

Step 205: and the server obtains a corresponding matched frame feature set according to the matched frame number of the historical video to be processed and each filtered historical video.

Specifically, the server executes the following steps for each filtered historical video respectively:

and extracting a plurality of matched sample video frames from the video multimedia feature set corresponding to the screened historical video according to the corresponding matched frame number, and forming a matched frame feature set according to the plurality of sample video frames.

And further, determining the maximum matching frame number in all the matching frame numbers, and extracting sample video frames in the video multimedia feature set of the historical video to be processed according to the maximum matching frame number to obtain a corresponding matching frame feature set.

Therefore, a video multimedia feature set, namely a matched frame feature set, of video independent content, which is matched between the historical video to be processed and each screened historical video, can be obtained.

Step 206: and the server determines a sample multimedia feature set corresponding to the historical video to be processed according to the obtained matched frame feature set.

Specifically, the value of each data bit of each video picture feature in the sample multimedia feature set is a mode of a corresponding data bit of a corresponding video picture feature in the plurality of matching frame feature sets.

Further, the sample multimedia feature set may further include audio features, that is, audio features corresponding to the video picture features included in the sample multimedia feature set.

It should be noted that the video picture is characterized by a 64-bit hash value, that is, 64 data bits are included.

For example, the server obtains three sets of matching frame features, which are set 1, set 2, and set 3, respectively, where the first data bit of the video picture feature of the first sample video frame in set 1 is 0, the first data bit of the video picture feature of the first video frame in set 2 is 1, and the first data bit of the video picture feature of the first sample video frame in set 3 is 0, and then the first data bit of the first video picture feature in the sample multimedia feature set is a mode 0 in {0, 1, 0 }.

Thus, a sample multimedia feature set can be generated according to each matched frame feature set.

Step 207: the server determines whether the number of videos of the historical videos included in the updated historical video set is greater than one, if so, step 202 is executed, otherwise, step 208 is executed.

Step 208: the flow ends.

In the embodiment of the present application, only the sample multimedia feature set of the target object is taken as an example for description, and according to a similar principle, the sample multimedia feature sets of a plurality of objects can be obtained, which is not described herein again.

Therefore, after the sample multimedia feature sets of the multiple objects are obtained, the sample multimedia feature sets can be used for detecting the video to be detected uploaded by the target object.

Referring to fig. 3, a flowchart of an implementation of a method for detecting video-independent content according to the present application is shown. The method comprises the following specific processes:

step 300: the server acquires a plurality of video frames to be detected of a video to be detected of the target object.

Specifically, the server performs video frame sampling on videos in a specified time period in video frames to be detected according to specified sampling duration to obtain each video frame to be detected.

Thus, a plurality of video frames to be detected in the video to be detected can be extracted.

Step 301: the server acquires a video multimedia feature set consisting of video multimedia features of a plurality of video frames to be detected.

Specifically, the video multimedia features at least include video picture features, and the server extracts the video picture features of each video frame to be detected respectively and forms a video multimedia feature set by using the obtained video picture features.

The video picture features may be extracted by using a hash algorithm, or may be obtained by using other methods, which are not described herein again.

Further, the video multimedia features can also include audio features, and the server extracts the audio features in the specified time period corresponding to the video frames to be detected.

Step 302: and when the matching degree of the video multimedia feature set and the sample multimedia feature set accords with a set condition, the server determines that video irrelevant content exists in the video to be detected.

Specifically, when step 302 is executed, the server may adopt the following steps:

s3021: and sequentially determining the video frame similarity between each video picture feature in the video multimedia feature set and each corresponding video picture feature in the sample multimedia feature set until the obtained video frame similarity is lower than a preset picture similarity threshold.

Specifically, the following steps may be employed:

step 1: and taking out a video picture feature from the video picture set according to the specified time sequence.

Step 2: video frame similarity between the video picture feature and a corresponding video picture feature in the sample set of multimedia features is determined.

And step 3: and (4) judging whether the video frame similarity is not lower than a preset picture similarity threshold, if so, executing the step 4, otherwise, executing the step 5.

And 4, step 4: the video frame is logarithmically incremented by one and the above step 1 is performed.

Wherein the initial value of the video frame logarithm is 0.

And 5: and stopping the video frame similarity determination process.

S3022: and determining the video frame logarithm with the video frame similarity not lower than a preset picture similarity threshold as the matching frame number.

Therefore, the video frame number of the video frame pairs matched between the video multimedia feature set and the sample multimedia feature set, namely the matching frame number, can be determined according to the video frame similarity.

S3023: and when the matching frame number is larger than a first preset frame number threshold value, determining that video irrelevant content exists in the video to be detected.

Therefore, whether the video to be detected contains video irrelevant content can be judged according to the number of the matched frames.

Further, after the matching frame number is determined to be larger than a first preset frame number threshold, the audio similarity between the audio features contained in the video multimedia feature set and the audio features contained in the sample multimedia feature set can be determined, and when the audio similarity is higher than the first preset audio threshold, the video irrelevant content exists in the video to be detected.

Therefore, whether the video to be detected contains video irrelevant content can be judged according to the matching frame number and the audio similarity.

Further, the server may also determine the duration of the irrelevant content of the video irrelevant content contained in the video to be detected by using the following methods:

the first mode is as follows: and when the sample multimedia feature of the target object is one, determining the irrelevant content time length of the video irrelevant content according to the acquired matching frame number and the specified sampling time length.

In one embodiment, the product of the number of matching frames and the specified sample duration is determined as the irrelevant content duration of the video irrelevant content.

The second way is: when the sample feature set of the target object is multiple, determining the maximum matching frame number in the multiple obtained matching frame numbers, and determining the irrelevant content time length of the video irrelevant content according to the maximum matching frame number and the specified sampling time length.

In one embodiment, the product of the maximum matching frame number and the specified sample duration is determined as the irrelevant content duration of the video irrelevant content.

Furthermore, the server can adjust corresponding recommendation weights according to the detection result of the video irrelevant content detection of the video to be detected, and display a video recommendation list to the user according to the recommendation weights of the videos.

For example, referring to fig. 4a, which is an illustration of a video recommendation list, video 1 does not include video-independent content, the corresponding recommendation weight is 1, video 2 includes video-independent content, and the corresponding recommendation weight is 0.5, and video 1 and video 2 are displayed in order according to the recommendation weights of video 1 and video 2.

It should be noted that fig. 4a is only used to show the ordering of the videos, and if the lines or texts in fig. 4a are not clear, the clarity of the embodiment of the present application is not affected.

When determining the recommended weight of the video to be detected, the following modes can be adopted:

the first mode is as follows: and if the detection result is that the video irrelevant content is not contained, setting the recommendation weight of the video to be detected as a first preset weight, otherwise, setting the recommendation weight of the video to be detected as a second preset recommendation weight.

The first preset weight and the second preset weight may be set according to an actual application scenario, and are not limited herein.

The second way is: and if the detection result is that the video irrelevant content is not contained, setting the recommendation weight of the video to be detected as a first preset weight, otherwise, acquiring the duration of the irrelevant content of the video to be detected, and setting the recommendation weight of the video to be detected as a recommendation weight value correspondingly set by the duration of the irrelevant content.

In one embodiment, the server sets a corresponding recommended weight value for each irrelevant content duration interval in advance.

In one embodiment, the server establishes a corresponding functional relationship between the irrelevant content duration and the recommended weight value in advance, and determines the corresponding recommended weight according to the functional relationship. For example, the functional relationship is a proportional relationship.

And further, updating the sample multimedia feature set according to the obtained video multimedia feature set of the video irrelevant content contained in the video to be detected.

Further, before the video is played, the server can also remove the video irrelevant content from the video according to the duration interval of the irrelevant content, and then play the video after the video irrelevant content is removed.

Fig. 4b is a schematic diagram illustrating an exemplary process of video-independent content detection. In the embodiment of the present application, a process of combining the sample multimedia feature set generation method and the video irrelevant content detection method is as follows:

step 401: and the server determines the number of matching frames among the historical videos according to the video multimedia feature set of the obtained historical videos.

Specifically, for convenience of understanding, when the steps 401 to 405 are performed, two historical videos are taken as an example for description, and a plurality of historical videos can be processed by using a similar principle, which is not described herein again.

Step 402: the server determines whether the matching frame number meets a preset picture matching condition, if so, executes step 403, otherwise, executes step 407.

Specifically, the preset picture matching conditions are as follows: the matching frame number is larger than a first preset frame number threshold and smaller than a second preset frame number threshold.

Step 403: the server determines an audio similarity between audio features of the extracted historical video.

Specifically, when step 403 is executed, the specific steps are as described in S2042 to S2044.

Step 404: the server judges whether the audio similarity accords with a preset audio similarity condition.

Specifically, when step 404 is executed, the specific steps are referred to as S2045.

Step 405: and the server determines a sample multimedia feature set according to the historical video.

Specifically, when step 405 is executed, the specific steps refer to step 205 to step 206.

Step 406: and when the video to be detected of the target object is received, detecting the video to be detected according to the sample multimedia feature set.

Step 407: the server obtains the detection result of the video irrelevant content.

In the embodiment of the application, whether video irrelevant content exists in the historical video to be processed and the historical video is judged according to the matching frame number, namely the matching degree, between the historical video to be processed and each historical video and the audio similarity, and a sample multimedia feature set is generated according to a plurality of obtained video irrelevant content, so that the newly-added video to be detected can be detected according to the sample multimedia feature set.

And then, acquiring a sample multimedia feature set, and judging whether the video to be detected contains video irrelevant content according to the matching degree of the video multimedia feature set of the video to be detected and the sample multimedia feature set.

In the traditional mode, the detection is required to be realized according to a plurality of videos in the television play, but the detection can not be realized according to only one short video, the application range is small, in the embodiment of the application, one or more videos (such as a television video, a short video and the like) in any form can be detected, the application range is wide, the detection efficiency is high, and the matching degree is determined by the video picture characteristics of the video frame, the influence of factors such as video resolution, size and the like on the detection result is reduced, and the repeated video and the video with similar video irrelevant content are filtered through the audio similarity, the robustness and the accuracy of the detection result are improved, further, the duration of the irrelevant content of the video irrelevant content can be determined so as to accurately position the video irrelevant content, so that the video recommendation sequence can be adjusted or the video irrelevant content can be removed according to the video irrelevant content.

The practical detection effect of the embodiment of the application is as follows: the embodiment of the application is applied to the detection of the video irrelevant content of a plurality of videos, the detection accuracy rate of the video irrelevant content is 96%, the recall rate is 94.71%, and more than one million video irrelevant contents are identified and processed.

Based on the same inventive concept, the embodiment of the present application further provides a device for detecting video-unrelated content, and because the principle of the device and the apparatus for solving the problem is similar to that of a method for detecting video-unrelated content, the implementation of the device can refer to the implementation of the method, and repeated details are omitted.

Fig. 5 is a schematic structural diagram of an apparatus for detecting video-unrelated content according to an embodiment of the present application. An apparatus for video-independent content detection comprising:

an obtaining unit 501, configured to obtain a video frame to be detected from a video to be detected of a target object;

a composition unit 502, configured to obtain a video multimedia feature set composed of video multimedia features of each to-be-detected video frame;

a determining unit 503, configured to determine that video-unrelated content exists in the video to be detected when a matching degree between the video multimedia feature set and the sample multimedia feature set meets a set condition;

Preferably, the obtaining unit 501 is configured to:

Preferably, the determining unit 503 is configured to:

Preferably, the determining unit 503 is further configured to:

respectively obtaining a video multimedia feature set of each historical video;

Preferably, the determining unit 503 is further configured to:

Fig. 6 shows a schematic configuration of a control device 6000. Referring to fig. 6, the control device 6000 includes: processor 6010, memory 6020, power supply 6030, display unit 6040, and input unit 6050.

The processor 6010 is a control center of the control apparatus 6000, connects various components using various interfaces and lines, and performs various functions of the control apparatus 6000 by running or executing software programs and/or data stored in the memory 6020, thereby performing overall monitoring of the control apparatus 6000.

In the embodiment of the present application, the processor 6010, when invoking the computer program stored in the memory 6020, executes the method of video irrelevant content detection as provided in the embodiment shown in fig. 3.

Alternatively, processor 6010 may include one or more processing units; preferably, processor 6010 may integrate an application processor that handles mainly the operating system, user interfaces, applications, etc. and a modem processor that handles mainly wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 6010. In some embodiments, the processor, memory, and/or memory may be implemented on a single chip, or in some embodiments, they may be implemented separately on separate chips.

The memory 6020 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, various applications, and the like; the storage data area may store data created according to the use of the control device 6000, and the like. In addition, the memory 6020 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The control device 6000 further includes a power supply 6030 (e.g., a battery) for powering the various components, which may be logically coupled to the processor 6010 via a power management system to manage charging, discharging, and power consumption via the power management system.

The display unit 6040 may be used to display information input by a user or information provided to the user, and various menus and the like of the control apparatus 6000, and in the embodiment of the present invention, is mainly used to display a display interface of each application in the control apparatus 6000 and objects such as text and pictures displayed in the display interface. The display unit 6040 may include a display panel 6041. The display panel 6041 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The input unit 6050 may be used to receive information such as numbers or characters input by a user. The input unit 6050 may include a touch panel 6051 and other input devices 6052. Touch panel 6051, also referred to as a touch screen, may collect touch operations by a user on or near touch panel 6051 (e.g., operations by a user on or near touch panel 6051 using a finger, a stylus, or any other suitable object or attachment).

Specifically, the touch panel 6051 may detect a touch operation by the user, detect signals resulting from the touch operation, convert the signals into touch point coordinates, send the touch point coordinates to the processor 6010, receive a command sent from the processor 6010, and execute the command. In addition, the touch panel 6051 can be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. Other input devices 6052 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, power on and off keys, etc.), a trackball, a mouse, a joystick, and the like.

Of course, the touch panel 6051 may cover the display panel 6041, and when the touch panel 6051 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 6010 to determine the type of the touch event, and then the processor 6010 provides a corresponding visual output on the display panel 6041 according to the type of the touch event. Although in fig. 6, the touch panel 6051 and the display panel 6041 are two separate components to implement the input and output functions of the control device 6000, in some embodiments, the touch panel 6051 and the display panel 6041 may be integrated to implement the input and output functions of the control device 6000.

The control device 6000 may also include one or more sensors, such as pressure sensors, gravitational acceleration sensors, proximity light sensors, etc. Of course, the control device 6000 may also include other components such as a camera, which are not shown in fig. 6 and will not be described in detail since they are not components that are used in the embodiments of the present application.

Those skilled in the art will appreciate that fig. 6 is merely an example of a control device and is not intended to be limiting and may include more or less components than those shown, or some components in combination, or different components.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for detecting video-unrelated content in any of the above method embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or partially contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a control device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for video-independent content detection, comprising:

the sample multimedia feature set is obtained according to video irrelevant content contained in the historical videos of the target object, and the video irrelevant content contained in the historical videos is determined according to the number of matching frames of video frames among the historical videos and the audio similarity.

2. The method of claim 1, wherein obtaining a video frame to be detected from a video to be detected of a target object comprises:

3. The method according to claim 1 or 2, wherein when the matching degree between the video multimedia feature set and the sample multimedia feature set meets a set condition, determining that video-unrelated content exists in the video to be detected comprises:

4. The method of claim 3, after determining that video-independent content is present in the video to be detected, further comprising:

when the sample multimedia feature of the target object is one, determining the irrelevant content time length of the video irrelevant content according to the matching frame number and the specified sampling time length;

and when the sample multimedia feature set of the target object is multiple, determining the maximum matching frame number in the multiple obtained matching frame numbers, and determining the irrelevant content time length of the video irrelevant content according to the maximum matching frame number and the specified sampling time length.

5. The method of claim 1 or 4, wherein the sample set of multimedia features is determined according to the following steps:

acquiring a historical video set consisting of a plurality of historical videos uploaded by the target object;

respectively obtaining a video multimedia feature set of each historical video;

6. The method of claim 5, wherein screening out a history video meeting a preset matching condition according to the obtained number of matching frames, further comprises:

7. The method of claim 6, wherein extracting corresponding audio features according to the matching time periods matched between the historical video to be processed and each filtered historical video respectively comprises:

respectively extracting first audio features of the videos in the matching time period in the historical videos to be processed and each filtered historical video;

and respectively extracting second audio features of the videos within the specified time period and outside the matched time period in the historical videos to be processed and each filtered historical video.

8. The method of claim 7, wherein determining the audio similarity between the audio features of the historical video to be processed and the filtered historical video comprises:

and determining first audio similarity between the historical video to be processed and the first audio feature of each filtered historical video and second audio similarity between the second audio features.

9. The method of claim 8, wherein screening out historical videos corresponding to audio similarities meeting a preset audio similarity condition comprises:

10. The method of claim 5, wherein after determining the sample multimedia feature set corresponding to the historical video to be processed according to the obtained plurality of matching frame feature sets, further comprising:

and when the number of the historical videos contained in the updated historical video set is more than one, executing the step of taking out a historical video to be processed from the historical video set and determining the number of the matching frames between the historical video to be processed and each historical video in the updated historical video set.

11. A method according to claim 1 or 2, wherein the video independent content is any one or a combination of: leader and trailer.

12. An apparatus for video independent content detection, comprising:

13. A control device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-11 are implemented when the program is executed by the processor.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.