CN110677718B

CN110677718B - Video identification method and device

Info

Publication number: CN110677718B
Application number: CN201910926328.9A
Authority: CN
Inventors: 张义飞; 康斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2021-07-23
Anticipated expiration: 2039-09-27
Also published as: CN110677718A

Abstract

The embodiment of the application discloses a video identification method and a video identification device, wherein a first video to be identified can be obtained; performing audio-video separation on the first video to obtain first audio information and a first image frame of the first video; acquiring a second video similar to the first video based on the first image frame; acquiring second audio information of the second video; acquiring audio similar parameters of the first audio information and the second audio information; and performing audio abnormity identification on the first video based on the audio similarity parameters to obtain an identification result. According to the method and the device, the first image frame of the first video to be identified is separated, the second video similar to the first video is found, and then the audio frequency of the first video and the audio frequency of the second video are compared, so that whether the first video is the video with abnormal sound or not can be accurately and quickly identified.

Description

Video identification method and device

Technical Field

The present application relates to the field of communications technologies, and in particular, to a video identification method and apparatus.

Background

In recent years, with the development of user generated content patterns, more and more users distribute original or secondary processed video content on a data sharing platform for sharing, but some videos obtained by re-recording, dubbing, accelerating and decelerating the original videos have abnormal sound compared with the original videos and are not suitable for being recommended to the users for watching. But at present, no method for automatically identifying the abnormal sound video accurately and quickly exists.

Disclosure of Invention

In view of this, embodiments of the present application provide a video identification method and apparatus, which can accurately and quickly identify a video with abnormal sound.

In a first aspect, an embodiment of the present application provides a video identification method, including:

in some embodiments, the video recognition method comprises:

acquiring a first video to be identified;

performing audio-video separation on the first video to obtain first audio information and a first image frame of the first video;

acquiring a second video similar to the first video based on the first image frame;

acquiring second audio information of the second video;

acquiring audio similar parameters of the first audio information and the second audio information;

and performing audio abnormity identification on the first video based on the audio similarity parameters to obtain an identification result.

In an embodiment, the obtaining a second video similar to the first video based on the first image frame includes:

acquiring picture characteristic information of the first image frame;

acquiring a candidate similar video clip set with a similar image frame to the first video based on the picture characteristic information, wherein the candidate similar video clip set comprises a plurality of candidate similar video clips;

and selecting a second video similar to the first video based on the candidate similar video segments.

In one embodiment, the first video includes a plurality of first video segments, and the selecting a second video similar to the first video based on the candidate similar video segments includes:

acquiring the segment similarity of the candidate similar video segment and the first video segment;

selecting a similar video clip corresponding to the first video clip from the candidate similar video clip set based on the clip similarity;

and selecting a second video similar to the first video based on the similar video clips.

In an embodiment, the selecting, based on the similar video segments, a second video similar to the first video includes:

obtaining similar videos corresponding to the similar video clips to obtain a similar video set, wherein the similar video set comprises a plurality of similar videos;

counting similar video clips in the similar videos to obtain statistical parameters corresponding to the similar videos;

and selecting a second video similar to the first video from the similar video set based on the statistical parameters corresponding to the similar videos.

In an embodiment, the video recognition method further includes:

counting time information of similar video clips in the second video;

segmenting a second audio information segment for audio comparison from the second audio information based on the time information;

segmenting a first audio information segment corresponding to the second audio information segment from the first audio information;

the obtaining of the audio similarity parameter of the first audio information and the second audio information includes:

and acquiring audio similar parameters of the first audio information segment and the second audio information segment.

In an embodiment, the obtaining the audio similarity parameters of the first audio information segment and the second audio information segment includes:

dividing the first piece of audio information into a plurality of first audio sub-pieces;

dividing the second segment of audio information into a plurality of second audio sub-segments based on the first audio sub-segment;

acquiring sub-segment similarity parameters of the second audio sub-segment and the first audio sub-segment;

and acquiring the audio similar parameters of the first audio information segment and the second audio information segment based on the sub-segment similar parameters.

In an embodiment, the obtaining the audio similarity parameters of the first piece of audio information and the second piece of audio information based on the sub-piece similarity parameters includes:

obtaining a comparison result between the sub-fragment similarity parameter and a first preset threshold;

and acquiring audio similarity parameters of the first audio information segment and the second audio information segment based on the comparison result.

In an embodiment, said obtaining sub-segment similarity parameters of said second audio sub-segment and said first audio sub-segment comprises:

acquiring first audio characteristic information of the first audio sub-segment;

acquiring second audio characteristic information of the second audio sub-segment;

acquiring the feature similarity of the first audio feature information and the second audio feature information,

and acquiring sub-segment similarity parameters of the second audio sub-segment and the first audio sub-segment based on the feature similarity.

In an embodiment, the performing, based on the audio similarity parameter, audio anomaly identification on the first video to obtain an identification result includes:

and when the audio similarity parameter is larger than a second preset threshold value, determining that the first video is a sound abnormal video.

In one embodiment, the video identification method further comprises: and storing the identification result to a block chain.

In a second aspect, an embodiment of the present application provides a video recognition apparatus, including:

a first video acquisition unit configured to acquire a first video;

the separation unit is used for carrying out audio-video separation on the first video to obtain first audio information and a first image frame of the first video;

a second video acquisition unit configured to acquire a second video similar to the first video based on the first image frame;

the audio acquisition unit is used for acquiring second audio information of the second video;

the calculating unit is used for acquiring audio similar parameters of the first audio information and the second audio information;

and the identification unit is used for carrying out audio abnormity identification on the first video based on the audio similar parameters to obtain an identification result.

In an embodiment, the calculation unit includes a first segmentation subunit, a second segmentation subunit, a first acquisition unit, and a second acquisition unit, as follows:

a first dividing subunit, configured to divide the first audio information segment into a plurality of first audio sub-segments;

a second dividing subunit configured to divide the second piece of audio information into a plurality of second audio sub-pieces based on the first audio sub-piece;

a first obtaining sub-unit, configured to obtain sub-segment similarity parameters of the second audio sub-segment and the first audio sub-segment;

a second obtaining subunit, configured to obtain the audio similarity parameters of the first audio information segment and the second audio information segment based on the sub-segment similarity parameters.

In a third aspect, an embodiment of the present application provides a terminal, including: a processor and a memory; the memory stores a plurality of instructions, and the processor loads the instructions stored in the memory to execute the steps in the video identification method provided by any embodiment of the application.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, which, when run on a computer, causes the computer to perform a video identification method as provided in any of the embodiments of the present application.

The method and the device for recognizing the video can acquire a first video to be recognized; performing audio-video separation on the first video to obtain first audio information and a first image frame of the first video; acquiring a second video similar to the first video based on the first image frame; acquiring second audio information of the second video; acquiring audio similar parameters of the first audio information and the second audio information; and performing audio abnormity identification on the first video based on the audio similarity parameters to obtain an identification result. According to the method and the device, the first image frame of the first video to be identified is separated, the second video similar to the first video is found, and then the audio frequency of the first video and the audio frequency of the second video are compared, so that whether the first video is the video with abnormal sound or not can be accurately and quickly identified.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic view of a scene of a video recognition system according to an embodiment of the present invention;

FIG. 1b is a schematic view of another scene of a video recognition system according to an embodiment of the present invention;

fig. 2a is a first flowchart of a video recognition method according to an embodiment of the present invention;

fig. 2b is a second flowchart of a video recognition method according to an embodiment of the present invention;

fig. 3a is a schematic diagram of a first structure of a video recognition apparatus according to an embodiment of the present invention;

fig. 3b is a schematic diagram of a second structure of a video recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a network device according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of a video recognition method according to an embodiment of the present invention;

fig. 6a is a schematic structural diagram of a data sharing system when a video recognition device is used as a node of the data sharing system according to an embodiment of the present invention;

FIG. 6b is a block chain and block structure diagram of the data sharing system shown in FIG. 6 a;

fig. 6c is a block generation flow diagram of the block chain shown in fig. 6 b.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a video identification method and device.

An embodiment of the present invention provides a video identification system, including a video identification device provided in any one of the embodiments of the present invention, where the video identification device may be specifically integrated in a terminal, and the terminal may include: a mobile phone, a tablet Computer, a notebook Computer, or a Personal Computer (PC).

In addition, the video recognition system may also include other devices, such as a server or the like.

For example, referring to fig. 1a, a video recognition system includes a terminal and a server, and the terminal and the server are connected through a network. The network includes network entities such as routers and gateways.

The terminal can acquire a first video to be identified; performing audio-video separation on the first video to obtain first audio information and a first image frame of the first video; acquiring a second video similar to the first video based on the first image frame; acquiring second audio information of the second video; acquiring audio similar parameters of the first audio information and the second audio information; and performing audio abnormity identification on the first video based on the audio similarity parameters to obtain an identification result. According to the method and the device, the first image frame of the first video to be identified is separated, the second video similar to the first video is found, and then the audio frequency of the first video and the audio frequency of the second video are compared, so that whether the first video is the video with abnormal sound or not can be accurately and quickly identified.

In one embodiment, the terminal may obtain a second video similar to the first video from a server or other terminals through a network link.

In an embodiment, the terminal may further send the identification result to a server in the video identification system through a network link, or forward the identification result to other terminals in the system.

In another embodiment, the terminal may be linked with a server or a terminal in the video push system through a network link, and the terminal may send the identification result to the video push system through the network link, for example, the identification result may be sent to an audit terminal in the video push system, when an auditor may perform an important audit on a video with abnormal sound according to the identification result, for example, the terminal may also send the identification result to the server of the video push terminal, and the server may determine whether to push the video to other terminal users according to the identification result.

In another embodiment, referring to fig. 1b, the video recognition apparatus of the present invention may be specifically integrated in a server, and the video recognition system further includes a terminal, where the terminal is connected to the server through a network, and the server may obtain a first video from the terminal through the network, and when performing video recognition, may select a second video similar to the first video from a memory. Wherein the video stored in the memory may be from a first video obtained from the terminal before the current point in time.

In some embodiments, referring to fig. 6a, the terminal and the server may be a node in a data sharing system, where the data sharing system refers to a system for performing data sharing between nodes, the data sharing system may include a plurality of nodes, and the plurality of nodes may refer to respective network devices in the data sharing system. Each node is stored with an identical blockchain, and the video identification device can store the identification result into the blockchain, so as to share data with other network devices.

The above example of fig. 1a and 1b is only one example of a system architecture for implementing the embodiment of the present invention, and the embodiment of the present invention is not limited to the system architecture shown in fig. 1a and 1b, and various embodiments of the present invention are proposed based on the system architecture of 1 a.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

The present embodiment will be described from the perspective of a video recognition apparatus, which may be integrated in a terminal, such as a mobile phone, a tablet Computer, a notebook Computer, or a Personal Computer (PC).

As shown in fig. 2a, a video recognition method is provided, which may be executed by a processor of a terminal, and the specific flow of the video recognition method may be as follows:

101. a first video to be identified is acquired.

In an embodiment, the terminal can trigger a video acquisition instruction to acquire the first video based on an operation of a user on a terminal page.

The first video may be obtained from a local storage of the terminal, or may be obtained from a server through a network link.

In an embodiment, the file identifier linked list may be obtained from a local storage or a server through a network link based on an operation of the user on a terminal page, and then the first video selected by the user may be obtained based on the operation of the user on the terminal device. For example, an attachment insertion control is arranged on a terminal page, a file identification linked list is acquired based on clicking operation of a user on the attachment insertion control, a file selection page is displayed according to the file identification linked list, a plurality of file identification options are arranged on the file selection page, and then the terminal triggers a video acquisition instruction based on selection operation of the user on the file identification options in the file selection page to acquire a first video corresponding to a file identification.

102. And carrying out audio-video separation on the first video to obtain first audio information and a first image frame of the first video.

After the terminal acquires the first video, the audio in the first video can be separated, and then the first video is subjected to frame processing to obtain a plurality of first image frames.

The video plays a single image frame by frame strictly, and generates continuous animation illusion to the vision by using the visual persistence characteristic of naked eyes, namely the video comprises a plurality of image frames. The frame extraction means extracting a single image frame from the image frames. The image frames may be extracted from the video according to a preset time interval, for example, every 0.04s may be set to be extracted.

In an embodiment, after the first image frame is acquired, the first image frame may be arranged according to a time sequence of the image frames in the video, so as to obtain a first image frame sequence corresponding to the first video.

103. Based on the first image frame, a second video similar to the first video is acquired.

The first video is a video to be identified, and the second video is a video which is similar to a motion picture in the first video and has normal sound. Wherein the motion picture refers to a video after audio is separated.

In one embodiment, the first video is obtained after secondary processing such as re-editing, recording, accelerating and decelerating the original video. The second video includes the master video and other videos similar to the first video motion picture.

In an embodiment, the obtaining a second video similar to the first video based on the first image frame may specifically include the following steps:

acquiring picture characteristic information of the first image frame;

The characteristic information of the picture, which distinguishes the current image from the characteristic information of other images, may include an image fingerprint, and the image fingerprint may be understood as a parameter representation of the image. The more similar the image fingerprints, the more similar the corresponding images.

In one embodiment, the first image frames may be computed using a perceptual hashing algorithm to obtain an image fingerprint corresponding to each first image frame. The computed image fingerprint is a set of strings that are combined in an order. The perceptual hash algorithm may include a hash algorithm, an avhash algorithm, and a histlhash algorithm.

In an embodiment, the calculating the first image frames by using the perceptual hash algorithm to obtain the image fingerprint corresponding to each first image frame may specifically include the following steps:

1. the size reduction removes the details of the image, only retains basic information such as structure/brightness, and discards image differences caused by different sizes/proportions, for example, the first image frame can be reduced to 8 × 8, and 64 pixels in total.

2. Simplifying the color: converting the reduced first image frame into 64-level gray, namely, all pixel points have 64 color levels in total;

3. calculating the average value: calculating the gray average value of 64 pixel points;

4. comparing the gray levels of the pixels: comparing the gray level of each pixel point with the average value, and recording the average value greater than or equal to 1 and recording the average value smaller than 0;

5. calculating a hash value: the comparison results (0 or 1) from the previous step are combined together in a predetermined order to form a 64-bit character string, which is the fingerprint of the image, and the length of the image fingerprint is 64.

In an embodiment, after the image fingerprints of all the first image frames in the first image frame sequence are calculated, the image fingerprint sequences corresponding to the first image frame sequence may be stored in a local memory or transmitted to a server via a network for storage. Of course, the first image frame in the first image frame sequence may be calculated according to the arrangement order, and the calculated image fingerprint may be stored in real time,

in an embodiment, the image fingerprints of the first image frames may be calculated by using a hash algorithm, an avhash algorithm, and a histlsh algorithm, respectively, and each first image frame may obtain three different image fingerprints, which may be denoted as a hash fingerprint, an avhash fingerprint, and a histlsh fingerprint.

In an embodiment, the hash fingerprint, the avhash fingerprint, and the histhash fingerprint corresponding to each first image frame may be arranged according to an arrangement order in the first image frame sequence, respectively, to obtain a hash fingerprint sequence, an avhash fingerprint sequence, and a histhash fingerprint sequence corresponding to the first video. The hash, avhash, and histlsh fingerprint sequences may be stored in local memory or sent to a server over a network for storage.

In an embodiment, a mapping relationship between the first video and the corresponding first image frame sequence, and the hash fingerprint sequence, the avhash fingerprint sequence, and the histlsh fingerprint sequence may be established in a local memory or a server, and the mapping relationship may be stored in a mapping table.

In one embodiment, the image frame sequence and the image fingerprint sequence corresponding to the historical video may be obtained from a local storage or from a server via a network. The historical video may include a first video published by the user based on the terminal before the current time point.

In one embodiment, the historical video includes a plurality of historical video segments, the first video includes a plurality of first video segments, and the candidate similar segments having similar image frames to the first video segments can be selected by searching and comparing the historical video segments according to the first segment image frame sequence of the first video segments and the historical segment image frame sequence of the historical video segments.

The historical video can be divided according to a preset sliding window algorithm to obtain a plurality of historical video segments, for example, a video segment is extracted by sliding in 4s steps in a time window which is 8s wide, and a video with the duration of 12s can be divided into 2 video segments. Wherein the time length of each obtained video clip is the same.

In an embodiment, for the historical video with the total length L, n historical video segments are generated by intercepting through a sliding window method, and the following formula can be used:

wherein, L, w and s are integers which are more than 0, the division is rounded down, and the minimum is 0.

Where w is a window width (the number of image frames included in each divided video segment may be taken as the window width), s is a step length, and n is the total number of video segments.

Similarly, the first video may be divided according to a window sliding algorithm to obtain a plurality of first video segments, which are not repeated.

For example, the fingerprint similarity between the image fingerprint of the historical video and the image fingerprint of the current first video may be calculated, and if the fingerprint similarity is greater than a preset threshold, the image frames are determined to be similar.

In another embodiment, the historical video includes a plurality of historical video segments, the image fingerprint sequences corresponding to the historical video segments may be compressed into historical video segment features, the image fingerprint sequences corresponding to the first video segment may be compressed into first video segment features, then according to the feature similarity between the first video segment features and the historical video segment features, initial candidate similar segments similar to the first video segment are preliminarily screened out from the historical video segments, then the initial candidate similar segments and the fingerprint sequence similarity parameters of the fingerprint sequences corresponding to the first video segment are respectively obtained, and based on the fingerprint sequence similarity parameters, the candidate similar segments are selected from the initial candidate similar segments.

The method for obtaining the fingerprint sequence similarity includes multiple methods, for example, the hash fingerprint similarity of each image frame in the initial candidate similar fragment and each image frame in the first video fragment may be respectively calculated, then the number of image frames with the hash fingerprint similarity larger than a preset threshold is counted, a hash ratio between the number of image frames and the total number of image frames in the first video fragment is calculated, an avhash ratio and an avhash ratio are respectively obtained by referring to the above methods, and then the fingerprint sequence similarity is obtained according to a preset weight parameter, the hash ratio and the avhash ratio.

There are various methods for calculating the fingerprint similarity, for example, Hamming distance (Hamming distance) of the image fingerprint can be calculated as the fingerprint similarity. In an embodiment, the number of different characters at the corresponding positions of the two character strings may be obtained by performing an exclusive or operation on the two character strings, and the obtained result is 1, where the number is the hamming distance between the two character strings.

In one embodiment, there are various methods for compressing an image fingerprint sequence into a video clip, for example, the width of a video clip corresponding to the image fingerprint sequence is w (i.e. the video clip contains w image frames), and if the image fingerprint is a k-bit string, the sum of the w image fingerprints on k bits can be calculated bit by bit; and if the sum of the bits is less than w/2, the bit in the video segment fingerprint is 0, if the sum of the bits is greater than or equal to w/2, the bit in the segment fingerprint is 1, and according to the calculated result, generating a new video segment fingerprint with the length of k as the video segment characteristic.

In an embodiment, the selecting a second video similar to the first video based on the candidate similar video segments may specifically include the following steps:

The first video can be divided according to a sliding window algorithm to obtain a plurality of first video segments, which are not repeated.

In an embodiment, obtaining the segment similarity between the candidate similar video segment and the first video segment may include the following steps: acquiring a similar image frame sequence corresponding to the candidate similar video clip, wherein the similar image frame comprises a plurality of similar image frames; acquiring a first image frame sequence corresponding to a first video segment, wherein the first image frame sequence comprises a plurality of first image frames; and acquiring the similarity of the first image frame and the similar image frame, then counting the number of the first image frames with the similarity larger than a preset threshold value, calculating the ratio of the total number of the first image frame and the first image frame sequence, and acquiring the fragment similarity according to the ratio.

The hamming distance of the image fingerprint can be calculated as the similarity of the image frames, which is referred to the above embodiments and will not be described again.

In another embodiment, obtaining the segment similarity between the candidate similar video segment and the first video segment may include the following steps: compressing the image fingerprint sequence corresponding to the candidate similar video clip into candidate similar video clip characteristics, compressing the image fingerprint sequence corresponding to the first video clip into first video clip characteristics, and then calculating the similarity of the first video clip characteristics and the candidate similar video clip characteristics as clip similarity.

Wherein, based on the segment similarity, selecting a similar video segment corresponding to the first video segment from the candidate similar video segment set may include the following steps: and if the first video segment is similar to a plurality of candidate similar video segments in the historical video, selecting the similar video segment with the maximum segment similarity as the similar video segment corresponding to the first video segment according to the segment similarity. And if the segment similarity is equal, selecting the first appearing candidate similar video segment as the similar video segment corresponding to the first video segment according to the time sequence of the candidate similar video segments appearing in the historical video. And if one candidate similar video clip is similar to the plurality of first video clips, determining the first video clip with the maximum clip similarity according to the clip similarity, and taking the first video clip as the first video clip corresponding to the candidate similar video clip. And if the segment similarity is equal, determining the first video segment which appears first according to the time sequence of the first video segment in the first video, and taking the first video segment which appears first as the first video segment corresponding to the candidate similar video. Therefore, the first video segments correspond to similar video segments in a historical video one by one, and repeated statistics in the following step of selecting similar videos can be avoided.

In an embodiment, the selecting a second video similar to the first video based on the similar video segment may specifically include the following steps:

The statistical parameter corresponding to the similar video is a parameter representing the similarity degree of the similar video and the first video.

In an embodiment, the step of performing statistics on similar video segments in the similar videos to obtain statistical parameters corresponding to the similar videos may include the following steps: counting the number of similar video segments in the similar video, wherein the number is also the number of segments similar to the similar video in the first video, and then acquiring the ratio of the number of the segments to the total number of the segments in the first video as a statistical parameter corresponding to the similar video.

In an embodiment, selecting a second video similar to the first video from the similar video set based on the statistical parameters corresponding to the similar videos includes: and judging the size relationship between the statistical parameter and a preset threshold, and if the statistical parameter is greater than the preset threshold, determining that the similar video corresponding to the statistical parameter is a second video similar to the first video.

104. And acquiring second audio information of the second video.

In an embodiment, audio and video separation may be performed on the second video to obtain second audio information of the second video.

In another embodiment, the server stores audio information corresponding to each historical video, and the audio information corresponding to the second video may be acquired from the server through network connection.

105. And acquiring audio similar parameters of the first audio information and the second audio information.

Wherein the audio similarity parameter is a parameter indicating a degree of similarity between the first audio information and the second audio information.

In an embodiment, before the obtaining of the audio similarity parameters of the first audio and the second audio, the following steps may be included:

counting time information of similar video clips in the second video;

The time information may include information of time points, time lengths, and the like of similar video segments.

In an embodiment, the second audio piece may be intercepted from the second audio information according to the time information.

In an embodiment, the similar time period of the second video may be determined according to a start time point and an end time point of the similar video segment appearing in the second video, and the second audio information segment may be cut from the second audio information according to the similar time period of the second video.

Similarly, a similar time period of the first video may be determined according to the start time point and the end time point of the first video, according to the first video segment corresponding to the similar video segment, at the start time point and the end time point appearing in the first video, and the first audio information segment may be cut out from the first audio information according to the similar time period of the first video.

In an embodiment, the obtaining of the audio similarity parameter of the first audio information segment and the second audio information segment may include the following steps:

In an embodiment, the first audio information segment may be divided into a plurality of first audio sub-segments according to a sliding window algorithm.

In an embodiment, the number of the first audio information segments may be divided into second audio sub-segments from the second audio segments according to a sliding window algorithm, wherein the number of the second audio sub-segments is equal to the number of the first audio sub-segments.

In an embodiment, the first audio information segment is not completely similar to the similar part in the second audio information segment in time, and the second audio sub-segment and the first audio sub-segment which are partially overlapped can be obtained by setting a time window and a sliding step, so as to reduce the influence of time misalignment on the incomplete similarity of the second audio sub-segment and the first audio sub-segment. For example, the second audio sub-segment may be extracted in a time window of 8s width by sliding in 4s steps, and the 12 s-long second audio information segment may be divided into 2 second audio sub-segments. Wherein the two resulting second audio sub-segments overlap in time.

In another embodiment, in order to reduce the amount of calculation and improve the calculation efficiency, the non-overlapping second audio sub-segment and the non-overlapping first audio sub-segment may be obtained by setting the time window and the sliding step. For example, the second audio sub-segment may be extracted in a 4s wide time window, sliding in 4s step, and the 12s long second audio information segment may be divided into 3 second audio sub-segments.

In an embodiment, the obtaining of the sub-segment similarity parameters of the second audio sub-segment and the first audio sub-segment may specifically include the following steps:

acquiring the feature similarity of the first audio feature information and the second audio feature information;

In one embodiment, the audio feature information is feature information for distinguishing a current audio from other audio, the audio information includes frequency, time, energy, and the like, and for one audio sub-segment, the audio feature information is a feature of the frequency, time, energy, and the like. The audio characteristic information may be represented as an audio fingerprint. Where an audio fingerprint may be represented as a set of numbers with a temporal attribute.

There are various methods for obtaining the audio fingerprint, for example, the audio fingerprint may be extracted by a chromaprint algorithm, an echoprint algorithm, and a landmark algorithm.

The obtaining of the audio fingerprint of the first audio sub-segment using the landmark algorithm may specifically include the following steps:

1. time domain framing. For example, a 3s audio sub-segment may be divided into frames with an overlap of 0.37s, and then weighted using a hanning window function. The overlap factor is 31/32, i.e. one sub-fingerprint is extracted every 0.37 s/32-11.6 ms.

2. A fourier transform. A Fast Fourier Transform (FFT) is performed for each subframe.

3. And dividing frequency domain sub-bands. In order to extract one 32-bit sub-fingerprint per sub-frame, the frequency domain between 300Hz-2000Hz may be divided into 33 non-overlapping frequency domain sub-bands.

4. And calculating the energy. The energy of each frequency domain sub-band (hereinafter referred to as a frequency band) is calculated.

5. An audio fingerprint is acquired. And calculating the bit of each frequency band in each subframe according to the energy of each frequency band in each subframe to be used as the sub-fingerprint. All bits are arranged according to a certain order to obtain the audio fingerprint.

In one embodiment, the bits may be calculated using the following equation:

wherein, F (n, m) is the bit of the nth subframe, mth frequency band; e (n, m) is the energy of the nth subframe, mth frequency band.

In an embodiment, the feature similarity between the first audio feature information and the second audio feature information may be calculated by calculating a hamming distance between the feature fingerprint of the first audio sub-segment and the feature fingerprint of the second audio sub-segment.

In an embodiment, the obtaining the audio similar parameters of the first audio information segment and the second audio information segment based on the sub-segment similar parameters may specifically include the following steps:

In an embodiment, when the sub-segment similarity parameter is greater than the first preset threshold, the comparison result is marked as similar, otherwise, the comparison result is marked as dissimilar.

In an embodiment, based on the comparison result, obtaining the audio similarity parameter of the first piece of audio information and the second piece of audio information may include the following steps: and counting the similar number of the similar first audio sub-segments as a comparison result, and acquiring the ratio of the similar number to the total number of the first audio sub-segments as the audio similar parameters of the first audio information segment and the second audio information segment.

106. And performing audio abnormity identification on the first video based on the audio similar parameters to obtain an identification result.

The recognition result may have a plurality of expressions, for example, the recognition result may include marking a video with abnormal sound to obtain a marked video, and for example, a recognition result report may be generated, and the recognition result report may include the first video, the first audio information, and the recognition conclusion. The identification conclusion may be conclusion information such as "the video is a sound abnormal video".

In an embodiment, the performing audio anomaly identification on the first video based on the audio similar parameter to obtain an identification result may specifically include the following steps:

Most of the short video sound generated by cutting the original edition video is consistent with the original edition video, and the sound is inconsistent with the original edition video after dubbing, accelerating and decelerating and introducing other noises during production.

In an embodiment, the video identification method further includes storing the identification result to a blockchain.

Referring to fig. 6a, the network device integrated with the video recognition apparatus is a node in a data sharing system, and each node in the data sharing system can receive input information during normal operation and maintain data in the data sharing system based on the received input information. In order to ensure information intercommunication in the data sharing system, information connection can exist between each node in the data sharing system, and information transmission can be carried out between the nodes through the information connection. For example, when an arbitrary node in the data sharing system receives input information, other nodes in the data sharing system acquire the input information according to a consensus algorithm, and store the input information as data in shared data, so that the data stored on all the nodes in the data sharing system are consistent.

Each node in the data sharing system has a node identifier corresponding thereto, and each node in the data sharing system may store a node identifier of another node in the data sharing system, so that the generated block is broadcast to the other node in the data sharing system according to the node identifier of the other node in the following. Each node may maintain a node identifier list as shown in the following table, and store the node name and the node identifier in the node identifier list correspondingly. The node identifier may be an IP (Internet Protocol) address and any other information that can be used to identify the node. For example, when a terminal or a server integrated with the video recognition device performs video anomaly recognition on a video to be recognized to obtain a recognition result, the recognition result is broadcasted to a node identification list, and the node identification corresponds to network equipment in a data sharing system. The following table is only illustrative of IP addresses.

Each node in the data sharing system stores one identical blockchain. The block chain is composed of a plurality of blocks, as shown in fig. 6b, the block chain is composed of a plurality of blocks, the starting block includes a block header and a block main body, the block header stores an input information characteristic value, a version number, a timestamp and a difficulty value, and the block main body stores input information; the next block of the starting block takes the starting block as a parent block, the next block also comprises a block head and a block main body, the block head stores the input information characteristic value of the current block, the block head characteristic value of the parent block, the version number, the timestamp and the difficulty value, and the like, so that the block data stored in each block in the block chain is associated with the block data stored in the parent block, and the safety of the input information in the block is ensured. In this embodiment, the recognition result may be stored in the block body.

When each block in the block chain is generated, referring to fig. 6c, when the node where the block chain is located receives the input information, the input information is verified, after the verification is completed, the input information is stored in the memory pool, and the hash tree for recording the input information is updated; and then, updating the updating time stamp to the time when the input information is received, trying different random numbers, and calculating the characteristic value for multiple times, so that the calculated characteristic value can meet the following formula:

SHA256(SHA256(version+prev_hash+merkle_root+ntime+nbits+x))<TARGET

wherein, SHA256 is a characteristic value algorithm used for calculating a characteristic value; version is version information of the relevant block protocol in the block chain; prev _ hash is a block head characteristic value of a parent block of the current block; merkle _ root is a characteristic value of the input information; ntime is the update time of the update timestamp; nbits is the current difficulty, is a fixed value within a period of time, and is determined again after exceeding a fixed time period; x is a random number; TARGET is a feature threshold, which can be determined from nbits.

Therefore, when the random number meeting the formula is obtained through calculation, the information can be correspondingly stored, and the block head and the block main body are generated to obtain the current block. And then, the node where the block chain is located respectively sends the newly generated blocks to other nodes in the data sharing system where the newly generated blocks are located according to the node identifications of the other nodes in the data sharing system, the newly generated blocks are verified by the other nodes, and the newly generated blocks are added to the block chain stored in the newly generated blocks after the verification is completed.

As can be seen from the above, the embodiment of the present invention may acquire a first video to be identified; performing audio-video separation on the first video to obtain first audio information and a first image frame of the first video; acquiring a second video similar to the first video based on the first image frame; acquiring second audio information of the second video; acquiring audio similar parameters of the first audio information and the second audio information; and performing audio abnormity identification on the first video based on the audio similarity parameters to obtain an identification result. According to the method and the device, the first image frame of the first video to be identified is separated, the second video similar to the first video is found, and then the audio frequency of the first video and the audio frequency of the second video are compared, so that whether the first video is the video with abnormal sound or not can be accurately and quickly identified.

The method described in the foregoing embodiment will be described in further detail below by way of example with the video recognition apparatus being specifically integrated in a server.

Referring to fig. 2b and fig. 5, a specific flow of the video identification method according to the embodiment of the present invention is as follows:

201. the server acquires a first video to be identified.

In an embodiment, the terminal may be triggered to send the first video to be recognized to the server based on an operation of the user on the terminal.

202. And the server performs audio-video separation on the first video to obtain first audio information and a first image frame of the first video.

Referring to fig. 5, in an embodiment, audio in a first video may be separated to obtain first audio information, video frame extraction may be performed on the first video to obtain a first image frame, and then a picture fingerprint of the first image frame is obtained, where the method for obtaining the picture fingerprint refers to the above embodiments and is not described again.

203. The server acquires a second video similar to the first video based on the first image frame.

Referring to fig. 5, in an embodiment, the server may store the historical video in a video library, then perform video frame extraction and audio-video separation on the historical video in the video library to obtain image frames and audio information corresponding to the historical video, then acquire image fingerprints of the image frames and audio fingerprints of the audio information, and store the image fingerprints and the audio fingerprints in a video feature library.

The method for acquiring the image fingerprint and the audio fingerprint refers to the above embodiments, and is not described in detail.

Referring to fig. 5, in an embodiment, the server may obtain a similar video clip having similar image frames as the first video by comparing the image fingerprint of the first image frame with the image fingerprints in the video feature library. The method for judging similarity of image frames refers to the above embodiments, and is not repeated.

Referring to fig. 5, in an embodiment, the server may obtain, from the video library, a second video similar to the first video through the similar video segment, where a method for determining the second video refers to the above embodiment and is not described again.

204. And the server acquires second audio information of the second video.

In an embodiment, referring to fig. 5, the server may obtain second audio information corresponding to the second video from the video feature library. In this embodiment, the second audio information is an audio fingerprint of an audio corresponding to the second video.

205. And the server acquires audio similar parameters of the first audio information and the second audio information.

Wherein the audio similarity parameter is a parameter indicating a degree of similarity of the first audio information and the second audio information.

In an embodiment, referring to fig. 5, the server may obtain a similar time period of the second video similar to the first video according to a similar video segment in the second video, and then divide the second video segment corresponding to the similar time period into a plurality of second video sub-segments in a certain time window in the second video according to a sliding window algorithm. And then acquiring a second sub-audio fingerprint corresponding to a second video sub-segment from the video feature library.

Similarly, referring to fig. 5, the server may obtain a similar time period similar to the second video in the first video according to the similar time period of the second video, and then divide the first video segment corresponding to the similar time period into the first video sub-segments equal to the number of the second video sub-segments in a certain time window in the first video according to a sliding window algorithm. And then acquiring a first sub-audio fingerprint corresponding to the first video sub-segment from the video feature library.

In an embodiment, referring to fig. 5, the server may perform audio feature comparison based on the first sub audio fingerprint and the second sub audio fingerprint, specifically, may calculate a hamming distance between the second sub audio fingerprint and the first sub audio fingerprint, and then obtain the audio similarity parameters of the first audio information and the first audio information according to the hamming distance.

In the above embodiment, the method for obtaining the audio similar parameters of the first audio information and the first audio information segment based on the sub-segment similar parameters is not repeated.

206. And the server performs audio abnormity identification on the first video based on the audio similarity parameter to obtain an identification result.

In an embodiment, the audio similarity parameter may be compared with a second preset threshold to determine a sound abnormal video, for example, when the audio similarity parameter is greater than the second preset threshold, the first video is determined to be a sound abnormal video.

As can be seen from the above, the embodiment of the application can acquire the first video to be identified; performing audio-video separation on the first video to obtain first audio information and a first image frame of the first video; acquiring a second video similar to the first video based on the first image frame; acquiring second audio information of the second video; acquiring audio similar parameters of the first audio information and the second audio information; and performing audio abnormity identification on the first video based on the audio similarity parameters to obtain an identification result. According to the method and the device, the first image frame of the first video to be identified is separated, the second video similar to the first video is found, and then the audio frequency of the first video and the audio frequency of the second video are compared, so that whether the first video is the video with abnormal sound or not can be accurately and quickly identified.

In order to better implement the above method, an embodiment of the present invention further provides a video recognition apparatus, where the video recognition apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal device, a server, a personal computer, or the like.

For example, in this embodiment, a video recognition device is integrated in a terminal device as an example, and the method in the embodiment of the present invention is described in detail.

For example, as shown in fig. 3a, the video recognition apparatus may include a first video acquisition unit 301, a separation unit 302, a second video acquisition unit 303, an audio acquisition unit 304, a calculation unit 305, and a recognition unit 306. The following were used:

(1) a first video acquiring unit 301 for acquiring a first video.

(2) The separation unit 302 is configured to perform audio and video separation on the first video to obtain first audio information and a first image frame of the first video.

(3) A second video obtaining unit 303, configured to obtain a second video similar to the first video based on the first image frame.

In some embodiments, the second video obtaining unit 303 may specifically include a feature obtaining subunit, a set obtaining subunit, and a selecting subunit, as follows:

the characteristic obtaining subunit is used for obtaining the picture characteristic information of the first image frame;

a set obtaining subunit, configured to obtain, based on the picture feature information, a candidate similar video segment set having similar image frames to the first video, where the candidate video segment set includes multiple candidate similar video segments;

and the selecting subunit is used for selecting a second video similar to the first video based on the candidate similar video segments.

In an embodiment, the selecting subunit may be specifically configured to:

(4) An audio obtaining unit 304, configured to obtain second audio information of the second video.

(5) A calculating unit 305, configured to obtain audio similarity parameters of the first audio information and the second audio information.

In an embodiment, referring to fig. 3b, the calculating unit 305 may specifically include a first dividing subunit 3051, a second dividing subunit 3052, a first obtaining unit 3053, and a second obtaining unit 3054, as follows:

a first splitting subunit 3051, configured to split the first piece of audio information into a plurality of first audio sub-pieces;

a second splitting subunit 3052, configured to split the second piece of audio information into a plurality of second audio sub-pieces based on the first audio sub-piece;

a first obtaining sub-unit 3053, configured to obtain sub-segment similarity parameters of the second audio sub-segment and the first audio sub-segment;

a second obtaining subunit 3054, configured to obtain, based on the sub-segment similarity parameter, an audio similarity parameter of the first audio information segment and the second audio information segment.

In an embodiment, the second obtaining unit 3054 may be specifically configured to:

In an embodiment, the first obtaining unit 3053 may be specifically configured to:

(6) The identifying unit 306 performs audio anomaly identification on the first video based on the audio similarity parameter to obtain an identification result.

In an embodiment, the identifying unit 306 may specifically be configured to:

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, the video identification apparatus of the present invention can obtain the first video by the first video obtaining unit; the separation unit is used for carrying out audio-video separation on the first video to obtain first audio information and a first image frame of the first video; acquiring, by a second video acquisition unit, a second video similar to the first video based on the first image frame; acquiring second audio information of the second video by an audio acquisition unit; acquiring audio similarity parameters of the first audio information and the second audio information by a computing unit; and performing audio abnormity identification on the first video by an identification unit based on the audio similar parameters to obtain an identification result. According to the method and the device, the first image frame of the first video to be identified is separated, the second video similar to the first video is found, and then the audio frequency of the first video and the audio frequency of the second video are compared, so that whether the first video is the video with abnormal sound or not can be accurately and quickly identified.

The embodiment of the invention also provides a terminal device, which can be integrated with any one of the video identification devices provided by the embodiment of the invention, and the terminal device can be a mobile phone, a tablet computer, a micro processing box, an unmanned aerial vehicle, an image acquisition device or the like.

For example, as shown in fig. 4, it shows a schematic structural diagram of a terminal device according to an embodiment of the present invention, specifically:

the terminal device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and input module 404 components. Those skilled in the art will appreciate that the server architecture shown in FIG. 4 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the server. In some embodiments, processor 401 may include one or more processing cores; in some embodiments, processor 401 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The terminal device further includes a power supply 403 for supplying power to the various components, and in some embodiments, the power supply 403 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The terminal device may also include an input module 404, the input module 404 being operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 401 in the server loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring a first video to be identified;

acquiring second audio information of the second video;

The above detailed implementation of each operation can refer to the foregoing embodiments, and is not described herein again.

As can be seen from the above, the terminal device of this embodiment may obtain the first video to be identified; performing audio-video separation on the first video to obtain first audio information and a first image frame of the first video; acquiring a second video similar to the first video based on the first image frame; acquiring second audio information of the second video; acquiring audio similar parameters of the first audio information and the second audio information; and performing audio abnormity identification on the first video based on the audio similarity parameters to obtain an identification result. According to the method and the device, the first image frame of the first video to be identified is separated, the second video similar to the first video is found, and then the audio frequency of the first video and the audio frequency of the second video are compared, so that whether the first video is the video with abnormal sound or not can be accurately and quickly identified.

It will be understood by those of ordinary skill in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the content recommendation methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

acquiring a first video to be identified;

acquiring second audio information of the second video;

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any video identification method provided in the embodiments of the present application, beneficial effects that can be achieved by any video identification method provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.

The foregoing describes in detail a video recognition method and apparatus provided in an embodiment of the present application, and a specific example is applied in the present application to explain the principle and the embodiment of the present application, and the description of the foregoing embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A video recognition method, comprising:

acquiring a first video to be identified;

acquiring second audio information of the second video;

2. The video identification method of claim 1, wherein said obtaining a second video similar to the first video based on the first image frame comprises:

acquiring picture characteristic information of the first image frame;

3. The video identification method of claim 2, wherein the first video comprises a plurality of first video segments, and the selecting a second video similar to the first video based on the candidate similar video segments comprises:

4. The video identification method of claim 3, wherein said selecting a second video similar to the first video based on the similar video segments comprises:

5. The video identification method of claim 1, wherein prior to said obtaining audio similarity parameters of said first audio information and said second audio information, further comprising:

counting time information of similar video clips in the second video;

6. The video identification method of claim 5, wherein said obtaining audio similarity parameters of the first piece of audio information and the second piece of audio information comprises:

7. The video identification method according to claim 6, wherein said obtaining audio similarity parameters of the first piece of audio information and the second piece of audio information based on the sub-segment similarity parameters comprises:

8. The video recognition method of claim 6, wherein obtaining sub-segment similarity parameters for the second audio sub-segment and the first audio sub-segment comprises:

9. The video identification method of claim 1, wherein the performing audio anomaly identification on the first video based on the audio similarity parameter to obtain an identification result comprises:

10. The video recognition method of claim 1, further comprising:

and storing the identification result to a block chain.

11. A video recognition apparatus, comprising:

a first video acquisition unit configured to acquire a first video;