CN106601243B

CN106601243B - Video file identification method and device

Info

Publication number: CN106601243B
Application number: CN201510683009.1A
Authority: CN
Inventors: 谷长信
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-10-20
Filing date: 2015-10-20
Publication date: 2020-11-06
Anticipated expiration: 2035-10-20
Also published as: CN106601243A; WO2017067400A1

Abstract

The invention discloses a video file identification method and a video file identification device, wherein the method comprises the steps of firstly obtaining audio information from a video file to be identified, extracting audio fingerprints by segmenting the audio information, and performing audio matching with a training sample to judge whether the video file is a target video; then, the suspicious video files which cannot be confirmed are continuously identified through image matching. The device comprises an audio preprocessing module, an audio fingerprint matching module, an audio judging module, an image preprocessing module and a comprehensive judging module. The method and the device have high processing efficiency and high recognition rate.

Description

Video file identification method and device

Technical Field

The invention belongs to the technical field of computer data processing, and particularly relates to a video file identification method and device.

Background

With the popularity of the internet, more and more users are beginning to store personal video files using cloud servers provided by internet service providers, some of which also allow users to upload video files for sharing to other users in the network. However, the law has strict examination requirements on video files transmitted on the internet and cannot relate to yellow storm. Therefore, the internet service provider has responsibility and obligation to check and supervise the video files uploaded by the users and provided by the service provider according to the national standard.

In the prior art, the auditing of the video file is based on the video image, and the auditing is performed by capturing the picture frame in the video image, so that the following problems exist:

the treatment efficiency is low: the frame capturing range of the video image cannot be effectively positioned, if comprehensive audit is wanted, the frame capturing amount is extremely large, and the processing efficiency is low;

the identification means is single, and the identification rate is not high: by means of picture identification alone, the probability of missing identification and error identification is high.

Disclosure of Invention

The invention aims to provide a video file identification method and a video file identification device, which are used for further carrying out picture identification by means of audio fingerprint identification and a video image frame capturing technology, finally giving an identification result and effectively improving the processing efficiency.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a video file identification method is used for auditing a video file to be identified, and comprises the following steps:

acquiring audio information from a video file to be identified;

segmenting the acquired audio information, and performing fingerprint extraction on the segmented audio segment to obtain an audio fingerprint of the audio segment;

carrying out audio matching on the audio fingerprints of the obtained audio segments and the trained training samples, and recording an audio matching result;

judging whether the video file to be identified is a target video or not according to the audio matching result, terminating identification when the video file to be identified is judged to be the target video or not, and entering the next step for continuing identification when the video file to be identified is judged to be a suspicious video file;

according to the audio matching result, starting to capture frames of the video file from the starting time of the successfully matched audio segment, capturing video images, performing image matching on the captured video images, and recording the image matching result;

and judging whether the video file to be identified is the target video or not according to the image matching result or the image matching result and the audio matching result.

The invention discloses an implementation mode for segmenting acquired audio information, which comprises the following steps:

finding out all volume peak points exceeding a specified threshold value from the audio information in a time domain;

and sampling the audio segments according to the fixed time length from each peak point in sequence to obtain each audio segment.

Another implementation manner of segmenting the acquired audio information according to the present invention includes:

and sampling the audio information according to a fixed time length to obtain each audio segment.

Further, the audio matching result includes: the number of times of successful matching, the starting time of the successfully matched audio segments and the labeling information of the training samples matched with the successfully matched audio segments; the labeling information includes: sample duration, content rating, and manual classification labels.

Further, the determining whether the video file to be identified is the target video according to the audio matching result includes:

when the matching success frequency is larger than a first threshold value, judging that the video file to be identified is a target video;

when the matching success frequency is smaller than a second threshold value, judging that the video file to be identified is not the target video;

and when the matching times are between the first threshold and the second threshold, calculating the audio matching probability corresponding to the matching result, and when the calculated matching probability is greater than a set third threshold, judging that the video file to be identified is a target video, otherwise, regarding the video file to be identified as a suspicious video file.

Wherein, the calculating the audio matching probability corresponding to the matching result of this time includes:

according to the number X of successful matching times and the total number Z of all the audio segments, calculating the ratio P1 of the two as:

and calculating the audio matching probability R1 corresponding to the matching result at this time, wherein the calculation formula is as follows:

R₁＝P₁*P(Y)

wherein, R1 is the audio matching probability corresponding to the current matching result, and p (y) is the sum of the weights corresponding to the content levels of all the training samples matching the audio fingerprint of the audio segment.

Further, the determining whether the video file to be identified is the target video according to the image matching result or the image matching result and the audio matching result includes:

calculating the image matching probability R according to the image matching result₂，R₂The ratio of the number of successful matching times of the captured video images to the total number of all captured video images;

according to the video matching probability R₂And the audio matching probability R₁ComputingIf the comprehensive matching probability R' of the matching exceeds a fourth threshold value, the video file to be identified is judged to be a target video, and if not, the video file to be identified is judged to be a normal video;

wherein, the calculation formula of the comprehensive matching probability R' is as follows:

R′＝R₁*α+R₂*β

where α and β are the weights of the audio matching probability and the video matching probability, respectively.

The invention also provides a video file identification device, which is used for checking the video file to be identified, and the device comprises:

the audio preprocessing module is used for acquiring audio information from a video file to be identified, segmenting the acquired audio information, and performing fingerprint extraction on the segmented audio segment to obtain an audio fingerprint of the audio segment;

the audio fingerprint matching module is used for performing audio matching on the obtained audio fingerprints of the audio segments and the trained training samples and recording audio matching results;

the audio judging module is used for judging whether the video file to be identified is a target video or not according to the audio matching result, terminating identification when the video file to be identified is judged to be the target video or not, and continuing processing by the image preprocessing module when the video file to be identified is judged to be a suspicious video file;

the image preprocessing module is used for capturing frames of the video file from the starting time of the successfully matched audio segment according to the audio matching result and capturing video images;

the image matching module is used for carrying out image matching on the captured video images and recording image matching results;

and the comprehensive judgment module is used for judging whether the video file to be identified is the target video according to the image matching result or the image matching result and the audio matching result.

The invention provides a video file identification method and device, which are characterized in that voice of a video file is quickly identified by means of audio fingerprint identification, a matched starting time point is recorded, then frame grabbing is carried out at intervals within the range of the starting time point for further picture identification, and finally an identification result is given. The method has the characteristics of high processing efficiency and high recognition rate.

Drawings

FIG. 1 is a flow chart of a video file identification method of the present invention;

fig. 2 is a schematic structural diagram of a video file recognition apparatus according to the present invention.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the drawings and examples, which should not be construed as limiting the present invention.

The current popular formats of video files include AVI format, MOV format, MPEG format, RM format, ASF format, etc., and a complete video file includes both video image and audio information. The general idea of the invention is to extract audio information from a video file, identify the extracted audio information, then capture frames of video images according to the identification result, and further identify the captured video images.

The following description will be given by taking the identification of videos related to yellow storm as an example, and the same is true for other types of video files. As shown in fig. 1, a video file identification method includes the following steps:

and step S1, acquiring audio information from the video file to be identified.

The embodiment acquires the audio information from the video file to be identified, and can directly decode the video file and extract the audio information. The audio information can also be extracted directly through other third-party software. The extraction of audio information is a mature technology, and is not described herein again.

And step S2, segmenting the acquired audio information, and performing fingerprint extraction on the segmented audio segments to obtain audio fingerprints of the audio segments.

And segmenting the acquired audio information, and performing fingerprint extraction on each audio segment to obtain an audio fingerprint corresponding to each audio segment.

The Audio information identification of the invention is based on Audio fingerprint (Audio fingerprint technology), the Audio fingerprint refers to a compact digital signature based on content and capable of representing a section of important acoustic characteristics of sound, the main purpose of the invention is to establish an effective mechanism to compare the perceptual auditory quality of two Audio files, and the invention can be used in applications such as Audio identification and content integrity verification.

After the audio information is stripped from the video file, the total duration T (milliseconds) of the playing of the audio information and the total length l (bytes) of the extracted audio information can be obtained. And then segmenting the audio information into a plurality of audio segments, extracting the fingerprint of each audio segment, and comparing the extracted audio fingerprint with the training sample. The training samples are also obtained by performing audio segmentation according to the same method and training.

The following two embodiments illustrate specific audio information segmentation methods:

the method comprises the following steps: and segmenting according to the volume in the time domain.

The audio information has different volume along the time axis in the time domain, the audio information is represented as a waveform with fluctuation, a threshold value of the volume is set, all volume peak points exceeding the specified threshold value can be found out from the audio information in the time domain, and the peak value is marked as (k)₁，k₂，k₃，....，k_n) And recording the coordinates on the time axis corresponding to each peak point, wherein the coordinates on the time axis are the time offset p of the peak point in the audio information.

And then sampling is carried out from each peak point in sequence according to the fixed time length w to obtain audio segments, audio fingerprints are extracted, and n audio fingerprints are extracted so as to be compared with the training samples.

It is easy to understand that the starting point of each audio segment is the time corresponding to the peak point, and the starting point of the audio segment corresponding to the peak point can be calculated as: t (p/L).

The second method comprises the following steps: and cutting at fixed intervals.

Sampling the audio information according to a fixed time length w to obtain f₁，f₂，f₃，….，f_mAn audio segment, and an audio fingerprint is extracted,for comparison with training samples.

It will be readily appreciated that the start of each audio segment may be calculated according to a fixed duration, the time starting points of the audio segments being: t (f)_i-1)/L, wherein i belongs to (1-m).

It is easily understood that the fixed duration w coincides with the duration of the training samples in the training sample library, such as 1 second. Corresponding to a video file related to a yellow storm, a video image corresponding to a higher volume is often an object needing important attention, so preferably, the video file is more easily and quickly identified by adopting the method I, peak points are sorted according to the volume, and audio with high peaks is segmented firstly.

Specifically, the fingerprint extraction is performed on the audio segment, and an algorithm for extracting, for example, a fast fourier transform method, is not described herein again. And acquiring the audio fingerprint corresponding to the audio segment so as to compare the subsequent steps with the trained training sample.

And step S3, performing audio matching on the obtained audio fingerprints of the audio segments and the trained training samples, and recording audio matching results.

In the embodiment, training samples are obtained by training a large number of various types of video and audio related to yellow storm, and labeling information is added to each training sample, wherein the labeling information of the training samples mainly comprises sample duration, content grades, manual classification labels and the like, and the content grades are the levels related to yellow storm in the embodiment.

And performing audio matching on the audio fingerprints of the audio segments and the training samples, and if the recognition similarity between the audio fingerprints of the audio segments and the training samples is greater than a set audio similarity threshold, determining that the matching is successful. And traversing all audio segments, and recording audio matching results, wherein the audio matching results comprise: the number of times of successful matching, the starting time of the audio segment successfully matched, and the labeling information of the training sample matched with the audio segment successfully matched.

And step S4, judging whether the video file to be identified is the target video or not according to the audio matching result, terminating identification when the video file to be identified is judged to be the target video or not, and entering the next step for continuing identification when the video file to be identified is judged to be the suspicious video file.

Specifically, the embodiment determines whether the video file to be identified is the target video by the following steps:

when the matching success times are larger than a first threshold (for example, 20 times), judging that the video file to be identified is the target video, and terminating the identification;

when the matching success frequency is smaller than a second threshold (for example, 2 times), judging that the video file to be identified is not the target video, and terminating the identification;

and when the matching times are between the first threshold and the second threshold, calculating the audio matching probability corresponding to the current matching result, and when the calculated matching probability is greater than a set third threshold (for example, T is a specific numerical value), judging that the video file to be identified is the target video, otherwise, regarding the video file to be identified as a suspicious video file, and continuing to identify the suspicious video file.

Assuming that the number of successful matching times is X and the total number of audio segments to be matched is Z, the ratio P of the number of successful matching times to the total number of all audio segments₁Comprises the following steps:

this embodiment calculates the audio matching probability R corresponding to the current matching result₁The calculation formula is as follows:

R₁＝P₁*P(Y)

wherein R is₁The audio matching probability, P, corresponding to the matching result₁P (y) is the sum of the weights corresponding to the content ratings of all training samples matching the audio fingerprint of the audio segment, as the ratio of the number of successful matches to the total number of audio segments.

Specifically, for an audio segment, the matched training samples correspond to a yellow-related storm level Y_iThen its corresponding weight is P (Y)_i) And has P (Y) ═ Sigma P (Y)_i)。

Obtain the book through calculationAudio matching probability R corresponding to secondary matching result₁Then, the audio is matched with the probability R₁And comparing the video image with a set third threshold value, judging as a target video if the video image is higher than the third threshold value, and otherwise, further judging the video image.

The above-mentioned determination steps are only a specific embodiment, wherein the first threshold, the second threshold, and the third threshold may be adjusted to make the determination result more accurate. An intermediate threshold value can be further set between the first threshold value and the second threshold value, for example, 10 times, when the number of times of successful matching is greater than the intermediate threshold value, the audio matching probability corresponding to the matching result of this time is calculated, and the judgment is performed according to the calculated audio matching probability; if the matching success frequency is smaller than the intermediate threshold and larger than the second threshold, the audio matching probability corresponding to the matching result is not calculated, and the next step is directly carried out, so that the video image needs to be further judged. The present invention is not limited to the specific determination steps, and will not be described in detail below.

And step S5, according to the matching result of the audio segmentation, starting to capture frames of the video file from the starting time of the audio segmentation successfully matched, capturing video images, performing image matching on the captured video images, and recording the image matching result.

Through the matching in step S3, it is known which audio segments are successfully matched, and the start time of the successfully matched audio segment in the recorded matching result is located to the corresponding time point in the video file, and the video file is captured from the time point, and the time interval of capturing frames can be determined according to the actual situation, so as to capture the video image.

The captured video images are identified, in this embodiment, whether the captured video images are images related to yellow storm or not is identified, and the captured video images can be identified by human eyes or a computer. If the identification is carried out by the computer, a large number of various yellow-related and storm-related video images also need to be trained to obtain training samples, the captured video images are matched with the training samples to obtain the identification similarity of the video images, if the identification similarity is larger than a set image similarity threshold value, the matching is regarded as successful, and the image matching result, namely the number of times of successful image matching is recorded.

And step S6, judging whether the video file to be identified is the target video according to the image matching result or the image matching result and the audio matching result.

After the image matching is finished, the video matching probability R can be calculated according to the successful matching times₂，R₂The ratio of the number of successful matching for the captured video image to the total number of all captured video images.

According to the video matching probability R₂And the audio matching probability R₁And calculating the comprehensive matching probability R' of the matching, if the comprehensive matching probability exceeds a fourth threshold value, judging the video file to be identified as the target video, and otherwise, judging the video file to be identified as the normal video.

The calculation formula of the comprehensive matching probability R' is as follows:

R′＝R₁*α+R₂*β

And judging according to the obtained comprehensive matching probability, if the comprehensive matching probability exceeds an identification threshold, judging that the video file to be identified is a target video, and otherwise, judging that the video file to be identified is a normal video.

Or judging whether the video file to be identified is a video file related to yellow storm or not directly according to the number of successful image matching, or according to the video matching probability R₂To determine whether the video file to be identified is a video file related to yellow storm, such as the number of successful image matching or the video matching probability R₂And if the video file is larger than the set threshold value, the video file is judged to be a video file related to yellow storm. The present invention is not limited to specific determination conditions.

It should be noted that matching the audio fingerprints of the audio segments with the training samples to calculate their identification similarities, or matching the video images with the training samples to calculate their identification similarities are all currently mature technologies, and for example, the calculation may be performed by a maximum likelihood estimation method, which is not described herein again.

Fig. 2 shows a video file identification apparatus corresponding to the above method, comprising:

The audio preprocessing module segments the acquired audio information, and may segment the audio information according to the volume level in the time domain or according to a fixed interval, which corresponds to the specific audio segmentation method in the method and is not described herein again.

Similarly, the operations performed by the audio determining module and the comprehensive determining module in the specific determination correspond to the specific steps of step S4 and step S6, and are not described herein again.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art can make various corresponding changes and modifications according to the present invention without departing from the spirit and the essence of the present invention, but these corresponding changes and modifications should fall within the protection scope of the appended claims.

Claims

1. A video file identification method is used for auditing a video file to be identified, and is characterized by comprising the following steps:

acquiring audio information from a video file to be identified;

judging whether the video file to be identified is a target video or not according to the image matching result or the image matching result and the audio matching result;

wherein the audio matching result comprises: the number of times of successful matching, the starting time of the successfully matched audio segments and the labeling information of the training samples matched with the successfully matched audio segments;

the labeling information includes: sample duration, content level and manual classification label;

the judging whether the video file to be identified is the target video according to the audio matching result comprises the following steps:

2. The method for identifying a video file according to claim 1, wherein said segmenting the acquired audio information comprises:

3. The method for identifying a video file according to claim 1, wherein said segmenting the acquired audio information comprises:

4. The method of claim 1, wherein the calculating the audio matching probability corresponding to the current matching result comprises:

calculating the ratio P of the number X of successful matching times to the total number Z of all audio segments₁Comprises the following steps:

calculating the audio matching probability R corresponding to the matching result₁The calculation formula is as follows:

R₁＝P₁*P(Y)

wherein R is₁The audio matching probability corresponding to the matching result of this time, P (Y) is the sum of the weights corresponding to the content grades of all the training samples matched with the audio fingerprints of the audio segments, and Y is the content grade。

5. The video file identification method according to claim 4, wherein the determining whether the video file to be identified is the target video according to the image matching result or the image matching result and the audio matching result comprises:

calculating the video matching probability R according to the image matching result₂，R₂The ratio of the number of successful matching times of the captured video images to the total number of all captured video images;

according to the video matching probability R₂And the audio matching probability R₁Calculating the comprehensive matching probability R' of the matching, if the comprehensive matching probability exceeds a fourth threshold value, judging the video file to be identified as a target video, and otherwise, judging the video file to be identified as a normal video;

R′＝R₁*α+R₂*β

6. A video file identification apparatus for auditing video files to be identified, the apparatus comprising:

the comprehensive judgment module is used for judging whether the video file to be identified is the target video or not according to the image matching result or the image matching result and the audio matching result;

wherein the audio matching result comprises: the number of times of successful matching, the starting time of the successfully matched audio segments and the labeling information of the training samples matched with the successfully matched audio segments; the labeling information includes: sample duration, content level and manual classification label;

the audio judging module judges whether the video file to be identified is a target video according to the audio matching result, and executes the following operations:

7. The apparatus according to claim 6, wherein the audio preprocessing module segments the acquired audio information, and specifically performs the following operations:

8. The apparatus according to claim 6, wherein the audio preprocessing module segments the acquired audio information, and specifically performs the following operations:

9. The apparatus according to claim 6, wherein said calculating the audio matching probability corresponding to the current matching result comprises:

R₁＝P₁*P(Y)

wherein R is₁And P (Y) is the audio matching probability corresponding to the matching result at this time, and is the sum of weights corresponding to the content grades of all the training samples matched with the audio fingerprints of the audio segments, wherein Y is the content grade.

10. The video file identification device of claim 9, wherein the comprehensive judgment module judges whether the video file to be identified is the target video according to the image matching result or the image matching result and the audio matching result, and executes the following operations:

R′＝R₁*α+R₂*β