WO2017067400A1 - 一种视频文件识别方法及装置 - Google Patents

一种视频文件识别方法及装置 Download PDF

Info

Publication number
WO2017067400A1
WO2017067400A1 PCT/CN2016/101733 CN2016101733W WO2017067400A1 WO 2017067400 A1 WO2017067400 A1 WO 2017067400A1 CN 2016101733 W CN2016101733 W CN 2016101733W WO 2017067400 A1 WO2017067400 A1 WO 2017067400A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
video file
video
matching
identified
Prior art date
Application number
PCT/CN2016/101733
Other languages
English (en)
French (fr)
Inventor
谷长信
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017067400A1 publication Critical patent/WO2017067400A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • the invention belongs to the technical field of computer data processing, and in particular relates to a video file identification method and device.
  • Internet service providers provide cloud servers provided by Internet service providers to store personal video files.
  • Some Internet service providers also allow users to upload video files for sharing to other users on the network.
  • the law has strict review requirements for video files that are transmitted online, and cannot be involved in pornography. Therefore, Internet service providers have the responsibility and obligation to review and supervise the user uploads and video files provided by the service providers themselves according to national regulations.
  • the processing efficiency is low: the video image capture frame range cannot be effectively positioned. If you want to comprehensively review, the frame capture amount is extremely large, and the processing efficiency is low;
  • the identification method is single, and the recognition rate is not high: the single recognition by means of pictures has a high probability of leak recognition and error recognition.
  • the object of the present invention is to provide a video file identification method and device, and further perform image recognition by means of audio fingerprint recognition and video image capture frame technology, and finally give recognition results, thereby effectively improving processing efficiency.
  • a video file identification method for reviewing a video file to be identified comprising:
  • Segmenting the obtained audio information and performing fingerprint extraction on the segmented audio segment to obtain an audio fingerprint of the audio segment;
  • the recognition is terminated.
  • it is determined to be a suspicious video file proceed to the next step to continue the identification;
  • the video file is captured from the start time of the successfully matched audio segment, the video image is captured, the captured video image is image-matched, and the image matching result is recorded;
  • Whether the video file to be identified is the target video is determined according to the image matching result or according to the image matching result and the audio matching result.
  • An implementation manner of segmenting the obtained audio information by the present invention includes:
  • Each audio segment is obtained by sequentially sampling from each peak point by a fixed duration.
  • Another implementation manner in which the present invention segments the obtained audio information includes:
  • Each audio segment is obtained by sampling the audio information for a fixed period of time.
  • the audio matching result includes: the number of times the matching succeeds, the starting time of the matched audio segment, and the labeling information of the training sample that matches the successfully matched audio segment; the labeling information includes: the sample duration , content levels, and manual classification labels.
  • the determining, according to the audio matching result, whether the video file to be identified is a target video includes:
  • the audio matching probability corresponding to the current matching result is calculated, and when the calculated matching probability is greater than the set third threshold, determining that the to-be-identified video file is the target video Otherwise, the video file to be identified is regarded as a suspicious video file.
  • the calculating the audio matching probability corresponding to the current matching result includes:
  • the ratio P1 of the two is calculated as:
  • the audio matching probability R1 corresponding to the matching result is calculated, and the calculation formula is as follows:
  • R1 is the audio matching probability corresponding to the matching result
  • P(Y) is the sum of the weights corresponding to the content levels of all the training samples that match the audio fingerprint of the audio segment.
  • determining whether the video file to be identified is a target video according to the image matching result or the image matching result and the audio matching result includes:
  • R2 is a ratio of the number of successful matching of the captured video image to the total number of all captured video images
  • R' R 1 * ⁇ +R 2 * ⁇
  • ⁇ and ⁇ are the weights of the audio matching probability and the video matching probability, respectively.
  • the present invention also provides a video file identification device for reviewing a video file to be identified, the device comprising:
  • the audio pre-processing module is configured to obtain audio information from the to-be-identified video file, segment the obtained audio information, and perform fingerprint extraction on the segmented audio segment to obtain an audio fingerprint of the audio segment;
  • An audio fingerprint matching module configured to perform audio matching on the obtained audio segment of the audio segment with the trained training sample, and record an audio matching result
  • the audio judging module is configured to determine, according to the audio matching result, whether the video file to be identified is a target video, and when it is determined that the target video is determined to be not the target video, the identification is terminated, and when the video file is determined to be a suspicious video file, the image preprocessing module is used. Continue processing;
  • the image preprocessing module is configured to capture a video file from the start time of the successfully matched audio segment according to the audio matching result, and capture the video image;
  • An image matching module configured to perform image matching on the captured video image, and record an image matching result
  • the comprehensive judgment module is configured to determine whether the video file to be identified is a target video according to the image matching result or according to the image matching result and the audio matching result.
  • the invention provides a video file identification method and device, which quickly recognizes the voice of a video file by means of audio fingerprint recognition, records the starting time point of the matching, and then further captures the frame within the starting time point range. Image recognition, and finally the recognition result. It has the characteristics of high processing efficiency and high recognition rate.
  • FIG. 1 is a flowchart of a video file identification method according to the present invention.
  • FIG. 2 is a schematic structural diagram of a video file identification device according to the present invention.
  • Video files are currently popular in many formats, including AVI format, MOV format, MPEG mode, RM format, ASF. Format, etc., a complete video file consists of two parts: video image and audio information.
  • the general idea of the present invention is to extract audio information from a video file, identify the extracted audio information, and then perform frame capture of the video image according to the recognition result to further identify the captured video image.
  • a video file identification method includes the following steps:
  • Step S1 Obtain audio information from the video file to be identified.
  • the audio information is obtained from the video file to be identified, and the video file can be directly decoded to extract the audio information. Audio information can also be extracted directly from other third-party software. For the extraction of audio information, it is already a mature technology, and will not be described here.
  • Step S2 segment the obtained audio information, perform fingerprint extraction on the segmented audio segment, and obtain an audio fingerprint of the audio segment.
  • the obtained audio information is segmented, and each audio segment is fingerprint-extracted to obtain an audio fingerprint corresponding to each audio segment.
  • the invention recognizes audio information based on audio fingerprinting technology, and the audio fingerprint refers to a content-based compact digital signature that can represent an important acoustic feature of a sound, the main purpose of which is to establish an effective mechanism for comparing two audios.
  • the perceived auditory quality of the file can be used in applications such as audio recognition, content integrity verification, and the like.
  • the total duration T (milliseconds) of the audio information playback and the total length L (bytes) of the extracted audio information can be obtained.
  • the audio information is then divided into a plurality of audio segments, and each audio segment is fingerprint-extracted, and the extracted audio fingerprints are compared with the training samples.
  • the training samples are also segmented by audio in the same way, obtained through training.
  • Method 1 According to the volume in the time domain, the score is divided.
  • the audio information has different volume levels along the time axis in the time domain. It is represented by a undulating waveform. Setting a threshold of the volume can find all the peaks of the volume in the time domain beyond the specified threshold. It is (k1, k2, k3, ..., kn), and the coordinates on the time axis corresponding to each peak point are recorded, and the coordinate on the time axis is the time offset p of the peak point in the audio information.
  • the audio segment is obtained by sampling from the peak points in a fixed duration w, and the audio fingerprint is extracted, and the n audio fingerprints are extracted to be compared with the training samples.
  • the starting point of each audio segment is the time corresponding to the peak point, and the corresponding peak point can be calculated.
  • the starting point of the audio segmentation is: T*(p/L).
  • the audio information is sampled by a fixed duration w to obtain f1, f2, f3, ..., fm audio segments, and the audio fingerprint is extracted for comparison with the training samples.
  • the starting point of each audio segment can be calculated according to a fixed duration, and the time starting point of the audio segment is: T*(fi-1)/L, where i belongs to (1 ⁇ m).
  • the fixed duration w is consistent with the duration of the training samples in the training sample library, such as 1 second.
  • the video image corresponding to the higher volume is often the object that needs to be focused on. Therefore, it is preferable to use the method 1 to identify the video file more easily and quickly, and sort the peak points according to the volume level. Compare the high-peak audio segments.
  • fingerprint extraction is performed on the audio segment, and an extracted algorithm such as a fast Fourier transform method is not described herein. Thereby, the audio fingerprint corresponding to the audio segment is obtained, so that the subsequent steps are compared with the trained training samples.
  • Step S3 Perform audio matching on the obtained audio segment of the audio segment and the trained training sample, and record the audio matching result.
  • a training sample is obtained by training a large number of various types of yellow-related video and audio, and label information is added to each training sample.
  • the labeling information of the training sample mainly includes sample duration, content level, and manual classification label, etc., content level. In this embodiment, it is a level of involvement in the yellow.
  • the audio fingerprint of the audio segment is audio-matched with the training sample, and if the recognition similarity between the audio fingerprint of the audio segment and the training sample is greater than the set audio similarity threshold, the matching is deemed successful.
  • the audio matching result is traversed by traversing all the audio segments, and the audio matching result includes: the number of successful matches, the start time of the matched audio segment, and the annotation information of the training samples matching the successfully matched audio segments.
  • Step S4 Determine, according to the audio matching result, whether the video file to be identified is the target video. When it is determined that the target video is determined to be not the target video, the identification is terminated. When it is determined to be a suspicious video file, the process proceeds to the next step to continue the identification.
  • the embodiment determines whether the video file to be identified is a target video by the following steps:
  • the first threshold for example, 20 times
  • the second threshold for example, 2 times
  • the audio matching probability corresponding to the current matching result is calculated, and when the calculated matching probability is greater than the set third threshold (for example, T, T is a specific value) ), it is determined that the video file to be identified is the target video, otherwise the video file to be identified is regarded as a suspicious video file, and it is necessary to proceed to the next step to continue identification.
  • the set third threshold for example, T, T is a specific value
  • the ratio of the number of successful matches to the total number of all audio segments is:
  • the audio matching probability R1 corresponding to the current matching result is calculated, and the calculation formula is as follows:
  • R1 is the audio matching probability corresponding to the matching result
  • P1 is the ratio of the number of successful matches to the total number of audio segments
  • P(Y) is the content level of all training samples matching the audio fingerprint of the audio segment. The sum of the weights.
  • the audio matching probability R1 is compared with the set third threshold, and if it is higher than the third threshold, the target video is determined, otherwise the video image is required. Make further judgments.
  • the foregoing determining step is only a specific embodiment, wherein the first threshold, the second threshold, and the third threshold may be adjusted to make the determination result more accurate.
  • An intermediate threshold may be further set in the middle of the first threshold and the second threshold, for example, 10 times.
  • the audio matching probability corresponding to the matching result is calculated, and the calculated audio is obtained according to the calculation. If the number of matching successes is less than the intermediate threshold and greater than the second threshold, the audio matching probability corresponding to the matching result is not calculated, and the next step is directly performed, and the video image needs to be further judged.
  • the present invention is not limited to the specific judging step, and will not be described below.
  • Step S5 According to the matching result of the audio segment, the video file is captured from the start time of the successfully matched audio segment, the video image is captured, the captured video image is image-matched, and the image matching result is recorded.
  • the matching of the step S3 it is known that the audio segments are successfully matched, and the start time of the matched audio segment is recorded according to the recorded matching result to the corresponding time point in the video file, and the video file is started from the time point.
  • the time interval for capturing frames can be determined according to the actual situation, and the video image is captured.
  • Identifying the captured video image in this embodiment, identifying whether the captured video image is yellow
  • the image of the violence can be identified by the human eye or by computer. If it is recognized by computer, it is also necessary to train a large number of various yellow-related video images to obtain training samples, and match the captured video images with the training samples to obtain the recognition similarity of the video images. If the recognition similarity is greater than the set If the image similarity threshold is determined, the matching is considered as successful, and the image matching result is recorded, that is, the number of times the image matching succeeds.
  • Step S6 Determine whether the video file to be identified is a target video according to the image matching result or according to the image matching result and the audio matching result.
  • the video matching probability R2 may be calculated according to the number of successful matching, and R2 is the ratio of the number of successful matching of the captured video image to the total number of all captured video images.
  • the comprehensive matching probability R′ of the current matching is calculated according to the video matching probability R2 and the audio matching probability R1. If the comprehensive matching probability exceeds the fourth threshold, the video file to be identified is determined to be the target video, otherwise it is determined to be a normal video.
  • R' R 1 * ⁇ +R 2 * ⁇
  • ⁇ and ⁇ are the weights of the audio matching probability and the video matching probability, respectively.
  • the comprehensive matching probability if the comprehensive matching probability exceeds the recognition threshold, it is determined that the video file to be identified is the target video, otherwise it is determined to be a normal video.
  • the video file to be identified may be directly determined according to the number of times the image matching succeeds, or whether the video file to be identified is a yellow-related video file according to the video matching probability R2, for example, image matching. If the number of successes or the video matching probability R2 is greater than the set threshold, it is determined to be a video file involving yellowing.
  • the present invention does not limit the specific judgment conditions.
  • matching the audio fingerprint of the audio segment with the training sample, calculating their recognition similarity, or matching the video image with the training sample, and calculating their recognition similarity are all mature technologies. For example, it can be calculated by the maximum likelihood estimation method, and will not be described again here.
  • FIG. 2 shows a video file identification device corresponding to the above method, comprising:
  • the audio pre-processing module is configured to obtain audio information from the to-be-identified video file, segment the obtained audio information, and perform fingerprint extraction on the segmented audio segment to obtain an audio fingerprint of the audio segment;
  • An audio fingerprint matching module configured to perform audio matching on the obtained audio segment of the audio segment with the trained training sample, and record an audio matching result
  • the audio judging module is configured to determine, according to the audio matching result, whether the video file to be identified is a target video, and when it is determined that the target video is determined to be not the target video, the identification is terminated, and when the video file is determined to be a suspicious video file, the image preprocessing module is used. Continue processing;
  • the image preprocessing module is configured to capture a video file from the start time of the successfully matched audio segment according to the audio matching result, and capture the video image;
  • An image matching module configured to perform image matching on the captured video image, and record an image matching result
  • the comprehensive judgment module is configured to determine whether the video file to be identified is a target video according to the image matching result or according to the image matching result and the audio matching result.
  • the audio pre-processing module segments the obtained audio information, and can be segmented according to the volume in the time domain, or divided according to a fixed interval, corresponding to the specific audio segmentation method in the method, and details are not described herein again.
  • step S4 and step S6 correspond to the specific steps of step S4 and step S6, and are not described herein again.

Abstract

一种视频文件识别方法,该方法首先从待识别视频文件中获取音频信息(S1),将获取的音频信息进行分段,对分段后的音频分段进行指纹提取,得到音频分段的音频指纹(S2),将得到的音频分段的音频指纹与已经训练好的训练样本进行音频匹配,记录音频匹配结果(S3),根据音频匹配结果,判断待识别视频文件是否是目标视频,当判断为目标视频或判断为不是目标视频时,终止识别,当判断为可疑视频文件时,进入下一步继续识别(S4)。一种视频文件识别装置,包括音频预处理模块、音频指纹匹配模块、音频判断模块、图像预处理模块、图像预处理模块和综合判断模块。

Description

一种视频文件识别方法及装置
本申请要求2015年10月20日递交的申请号为201510683009.1、发明名称为“一种视频文件识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明属于计算机数据处理技术领域,尤其涉及一种视频文件识别方法及装置。
背景技术
随着互联网的普及,越来越多用户开始利用互联网服务提供商提供的云服务器来存储个人的视频文件,一些互联网服务提供商还允许用户上传视频文件用来共享给网络中的其他用户。但是法律对于网上传播的视频文件有严格的审查要求,不能涉黄涉暴。因此互联网服务提供商有责任和义务对用户上传及服务商自己提供的视频文件按国家规范进行审核和监管。
现有技术对于视频文件的审核都是基于视频图像,通过抓取视频图像中的图片帧进行审核,存在如下问题:
处理效率低:视频图像抓帧范围无法有效定位,若想全面审核,抓帧量极大,处理效率低下;
识别手段单一,识别率不高:单一借助图片识别,存在漏识别和错误识别概率很高。
发明内容
本发明的目的是提供一种视频文件识别方法及装置,借助音频指纹识别和采用视频图像抓帧技术进一步进行图片识别,最终给出识别结果,有效提高处理效率。
为了实现上述目的,本发明技术方案如下:
一种视频文件识别方法,用于审核待识别视频文件,所述方法包括:
从待识别视频文件中获取音频信息;
将获取的音频信息进行分段,对分段后的音频分段进行指纹提取,得到音频分段的音频指纹;
将得到的音频分段的音频指纹与已经训练好的训练样本进行音频匹配,记录音频匹配结果;
根据音频匹配结果,判断待识别视频文件是否是目标视频,当判断为目标视频或判 断为不是目标视频时,终止识别,当判断为可疑视频文件时,进入下一步继续识别;
根据音频匹配结果,从匹配成功的音频分段的起始时间开始对视频文件进行抓帧,抓取视频图像,对抓取的视频图像进行图像匹配,记录图片匹配结果;
根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频。
本发明将获取的音频信息进行分段的一种实现方式,包括:
对音频信息在时域上找出超出指定阀值的所有音量峰值点;
依次从各峰值点开始按固定时长进行采样得到各音频分段。
本发明将获取的音频信息进行分段的另一种实现方式,包括:
对音频信息按固定时长进行采样得到各音频分段。
进一步第,所述音频匹配结果包括:匹配成功的次数、匹配成功的音频分段的起始时间、以及与匹配成功的音频分段匹配的训练样本的标注信息;所述标注信息包括:样本时长、内容等级以及人工分类标签。
进一步地,所述根据音频匹配结果,判断待识别视频文件是否是目标视频,包括:
当匹配成功的次数大于第一阈值,判断待识别视频文件是目标视频;
当匹配成功的次数小于第二阈值,判断待识别视频文件不是目标视频;
当匹配成功的次数在第一阈值与第二阈值之间时,计算本次匹配结果对应的音频匹配概率,当计算得到的匹配概率大于设定的第三阈值,判断待识别视频文件是目标视频,否则将待识别视频文件视为可疑视频文件。
其中,所述计算本次匹配结果对应的音频匹配概率,包括:
根据匹配成功的次数X与所有音频分段的总数Z,计算两者的比值P1为:
Figure PCTCN2016101733-appb-000001
计算本次匹配结果对应的音频匹配概率R1,计算公式如下:
R1=P1*P(Y)
其中,R1为本次匹配结果对应的音频匹配概率,P(Y)为所有与音频分段的音频指纹匹配的训练样本的内容等级对应的权重之和。
进一步地,所述根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频,包括:
根据图像匹配结果,计算图像匹配概率R2,R2为抓取的视频图像匹配成功的次数与所有抓取的视频图像的总数的比值;
根据视频匹配概率R2和音频匹配概率R1计算本次匹配的综合匹配概率R′,如果综合匹配概率超过第四阈值,则判断待识别视频文件为目标视频,否则判定为正常视频;
其中,综合匹配概率R′的计算公式如下:
R′=R1*α+R2
其中,α和β分别为音频匹配概率和视频匹配概率的权重。
本发明还提出了一种视频文件识别装置,用于审核待识别视频文件,所述装置包括:
音频预处理模块,用于从待识别视频文件中获取音频信息,将获取的音频信息进行分段,对分段后的音频分段进行指纹提取,得到音频分段的音频指纹;
音频指纹匹配模块,用于将得到的音频分段的音频指纹与已经训练好的训练样本进行音频匹配,记录音频匹配结果;
音频判断模块,用于根据音频匹配结果,判断待识别视频文件是否是目标视频,当判断为目标视频或判断为不是目标视频时,终止识别,当判断为可疑视频文件时,由图像预处理模块继续处理;
图像预处理模块,用于根据音频匹配结果,从匹配成功的音频分段的起始时间开始对视频文件进行抓帧,抓取视频图像;
图像匹配模块,用于对抓取的视频图像进行图像匹配,记录图像匹配结果;
综合判断模块,用于根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频。
本发明提出的一种视频文件识别方法及装置,借助音频指纹识别将视频文件的语音快速识别出来,并记录匹配上的起始时间点,然后在该起始时间点范围内间隔抓帧进一步进行图片识别,最终给出识别结果。具有处理效率高,识别率高的特点。
附图说明
图1为本发明视频文件识别方法流程图;
图2为本发明视频文件识别装置的结构示意图。
具体实施方式
下面结合附图和实施例对本发明技术方案做进一步详细说明,以下实施例不构成对本发明的限定。
视频文件目前流行的格式很多,包括AVI格式、MOV格式、MPEG模式、RM格式、ASF 格式等,一个完整的视频文件包括视频图像和音频信息两部分。本发明的总体思路是从视频文件中提取出音频信息,对提取的音频信息进行识别,然后根据识别结果再进行视频图像的抓帧,对抓取的视频图像进行进一步的识别。
以下以识别涉黄涉暴的视频为例来进行说明,对于其他类型的视频文件同样适用。如图1所示,一种视频文件识别方法,包括如下步骤:
步骤S1、从待识别视频文件中获取音频信息。
本实施例从待识别视频文件中获取音频信息,可以直接对视频文件进行解码,提取出音频信息。也可以直接通过其他第三方软件进行音频信息的提取。对于音频信息的提取,已经是比较成熟的技术,这里不再赘述。
步骤S2、将获取的音频信息进行分段,对分段后的音频分段进行指纹提取,得到音频分段的音频指纹。
将获取的音频信息分段,并对每一个音频分段进行指纹提取,获得每个音频分段对应的音频指纹。
本发明对音频信息的识别基于音频指纹(Audio fingerprinting technology),音频指纹是指可以代表一段声音重要声学特征的基于内容的紧致数字签名,其主要目的是建立一种有效机制来比较两个音频文件的感知听觉质量,可用在音频识别、内容完整性校验等应用中。
将音频信息从视频文件中剥离出来后,可以获得该音频信息播放的总时长T(毫秒),以及提取出来的音频信息的总长度L(bytes)。然后将音频信息切分为多个音频分段,对每一个音频分段进行指纹提取,将提取的音频指纹与训练样本进行比对。训练样本也是按照同样的方法进行音频分段,通过训练得到。
以下通过两个实施例来阐述具体的音频信息切分方法:
方法一:根据时域上音量高低切分。
音频信息在时域上沿时间轴音量高低不同,表现为有起有伏的波形,设定一个音量的阈值,可以对音频信息在时域上找出超出指定阀值的所有音量峰值点,记为(k1,k2,k3,….,kn),并记录下各峰值点对应的时间轴上坐标,该时间轴上的坐标就是峰值点在音频信息中的时间偏移量p。
然后依次从各峰值点开始按固定时长w进行采样得到音频分段,并提取音频指纹,提取到n个音频指纹,以便与训练样本进行比对。
容易理解的是,每个音频分段的起点为峰值点对应的时间,可计算出该峰值点对应 的音频分段的时间起始点为:T*(p/L)。
方法二:固定间隔切分。
对音频信息按固定时长w进行采样,得到f1,f2,f3,….,fm个音频分段,并提取音频指纹,以便与训练样本进行比对。
容易理解的是,每个音频分段的起点可根据固定时长来进行计算,音频分段的时间起始点为:T*(fi-1)/L,其中i属于(1~m)。
容易理解的是,固定时长w与训练样本库中的训练样本的时长一致,如1秒钟。对应涉黄涉暴的视频文件,较高音量对应的视频图像往往是需要重点关注的对象,因此优选地,采用方法一更容易快速地对视频文件进行识别,将峰值点按照音量高低排序,先比对高峰值的音频分段即可。
具体地,对音频分段进行指纹提取,提取的算法例如快速傅立叶变换方法,这里不再赘述。从而获取到音频分段对应的音频指纹,以便后续步骤与已经训练出的训练样本进行比对。
步骤S3、将得到的音频分段的音频指纹与已经训练好的训练样本进行音频匹配,记录音频匹配结果。
本实施例通过对大量各类涉黄涉暴视频音频进行训练得到训练样本,并对每个训练样本添加标注信息,训练样本的标注信息主要包含样本时长、内容等级以及人工分类标签等,内容等级在本实施例中为涉黄涉暴的等级。
将音频分段的音频指纹与训练样本进行音频匹配,如果音频分段的音频指纹与训练样本的识别相似度大于设定的音频相似度阈值,则视为匹配成功。遍历所有音频分段,记录音频匹配结果,音频匹配结果包括:匹配成功的次数、匹配成功的音频分段的起始时间、以及与匹配成功的音频分段匹配的训练样本的标注信息。
步骤S4、根据音频匹配结果,判断待识别视频文件是否是目标视频,当判断为目标视频或判断为不是目标视频时,终止识别,当判断为可疑视频文件时,进入下一步继续识别。
具体地,本实施例通过如下步骤判断待识别视频文件是否是目标视频:
当匹配成功的次数大于第一阈值(例如20次),判断待识别视频文件是目标视频,终止识别;
当匹配成功的次数小于第二阈值(例如2次),判断待识别视频文件不是目标视频,终止识别;
当匹配成功的次数在第一阈值与第二阈值之间时,计算本次匹配结果对应的音频匹配概率,当计算得到的匹配概率大于设定的第三阈值(例如T,T为一具体数值),判断待识别视频文件是目标视频,否则将待识别视频文件视为可疑视频文件,需要进入下一步继续识别。
假设匹配成功的次数为X,而进行匹配的音频分段的总数为Z,则匹配成功的次数与所有音频分段的总数的比值P1为:
Figure PCTCN2016101733-appb-000002
本实施例计算本次匹配结果对应的音频匹配概率R1,计算公式如下:
R1=P1*P(Y)
其中,R1为本次匹配结果对应的音频匹配概率,P1为匹配成功的次数与音频分段的总数的比值,P(Y)为所有与音频分段的音频指纹匹配的训练样本的内容等级对应的权重之和。
具体地,对于一个音频分段,其匹配的训练样本对应有一个涉黄涉暴等级Yi,则其对应的权重为P(Yi),并有P(Y)=∑P(Yi)。
在计算得到本次匹配结果对应的音频匹配概率R1后,将音频匹配概率R1与设定的第三阈值进行比对判定,如果高于第三阈值,则判定为目标视频,否则需要对视频图像做进一步的判断。
上述判断步骤仅为一具体的实施例,其中第一阈值、第二阈值、第三阈值可以进行调整,以使判断结果更准确。还可以在第一阈值与第二阈值中间进一步设定一个中间阈值,例如10次,在匹配成功的次数大于这个中间阈值时,才计算本次匹配结果对应的音频匹配概率,根据计算得到的音频匹配概率进行判断;如果匹配成功的次数小于这个中间阈值,并大于第二阈值,则不计算本次匹配结果对应的音频匹配概率,直接进入下一步,需要对视频图像做进一步的判断。本发明不限于具体的判断步骤,以下不再赘述。
步骤S5、根据音频分段的匹配结果,从匹配成功的音频分段的起始时间开始对视频文件进行抓帧,抓取视频图像,对抓取的视频图像进行图像匹配,记录图片匹配结果。
通过步骤S3的匹配,已经知道哪些音频分段匹配成功,根据记录的匹配结果中匹配成功的音频分段的起始时间定位到视频文件中对应的时间点,从该时间点开始对视频文件进行抓帧,抓帧的时间间隔可以根据实际情况来确定,抓取到视频图像。
对抓取到的视频图像进行识别,在本实施例中就是识别抓取的视频图像是否是涉黄 涉暴的图像,可以通过人眼识别也可以通过计算机识别。如果通过计算机识别,则也需要对大量各类涉黄涉暴视频图像进行训练得到训练样本,将抓取的视频图像与训练样本进行匹配,获得视频图像的识别相似度,如果识别相似度大于设定的图像相似度阈值,则视为匹配成功,记录图像匹配结果,即图像匹配成功的次数。
步骤S6、根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频。
在图像匹配结束后,可以根据匹配成功的次数计算视频匹配概率R2,R2为抓取的视频图像匹配成功的次数与所有抓取的视频图像的总数的比值。
根据视频匹配概率R2和音频匹配概率R1计算本次匹配的综合匹配概率R′,如果综合匹配概率超过第四阈值,则判断待识别视频文件为目标视频,否则判定为正常视频。
综合匹配概率R′的计算公式如下:
R′=R1*α+R2
其中,α和β分别为音频匹配概率和视频匹配概率的权重。
从而根据得到的综合匹配概率进行判断,如果综合匹配概率超过识别阈值,则判断待识别视频文件为目标视频,否则判定为正常视频。
也可以直接根据图像匹配成功的次数来判断待识别视频文件是否为涉黄涉暴的视频文件,或根据视频匹配概率R2来判断待识别视频文件是否为涉黄涉暴的视频文件,例如图像匹配成功的次数或视频匹配概率R2大于设定的阈值则判断为涉黄涉暴的视频文件。本发明对具体的判断条件不做限制。
需要说明的是,将音频分段的音频指纹与训练样本进行匹配,计算他们的识别相似度,或将视频图像与训练样本进行匹配,计算他们的识别相似度,均为目前较为成熟的技术,例如可以通过最大似然估计方法来计算,这里不再赘述。
图2示出了对应于上述方法的一种视频文件识别装置,包括:
音频预处理模块,用于从待识别视频文件中获取音频信息,将获取的音频信息进行分段,对分段后的音频分段进行指纹提取,得到音频分段的音频指纹;
音频指纹匹配模块,用于将得到的音频分段的音频指纹与已经训练好的训练样本进行音频匹配,记录音频匹配结果;
音频判断模块,用于根据音频匹配结果,判断待识别视频文件是否是目标视频,当判断为目标视频或判断为不是目标视频时,终止识别,当判断为可疑视频文件时,由图像预处理模块继续处理;
图像预处理模块,用于根据音频匹配结果,从匹配成功的音频分段的起始时间开始对视频文件进行抓帧,抓取视频图像;
图像匹配模块,用于对抓取的视频图像进行图像匹配,记录图像匹配结果;
综合判断模块,用于根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频。
其中,音频预处理模块将获取的音频信息进行分段,可以根据时域上音量高低切分,或按照固定间隔切分,与方法中所述具体音频分段方法对应,这里不再赘述。
同样,音频判断模块、综合判断模块在做具体判别时执行的操作,对应于步骤S4和步骤S6的具体步骤,这里不再赘述。
以上实施例仅用以说明本发明的技术方案而非对其进行限制,在不背离本发明精神及其实质的情况下,熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形,但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。

Claims (14)

  1. 一种视频文件识别方法,用于审核待识别视频文件,其特征在于,所述方法包括:
    从待识别视频文件中获取音频信息;
    将获取的音频信息进行分段,对分段后的音频分段进行指纹提取,得到音频分段的音频指纹;
    将得到的音频分段的音频指纹与已经训练好的训练样本进行音频匹配,记录音频匹配结果;
    根据音频匹配结果,判断待识别视频文件是否是目标视频,当判断为目标视频或判断为不是目标视频时,终止识别,当判断为可疑视频文件时,进入下一步继续识别;
    根据音频匹配结果,从匹配成功的音频分段的起始时间开始对视频文件进行抓帧,抓取视频图像,对抓取的视频图像进行图像匹配,记录图片匹配结果;
    根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频。
  2. 根据权利要求1所述的视频文件识别方法,其特征在于,所述将获取的音频信息进行分段,包括:
    对音频信息在时域上找出超出指定阀值的所有音量峰值点;
    依次从各峰值点开始按固定时长进行采样得到各音频分段。
  3. 根据权利要求1所述的视频文件识别方法,其特征在于,所述将获取的音频信息进行分段,包括:
    对音频信息按固定时长进行采样得到各音频分段。
  4. 根据权利要求1所述的视频文件识别方法,其特征在于,所述音频匹配结果包括:匹配成功的次数、匹配成功的音频分段的起始时间、以及与匹配成功的音频分段匹配的训练样本的标注信息;
    所述标注信息包括:样本时长、内容等级以及人工分类标签。
  5. 根据权利要求4所述的视频文件识别方法,其特征在于,所述根据音频匹配结果,判断待识别视频文件是否是目标视频,包括:
    当匹配成功的次数大于第一阈值,判断待识别视频文件是目标视频;
    当匹配成功的次数小于第二阈值,判断待识别视频文件不是目标视频;
    当匹配成功的次数在第一阈值与第二阈值之间时,计算本次匹配结果对应的音频匹 配概率,当计算得到的匹配概率大于设定的第三阈值,判断待识别视频文件是目标视频,否则将待识别视频文件视为可疑视频文件。
  6. 根据权利要求5所述的视频文件识别方法,其特征在于,所述计算本次匹配结果对应的音频匹配概率,包括:
    根据匹配成功的次数X与所有音频分段的总数Z,计算两者的比值P1为:
    Figure PCTCN2016101733-appb-100001
    计算本次匹配结果对应的音频匹配概率R1,计算公式如下:
    R1=P1*P(Y)
    其中,R1为本次匹配结果对应的音频匹配概率,P(Y)为所有与音频分段的音频指纹匹配的训练样本的内容等级对应的权重之和。
  7. 根据权利要求6所述的视频文件识别方法,其特征在于,所述根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频,包括:
    根据图像匹配结果,计算图像匹配概率R2,R2为抓取的视频图像匹配成功的次数与所有抓取的视频图像的总数的比值;
    根据视频匹配概率R2和音频匹配概率R1计算本次匹配的综合匹配概率R′,如果综合匹配概率超过第四阈值,则判断待识别视频文件为目标视频,否则判定为正常视频;
    其中,综合匹配概率R′的计算公式如下:
    R′=R1*α+R2
    其中,α和β分别为音频匹配概率和视频匹配概率的权重。
  8. 一种视频文件识别装置,用于审核待识别视频文件,其特征在于,所述装置包括:
    音频预处理模块,用于从待识别视频文件中获取音频信息,将获取的音频信息进行分段,对分段后的音频分段进行指纹提取,得到音频分段的音频指纹;
    音频指纹匹配模块,用于将得到的音频分段的音频指纹与已经训练好的训练样本进行音频匹配,记录音频匹配结果;
    音频判断模块,用于根据音频匹配结果,判断待识别视频文件是否是目标视频,当判断为目标视频或判断为不是目标视频时,终止识别,当判断为可疑视频文件时,由图像预处理模块继续处理;
    图像预处理模块,用于根据音频匹配结果,从匹配成功的音频分段的起始时间开始对视频文件进行抓帧,抓取视频图像;
    图像匹配模块,用于对抓取的视频图像进行图像匹配,记录图像匹配结果;
    综合判断模块,用于根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是目标视频。
  9. 根据权利要求8所述的视频文件识别装置,其特征在于,所述音频预处理模块将获取的音频信息进行分段,具体执行如下操作:
    对音频信息在时域上找出超出指定阀值的所有音量峰值点;
    依次从各峰值点开始按固定时长进行采样得到各音频分段。
  10. 根据权利要求8所述的视频文件识别装置,其特征在于,所述音频预处理模块将获取的音频信息进行分段,具体执行如下操作:
    对音频信息按固定时长进行采样得到各音频分段。
  11. 根据权利要求8所述的视频文件识别装置,其特征在于,所述音频匹配结果包括:匹配成功的次数、匹配成功的音频分段的起始时间、以及与匹配成功的音频分段匹配的训练样本的标注信息;所述标注信息包括:样本时长、内容等级以及人工分类标签。
  12. 根据权利要求11所述的视频文件识别装置,其特征在于,所述音频判断模块根据音频匹配结果,判断待识别视频文件是否是目标视频,执行如下操作:
    当匹配成功的次数大于第一阈值,判断待识别视频文件是目标视频;
    当匹配成功的次数小于第二阈值,判断待识别视频文件不是目标视频;
    当匹配成功的次数在第一阈值与第二阈值之间时,计算本次匹配结果对应的音频匹配概率,当计算得到的匹配概率大于设定的第三阈值,判断待识别视频文件是目标视频,否则将待识别视频文件视为可疑视频文件。
  13. 根据权利要求12所述的视频文件识别装置,其特征在于,所述计算本次匹配结果对应的音频匹配概率,包括:
    根据匹配成功的次数X与所有音频分段的总数Z,计算两者的比值P1为:
    Figure PCTCN2016101733-appb-100002
    计算本次匹配结果对应的音频匹配概率R1,计算公式如下:
    R1=P1*P(Y)
    其中,R1为本次匹配结果对应的音频匹配概率,P(Y)为所有与音频分段的音频指纹匹配的训练样本的内容等级对应的权重之和。
  14. 根据权利要求13所述的视频文件识别装置,其特征在于,所述综合判断模块根据图像匹配结果、或根据图像匹配结果与音频匹配结果,判断待识别视频文件是否是 目标视频,执行如下操作:
    根据图像匹配结果,计算图像匹配概率R2,R2为抓取的视频图像匹配成功的次数与所有抓取的视频图像的总数的比值;
    根据视频匹配概率R2和音频匹配概率R1计算本次匹配的综合匹配概率R′,如果综合匹配概率超过第四阈值,则判断待识别视频文件为目标视频,否则判定为正常视频;
    其中,综合匹配概率R′的计算公式如下:
    R′=R1*α+R2
    其中,α和β分别为音频匹配概率和视频匹配概率的权重。
PCT/CN2016/101733 2015-10-20 2016-10-11 一种视频文件识别方法及装置 WO2017067400A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510683009.1A CN106601243B (zh) 2015-10-20 2015-10-20 一种视频文件识别方法及装置
CN201510683009.1 2015-10-20

Publications (1)

Publication Number Publication Date
WO2017067400A1 true WO2017067400A1 (zh) 2017-04-27

Family

ID=58554949

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101733 WO2017067400A1 (zh) 2015-10-20 2016-10-11 一种视频文件识别方法及装置

Country Status (2)

Country Link
CN (1) CN106601243B (zh)
WO (1) WO2017067400A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489757A (zh) * 2020-03-26 2020-08-04 北京达佳互联信息技术有限公司 音频处理方法、装置、电子设备及可读存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967922A (zh) * 2017-12-19 2018-04-27 成都嗨翻屋文化传播有限公司 一种基于特征的音乐版权识别方法
CN108419124B (zh) * 2018-05-08 2020-11-17 北京酷我科技有限公司 一种音频处理方法
CN108984665A (zh) * 2018-06-29 2018-12-11 杭州当虹科技股份有限公司 一种高效视频内容联合检测方法
CN109389794A (zh) * 2018-07-05 2019-02-26 北京中广通业信息科技股份有限公司 一种智能化视频监控方法和系统
CN109271126A (zh) * 2018-08-02 2019-01-25 联想(北京)有限公司 一种数据处理方法及装置
CN109344289B (zh) * 2018-09-21 2020-12-11 北京字节跳动网络技术有限公司 用于生成信息的方法和装置
CN109982137A (zh) * 2019-02-22 2019-07-05 北京奇艺世纪科技有限公司 模型生成方法、视频标记方法、装置、终端及存储介质
CN109887493B (zh) * 2019-03-13 2021-08-31 安徽声讯信息技术有限公司 一种文字音频推送方法
CN113542820B (zh) * 2021-06-30 2023-12-22 北京中科模识科技有限公司 一种视频编目方法、系统、电子设备及存储介质
CN114358643B (zh) * 2022-01-13 2023-09-12 南京讯思雅信息科技有限公司 一种多媒体内容风控管理装置及管理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7027990B2 (en) * 2001-10-12 2006-04-11 Lester Sussman System and method for integrating the visual display of text menus for interactive voice response systems
CN101021854A (zh) * 2006-10-11 2007-08-22 鲍东山 基于内容的音频分析系统
CN101640057A (zh) * 2009-05-31 2010-02-03 北京中星微电子有限公司 一种音视频匹配方法及装置
CN102014295A (zh) * 2010-11-19 2011-04-13 嘉兴学院 一种网络敏感视频检测方法
CN103533459A (zh) * 2013-10-09 2014-01-22 北京中科模识科技有限公司 一种新闻视频条目拆分的方法和系统

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070288452A1 (en) * 2006-06-12 2007-12-13 D&S Consultants, Inc. System and Method for Rapidly Searching a Database
CN101470897B (zh) * 2007-12-26 2011-04-20 中国科学院自动化研究所 基于音视频融合策略的敏感影片检测方法
CN101819638B (zh) * 2010-04-12 2012-07-11 中国科学院计算技术研究所 色情检测模型建立方法和色情检测方法
CN102222103B (zh) * 2011-06-22 2013-03-27 央视国际网络有限公司 视频内容的匹配关系的处理方法及装置
CN102890778A (zh) * 2011-07-21 2013-01-23 北京新岸线网络技术有限公司 基于内容的视频检测方法及装置
CN102509084B (zh) * 2011-11-18 2014-05-07 中国科学院自动化研究所 一种基于多示例学习的恐怖视频场景识别方法
EP2608062A1 (en) * 2011-12-23 2013-06-26 Thomson Licensing Method of automatic management of images in a collection of images and corresponding device
US8781154B1 (en) * 2012-01-21 2014-07-15 Google Inc. Systems and methods facilitating random number generation for hashes in video and audio applications
EP2648418A1 (en) * 2012-04-05 2013-10-09 Thomson Licensing Synchronization of multimedia streams
CN102799605B (zh) * 2012-05-02 2016-03-23 天脉聚源(北京)传媒科技有限公司 一种广告监播方法和系统
CN202602832U (zh) * 2012-05-10 2012-12-12 青岛海尔电子有限公司 识别电视机所播放节目的系统
CN102831537B (zh) * 2012-07-09 2016-03-23 北京酷云互动科技有限公司 一种获取网络广告信息的方法及装置
US8484017B1 (en) * 2012-09-10 2013-07-09 Google Inc. Identifying media content
US8805865B2 (en) * 2012-10-15 2014-08-12 Juked, Inc. Efficient matching of data
CN103581705A (zh) * 2012-11-07 2014-02-12 深圳新感易搜网络科技有限公司 视频节目识别方法和系统
CN103617263A (zh) * 2013-11-29 2014-03-05 安徽大学 一种基于多模态特征的电视广告片花自动检测方法
CN104036280A (zh) * 2014-06-23 2014-09-10 国家广播电影电视总局广播科学研究院 基于感兴趣区域和聚类相结合的视频指纹方法
CN104866616B (zh) * 2015-06-07 2019-01-22 中科院成都信息技术股份有限公司 监控视频目标搜索方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7027990B2 (en) * 2001-10-12 2006-04-11 Lester Sussman System and method for integrating the visual display of text menus for interactive voice response systems
CN101021854A (zh) * 2006-10-11 2007-08-22 鲍东山 基于内容的音频分析系统
CN101640057A (zh) * 2009-05-31 2010-02-03 北京中星微电子有限公司 一种音视频匹配方法及装置
CN102014295A (zh) * 2010-11-19 2011-04-13 嘉兴学院 一种网络敏感视频检测方法
CN103533459A (zh) * 2013-10-09 2014-01-22 北京中科模识科技有限公司 一种新闻视频条目拆分的方法和系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489757A (zh) * 2020-03-26 2020-08-04 北京达佳互联信息技术有限公司 音频处理方法、装置、电子设备及可读存储介质
CN111489757B (zh) * 2020-03-26 2023-08-18 北京达佳互联信息技术有限公司 音频处理方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN106601243A (zh) 2017-04-26
CN106601243B (zh) 2020-11-06

Similar Documents

Publication Publication Date Title
WO2017067400A1 (zh) 一种视频文件识别方法及装置
US10497382B2 (en) Associating faces with voices for speaker diarization within videos
US20210166035A1 (en) Selecting and presenting representative frames for video previews
RU2738325C2 (ru) Способ и устройство аутентификации личности
CN106973305B (zh) 一种视频中不良内容的检测方法及装置
US8358837B2 (en) Apparatus and methods for detecting adult videos
US10304458B1 (en) Systems and methods for transcribing videos using speaker identification
CN112465008B (zh) 一种基于自监督课程学习的语音和视觉关联性增强方法
US20140245463A1 (en) System and method for accessing multimedia content
CN110267061B (zh) 一种新闻拆条方法及系统
CN104834849A (zh) 基于声纹识别和人脸识别的双因素身份认证方法及系统
WO2022142521A1 (zh) 活体检测方法、装置、设备和存储介质
CN109118420B (zh) 水印识别模型建立及识别方法、装置、介质及电子设备
CN110853646A (zh) 会议发言角色的区分方法、装置、设备及可读存储介质
CN107609149B (zh) 一种视频定位方法和装置
WO2020135756A1 (zh) 视频段的提取方法、装置、设备及计算机可读存储介质
CN109117622B (zh) 一种基于音频指纹的身份认证方法
Heng et al. How to assess the quality of compressed surveillance videos using face recognition
Xie et al. Inducing predictive uncertainty estimation for face recognition
Mou et al. Content-based copy detection through multimodal feature representation and temporal pyramid matching
WO2023029389A1 (zh) 视频指纹的生成方法及装置、电子设备、存储介质、计算机程序、计算机程序产品
CN108733843B (zh) 基于哈希算法的文件检测方法和样本哈希库生成方法
WO2006009035A1 (ja) 信号検出方法,信号検出システム,信号検出処理プログラム及びそのプログラムを記録した記録媒体
CN116017088A (zh) 视频字幕处理方法、装置、电子设备和存储介质
CN108734144A (zh) 一种基于人脸识别的主讲人身份认证方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16856833

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16856833

Country of ref document: EP

Kind code of ref document: A1