WO2017166483A1 - 一种动态图片处理方法及系统 - Google Patents

一种动态图片处理方法及系统 Download PDF

Info

Publication number
WO2017166483A1
WO2017166483A1 PCT/CN2016/088859 CN2016088859W WO2017166483A1 WO 2017166483 A1 WO2017166483 A1 WO 2017166483A1 CN 2016088859 W CN2016088859 W CN 2016088859W WO 2017166483 A1 WO2017166483 A1 WO 2017166483A1
Authority
WO
WIPO (PCT)
Prior art keywords
dynamic picture
voiceprint
module
dynamic
voiceprint feature
Prior art date
Application number
PCT/CN2016/088859
Other languages
English (en)
French (fr)
Inventor
姜天宇
Original Assignee
乐视控股(北京)有限公司
乐视移动智能信息技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视移动智能信息技术(北京)有限公司 filed Critical 乐视控股(北京)有限公司
Priority to US15/245,743 priority Critical patent/US20170287524A1/en
Publication of WO2017166483A1 publication Critical patent/WO2017166483A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/433Query formulation using audio data

Definitions

  • the present invention relates to the field of dynamic picture processing technologies, and in particular, to a dynamic picture processing method and system.
  • the dynamic image format is likely to replace the existing static image format in the future, becoming an important competitive link in the next mobile device innovation technology field.
  • the existing dynamic picture only records the image information in the shooting range and simply records the original digital media signal, and does not consider the content information of the scene sound. Therefore, in the field of dynamic picture format processing, the user experience is much improved. Space.
  • the invention provides a dynamic picture processing method and system, aiming at solving the technical problem that the existing dynamic picture only records the image information in the shooting range and simply records the original digital media signal without considering the content information of the scene sound. .
  • a dynamic picture processing method including:
  • the extracted voiceprint features are written into the dynamic picture, and the dynamic picture is voiceprinted.
  • the technical solution adopted by the embodiment of the present invention further includes: recording the dynamic picture, and performing recording during the dynamic picture shooting process further includes: storing the captured dynamic picture and the recorded information;
  • the stored form of the picture is in the form of a thumbnail, and the recorded information includes voice, ambient sound or noise.
  • the voiceprint feature extraction method includes the following steps:
  • Window processing is performed on the frame data through the Hamming window
  • Bandpass filtering and voiceprint feature extraction are performed on the sound source.
  • the technical solution adopted by the embodiment of the present invention further includes: writing the extracted voiceprint feature into the dynamic picture by: reading the stored dynamic picture, and writing the extracted voiceprint feature in a serialized manner The specified file data node for the dynamic picture.
  • the technical solution adopted by the embodiment of the present invention further includes: writing the extracted voiceprint feature into the dynamic picture, and performing voiceprint labeling on the dynamic image, further comprising: returning the dynamic picture after the voiceprint is marked according to the voiceprint feature Class and storage; the classification method includes classification of speech features, classification of environmental sound features or classification of noise features.
  • the technical solution adopted by the embodiment of the present invention further includes: writing the extracted voiceprint feature into the dynamic picture, and performing voiceprint labeling on the dynamic image, further comprising: searching for a specific voiceprint feature by using a voice input or a classification search method. Dynamic picture.
  • a dynamic picture processing system including a shooting module, a recording module, a voiceprint extraction module, and a voiceprint labeling module; the shooting module is configured to capture a dynamic picture; the recording module The sound recording extraction module is configured to perform voiceprint feature extraction on the recorded information; the voiceprint annotation module is configured to write the extracted voiceprint feature into the dynamic image, and the dynamic The picture is voiced.
  • the voiceprint extraction module includes an endpoint detection unit, a pre-emphasis unit, an audio framing unit, a windowing unit, a sound source conversion unit, and a filtering unit;
  • the endpoint detecting unit is configured to detect whether valid source data is entered
  • the pre-emphasis unit is configured to perform differential and filtering processing on the incoming sound source data
  • the audio framing unit is configured to perform discretization processing on the streaming audio source
  • the windowing unit is configured to perform windowing processing on the frame data through the Hamming window
  • the sound source conversion unit is configured to convert the time domain sound source into frequency domain energy by fast Fourier transform
  • the filtering unit is configured to perform band pass filtering and voiceprint feature extraction on the sound source.
  • the technical solution adopted by the embodiment of the present invention further includes a storage module, where the storage module is configured to store the captured dynamic picture and the recorded information.
  • the technical solution adopted by the embodiment of the present invention further includes a classification module and a retrieval module, wherein the classification module is configured to classify and store the dynamic picture labeled by the voiceprint according to the voiceprint feature; and classify the classification module.
  • the method includes a voice feature classification, an environmental sound feature classification or a noise feature classification; the retrieval module is configured to retrieve a dynamic picture having a specific voiceprint feature by a voice input or a classification search method.
  • the utility model has the beneficial effects that the dynamic picture processing method and system of the embodiment of the present invention can calculate and extract the voiceprint feature of the dynamic picture shooting scene in real time by fully utilizing the sound information of the shooting scene, and write the voiceprint feature into the dynamic picture. Realize the voiceprint annotation of dynamic pictures, and classify the dynamic pictures according to the voiceprint features, realize the classification retrieval of dynamic pictures and the fast matching query based on voiceprint features, so that the user's retrieval picture operation is more efficient and intuitive.
  • FIG. 1 is a flowchart of a dynamic picture processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of extracting voiceprint features according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a dynamic picture processing system according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a dynamic picture processing method according to an embodiment of the present invention.
  • the dynamic picture processing method of the embodiment of the present invention includes the following steps:
  • Step 100 Start a dynamic camera function, and start shooting a dynamic picture
  • Step 200 Start the recording function, perform recording during the dynamic picture shooting, and store the captured dynamic picture and recording information
  • the embodiment of the present invention stores a dynamic picture in the form of Thumbnail+MOV (thumbnail), the picture is from the preview data of the camera, generates a MOV by encoding the image data of the multiple frames, and crops the time axis image.
  • Thumbnail the default recorded MOV format (QuickTime movie format, an audio and video file format developed by Apple for storing common digital media types) with a video source length of 4 seconds, recording information including voice, environment Sound or noise, etc.
  • Step 300 Perform voiceprint feature extraction on the stored recorded information by using a voiceprint extraction module, and store the extracted voiceprint features;
  • the embodiment of the present invention uses the special paragraph of the media information to store the voiceprint feature.
  • FIG. 2 it is a schematic diagram of the voiceprint feature extraction according to the embodiment of the present invention.
  • the voiceprint feature extraction process of the embodiment of the present invention includes the following steps:
  • Step 301 Endpoint detection: detecting whether valid source data is entered
  • Step 302 Pre-emphasis: performing differential and filtering processing on the incoming sound source data
  • step 302 the pre-emphasis filtering algorithm formula is:
  • Step 303 Audio framing: discretizing the stream source
  • step 303 in order to preserve some detailed features of the sound source, especially the special sound quality of certain environmental scenes, and considering the size of the processed data amount, the present invention selects a sampling standard of 1 Channel 44100 Hz; and according to the rules of audio processing, the audio frame
  • the duration is usually controlled at about 20 to 30 ms. Therefore, the number of sampling points of a single audio frame can be set to 1024, and the actual corresponding duration is 1024 ⁇ 44100 ⁇ 1000 ⁇ 2.
  • Step 304 Window processing: using a common Hamming window to perform windowing processing on the frame data;
  • the form is as follows:
  • Step 305 FFT (Fast Fourier Transformation): converting the time domain source into frequency domain energy
  • step 305 the atomic operation level fast Fourier transform converts the time domain source into frequency domain data, and the conversion formula is:
  • Step 306 Perform bandpass filtering and voiceprint feature extraction on the sound source.
  • filtering and voiceprint feature extraction are performed using a specific filter and extraction algorithm for different sound source characteristics required for analysis.
  • a voice feature may be used to collect MFCC coefficient features using a triangular band pass filter + DCT;
  • the ambient sound can be collected using a logarithmic filter + wavelet transform to collect Jaccard coefficient bit features.
  • Step 400 Read the stored dynamic picture, write the extracted voiceprint feature into the specified file data node of the dynamic picture in a serialized manner, and perform voiceprint labeling on the dynamic picture;
  • Step 500 classify and store the dynamic picture after the voiceprint is marked according to the voiceprint feature
  • the categorization method for categorizing the dynamic picture after voiceprint according to the voiceprint feature includes voice feature classification, environmental sound feature classification or noise feature classification.
  • Step 600 Perform a retrieval by means of voice input or categorization search, thereby quickly retrieving a dynamic picture having a specific voiceprint feature;
  • the voice features can be quickly indexed by means of similarity recognition of the input voice, and for more complex environmental sound features or noise features and other sound features, such as sounding objects, scene locations, sound intensity, etc. Features are categorized and searched by category.
  • FIG. 3 is a schematic structural diagram of a dynamic picture processing system according to an embodiment of the present invention.
  • the dynamic picture processing system of the embodiment of the invention includes a shooting module, a recording module, a storage module, a voiceprint extraction module, a voiceprint labeling module, a classification module and a retrieval module;
  • the shooting module is used to take a dynamic picture
  • the recording module is used for recording during dynamic picture shooting
  • the storage module is configured to store the captured dynamic picture and recorded information
  • the voiceprint extraction module is configured to perform voiceprint feature extraction on the stored recorded information, and store the extracted voiceprint feature; specifically, the voiceprint extraction module further includes an endpoint detection unit, a pre-emphasis unit, an audio framer unit, and an Window unit, sound source conversion unit, and filter unit
  • the endpoint detecting unit is configured to detect whether valid source data is entered
  • the pre-emphasis unit is configured to perform differential and filtering processing on the incoming sound source data; wherein the pre-emphasis filtering algorithm formula is:
  • the audio framing unit is used for discretizing the streaming sound source; wherein, in order to preserve some detailed features of the sound source, especially the special sound quality of some environmental scenes, and considering the amount of processing data, the present invention selects 1 Channel 44100 Hz sampling. Standard; according to the rules of audio processing, the duration of the audio frame is usually controlled to be about 20 to 30 ms, so the number of sampling points of a single audio frame can be set to 1024, and the actual corresponding duration is 1024 ⁇ 44100 ⁇ 1000 ⁇ 2.
  • the sound source conversion unit is configured to convert the time domain sound source into frequency domain energy by FFT; wherein the atomic operation level fast Fourier transform converts the time domain sound source into frequency domain data, and the conversion formula is:
  • the filtering unit is configured to perform band pass filtering and voiceprint feature extraction on the sound source; wherein, for different sound source characteristics required for analysis, a specific filter and an extraction algorithm are used for filtering and voiceprint feature extraction, for example, a voice feature may be used.
  • the MFCC coefficient feature is collected using a triangular bandpass filter + DCT; the environmental sound can be collected using a logarithmic filter + wavelet transform to collect Jaccard coefficient bit features.
  • the voiceprint labeling module is configured to read the stored dynamic image, and write the extracted voiceprint feature into a specified file data node of the dynamic image in a serialized manner, and perform voiceprint labeling on the dynamic image;
  • the categorization module is configured to classify and store the dynamic picture after the voiceprint is marked according to the voiceprint feature; wherein the classification method for classifying the dynamic picture after the voiceprint according to the voiceprint feature comprises a voice feature return Class, environmental sound feature classification or noise feature classification.
  • the retrieval module is configured to perform retrieval by means of voice input or classification search, thereby quickly retrieving a dynamic picture having a specific voiceprint feature; wherein the voice feature can be quickly indexed by means of similarity recognition of the input voice, and More complex environmental sound characteristics or noise characteristics and other sound characteristics should be classified according to characteristics such as sounding objects, scene location, sound intensity, etc., and searched according to the classification category.
  • the dynamic picture processing method and system fully utilizes the sound information of the shooting scene, calculates and extracts the voiceprint feature of the dynamic picture shooting scene in real time, and writes the voiceprint feature into the dynamic picture to realize the voiceprint of the dynamic picture. And classify the dynamic picture according to the voiceprint feature, and implement the action
  • the purpose of the classification retrieval of the state image and the quick matching query based on the voiceprint feature makes the user's retrieval of the image operation more efficient and intuitive.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

一种动态图片处理方法及系统,其中动态图片处理方法包括:拍摄动态图片,并在动态图片拍摄过程中进行录音;对录音信息进行声纹特征提取;将提取的声纹特征写入动态图片中,对动态图片进行声纹标注。通过实时计算并提取动态图片拍摄场景的声纹特征,将声纹特征写入动态图片中,实现动态图片的声纹标注,并根据声纹特征对动态图片进行归类,实现动态图片的归类检索和基于声纹特征的快速匹配查询的目的,使用户的检索图片操作更高效、直观。

Description

一种动态图片处理方法及系统
本申请基于申请号为2016101964910、申请日为2016年03月31日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
【技术领域】
本发明涉及动态图片处理技术领域,尤其涉及一种动态图片处理方法及系统。
【背景技术】
自多家移动设备厂商推出诸如Zoe、LivePhoto等新图像媒体格式后,动态图片格式极有可能会在未来替代现有的静态图片格式,成为下一个移动设备创新技术领域的重要竞争环节。现有的动态图片只是记录拍摄范围内的图像信息和单纯的录制原始数字媒体信号,而没有考虑拍摄场景声音的内容信息,因此,在动态图片格式处理领域,用户的使用体验还有很多可提升的空间。
【发明内容】
本发明提供了一种动态图片处理方法及系统,旨在解决现有的动态图片只是记录拍摄范围内的图像信息和单纯的录制原始数字媒体信号,而没有考虑拍摄场景声音的内容信息的技术问题。
为了解决以上提出的问题,本发明实施例采用的技术方案为:一种动态图片处理方法,包括:
拍摄动态图片,并在动态图片拍摄过程中进行录音;
对录音信息进行声纹特征提取;
将提取的声纹特征写入动态图片中,对动态图片进行声纹标注。
本发明实施例采取的技术方案还包括:所述拍摄动态图片,并在动态图片拍摄过程中进行录音还包括:将拍摄的动态图片和录音信息进行存储;所述动 态图片的存储形式为缩略图形式,所述录音信息包括语音、环境音或噪音。
本发明实施例采取的技术方案还包括:所述声纹特征提取方法包括以下步骤:
检测是否有效音源数据进入;
对进入的音源数据进行差分和滤波处理;
对流式音源进行离散化处理;
通过哈明窗对帧数据做加窗处理;
通过快速傅氏变换将时域音源转为频域能量;
对音源进行带通滤波与声纹特征提取。
本发明实施例采取的技术方案还包括:所述将提取的声纹特征写入动态图片中的写入方式为:读取存储的动态图片,将提取的声纹特征以序列化的方式写入动态图片的指定文件数据节点。
本发明实施例采取的技术方案还包括:所述将提取的声纹特征写入动态图片中,对动态图片进行声纹标注后还包括:根据声纹特征对声纹标注后的动态图片进行归类并存储;所述归类方式包括语音特征归类、环境音特征归类或噪声特征归类。
本发明实施例采取的技术方案还包括:所述将提取的声纹特征写入动态图片中,对动态图片进行声纹标注后还包括:通过语音输入或者归类查找方式检索具有特定声纹特征的动态图片。
本发明实施例采取的另一技术方案为:一种动态图片处理系统,包括拍摄模块、录音模块、声纹提取模块和声纹标注模块;所述拍摄模块用于拍摄动态图片;所述录音模块用于在动态图片拍摄过程中进行录音;所述声纹提取模块用于对录音信息进行声纹特征提取;所述声纹标注模块用于将提取的声纹特征写入动态图片中,对动态图片进行声纹标注。
本发明实施例采取的技术方案还包括:所述声纹提取模块包括端点检测单元、预加重单元、音频分帧单元、加窗单元、音源转换单元和滤波单元;
所述端点检测单元用于检测是否有效音源数据进入;
所述预加重单元用于对进入的音源数据进行差分和滤波处理;
所述音频分帧单元用于对流式音源进行离散化处理;
所述加窗单元用于通过哈明窗对帧数据做加窗处理;
所述音源转换单元用于通过快速傅氏变换将时域音源转为频域能量;
所述滤波单元用于对音源进行带通滤波与声纹特征提取。
本发明实施例采取的技术方案还包括存储模块,所述存储模块用于将拍摄的动态图片和录音信息进行存储。
本发明实施例采取的技术方案还包括归类模块和检索模块,所述归类模块用于根据声纹特征对声纹标注后的动态图片进行归类并存储;所述归类模块的归类方式包括语音特征归类、环境音特征归类或噪声特征归类;所述检索模块用于通过语音输入或者归类查找方式检索具有特定声纹特征的动态图片。
本发明的有益效果在于:本发明实施例的动态图片处理方法及系统通过充分利用拍摄场景的声音信息,实时计算并提取动态图片拍摄场景的声纹特征,将声纹特征写入动态图片中,实现动态图片的声纹标注,并根据声纹特征对动态图片进行归类,实现动态图片的归类检索和基于声纹特征的快速匹配查询的目的,使用户的检索图片操作更高效、直观。
【附图说明】
图1为本发明实施例的动态图片处理方法的流程图;
图2是本发明实施例的声纹特征提取示意图;
图3为本发明实施例的动态图片处理系统的结构示意图。
【具体实施方式】
为了便于理解本发明,下面将参照相关附图对本发明进行更全面的描述。附图中给出了本发明的较佳实施例。但是,本发明可以以许多不同的形式来实 现,并不限于本文所描述的实施例。相反地,提供这些实施例的目的是使对本发明的公开内容的理解更加透彻全面。
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本发明。
请参阅图1,是本发明实施例的动态图片处理方法的流程图。本发明实施例的动态图片处理方法包括以下步骤:
步骤100:启动动态拍照功能,开始拍摄动态图片;
步骤200:启动录音功能,在动态图片拍摄过程中进行录音,并将拍摄的动态图片和录音信息进行存储;
在步骤200中,本发明实施例通过Thumbnail+MOV(缩略图)的形式存储动态图片,图片来自相机的Preview(预览)数据,通过对多帧图像数据的编码生成MOV,并裁剪时间中轴图像作为Thumbnail;默认录制的MOV格式(QuickTime影片格式,是Apple公司开发的一种音频、视频文件格式,用于存储常用数字媒体类型)带有音源的视频长度为4秒,录音信息包括语音、环境音或噪音等。
步骤300:通过声纹提取模块对存储的录音信息进行声纹特征提取,并对提取的声纹特征进行存储;
在步骤300中,本发明实施例使用媒体信息特殊段落来存储声纹特征;具体如图2所示,是本发明实施例的声纹特征提取示意图。本发明实施例的声纹特征提取过程包括以下步骤:
步骤301:端点检测:检测是否有效音源数据进入;
步骤302:预加重:对进入的音源数据进行差分和滤波处理;
在步骤302中,所述预加重滤波算法公式为:
Figure PCTCN2016088859-appb-000001
步骤303:音频分帧:对流式音源进行离散化处理;
在步骤303中,为了保留音源的一些细节特征,特别是某些环境场景的特殊音质,同时考虑处理数据量的大小,本发明选用1 Channel 44100Hz的采样标准;而按照音频处理的规则,音频帧的时长通常控制在20~30ms左右,因此可以将单个音频帧的采样点数定为1024个,实际对应时长为1024÷44100×1000≈2。
步骤304:加窗处理:选用常见哈明窗对帧数据做加窗处理;
在步骤304中,对经过音频分帧处理的每帧音频数据S(n)进行哈明窗处理,得到处理后的数据S′(n)=S(n)×W(n),其中W(n)形式如下:
Figure PCTCN2016088859-appb-000002
步骤305:FFT(Fast Fourier Transformation,快速傅氏变换):将时域音源转为频域能量;
在步骤305中,原子运算级快速傅里叶变换将时域音源转换为频域数据,转换公式为:
Figure PCTCN2016088859-appb-000003
步骤306:对音源进行带通滤波与声纹特征提取。
在步骤306中,针对分析所需要的不同声源特征,使用特定的滤波器和提取算法进行滤波与声纹特征提取,例如:语音特征则可以使用三角带通滤波器+DCT收集MFCC系数特征;环境音则可以使用对数滤波器+小波变换收集Jaccard系数位元特征等。
步骤400:读取存储的动态图片,将提取的声纹特征以序列化的方式写入动态图片的指定文件数据节点,对动态图片进行声纹标注;
步骤500:根据声纹特征对声纹标注后的动态图片进行归类并存储;
在步骤500中,所述根据声纹特征对声纹标注后的动态图片进行归类的归类方式包括语音特征归类、环境音特征归类或噪声特征归类等。
步骤600:通过语音输入或者归类查找等方式进行检索,从而快速检索出具有特定声纹特征的动态图片;
在步骤600中,语音特征可以直接通过输入语音的相似度识别的方式进行快速索引,而对于较为复杂的环境音特征或噪声特征以及其它声音特征则应按照诸如发声物、场景地点、声音强度等特点进行归类,并按照归类类别进行查找。
请参阅图3,是本发明实施例的动态图片处理系统的结构示意图。本发明实施例的动态图片处理系统包括拍摄模块、录音模块、存储模块、声纹提取模块、声纹标注模块、归类模块和检索模块;
拍摄模块用于拍摄动态图片;
录音模块用于在动态图片拍摄过程中进行录音;
存储模块用于将拍摄的动态图片和录音信息进行存储;
声纹提取模块用于对存储的录音信息进行声纹特征提取,并对提取的声纹特征进行存储;具体地,声纹提取模块还包括端点检测单元、预加重单元、音频分帧单元、加窗单元、音源转换单元和滤波单元
端点检测单元用于检测是否有效音源数据进入;
预加重单元用于对进入的音源数据进行差分和滤波处理;其中,所述预加重滤波算法公式为:
Figure PCTCN2016088859-appb-000004
音频分帧单元用于对流式音源进行离散化处理;其中,为了保留音源的一些细节特征,特别是某些环境场景的特殊音质,同时考虑处理数据量的大小,本发明选用1 Channel 44100Hz的采样标准;而按照音频处理的规则,音频帧的时长通常控制在20~30ms左右,因此可以将单个音频帧的采样点数定为1024个,实际对应时长为1024÷44100×1000≈2。
加窗单元用于通过哈明窗对帧数据做加窗处理;其中,对经过音频分帧处理的每帧音频数据S(n)进行哈明窗处理,得到处理后的数据S′(n)=S(n)×W(n),其中W(n)形式如下:
Figure PCTCN2016088859-appb-000005
音源转换单元用于通过FFT将时域音源转为频域能量;其中,原子运算级快速傅里叶变换将时域音源转换为频域数据,转换公式为:
Figure PCTCN2016088859-appb-000006
滤波单元用于对音源进行带通滤波与声纹特征提取;其中,针对分析所需要的不同声源特征,使用特定的滤波器和提取算法进行滤波与声纹特征提取,例如:语音特征则可以使用三角带通滤波器+DCT收集MFCC系数特征;环境音则可以使用对数滤波器+小波变换收集Jaccard系数位元特征等。
声纹标注模块用于读取存储的动态图片,将提取的声纹特征以序列化的方式写入动态图片的指定文件数据节点,对动态图片进行声纹标注;
归类模块用于根据声纹特征对声纹标注后的动态图片进行归类并存储;其中,所述根据声纹特征对声纹标注后的动态图片进行归类的归类方式包括语音特征归类、环境音特征归类或噪声特征归类等。
检索模块用于通过语音输入或者归类查找等方式进行检索,从而快速检索出具有特定声纹特征的动态图片;其中,语音特征可以直接通过输入语音的相似度识别的方式进行快速索引,而对于较为复杂的环境音特征或噪声特征以及其它声音特征则应按照诸如发声物、场景地点、声音强度等特点进行归类,并按照归类类别进行查找。
本发明实施例的动态图片处理方法及系统通过充分利用拍摄场景的声音信息,实时计算并提取动态图片拍摄场景的声纹特征,将声纹特征写入动态图片中,实现动态图片的声纹标注,并根据声纹特征对动态图片进行归类,实现动 态图片的归类检索和基于声纹特征的快速匹配查询的目的,使用户的检索图片操作更高效、直观。
上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。

Claims (10)

  1. 一种动态图片处理方法,其特征在于,包括:
    拍摄动态图片,并在动态图片拍摄过程中进行录音;
    对录音信息进行声纹特征提取;
    将提取的声纹特征写入动态图片中,对动态图片进行声纹标注。
  2. 根据权利要求1所述的动态图片处理方法,其特征在于:所述拍摄动态图片,并在动态图片拍摄过程中进行录音还包括:将拍摄的动态图片和录音信息进行存储;所述动态图片的存储形式为缩略图形式,所述录音信息包括语音、环境音或噪音。
  3. 根据权利要求1所述的动态图片处理方法,其特征在于:所述声纹特征提取方法包括以下步骤:
    检测是否有效音源数据进入;
    对进入的音源数据进行差分和滤波处理;
    对流式音源进行离散化处理;
    通过哈明窗对帧数据做加窗处理;
    通过快速傅氏变换将时域音源转为频域能量;
    对音源进行带通滤波与声纹特征提取。
  4. 根据权利要求1或2所述的动态图片处理系统,其特征在于:所述将提取的声纹特征写入动态图片中的写入方式为:读取存储的动态图片,将提取的声纹特征以序列化的方式写入动态图片的指定文件数据节点。
  5. 根据权利要求4所述的动态图片处理系统,其特征在于:所述将提取的声纹特征写入动态图片中,对动态图片进行声纹标注后还包括:根据声纹特征对声纹标注后的动态图片进行归类并存储;所述归类方式包括语音特征归类、环境音特征归类或噪声特征归类。
  6. 根据权利要求5所述的动态图片处理系统,其特征在于:所述将提取的 声纹特征写入动态图片中,对动态图片进行声纹标注后还包括:通过语音输入或者归类查找方式检索具有特定声纹特征的动态图片。
  7. 一种动态图片处理系统,其特征在于,包括拍摄模块、录音模块、声纹提取模块和声纹标注模块;所述拍摄模块用于拍摄动态图片;所述录音模块用于在动态图片拍摄过程中进行录音;所述声纹提取模块用于对录音信息进行声纹特征提取;所述声纹标注模块用于将提取的声纹特征写入动态图片中,对动态图片进行声纹标注。
  8. 根据权利要求7所述的动态图片处理系统,其特征在于:所述声纹提取模块包括端点检测单元、预加重单元、音频分帧单元、加窗单元、音源转换单元和滤波单元;
    所述端点检测单元用于检测是否有效音源数据进入;
    所述预加重单元用于对进入的音源数据进行差分和滤波处理;
    所述音频分帧单元用于对流式音源进行离散化处理;
    所述加窗单元用于通过哈明窗对帧数据做加窗处理;
    所述音源转换单元用于通过快速傅氏变换将时域音源转为频域能量;
    所述滤波单元用于对音源进行带通滤波与声纹特征提取。
  9. 根据权利要求8所述的动态图片处理系统,其特征在于:还包括存储模块,所述存储模块用于将拍摄的动态图片和录音信息进行存储。
  10. 根据权利要求9所述的动态图片处理系统,其特征在于:还包括归类模块和检索模块,所述归类模块用于根据声纹特征对声纹标注后的动态图片进行归类并存储;所述归类模块的归类方式包括语音特征归类、环境音特征归类或噪声特征归类;所述检索模块用于通过语音输入或者归类查找方式检索具有特定声纹特征的动态图片。
PCT/CN2016/088859 2016-03-31 2016-07-06 一种动态图片处理方法及系统 WO2017166483A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/245,743 US20170287524A1 (en) 2016-03-31 2016-08-24 Method and electronic device for processing dynamic image

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610196491.0A CN106095764A (zh) 2016-03-31 2016-03-31 一种动态图片处理方法及系统
CN201610196491.0 2016-03-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/245,743 Continuation US20170287524A1 (en) 2016-03-31 2016-08-24 Method and electronic device for processing dynamic image

Publications (1)

Publication Number Publication Date
WO2017166483A1 true WO2017166483A1 (zh) 2017-10-05

Family

ID=58702491

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/088859 WO2017166483A1 (zh) 2016-03-31 2016-07-06 一种动态图片处理方法及系统

Country Status (2)

Country Link
CN (1) CN106095764A (zh)
WO (1) WO2017166483A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750773A (zh) * 2019-09-16 2020-02-04 康佳集团股份有限公司 一种基于声纹属性的图像识别方法、智能终端及存储介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6893606B2 (ja) 2017-03-20 2021-06-23 達闥机器人有限公司 画像のタグ付け方法、装置及び電子機器
WO2019127437A1 (zh) * 2017-12-29 2019-07-04 深圳前海达闼云端智能科技有限公司 地图标注的方法、装置、云端服务器、终端及应用程序
CN108281147A (zh) * 2018-03-31 2018-07-13 南京火零信息科技有限公司 基于lpcc和adtw的声纹识别系统
CN109361858A (zh) * 2018-10-29 2019-02-19 北京小米移动软件有限公司 获取图像的方法、装置、电子设备及存储介质
CN110647635A (zh) * 2019-09-29 2020-01-03 维沃移动通信有限公司 一种图像管理方法及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609968A (zh) * 2012-03-05 2012-07-25 信源通科技(深圳)有限公司 实现有声图片的方法及系统
CN103035020A (zh) * 2012-11-23 2013-04-10 惠州Tcl移动通信有限公司 移动终端及其图片备注方法
US20140122513A1 (en) * 2005-01-03 2014-05-01 Luc Julia System and method for enabling search and retrieval operations to be performed for data items and records using data obtained from associated voice files
CN104298694A (zh) * 2013-07-19 2015-01-21 深圳市康睿祥通讯有限公司 一种照片信息添加方法、装置及移动终端
TW201513095A (zh) * 2013-09-23 2015-04-01 Hon Hai Prec Ind Co Ltd 語音處理系統、裝置及方法
CN105677799A (zh) * 2015-12-31 2016-06-15 宇龙计算机通信科技(深圳)有限公司 一种照片检索方法与系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100454296C (zh) * 2004-07-29 2009-01-21 鸿富锦精密工业(深圳)有限公司 声音影像控制装置及方法
CN101102240A (zh) * 2006-07-04 2008-01-09 王建波 一种音频、视频内容的采集方法和检索方法
WO2010087125A1 (ja) * 2009-01-29 2010-08-05 日本電気株式会社 時間区間代表特徴ベクトル生成装置
CN103035247B (zh) * 2012-12-05 2017-07-07 北京三星通信技术研究有限公司 基于声纹信息对音频/视频文件进行操作的方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122513A1 (en) * 2005-01-03 2014-05-01 Luc Julia System and method for enabling search and retrieval operations to be performed for data items and records using data obtained from associated voice files
CN102609968A (zh) * 2012-03-05 2012-07-25 信源通科技(深圳)有限公司 实现有声图片的方法及系统
CN103035020A (zh) * 2012-11-23 2013-04-10 惠州Tcl移动通信有限公司 移动终端及其图片备注方法
CN104298694A (zh) * 2013-07-19 2015-01-21 深圳市康睿祥通讯有限公司 一种照片信息添加方法、装置及移动终端
TW201513095A (zh) * 2013-09-23 2015-04-01 Hon Hai Prec Ind Co Ltd 語音處理系統、裝置及方法
CN105677799A (zh) * 2015-12-31 2016-06-15 宇龙计算机通信科技(深圳)有限公司 一种照片检索方法与系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750773A (zh) * 2019-09-16 2020-02-04 康佳集团股份有限公司 一种基于声纹属性的图像识别方法、智能终端及存储介质
CN110750773B (zh) * 2019-09-16 2023-08-18 康佳集团股份有限公司 一种基于声纹属性的图像识别方法、智能终端及存储介质

Also Published As

Publication number Publication date
CN106095764A (zh) 2016-11-09

Similar Documents

Publication Publication Date Title
WO2017166483A1 (zh) 一种动态图片处理方法及系统
US10108709B1 (en) Systems and methods for queryable graph representations of videos
US11482242B2 (en) Audio recognition method, device and server
EP1692629B1 (en) System & method for integrative analysis of intrinsic and extrinsic audio-visual data
JP6046393B2 (ja) 情報処理装置、情報処理システム、情報処理方法および記録媒体
US20080187231A1 (en) Summarization of Audio and/or Visual Data
US20090150147A1 (en) Recording audio metadata for stored images
CN105957531A (zh) 基于云平台的演讲内容提取方法及装置
WO2021120818A1 (en) Methods and systems for managing image collection
KR20070118038A (ko) 정보처리 장치 및 정보처리 방법과 컴퓨터·프로그램
EP2107477A3 (en) Summarizing reproduction device and summarizing reproduction method
US8959022B2 (en) System for media correlation based on latent evidences of audio
Douze et al. Circulant temporal encoding for video retrieval and temporal alignment
JP2014006680A5 (ja) 情報処理装置、情報処理システム、情報処理方法および記録媒体
CN108831456B (zh) 一种通过语音识别对视频标记的方法、装置及系统
CN104298694A (zh) 一种照片信息添加方法、装置及移动终端
TWM594323U (zh) 智能會議記錄系統
Radha Video retrieval using speech and text in video
CN116708055A (zh) 智能多媒体视听图像处理方法、系统及存储介质
US20120059855A1 (en) Method and computer program product for enabling organization of media objects
TW201435627A (zh) 搜索優化系統及方法
JPWO2006009035A1 (ja) 信号検出方法,信号検出システム,信号検出処理プログラム及びそのプログラムを記録した記録媒体
Cricri et al. Multimodal event detection in user generated videos
US20170287524A1 (en) Method and electronic device for processing dynamic image
JP2008003972A (ja) メタデータ生成装置およびメタデータ生成方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16896256

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16896256

Country of ref document: EP

Kind code of ref document: A1