WO2017144007A1 - 一种基于经验模态分解的音频识别方法及系统 - Google Patents

一种基于经验模态分解的音频识别方法及系统 Download PDF

Info

Publication number
WO2017144007A1
WO2017144007A1 PCT/CN2017/074706 CN2017074706W WO2017144007A1 WO 2017144007 A1 WO2017144007 A1 WO 2017144007A1 CN 2017074706 W CN2017074706 W CN 2017074706W WO 2017144007 A1 WO2017144007 A1 WO 2017144007A1
Authority
WO
WIPO (PCT)
Prior art keywords
time offset
audio signal
mode decomposition
empirical mode
time
Prior art date
Application number
PCT/CN2017/074706
Other languages
English (en)
French (fr)
Inventor
岳廷明
Original Assignee
深圳创维数字技术有限公司
深圳市创维软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳创维数字技术有限公司, 深圳市创维软件有限公司 filed Critical 深圳创维数字技术有限公司
Publication of WO2017144007A1 publication Critical patent/WO2017144007A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval

Definitions

  • the present invention relates to the field of audio recognition, and in particular, to an audio recognition method and system based on empirical mode decomposition.
  • Audio recognition refers to obtaining the spectrum of the audio signal by spectrum analysis of the audio signal, extracting the feature values of the audio signal, constructing a model or a constellation diagram, and performing target matching and recognition.
  • the main techniques include short-time Fourier transform, spectrogram feature extraction, feature template generation, and so on.
  • the specific processing of a piece of original audio or voice mostly goes through the following steps: Pre-emphasis Denoising, Framing, Windowing, Fast Fourier Transform (FFT), Filter Group Processing (Mel-Filter Bank), Discrete Cosine Transform DCT (Calculation Cepstrum Parameter) Logarithmic energy, difference cepstral parameters (vector form, inverse Fourier transform IFFT), MFCC (Mel frequency cepstral coefficient --- eigenvalue of one frame of audio) Etc., finally obtaining a series of eigenvalues of an audio signal, the series of eigenvalues can fully and completely characterize the audio signal of the segment.
  • the matching recognition algorithm of mainstream audio signals mainly deals with the spectrogram (describes the change of the intensity of a specific frequency with time), including comparing time, frequency variation and difference or finding a peak.
  • One of the main technical implementations is to convert the frequency into notes for processing, each note corresponds to a range, forming a N
  • the eigenvectors of the dimension are filtered and normalized to obtain the characteristic sound spectrum map.
  • the audio voiceprint is obtained by sliding the subgraph, and the recognition and matching is completed for the bit error rate of the voiceprint.
  • Another main technical solution is to obtain a series of maximum points of a spectrogram, obtain the time point and frequency of the maximum point, and construct a constellation map based on the plurality of maxima points, according to two constellations within the constellation The time offset of the point and the respective frequency strengths generate a hash value at this point in time, and finally the target is identified by counting the number of hash values of the same time offset.
  • the object of the present invention It is to provide an audio recognition method and system based on empirical modal decomposition, which aims to solve the problem that the existing identification method cannot completely and fully characterize the audio signal.
  • An audio recognition method based on empirical mode decomposition which comprises the steps of:
  • A Inputting the original audio signal, sampling the original audio signal, and sequentially performing denoising preprocessing, adding a Hanming window, and performing a Fourier transform process to obtain spectrum data, and then sequentially connecting the spectrum data of each frame to obtain a sound spectrum map;
  • the step D specifically includes:
  • D4 obtains N hash values through the N sets of eigenmode functions to form a set of eigenvalues.
  • the method further includes:
  • the step E specifically includes:
  • E2 Determining a time offset difference between each time offset in the time offset group and the time offset of the feature value, and determining the target audio to be identified by the distribution and quantity of the time offset differences .
  • the added sampling sequence passes the sha1 hash algorithm or Murmur
  • the hash algorithm process gets a hash value.
  • An audio recognition system based on empirical modal decomposition which includes:
  • a spectrogram acquisition module configured to input the original audio signal, sample the original audio signal, and then perform denoising preprocessing, adding a Hanming window, and a Fourier transform process to obtain spectrum data, and then sequentially connecting the spectrum of each frame. Data, obtaining a spectrogram;
  • Time - a frequency curve generating module configured to obtain a point at which the energy maximum value of each frequency segment of the sound spectrum map is located, and sequentially connect a point where the energy maximum value of each frequency segment is located to generate a time-frequency curve;
  • An empirical mode decomposition module configured to perform empirical mode decomposition on the generated time-frequency curve to obtain a plurality of eigenmode functions
  • an eigenvalue output module configured to generate, by using the obtained plurality of eigenmode functions in combination with the corresponding frequency segment and the time frame, a plurality of feature values for representing the original audio signal, and outputting.
  • the feature value output module specifically includes:
  • sampling unit configured to equally sample each eigenmode function to obtain a corresponding set of sampling sequences
  • An adding unit configured to add a sequence number of the frequency segment after the sampling sequence
  • a hash processing unit configured to process the appended sample sequence to obtain a hash value
  • the vector component is used to obtain N hash values through the N sets of eigenmode functions to form a set of feature values.
  • the audio recognition system further includes:
  • a distribution quantity obtaining module configured to acquire a distribution and a quantity of the time offset difference according to the feature value to represent the original audio signal.
  • the distribution quantity obtaining module specifically includes:
  • a time offset group obtaining unit configured to perform a search in the database by using the feature value, and obtain a time offset group formed by a time offset of a plurality of other feature values that match the feature value;
  • a time offset difference calculation unit configured to respectively obtain a time offset difference between each time offset in the time offset group and the time offset of the feature value, and then pass the distribution of the time offset differences Quantity, determine the target audio to be identified.
  • the added sampling sequence passes the sha1 hash algorithm or Murmur
  • the hash algorithm process gets a hash value.
  • the present invention will be EMD
  • the method of empirical modal decomposition is introduced into the generation of the eigenvalues of the audio signal, so that the trend information of the audio features is fully fused to the generation of the eigenvalues, so that the generated eigenvalues are more fully characterized by the audio signals.
  • the invention can replace the complex feature model and the constellation diagram, and can effectively fuse the change process information of the feature, so that the feature value is more sufficient, accurate and effective for the representation of the audio signal.
  • FIG. 1 is a flow chart of a first embodiment of an audio recognition method based on empirical mode decomposition according to the present invention
  • FIG. 3 is a specific flowchart of step S104 in the method shown in FIG. 1;
  • Figure 4 is a five-item IMF data curve generated by EMD decomposition in the present invention.
  • FIG. 5 is a flowchart of a second embodiment of an audio recognition method based on empirical mode decomposition according to the present invention.
  • FIG. 6 is a specific flow chart of step S105 in the method shown in Figure 5;
  • FIG. 7 is a structural block diagram of a first embodiment of an audio recognition system based on empirical mode decomposition according to the present invention.
  • Figure 8 is a block diagram showing the specific structure of the eigenvalue output module in the system shown in Figure 7;
  • FIG. 9 is a structural block diagram of a second embodiment of an audio recognition system based on empirical mode decomposition according to the present invention.
  • FIG. 10 is a block diagram showing the specific structure of the distributed quantity acquisition module in the system shown in Figure 9.
  • the invention provides an audio recognition method and system based on empirical mode decomposition
  • the present invention will be further described in detail below. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention .
  • FIG. 1 A flowchart of a first embodiment of an audio recognition method based on empirical mode decomposition according to the present invention, as shown in the figure, comprising the steps of:
  • the method of the present invention introduces an empirical mode decomposition (EMD, Empirical Mode Decomposition) method into the generation of an eigenvalue of an audio signal, and the eigenmode function (IMF, Intrinsic Mode Function) generated by the EMD can sufficiently retain the original queue signal.
  • EMD Empirical Mode Decomposition
  • IMF Intrinsic Mode Function
  • the method of empirical mode decomposition is introduced into the generation of eigenvalues, by constructing the point of maximum energy (t n , f n ) generated by the spectrogram as the original EMD decomposition.
  • Signal queue, EMD decomposition of this original signal queue to obtain N IMF items.
  • the method of the invention can fully retain the trend information of the signal characteristics in the frequency domain with time, so that the characterization of the audio signal is more sufficient, accurate and effective.
  • step S101 the original audio signal (ie, the analog audio signal) is input through the microphone, and then passes through the A/D.
  • Analog to digital conversion sampling (for example, sampling at a sampling rate of 44100 Hz), to obtain a digital audio signal.
  • denoising by spectral subtraction which mainly uses the short-term stationary characteristic of the audio signal, subtracts the short-time spectrum of noise from the short-time spectrum value of the noise-frequency signal, that is, eliminates the random environmental noise doped in the signal, thereby obtaining The spectrum of the pure audio signal (ie, the audio data, which is buffered) for speech enhancement purposes.
  • the digital audio signal can be pre-emphasized prior to spectral subtraction denoising.
  • the pre-emphasis process effectively processes the signal by utilizing the difference between the signal characteristic and the noise characteristic, and adopts a pre-emphasis network before noise introduction to reduce the high-frequency component of the noise and improve the output signal-to-noise ratio.
  • each frame is N milliseconds long, and each piece of audio data after the frame can be regarded as a steady state signal.
  • the spectral data of each frame is connected in turn, and the time is taken as the horizontal axis, and the frequency is taken as the vertical axis, and the intensity of the spectrum amplitude (energy) is represented by color, and is drawn as shown in FIG. 2 .
  • step S102 Calculating the energy value of each frequency point in each frequency segment of each frame of the spectral data in the spectrogram, obtaining the maximum energy of each frequency segment; sequentially obtaining the time frame and the frequency segment where the energy maximum value of each frequency segment is located , taking this time frame and frequency segment as new points, and sequentially connecting the points where the maximum energy values are located to generate a target curve, that is, time - Frequency curve.
  • the n segments of consecutive frequency segments are divided, and the sequence numbers of the frequency segments are id1, id2, ..., idn, respectively.
  • a certain frequency range eg frequency segment idn, as shown in Figure 2 from 6kHz to 9kHz
  • the point at which the maximum energy value of the spectrogram is connected, and the point that does not reach the specified intensity threshold is treated as the lower limit of the frequency range, forming a continuous dynamic curve with time as the horizontal axis and frequency as the vertical axis. , that is time - Frequency curve.
  • the generated time - The frequency curve is subjected to empirical mode decomposition to obtain a plurality of eigenmode function terms that can fully characterize the change of the curve, such as obtaining the N-group (this curve is generated up to 12 groups) IMF eigenmode function term ( Each item is a time domain curve).
  • the step S104 specifically includes:
  • IMF C1, IMF C2, IMF C3, IMF C4 and The IMF C5 curve obtains a corresponding set of sampling sequences x1, x2...xn, and adds the frequency segment number idn of the corresponding IMF item to the sample sequence.
  • the sample sequence after the addition is passed.
  • the sha1 hash algorithm or the Murmur hash algorithm obtains a 32-bit or 64-bit hash value, so that N is obtained by N sets (ie N) IMF entries.
  • the hash values form a set of eigenvalues (also known as a set of eigenvectors).
  • the time offset tm at which the set of feature values is located ie, the position on the time axis of the start frame of the audio signal.
  • the method of the present invention can fully fuse the change trend information of the audio features to the generation of the feature values, so that the generated feature values more fully characterize the audio signals.
  • the invention combines the generated feature value of each frame and the generated feature value of the local time segment to enrich the audio feature information, that is, separately performs audio feature information for each frame of audio and several frames of audio extracted feature values. EMD empirical mode decomposition.
  • the invention can replace the complex feature model and the constellation diagram, and can effectively fuse the change process information of the feature, so that the feature value is more sufficient, accurate and effective for the representation of the audio signal.
  • a flow chart of a second embodiment of an audio recognition method based on empirical mode decomposition according to the present invention includes:
  • step S105 is added after step S104.
  • Step S105 The main purpose is to use the previously generated feature values to obtain the distribution and quantity of the time offset difference, thereby visually representing the audio signal.
  • the step S105 specifically includes:
  • S302 Determining a time offset difference between each time offset in the time offset group and the time offset of the feature value, and determining the target audio to be identified by the distribution and quantity of the time offset differences .
  • each feature value ie, the target feature value
  • the time offset tm1, td2...tdn of the time offset tm of the set of time offsets and the feature value (ie, the target feature value) are respectively obtained, and the translation is performed once (step n frame). You can get N group time offset is poor.
  • the target is determined by counting the distribution and number of all time offset differences, wherein the most concentrated audio of the time offset difference distribution is the identified target audio.
  • the energy maximum of each block is obtained by dividing a plurality of frames (for example, 50 frames), and then performing the above. Steps S103 ⁇ S105 can obtain more eigenvalues and time offset differences. In this way, a large range of feature change information can be fully captured to enhance the characterization of the entire audio signal.
  • the present invention also provides a first embodiment of an audio recognition system based on empirical modal decomposition, as shown in FIG. 7, which includes:
  • Spectrogram acquisition module 100 And inputting the original audio signal, sampling the original audio signal, and sequentially performing denoising preprocessing, adding a Hanming window, and performing Fourier transform processing to obtain spectrum data, and then sequentially connecting spectral data of each frame to obtain a sound spectrum.
  • Time - Frequency Curve Generation Module 200 a point for generating a time-frequency curve at a point where the energy maximum value of each frequency segment of the sound spectrum is obtained, and sequentially connecting the energy maximum points of the respective frequency segments;
  • An empirical modal decomposition module 300 for using the generated time - The frequency curve is subjected to empirical mode decomposition to obtain a plurality of eigenmode functions;
  • Eigenvalue output module 400 And generating, by using the obtained plurality of eigenmode functions in combination with the corresponding frequency segment and the time frame, a plurality of feature values for characterizing the original audio signal, and outputting.
  • the system of the invention introduces empirical mode decomposition into the generation of the eigenvalues of the audio signal, and the empirical modal decomposition is performed because the eigenmode function term generated by the EMD has the advantages of being able to sufficiently retain the characteristics of the original queue signal and being easy to process non-stationary sequences.
  • the generation of the eigenvalues is introduced, and the original signal queue of the EMD decomposition is formed by the point (t n , f n ) of the energy maximum generated by the spectrogram, and the original signal queue is EMD-decomposed to obtain N IMF items.
  • the system of the present invention can sufficiently retain the trend information of the signal characteristics in the frequency domain with time, so that the characterization of the audio signal is more sufficient, accurate and effective.
  • the original audio signal ie, the analog audio signal
  • the microphone passes through the A/D.
  • Analog to digital conversion sampling (for example, sampling at a sampling rate of 44100 Hz), to obtain a digital audio signal.
  • denoising by spectral subtraction which mainly uses the short-term stationary characteristic of the audio signal, subtracts the short-time spectrum of noise from the short-time spectrum value of the noise-frequency signal, that is, eliminates the random environmental noise doped in the signal, thereby obtaining The spectrum of the pure audio signal (ie, the audio data, which is buffered) for speech enhancement purposes.
  • the digital audio signal can be pre-emphasized prior to spectral subtraction denoising.
  • the pre-emphasis process effectively processes the signal by utilizing the difference between the signal characteristic and the noise characteristic, and adopts a pre-emphasis network before noise introduction to reduce the high-frequency component of the noise and improve the output signal-to-noise ratio.
  • each piece of audio data after framing can be viewed as a steady state signal.
  • the spectral data of each frame is connected in turn, and the time is taken as the horizontal axis, and the frequency is taken as the vertical axis, and the intensity of the spectrum amplitude (energy) is represented by color, and is drawn as shown in FIG. 2 .
  • At the time-frequency curve generation module 200 Calculating the energy value of each frequency point in each frequency segment of each frame of the spectral data in the spectrogram, obtaining the maximum energy of each frequency segment; sequentially obtaining the time frame and the frequency segment where the energy maximum value of each frequency segment is located , taking this time frame and frequency segment as new points, and sequentially connecting the points where the maximum energy values are located to generate a target curve, that is, time - Frequency curve.
  • the n segments of consecutive frequency segments are divided, and the sequence numbers of the frequency segments are id1, id2, ..., idn, respectively.
  • a certain frequency range eg frequency segment idn, as shown in Figure 2 from 6kHz to 9kHz
  • the point at which the maximum energy value of the spectrogram is connected, and the point that does not reach the specified intensity threshold is treated as the lower limit of the frequency range, forming a continuous dynamic curve with time as the horizontal axis and frequency as the vertical axis. , that is time - Frequency curve.
  • the frequency curve is subjected to empirical mode decomposition to obtain a plurality of eigenmode function terms that can fully characterize the change of the curve, such as obtaining the N-group (this curve is generated up to 12 groups) IMF eigenmode function term ( Each item is a time domain curve).
  • the feature value output module 400 specifically includes:
  • sampling unit 410 configured to equally sample each eigenmode function to obtain a corresponding set of sampling sequences
  • An adding unit 420 configured to add a sequence number of the frequency segment after the sampling sequence
  • a hash processing unit 430 configured to process the appended sample sequence to obtain a hash value
  • the vector component unit 440 is configured to obtain N hash values through the N sets of eigenmode functions to form a set of feature values.
  • IMF C1, IMF C2, IMF C3, IMF C4 and The IMF C5 curve obtains a corresponding set of sampling sequences x1, x2...xn, and adds the frequency segment number idn of the corresponding IMF item to the sample sequence.
  • the sample sequence after the addition is passed.
  • the sha1 hash algorithm or the Murmur hash algorithm obtains a 32-bit or 64-bit hash value, so that N is obtained by N sets (ie N) IMF entries.
  • the hash values form a set of eigenvalues (also known as a set of eigenvectors).
  • the time offset tm at which the set of feature values is located ie, the position on the time axis of the start frame of the audio signal.
  • the system of the present invention can fully fuse the change trend information of the audio features to the generation of the feature values, so that the generated feature values more fully characterize the audio signals.
  • the invention combines the generated feature value of each frame and the generated feature value of the local time segment to enrich the audio feature information, that is, separately performs audio feature information for each frame of audio and several frames of audio extracted feature values. EMD empirical mode decomposition.
  • the invention can replace the complex feature model and the constellation diagram, and can effectively fuse the change process information of the feature, so that the feature value is more sufficient, accurate and effective for the representation of the audio signal.
  • the present invention also provides a second embodiment of an audio recognition system based on empirical mode decomposition, as shown in FIG. 9, which includes:
  • Spectrogram acquisition module 100 And inputting the original audio signal, sampling the original audio signal, and sequentially performing denoising preprocessing, adding a Hanming window, and performing Fourier transform processing to obtain spectrum data, and then sequentially connecting spectral data of each frame to obtain a sound spectrum.
  • Time - Frequency Curve Generation Module 200 a point for generating a time-frequency curve at a point where the energy maximum value of each frequency segment of the sound spectrum is obtained, and sequentially connecting the energy maximum points of the respective frequency segments;
  • An empirical modal decomposition module 300 for using the generated time - The frequency curve is subjected to empirical mode decomposition to obtain a plurality of eigenmode functions;
  • Eigenvalue output module 400 And generating, by using the obtained plurality of eigenmode functions in combination with the corresponding frequency segment and the time frame, a plurality of feature values for representing the original audio signal, and outputting;
  • the distribution quantity acquisition module 500 is configured to obtain a distribution and a quantity of the time offset difference according to the feature value to represent the original audio signal.
  • the distribution quantity acquisition module 500 is added.
  • the distribution quantity acquisition module 500 The main purpose is to use the previously generated feature values to obtain the distribution and quantity of the time offset difference, thereby visually representing the audio signal.
  • the distribution quantity obtaining module 500 specifically includes:
  • Time offset group acquisition unit 510 And searching, in the database, by using the feature value to obtain a time offset group formed by a time offset of a plurality of other feature values that match the feature value;
  • Time offset difference calculation unit 520 And determining, by using each time offset in the time offset group and the time offset of the feature value, a time offset difference, and determining, by using the distribution and quantity of the time offset differences, the identifier to be identified.
  • Target audio determining, by using each time offset in the time offset group and the time offset of the feature value, a time offset difference, and determining, by using the distribution and quantity of the time offset differences, the identifier to be identified.
  • each feature value ie, the target feature value
  • the time offset tm1, td2...tdn of the time offset tm of the set of time offsets and the feature value (ie, the target feature value) are respectively obtained, and the translation is performed once (step n frame). You can get N group time offset is poor.
  • the target is determined by counting the distribution and number of all time offset differences, wherein the most concentrated audio of the time offset difference distribution is the identified target audio.

Abstract

一种基于经验模态分解的音频识别方法及系统。其中,方法包括步骤:A、输入原始音频信号,对原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图(S101);B、获得声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间-频率曲线(S102);C、将生成的时间-频率曲线进行经验模态分解,获得多个本征模函数(S103);D、通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值并输出(S104)。将音频特征的变化趋势信息充分融合到特征值的生成,使生成的特征值更完整的表征音频信号。

Description

一种基于经验模态分解的音频识别方法及系统
技术领域
本发明涉及音频识别领域,尤其涉及一种基于经验模态分解的音频识别方法及系统。
背景技术
音频识别是指通过对音频信号进行频谱分析,获得音频信号的频谱,提取音频信号的特征值,构建模型或星座图,进行目标匹配、识别。主要技术包括短时傅氏变换、声谱图特征提取、特征模板生成等。
对一段原始音频或语音的具体处理大多经过如下步骤:预加重 (Pre-emphasis) 去噪、分帧、加窗处理、快速傅里叶转换 (FFT) 、滤波组处理 (Mel-Filter Bank) 、离散余弦转换 DCT( 计算倒谱参数 ) 、对数能量、差量倒谱参数 ( 向量形式、逆傅氏变换 IFFT) 、 MFCC( 梅尔频率倒谱系数 --- 一帧音频的特征值 ) 等,最终获得一段音频信号的一系列特征值,此系列特征值可充分、完全表征此段音频信号。
目前,主流音频信号的匹配识别算法主要是对声谱图(描述了特定频率的强度随着时间的变化)进行处理,包括比较时间、频率变化和不同或者寻找波峰。其中的一个主要技术实现方案为将频率转换为音符进行处理,每个音符对应一个音域,形成一个 N 维的特征向量,再经过过滤和标准化处理,获得特征声谱图,通过滑动子图的方法获得音频声纹,并针对声纹计算位错误率完成识别匹配。另一个主要技术方案为获取一段声谱图的一系列极大值点,获得此极大值点的所处的时间点和频率,基于多个极大值点构建星座图,依据星座图内两点的时间偏移和各自的频率强度生成此时间点上的哈希值,最终通过统计相同时间偏移的哈希值的个数完成目标的识别。
特征模型和星座图的构建相对复杂,不能有效的、完整的表征音频信号特征的变化,无法将特征的变化过程和趋势融入到特征值的生成,即形成的特征模板不能完整、充分表征音频信号。
因此,现有技术还有待于改进和发展。
发明内容
鉴于上述现有技术的不足,本发明的目的 在于提供一种基于经验模态分解的音频识别方法及系统,旨在解决现有的识别方法无法完整、充分表征音频信号的 问题。
本发明的技术方案如下:
一种基于经验模态分解的音频识别方法,其中,包括步骤:
A 、输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
B 、获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
C 、将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
D 、通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出。
优选的,所述步骤 D 具体包括:
D1 、对每一个本征模函数等间隔取样,获得一组相应的取样序列;
D2 、在所述取样序列后追加所处的频率段序号;
D3 、对追加后的取样序列进行处理获得一个哈希值;
D4 、通过 N 组本征模函数获得 N 个哈希值,共同组成一组特征值。
优选的,所述步骤 D 之后还包括:
E 、根据所述特征值获取时间偏移差的分布和数量,以表征原始音频信号。
优选的,所述步骤 E 具体包括:
E1 、通过所述特征值在一数据库中进行搜索,获得与所述特征值相匹配的若干其他特征值所处的时间偏移构成的时间偏移组;
E2 、将所述时间偏移组中各时间偏移与所述特征值所处的时间偏移分别求得时间偏移差,再通过这些时间偏移差的分布和数量,确定需识别的目标音频。
优选的,所述步骤 D3 中,对追加后的取样序列通过 sha1 哈希算法或 Murmur 哈希算法处理获得一个哈希值。
一种基于经验模态分解的音频识别系统,其中,包括:
声谱图获取模块,用于输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
时间 - 频率曲线生成模块,用于获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
经验模态分解模块,用于将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
特征值输出模块,用于通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出。
优选的,所述特征值输出模块具体包括:
取样单元,用于对每一个本征模函数等间隔取样,获得一组相应的取样序列;
追加单元,用于在所述取样序列后追加所处的频率段序号;
哈希处理单元,用于对追加后的取样序列进行处理获得一个哈希值;
向量组成单元,用于通过 N 组本征模函数获得 N 个哈希值,共同组成一组特征值。
优选的,所述音频识别系统还包括:
分布数量获取模块,用于根据所述特征值获取时间偏移差的分布和数量,以表征原始音频信号。
优选的,所述分布数量获取模块具体包括:
时间偏移组获取单元,用于通过所述特征值在数据库中进行搜索,获得与所述特征值相匹配的若干其他特征值所处的时间偏移构成的时间偏移组;
时间偏移差计算单元,用于将所述时间偏移组中各时间偏移与所述特征值所处的时间偏移分别求得时间偏移差,再通过这些时间偏移差的分布和数量,确定需识别的目标音频。
优选的,所述哈希处理单元中,对追加后的取样序列通过 sha1 哈希算法或 Murmur 哈希算法处理获得一个哈希值。
有益效果:本发明将 EMD 经验模态分解的方法引入到音频信号特征值的生成,从而将音频特征的变化趋势信息充分融合到特征值的生成,使生成的特征值更完整的表征音频信号。本发明可取代构建复杂的特征模型和星座图,并能够有效融合特征的变化过程信息,使得特征值对音频信号的表征更加充分、精确、有效。
附图说明
图 1 为 本发明 一种基于经验模态分解的音频识别方法第一实施例的流程图;
图 2 为本发明中经过短时傅里叶变换生成的声谱图;
图 3 为图 1 所示方法中步骤 S104 的具体流程图;
图 4 为本发明中经 EMD 分解后生成的 5 项 IMF 数据曲线;
图 5 为本发明一种基于经验模态分解的音频识别方法第二实施例的流程图;
图 6 为图 5 所示方法中步骤 S105 的具体流程图;
图 7 为 本发明 一种基于经验模态分解的音频识别系统第一实施例的结构框图;
图 8 为图 7 所示系统中特征值输出模块的具体结构框图;
图 9 为本发明一种基于经验模态分解的音频识别系统第二实施例的结构框图;
图 10 为图 9 所示系统中分布数量获取模块的具体结构框图。
具体实施方式
本发明提供 一种基于经验模态分解的音频识别方法及系统 ,为使本发明的目的、技术方案及效果更加清楚、明确,以下对本发明进一步详细说明。 应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明 。
请参阅图 1 ,图 1 为本发明一种基于经验模态分解的音频识别方法第一实施例的流程图,如图所示,其包括步骤:
S101 、输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
S102 、获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
S103 、将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
S104 、通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出。
本发明的方法将经验模态分解( EMD , Empirical Mode Decomposition )的方法引入到音频信号特征值的生成,由于 EMD 生成的本征模函数( IMF , Intrinsic Mode Function )项具有能够充分保留原始队列信号的特征、易于处理非平稳序列等优点,将经验模态分解的方法引入到特征值的生成,通过将声谱图生成的能量最大值所在点( tn, fn )构成为 EMD 分解的原始信号队列,对此原始信号队列进行 EMD 分解获得 N 个 IMF 项。本发明的方法可充分保留信号特征在频域随时间变化的趋势信息,使得特征值对音频信号的表征更加充分、精确、有效。
具体来说,在步骤 S101 中,原始音频信号(即模拟音频信号)通过麦克风输入后,通过 A/D 模数转换、采样(例如按照 44100Hz 的采样率采样),获得数字音频信号。
然后通过谱减法去噪,其主要利用音频信号的短时平稳特性,从带噪音频信号的短时谱值中减去噪声的短时谱,即消除信号内掺杂的随机环境噪声,从而得到纯净音频信号的频谱(即音频数据,将其缓存),达到语音增强的目的。在谱减法去噪之前,可对数字音频信号进行预加重处理。预加重处理其是利用信号特性和噪声特性的差别有效地对信号进行处理,在噪声引入之前采用预加重网络,减小噪声的高频分量,提高输出信噪比。
再对缓存内的音频数据进行分帧处理,每帧时长 N 毫秒,分帧后的每段音频数据都可以看成一段稳态信号。
再生成汉明窗,重叠加在音频数据上,重叠率为 1/2 ,帧移为 N/2 毫秒;由于直接对信号截断会产生频率泄露,为了改善频率泄露的情况,加非矩形窗,例如加汉明窗,因为汉明窗的幅频特性是旁瓣衰减较大,主瓣峰值与第一个旁瓣峰值衰减可达 40db 。
再对每帧音频数据进行傅氏变换处理(即 FFT 快速傅里叶变换),获得频谱数据;关于傅氏变换处理的具体技术细节可参考现有技术的内容,在此不再详述。
依次连接每帧的频谱数据,以时间为横轴,以所处频率为纵轴,以颜色表征频谱振幅(能量)强度,绘制得到如图 2 所示的声谱图。
在步骤 S102 中,计算声谱图中每帧频谱数据上的各个频率段各个频率点的能量值,取得各频率段能量最大值;依次获得每个频率段能量最大值所在点所处的时间帧和频率段,将此时间帧和频率段作为新的点,依次连接各能量最大值所在点生成目标曲线,即时间 - 频率曲线。
例如,在声谱图中划分 n 段连续的频率段,各频率段的序号依次为 id1,id2,...,idn, 在某个频率范围内(例如频率段 idn ,如图 2 中 6kHz 至 9kHz )连接声谱图各能量最大值所在点,而未达到指定强度阈值的点归为此频率范围的下限值处理,形成一条以时间为横轴,以频率为纵轴的连续的动态变化曲线,即时间 - 频率曲线。
在所述步骤 S103 中,将生成的时间 - 频率曲线进行经验模态分解,获得能充分表征此曲线变化的多个本征模函数项,如获得 N 组(本曲线生成截止到 12 组) IMF 本征模函数项 ( 每项为时域的变化曲线 ) 。
如图 3 所示,所述步骤 S104 具体包括:
S201 、对每一个本征模函数等间隔取样,获得一组相应的取样序列;
S202 、在所述取样序列后追加所处的频率段序号;
S203 、对追加后的取样序列进行处理获得一个哈希值;
S204 、通过 N 组本征模函数获得 N 个哈希值,共同组成一组特征值。
具体来说,通过对每一个 IMF 项进行等间隔取样(所有对 IMF 项的抽样处理间隔保持一致,并且间隔不可过大以保留曲线动态变化信息),如图 4 中的 IMF C1 , IMF C2, IMF C3 , IMF C4 和 IMF C5 曲线,获得一组相应的取样序列 x1 、 x2...xn ,将此取样序列后追加相应 IMF 项所处的频率段序号 idn ,对此追加后的取样序列通过 sha1 哈希算法或 Murmur 哈希算法获得一个 32 位或 64 位的哈希值,这样通过 N 组(即 N 个) IMF 项获得 N 个哈希值组成一组特征值(也可称为一组特征向量)。同时保存此组特征值所处的时间偏移 tm( 即音频信号的起始帧所在时间轴上的位置 ) 。
本发明的方法可将音频特征的变化趋势信息充分融合到特征值的生成,使生成的特征值更完整的表征音频信号。本发明将每帧生成特征值和局部时间段生成特征值结合,丰富了音频特征信息,即对每帧音频和对若干帧音频提取特征值分别进行 EMD 经验模态分解。本发明可取代构建复杂的特征模型和星座图,并能够有效融合特征的变化过程信息,使得特征值对音频信号的表征更加充分、精确、有效。
请参阅图 5 ,图 5 为本发明本发明一种基于经验模态分解的音频识别方法第二实施例的流程图,其具体包括:
S101 、输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
S102 、获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
S103 、将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
S104 、通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出;
S105 、根据所述特征值获取时间偏移差的分布和数量,以表征原始音频信号。
其与方法第一实施例不同的是,在步骤 S104 之后增加了步骤 S105 。步骤 S105 ,其主要是利用前面生成的特征值,来获取时间偏移差的分布和数量,从而根据直观的表征音频信号。
具体来说,如图 6 所示,所述步骤 S105 具体包括:
S301 、通过所述特征值在一数据库中进行搜索,获得与所述特征值相匹配的若干其他特征值所处的时间偏移构成的时间偏移组;
S302 、将所述时间偏移组中各时间偏移与所述特征值所处的时间偏移分别求得时间偏移差,再通过这些时间偏移差的分布和数量,确定需识别的目标音频。
通过生成的若干特征值在数据库中进行搜索,每个特征值(即目标特征值)可获得与此特征值匹配的若干其他特征值向量所处的时间偏移 t1 、 t2...tn ,将这组时间偏移与此特征值(即目标特征值)所处的时间偏移 tm 分别求得时间偏移差 td1 、 td2...tdn ,依次,每平移一次(步长 n 帧)即可获得 N 组时间偏移差。
依次,直至处理完整个原始音频信号,最后再通过统计所有时间偏移差的分布和数目,确定目标,其中时间偏移差分布最集中的音频即为识别的目标音频。
为了适当增加所生成特征值的丰富性,通过若干帧(例如 50 帧)分块,求得每块的能量最大值,再进行如上 S103~S105 步骤,可获得更多的特征值和时间偏移差。这样,就可以充分捕捉较大范围特征变化信息,以加强整个音频信号的表征。
基于上述方法,本发明还提供一种基于经验模态分解的音频识别系统第一实施例,如图 7 所示,其包括:
声谱图获取模块 100 ,用于输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
时间 - 频率曲线生成模块 200 ,用于获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
经验模态分解模块 300 ,用于将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
特征值输出模块 400 ,用于通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出。
本发明的系统将经验模态分解引入到音频信号特征值的生成,由于 EMD 生成的本征模函数项具有能够充分保留原始队列信号的特征、易于处理非平稳序列等优点,将经验模态分解引入到特征值的生成,通过将声谱图生成的能量最大值所在点( tn, fn )构成为 EMD 分解的原始信号队列,对此原始信号队列进行 EMD 分解获得 N 个 IMF 项。本发明的系统可充分保留信号特征在频域随时间变化的趋势信息,使得特征值对音频信号的表征更加充分、精确、有效。
具体来说,在声谱图获取模块 100 中,原始音频信号(即模拟音频信号)通过麦克风输入后,通过 A/D 模数转换、采样(例如按照 44100Hz 的采样率采样),获得数字音频信号。
然后通过谱减法去噪,其主要利用音频信号的短时平稳特性,从带噪音频信号的短时谱值中减去噪声的短时谱,即消除信号内掺杂的随机环境噪声,从而得到纯净音频信号的频谱(即音频数据,将其缓存),达到语音增强的目的。在谱减法去噪之前,可对数字音频信号进行预加重处理。预加重处理其是利用信号特性和噪声特性的差别有效地对信号进行处理,在噪声引入之前采用预加重网络,减小噪声的高频分量,提高输出信噪比。
再对缓存内的音频数据进行分帧处理,每帧时长 N 毫秒,分帧后的每段音频数据都可以看成一段稳态信号。
再生成汉明窗,重叠加在音频数据上,重叠率为 1/2 ,帧移为 N/2 毫秒;由于直接对信号截断会产生频率泄露,为了改善频率泄露的情况,加非矩形窗,例如加汉明窗,因为汉明窗的幅频特性是旁瓣衰减较大,主瓣峰值与第一个旁瓣峰值衰减可达 40db 。
再对每帧音频数据进行傅氏变换处理(即 FFT 快速傅里叶变换),获得频谱数据;关于傅氏变换处理的具体技术细节可参考现有技术的内容,在此不再详述。
依次连接每帧的频谱数据,以时间为横轴,以所处频率为纵轴,以颜色表征频谱振幅(能量)强度,绘制得到如图 2 所示的声谱图。
在所述时间 - 频率曲线生成模块 200 中,计算声谱图中每帧频谱数据上的各个频率段各个频率点的能量值,取得各频率段能量最大值;依次获得每个频率段能量最大值所在点所处的时间帧和频率段,将此时间帧和频率段作为新的点,依次连接各能量最大值所在点生成目标曲线,即时间 - 频率曲线。
例如,在声谱图中划分 n 段连续的频率段,各频率段的序号依次为 id1,id2,...,idn, 在某个频率范围内(例如频率段 idn ,如图 2 中 6kHz 至 9kHz )连接声谱图各能量最大值所在点,而未达到指定强度阈值的点归为此频率范围的下限值处理,形成一条以时间为横轴,以频率为纵轴的连续的动态变化曲线,即时间 - 频率曲线。
在所述经验模态分解模块 300 中,将生成的时间 - 频率曲线进行经验模态分解,获得能充分表征此曲线变化的多个本征模函数项,如获得 N 组(本曲线生成截止到 12 组) IMF 本征模函数项 ( 每项为时域的变化曲线 ) 。
进一步,如图 8 所示,所述特征值输出模块 400 具体包括:
取样单元 410 ,用于对每一个本征模函数等间隔取样,获得一组相应的取样序列;
追加单元 420 ,用于在所述取样序列后追加所处的频率段序号;
哈希处理单元 430 ,用于对追加后的取样序列进行处理获得一个哈希值;
向量组成单元 440 ,用于通过 N 组本征模函数获得 N 个哈希值,共同组成一组特征值。
具体来说,通过对每一个 IMF 项进行等间隔取样(所有对 IMF 项的抽样处理间隔保持一致,并且间隔不可过大以保留曲线动态变化信息),如图 4 中的 IMF C1 , IMF C2, IMF C3 , IMF C4 和 IMF C5 曲线,获得一组相应的取样序列 x1 、 x2...xn ,将此取样序列后追加相应 IMF 项所处的频率段序号 idn ,对此追加后的取样序列通过 sha1 哈希算法或 Murmur 哈希算法获得一个 32 位或 64 位的哈希值,这样通过 N 组(即 N 个) IMF 项获得 N 个哈希值组成一组特征值(也可称为一组特征向量)。同时保存此组特征值所处的时间偏移 tm( 即音频信号的起始帧所在时间轴上的位置 ) 。
本发明的系统可将音频特征的变化趋势信息充分融合到特征值的生成,使生成的特征值更完整的表征音频信号。本发明将每帧生成特征值和局部时间段生成特征值结合,丰富了音频特征信息,即对每帧音频和对若干帧音频提取特征值分别进行 EMD 经验模态分解。本发明可取代构建复杂的特征模型和星座图,并能够有效融合特征的变化过程信息,使得特征值对音频信号的表征更加充分、精确、有效。
本发明还提供一种基于经验模态分解的音频识别系统第二实施例,如图 9 所示,其包括:
声谱图获取模块 100 ,用于输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
时间 - 频率曲线生成模块 200 ,用于获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
经验模态分解模块 300 ,用于将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
特征值输出模块 400 ,用于通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出;
分布数量获取模块 500 ,用于根据所述特征值获取时间偏移差的分布和数量,以表征原始音频信号。
其与系统第一实施例不同的是,增加了分布数量获取模块 500 。所述分布数量获取模块 500 其主要是利用前面生成的特征值,来获取时间偏移差的分布和数量,从而根据直观的表征音频信号。
进一步,如图 10 所示,所述分布数量获取模块 500 具体包括:
时间偏移组获取单元 510 ,用于通过所述特征值在数据库中进行搜索,获得与所述特征值相匹配的若干其他特征值所处的时间偏移构成的时间偏移组;
时间偏移差计算单元 520 ,用于将所述时间偏移组中各时间偏移与所述特征值所处的时间偏移分别求得时间偏移差,再通过这些时间偏移差的分布和数量,确定需识别的目标音频。
通过生成的若干特征值在数据库中进行搜索,每个特征值(即目标特征值)可获得与此特征值匹配的若干其他特征值向量所处的时间偏移 t1 、 t2...tn ,将这组时间偏移与此特征值(即目标特征值)所处的时间偏移 tm 分别求得时间偏移差 td1 、 td2...tdn ,依次,每平移一次(步长 n 帧)即可获得 N 组时间偏移差。
依次,直至处理完整个原始音频信号,最后再通过统计所有时间偏移差的分布和数目,确定目标,其中时间偏移差分布最集中的音频即为识别的目标音频。
为了适当增加所生成特征值的丰富性,通过若干帧(例如 50 帧)分块,求得每块的能量最大值,再执行经验模态分解模块 300 、特征值输出模块 400 、分布数量获取模块 500 ,可获得更多的特征值和时间偏移差。这样,就可以充分捕捉较大范围特征变化信息,以加强整个音频信号的表征。
应当理解的是,本发明的应用不限于上述的举例,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,所有这些改进和变换都应属于本发明所附权利要求的保护范围。

Claims (14)

  1. 一种基于经验模态分解的音频识别方法,其特征在于,包括步骤:
    A 、输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
    B 、获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
    C 、将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
    D 、通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出。
  2. 根据权利要求 1 所述的基于经验模态分解的音频识别方法,其特征在于,所述步骤 D 具体包括:
    D1 、对每一个本征模函数等间隔取样,获得一组相应的取样序列;
    D2 、在所述取样序列后追加所处的频率段序号;
    D3 、对追加后的取样序列进行处理获得一个哈希值;
    D4 、通过 N 组本征模函数获得 N 个哈希值,共同组成一组特征值。
  3. 根据权利要求 2 所述的基于经验模态分解的音频识别方法,其特征在于,所述步骤 D 之后还包括:
    E 、根据所述特征值获取时间偏移差的分布和数量,以表征原始音频信号。
  4. 根据权利要求 3 所述的基于经验模态分解的音频识别方法,其特征在于,所述步骤 E 具体包括:
    E1 、通过所述特征值在一数据库中进行搜索,获得与所述特征值相匹配的若干其他特征值所处的时间偏移构成的时间偏移组;
    E2 、将所述时间偏移组中各时间偏移与所述特征值所处的时间偏移分别求得时间偏移差,再通过这些时间偏移差的分布和数量,确定需识别的目标音频。
  5. 根据权利要求 2 所述的基于经验模态分解的音频识别方法,其特征在于,所述步骤 D3 中,对追加后的取样序列通过 sha1 哈希算法或 Murmur 哈希算法处理获得一个哈希值。
  6. 根据权利要求 1 所述的基于经验模态分解的音频识别方法,其特征在于,所述步骤 A 中,通过谱减法去噪。
  7. 根据权利要求 6 所述的基于经验模态分解的音频识别方法,其特征在于,在谱减法去噪之前,对音频信号进行预加重处理。
  8. 一种基于经验模态分解的音频识别系统,其特征在于,包括:
    声谱图获取模块,用于输入原始音频信号,对所述原始音频信号进行采样,然后依次进行去噪预处理、加汉明窗以及傅氏变换处理得到频谱数据,再依次连接每帧的频谱数据,获得声谱图;
    时间 - 频率曲线生成模块,用于获得所述声谱图各频率段的能量最大值所在点,并依次连接各频率段的能量最大值所在点生成时间 - 频率曲线;
    经验模态分解模块,用于将所述生成的时间 - 频率曲线进行经验模态分解,获得多个本征模函数;
    特征值输出模块,用于通过获得的多个本征模函数结合相应的频率段以及时间帧,生成用于表征原始音频信号的多个特征值,并输出。
  9. 根据权利要求 8 所述的基于经验模态分解的音频识别系统,其特征在于,所述特征值输出模块具体包括:
    取样单元,用于对每一个本征模函数等间隔取样,获得一组相应的取样序列;
    追加单元,用于在所述取样序列后追加所处的频率段序号;
    哈希处理单元,用于对追加后的取样序列进行处理获得一个哈希值;
    向量组成单元,用于通过 N 组本征模函数获得 N 个哈希值,共同组成一组特征值。
  10. 根据权利要求 9 所述的基于经验模态分解的音频识别系统,其特征在于,还包括:
    分布数量获取模块,用于根据所述特征值获取时间偏移差的分布和数量,以表征原始音频信号。
  11. 根据权利要求 10 所述的基于经验模态分解的音频识别系统,其特征在于,所述分布数量获取模块具体包括:
    时间偏移组获取单元,用于通过所述特征值在数据库中进行搜索,获得与所述特征值相匹配的若干其他特征值所处的时间偏移构成的时间偏移组;
    时间偏移差计算单元,用于将所述时间偏移组中各时间偏移与所述特征值所处的时间偏移分别求得时间偏移差,再通过这些时间偏移差的分布和数量,确定需识别的目标音频。
  12. 根据权利要求 9 所述的基于经验模态分解的音频识别系统,其特征在于,所述哈希处理单元中,对追加后的取样序列通过 sha1 哈希算法或 Murmur 哈希算法处理获得一个哈希值。
  13. 根据权利要求 8 所述的基于经验模态分解的音频识别系统,其特征在于,所述声谱图获取模块中,通过谱减法去噪。
  14. 根据权利要求 13 所述的基于经验模态分解的音频识别系统,其特征在于,在谱减法去噪之前,对音频信号进行预加重处理。
PCT/CN2017/074706 2016-02-25 2017-02-24 一种基于经验模态分解的音频识别方法及系统 WO2017144007A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610103443.2A CN105788603B (zh) 2016-02-25 2016-02-25 一种基于经验模态分解的音频识别方法及系统
CN2016101034432 2016-02-25

Publications (1)

Publication Number Publication Date
WO2017144007A1 true WO2017144007A1 (zh) 2017-08-31

Family

ID=56403668

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/074706 WO2017144007A1 (zh) 2016-02-25 2017-02-24 一种基于经验模态分解的音频识别方法及系统

Country Status (2)

Country Link
CN (1) CN105788603B (zh)
WO (1) WO2017144007A1 (zh)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108682429A (zh) * 2018-05-29 2018-10-19 平安科技(深圳)有限公司 语音增强方法、装置、计算机设备及存储介质
CN109410977A (zh) * 2018-12-19 2019-03-01 东南大学 一种基于EMD-Wavelet的MFCC相似度的语音段检测方法
CN109948286A (zh) * 2019-03-29 2019-06-28 华北理工大学 基于改进经验小波分解的信号分解方法
CN110556125A (zh) * 2019-10-15 2019-12-10 出门问问信息科技有限公司 基于语音信号的特征提取方法、设备及计算机存储介质
CN111046323A (zh) * 2019-12-24 2020-04-21 国网河北省电力有限公司信息通信分公司 一种基于emd的网络流量数据预处理方法
CN111276154A (zh) * 2020-02-26 2020-06-12 中国电子科技集团公司第三研究所 风噪声抑制方法与系统以及炮声检测方法与系统
CN111524493A (zh) * 2020-05-27 2020-08-11 珠海格力智能装备有限公司 调试曲谱的方法及装置
CN113314137A (zh) * 2020-02-27 2021-08-27 东北大学秦皇岛分校 一种基于动态进化粒子群屏蔽emd的混合信号分离方法
CN115129923A (zh) * 2022-05-17 2022-09-30 荣耀终端有限公司 语音搜索方法、设备及存储介质
CN116127277A (zh) * 2023-04-12 2023-05-16 武汉工程大学 激波流场动态压力测量不确定度评定方法及系统
CN116129926A (zh) * 2023-04-19 2023-05-16 北京北信源软件股份有限公司 智能设备自然语言交互信息处理方法

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105788603B (zh) * 2016-02-25 2019-04-16 深圳创维数字技术有限公司 一种基于经验模态分解的音频识别方法及系统
CN107895571A (zh) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 无损音频文件识别方法及装置
CN106656882B (zh) * 2016-11-29 2019-05-10 中国科学院声学研究所 一种信号合成方法及系统
CN106601265B (zh) * 2016-12-15 2019-08-13 中国人民解放军第四军医大学 一种消除毫米波生物雷达语音中噪声的方法
GB201801875D0 (en) * 2017-11-14 2018-03-21 Cirrus Logic Int Semiconductor Ltd Audio processing
CN110070874B (zh) * 2018-01-23 2021-07-30 中国科学院声学研究所 一种针对声纹识别的语音降噪方法及装置
CN108986840A (zh) * 2018-04-03 2018-12-11 五邑大学 一种在检测验电笔过程中对蜂鸣器音频的识别方法
CN109102811B (zh) * 2018-07-27 2021-03-30 广州酷狗计算机科技有限公司 音频指纹的生成方法、装置及存储介质
CN109616143B (zh) * 2018-12-13 2019-09-10 山东省计算中心(国家超级计算济南中心) 基于变分模态分解和感知哈希的语音端点检测方法
CN111402926A (zh) * 2020-03-19 2020-07-10 中国电影科学技术研究所 影院放映内容的检测方法、装置、设备及智能网络传感器
CN111935044B (zh) * 2020-08-20 2021-03-09 金陵科技学院 基于emd分解的psk及qam类信号调制识别方法
CN112214635B (zh) * 2020-10-23 2022-09-13 昆明理工大学 一种基于倒频谱分析的快速音频检索方法
CN113628641A (zh) * 2021-06-08 2021-11-09 广东工业大学 一种基于深度学习的用于检查口鼻呼吸的方法
CN114023313B (zh) * 2022-01-04 2022-04-08 北京世纪好未来教育科技有限公司 语音处理模型的训练、语音处理方法、装置、设备及介质
CN117118536B (zh) * 2023-10-25 2023-12-19 南京派格测控科技有限公司 调频稳定性的确定方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033094A1 (en) * 2001-02-14 2003-02-13 Huang Norden E. Empirical mode decomposition for analyzing acoustical signals
US20090116595A1 (en) * 2007-05-21 2009-05-07 Florida State University System and methods for determining masking signals for applying empirical mode decomposition (emd) and for demodulating intrinsic mode functions obtained from application of emd
CN101727905A (zh) * 2009-11-27 2010-06-09 江南大学 一种得到具有精细时频结构的声纹图的方法
CN104795064A (zh) * 2015-03-30 2015-07-22 福州大学 低信噪比声场景下声音事件的识别方法
CN105788603A (zh) * 2016-02-25 2016-07-20 深圳创维数字技术有限公司 一种基于经验模态分解的音频识别方法及系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3886372B2 (ja) * 2001-12-13 2007-02-28 松下電器産業株式会社 音響変節点抽出装置及びその方法、音響再生装置及びその方法、音響信号編集装置、音響変節点抽出方法プログラム記録媒体、音響再生方法プログラム記録媒体、音響信号編集方法プログラム記録媒体、音響変節点抽出方法プログラム、音響再生方法プログラム、音響信号編集方法プログラム
US8391615B2 (en) * 2008-12-02 2013-03-05 Intel Corporation Image recognition algorithm, method of identifying a target image using same, and method of selecting data for transmission to a portable electronic device
CN103209036B (zh) * 2013-04-22 2015-10-14 哈尔滨工程大学 基于Hilbert-黄变换双重降噪的瞬态信号检测方法
CN104299620A (zh) * 2014-09-22 2015-01-21 河海大学 一种基于emd算法的语音增强方法
CN104900229A (zh) * 2015-05-25 2015-09-09 桂林电子科技大学信息科技学院 一种语音信号混合特征参数的提取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033094A1 (en) * 2001-02-14 2003-02-13 Huang Norden E. Empirical mode decomposition for analyzing acoustical signals
US20090116595A1 (en) * 2007-05-21 2009-05-07 Florida State University System and methods for determining masking signals for applying empirical mode decomposition (emd) and for demodulating intrinsic mode functions obtained from application of emd
CN101727905A (zh) * 2009-11-27 2010-06-09 江南大学 一种得到具有精细时频结构的声纹图的方法
CN104795064A (zh) * 2015-03-30 2015-07-22 福州大学 低信噪比声场景下声音事件的识别方法
CN105788603A (zh) * 2016-02-25 2016-07-20 深圳创维数字技术有限公司 一种基于经验模态分解的音频识别方法及系统

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108682429A (zh) * 2018-05-29 2018-10-19 平安科技(深圳)有限公司 语音增强方法、装置、计算机设备及存储介质
CN109410977A (zh) * 2018-12-19 2019-03-01 东南大学 一种基于EMD-Wavelet的MFCC相似度的语音段检测方法
CN109948286A (zh) * 2019-03-29 2019-06-28 华北理工大学 基于改进经验小波分解的信号分解方法
CN109948286B (zh) * 2019-03-29 2023-10-03 华北理工大学 基于改进经验小波分解的信号分解方法
CN110556125A (zh) * 2019-10-15 2019-12-10 出门问问信息科技有限公司 基于语音信号的特征提取方法、设备及计算机存储介质
CN111046323A (zh) * 2019-12-24 2020-04-21 国网河北省电力有限公司信息通信分公司 一种基于emd的网络流量数据预处理方法
CN111276154A (zh) * 2020-02-26 2020-06-12 中国电子科技集团公司第三研究所 风噪声抑制方法与系统以及炮声检测方法与系统
CN111276154B (zh) * 2020-02-26 2022-12-09 中国电子科技集团公司第三研究所 风噪声抑制方法与系统以及炮声检测方法与系统
CN113314137B (zh) * 2020-02-27 2022-07-26 东北大学秦皇岛分校 一种基于动态进化粒子群屏蔽emd的混合信号分离方法
CN113314137A (zh) * 2020-02-27 2021-08-27 东北大学秦皇岛分校 一种基于动态进化粒子群屏蔽emd的混合信号分离方法
CN111524493A (zh) * 2020-05-27 2020-08-11 珠海格力智能装备有限公司 调试曲谱的方法及装置
CN115129923A (zh) * 2022-05-17 2022-09-30 荣耀终端有限公司 语音搜索方法、设备及存储介质
CN115129923B (zh) * 2022-05-17 2023-10-20 荣耀终端有限公司 语音搜索方法、设备及存储介质
CN116127277A (zh) * 2023-04-12 2023-05-16 武汉工程大学 激波流场动态压力测量不确定度评定方法及系统
CN116129926A (zh) * 2023-04-19 2023-05-16 北京北信源软件股份有限公司 智能设备自然语言交互信息处理方法
CN116129926B (zh) * 2023-04-19 2023-06-09 北京北信源软件股份有限公司 智能设备自然语言交互信息处理方法

Also Published As

Publication number Publication date
CN105788603B (zh) 2019-04-16
CN105788603A (zh) 2016-07-20

Similar Documents

Publication Publication Date Title
WO2017144007A1 (zh) 一种基于经验模态分解的音频识别方法及系统
WO2018190547A1 (ko) 심화신경망 기반의 잡음 및 에코의 통합 제거 방법 및 장치
WO2020034526A1 (zh) 保险录音的质检方法、装置、设备和计算机存储介质
WO2020207035A1 (zh) 骚扰电话拦截方法、装置、设备及存储介质
CN106875938B (zh) 一种改进的非线性自适应语音端点检测方法
WO2019004592A1 (ko) 생성적 대립 망 기반의 음성 대역폭 확장기 및 확장 방법
CN102543073B (zh) 一种沪语语音识别信息处理方法
WO2013183928A1 (ko) 오디오 부호화방법 및 장치, 오디오 복호화방법 및 장치, 및 이를 채용하는 멀티미디어 기기
WO2020153572A1 (ko) 사운드 이벤트 탐지 모델 학습 방법
WO2017071453A1 (zh) 一种语音识别的方法及装置
WO2018038381A1 (ko) 외부 기기를 제어하는 휴대 기기 및 이의 오디오 신호 처리 방법
WO2010067976A2 (ko) 신호 분리 방법, 상기 신호 분리 방법을 이용한 통신 시스템 및 음성인식시스템
WO2020253115A1 (zh) 基于语音识别的产品推荐方法、装置、设备和存储介质
WO2018217059A1 (en) Method and electronic device for managing loudness of audio signal
Liu Sound source seperation with distributed microphone arrays in the presence of clocks synchronization errors
Al-Kaltakchi et al. Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments
Hou et al. Domain adversarial training for speech enhancement
EP4042725A1 (en) Position detection method, apparatus, electronic device and computer readable storage medium
CN110176243A (zh) 语音增强方法、模型训练方法、装置和计算机设备
Adibi A low overhead scaled equalized harmonic-based voice authentication system
WO2018199367A1 (ko) 스테레오 채널 잡음 제거 장치 및 방법
WO2012053809A2 (ko) 음성통신 기반 잡음 제거 시스템 및 그 방법
WO2022075702A1 (ko) 음성을 이용한 안면 검출 방법
WO2019156427A1 (ko) 발화된 단어에 기초하여 화자를 식별하기 위한 방법 및 그 장치, 문맥 기반 음성 모델 관리 장치 및 그 방법
WO2014157954A1 (ko) 뇌의 음성처리에 기반한 음성신호 프레임 가변 분할 방법

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17755836

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17755836

Country of ref document: EP

Kind code of ref document: A1