WO2014153800A1 - 语音识别系统 - Google Patents

语音识别系统 Download PDF

Info

Publication number
WO2014153800A1
WO2014153800A1 PCT/CN2013/074831 CN2013074831W WO2014153800A1 WO 2014153800 A1 WO2014153800 A1 WO 2014153800A1 CN 2013074831 W CN2013074831 W CN 2013074831W WO 2014153800 A1 WO2014153800 A1 WO 2014153800A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
signal
recognized
recognition system
Prior art date
Application number
PCT/CN2013/074831
Other languages
English (en)
French (fr)
Inventor
王健铭
Original Assignee
京东方科技集团股份有限公司
北京京东方显示技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 北京京东方显示技术有限公司 filed Critical 京东方科技集团股份有限公司
Priority to US14/366,482 priority Critical patent/US20150340027A1/en
Publication of WO2014153800A1 publication Critical patent/WO2014153800A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present invention relates to the field of voice detection technologies, and in particular, to a voice recognition system. Background technique
  • Speech recognition includes both speaker recognition and speaker semantic recognition.
  • the former uses the personality characteristics of the speaker in the speech signal, and does not consider the meaning of the words contained in the speech, emphasizing the personality of the speaker;
  • the purpose is to identify the semantic content of the speech signal, not considering the personality of the speaker, and emphasizing the commonality of the speech.
  • the technical problem to be solved by the technical solution of the present invention is how to provide a speech recognition system capable of improving the reliability of speaker detection, so that the speech product can be more widely applied.
  • the speech recognition system includes:
  • a storage unit configured to store a voice model of at least one user
  • a voice collecting and pre-processing unit configured to collect a voice signal to be recognized, and perform format conversion and encoding on the voice signal to be identified;
  • the voice collection and pre-processing unit is further configured to sequentially perform amplification, gain control, filtering, and sampling on the to-be-recognized voice signal. And then performing format conversion and encoding on the to-be-identified speech signal, so that the to-be-identified speech signal is divided into short-time signals composed of multiple frames.
  • the voice collection and pre-processing unit is further configured to perform pre-emphasis processing on the to-be-identified voice signal after performing format conversion and encoding by using a window function.
  • the foregoing voice recognition system further includes:
  • An endpoint detecting unit configured to calculate a voice start point and a voice end point of the to-be-identified voice signal after performing format conversion and encoding, and remove a silence signal in the to-be-identified voice signal to obtain a voice in the to-be-identified voice signal a domain range; and performing fast Fourier transform FFT analysis on the speech spectrum in the to-be-identified speech signal, and calculating a vowel signal, a voiced signal, and a light consonant signal in the to-be-identified speech signal according to the analysis result.
  • the feature extraction unit obtains the speech feature parameter by extracting a frequency cepstral coefficient MFCC feature from the encoded speech signal to be recognized.
  • the voice recognition system further includes: a voice modeling unit, configured to use the frequency cepstral coefficient MFCC to establish a text-independent Gaussian mixture model as an acoustic model of the voice by using the voice feature parameter.
  • a voice modeling unit configured to use the frequency cepstral coefficient MFCC to establish a text-independent Gaussian mixture model as an acoustic model of the voice by using the voice feature parameter.
  • the pattern matching unit uses a Gaussian mixture model to match the extracted speech feature parameters with at least one of the speech models using a maximum a posteriori probability algorithm MAP, and calculates the The likelihood that the speech signal is to be identified with each of the speech models.
  • the extracted posterior speech feature parameter is matched with at least one of the voice models by using a maximum a posteriori probability algorithm MAP, and determining a manner of the user to which the to-be-identified voice signal belongs is specific.
  • MAP maximum a posteriori probability algorithm
  • the to-be-identified voice message The characteristic parameters of the number are uniquely determined by a set of parameters ⁇ Wi , ⁇ - , where ⁇ and ⁇ are respectively the mixed weight value, the mean vector and the covariance matrix of the speaker speech feature parameters.
  • the voice recognition system further includes a determining unit, configured to compare the voice model having the highest likelihood with the to-be-identified voice signal with a preset recognition threshold, and determine that the to-be-identified voice signal belongs to User.
  • a determining unit configured to compare the voice model having the highest likelihood with the to-be-identified voice signal with a preset recognition threshold, and determine that the to-be-identified voice signal belongs to User.
  • the characteristics of speech are analyzed, and the MFCC parameters are used to establish the speaker's speech feature model, and the speaker's feature recognition algorithm can be realized, which can improve the reliability of speaker detection, so that it can finally be on electronic products.
  • the function of speaker recognition is used to establish the speaker's speech feature model, and the speaker's feature recognition algorithm can be realized, which can improve the reliability of speaker detection, so that it can finally be on electronic products.
  • FIG. 1 is a block diagram showing the structure of a voice recognition system according to an exemplary embodiment of the present invention
  • FIG. 2 is a schematic diagram showing a processing procedure in a voice acquisition and preprocessing stage using a voice recognition system according to an exemplary embodiment of the present invention
  • FIG. 3 is a schematic diagram showing the principle of speech recognition by a speech recognition system according to an exemplary embodiment of the present invention
  • FIG. 4 shows a schematic diagram of the speech output frequency using an MEL filter.
  • FIG. 1 is a schematic structural diagram of a voice recognition system according to an exemplary embodiment of the present invention. As shown in FIG. 1, the voice recognition system includes:
  • a storage unit 10 configured to store a voice model of at least one user
  • the voice collecting and pre-processing unit 20 is configured to collect a voice signal to be recognized, and perform format conversion and encoding on the voice signal to be recognized;
  • the feature extraction unit 30 is configured to extract a speech feature parameter from the encoded speech signal to be recognized; the pattern matching unit 40 is configured to match the extracted speech feature parameter with at least one speech model, and determine that the to-be-recognized speech signal belongs to user.
  • FIG. 2 is a schematic diagram showing a process of using the voice recognition system in a voice collection and pre-processing stage. As shown in FIG. 2, after acquiring a voice signal to be recognized, the voice collection and pre-processing unit 20 sequentially amplifies the voice signal to be recognized. Gain control, filtering and sampling, and then format conversion and encoding of the identification signal, so that the speech signal to be recognized is divided into short-time signals composed of multiple frames. Alternatively, a pre-emphasis processing may be performed on the speech signal to be recognized after the format conversion and encoding.
  • speech acquisition is actually the digitization process of speech signals, through amplification and gain control, anti-aliasing filtering, sampling, A/D (analog/digital) conversion and encoding (generally pulse code modulation (PCM).
  • PCM pulse code modulation
  • the code filtering and amplifying the recognized speech signal, and converting the filtered and amplified analog speech signal into a digital speech signal.
  • the voice collection and pre-processing unit 20 can also be used to perform inverse process processing on the encoded speech signal to be recognized to reconstruct a speech waveform from the digitized speech, that is, perform D/A ( Digital/analog) Transform.
  • D/A Digital/analog
  • smoothing filtering is required after the D/A conversion, and the higher harmonics of the reconstructed speech waveform are smoothed to remove higher harmonic distortion.
  • the speech signal has been divided into short-term signals of one frame and one frame, and then each short-term speech frame is regarded as a stationary random signal, and the digital signal processing technology is used to extract the speech feature parameters.
  • the data is taken out from the data area by frame, the next frame is taken after the processing is completed, and so on, and finally the time series of the speech feature parameters composed of each frame parameter is obtained.
  • the voice collection and pre-processing unit 20 is further configured to perform pre-emphasis processing on the to-be-identified speech signal after performing format conversion and encoding by using a window function.
  • pre-processing generally includes pre-emphasis, windowing and framing, etc. Since the average power spectrum of the speech signal is affected by glottal excitation and nose and mouth radiation, the high-frequency end drops by about 6dB/octave above about 800Hz, ie 6dB. /oct (2 times frequency), 20dB/dec (10 times frequency), usually the higher the frequency, the smaller the amplitude. When the power of the voice signal is reduced by one-half, the power spectrum will have half the amplitude. The decline of the level. Therefore, before the speech signal is analyzed, the speech signal is generally upgraded.
  • the window functions commonly used in speech signal processing are rectangular windows, Hamming windows, etc., for the sampled speech.
  • the signal is windowed into short-term speech sequences of one frame and one frame, and the expressions are as follows: (where N is the frame length):
  • the voice recognition system may further include: an endpoint detecting unit 50, configured to calculate a voice start point and a voice end point of the to-be-identified voice signal after performing format conversion and encoding, and remove the voice signal to be recognized.
  • an endpoint detecting unit 50 configured to calculate a voice start point and a voice end point of the to-be-identified voice signal after performing format conversion and encoding, and remove the voice signal to be recognized.
  • a mute signal obtaining a time domain range of the speech in the speech signal to be recognized; and performing fast Fourier transform FFT analysis on the speech spectrum in the speech signal to be recognized, and calculating a vowel signal and a voiced sound in the speech signal to be recognized according to the analysis result Signal and light consonant signals.
  • the speech recognition system determines the start point and the end point of the speech from the speech signal to be recognized including the speech through the endpoint detecting unit 50, and the function is to minimize the processing time and eliminate the noise interference of the silent segment.
  • the speech recognition system has good recognition performance.
  • the correlation-based speech endpoint detection algorithm the speech signal has correlation, and the background noise has no correlation. Therefore, by using the difference in correlation, the voice can be detected, and in particular, the unvoiced sound can be detected from the noise.
  • the first stage pairs the input speech signal, according to the change of its energy and zero-crossing rate, performs real-time endpoint detection of the single-segment, so as to remove the mute, obtain the time-domain range of the input speech, and perform spectral feature extraction on this basis.
  • the second stage calculates the energy distribution characteristics of the high frequency, the intermediate frequency and the low frequency band according to the FFT analysis result of the input speech spectrum, and is used to discriminate the light consonant, the voiced consonant and the vowel; after determining the vowel and voiced segments, Expand the search for frames containing speech endpoints to the front and rear ends.
  • the feature extraction unit 30 performs speech feature parameter extraction from the speech signal to be recognized, including linear prediction parameters and derived parameters (LPCC), parameters directly derived from the speech spectrum, mixed parameters, and Mel frequency cepstral coefficients (MFCC).
  • LPCC linear prediction parameters and derived parameters
  • MFCC Mel frequency cepstral coefficients
  • the speech short-term spectrum contains the characteristics of the excitation source and the channel, and thus can reflect the physiological differences of the speaker.
  • the short-time spectrum changes with time, and to some extent reflects the speaker's pronunciation habits. Therefore, the parameters derived from the speech short-term spectrum can be effectively used in speaker recognition. Parameters that have been used include power spectrum, pitch contour, formant and its bandwidth, speech intensity and its variations. For mixing parameters
  • the MFCC parameters have the following advantages (compared with the LPCC parameters):
  • the MFCC parameter converts the linear frequency standard into the Mel frequency standard, emphasizing the low frequency information of the voice, thus highlighting the advantages of the LPCC.
  • the identified information shields the noise from interference.
  • the LPCC parameters are based on linear frequency markers, so there is no such feature.
  • the MFCC parameters have no assumptions and can be used in all situations.
  • the LPCC parameter assumes that the signal processed is the AR signal. For consonants with strong dynamic characteristics, this assumption is not strictly established, so the MFCC parameter is superior to the LPCC parameter in speaker recognition.
  • FFT conversion is required in the MFCC parameter extraction process, which can be used to obtain the speech signal in the frequency domain. All information.
  • Fig. 3 shows the principle of speech recognition by the speech recognition system of an exemplary embodiment of the present invention.
  • the feature extraction unit 30 obtains a speech feature parameter by extracting a frequency cepstral coefficient MFCC feature from the encoded speech signal to be recognized.
  • the speech recognition system may further include: a speech modeling unit 60, configured to use the speech cepstral coefficient MFCC to establish a text-independent Gaussian mixture model as an acoustic model of speech using the speech characteristic parameter.
  • a speech modeling unit 60 configured to use the speech cepstral coefficient MFCC to establish a text-independent Gaussian mixture model as an acoustic model of speech using the speech characteristic parameter.
  • the pattern matching unit 40 uses the Gaussian mixture model to match the extracted speech feature parameters with at least one speech model using a maximum a posteriori probability algorithm (MAP), so that the decision unit 70 determines the user to which the speech signal to be identified belongs according to the matching result.
  • MAP maximum a posteriori probability algorithm
  • the specific method of speech modeling and pattern matching using Gaussian mixture model can be as follows:
  • the model form of any speaker is consistent, and its personality feature is composed of a set of parameters.
  • ⁇ W ', A', C, ⁇ is uniquely ok.
  • Wi , ⁇ , ⁇ are respectively the mixed weight value, average vector and covariance matrix of the speaker's speech feature parameters. Therefore, the speaker's training is to obtain such a set of parameters from the speech of the known speaker, so that the probability density of the generated training speech is the largest.
  • the living person identification is to rely on the principle of maximum probability to select the speaker represented by the set of parameters that have the highest probability of recognizing the speech, that is, refer to formula (1):
  • P ( , ⁇ ( ) is the prior probability of , respectively;
  • P ⁇ is the likelihood estimation of the characteristic parameter of the speech signal to be recognized relative to the i-th speaker.
  • the maximum expected value (Expectation Maximization) is often used to estimate the parameters.
  • the calculation of the EM algorithm starts from an initial value of the parameter, and an EM algorithm is used to estimate a new parameter, so that the likelihood of the new model parameter is ⁇ ). New model; until the model converges. For each iteration, the following revaluation formula guarantees a monotonic increase in model likelihood.
  • the number M of Gaussian components of the model of the GMM and the initial parameters of the model must be determined first. If the value of M is too small, the trained GMM model cannot effectively characterize the speaker, thereby degrading the overall system performance. If the value of M is too large, there will be many parameters of the Bay' J model. Convergent model parameters may not be obtained from the effective training data. At the same time, the error of the model parameters obtained by the training will be large. Moreover, too many model parameters require more storage space, and the computational complexity of training and recognition is greatly increased.
  • the size of the Gaussian component M is theoretically derived and can be determined experimentally according to different identification systems.
  • the value of M can be 4, 8, 16 or the like.
  • Two methods of initializing model parameters can be used: The first method automatically segments the training data using a speaker-independent HMM model.
  • the training data speech frame is divided into M different classes (M is the number of mixed numbers) according to its characteristics, corresponding to the initial M Gaussian components.
  • the mean and variance of each class are used as initialization parameters for the model.
  • the variance matrix can be either a full matrix or a diagonal matrix.
  • the speech recognition system of the present invention uses a Gaussian Mixture Model (GMM) to match the extracted speech feature parameters with at least one speech model using a maximum posterior probability algorithm (MAP) to determine the manner in which the user to whom the speech signal is to be identified belongs.
  • GMM Gaussian Mixture Model
  • MAP maximum posterior probability algorithm
  • the Bayesian learning method is used to modify the parameters. First, starting from a given initial model, the statistical probability of each feature vector in the training corpus is calculated for each Gaussian distribution. The statistical probability is used to calculate the expected value of each Gaussian distribution, and then the expected values are used to maximize the parameter values of the Gaussian mixture model to obtain ⁇ . Repeat the above steps until R(X U) converges. When the training corpus is sufficient, the MAP algorithm has theoretical optimality.
  • the speech acoustic model determined by the MAP training method criterion is as above (3).
  • P (and (W is the number of terms) nothing happens:
  • the progressive adaptive approach training samples are entered one by one. Let ⁇ , ⁇ ⁇ , 2 ,..., ⁇ be the training sample sequence, then the progressive MAP method criteria are as follows:
  • the purpose of speaker recognition is to determine which of the following speakers the speech signal to be recognized belongs to. In a closed speaker set, it is only necessary to confirm which speaker in the speech library the voice belongs to.
  • the purpose is to find a speaker whose corresponding model makes the speech feature vector group X to be recognized have the maximum posterior probability ⁇ ( ⁇ 7 ⁇ ). According to the Bayes theory and the above formula 3, the maximum posterior probability can be expressed as:
  • ⁇ (1, ⁇ ) ⁇ ( ⁇ / ⁇ ⁇
  • Equation 2 Its logarithmic form is: Since the prior probability of P W) is unknown, it is assumed that the probability that the speech signal to be recognized is from each person in the closed set is equal, that is:
  • ⁇ (X) is a deterministic constant value that is equal for all speakers. Therefore, the maximum value of the posterior probability can be obtained by obtaining ⁇ . Therefore, identifying which speaker in the speech library the speech belongs to can be expressed as:
  • the speech recognition system further includes a decision unit for comparing the speech model having the highest likelihood with the speech signal to be recognized with a preset recognition threshold to determine a user to which the to-be-identified speech signal belongs.
  • Figure 4 shows a schematic diagram of the speech output frequency using the MEL filter.
  • the height of the sound heard by the human ear is not linearly proportional to the frequency of the sound, while the Mel frequency scale is more in line with the auditory characteristics of the human ear.
  • the so-called Mel frequency scale whose value generally corresponds to the logarithmic distribution of the actual frequency.
  • the critical frequency bandwidth varies with frequency and is consistent with the increase of Mel frequency. Below 1000 Hz, it is roughly linear, with a bandwidth of about 100 Hz; logarithmically increases above 1000 Hz.
  • the speech frequency can be divided into a series of triangular filter sequences, ie Mel filter banks.
  • the speech recognition system of the exemplary embodiment of the present invention analyzes the characteristics of the speech from the principle of generation of speech, and uses the MFCC parameters to establish a speech feature model of the speaker, and implements an algorithm for character recognition of the speaker, thereby improving speaker detection.
  • the purpose of reliability is to enable speaker recognition on electronic products.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别系统,包括:存储单元(10),用于存储至少一个用户的语音模型;语音采集及预处理单元(20),用于采集待识别语音信号,对所述待识别语音信号进行格式转换及编码;特征提取单元(30),用于从编码后的所述待识别语音信号中提取语音特征参数;模式匹配单元(40),用于将所提取的所述语音特征参数与至少一个所述语音模型进行匹配,确定所述待识别语音信号所属的用户。所述语音识别系统从语音的产生原理开始分析语音的特性,并使用频率倒谱系数MFCC参数,建立说话人的语音特征模型,实现说话人的特征识别算法,能够达到提高说话人检测的可靠性,使得最终能够在电子产品上实现说话人识别的功能。

Description

语音识别系统 技术领域
本发明涉及语音检测技术领域, 尤其涉及一种语音识别系统。 背景技术
目前, 在电信、 服务业和工业生产线的电子产品开发中, 许多产品上使用 了语音识别技术, 并创造出一批新颖的语音产品, 如语音记事本、 声控玩具、 语音摇控器及家用服务器等, 从而极大地减轻了劳动强度、 提高了工作效率, 并日益改变着人们的日常生活。 因此, 目前语音识别技术被视为本世纪最有挑 战性、 最具市场前景的应用技术之一。
语音识别包含说话人识别和说话人语义识别两种,前者利用的是语音信号 中说话人的个性特征, 不考虑包含在语音中的字词的含义, 强调的是说话人的 个性;而后者的目的是识别出语音信号中的语义内容,并不考虑说话人的个性, 强调的是语音的共性。
然而,现有技术识别说话人的技术可靠性不高, 因此使得采用说话人检测 的语音产品不能被广泛应用。 发明内容
据此,本发明技术方案要解决的技术问题是如何提供一种能够提高说话人 检测的可靠性的语音识别系统, 使语音产品能够得到更广泛的应用。
为了解决上述技术问题,按照本发明的一个方面,提供一种语音识别系统。 该语音识别系统包括:
存储单元, 用于存储至少一个用户的语音模型;
语音采集及预处理单元, 用于采集待识别语音信号,对所述待识别语音信 号进行格式转换及编码;
特征提取单元,用于从编码后的所述待识别语音信号中提取语音特征参数; 模式匹配单元,用于将所提取的所述语音特征参数与至少一个所述语音模 型进行匹配, 确定所述待识别语音信号所属的用户。 可选地, 在上述语音识别系统中, 在采集所述待识别语音信号后, 所述语 音采集及预处理单元还用于依次对所述待识别语音信号进行放大、 增益控制、 滤波及采样,之后对所述待识别语音信号进行格式转换及编码,使所述待识别 语音信号被分割为由多帧组合而成的短时信号。
可选地,在上述语音识别系统中, 所述语音采集及预处理单元还用于对进 行格式转换及编码后的所述待识别语音信号采用窗函数进行预加重处理。
可选地, 上述语音识别系统, 还包括:
端点检测单元,用于计算进行格式转换及编码后的所述待识别语音信号的 语音起点及语音终点,去除所述待识别语音信号中的静音信号, 获得所述待识 别语音信号中语音的时域范围;以及用于对所述待识别语音信号中的语音频谱 进行快速傅里叶变换 FFT分析, 根据分析结果计算所述待识别语音信号中的 元音信号、 浊音信号及轻辅音信号。
可选地,在上述语音识别系统中, 所述特征提取单元通过从编码后的所述 待识别语音信号中提取频率倒谱系数 MFCC特征, 获得所述语音特征参数。
可选地, 上述语音识别系统还包括: 语音建模单元, 用于利用所述语音特 征参数, 采用频率倒谱系数 MFCC建立与文本无关的高斯混合模型为语音的 声学模型。
可选地, 在上述语音识别系统中, 所述模式匹配单元利用高斯混合模型, 使用最大后验概率算法 MAP将所提取的所述语音特征参数与至少一个所述语 音模型进行匹配, 计算所述待识别语音信号与每一个所述语音模型的似然度。
可选地, 在上述语音识别系统中, 采用最大后验概率算法 MAP将所提取 的所述语音特征参数与至少一个所述语音模型进行匹配,确定所述待识别语音 信号所属的用户的方式具体采用以下公式:
_ Ρ( Ύ \ Θ、ρ(β
Oi = arge max Ρ(θ I χ) = arge max — ~ '―
' ' P(z) 其中: 表示存储单元所存储的第 i个人语音的模型参数, 为待识别语 音信号的特征参数; ρω、 Ρ( )分别为 、 的先验概率; 为所述待识 别语音信号的特征参数相对于第 i 个说话人的似然估计。
可选地, 在上述语音识别系统中, 利用高斯混合模型, 所述待识别语音信 号的特征参数由一组参数 { Wi、 μ一 唯一确定, 其中 ^、 μ一 ς分别为说话人 语音特征参数的混合加权值、 平均值向量及协方差矩阵。
可选地, 上述语音识别系统还包括判决单元, 用于将与所述待识别语音信 号具有最高似然度的所述语音模型与预设识别门限进行比对,确定所述待识别 语音信号所属的用户。
本发明示例性实施例的技术方案至少具有以下有益效果:
从语音的产生原理开始分析语音的特性, 并使用 MFCC参数, 建立说话 人的语音特征模型, 实现说话人的特征识别算法, 能够达到提高说话人检测可 靠性的目的, 使得最终能够在电子产品上实现说话人识别的功能。
附图说明
图 1表示本发明示例性实施例的语音识别系统的结构示意图;
图 2表示采用本发明示例性实施例的语音识别系统,在语音采集及预处理 阶段的处理过程示意图;
图 3 表示本发明示例性实施例的语音识别系统进行语音识别的原理示意 图;
图 4表示采用 MEL滤波器的语音输出频率示意图。 具体实施方式
为使本发明实施例要解决的技术问题、技术方案和优点更加清楚, 下面将 结合附图及具体实施例进行详细描述。
图 1为本发明示例性实施例的语音识别系统的结构示意图。 如图 1所示, 所述语音识别系统包括:
存储单元 10, 用于存储至少一个用户的语音模型;
语音采集及预处理单元 20, 用于采集待识别语音信号, 对待识别语音信 号进行格式转换及编码;
特征提取单元 30, 用于从编码后的待识别语音信号中提取语音特征参数; 模式匹配单元 40, 用于将所提取的语音特征参数与至少一个语音模型进 行匹配, 确定待识别语音信号所属的用户。 图 2 示出采用该语音识别系统在语音采集及预处理阶段的处理过程的示 意图如图 2所示, 在采集待识别语音信号后, 语音采集及预处理单元 20依次 对待识别语音信号进行放大、 增益控制、 滤波及采样, 之后对待识别信号进行 格式转换及编码,使待识别语音信号被分割为由多帧组合而成的短时信号。可 选择地,还可对进行格式转换及编码后的待识别语音信号采用窗函数进行预加 重处理。
在说话人识别技术中,语音采集实际上是语音信号的数字化过程,通过放 大及增益控制、 反混叠滤波、 采样、 A/D (模拟 /数字) 变换及编码 (一般为脉 沖编码调制 (PCM )码)过程, 对待识别语音信号进行滤波和放大, 并将滤波 和放大后的模拟语音信号转变为数字语音信号。
在上述过程中,通过进行滤波处理, 达到抑制输入信号各频域分量中频率 超出 fs/2的所有分量( fs为采样频率 ) , 以防止混叠干扰, 同时达到抑制 50Hz 的电源工频干扰的目的。
此外, 如图 2所示, 语音采集及预处理单元 20还可用于对编码后的待识 别语音信号进行数字化的反过程处理, 以从数字化语音中重构语音波形,也即 进行 D/A (数字 /模拟) 变换。 此外, 还需要在 D/A变换之后进行平滑滤波, 对重构的语音波形的高次谐波进行平滑处理, 以去除高次谐波失真。
通过上面介绍的处理过程,语音信号就已经被分割成一帧一帧的短时信号, 然后再把每一个短时语音帧看成平稳的随机信号,利用数字信号处理技术来提 取语音特征参数。 在进行处理时, 按帧从数据区中取出数据, 处理完成后再取 下一帧, 等等, 最后得到由每一帧参数组成的语音特征参数的时间序列。
此外, 语音采集及预处理单元 20还可用于对进行格式转换及编码后的所 述待识别语音信号采用窗函数进行预加重处理。
其中, 预处理一般包括预加重、 加窗和分帧等, 由于语音信号的平均功率 谱受声门激励和口鼻辐射影响,高频端大约在 800Hz以上按 6dB/倍频程跌落, 即 6dB/oct ( 2倍频), 20dB/dec ( 10倍频), 通常是频率越高幅值越小, 在语 音信号的功率降低二分之一时,其功率谱的幅度就会有半个量级的下降。因此, 在对语音信号进行分析之前, 一般要对语音信号加以一定的提升。
在语音信号处理中常用的窗函数是矩形窗和汉明窗等,用于对采样的语音 信号进行加窗分割成一帧一帧的短时语音序列,表达式分别如下: (其中 N为帧 长):
Figure imgf000007_0001
此外, 参阅图 1所示, 所述语音识别系统还可包括: 端点检测单元 50, 用于计算进行格式转换及编码后的待识别语音信号的语音起点及语音终点,去 除待识别语音信号中的静音信号, 获得待识别语音信号中语音的时域范围; 以 及用于对待识别语音信号中的语音频谱进行快速傅里叶变换 FFT分析, 根据 分析结果计算待识别语音信号中的元音信号、 浊音信号及轻辅音信号。
所述语音识别系统通过端点检测单元 50, 从包含语音的一段待识别语音 信号中确定出语音的起点以及终点, 其作用是使处理的时间减到最小, 而且能 排除无声段的噪声干扰, 从而使语音识别系统具有良好的识别性能。
本发明示例性实施例的语音识别系统, 基于相关性的语音端点检测算法: 语音信号具有相关性, 而背景噪声则无相关性。 因而利用相关性的不同, 可以 检测出语音, 尤其是可以将清音从噪声中检测出来。 第一级对输入语音信号, 根据其能量和过零率的变化, 进行一次筒单的实时端点检测, 以便去掉静音, 得到输入语音的时域范围, 并且在此基础上进行频谱特征提取工作。 第二级根 据输入语音频谱的 FFT分析结果, 分别计算出高频、 中频和低频段的能量分 布特性, 用来判别轻辅音、 浊辅音和元音; 在确定了元音、 浊音段后, 再向前 后两端扩展搜索包含语音端点的帧。
特征提取单元 30从待识别语音信号中进行语音特征参数提取, 包括线性 预测参数及其派生参数 ( LPCC )、 语音频谱直接导出的参数、 混合参数及 Mel 频率倒谱系数(MFCC )等。 对于线性预测参数及其派生参数:
通过对线性预测参数进行正交变换得到的参量,其中阶数较高的几个方差 较小, 这说明它们实质上与语句的内容相关性小, 而反映了说话人的信息。 另 夕卜,由于这些参数是对整个语句平均得到的,所以不需要进行时间上的归一化, 因此可用于与文本无关的说话人识别。 对于语音频谱直接导出的参数:
语音短时谱中包含有激励源和声道的特性,因而可以反映说话人生理上的 差别。而短时谱随时间变化,又在一定程度上反映了说话人的发音习惯,因此, 由语音短时谱中导出的参数可以有效地用于说话人识别中。已经使用的参数包 括功率谱、 基音轮廓、 共振峰及其带宽、 语音强度及其变化等。 对于混合参数
为了提高系统的识别率,部分原因也许是因为对究竟哪些参量是关键把握 不够, 相当多的系统采用了混合参量构成的矢量。 如将 "动态" 参量(对数面 积比与基频随时间的变化)与 "统计" 分量(由长时间平均谱导出)相结合, 还有将逆滤波器谱与带通滤波器谱结合,或者将线性预测参数与基音轮廓结合 等参量组合方法。如果组成矢量的各个参量之间的相关性不大,则效果会 ^艮好, 因为这些参量分别反映了语音信号中不同的特征。 对于其他鲁棒性参数:
包括 Mel 频率倒谱系数, 以及经过噪声谱减或者信道语减的去噪倒谱系 数。
其中, MFCC参数具有如下优点 (与 LPCC参数相比):
语音信息大多集中在低频部分, 而高频部分易受环境噪音干扰; MFCC参 数将线性频标转化为 Mel频标, 强调语音的低频信息, 从而除了具有 LPCC 的优点之外, 还突出了有利于识别的信息, 屏蔽了噪音的干扰。 LPCC参数是 基于线性频标的, 所以没有这样的特点。
MFCC参数没有任何前提假设,在各种情况下都可使用。 而 LPCC参数假 设所处理的信号是 AR信号,对于动态特性较强的辅音,该假设并不严格成立, 所以 MFCC参数在说话人识别中优于 LPCC参数。
MFCC参数提取过程中需要 FFT 变换, 可以以此获得语音信号频域上的 所有信息。
图 3示出本发明示例性实施例的语音识别系统进行语音识别的原理。如图 3所示, 利用特征提取单元 30,通过从编码后的待识别语音信号中提取频率倒 谱系数 MFCC特征, 获得语音特征参数。
此外, 语音识别系统还可包括: 语音建模单元 60, 用于利用所述语音特 征参数, 采用频率倒谱系数 MFCC建立与文本无关的高斯混合模型为语音的 声学模型。
模式匹配单元 40利用高斯混合模型, 使用最大后验概率算法(MAP)将 所提取的语音特征参数与至少一个语音模型进行匹配, 使判决单元 70根据匹 配结果确定待识别语音信号所属的用户。这样通过将提取出的语音特征参数与 存储单元 10中所保存的语音模型相比对, 得出识别结果。
具体采用高斯混合模型进行语音建模和模式匹配的方式可以为如下: 在采取高斯混合模型的说话人集合中,任一说话人的模型形式都是一致的 , 其个性特征由一组参数, = {W',A',C,}唯一确定。 其中 Wi、 Α、 ς.分别为说话人 语音特征参数的混合加权值、 平均值向量及协方差矩阵。 因此, 说话人的训练 是从已知说话人的语音中得到这样的一组参数 ,使得其产生训练语音的概率 密度最大。而说活人识别就是依靠最大概率原则选出识别语音概率最大的那一 组参数代表的说话人, 即参阅公式( 1 ):
λ = arg max P(X I λ) ( 1 )
其中 表示长度为 τ的训练序歹 'Κτ个特征参数) χ= ^,χ2,···, ^}关 于高斯混合模型 (GMM) 的似然度:
Figure imgf000009_0001
具体地: (2)
下面是 MAP算法过程:
在说话人识别系统中, 设 为训练样本, 是第 i 个说话人的模型参数, 则根据最大后验概率原则及公式 1, 由 MAP训练方法准则所确定的语音声学 模型为如下公式(3 ): θί = arge max Ρ(θ I χ) = arge max
p(x) ( 3 )
上式公式(3) 中: P ( 、 Ρ( )分别为 、 的先验概率; P^^ 为该待 识别语音信号的特征参数相对于第 i 个说话人的似然估计。
对于上述公式 2中的 GMM的似然度计算, 由于上式 2是参数 的非线性 函数, 很难直接求出上式的最大值。 因此, 常常采用最大期望值(Expectation Maximization, 筒称为 EM )算法估计参数 。 EM算法的计算是从参数 的一 个初值开始, 采用 EM算法估计出一个新的参数 , 使得新的模型参数下的似 然度^^ ^^^ )。新的模型; 直到模型收敛。每一次迭代运算, 下面的重估公式保证了模型似然度的单调递 增。
(1) 混合权值的重估公式:
Figure imgf000010_0001
(2) 均值的重估公式:
Figure imgf000010_0002
(3) 方差的重估公式:
Figure imgf000010_0003
其中, 分量 i的后验概率为:
Ρ{ϊΙΧ λ)
∑ okbk(Xt) 在使用 EM算法训练 GMM时, GMM的模型的高斯分量的个数 M和模 型的初始参数必须首先确定。 如果 M取值太小, 则训练出的 GMM模型不能 有效地刻画说话人的特征, 从而使整个系统性能下降。 如果 M取值过大, 贝' J 模型参数会很多, 从有效的训练数据中可能得不到收敛的模型参数, 同时, 训 练得到的模型参数误差会很大。 而且, 太多的模型参数要求更多的存贮空间, 而且训练和识别的运算复杂度大大增加。 高斯分量 M的大小, 艮难从理论上 推导出来, 可以根据不同的识别系统, 由实验确定。
一般, M取值可以是 4、 8、 16等。可以采用两种初始化模型参数的方法: 第一种方法使用一个与说话人无关的 HMM模型对训练数据进行自动分段。训 练数据语音帧根据其特征分为 M个不同的类(M为混合数的个数), 与初始的 M个高斯分量相对应。 每个类的均值和方差作为模型的初始化参数。 尽管有 实验证明 EM算法对于初始化参数的选择并不敏感,但是显然第一种方法训练 要优于第二种方法。也可以首先采用聚类的方法将特征矢量归于混合数相等的 各个类中, 然后分别计算各个类的方差和均值, 作为初始矩阵和均值, 权值是 各个类中所包含的特征矢量的个数占总的特征矢量的百分比。 建立的模型中, 方差矩阵可以为全矩阵, 也可以为对角矩阵。
本发明的语音识别系统, 利用高斯混合模型 (GMM)采用最大后验概率算 法 (MAP )将所提取的语音特征参数与至少一个语音模型进行匹配, 确定待 识别语音信号所属的用户的方式。
使用最大后验概率算法(MAP ), 就是利用 Bayes学习方法对参数进行修 改, 先从一个给定得初始模型 开始, 计算训练语料中每个特征向量在每个高 斯分布的统计几率,再利用这些统计几率来计算每个高斯分布的期望值, 然后 以这些期望值反过来最大化高斯混合模型得参数值,得到 ^。重复上面得步骤, 直到 R(X U)收敛为止。 当训练语料足够多时, MAP算法有理论上的最优性。
当设 为训练样本, 是第 i 个说话人得模型参数, 根据最大后验概率原 则及公式 1 ,由 MAP训练方法准则所确定的语音声学模型为如上公式(3 )后, 当考虑 P ( 和 (W是词条数)无关得情况:
Figure imgf000012_0001
在渐进的自适应方式中,训练样本是逐个输入的。设 ^ , ^ ^,2,…,^为 训练样本序列, 则渐进 MAP方法准则如下:
θ^1) = arge_ maxP(Zn+11 θι)Ρ(θι I χ") 其中 为第一次训练的模型参数估计值。 根据上述计算过程, 以更筒化形式举例说明:
本发明示例性实施例的语音识别系统,说话人辨认的目的是要用于确定待 识别语音信号属于 Ν个说话人中的哪一个。 在一个封闭的说话人集合里, 只 需要确认该语音属于语音库中的哪一个说话人。在辨认任务中, 目的是找到一 个说话者^其对应的模型 使得待识别语音特征矢量组 X具有最大后验概率 Ρ ( ^7Χ )。 根据 Bayes理论及上述公式 3, 最大后验概率可表示为:
Ρ(1,Χ) = Ρ(Χ /λ^
Ρ(Χ)
在这里, 参阅上述公式 2:
Figure imgf000012_0002
其对数形式为: 因为 PW)的先验概率未知, 假定该所述待识别语音信号出自封闭集里的 每个人的可能性相等, 也即为:
Ρ{λ) =—,\<ί<Ν
Ν 对于一个确定的观察值矢量 X, Ρ (X)是一个确定的常数值, 对所有说 话人都相等。 因此,求取后验概率的最大值可以通过求取^^^获得。 因此, 辨认该语音属于语音库中的哪一个说话人可以表示为:
C = arg max P(X I 上述公式对应公式 (3)/即为所识别出的说话人。
进一步地, 利用上述方式, 只是识别出模型库中最接近的用户。 通过上述 在匹配时计算要识别的说话人与语音库中所有说话人信息的似然度之后,还需 要通过判决单元将与待识别语音信号具有最高似然度的用户的语音模型配合 识别门限的限制,确定待识别语音信号所属的用户,从而达到对说话人身份进 行认证的目的。
上述语音识别系统还包括判决单元,用于将与待识别语音信号具有最高似 然度的语音模型与预设识别门限进行比对, 确定待识别语音信号所属的用户。
图 4表示采用 MEL滤波器的语音输出频率示意图, 人耳所听到的声音的 高低与声音的频率并不成线性正比关系, 而用 Mel 频率尺度则更符合人耳的 听觉特性。 所谓 Mel 频率尺度, 它的值大体上对应于实际频率的对数分布关 系。 Mel 频率与实际频率的具体关系可用式: Mel(f)=25951g(l+f/700), 这里, 实际频率 f 的单位是 Hz。 临界频率带宽随着频率的变化而变化, 并与 Mel频 率的增长一致, 在 1000Hz 以下, 大致呈线性分布, 带宽为 100Hz左右; 在 1000Hz 以上呈对数增长。 类似于临界频带的划分, 可以将语音频率划分成一 系列三角形的滤波器序列, 即 Mel滤波器组。 三角滤波器的输出则为: k:^ Fi _F" k:^ FFi , i =l,2,— ,P 其中 为第 i 个滤波器的输出。
用离散余弦变换(DCT )将滤波器输出变换到倒谱域:
Q =∑log(F.)cos[^ - -i)^-]
2 24 , k=l,2,...,P
其中 P为 MFCC参数的阶数, 实际软件算法中选取 P=12, (dim即为 所求的 MFCC参数。
本发明示例性实施例的语音识别系统,从语音的产生原理开始分析语音的 特性, 并使用 MFCC参数, 建立说话人的语音特征模型, 实现说话人的特征 识别的算法, 能够达到提高说话人检测可靠性的目的,使得最终能够在电子产 品上实现说话人识别的功能。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通 技术人员来说, 在不脱离本发明原理的前提下, 还可以做出若干改进和润饰, 这些改进和润饰也应视为本发明的保护范围。

Claims

权 利 要 求 书
1. 一种语音识别系统, 包括:
存储单元, 用于存储至少一个用户的语音模型;
语音采集及预处理单元, 用于采集待识别语音信号,对所述待识别语音信 号进行格式转换及编码;
特征提取单元,用于从编码后的所述待识别语音信号中提取语音特征参数; 模式匹配单元,用于将所提取的所述语音特征参数与至少一个所述语音模 型进行匹配, 确定所述待识别语音信号所属的用户。
2. 如权利要求 1 所述的语音识别系统, 其中, 在采集所述待识别语音信 号后,所述语音采集及预处理单元还用于依次对所述待识别语音信号进行放大、 增益控制、 滤波及采样, 之后对所述待识别语音信号进行格式转换及编码, 使 所述待识别语音信号被分割为由多帧组合而成的短时信号。
3. 如权利要求 2所述的语音识别系统, 其中, 所述语音采集及预处理单 元还用于对进行格式转换及编码后的所述待识别语音信号采用窗函数进行预 加重处理。
4. 如权利要求 1所述的语音识别系统,其中,所述语音识别系统还包括: 端点检测单元,用于计算进行格式转换及编码后的所述待识别语音信号的 语音起点及语音终点,去除所述待识别语音信号中的静音信号, 获得所述待识 别语音信号中语音的时域范围;以及用于对所述待识别语音信号中的语音频谱 进行快速傅里叶变换 FFT分析, 根据分析结果计算所述待识别语音信号中的 元音信号、 浊音信号及轻辅音信号。
5. 如权利要求 1 所述的语音识别系统, 其中, 所述特征提取单元通过从 编码后的所述待识别语音信号中提取频率倒谱系数 MFCC特征, 获得所述语 音特征参数。
6. 如权利要求 5所述的语音识别系统,其中,所述语音识别系统还包括: 语音建模单元, 用于利用所述语音特征参数, 采用频率倒谱系数 MFCC建立 与文本无关的高斯混合模型为语音的声学模型。
7. 如权利要求 1 所述的语音识别系统, 其中, 所述模式匹配单元利用高 斯混合模型, 使用最大后验概率算法 MAP将所提取的所述语音特征参数与至 少一个所述语音模型进行匹配,计算所述待识别语音信号与每一个所述语音模 型的似然度。
8. 如权利要求 7所述的语音识别系统,其中,采用最大后验概率算法 MAP 将所提取的所述语音特征参数与至少一个所述语音模型进行匹配,确定所述待 识别语音信号所属的用户的方式采用以下公式:
_ Ρ( Ύ \ Θ、ρ(β
= arge max Ρ(θ I χ) = arge max - ~ '―
' ' P(z) 其中: 表示存储单元所存储的第 i个人语音的模型参数, 为待识别语 音信号的特征参数; ρω、 Ρ( )分别为 、 的先验概率; P( / 为所述待识 别语音信号的特征参数相对于第 i 个说话人得似然估计。
9. 如权利要求 8所述的语音识别系统, 其中, 利用高斯混合模型, 所述 待识别语音信号的特征参数由一组参数 { Wi、 唯一确定, 其中 w;、 μ— ς 分别为说话人语音特征参数的混合加权值、 平均值向量及协方差矩阵。
10. 如权利要求 7所述的语音识别系统, 其中, 所述语音识别系统还包括 判决单元,用于将与所述待识别语音信号具有最高似然度的所述语音模型与预 设识别门限进行比对, 确定所述待识别语音信号所属的用户。
PCT/CN2013/074831 2013-03-29 2013-04-26 语音识别系统 WO2014153800A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/366,482 US20150340027A1 (en) 2013-03-29 2013-04-26 Voice recognition system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310109044.3 2013-03-29
CN201310109044.3A CN103236260B (zh) 2013-03-29 2013-03-29 语音识别系统

Publications (1)

Publication Number Publication Date
WO2014153800A1 true WO2014153800A1 (zh) 2014-10-02

Family

ID=48884296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/074831 WO2014153800A1 (zh) 2013-03-29 2013-04-26 语音识别系统

Country Status (3)

Country Link
US (1) US20150340027A1 (zh)
CN (1) CN103236260B (zh)
WO (1) WO2014153800A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9754593B2 (en) 2015-11-04 2017-09-05 International Business Machines Corporation Sound envelope deconstruction to identify words and speakers in continuous speech
CN111027453A (zh) * 2019-12-06 2020-04-17 西北工业大学 基于高斯混合模型的非合作水中目标自动识别方法
CN113053398A (zh) * 2021-03-11 2021-06-29 东风汽车集团股份有限公司 基于mfcc和bp神经网络的说话人识别系统及方法
CN113223511A (zh) * 2020-01-21 2021-08-06 珠海市煊扬科技有限公司 用于语音识别的音频处理装置

Families Citing this family (127)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US20120311585A1 (en) 2011-06-03 2012-12-06 Apple Inc. Organizing task items that represent tasks to perform
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
DE212014000045U1 (de) 2013-02-07 2015-09-24 Apple Inc. Sprach-Trigger für einen digitalen Assistenten
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
WO2015026960A1 (en) * 2013-08-21 2015-02-26 Sanger Terence D Systems, methods, and uses of b a yes -optimal nonlinear filtering algorithm
JP6188831B2 (ja) * 2014-02-06 2017-08-30 三菱電機株式会社 音声検索装置および音声検索方法
CN103940190B (zh) * 2014-04-03 2016-08-24 合肥美的电冰箱有限公司 具有食品管理系统的冰箱及食品管理方法
CN103974143B (zh) * 2014-05-20 2017-11-07 北京速能数码网络技术有限公司 一种生成媒体数据的方法和设备
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
EP3149728B1 (en) 2014-05-30 2019-01-16 Apple Inc. Multi-command single utterance input method
US10186282B2 (en) * 2014-06-19 2019-01-22 Apple Inc. Robust end-pointing of speech signals using speaker recognition
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
CN104183245A (zh) * 2014-09-04 2014-12-03 福建星网视易信息系统有限公司 一种演唱者音色相似的歌星推荐方法与装置
KR101619262B1 (ko) * 2014-11-14 2016-05-18 현대자동차 주식회사 음성인식 장치 및 방법
CN105869641A (zh) * 2015-01-22 2016-08-17 佳能株式会社 语音识别装置及语音识别方法
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
CN106161755A (zh) * 2015-04-20 2016-11-23 钰太芯微电子科技(上海)有限公司 一种关键词语音唤醒系统及唤醒方法及移动终端
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
CN104900235B (zh) * 2015-05-25 2019-05-28 重庆大学 基于基音周期混合特征参数的声纹识别方法
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
CN104835496B (zh) * 2015-05-30 2018-08-03 宁波摩米创新工场电子科技有限公司 一种基于线性驱动的高清语音识别系统
CN104851425B (zh) * 2015-05-30 2018-11-30 宁波摩米创新工场电子科技有限公司 一种基于对称式三极管放大电路的高清语音识别系统
CN104900234B (zh) * 2015-05-30 2018-09-21 宁波摩米创新工场电子科技有限公司 一种高清语音识别系统
CN104835495B (zh) * 2015-05-30 2018-05-08 宁波摩米创新工场电子科技有限公司 一种基于低通滤波的高清语音识别系统
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
CN106328152B (zh) * 2015-06-30 2020-01-31 芋头科技(杭州)有限公司 一种室内噪声污染自动识别监测系统
CN105096551A (zh) * 2015-07-29 2015-11-25 努比亚技术有限公司 一种实现虚拟遥控器的装置和方法
CN105245497B (zh) * 2015-08-31 2019-01-04 刘申宁 一种身份认证方法及装置
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
CN105709291B (zh) * 2016-01-07 2018-12-04 王贵霞 一种智能血液透析过滤装置
CN105931635B (zh) * 2016-03-31 2019-09-17 北京奇艺世纪科技有限公司 一种音频分割方法及装置
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
CN105913840A (zh) * 2016-06-20 2016-08-31 西可通信技术设备(河源)有限公司 一种语音识别装置及移动终端
CN106328168B (zh) * 2016-08-30 2019-10-18 成都普创通信技术股份有限公司 一种语音信号相似度检测方法
CN106448654A (zh) * 2016-09-30 2017-02-22 安徽省云逸智能科技有限公司 一种机器人语音识别系统及其工作方法
CN106448655A (zh) * 2016-10-18 2017-02-22 江西博瑞彤芸科技有限公司 语音识别方法
CN106557164A (zh) * 2016-11-18 2017-04-05 北京光年无限科技有限公司 应用于智能机器人的多模态输出方法和装置
CN106782550A (zh) * 2016-11-28 2017-05-31 黑龙江八农垦大学 一种基于dsp芯片的自动语音识别系统
CN106653047A (zh) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 一种音频数据的自动增益控制方法与装置
CN106653043B (zh) * 2016-12-26 2019-09-27 云知声(上海)智能科技有限公司 降低语音失真的自适应波束形成方法
CN106782595B (zh) * 2016-12-26 2020-06-09 云知声(上海)智能科技有限公司 一种降低语音泄露的鲁棒阻塞矩阵方法
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
KR20180082033A (ko) * 2017-01-09 2018-07-18 삼성전자주식회사 음성을 인식하는 전자 장치
US10264410B2 (en) * 2017-01-10 2019-04-16 Sang-Rae PARK Wearable wireless communication device and communication group setting method using the same
CN106782521A (zh) * 2017-03-22 2017-05-31 海南职业技术学院 一种语音识别系统
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. USER INTERFACE FOR CORRECTING RECOGNITION ERRORS
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770427A1 (en) 2017-05-12 2018-12-20 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
EP3433854B1 (en) * 2017-06-13 2020-05-20 Beijing Didi Infinity Technology and Development Co., Ltd. Method and system for speaker verification
CN109146450A (zh) * 2017-06-16 2019-01-04 阿里巴巴集团控股有限公司 支付方法、客户端、电子设备、存储介质和服务器
CN107452403B (zh) * 2017-09-12 2020-07-07 清华大学 一种说话人标记方法
CN107564522A (zh) * 2017-09-18 2018-01-09 郑州云海信息技术有限公司 一种智能控制方法及装置
GB201719734D0 (en) * 2017-10-30 2018-01-10 Cirrus Logic Int Semiconductor Ltd Speaker identification
CN108022584A (zh) * 2017-11-29 2018-05-11 芜湖星途机器人科技有限公司 办公室语音识别优化方法
CN107808659A (zh) * 2017-12-02 2018-03-16 宫文峰 智能语音信号模式识别系统装置
CN108172229A (zh) * 2017-12-12 2018-06-15 天津津航计算技术研究所 一种基于语音识别的身份验证及可靠操控的方法
CN108022593A (zh) * 2018-01-16 2018-05-11 成都福兰特电子技术股份有限公司 一种高灵敏度语音识别系统及其控制方法
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
CN108538310B (zh) * 2018-03-28 2021-06-25 天津大学 一种基于长时信号功率谱变化的语音端点检测方法
CN108600898B (zh) * 2018-03-28 2020-03-31 深圳市冠旭电子股份有限公司 一种配置无线音箱的方法、无线音箱及终端设备
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
CN108922541B (zh) * 2018-05-25 2023-06-02 南京邮电大学 基于dtw和gmm模型的多维特征参数声纹识别方法
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
DK179822B1 (da) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10460749B1 (en) * 2018-06-28 2019-10-29 Nuvoton Technology Corporation Voice activity detection using vocal tract area information
CN109036437A (zh) * 2018-08-14 2018-12-18 平安科技(深圳)有限公司 口音识别方法、装置、计算机装置及计算机可读存储介质
CN109147796B (zh) * 2018-09-06 2024-02-09 平安科技(深圳)有限公司 语音识别方法、装置、计算机设备及计算机可读存储介质
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
CN109378002B (zh) * 2018-10-11 2024-05-07 平安科技(深圳)有限公司 声纹验证的方法、装置、计算机设备和存储介质
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
CN109545192B (zh) * 2018-12-18 2022-03-08 百度在线网络技术(北京)有限公司 用于生成模型的方法和装置
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
CN109920406B (zh) * 2019-03-28 2021-12-03 国家计算机网络与信息安全管理中心 一种基于可变起始位置的动态语音识别方法及系统
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. USER ACTIVITY SHORTCUT SUGGESTIONS
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
WO2021056255A1 (en) 2019-09-25 2021-04-01 Apple Inc. Text detection using global geometry estimators
CN113112993B (zh) * 2020-01-10 2024-04-02 阿里巴巴集团控股有限公司 一种音频信息处理方法、装置、电子设备以及存储介质
CN111277341B (zh) * 2020-01-21 2021-02-19 北京清华亚迅电子信息研究所 无线电信号分析方法及装置
CN111429890B (zh) * 2020-03-10 2023-02-10 厦门快商通科技股份有限公司 一种微弱语音增强方法、语音识别方法及计算机可读存储介质
CN111581348A (zh) * 2020-04-28 2020-08-25 辽宁工程技术大学 一种基于知识图谱的查询分析系统
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11038934B1 (en) 2020-05-11 2021-06-15 Apple Inc. Digital assistant hardware abstraction
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
CN111845751B (zh) * 2020-07-28 2021-02-09 盐城工业职业技术学院 一种可切换控制多个农用拖拉机的控制终端
CN112037792B (zh) * 2020-08-20 2022-06-17 北京字节跳动网络技术有限公司 一种语音识别方法、装置、电子设备及存储介质
CN112035696B (zh) * 2020-09-09 2024-05-28 兰州理工大学 一种基于音频指纹的语音检索方法及系统
CN112331231B (zh) * 2020-11-24 2024-04-19 南京农业大学 基于音频技术的肉鸡采食量检测系统
CN112242138A (zh) * 2020-11-26 2021-01-19 中国人民解放军陆军工程大学 一种无人平台语音控制方法
CN112820319A (zh) * 2020-12-30 2021-05-18 麒盛科技股份有限公司 一种人类鼾声识别方法及其装置
CN112954521A (zh) * 2021-01-26 2021-06-11 深圳市富天达电子有限公司 一种具有声控免按键调节系统的蓝牙耳机
CN113674766A (zh) * 2021-08-18 2021-11-19 上海复深蓝软件股份有限公司 语音评价方法、装置、计算机设备及存储介质
CN115950517A (zh) * 2023-03-02 2023-04-11 南京大学 一种可配置水声信号特征提取方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1268732A (zh) * 2000-03-31 2000-10-04 清华大学 基于语音识别专用芯片的特定人语音识别、语音回放方法
CN1787075A (zh) * 2005-12-13 2006-06-14 浙江大学 基于内嵌gmm核的支持向量机模型的说话人识别方法
CN101241699A (zh) * 2008-03-14 2008-08-13 北京交通大学 一种远程汉语教学中的说话人确认系统
CN102005070A (zh) * 2010-11-17 2011-04-06 广东中大讯通信息有限公司 一种语音识别门禁系统
CN102324232A (zh) * 2011-09-12 2012-01-18 辽宁工业大学 基于高斯混合模型的声纹识别方法及系统
CN102737629A (zh) * 2011-11-11 2012-10-17 东南大学 一种嵌入式语音情感识别方法及装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6195634B1 (en) * 1997-12-24 2001-02-27 Nortel Networks Corporation Selection of decoys for non-vocabulary utterances rejection
JP2001166789A (ja) * 1999-12-10 2001-06-22 Matsushita Electric Ind Co Ltd 初頭/末尾の音素類似度ベクトルによる中国語の音声認識方法及びその装置
CN1181466C (zh) * 2001-12-17 2004-12-22 中国科学院自动化研究所 基于子带能量和特征检测技术的语音信号端点检测方法
US7904295B2 (en) * 2004-09-02 2011-03-08 Coelho Rosangela Fernandes Method for automatic speaker recognition with hurst parameter based features and method for speaker classification based on fractional brownian motion classifiers
US8708702B2 (en) * 2004-09-16 2014-04-29 Lena Foundation Systems and methods for learning using contextual feedback
US20110035215A1 (en) * 2007-08-28 2011-02-10 Haim Sompolinsky Method, device and system for speech recognition
CN101206858B (zh) * 2007-12-12 2011-07-13 北京中星微电子有限公司 一种孤立词语音端点检测的方法及系统
CN101625857B (zh) * 2008-07-10 2012-05-09 新奥特(北京)视频技术有限公司 一种自适应的语音端点检测方法
CN101872616B (zh) * 2009-04-22 2013-02-06 索尼株式会社 端点检测方法以及使用该方法的系统
CN102332263B (zh) * 2011-09-23 2012-11-07 浙江大学 一种基于近邻原则合成情感模型的说话人识别方法
WO2013133768A1 (en) * 2012-03-06 2013-09-12 Agency For Science, Technology And Research Method and system for template-based personalized singing synthesis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1268732A (zh) * 2000-03-31 2000-10-04 清华大学 基于语音识别专用芯片的特定人语音识别、语音回放方法
CN1787075A (zh) * 2005-12-13 2006-06-14 浙江大学 基于内嵌gmm核的支持向量机模型的说话人识别方法
CN101241699A (zh) * 2008-03-14 2008-08-13 北京交通大学 一种远程汉语教学中的说话人确认系统
CN102005070A (zh) * 2010-11-17 2011-04-06 广东中大讯通信息有限公司 一种语音识别门禁系统
CN102324232A (zh) * 2011-09-12 2012-01-18 辽宁工业大学 基于高斯混合模型的声纹识别方法及系统
CN102737629A (zh) * 2011-11-11 2012-10-17 东南大学 一种嵌入式语音情感识别方法及装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9754593B2 (en) 2015-11-04 2017-09-05 International Business Machines Corporation Sound envelope deconstruction to identify words and speakers in continuous speech
CN111027453A (zh) * 2019-12-06 2020-04-17 西北工业大学 基于高斯混合模型的非合作水中目标自动识别方法
CN113223511A (zh) * 2020-01-21 2021-08-06 珠海市煊扬科技有限公司 用于语音识别的音频处理装置
CN113223511B (zh) * 2020-01-21 2024-04-16 珠海市煊扬科技有限公司 用于语音识别的音频处理装置
CN113053398A (zh) * 2021-03-11 2021-06-29 东风汽车集团股份有限公司 基于mfcc和bp神经网络的说话人识别系统及方法
CN113053398B (zh) * 2021-03-11 2022-09-27 东风汽车集团股份有限公司 基于mfcc和bp神经网络的说话人识别系统及方法

Also Published As

Publication number Publication date
US20150340027A1 (en) 2015-11-26
CN103236260B (zh) 2015-08-12
CN103236260A (zh) 2013-08-07

Similar Documents

Publication Publication Date Title
WO2014153800A1 (zh) 语音识别系统
US10504539B2 (en) Voice activity detection systems and methods
CN106486131B (zh) 一种语音去噪的方法及装置
Mak et al. A study of voice activity detection techniques for NIST speaker recognition evaluations
WO2021139425A1 (zh) 语音端点检测方法、装置、设备及存储介质
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
Chapaneri Spoken digits recognition using weighted MFCC and improved features for dynamic time warping
US8306817B2 (en) Speech recognition with non-linear noise reduction on Mel-frequency cepstra
CN110232933B (zh) 音频检测方法、装置、存储介质及电子设备
WO2002029782A1 (en) Perceptual harmonic cepstral coefficients as the front-end for speech recognition
CN108305639B (zh) 语音情感识别方法、计算机可读存储介质、终端
CN108682432B (zh) 语音情感识别装置
CN105679312A (zh) 一种噪声环境下声纹识别的语音特征处理方法
CN108091340B (zh) 声纹识别方法、声纹识别系统和计算机可读存储介质
Chauhan et al. Speech to text converter using Gaussian Mixture Model (GMM)
Korkmaz et al. Unsupervised and supervised VAD systems using combination of time and frequency domain features
Malode et al. Advanced speaker recognition
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
El-Henawy et al. Recognition of phonetic Arabic figures via wavelet based Mel Frequency Cepstrum using HMMs
CN116312561A (zh) 一种电力调度系统人员声纹识别鉴权降噪和语音增强方法、系统及装置
Abka et al. Speech recognition features: Comparison studies on robustness against environmental distortions
Kumar et al. Effective preprocessing of speech and acoustic features extraction for spoken language identification
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质
Yue et al. Speaker age recognition based on isolated words by using SVM
JPH01255000A (ja) 音声認識システムに使用されるテンプレートに雑音を選択的に付加するための装置及び方法

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 14366482

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13880076

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/01/2016)

122 Ep: pct application non-entry in european phase

Ref document number: 13880076

Country of ref document: EP

Kind code of ref document: A1