WO2014114048A1 - 一种语音识别的方法、装置 - Google Patents

一种语音识别的方法、装置 Download PDF

Info

Publication number
WO2014114048A1
WO2014114048A1 PCT/CN2013/077498 CN2013077498W WO2014114048A1 WO 2014114048 A1 WO2014114048 A1 WO 2014114048A1 CN 2013077498 W CN2013077498 W CN 2013077498W WO 2014114048 A1 WO2014114048 A1 WO 2014114048A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
voice data
data
confidence
unit
Prior art date
Application number
PCT/CN2013/077498
Other languages
English (en)
French (fr)
Inventor
蒋洪睿
王细勇
梁俊斌
郑伟军
周均扬
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Publication of WO2014114048A1 publication Critical patent/WO2014114048A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Definitions

  • the present invention relates to the field of voice processing technologies, and in particular, to a voice recognition method and apparatus.
  • a user generally uses voice assistant software for voice recognition on a terminal device such as a mobile phone.
  • the process of voice recognition using software such as voice assistant is that the user starts the voice assistant software to obtain the voice data; the voice data is sent to the noise reduction module for noise reduction processing; the voice data after the noise reduction processing is sent to the voice recognition engine; the voice recognition engine The recognition result is returned to the voice assistant; the voice assistant determines the correctness of the recognition result according to the confidence width value, and then presents the result.
  • voice assistant software is generally used in quiet environments such as offices, but it is not effective in noisy environments (such as in the car environment); software noise reduction is commonly used in the industry to improve speech recognition. Rate, but the improvement effect is not obvious, and sometimes even lower the recognition rate.
  • the present technical solution provides a method and apparatus for voice recognition to improve voice recognition rate and enhance user experience.
  • the first aspect provides a method for voice recognition: the method includes: acquiring voice data; acquiring a confidence value according to the voice data; acquiring a noise scene according to the voice data; acquiring, corresponding to the noise scene
  • the confidence value is a value; if the confidence value is greater than or equal to the confidence threshold, the voice data is processed.
  • the noise scenario specifically includes: a noise type; a noise size.
  • the noise scenario includes a noise type
  • the acquiring a noise scenario according to the voice data includes: Speech data, obtaining a frequency cepstral coefficient of noise in the speech data; acquiring a noise type of the speech data according to the frequency cepstral coefficient of the noise and a pre-established noise type model.
  • the method for establishing the noise type model includes: acquiring noise data; acquiring, according to the noise data, a frequency cepstrum coefficient of the noise data; processing the frequency cepstral coefficient according to an EM algorithm to establish the noise type model.
  • the noise type model is a Gaussian mixture model.
  • the noise scenario includes a noise
  • the acquiring a noise scenario according to the voice data includes: Voice data, acquiring feature parameters of the voice data; performing voice activation detection according to the feature parameters; and acquiring the noise size according to the result of the voice activation detection.
  • the noise size specifically includes: a signal to noise ratio; a noise energy level.
  • the implementation may be the fifth possible implementation of the first aspect or the sixth possible implementation of the first aspect or, in a seventh possible implementation manner of the first aspect, the acquiring and the noise scenario
  • the corresponding confidence threshold includes: obtaining a confidence threshold corresponding to the noise scenario according to the correspondence between the pre-stored confidence threshold empirical data and the noise scenario.
  • the second aspect provides a voice recognition apparatus, where the apparatus includes: an acquiring unit, configured to acquire voice data; a confidence value unit, configured to receive the voice data acquired by the acquiring unit, and according to And the noise scene unit is configured to receive the voice data acquired by the acquiring unit, and obtain a noise scene according to the voice data; a confidence threshold unit for receiving the noise scene a noise scenario of the unit, and obtaining a confidence threshold corresponding to the noise scenario; a processing unit, configured to receive the confidence value obtained by the confidence value unit and the confidence level unit to obtain The confidence value is a threshold value, and if the confidence value is greater than or equal to the confidence threshold, the voice data is processed.
  • the device further includes: a modeling unit, configured to acquire noise data, and obtain a frequency cepstrum of the noise data according to the noise data Number, the frequency cepstral coefficient is processed according to the EM algorithm, and a noise type model is established.
  • a modeling unit configured to acquire noise data, and obtain a frequency cepstrum of the noise data according to the noise data Number, the frequency cepstral coefficient is processed according to the EM algorithm, and a noise type model is established.
  • the noise scene unit includes: a noise type unit, configured to use the voice according to the acquiring unit Data, acquiring a frequency cepstrum coefficient of noise in the speech data, and acquiring a noise type of the speech data according to the frequency cepstral coefficient of the noise and the noise type model of the modeling unit.
  • the noise scene unit further includes And a noise size unit, configured to acquire a feature parameter of the voice data according to the voice data of the acquiring unit, perform voice activation detection according to the feature parameter, and acquire the noise size according to the result of the voice activation detection .
  • the device further includes: a storage unit, configured to store confidence value exponential data.
  • the confidence threshold unit is specifically configured to perform pre-stored confidence according to the storage unit And a correspondence between the threshold value data and the noise scene, and obtaining a confidence width value corresponding to the noise scene.
  • a mobile terminal including a processor and a microphone, wherein the microphone is configured to acquire voice data, and the processor is configured to acquire a confidence value and a noise scenario according to the voice data, according to the In the noise scenario, a confidence threshold corresponding to the noise scenario is obtained, and if the confidence value is greater than or equal to the confidence threshold, the voice data is processed.
  • the mobile terminal further includes: a memory, configured to store confidence wide value empirical data.
  • the processor is configured to: obtain a confidence value and a noise scenario according to the voice data; Obtaining a confidence value corresponding to the noise scenario, and obtaining a confidence threshold corresponding to the noise scenario; if the confidence value is greater than or equal to the confidence threshold, Processing the voice data.
  • the technical solution of the present invention provides a method and apparatus for speech recognition.
  • the method and apparatus obtain a confidence threshold value by acquiring a noise scene and according to pre-stored confidence threshold value empirical data and the noise scene.
  • This method and apparatus for flexibly adjusting the confidence threshold according to the noise scenario greatly enhances the speech recognition rate in a noisy environment.
  • FIG. 1 is a flowchart of a method for voice recognition according to Embodiment 1 of the present invention.
  • Embodiment 2 is a flowchart of another implementation manner of a voice recognition method according to Embodiment 1 of the present invention
  • 4 is a flowchart of another implementation manner of a method for voice recognition according to Embodiment 2 of the present invention
  • FIG. 5 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 6 is a schematic diagram showing another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 7 is a schematic diagram showing another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 8 is a schematic diagram of another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 9 is a schematic structural diagram of a mobile terminal according to Embodiment 5 of the present invention.
  • FIG. 10 is a schematic diagram of another possible structure of a mobile terminal according to Embodiment 5 of the present invention
  • FIG. 11 is a schematic structural diagram of a mobile phone according to an embodiment of the present invention.
  • the device includes, but is not limited to, a mobile phone, a personal digital assistant (PDA), a tablet computer, a portable device (for example, a portable computer), an in-vehicle device, and an ATM (Automatic Teller Machine).
  • PDA personal digital assistant
  • tablet computer for example, a portable computer
  • portable device for example, a portable computer
  • in-vehicle device for example, a portable computer
  • ATM Automatic Teller Machine
  • FIG. 1 is a flowchart of a method for voice recognition according to Embodiment 1 of the present invention. As shown in FIG. 1 , a method for providing voice recognition according to Embodiment 1 of the present invention may specifically include:
  • the user turns on the voice recognition software such as the voice assistant on the device, and obtains the voice data input by the user through the microphone.
  • the voice data may also be input by the user or may be input by the machine, including any data containing information.
  • the returned confidence value includes: the confidence level of the sentence N1 ("To call Zhang San” overall confidence) ), the pre-command word confidence N2 ("give” is the pre-command word, that is, the confidence value of "give” is N2), the person's name confidence N3 ("Zhang San” is the name of the person, that is, the "Zhang San” letter Degree value is N3), post command word confidence N4 ("call” is a post command The word, the "call” confidence level is N4).
  • the sentence confidence N1 is obtained by combining N2, N3, and N4.
  • step S102 may be performed before the step S103, the step S102 may be performed after the step S103, or the step S102 may be performed simultaneously with the step S103, which is not limited by the embodiment of the present invention.
  • the noise scene is a noise state in which the user inputs voice data. That is, it can be understood that the user is in the noise environment on the road, whether the voice data is input in the noise environment of the office or in the noise environment of the vehicle, and whether the noise is large or small in the corresponding environment in which the user is located.
  • step S102 may be performed before the step S101, the step S102 may be performed after the step S101, or the step S102 may be performed simultaneously with the step S101, which is not limited by the embodiment of the present invention.
  • the confidence threshold corresponding to the noise scene may be obtained according to the noise scene.
  • the result of the speech data recognition is considered correct, that is, the corresponding speech data is processed.
  • the voice data is voice data including a command word such as "calling to Zhang San", "sending a text to Zhang San", "opening an application”, etc.
  • the voice recognition belongs to command word recognition
  • the device performs The corresponding commands, such as making a call, texting, opening an application, etc.
  • the voice data recognition belongs to text dictation recognition
  • the recognition result text is displayed. That is, if the confidence value is greater than or equal to the confidence threshold, the speech data is processed.
  • the technical solution of the present invention provides a method for voice recognition, which acquires a noise scene And obtaining a confidence threshold value according to the pre-stored confidence threshold value empirical data and the noise scenario.
  • This method of flexibly adjusting the confidence threshold according to the noise scenario greatly improves the speech recognition rate in a noisy environment.
  • FIG. 2 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 1 of the present invention. As shown in FIG. 2, the method further includes:
  • the technical solution of the present invention provides a method for voice recognition, which obtains a confidence threshold value by acquiring a noise scene and according to pre-stored confidence width value empirical data and the noise scene. This method of flexibly adjusting the confidence threshold according to the noise scene greatly improves the speech recognition rate in a noisy environment.
  • FIG. 3 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 2 of the present invention.
  • Embodiment 2 of the present invention is described on the basis of Embodiment 1 of the present invention.
  • the noise scenario specifically includes: a noise type; a noise size.
  • the noise type refers to the noise environment in which the user inputs voice data, which can be understood as the user. It is a noisy environment on the road, a noise environment in the office or a noise environment in the car.
  • the noise level represents the amount of noise in the noise environment in which the user inputs the voice data.
  • the noise size includes: a signal to noise ratio and a noise energy level.
  • the signal-to-noise ratio is the ratio of the voice data to the noise data power, often expressed in decibels. The higher the signal-to-noise ratio, the smaller the noise data power, otherwise the opposite is true.
  • the noise energy level is used to reflect the magnitude of the noise data energy in the user's voice data combined with the noise energy level to represent the noise level.
  • the noise scenario includes a noise type.
  • the acquiring a noise scenario according to the voice data includes:
  • VAD Determines a speech data frame and a noise data frame, and after acquiring the noise data frame, acquires a frequency cepstrum coefficient of the noise data frame.
  • Mel is the unit of subjective pitch
  • Hz is the unit of objective pitch.
  • the Mel frequency is based on the auditory characteristics of the human ear, which is nonlinearly related to the Hz frequency.
  • Mel Frequency Cepstrum Coefficient MFCC is a cepstral coefficient on the Mel frequency. It has good recognition performance and is widely used in speech recognition, voiceprint recognition, and language recognition.
  • the frequency cepstral coefficients are respectively substituted into each of the pre-established noise type models for calculation. If the calculation result value of a certain noise type model is the largest, it is considered that the user inputs the voice data in the environment of the noise type, that is, obtaining The type of noise of the voice data.
  • the pre-established noise type model in step S1022 is a Gaussian mixture model.
  • Gaussian density function estimation is a parametric model, which has two types: single Gaussian model (SGM) and Gaussian mixture model (GMM).
  • SGM single Gaussian model
  • GMM Gaussian mixture model
  • Gaussian model is an effective clustering model. According to the Gaussian probability density function parameters, each established Gaussian model can be regarded as a category. Enter a sample X and calculate its value by Gaussian probability density function. Then, a threshold value is used to determine whether the sample belongs to the Gaussian model that has been established. Because GMM has multiple models, the division is more elaborate, suitable for the division of complex objects, and is widely used in complex object modeling, such as speech recognition, using GMM to classify and model different noise types.
  • the process of establishing a GMM of a certain noise type may be: inputting multiple groups The same type of noise data is repeatedly trained according to the noise data, and finally the GMM of the noise type is obtained.
  • the Gaussian mixture model can be expressed by the following formula:
  • is the mixture of GMM models, that is, the combination of N Gaussian models, ( ⁇ is the weight of the first Gaussian model, ⁇ is the mean, and ⁇ is the covariance matrix.
  • Theoretically, the task in space The shape can be modeled using a GMM model. Since the output of the Gaussian model is a decimal between 0 and 1, for the sake of calculation, the result is usually taken as a natural logarithm (In), which becomes less than 0. Floating point number.
  • the method of establishing the pre-established noise type model in step S1022 includes: acquiring noise data. Acquire noise data of multiple sets of the same type of noise, such as car noise, street noise, office noise, etc.
  • the GMM used to establish this type of noise data is the noise type model of the noise data. It should be understood that the present invention can also obtain other kinds of noise data, and establish a corresponding noise type model for each type of noise data, which is not limited by the embodiment of the present invention.
  • MFCC Mel Frequency Cepstrum Coefficient
  • the noise type model is established by processing the frequency cepstral coefficients according to an EM algorithm.
  • the EM algorithm (Expectation-maximization algorithm) is used in statistic to find the maximum likelihood estimate of a parameter in a probability model that depends on unobservable implicit variables.
  • the Maximum Expectation (EM) algorithm is an algorithm that looks for a parameter maximum likelihood estimate or a maximum a posteriori estimate in the GMM, where the GMM relies on an unobservable hidden variable.
  • the EM algorithm is calculated alternately in two steps: The first step is to calculate the expectation (E) and estimate the unknown. The expected value of the parameter gives the current parameter estimate. The second step is to maximize (M) and re-estimate the distribution parameters to maximize the likelihood of the data, giving the expected estimate of the unknown variable.
  • the algorithm flow of EM is as follows: 1. Initialize the distribution parameters; 2. Repeat until convergence. Simply put, the EM algorithm is, let's assume that we know that the two parameters A and B are unknown in the starting state, and knowing the information of A can get the information of B, and then knowing B will get A. It can be considered to first give A some initial value, to get the estimated value of B, and then re-estimate the value of A from the current value of B.
  • the EM algorithm can estimate the parameters from the incomplete data set to the maximum possible probability. It is a very simple and practical learning algorithm. By alternately using the two steps E and M, the EM algorithm gradually improves the parameters of the model, gradually increasing the likelihood of the parameters and the training samples, and finally ending at a maximum point. Intuitive understanding of the EM algorithm, it can also be seen as a successive approximation algorithm: without knowing the parameters of the model in advance, you can randomly select a set of parameters or give a rough initial reference to an initial parameter to determine the corresponding set of parameters.
  • the obtained frequency cepstral coefficients are substituted into the EM algorithm for training.
  • X is a frequency cepstral coefficient.
  • noise type model of vehicle noise obtained by vehicle noise training
  • off-board noise which can include office noise, street noise, supermarket noise, etc.
  • the final result shows that the calculated result value of the noise type model of the vehicle noise is larger than the calculation result value of the noise type model of the off-board noise (ie, -41.9>-46.8), so the noise type of the current voice data is the vehicle noise.
  • the technical solution of the present invention provides a method for improving the speech recognition rate in a noisy environment.
  • the method obtains a confidence threshold value by acquiring a noise scene and according to the pre-stored confidence threshold value empirical data and the noise scene.
  • This method of flexibly adjusting the confidence threshold according to the noise scenario greatly improves the speech recognition rate in a noisy environment.
  • the noise scenario includes a noise level.
  • the acquiring a noise scenario according to the voice data includes:
  • a feature parameter of the voice data where the feature parameter includes: a subband energy, a pitch, and a periodic factor.
  • the sub-band energy is divided into N sub-bands according to the different useful components in the different frequency bands of the speech data, and the energy of each frame of each sub-band is calculated separately.
  • the subband energy calculation formula is:
  • L is the frame length
  • one frame of speech data is ⁇ [0] ⁇ [1 ⁇ x[Ll]
  • the pitch and periodicity factors reflect the periodic components of speech.
  • speech the periodic components of the silent segment and the soft segment are very poor.
  • voiced segment the periodicity is very good.
  • voice frame detection can be performed. 51024, performing voice activation detection according to the feature parameter;
  • the voice activity detection (VAD) is used to determine the voice data frame and the noise data frame, and the pitch and the periodic factor are combined with the sub-band energy to perform the decision of the voice frame and the silence frame.
  • the VAD judgment is based on the following two factors: speech frame and noise frame decision:
  • the periodicity is generally a speech frame.
  • SNR speechLev - noiseLev
  • Ln Ls represent the noise frame and the total number of frames of the speech frame
  • ener[Ni] represents the energy of the i-th noise frame
  • ener[Sj] represents the energy of the j-th speech frame.
  • the technical solution of the present invention provides a method for improving the speech recognition rate in a noisy environment.
  • the method obtains a confidence threshold value by acquiring a noise scene and according to the pre-stored confidence threshold value empirical data and the noise scene. This method of flexibly adjusting the confidence threshold according to the noise scenario greatly improves the speech recognition rate in a noisy environment.
  • FIG. 4 is a flowchart of another implementation manner of a method for voice recognition according to Embodiment 3 of the present invention.
  • the embodiment is described on the basis of the embodiment 1.
  • the method of step S103 of the embodiment 1 specifically includes: S1031, according to the correspondence between the pre-stored confidence value empirical data and the noise scene. And obtaining a confidence threshold corresponding to the noise scenario.
  • the confidence width value corresponding to the noise scenario may be acquired according to the correspondence between the pre-stored confidence threshold empirical data and the noise scenario. That is, the confidence width value can be obtained according to the noise type in the noise scene, the noise magnitude, and the correspondence between the confidence value and the empirical data obtained by a large number of simulation measurements.
  • This type of noise indicates the type of environment in which the user is performing speech recognition
  • the magnitude of the noise indicates the amount of noise of the type of environment in which the user is located.
  • the principle of obtaining the confidence threshold is that, in combination with the noise type, when the noise is too large, the confidence threshold is selected to be low; in combination with the noise type, the noise is too small, and the confidence threshold is set too large.
  • the specific confidence value empirical data is obtained by simulation measurement statistics. for example,
  • the noise type when the noise type is in the vehicle environment, when the noise is too large (that is, the noise level is less than -30dB and the signal-to-noise ratio is less than 10dB), the empirical data of the confidence width is 35 ⁇ 50. Therefore, in the noise scenario, a value of a confidence value of 35 to 50 is obtained.
  • the noise type When the noise type is in the vehicle environment, the noise is too small (the noise level is greater than -30dB less than -40dB, and the signal-to-noise ratio is greater than 10dB less than 20dB).
  • the empirical data of confidence width is 40 ⁇ 55. . Therefore, in this noise scenario, a value of a confidence value of 40 to 55 is obtained.
  • the noise type is office environment
  • the noise is too small (noise level is less than -40dB, signal-to-noise ratio is greater than 20dB), and the empirical data of the confidence value is 45 ⁇ 60. Therefore, in this noise scenario, a confidence value of 45 to 60 is obtained.
  • the technical solution of the present invention provides a method for improving the speech recognition rate in a noisy environment.
  • the method obtains a confidence threshold value by acquiring a noise scenario and according to the pre-stored confidence threshold value empirical data and the noise scenario.
  • This method of flexibly adjusting the confidence threshold according to the noise scenario greatly improves The speech recognition rate in a noisy environment.
  • FIG. 5 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 4 of the present invention. As shown in FIG. 5, the device includes: an acquiring unit 300, configured to acquire voice data;
  • a confidence value unit 301 configured to receive the voice data acquired by the acquiring unit 300, and obtain a confidence value according to the voice data;
  • the noise scene unit 302 is configured to receive the acquired voice data of the acquiring unit 300, and acquire a noise scene according to the voice data;
  • the confidence level value unit 303 is configured to receive the noise scene of the noise scene unit 302, and obtain a confidence threshold corresponding to the noise scene.
  • the processing unit 304 is configured to receive the confidence value unit 301. Obtaining the confidence value and the confidence threshold obtained by the confidence threshold unit 303, if the confidence value is greater than or equal to the confidence threshold, processing the voice data.
  • the acquiring unit 300 acquires the voice data; the confidence value unit 301 receives the voice data acquired by the acquiring unit 300, and acquires a confidence value according to the voice data; the noise scene unit 302 receives the acquired by the acquiring unit 300.
  • the voice data, and acquiring a noise scene according to the voice data, the noise scene includes: a noise type, a signal to noise ratio, and a noise energy level;
  • the confidence threshold unit 303 receives the noise scene of the noise scene unit 302 Obtaining a confidence threshold corresponding to the noise scenario; the processing unit 304 receives the confidence value obtained by the confidence value unit 301 and the confidence threshold obtained by the confidence threshold unit 303, The speech data is processed if the confidence value is greater than or equal to the confidence threshold.
  • the obtaining unit 300, the confidence value unit 301, the noise scene unit 302, the confidence threshold unit 303, and the processing unit 304 can be used to execute the methods described in steps S100, S101, S102, S103, and S104 in Embodiment 1.
  • steps S100, S101, S102, S103, and S104 in Embodiment 1.
  • the technical solution of the present invention provides a speech recognition apparatus, which acquires a confidence threshold value by acquiring a noise scene and according to the pre-stored confidence threshold empirical data and the noise scene.
  • the device for flexibly adjusting the confidence width value greatly improves the speech recognition rate under the noise environment.
  • FIG. 6 is a schematic diagram showing another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • the device further includes: a modeling unit 305, configured to acquire noise data, obtain a frequency cepstrum coefficient of the noise data according to the noise data, and process the frequency cepstrum coefficient according to an EM algorithm. , establish a noise type model.
  • the modeling unit 305 can be used to perform the method of the pre-established noise type model in the step S1022 in the second embodiment. For details, refer to the description of the method in the embodiment 2, and details are not described herein again.
  • FIG. 7 is another schematic structural diagram of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • the noise scene unit specifically includes: a noise type unit 3021, configured to acquire, according to the voice data of the acquiring unit, a frequency cepstrum coefficient of noise in the voice data, according to the frequency of the noise.
  • a cepstrum coefficient and the noise type model of the modeling unit acquires a noise type of the speech data.
  • the noise type unit 3021 can be used to perform the method described in the step S1021 and S1022 in the second embodiment. For details, refer to the description of the method in the embodiment 2, and details are not described herein again.
  • the noise size unit 3022 is configured to acquire a feature parameter of the voice data according to the voice data of the acquiring unit, perform voice activation detection according to the feature parameter, and acquire the noise size according to the result of the voice activation detection. .
  • the noise size unit 3022 can be used to perform the method described in the steps S1023, S1024, and S1025 in the second embodiment. For details, refer to the description of the method in Embodiment 2, and details are not described herein again.
  • the technical solution of the present invention provides a speech recognition apparatus, which acquires a confidence threshold value by acquiring a noise scene and according to the pre-stored confidence threshold empirical data and the noise scene.
  • This device which flexibly adjusts the confidence threshold according to the noise scene, greatly improves the speech recognition rate in a noisy environment.
  • FIG. 8 is a schematic diagram showing another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention. As shown in FIG. 8, the device further includes:
  • the storage unit 306 is used for storing the confidence width experience data.
  • the confidence threshold unit 303 is configured to acquire a confidence threshold corresponding to the noise scenario according to the correspondence between the confidence threshold data and the noise scenario stored in advance by the storage unit 306.
  • the confidence value unit 303 can be used to perform the method described in the step S1031 in the third embodiment. For details, refer to the description of the method in the embodiment 3, and details are not described herein again.
  • the technical solution of the present invention provides a speech recognition apparatus, which acquires a confidence threshold value by acquiring a noise scene and according to the pre-stored confidence threshold empirical data and the noise scene.
  • This device which flexibly adjusts the confidence threshold according to the noise scene, greatly improves the speech recognition rate in a noisy environment.
  • FIG. 9 is a schematic structural diagram of a mobile terminal according to Embodiment 5 of the present invention. As shown in FIG. 9, the mobile terminal includes a processor and a microphone, where
  • the microphone 501 is configured to acquire voice data.
  • the processor 502 is configured to acquire a confidence value and a noise scenario according to the voice data, and obtain a confidence threshold corresponding to the noise scenario according to the noise scenario, if the confidence value is greater than or equal to The confidence value is processed to process the voice data.
  • the microphone 501 and the processor 502 may be used to perform the methods described in steps S100, S101, S102, S103, and S104 in Embodiment 1. For details, refer to the description of the method in Embodiment 1. I will not repeat them here.
  • the technical solution of the present invention provides a mobile terminal, which obtains a confidence threshold value by acquiring a noise scenario and according to the pre-stored confidence threshold empirical data and the noise scenario. According to the noise scenario, the mobile terminal with flexible confidence value is greatly improved, and the speech recognition rate under the noise environment is greatly improved.
  • the mobile terminal further includes: a memory 503, configured to store confidence value experience data.
  • the processor is specifically configured to: acquire a confidence value and a noise scenario according to the voice data; and acquire, according to the correspondence between the confidence value empirical data stored in the memory 503 and the noise scenario, the noise scenario Corresponding confidence threshold; if the confidence value is greater than or equal to the confidence threshold, the voice data is processed.
  • the above structure can be used to perform the methods in Embodiment 1, Embodiment 2, and Embodiment 3.
  • the technical solution of the present invention provides a mobile terminal, which obtains a confidence threshold value by acquiring a noise scene and according to pre-stored confidence width value empirical data and the noise scene.
  • the mobile terminal with flexible confidence value is flexibly adjusted, and the speech recognition rate in a noisy environment is greatly improved.
  • Example 6 As shown in FIG. 11, the embodiment of the present invention is specifically described by taking a mobile phone as an example. It should be understood that the illustrated mobile phone is merely an example of a mobile phone, and that the mobile phone may have more or fewer components than those shown in the figures, two or more components may be combined, or may have different Component configuration.
  • the various components shown in the figures can be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
  • FIG. 11 is a schematic structural diagram of a mobile phone according to an embodiment of the present invention.
  • the mobile phone shown in FIG. 11 includes: touch screen 41, memory 42, CPU 43, power management chip 44, RF circuit 45, peripheral interface 46, audio circuit 47, microphone 48, I/O subsystem 49.
  • the touch screen 41 is an input interface and an output interface between the mobile phone and the user. In addition to the function of acquiring user touch information and control commands, the touch screen 41 also presents the visual output to the user, and the visual output may include graphics, text, Icons, videos, etc.
  • the memory 42 can be used to store confidence value experiential data for use by the CPU 43 for processing. Memory 42 may be accessed by CPU 43, peripheral interface 46, etc., which may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other volatile Solid state storage devices.
  • the CPU 43 is configured to process the voice data acquired by the audio circuit 47 and the microphone 48, and obtain a noise scene according to the voice data; and obtain the confidence according to the noise scene and the confidence value empirical data stored in advance by the memory 42 Wide value.
  • the CPU 43 is a control center of the mobile phone, and connects various parts of the entire mobile phone using various interfaces and lines, and executes the mobile phone by running or executing software programs and/or modules stored in the memory 42, and calling data stored in the memory 42.
  • the CPU 43 may include one or more processing units.
  • the CPU 43 may integrate an application processor and a modem processor.
  • the application processor mainly processes an operating system, a user interface, an application, etc., and modulates
  • the demodulation processor mainly handles wireless Communication. It can be understood that the above modem processor may not be integrated into the CPU 43. It should be understood that the foregoing functions are only one of the functions that the CPU 43 can perform, and the other functions are not limited in the embodiments of the present invention.
  • the power management chip 44 can be used for power supply and power management of the hardware connected to the CPU 43, the I/O subsystem 49, and the peripheral interface 46.
  • the RF circuit 45 is mainly used to establish communication between the mobile phone and the wireless network (ie, the network side), and realize data acquisition and transmission between the mobile phone and the wireless network. For example, sending and receiving short messages, emails, and the like. Specifically, RF circuitry 45 acquires and transmits the RF signal, the RF signal is also called electromagnetic signals, the RF circuit 45 converts the electrical signal into an electromagnetic signal or an ⁇ electromagnetic signals into electrical signals, and by which electromagnetic signals with a communication network and other The device communicates.
  • RF circuitry 45 may include known circuitry for performing these functions including, but not limited to, an antenna system, an RF transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a CODEC chip Group, Subscriber Identity Module (SIM), etc.
  • the peripheral interface 46 which can connect the input and output peripherals of the device to the CPU 43 and the memory 42.
  • the audio circuit 47 is mainly used to acquire audio data from the peripheral interface 46 and convert the audio data into an electrical signal.
  • the microphone 48 can be used to acquire voice data.
  • the I/O subsystem 49 the I/O subsystem 49 can control input and output peripherals on the device, and the I/O subsystem 49 can include a display controller 491. And one or more input controllers 492 for controlling other input/control devices.
  • one or more input controllers 492 acquire electrical signals from other input/control devices or transmit electrical signals to other input/control devices, and other input/control devices may include physical buttons (press buttons, rocker buttons, etc.) , dial, slide switch, joystick, click roll Round.
  • the input controller 492 can be connected to any of the following: a keyboard, an infrared port, a USB interface, and a pointing device such as a mouse.
  • the display controller 491 in the I/O subsystem 49 acquires an electrical signal from the touch screen 41 or transmits an electrical signal to the touch screen 41.
  • the touch screen 41 acquires the contact display controller 491 on the touch screen to convert the acquired contact into an interaction with the user interface object presented on the touch screen 41, that is, realizes human-computer interaction and is presented on the touch screen 41.
  • the user interface objects can be icons that run the game, icons that are networked to the corresponding network, filtering modes, and the like.
  • the device may also include a light mouse, which is a touch sensitive surface that does not present a visual output, or an extension of a touch sensitive surface formed by the touch screen.
  • the microphone 48 acquires the acquired voice data of the large-screen device, and sends the voice data to the CPU 43 through the peripheral interface 46 and the audio circuit 47.
  • the CPU 43 can be used to process the voice data, and acquire a noise scene according to the voice data. And obtaining the confidence threshold according to the noise scenario and the confidence threshold empirical data pre-stored by the memory 42.
  • the above structure can be used to perform the methods in Embodiment 1, Embodiment 2, and Embodiment 3.
  • the technical solution of the present invention provides a voice recognition mobile phone, which obtains a confidence threshold value by acquiring a noise scene and according to the pre-stored confidence width value empirical data and the noise scene.
  • This kind of mobile phone that flexibly adjusts the confidence value according to the noise scene greatly improves the speech recognition rate in a noisy environment.
  • the device readable medium includes a device storage medium and a communication medium, and the optional communication medium includes any medium that facilitates transfer of the device program from one location to another.
  • the storage medium can be any available medium that the device can access.
  • the device readable medium may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage media or other magnetic storage device, or can be used for carrying or storing in the form of an instruction or data structure.
  • Any connection may suitably be a device readable medium.
  • disks and discs include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, and optional discs are usually magnetically replicated. Data, while discs use lasers to optically replicate data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)

Abstract

一种语音识别的方法,包括:获取语音数据(S100);根据所述语音数据,获取置信度值(S101);根据所述语音数据,获取噪声场景(S102);获取与所述噪声场景对应的置信度阈值(S103);如果所述第二置信度值大于或者等于所述置信度阈值,则处理所述语音数据(S104)。以及一种装置。这种根据噪声场景,灵活调整置信度值的方法和装置,大大提升了噪声环境下的语音识别率。

Description

一种语音识别的方法、 装置
技术领域 本发明实施例涉及语音处理技术领域,尤其涉及一种语音识别的方法及装 置。
背景技术 用户在手机等终端设备上一般使用语音助手软件用来进行语音识别。用语 音助手等软件进行语音识别的过程为,用户开启语音助手软件,获取语音数据; 语音数据送到降噪模块进行降噪处理;降噪处理后的语音数据送给语音识别引 擎; 语音识别引擎返回识别结果给语音助手; 语音助手为减少误判, 根据置信 度阔值判断识别结果的正确性, 然后呈现。 目前, 语音助手类软件通常是在办公室等安静环境下使用效果相对较好, 但在噪声环境下 (如: 车载环境下) 的使用效果不佳; 业界普遍采用软件降噪 的方法来提升语音识别率, 但提升效果并不明显, 有时甚至会降低识别率。 发明内容 本技术方案提供一种语音识别的方法和装置, 用以提升语音识别率, 同时 提升用户感受。 第一方面, 提供一种语音识别的方法: 所述方法包括: 获取语音数据; 根 据所述语音数据, 获取置信度值; 根据所述语音数据, 获取噪声场景; 获取与 所述噪声场景对应的置信度阔值;如果所述置信度值大于或者等于所述置信度 阔值, 则处理所述语音数据。
结合第一方面, 在第一方面的第一种可能的实现方式中, 所述噪声场景具 体包括: 噪声类型; 噪声大小。 结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现 方式中, 所述噪声场景包括噪声类型, 所述根据语音数据获取噪声场景, 具体 包括: 根据所述语音数据, 获取所述语音数据中的噪声的频率倒谱系数; 根据 所述噪声的频率倒谱系数和预先建立的噪声类型模型,获取所述语音数据的噪 声类型。
结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现 方式中, 所述噪声类型模型的建立方法具体包括: 获取噪声数据; 根据所述噪 声数据, 获取所述噪声数据的频率倒谱系数; 根据 EM算法处理所述频率倒谱 系数, 建立所述噪声类型模型。
结合第一方面的第三种可能的实现方式或者第一方面的第二种可能的实 现方式, 在第一方面的第四种可能的实现方式中, 所述噪声类型模型是, 高斯 混合模型。
结合第一方面的第一种可能的实现方式,在第一方面的第五种可能的实现 方式中, 所述噪声场景包括噪声大小, 所述根据语音数据获取噪声场景, 具体 包括:根据所述语音数据,获取所述语音数据的特征参数;根据所述特征参数, 进行语音激活检测; 根据所述语音激活检测的结果, 获取所述噪声大小。 结合第一方面的第一种可能的实现方式或者第一方面的第二种可能的实 现方式或者第一方面的第三种可能的实现方式或者第一方面的第四种可能的 实现方式或者第一方面的第五种可能的实现方式或者,在第一方面的第六种可 能的实现方式中, 所述噪声大小具体包括: 信噪比; 噪声能量水平。 结合第一方面或者第一方面的第一种可能的实现方式或者第一方面的第 二种可能的实现方式或者第一方面的第三种可能的实现方式或者第一方面的 第四种可能的实现方式或者第一方面的第五种可能的实现方式或者第一方面 的第六种可能的实现方式或者, 在第一方面的第七种可能的实现方式中, 所述 获取与所述噪声场景对应的置信度阔值, 具体包括: 根据预先存储的置信度阔 值经验数据和所述噪声场景的对应关系,获取与所述噪声场景对应的置信度阔 值。 结合第一方面或者第一方面的第一种可能的实现方式或者第一方面的第 二种可能的实现方式或者第一方面的第三种可能的实现方式或者第一方面的 第四种可能的实现方式或者第一方面的第五种可能的实现方式或者第一方面 的第六种可能的实现方式或者第一方面的第七种可能的实现方式或者,在第一 方面的第八种可能的实现方式中,如果所述置信度值小于所述置信度阔值, 则 提示用户。
第二方面, 提供一种语音识别装置, 其特征在于, 所述装置包括: 获取单 元, 用于获取语音数据; 置信度值单元, 用于接收所述获取单元获取的所述语 音数据, 并根据所述语音数据获取置信度值; 噪声场景单元, 用于接收所述获 取单元获取的所述语音数据, 并根据所述语音数据获取噪声场景;置信度阔值 单元, 用于接收所述噪声场景单元的所述噪声场景, 并获取与所述噪声场景对 应的置信度阔值;处理单元, 用于接收所述置信度值单元获取的所述置信度值 和所述置信度阔值单元获取的所述置信度阔值,如果所述置信度值大于或者等 于所述置信度阈值, 则处理所述语音数据。
结合第二方面,在第二方面的第一种可能的实现方式中,所述装置还包括: 建模单元, 用于获取噪声数据, 根据所述噪声数据, 获取所述噪声数据的频率 倒谱系数, 根据 EM算法处理所述频率倒谱系数, 建立噪声类型模型。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现 方式中, 所述噪声场景单元具体包括: 噪声类型单元, 用于根据所述获取单元 的所述语音数据, 获取所述语音数据中的噪声的频率倒谱系数, 根据所述噪声 的频率倒谱系数和所述建模单元的所述噪声类型模型,获取所述语音数据的噪 声类型。
结合第二方面或者第二方面的第一种可能的实现方式或者第二方面的第 二种可能的实现方式, 在第二方面的第三种可能的实现方式中, 所述噪声场景 单元还包括: 噪声大小单元, 用于根据所述获取单元的语音数据, 获取所述语 音数据的特征参数, 根据所述特征参数, 进行语音激活检测, 根据所述语音激 活检测的结果, 获取所述噪声大小。 结合第二方面或者第二方面的第一种可能的实现方式或者第二方面的第 二种可能的实现方式或者第二方面的第三种可能的实现方式,在第二方面的第 四种可能的实现方式中, 所述装置还包括: 存储单元, 用于存储的置信度阔值 经验数据。 结合者第二方面的第四种可能的实现方式,在第二方面的第五种可能的实 现方式中, 所述置信度阔值单元, 具体用于, 根据所述存储单元预先存储的置 信度阔值经验数据和所述噪声场景的对应关系,获取与所述噪声场景对应的置 信度阔值。 第三方面, 提供移动终端, 包括处理器、 麦克风, 其特征在于, 所述麦克 风, 用于获取语音数据; 所述处理器, 用于根据所述语音数据获取置信度值和 噪声场景, 根据所述噪声场景, 获取与所述噪声场景对应的置信度阔值, 如果 所述置信度值大于或者等于所述置信度阔值, 则处理所述语音数据。
结合第三方面,在第二方面的第一种可能的实现方式中所述移动终端还包 括: 存储器, 用于存储置信度阔值经验数据。
结合第三方面的第一种可能的实现方式,在第三方面的第二种可能的实现 方式中, 所述处理器具体用于, 根据所述语音数据获取置信度值和噪声场景; 根据所述存储器预先存储的置信度阔值经验数据和所述噪声场景的对应关系, 获取与所述噪声场景对应的置信度阔值;如果所述置信度值大于或者等于所述 置信度阔值, 则处理所述语音数据。
本发明技术方案提供了一种语音识别的方法以及装置,该方法和装置, 通 过获取噪声场景, 并根据预先存储的置信度阔值经验数据和所述噪声场景, 获 取置信度阔值值。 这种根据噪声场景, 灵活调整置信度阔值的方法和装置, 大 大提升了噪声环境下的语音识别率。
附图说明 为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施 例中所需要使用的附图作一简单地介绍, 显而易见地, 下面描述中的附图是本 发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动性的 前提下, 还可以根据这些附图获取其他的附图。 图 1为本发明实施例 1提供的一种语音识别的方法流程图; 图 2为本发明实施例 1提供的一种语音识别的方法的另一种实现方式的流 程图; 图 3为本发明实施例 2提供的一种语音识别的方法的另一种实现方式的流 程图; 图 4为本发明实施例 2提供的一种语音识别的方法的另一种实现方式的流 程图;
图 5为本发明实施例 4提供的一种语音识别装置的结构示意图;
图 6为本发明实施例 4提供的一种语音识别装置的另一种可能的结构示意 图;
图 7为本发明实施例 4提供的一种语音识别装置的另一种可能的结构示意 图;
图 8为本发明实施例 4提供的一种语音识别装置的另一种可能的结构示意 图;
图 9为本发明实施例 5提供的一种移动终端的结构示意图;
图 10为本发明实施例 5提供的一种移动终端的另一种可能的结构示意图; 图 11为本发明实施例提供的手机的结构示意图。 具体实施方式 为使本发明实施例的目的、技术方案和优点更加清楚, 下面将结合本发明 实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。基于本发明中 的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获取的所有其 他实施例, 都属于本发明实施例保护的范围。 在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨 在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的 "一 种"、 "所述"和 "该"也旨在包括多数形式, 除非上下文清楚地表示其他含 义。 还应当理解, 本文中使用的术语 "和 /或"是指并包含一个或多个相关联 的列出项目的任何或所有可能组合。进一步应当理解, 本文中采用的术语"包 括"规定了所述的特征、 整体、 步骤、 操作、 元件和 /或部件的存在, 而不排 除一个或多个其他特征、 整体、 步骤、 操作、 元件、 部件和 /或它们的组的存 在或附加。 在本发明实施例中, 装置包括但不限于手机、 个人数字助理 (Personal Digital Assistant, PDA) 、 平板电脑、 便携设备 (例如, 便携式计算机) 车载 设备, ATM机 (Automatic Teller Machine, 自动拒员机) 等设备, 本发明实施 例并不限定。
实施例 1 图 1为本发明实施例 1提供的一种语音识别的方法流程图。 如图 1所示, 本发明实施例 1提供一种语音识别的方法具体可以包括:
5100, 获取语音数据;
用户开启装置上的语音助手等语音识别类软件,通过麦克风获取用户输入 的语音数据。 应当理解的是, 所述语音数据也可以不是用户输入的, 也可以是 机器输入的, 包括任何包含信息的数据。
5101, 根据所述语音数据, 获取置信度值。 该置信度是指特定个体对待特定命题真实性相信的程度。在本发明实施例 中, 是装置等对该语音数据识别结果的真实性相信的程度。 即, 该置信度值用 来表示语音识别结果的可信程度的数值。举例来说,用户输入的语音数据为"给 张三打电话", 则在该语音数据识别过程中, 返回的置信度值包含: 句置信度 N1 ( "给张三打电话" 的总体置信度) , 前置命令词置信度 N2 ( "给" 为 前置命令词, 即 "给"的置信度值为 N2) , 人名置信度 N3 ( "张三"为人名, 即 "张三"的置信度值为 N3) , 后置命令词置信度 N4( "打电话" 为后置命令 词, 即 "打电话" 的置信度为 N4)。 通常, 句置信度 N1是由 N2、 N3、 N4综合 得到的。 在某次实验中, 经测试得到, 用户输入 "给张三打电话"该语音数 据的置信度值分别为 Nl=62, N2=50, N3=48, N4=80。
应当理解的是, 所述步骤 S102可以在步骤 S103之前, 所述步骤 S102可以 在步骤 S103之后, 或者所述步骤 S102可以和步骤 S103同时执行, 本发明实施 例对此不做限制
S102, 根据所述语音数据, 获取噪声场景;
根据用户输入的语音数据, 获取噪声场景。所述噪声场景是用户输入语音 数据时所处的噪声状态。 即可以理解为, 用户是在马路上的噪声环境, 还是在 办公室的噪声环境或者是在车载的噪声环境中输入该语音数据,以及用户所处 的相应环境中噪声是大还是小。
应当理解的是, 所述步骤 S102可以在步骤 S101之前, 所述步骤 S102也可 以在步骤 S101之后, 或者所述步骤 S102可以和步骤 S101同时执行, 本发明实 施例对此不做限制
S103, 获取与所述噪声场景对应的置信度阔值。 该置信度阔值作为置信度值是否可接受的评价指标,如置信度值大于此置 信度阔值, 则认为识别结果正确, 如果置信度值小于此置信度阔值, 则认为识 别结果错误, 结果是不可相信的。在获取该语音数据所处的噪声场景之后, 可 以通过根据所述噪声场景, 获取所述噪声场景对应的置信度阔值。
S104,如果所述置信度值大于或者等于所述置信度阔值, 则处理所述语音 数据;
如果所述置信度值大于或者等于所述置信度阔值,则认为该语音数据识别 的结果是正确的, 即处理相应的语音数据。举例来说, 如步骤 S101中获取的置 信度值 N3=48, 步骤 S103中获取的置信度阔值 =40, 则所述置信度值大于所述 置信度阔值, 该语音数据识别结果是正确的。进一步举例说明, 当该语音数据 是 "打电话给张三" "发短信给张三" "打开应用程序"等包含命令词的语音 数据时, 该语音识别属于命令词识别, 则所述装置执行相应命令, 如打电话、 发短信、 打开应用程序等操作。如果该语音数据识别属于文本听写识别, 则显 示识别结果文本。即如果所述置信度值大于或者等于所述置信度阔值, 则处理 所述语音数据。
本发明技术方案提供了一种语音识别的方法,该方法通过获取噪声场景 并根据预先存储的置信度阔值经验数据和所述噪声场景, 获取置信度阔值值。 这种根据噪声场景, 灵活调整置信度阔值的方法, 大大提升了噪声环境下的语 音识别率。
可选的,
图 2为本发明实施例 1提供的一种语音识别的方法的另一种实现方式的流 程图。 如图 2所示, 所述方法还包括:
S1041 , 如果所述置信度值小于所述置信度阔值, 则提示用户。
如果所述置信度值小于所述置信度阔值,则认为该语音数据识别结果是错 误的,则提示用户。举例来说,如步骤 S101中获取的置信度值 N3=48,步骤 S103 中获取的置信度阈值 =50, 则所述置信度值小于所述置信度阈值, 所述语音数 据识别结果是错误的。进一步举例说明, 当该语音数据是"给张三打电话 "时, 则装置判断该语音数据的识别结果错误, 系统提示用户重新说一遍和 /或者告 知用户错误。 即, 如果所述置信度值小于所述置信度阔值, 则提示用户重新输 入或者纠正错误等。
本发明技术方案提供了一种语音识别的方法,该方法通过获取噪声场景, 并根据预先存储的置信度阔值经验数据和所述噪声场景, 获取置信度阔值值。 这种根据噪声场景, 灵活调整置信度阔值的方法, 大大提升了噪声环境下的语 音识别率。
实施例 2 图 3为本发明实施例 2提供的一种语音识别的方法的另一种实现方式的流 程图。 本发明实施例 2是在本发明实施例 1的基础之上进行描述的。 如图 3所示, 在实施例 1中的步骤 S102中, 所述噪声场景具体包括: 噪声类型; 噪声大小。
该噪声类型是指用户输入语音数据时所处的噪声环境,即可以理解为用户 是在马路上的噪声环境, 还是在办公室的噪声环境或者是在车载的噪声环境。 该噪声大小表示用户输入语音数据该时所处噪声环境中噪声的大小。可选 的, 该噪声大小包括: 信噪比和噪声能量水平。该信噪比是语音数据与噪声数 据功率的比值, 常常用分贝数表示, 一般信噪比越高表明噪声数据功率越小, 否则则相反。该噪声能量水平是用来反应用户语音数据中噪声数据能量的大小 信噪比和噪声能量水平结合起来, 表示该噪声大小。 所述噪声场景包括噪声类型, 在实施例 1中的步骤 S102, 所述根据语音数 据获取噪声场景, 具体包括:
51021 ,根据所述语音数据,获取所述语音数据中的噪声的频率倒谱系数; 根据用户输入的语音数据, 通过语音激活检测 (Voice activity detection ,
VAD) 判断语音数据帧和噪声数据帧, 在获取噪声数据帧之后, 获取该噪声 数据帧的频率倒谱系数。 Mel (美尔) 是主观音高的单位, 而 Hz (赫兹) 则是 客观音高的单位, Mel频率是基于人耳听觉特性提出的, 它与 Hz频率成非线性 对应关系。 频率倒谱系数 (Mel Frequency Cepstrum Coefficient, MFCC) 是 Mel频率上的倒谱系数, 具有良好的识别性能, 被广泛应用于语音识别、 声紋 识别、 语种识别等领域。
51022, 根据所述噪声的频率倒谱系数和预先建立的噪声类型模型, 获取 所述语音数据的噪声类型。
将该频率倒谱系数分别代入预先建立的每一个噪声类型模型中进行计算, 如果某一噪声类型模型的计算结果值最大,则认为用户输入该语音数据时处于 该噪声类型的环境中, 即获取该语音数据的噪声类型。
在步骤 S1022中的该预先建立的噪声类型模型是高斯混合模型。 高斯密度函数估计是一种参数化模型, 有单高斯模型 (Single GaussianModel, SGM) 和高斯混合模型 (Gaussian mixture model, GMM) 两 类。 高斯模型是一种有效的聚类模型, 它根据高斯概率密度函数参数的不同, 每一个已经建立的高斯模型可以看作一种类别, 输入一个样本 X , 即可通过高 斯概率密度函数计算其值,然后通过一个阔值来判断该样本是否属于已经建立 的该高斯模型。 由于 GMM具有多个模型, 划分更为精细, 适用于复杂对象的 划分, 广泛应用于复杂对象建模, 例如语音识别中利用 GMM对不同噪声类型 的分类和建模。
在本发明实施例中, 某一噪声类型的 GMM建立的过程可以是, 输入多组 同一类型噪声数据, 根据所述噪声数据反复训练 GMM模型, 并最终获得该噪 声类型的 GMM。
高斯混合模型可用下式表达:
Ρ(Χ)
Figure imgf000011_0001
= 1 其中 高斯模型 Ν(χ; μ,∑)可用下式表达:
Figure imgf000011_0002
其中, Ν为 GMM模型的混合度, 即由 N个高斯模型组合而成, (^为第1个 高斯模型的权值, μ为均值, 为协方差矩阵。 理论上, 空间中的任 ΐ形状都 可以使用一个 GMM模型来建模。 由于高斯模型的输出是 1个 0~1之间的小数, 为了便于计算, 一般会对结果进行取自然对数 (In), 从而变成小于 0的浮点数。
在步骤 S1022中的该预先建立的噪声类型模型的建立方法包括: 获取噪声数据。 获取多组同一类型噪声, 如, 车载噪声, 街道噪声, 办公 室噪声等, 的噪声数据。 用于建立该种类型噪声数据的 GMM, 即该种噪声数 据的噪声类型模型。 应当理解的是, 本发明还可以获得其他种类的噪声数据, 并针对每一种类型噪声数据建立相应的噪声类型模型,本发明实施例对此不做 限制。
根据所述噪声数据,获取所述噪声数据的频率倒谱系数。从该噪声数据中, 提取该噪声的频率倒谱系数。 Mel (美尔) 是主观音高的单位, 而 Hz (赫兹) 则是客观音高的单位, Mel频率是基于人耳听觉特性提出的, 它与 Hz频率成非 线性对应关系。 频率倒谱系数 (Mel Frequency Cepstrum Coefficient, MFCC) 是 Mel频率上的倒谱系数, 具有良好的识别性能, 被广泛应用于语音识别、 声 紋识别、 语种识别等领域。
根据 EM算法处理所述频率倒谱系数, 建立所述噪声类型模型。 EM算法 ( Expectation-maximization algorithm, 最大期望算法) 在统计中被用于寻找, 依赖于不可观察的隐性变量的概率模型中, 参数的最大似然估计。在统计计算 中, 最大期望 (EM)算法是在 GMM中寻找参数最大似然估计或者最大后验估 计的算法, 其中 GMM依赖于无法观测的隐藏变量 (Latent Variable) 。
EM算法经过两个步骤交替进行计算: 第一步是计算期望 (E) , 估计未知 参数的期望值, 给出当前的参数估计。 ; 第二步是最大化 (M) , 重新估计分 布参数, 以使得数据的似然性最大, 给出未知变量的期望估计。总体来说, EM 的算法流程如下: 1, 初始化分布参数; 2, 重复直到收敛。 简单说来 EM算法 就是, 假设我们估计知道 A和 B两个参数, 在开始状态下二者都是未知的, 并 且知道了 A的信息就可以得到 B的信息, 反过来知道了 B也就得到了 A。 可以考 虑首先赋予 A某种初值, 以此得到 B的估计值, 然后从 B的当前值出发, 重新估 计 A的取值, 这个过程一直持续到收敛为止。 EM 算法可以从非完整数据集中 对参数进行最大可能性估计, 是一种非常简单实用的学习算法。通过交替使用 E和 M这两个个步骤, EM算法逐步改进模型的参数, 使参数和训练样本的似然 概率逐渐增大, 最后终止于一个极大点。 直观地理解 EM算法, 它也可被看作 为一个逐次逼近算法: 事先并不知道模型的参数, 可以随机的选择一套参数或 者事先粗略地给定某个初始参数,确定出对应于这组参数的最可能的状态, 计 算每个训练样本的可能结果的概率, 在当前的状态下再由样本对参数修正, 重 新估计参数, 并在新的参数下重新确定模型的状态, 这样, 通过多次的迭代, 循环直至某个收敛条件满足为止, 就可以使得模型的参数逐渐逼近真实参数。 将获取的频率倒谱系数代入 EM算法进行训练, 通过训练过程, 获取高斯混合 模型中的 Ν、 α . , μ、 ∑等参数, 根据这些参数和 ρ(χ) =∑= 1 0^( ^,∑ 其中∑=1 = 1, 建立高斯混合模型, 即建立该种噪声类型相应的噪声类型模 型。 同时, X是频率倒谱系数。 举例来说, 在实施例 1中的步骤 S102, 所述根据语音数据获取噪声场景, 具体为:
根据语音数据获取该语音数据噪声帧的频率倒谱系数,该频率倒谱系数即 为高斯混合模型 ρ(χ) =∑=1 (¾^ ; ^,∑ 中的 χ。假设,有两个噪声类型模型, 一个是由车载噪声训练得到的车载噪声的噪声类型模型,另一个是由非车载类 噪声(可以包含办公室噪声、 街道噪声、 超市噪声等)训练得到的非车载噪声的 噪声类型模型。假设当前用户输入的语音数据有 10帧噪声帧,将每个噪声帧的 频率倒谱系数,即 X分别代入两个噪声类型模型 ρ(χ) =∑=1 (¾^( ,∑ 中(其 中, Ν、 α . , μ、 ∑等参数为已知) , 获取计算结果, 将计算结果取对数, 并 进行累加平1均, 最后结果如下表一所示: 噪声帧帧数 1 2 3 4 6 7 8 8 10 非车载噪声的噪声 -46 -46 -45 -43 -47 -50 -46 -47 -46 -45
-46.8 类型模型 .8 .6 .3 .8 .8 .7 .5 .7 .7 .7 车载噪声的噪声类 -43 -41 -41 -39 -42 -47 -41 -39 -43 -38
-41.9 型模型 .0 .9 .3 .7 .1 .7 .5 .6 .6 .7
Figure imgf000013_0001
最终的结果显示,车载噪声的噪声类型模型的计算结果值大于非车载噪声的噪 声类型模型的计算结果值 (即, -41.9>-46.8) , 所以当前语音数据的噪声类型 为车载噪声。
本发明技术方案提供了一种噪声环境下提升语音识别率的方法,该方法通 过获取噪声场景, 并根据预先存储的置信度阔值经验数据和所述噪声场景, 获 取置信度阔值值。 这种根据噪声场景, 灵活调整置信度阔值的方法, 大大提升 了噪声环境下的语音识别率。 可选的,
如图 3所示, 所述噪声场景包括包括噪声大小, 在实施例 1中的步骤 S102, 所述根据语音数据获取噪声场景, 具体包括:
S1023, 根据所述语音数据, 获取所述语音数据的特征参数;
根据该语音数据, 提取该语音数据的特征参数, 所述特征参数包括: 子带 能量、 基音、 周期性因子。
子带能量, 根据语音数据不同频带中有用成分的不同,将 0~8K频带分成 N 个子带, 并分别计算各子带每帧语音的能量。 子带能量计算公式为:
Figure imgf000013_0002
其中, L为帧长, 一帧语音数据为 χ[0]χ[1卜 x[L-l]
基音及周期性因子, 反映了语音中的周期性成分。在语音中, 静音段及轻 声段周期性成分很差, 在浊音段, 周期性很好, 基于此点可进行语音帧检测。 51024, 根据所述特征参数, 进行语音激活检测;
根据用户输入的语音数据, 通过语音激活检测 (Voice activity detection , VAD) 判断语音数据帧和噪声数据帧, 将基音及周期性因子与子带能量相结 合, 进行语音帧、 静音帧的判决。
VAD判断主要基于以下两个因素进行语音帧、 噪声帧的判决:
1)语音帧的能量高于噪声帧的能量;
2)周期性强的一般是语音帧。
51025, 根据所述语音激活检测的结果, 获取所述噪声大小。
根据 VAD判断结果, 对语音帧、 噪声帧分别求平均能量, 得到语音能量 水平 (speechLev)、 噪声能量水平 (noiseLev), 然后计算得到信噪比 (SNR), 其公 式为: noiseLev = 10 * logl0(l +丄∑ ener[Nj])
speechLev = 10 * loglO (1 +— ener[Sj]
SNR = speechLev― noiseLev 其中, Ln、 Ls分别表示噪声帧、 语音帧总帧数, ener[Ni]表示第 i个噪声帧 的能量, ener[Sj]表示第 j个语音帧的能量。 本发明技术方案提供了一种噪声环境下提升语音识别率的方法,该方法通 过获取噪声场景, 并根据预先存储的置信度阔值经验数据和所述噪声场景, 获 取置信度阔值值。 这种根据噪声场景, 灵活调整置信度阔值的方法, 大大提升 了噪声环境下的语音识别率。
实施例 3 图 4为本发明实施例 3提供的一种语音识别的方法的另一种实现方式的流 程图。 本实施例是在实施例 1的基础上描述的, 如图 4所示, 实施例 1的步骤 S103 方法具体包括: S1031 ,根据预先存储的置信度阔值经验数据和所述噪声场景的对应关系, 获取与所述噪声场景对应的置信度阔值。 在获取该语音数据所处的噪声场景之后,可以根据预先存储的置信度阔值 经验数据和所述噪声场景的对应关系,获取与所述噪声场景对应的置信度阔值。 即, 可以根据噪声场景中的噪声类型, 噪声大小以及经大量仿真测量得到的置 信度阔值经验数据的对应关系, 获取置信度阔值。该噪声类型表明用户进行语 音识别时所处的环境类型, 该噪声大小表明用户所处的环境类型的噪声大小。 其中, 置信度阔值的获取原则是, 结合噪声类型, 当噪声偏大时, 置信度阔值 选择偏低; 结合噪声类型, 噪声偏小时, 置信度阔值设置偏大。 具体的置信度 阔值经验数据通过仿真测量统计得到。 举例来说,
在噪声类型为车载环境,噪声偏大时 (即,噪声水平小于 -30dB, 信噪比小 于 10dB), 通过仿真测量统计得到此种噪声场景中, 置信度阔值经验数据为 35~50。 因此, 该噪声场景中, 获取置信度阔值为 35至 50中的某一值。
在噪声类型为车载环境, 噪声偏小时 (噪声水平大于 -30dB小于 -40dB, 信 噪比大于 10dB小于 20dB), 通过仿真测量统计得到此种噪声场景中, 置信度阔 值经验数据为 40~55。 因此, 该噪声场景中, 获取置信度阔值为 40至 55中的某 一值。
在噪声类型为办公室环境, 噪声偏小时 (噪声水平小于 -40dB, 信噪比大于 20dB),通过仿真测量统计得到此种噪声场景中,置信度阔值经验数据为 45~60。 因此, 该噪声场景中, 获取置信度阔值为 45至 60中的某一值。
本发明技术方案提供了一种噪声环境下提升语音识别率的方法,该方法通 过获取噪声场景, 并根据预先存储的置信度阔值经验数据和所述噪声场景, 获 取置信度阔值值。 这种根据噪声场景, 灵活调整置信度阔值的方法, 大大提升 了噪声环境下的语音识别率。
实施例 4 图 5为本发明实施例 4提供的一种语音识别装置的结构示意图。 如图 5所示, 所述装置包括: 获取单元 300, 用于获取语音数据;
置信度值单元 301, 用于接收所述获取单元 300获取的所述语音数据, 并根 据所述语音数据获取置信度值;
噪声场景单元 302, 用于接收所述获取单元 300的获取的所述语音数据, 并 根据所述语音数据获取噪声场景;
置信度阔值单元 303, 用于接收所述噪声场景单元 302的所述噪声场景, 并 获取与所述噪声场景对应的置信度阔值; 处理单元 304,用于接收所述置信度值单元 301获取的所述置信度值和所述 置信度阔值单元 303获取的所述置信度阔值, 如果所述置信度值大于或者等于 所述置信度阈值, 则处理所述语音数据。
该获取单元 300获取语音数据; 置信度值单元 301接收所述获取单元 300获 取的所述语音数据, 并根据所述语音数据获取置信度值; 噪声场景单元 302接 收所述获取单元 300的获取的所述语音数据, 并根据所述语音数据获取噪声场 景, 所述噪声场景包括, 噪声类型、 信噪比和噪声能量水平; 置信度阔值单元 303接收所述噪声场景单元 302的所述噪声场景,并获取与所述噪声场景对应的 置信度阈值;处理单元 304接收所述置信度值单元 301获取的所述置信度值和所 述置信度阔值单元 303获取的所述置信度阔值, 如果所述置信度值大于或者等 于所述置信度阈值, 则处理所述语音数据。
其中, 获取单元 300、 置信度值单元 301、 噪声场景单元 302、 置信度阔值 单元 303、处理单元 304,可以用于执行实施例 1中步骤 S100、 S101、 S102、 S103、 S104所述的方法, 具体描述详见实施例 1对所述方法的描述, 在此不再赘述。
本发明技术方案提供了一种语音识别装置,该装置通过获取噪声场景, 并 根据预先存储的置信度阈值经验数据和所述噪声场景, 获取置信度阈值值。这 种根据噪声场景, 灵活调整置信度阔值的装置, 大大提升了噪声环境下的语音 识别率。 可选的,
图 6为本发明实施例 4提供的一种语音识别装置的另一种可能的结构示意 图。 如图 6所示, 所述装置还包括: 建模单元 305, 用于获取噪声数据, 根据所述噪声数据, 获取所述噪声数 据的频率倒谱系数,根据 EM算法处理所述频率倒谱系数,建立噪声类型模型。 其中, 建模单元 305, 可以用于执行实施例 2中在步骤 S1022中的预先建立 的噪声类型模型的方法, 具体描述详见实施例 2对所述方法的描述, 在此不再 赘述。
本发明技术方案提供了一种语音识别装置,该装置通过获取噪声场景, 并 根据预先存储的置信度阈值经验数据和所述噪声场景, 获取置信度阈值值。这 种根据噪声场景, 灵活调整置信度阔值的装置, 大大提升了噪声环境下的语音 识别率。 可选的, 图 7为本发明实施例 4提供的一种语音识别装置的另一种可能的结构示意 图。 如图 7所示, 噪声场景单元具体包括: 噪声类型单元 3021, 用于根据所述获取单元的所述语音数据, 获取所述语 音数据中的噪声的频率倒谱系数,根据所述噪声的频率倒谱系数和所述建模单 元的所述噪声类型模型, 获取所述语音数据的噪声类型。
其中, 噪声类型单元 3021, 可以用于执行实施例 2中在步骤 S1021、 S1022 中所述的方法, 具体描述详见实施例 2对所述方法的描述, 在此不再赘述。 噪声大小单元 3022, 用于根据所述获取单元的语音数据, 获取所述语音数 据的特征参数, 根据所述特征参数, 进行语音激活检测; 根据所述语音激活检 测的结果, 获取所述噪声大小。
其中,噪声大小单元 3022,可以用于执行实施例 2中在步骤 S1023、 S1024、 S1025中所述的方法,具体描述详见实施例 2对所述方法的描述,在此不再赘述。
本发明技术方案提供了一种语音识别装置,该装置通过获取噪声场景, 并 根据预先存储的置信度阈值经验数据和所述噪声场景, 获取置信度阈值值。这 种根据噪声场景, 灵活调整置信度阔值的装置, 大大提升了噪声环境下的语音 识别率。 可选的,
图 8为本发明实施例 4提供的一种语音识别装置的另一种可能的结构示意 图。 如图 8所示, 所述装置还包括:
存储单元 306, 用于存储的置信度阔值经验数据。
所述置信度阔值单元 303, 具体用于, 根据所述存储单元 306预先存储的置 信度阔值经验数据和所述噪声场景的对应关系,获取与所述噪声场景对应的置 信度阔值。
其中, 置信度阔值单元 303, 可以用于执行实施例 3中在步骤 S1031中所述 的方法, 具体描述详见实施例 3对所述方法的描述, 在此不再赘述。
本发明技术方案提供了一种语音识别装置,该装置通过获取噪声场景, 并 根据预先存储的置信度阈值经验数据和所述噪声场景, 获取置信度阈值值。这 种根据噪声场景, 灵活调整置信度阔值的装置, 大大提升了噪声环境下的语音 识别率。
实施例 5 图 9为本发明实施例 5提供的一种移动终端的结构示意图。 如图 9所示, 该移动终端, 包括处理器、 麦克风, 其特征在于,
所述麦克风 501, 用于获取语音数据;
所述处理器 502, 用于根据所述语音数据获取置信度值和噪声场景, 根据 所述噪声场景, 获取与所述噪声场景对应的置信度阔值,如果所述置信度值大 于或者等于所述置信度阔值, 则处理所述语音数据。
其中,所述麦克风 501、所述处理器 502,可以用于执行实施例 1中步骤 S100、 S101、 S102、 S103、 S104所述的方法, 具体描述详见实施例 1对所述方法的描 述, 在此不再赘述。
本发明技术方案提供了一种移动终端,该移动终端通过获取噪声场景, 并 根据预先存储的置信度阈值经验数据和所述噪声场景, 获取置信度阈值值。这 种根据噪声场景, 灵活调整置信度阔值的移动终端, 大大提升了噪声环境下的 语音识别率。
可选的, 如图 10所示, 所述所述移动终端还包括: 存储器 503, 用于存储置信度阔 值经验数据。
所述处理器具体用于, 根据所述语音数据获取置信度值和噪声场景;根据 所述存储器 503预先存储的置信度阔值经验数据和所述噪声场景的对应关系, 获取与所述噪声场景对应的置信度阔值;如果所述置信度值大于或者等于所述 置信度阈值, 则处理所述语音数据。
上述结构可用于执行实施例 1、 实施例 2、 实施例 3中的方法, 具体方法详 见实施例 1、 实施例 2、 实施例 3中所述的方法, 在此不再赘述。 本发明技术方案提供了一种移动终端,该装置通过获取噪声场景, 并根据 预先存储的置信度阔值经验数据和所述噪声场景, 获取置信度阔值值。这种根 据噪声场景, 灵活调整置信度阔值的移动终端, 大大提升了噪声环境下的语音 识别率。
实施例 6 如图 11所示, 本实施例以手机为例对本发明实施例进行具体说明。应该理 解的是, 图示手机仅仅是手机的一个范例, 并且手机可以具有比图中所示出的 更过的或者更少的部件, 可以组合两个或更多的部件, 或者可以具有不同的部 件配置。 图中所示出的各种部件可以在包括一个或多个信号处理和 /或专用集 成电路在内的硬件、 软件、 或硬件和软件的组合中实现。
图 11为本发明实施例提供的手机的结构示意图。如图 11所示手机包括: 触控 屏 41, 存储器 42, CPU43, 电源管理芯片 44, RF电路 45, 外设接口 46, 音频电路 47, 麦克风 48, I/O子系统 49。
所述触控屏 41是手机与用户之间的输入接口和输出接口,除具有获取用户 触摸信息和控制指令的功能外, 还将可视输出呈现给用户, 可视输出可以包括 图形、 文本、 图标、 视频等。 所述存储器 42, 可以用于存储置信度阔值经验数据, 以供 CPU43处理时使 用。 存储器 42可以被 CPU43、 外设接口 46等访问, 所述存储器 42可以包括高速 随机存取存储器,还可以包括非易失性存储器,例如一个或多个磁盘存储器件、 闪存器件、 或其他易失性固态存储器件。 所述 CPU43, 可用于处理音频电路 47和麦克风 48获取的语音数据, 并根据 该语音数据获取噪声场景;根据所述噪声场景和存储器 42预先存储的置信度阔 值经验数据, 获取所述置信度阔值。 CPU43是手机的控制中心, 利用各种接口 和线路连接整个手机的各个部分,通过运行或执行存储在存储器 42内的软件程 序和 /或模块, 以及调用存储在存储器 42内的数据, 执行手机的各种功能和处 理数据, 从而对手机进行整体监控。 可选的, CPU43可包括一个或多个处理单 元; 优选的, CPU43可集成应用处理器和调制解调处理器, 可选的, 应用处理 器主要处理操作系统、 用户界面和应用程序等, 调制解调处理器主要处理无线 通信。 可以理解的是, 上述调制解调处理器也可以不集成到 CPU43中。 还应当 理解, 上述功能只是 CPU43能够执行功能中的一种, 对于其他功能本发明实施 例不做限制。
所述电源管理芯片 44, 可用于为 CPU43、 I/O子系统 49及外设接口 46所连 接的硬件进行供电及电源管理。 所述 RF电路 45, 主要用于建立手机与无线网络 (即网络侧) 的通信, 实 现手机与无线网络的数据获取和发送。例如收发短信息、电子邮件等。具体地, RF电路 45获取并发送 RF信号, RF信号也称为电磁信号, RF电路 45将电信号转 换为电磁信号或^1电磁信号转换为电信号,并且通过该电磁信号与通信网络以 及其他设备进行通信。 RF电路 45可以包括用于执行这些功能的已知电路, 其 包括但不限于天线系统、 RF收发机、 一个或多个放大器、 调谐器、 一个或多 个振荡器、数字信号处理器、 CODEC芯片组、用户标识模块 (Subscriber Identity Module, SIM)等等。 所述外设接口 46,所述外设接口可以将设备的输入和输出外设连接到 CPU 43和存储器 42。
所述音频电路 47, 主要可用于从外设接口 46获取音频数据,将该音频数据 转换为电信号。 所述麦克风 48, 可用于获取语音数据. 所述 I/O子系统 49:所述 I/O子系统 49可以控制设备上的输入输出外设, I/O 子系统 49可以包括显示控制器 491和用于控制其他输入 /控制设备的一个或多 个输入控制器 492。 可选的, 一个或多个输入控制器 492从其他输入 /控制设备 获取电信号或者向其他输入 /控制设备发送电信号, 其他输入 /控制设备可以包 括物理按钮 (按压按钮、 摇臂按钮等) 、 拨号盘、 滑动开关、 操纵杆、 点击滚 轮。值得说明的是,输入控制器 492可以与以下任一个连接:键盘、红外端口、 USB接口以及诸如鼠标的指示设备。所述 I/O子系统 49中的显示控制器 491从触 控屏 41获取电信号或者向触控屏 41发送电信号。触控屏 41获取触控屏上的接触 显示控制器 491将获取到的接触转换为与呈现在触控屏 41上的用户界面对象的 交互, 即实现人机交互, 呈现在触控屏 41上的用户界面对象可以是运行游戏的 图标、 联网到相应网络的图标、 筛选模式等。 值得说明的是, 设备还可以包括 光鼠, 光鼠是不呈现可视输出的触摸敏感表面, 或者是由触控屏形成的触摸敏 感表面的延伸。
麦克风 48获取大屏设备的获取语音数据,通过所述外设接口 46和所述音频 电路 47将所述语音数据送入 CUP43, CPU43可用于处理所述语音数据, 并根 据该语音数据获取噪声场景;根据所述噪声场景和存储器 42预先存储的置信度 阔值经验数据, 获取所述置信度阔值。
上述结构可用于执行实施例 1、 实施例 2、 实施例 3中的方法, 具体方法详 见实施例 1、 实施例 2、 实施例 3中所述的方法, 在此不再赘述。
本发明技术方案提供了一种语音识别的手机,该手机通过获取噪声场景, 并根据预先存储的置信度阔值经验数据和所述噪声场景, 获取置信度阔值值。 这种根据噪声场景, 灵活调整置信度阔值的手机, 大大提升了噪声环境下的语 音识别率。 通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本发 明实施例可以用硬件实现, 或固件实现, 或它们的组合方式来实现。 当使用软 件实现时,可以将上述功能存储在装置可读介质中或作为装置可读介质上的一 个或多个指令或代码进行传输。 装置可读介质包括装置存储介质和通信介质, 可选的通信介质包括便于从一个地方向另一个地方传送装置程序的任何介质。 存储介质可以是装置能够存取的任何可用介质。 以此为例但不限于: 装置可读 介质可以包括 RAM、 ROM, EEPROM、 CD-ROM或其他光盘存储、 磁盘存储 介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式 的期望的程序代码并能够由装置存取的任何其他介质。此外。任何连接可以适 当的成为装置可读介质。例如,如果软件是使用同轴电缆、光纤光缆、双绞线、 数字用户线 (DSL) 或者诸如红外线、 无线电和微波之类的无线技术从网站、 服务器或者其他远程源传输的, 那么同轴电缆、 光纤光缆、 双绞线、 DSL 或 者诸如红外线、无线和微波之类的无线技术包括在所属介质的定影中。如本发 明实施例所使用的, 盘 (Disk) 和碟 (disc) 包括压缩光碟 (CD)、 激光碟、 光碟、 数字通用光碟 (DVD)、 软盘和蓝光光碟, 可选的盘通常磁性的复制数 据, 而碟则用激光来光学的复制数据。上面的组合也应当包括在装置可读介质 的保护范围之内。 总之, 以上所述仅为本发明技术方案的实施例而已, 并非用于限定本发明 的保护范围。 凡在本发明的精神和原则之内, 所作的任何修改、 等同替换、 改 进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要 求 书
1, 一种语音识别方法, 其特征在于, 所述方法包括:
获取语音数据; 根据所述语音数据, 获取置信度值;
根据所述语音数据, 获取噪声场景;
获取与所述噪声场景对应的置信度阔值;
如果所述置信度值大于或者等于所述置信度阔值, 则处理所述语音数据。
2, 根据权利要求 1所述的方法, 其特征在于, 所述噪声场景具体包括: 噪声类型;
噪声大小。
3,根据权利要求 2所述的方法,其特征在于,所述噪声场景包括噪声类型, 所述根据语音数据获取噪声场景, 具体包括:
根据所述语音数据, 获取所述语音数据中的噪声的频率倒谱系数; 根据所述噪声的频率倒谱系数和预先建立的噪声类型模型,获取所述语音 数据的噪声类型。
4, 根据权利要求 3所述的方法, 其特征在于, 所述噪声类型模型的建立方 法具体包括:
获取噪声数据;
根据所述噪声数据, 获取所述噪声数据的频率倒谱系数;
根据 EM算法处理所述频率倒谱系数, 建立所述噪声类型模型。
5,根据权利要求 3或者 4所述的方法,其特征在于,所述噪声类型模型是, 高斯混合模型。
6,根据权利要求 2所述的方法,其特征在于,所述噪声场景包括噪声大小, 所述根据语音数据获取噪声场景, 具体包括:
根据所述语音数据, 获取所述语音数据的特征参数;
根据所述特征参数, 进行语音激活检测; 根据所述语音激活检测的结果, 获取所述噪声大小。
7,根据权利要求 2或 6所述的方法,其特征在于,所述噪声大小具体包括: 信11喿比;
噪声能量水平。
8, 根据权利要求 1至 7任一项所述的方法, 其特征在于, 所述获取与所述 噪声场景对应的置信度阔值, 具体包括:
根据预先存储的置信度阔值经验数据和所述噪声场景的对应关系,获取与 所述噪声场景对应的置信度阔值。
9,根据权利要求 1至 8任一项所述的方法,其特征在于,所述方法还包括: 如果所述置信度值小于所述置信度阔值, 则提示用户。
10, 一种语音识别装置, 其特征在于, 所述装置包括: 获取单元, 用于获取语音数据;
置信度值单元, 用于接收所述获取单元获取的所述语音数据, 并根据所述 语音数据获取置信度值;
噪声场景单元, 用于接收所述获取单元获取的所述语音数据, 并根据所述 语音数据获取噪声场景;
置信度阔值单元, 用于接收所述噪声场景单元的所述噪声场景, 并获取与 所述噪声场景对应的置信度阔值;
处理单元,用于接收所述置信度值单元获取的所述置信度值和所述置信度 阔值单元获取的所述置信度阔值,如果所述置信度值大于或者等于所述置信度 阔值, 则处理所述语音数据。
11, 根据权利要求 10所述的装置, 其特征在于, 所述装置还包括: 建模单元, 用于获取噪声数据, 根据所述噪声数据, 获取所述噪声数据的 频率倒谱系数, 根据 EM算法处理所述频率倒谱系数, 建立噪声类型模型。
12, 根据权利要求 11所述的装置, 其特征在于, 所述噪声场景单元具体包 括:
噪声类型单元, 用于根据所述获取单元的所述语音数据, 获取所述语音数 据中的噪声的频率倒谱系数,根据所述噪声的频率倒谱系数和所述建模单元的 所述噪声类型模型, 获取所述语音数据的噪声类型。
13, 根据权利要求 10至 12任一项所述的方法, 其特征在于, 所述噪声场景 单元还包括:
噪声大小单元, 用于根据所述获取单元的语音数据, 获取所述语音数据的 特征参数, 根据所述特征参数, 进行语音激活检测, 根据所述语音激活检测的 结果, 获取所述噪声大小。
14, 根据权利要求 10至 13任一项所述的方法, 其特征在于, 所述装置还包 括:
存储单元, 用于存储的置信度阔值经验数据。
15, 根据权利要求 14所述的方法, 其特征在于, 所述置信度阔值单元, 具体用于,根据所述存储单元预先存储的置信度阔值经验数据和所述噪声 场景的对应关系, 获取与所述噪声场景对应的置信度阔值。
16, 一种移动终端, 包括处理器、 麦克风, 其特征在于,
所述麦克风, 用于获取语音数据;
所述处理器, 用于根据所述语音数据获取置信度值和噪声场景, 根据所述 噪声场景, 获取与所述噪声场景对应的置信度阔值,如果所述置信度值大于或 者等于所述置信度阔值, 则处理所述语音数据。
17,根据权利要求 16所述的移动终端,其特征在于,所述移动终端还包括: 存储器, 用于存储置信度阔值经验数据。
18,根据权利要求 17所述的移动终端,其特征在于,所述处理器具体用于, 根据所述语音数据获取置信度值和噪声场景;
根据所述存储器预先存储的置信度阔值经验数据和所述噪声场景的对应 关系, 获取与所述噪声场景对应的置信度阔值;
如果所述置信度值大于或者等于所述置信度阔值, 则处理所述语音数据。
PCT/CN2013/077498 2013-01-24 2013-06-19 一种语音识别的方法、装置 WO2014114048A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310027559.9A CN103971680B (zh) 2013-01-24 2013-01-24 一种语音识别的方法、装置
CN201310027559.9 2013-01-24

Publications (1)

Publication Number Publication Date
WO2014114048A1 true WO2014114048A1 (zh) 2014-07-31

Family

ID=49766854

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/077498 WO2014114048A1 (zh) 2013-01-24 2013-06-19 一种语音识别的方法、装置

Country Status (5)

Country Link
US (1) US9666186B2 (zh)
EP (1) EP2763134B1 (zh)
JP (2) JP6101196B2 (zh)
CN (1) CN103971680B (zh)
WO (1) WO2014114048A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140376A1 (zh) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 基于声纹识别的酒驾检测方法、装置、设备及存储介质

Families Citing this family (88)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
KR20150104615A (ko) 2013-02-07 2015-09-15 애플 인크. 디지털 어시스턴트를 위한 음성 트리거
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
CN105453026A (zh) 2013-08-06 2016-03-30 苹果公司 基于来自远程设备的活动自动激活智能响应
TWI566107B (zh) 2014-05-30 2017-01-11 蘋果公司 用於處理多部分語音命令之方法、非暫時性電腦可讀儲存媒體及電子裝置
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
KR101619260B1 (ko) * 2014-11-10 2016-05-10 현대자동차 주식회사 차량 내 음성인식 장치 및 방법
CN104952449A (zh) * 2015-01-09 2015-09-30 珠海高凌技术有限公司 环境噪声声源识别方法及装置
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
CN105161110B (zh) * 2015-08-19 2017-11-17 百度在线网络技术(北京)有限公司 基于蓝牙连接的语音识别方法、装置和系统
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
KR102420450B1 (ko) * 2015-09-23 2022-07-14 삼성전자주식회사 음성인식장치, 음성인식방법 및 컴퓨터 판독가능 기록매체
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
CN106971717A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 机器人与网络服务器协作处理的语音识别方法、装置
CN106971715A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 一种应用于机器人的语音识别装置
CN106971714A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 一种应用于机器人的语音去噪识别方法及装置
WO2017147870A1 (zh) * 2016-03-03 2017-09-08 邱琦 拾音式识别方法
WO2017154282A1 (ja) * 2016-03-10 2017-09-14 ソニー株式会社 音声処理装置および音声処理方法
US10446143B2 (en) * 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
JP7249094B2 (ja) * 2016-04-04 2023-03-30 住友化学株式会社 樹脂、レジスト組成物及びレジストパターンの製造方法
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
CN106161795B (zh) * 2016-07-19 2019-03-29 西北工业大学 基于手机麦克风的键盘输入感知方法
CN106384594A (zh) * 2016-11-04 2017-02-08 湖南海翼电子商务股份有限公司 语音识别的车载终端及其方法
WO2018090252A1 (zh) * 2016-11-16 2018-05-24 深圳达闼科技控股有限公司 机器人语音指令识别的方法及相关机器人装置
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770429A1 (en) 2017-05-12 2018-12-14 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. MULTI-MODAL INTERFACES
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
US20180336275A1 (en) 2017-05-16 2018-11-22 Apple Inc. Intelligent automated assistant for media exploration
CN109243431A (zh) * 2017-07-04 2019-01-18 阿里巴巴集团控股有限公司 一种处理方法、控制方法、识别方法及其装置和电子设备
US10706868B2 (en) * 2017-09-06 2020-07-07 Realwear, Inc. Multi-mode noise cancellation for voice detection
CN109672775B (zh) * 2017-10-16 2021-10-29 腾讯科技(北京)有限公司 调节唤醒灵敏度的方法、装置及终端
CN108064007A (zh) * 2017-11-07 2018-05-22 苏宁云商集团股份有限公司 用于智能音箱的增强人声识别的方法及微控制器和智能音箱
CN108022596A (zh) * 2017-11-28 2018-05-11 湖南海翼电子商务股份有限公司 语音信号处理方法及车载电子设备
CN108242234B (zh) * 2018-01-10 2020-08-25 腾讯科技(深圳)有限公司 语音识别模型生成方法及其设备、存储介质、电子设备
CN108416096B (zh) * 2018-02-01 2022-02-25 北京百度网讯科技有限公司 基于人工智能的远场语音数据信噪比估计方法及装置
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
DK179822B1 (da) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
CN108766459B (zh) * 2018-06-13 2020-07-17 北京联合大学 一种多人语音混合中目标说话人估计方法及系统
CN108924343A (zh) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 电子设备控制方法、装置、存储介质及电子设备
CN109003607B (zh) * 2018-07-12 2021-06-01 Oppo广东移动通信有限公司 语音识别方法、装置、存储介质及电子设备
CN109346071A (zh) * 2018-09-26 2019-02-15 出门问问信息科技有限公司 唤醒处理方法、装置及电子设备
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
CN109346099B (zh) * 2018-12-11 2022-02-08 珠海一微半导体股份有限公司 一种基于语音识别的迭代去噪方法和芯片
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. USER ACTIVITY SHORTCUT SUGGESTIONS
US11227599B2 (en) 2019-06-01 2022-01-18 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
CN110930987B (zh) * 2019-12-11 2021-01-08 腾讯科技(深圳)有限公司 音频处理方法、装置和存储介质
WO2021147018A1 (en) * 2020-01-22 2021-07-29 Qualcomm Incorporated Electronic device activation based on ambient noise
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11043220B1 (en) 2020-05-11 2021-06-22 Apple Inc. Digital assistant hardware abstraction
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
CN114187904A (zh) * 2020-08-25 2022-03-15 广州华凌制冷设备有限公司 相似度阈值获取方法、语音家电及计算机可读存储介质
US11620999B2 (en) 2020-09-18 2023-04-04 Apple Inc. Reducing device processing of unintended audio
CN114743571A (zh) * 2022-04-08 2022-07-12 北京字节跳动网络技术有限公司 一种音频处理方法、装置、存储介质及电子设备
CN115050366B (zh) * 2022-07-08 2024-05-17 合众新能源汽车股份有限公司 一种语音识别方法、装置及计算机存储介质
CN115472152B (zh) * 2022-11-01 2023-03-03 北京探境科技有限公司 语音端点检测方法、装置、计算机设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1708782A (zh) * 2002-11-02 2005-12-14 皇家飞利浦电子股份有限公司 用于操作语音识别系统的方法
CN101051461A (zh) * 2006-04-06 2007-10-10 株式会社东芝 特征向量补偿装置和特征向量补偿方法
CN101197130A (zh) * 2006-12-07 2008-06-11 华为技术有限公司 声音活动检测方法和声音活动检测器
CN101320559A (zh) * 2007-06-07 2008-12-10 华为技术有限公司 一种声音激活检测装置及方法
US7536301B2 (en) * 2005-01-03 2009-05-19 Aai Corporation System and method for implementing real-time adaptive threshold triggering in acoustic detection systems
CN102693724A (zh) * 2011-03-22 2012-09-26 张燕 一种基于神经网络的高斯混合模型的噪声分类方法

Family Cites Families (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970446A (en) * 1997-11-25 1999-10-19 At&T Corp Selective noise/channel/coding models and recognizers for automatic speech recognition
US6434520B1 (en) 1999-04-16 2002-08-13 International Business Machines Corporation System and method for indexing and querying audio archives
JP2001075595A (ja) 1999-09-02 2001-03-23 Honda Motor Co Ltd 車載用音声認識装置
US6735562B1 (en) * 2000-06-05 2004-05-11 Motorola, Inc. Method for estimating a confidence measure for a speech recognition system
JP4244514B2 (ja) * 2000-10-23 2009-03-25 セイコーエプソン株式会社 音声認識方法および音声認識装置
US8023622B2 (en) * 2000-12-21 2011-09-20 Grape Technology Group, Inc. Technique for call context based advertising through an information assistance service
JP2003177781A (ja) 2001-12-12 2003-06-27 Advanced Telecommunication Research Institute International 音響モデル生成装置及び音声認識装置
JP3826032B2 (ja) 2001-12-28 2006-09-27 株式会社東芝 音声認識装置、音声認識方法及び音声認識プログラム
JP2003241788A (ja) 2002-02-20 2003-08-29 Ntt Docomo Inc 音声認識装置及び音声認識システム
US7502737B2 (en) * 2002-06-24 2009-03-10 Intel Corporation Multi-pass recognition of spoken dialogue
US7103541B2 (en) * 2002-06-27 2006-09-05 Microsoft Corporation Microphone array signal enhancement using mixture models
JP4109063B2 (ja) * 2002-09-18 2008-06-25 パイオニア株式会社 音声認識装置及び音声認識方法
JP2004325897A (ja) * 2003-04-25 2004-11-18 Pioneer Electronic Corp 音声認識装置及び音声認識方法
JP4357867B2 (ja) * 2003-04-25 2009-11-04 パイオニア株式会社 音声認識装置、音声認識方法、並びに、音声認識プログラムおよびそれを記録した記録媒体
US8005668B2 (en) * 2004-09-22 2011-08-23 General Motors Llc Adaptive confidence thresholds in telematics system speech recognition
KR100745976B1 (ko) * 2005-01-12 2007-08-06 삼성전자주식회사 음향 모델을 이용한 음성과 비음성의 구분 방법 및 장치
US7949533B2 (en) * 2005-02-04 2011-05-24 Vococollect, Inc. Methods and systems for assessing and improving the performance of a speech recognition system
US8311819B2 (en) * 2005-06-15 2012-11-13 Qnx Software Systems Limited System for detecting speech with background voice estimates and noise estimates
US20070055519A1 (en) * 2005-09-02 2007-03-08 Microsoft Corporation Robust bandwith extension of narrowband signals
JP2008009153A (ja) 2006-06-29 2008-01-17 Xanavi Informatics Corp 音声対話システム
US8560316B2 (en) 2006-12-19 2013-10-15 Robert Vogt Confidence levels for speaker recognition
US7881929B2 (en) * 2007-07-25 2011-02-01 General Motors Llc Ambient noise injection for use in speech recognition
US7856353B2 (en) * 2007-08-07 2010-12-21 Nuance Communications, Inc. Method for processing speech signal data with reverberation filtering
US8306817B2 (en) * 2008-01-08 2012-11-06 Microsoft Corporation Speech recognition with non-linear noise reduction on Mel-frequency cepstra
JPWO2010128560A1 (ja) * 2009-05-08 2012-11-01 パイオニア株式会社 音声認識装置、音声認識方法、及び音声認識プログラム
CN101593522B (zh) 2009-07-08 2011-09-14 清华大学 一种全频域数字助听方法和设备
CA2778343A1 (en) * 2009-10-19 2011-04-28 Martin Sehlstedt Method and voice activity detector for a speech encoder
US8632465B1 (en) * 2009-11-03 2014-01-21 Vivaquant Llc Physiological signal denoising
DK2352312T3 (da) * 2009-12-03 2013-10-21 Oticon As Fremgangsmåde til dynamisk undertrykkelse af omgivende akustisk støj, når der lyttes til elektriske input
JP5621783B2 (ja) 2009-12-10 2014-11-12 日本電気株式会社 音声認識システム、音声認識方法および音声認識プログラム
US8626498B2 (en) * 2010-02-24 2014-01-07 Qualcomm Incorporated Voice activity detection based on plural voice activity detectors
JPWO2011122522A1 (ja) 2010-03-30 2013-07-08 日本電気株式会社 感性表現語選択システム、感性表現語選択方法及びプログラム
US8234111B2 (en) * 2010-06-14 2012-07-31 Google Inc. Speech and noise models for speech recognition
JP5200080B2 (ja) * 2010-09-29 2013-05-15 日本電信電話株式会社 音声認識装置、音声認識方法、およびそのプログラム
US8886532B2 (en) 2010-10-27 2014-11-11 Microsoft Corporation Leveraging interaction context to improve recognition confidence scores
EP2678861B1 (en) * 2011-02-22 2018-07-11 Speak With Me, Inc. Hybridized client-server speech recognition
US10418047B2 (en) * 2011-03-14 2019-09-17 Cochlear Limited Sound processing with increased noise suppression
US8731936B2 (en) * 2011-05-26 2014-05-20 Microsoft Corporation Energy-efficient unobtrusive identification of a speaker
JP2013114518A (ja) * 2011-11-29 2013-06-10 Sony Corp 画像処理装置、および画像処理方法、並びにプログラム
US20130144618A1 (en) * 2011-12-02 2013-06-06 Liang-Che Sun Methods and electronic devices for speech recognition
US8930187B2 (en) * 2012-01-03 2015-01-06 Nokia Corporation Methods, apparatuses and computer program products for implementing automatic speech recognition and sentiment detection on a device
US20130211832A1 (en) * 2012-02-09 2013-08-15 General Motors Llc Speech signal processing responsive to low noise levels
CN103578468B (zh) 2012-08-01 2017-06-27 联想(北京)有限公司 一种语音识别中置信度阈值的调整方法及电子设备
US8571871B1 (en) * 2012-10-02 2013-10-29 Google Inc. Methods and systems for adaptation of synthetic speech in an environment
CN103065631B (zh) 2013-01-24 2015-07-29 华为终端有限公司 一种语音识别的方法、装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1708782A (zh) * 2002-11-02 2005-12-14 皇家飞利浦电子股份有限公司 用于操作语音识别系统的方法
US7536301B2 (en) * 2005-01-03 2009-05-19 Aai Corporation System and method for implementing real-time adaptive threshold triggering in acoustic detection systems
CN101051461A (zh) * 2006-04-06 2007-10-10 株式会社东芝 特征向量补偿装置和特征向量补偿方法
CN101197130A (zh) * 2006-12-07 2008-06-11 华为技术有限公司 声音活动检测方法和声音活动检测器
CN101320559A (zh) * 2007-06-07 2008-12-10 华为技术有限公司 一种声音激活检测装置及方法
CN102693724A (zh) * 2011-03-22 2012-09-26 张燕 一种基于神经网络的高斯混合模型的噪声分类方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140376A1 (zh) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 基于声纹识别的酒驾检测方法、装置、设备及存储介质

Also Published As

Publication number Publication date
JP2017058691A (ja) 2017-03-23
CN103971680A (zh) 2014-08-06
US9666186B2 (en) 2017-05-30
CN103971680B (zh) 2018-06-05
JP2014142626A (ja) 2014-08-07
JP6393730B2 (ja) 2018-09-19
EP2763134B1 (en) 2017-01-04
US20140207447A1 (en) 2014-07-24
JP6101196B2 (ja) 2017-03-22
EP2763134A1 (en) 2014-08-06

Similar Documents

Publication Publication Date Title
WO2014114048A1 (zh) 一种语音识别的方法、装置
WO2014114049A1 (zh) 一种语音识别的方法、装置
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
CN112259106B (zh) 声纹识别方法、装置、存储介质及计算机设备
JP6350148B2 (ja) 話者インデキシング装置、話者インデキシング方法及び話者インデキシング用コンピュータプログラム
JP6303971B2 (ja) 話者交替検出装置、話者交替検出方法及び話者交替検出用コンピュータプログラム
KR101323061B1 (ko) 스피커 인증 방법 및 이 방법을 수행하기 위한 컴퓨터 실행가능 명령어를 갖는 컴퓨터 판독가능 매체
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
US20160358610A1 (en) Method and device for voiceprint recognition
US8719019B2 (en) Speaker identification
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN113643693A (zh) 以声音特征为条件的声学模型
CN110992940B (zh) 语音交互的方法、装置、设备和计算机可读存储介质
WO2017177629A1 (zh) 远讲语音识别方法及装置
JP6268916B2 (ja) 異常会話検出装置、異常会話検出方法及び異常会話検出用コンピュータプログラム
WO2019041871A1 (zh) 语音对象识别方法及装置
CN112509556B (zh) 一种语音唤醒方法及装置
CN116741193B (zh) 语音增强网络的训练方法、装置、存储介质及计算机设备
CN115410586A (zh) 音频处理方法、装置、电子设备及存储介质
JP5626558B2 (ja) 話者選択装置、話者適応モデル作成装置、話者選択方法および話者選択用プログラム
CN116403567A (zh) 语音识别文本的评分方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13872307

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13872307

Country of ref document: EP

Kind code of ref document: A1