WO2014114049A1 - 一种语音识别的方法、装置 - Google Patents

一种语音识别的方法、装置 Download PDF

Info

Publication number
WO2014114049A1
WO2014114049A1 PCT/CN2013/077529 CN2013077529W WO2014114049A1 WO 2014114049 A1 WO2014114049 A1 WO 2014114049A1 CN 2013077529 W CN2013077529 W CN 2013077529W WO 2014114049 A1 WO2014114049 A1 WO 2014114049A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
confidence value
voice data
confidence
data
Prior art date
Application number
PCT/CN2013/077529
Other languages
English (en)
French (fr)
Inventor
蒋洪睿
王细勇
梁俊斌
郑伟军
周均扬
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Publication of WO2014114049A1 publication Critical patent/WO2014114049A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation

Definitions

  • the present invention relates to the field of voice processing technologies, and in particular, to a voice recognition method and apparatus.
  • a user generally uses voice assistant software for voice recognition on a terminal device such as a mobile phone.
  • the process of voice recognition using software such as voice assistant is that the user opens the voice assistant software to obtain the voice data; the voice data is sent to the noise reduction module for noise reduction processing; the voice data after the noise reduction processing is sent to the voice recognition engine;
  • the recognition engine returns the recognition result to the voice assistant; the voice assistant determines the correctness of the recognition result according to the confidence threshold to reduce the false positive, and then presents.
  • the voice assistant software usually works well in a quiet environment such as an office, but it is not effective in a noisy environment (such as in a car environment); the industry generally uses software noise reduction to enhance voice recognition.
  • the present technical solution provides a method and apparatus for voice recognition to improve voice recognition rate and enhance user experience.
  • the first aspect provides a method for voice recognition: the method includes: acquiring voice data; acquiring a first confidence value according to the voice data; acquiring a noise scene according to the voice data; a confidence value, a second confidence value corresponding to the noise scenario is obtained; and if the second confidence value is greater than or equal to a pre-stored confidence threshold, the voice data is processed.
  • the noise scenario specifically includes: a noise type; a noise size.
  • the noise scenario includes a noise type
  • the acquiring a noise scenario according to the voice data includes: Speech data, obtaining a frequency cepstral coefficient of noise in the speech data; acquiring a noise type of the speech data according to the frequency cepstral coefficient of the noise and a pre-established noise type model.
  • the method for establishing the noise type model includes: acquiring noise data; acquiring, according to the noise data, a frequency cepstrum coefficient of the noise data; processing the frequency cepstral coefficient according to an EM algorithm to establish the noise type model.
  • the noise type model is a Gaussian mixture model.
  • the noise scenario includes a noise
  • the acquiring a noise scenario according to the voice data includes: Voice data, acquiring feature parameters of the voice data; performing voice activation detection according to the feature parameters; and acquiring the noise size according to the result of the voice activation detection.
  • the noise size specifically includes: a signal to noise ratio; a noise energy level.
  • the implementation may be the fifth possible implementation of the first aspect or the sixth possible implementation of the first aspect or, in a seventh possible implementation of the first aspect, the first confidence value Obtaining a second confidence value corresponding to the noise scenario, specifically: acquiring a confidence value corresponding to the noise scenario according to a correspondence between the noise scenario and empirical data of a pre-stored confidence value adjustment value Adjusting the value; adjusting the first confidence value according to the confidence value adjustment value, and acquiring the second confidence value; wherein the adjusting comprises: increasing, reducing, and remaining unchanged.
  • the user is prompted if the second confidence value is less than the confidence threshold.
  • the second aspect provides a voice recognition apparatus, where the apparatus includes: an acquiring unit, configured to acquire voice data; and acquire a first confidence value first confidence value unit according to the voice data, Receiving the voice data acquired by the acquiring unit, and acquiring a first confidence value according to the voice data; a noise scene unit, configured to receive the voice data acquired by the acquiring unit, and according to the voice data Obtaining a noise scenario; a second confidence value unit, configured to receive the noise scenario of the noise scene unit and the first confidence value of the first confidence value unit, and according to the first confidence a value, a second confidence value corresponding to the noise scenario; a processing unit, configured to receive the second confidence value obtained by the second confidence value unit, if the second confidence value is greater than or Equal to the pre-stored confidence threshold, the speech data is processed. a second confidence value unit if the second confidence value is greater than or equal to a pre-stored confidence threshold
  • the device further includes: a modeling unit, configured to acquire noise data, and obtain a frequency cepstrum of the noise data according to the noise data Number, the frequency cepstral coefficient is processed according to the EM algorithm, and a noise type model is established.
  • a modeling unit configured to acquire noise data, and obtain a frequency cepstrum of the noise data according to the noise data Number, the frequency cepstral coefficient is processed according to the EM algorithm, and a noise type model is established.
  • the noise scene unit includes: a noise type unit, configured to use the voice according to the acquiring unit Data, acquiring a frequency cepstrum coefficient of noise in the speech data, and acquiring a noise type of the speech data according to the frequency cepstral coefficient of the noise and the noise type model of the modeling unit.
  • the noise scene unit further includes And a noise size unit, configured to acquire a feature parameter of the voice data according to the voice data of the acquiring unit, perform voice activation detection according to the feature parameter, and acquire the noise size according to the result of the voice activation detection .
  • the apparatus further includes: a storage unit, configured to store empirical data of a confidence threshold and a confidence value adjustment value. .
  • the second confidence value unit is specifically used,
  • the adjustment includes: increasing, reducing, and remaining unchanged.
  • the third aspect of obtaining the second confidence value corresponding to the noise scenario according to the first confidence value provides a mobile terminal, including a processor and a microphone, wherein the a microphone, configured to acquire voice data, the processor, configured to acquire a first confidence value according to the voice data, obtain a noise scenario according to the voice data, and obtain a location according to the first confidence value
  • the second confidence value corresponding to the noise scenario is processed, and if the second confidence value is greater than or equal to a pre-stored confidence threshold, the voice data is processed.
  • the mobile terminal further includes: a memory, configured to store empirical data of the confidence value adjustment value and the confidence threshold.
  • the processor is specifically configured to: acquire, according to the voice data, a first confidence value; The voice data is obtained, and the noise scene is obtained. According to the correspondence between the noise scene and the empirical data, a confidence value adjustment value corresponding to the noise scene is obtained; and the first value is adjusted according to the confidence value adjustment value. a confidence value, the second confidence value is obtained; if the second confidence value is greater than or equal to the confidence threshold, the voice data is processed.
  • the technical solution of the present invention provides a method and apparatus for speech recognition, which acquires a second confidence by acquiring a noise scene and adjusting the value empirical data and the noise scene according to a pre-stored confidence value value. value.
  • the method and device for flexibly adjusting the confidence value according to the noise scenario greatly improve the speech recognition rate in a noisy environment.
  • FIG. 1 is a flowchart of a method for voice recognition according to Embodiment 1 of the present invention.
  • FIG. 2 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 1 of the present invention.
  • FIG. 3 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 2 of the present invention.
  • FIG. 4 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 2 of the present invention.
  • FIG. 5 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 6 is a schematic diagram showing another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 7 is a schematic diagram showing another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 8 is a schematic diagram of another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • FIG. 9 is a schematic structural diagram of a mobile terminal according to Embodiment 5 of the present invention.
  • FIG. 10 is a schematic diagram of another possible structure of a mobile terminal according to Embodiment 5 of the present invention
  • FIG. 11 is a schematic structural diagram of a mobile phone according to an embodiment of the present invention.
  • the device includes, but is not limited to, a mobile phone, a personal digital assistant (PDA), a tablet computer, a portable device (for example, a portable computer), an in-vehicle device, an ATM (Automatic Teller Machine), and the like.
  • PDA personal digital assistant
  • tablet computer for example, a portable computer
  • portable device for example, a portable computer
  • in-vehicle device for example, a portable computer
  • ATM Automatic Teller Machine
  • FIG. 1 is a flowchart of a method for voice recognition according to Embodiment 1 of the present invention.
  • a method for providing voice recognition may specifically include: S100: Acquire voice data; the user opens a voice recognition software such as a voice assistant on the device, and acquires voice data input by the user through a microphone. It should be understood that the voice data may also be input by the user or may be input by the machine, including any data containing information. S101. Acquire a first confidence value according to the voice data. The first confidence value refers to the extent to which a particular individual believes in the authenticity of a particular proposition. In the embodiment of the present invention, it is the degree to which the device or the like believes the authenticity of the speech data recognition result.
  • the first confidence value is used to represent the value of the degree of confidence of the speech recognition result. For example, if the voice data input by the user is "calling Zhang San" Overall confidence), pre-command word confidence N2 ("Give” is the pre-command word, that is, the confidence value of "Give” is N2), the name of the confidence is N3 ("Zhang San” is the name of the person, that is, “Zhang San "The confidence value is N3”, and the post command word confidence is N4 ("call” is the post command word, that is, the confidence level of "call” is N4).
  • sentence confidence N1 (“calling Zhang San” Overall confidence)
  • pre-command word confidence N2 (“Give” is the pre-command word, that is, the confidence value of "Give” is N2)
  • the name of the confidence is N3
  • Zhang San is the name of the person, that is, “Zhang San "The confidence value is N3”
  • the post command word confidence is N4
  • the sentence confidence N1 is obtained by combining N2, N3, and N4.
  • the user inputs "call to Zhang San”.
  • first, second, etc. may be used to describe various confidence values in embodiments of the invention, these confidence values should not be limited to these terms. These terms are only used to distinguish confidence values from each other.
  • the first confidence value may also be referred to as a second confidence value without departing from the scope of embodiments of the invention.
  • the second confidence value may also be referred to as a first confidence value.
  • the first confidence value and the second confidence value are both confidence values.
  • the noise scene is a noise state in which the user inputs voice data. That is, it can be understood that the user is in the noise environment on the road, whether the voice data is input in the noise environment of the office or in the noise environment of the vehicle, and whether the noise is large or small in the corresponding environment in which the user is located.
  • step S102 may be performed before the step S101, the step S102 may be performed after the step S101, or the step S102 may be performed simultaneously with the step S101, which is not limited by the embodiment of the present invention.
  • a second confidence value corresponding to the noise scenario is obtained according to the acquired first confidence value.
  • the second confidence value is not directly obtained from the voice data input by the user, but is obtained based on the first confidence value.
  • a second confidence value corresponding to the noise scenario may be acquired according to the first confidence value.
  • the pre-stored confidence threshold is used as an evaluation index of whether the second confidence value is acceptable. If the second confidence value is greater than the confidence threshold, the recognition result is considered correct, if the second confidence value is less than the confidence threshold. , the recognition result is wrong, the result is overwhelming.
  • the result of the speech data recognition is considered correct, i.e., the corresponding speech data is processed.
  • the confidence value threshold is 40, and the second confidence value is greater than The confidence threshold is described, and the voice data recognition result is correct.
  • the voice data is voice data including a command word such as "calling to Zhang San", "sending a text to Zhang San", "opening an application”, etc.
  • the voice recognition belongs to command word recognition
  • the device performs the corresponding commands, such as making a call, texting, opening an application, and the like. If the voice data recognition belongs to text dictation recognition, the recognition result text is displayed. That is, if the second confidence value is greater than or equal to a pre-stored confidence threshold, the speech data is processed.
  • the technical solution of the present invention provides a method for voice recognition, which acquires a second confidence value by acquiring a noise scene and adjusting the value empirical data and the noise scene according to a pre-stored confidence value.
  • This method of flexibly adjusting the confidence value according to the noise scene greatly improves the speech recognition rate in a noisy environment.
  • FIG. 2 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 1 of the present invention. As shown in FIG. 2, the method further includes:
  • the voice data recognition result is considered to be erroneous, and the user is prompted.
  • the second confidence value is smaller than the confidence threshold, the voice.
  • the data recognition result is wrong. Further exemplifying, when the voice data is "given When the third call "", the device determines that the recognition result of the voice data is incorrect, and the system prompts the user to repeat and/or inform the user of the error. That is, if the second confidence value is less than the confidence threshold, then Prompt the user to re-enter or correct the error.
  • the technical solution of the present invention provides a method for voice recognition, which acquires a second confidence value by acquiring a noise scene and adjusting the value empirical data and the noise scene according to a pre-stored confidence value. This method of flexibly adjusting the confidence value according to the noise scene greatly improves the speech recognition rate in a noisy environment.
  • FIG. 3 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 2 of the present invention.
  • Embodiment 2 of the present invention is described on the basis of Embodiment 1 of the present invention.
  • the noise scenario specifically includes: a noise type; a noise size.
  • the noise type refers to the noise environment in which the user inputs voice data, that is, whether the user is a noise environment on the road, a noise environment in the office, or a noise environment on the vehicle.
  • the noise level represents the amount of noise in the noise environment in which the user inputs the voice data.
  • the noise magnitude includes: a signal to noise ratio and a noise energy level.
  • the signal-to-noise ratio is the ratio of the voice data to the noise data power, often expressed in decibels. The higher the signal-to-noise ratio, the smaller the noise data power, otherwise the opposite is true.
  • the noise energy level is used to reflect the amount of noise data energy in the user's speech data. The signal-to-noise ratio is combined with the noise energy level to represent the noise level.
  • the noise scenario includes a noise type.
  • the acquiring a noise scenario according to the voice data includes:
  • S1021 Acquire, according to the voice data, a frequency cepstrum coefficient of noise in the voice data; and according to voice data input by the user, determine a voice data frame and a noise data frame by using voice activity detection (VAD), After acquiring the noise data frame, the frequency cepstrum coefficient of the noise data frame is obtained.
  • Mel is the unit of subjective pitch, while Hz is the unit of objective pitch.
  • the Mel frequency is based on the auditory characteristics of the human ear, which is nonlinearly related to the Hz frequency.
  • Frequency Cepstrum Coefficient (MFCC) is The cepstrum coefficient on the Mel frequency has good recognition performance and is widely used in speech recognition, voiceprint recognition, and language recognition.
  • S1022 Acquire a noise type of the voice data according to a frequency cepstrum coefficient of the noise and a pre-established noise type model.
  • the frequency cepstral coefficients are respectively substituted into each of the pre-established noise type models for calculation. If the calculation result value of a certain noise type model is the largest, it is considered that the user inputs the voice data in the environment of the noise type, that is, obtaining The type of noise of the voice data.
  • the pre-established noise type model in step S1022 is a Gaussian mixture model.
  • Gaussian density function estimation is a parametric model, which has two types: Single Gaussian Model (SGM) and Gaussian mixture model (GMM).
  • SGM Single Gaussian Model
  • GMM Gaussian mixture model
  • Gaussian model is an effective clustering model. According to the Gaussian probability density function parameters, each established Gaussian model can be regarded as a category. Enter a sample X and calculate its value by Gaussian probability density function. Then, a threshold is used to determine whether the sample belongs to the Gaussian model that has been established. Because GMM has multiple models, the division is more elaborate, suitable for the division of complex objects, and is widely used in complex object modeling, such as speech recognition, using GMM to classify and model different noise types.
  • the process of establishing a GMM of a certain noise type may be: inputting multiple sets of the same type of noise data, repeatedly training the GMM model according to the noise data, and finally obtaining the GMM of the noise type.
  • the Gaussian mixture model can be expressed by the following formula:
  • is the degree of mixing of the GMM model, that is, the combination of N Gaussian models, ( ⁇ is the weight of the first Gaussian model, ⁇ is the mean, and ⁇ is the covariance matrix.
  • is the degree of mixing of the GMM model, that is, the combination of N Gaussian models
  • is the weight of the first Gaussian model
  • is the mean
  • is the covariance matrix.
  • any opening in space The shape can be modeled using a GMM model. Since the output of the Gaussian model is a decimal between 0 and 1, in order to facilitate the calculation, the result is usually taken as a natural logarithm (In:), which becomes smaller. 0 floating point number.
  • the method of establishing the pre-established noise type model in step S1022 includes: acquiring noise data. Get multiple sets of the same type of noise, such as car noise, street noise, office Noise data such as room noise. A GMM used to establish this type of noise data, that is, a noise type model of such noise data. It should be understood that the present invention can also obtain other types of noise data, and establish a corresponding noise type model for each type of noise data, which is not limited by the embodiment of the present invention.
  • MFCC Mel Frequency Cepstmm Coefficient
  • the noise type model is established by processing the frequency cepstral coefficients according to an EM algorithm.
  • the EM algorithm (Expectation-maximization algorithm) is used in statistic to find the maximum likelihood estimate of a parameter in a probability model that depends on unobservable implicit variables.
  • the Maximum Expectation (EM) algorithm is an algorithm that looks for a parameter maximum likelihood estimate or a maximum a posteriori estimate in the GMM, where the GMM relies on an unobservable hidden variable.
  • the EM algorithm is calculated alternately in two steps: The first step is to calculate the expectation (E), estimate the expected value of the unknown parameter, and give the current parameter estimate. The second step is to maximize (M) and re-estimate the distribution parameters to maximize the likelihood of the data, giving the expected estimate of the unknown variable.
  • E expectation
  • M maximum
  • the algorithm flow of EM is as follows: 1. Initialize the distribution parameters; 2. Repeat until convergence. Simply put, the EM algorithm is, let's assume that we know that the two parameters A and B are unknown in the starting state, and knowing the information of A can get the information of B, and then knowing B will get A. It can be considered to first give A some initial value, to get the estimated value of B, and then re-estimate the value of A from the current value of B.
  • the EM algorithm can estimate the parameters from the incomplete data set to the maximum possible probability. It is a very simple and practical learning algorithm. By alternately using the two steps E and M, the EM algorithm gradually improves the parameters of the model, gradually increasing the likelihood of the parameters and the training samples, and finally ending at a maximum point. Intuitive understanding of the EM algorithm, it can also be seen as a successive approximation algorithm: without knowing the parameters of the model in advance, you can randomly select a set of parameters or give a rough initial reference to an initial parameter to determine the corresponding set of parameters.
  • the frequency cepstrum coefficient is a Gaussian mixture model p(x) ⁇ , x in x.
  • the noise type model of the vehicle noise obtained by the vehicle noise training
  • the non-vehicle noise which can include office noise, street noise, supermarket noise, etc.
  • Noise type model for vehicle noise Assume that the voice data input by the current user has 10 frame noise frames, and the frequency cepstrum coefficients of each noise frame, that is, X, are respectively substituted into two noise type models p(x).
  • the final result shows that the calculated result value of the noise type model of the vehicle noise is larger than the calculation result value of the noise type model of the off-board noise (ie, -41.9>-46.8), so the noise type of the current voice data is the vehicle noise.
  • the technical solution of the present invention provides a method for improving a speech recognition rate in a noisy environment, which obtains a second confidence by acquiring a noise scenario and adjusting the empirical data of the value according to the pre-stored confidence value and the noise scenario. value.
  • This method of flexibly adjusting the confidence value according to the noise scene greatly improves the speech recognition rate in a noisy environment.
  • the noise scenario includes a noise level.
  • the acquiring a noise scenario according to the voice data includes:
  • a feature parameter of the voice data where the feature parameter includes: a subband energy, a pitch, and a periodic factor.
  • the sub-band energy is divided into N sub-bands according to the different useful components in the different frequency bands of the speech data, and the energy of each frame of each sub-band is calculated separately.
  • the subband energy calculation formula is:
  • the pitch and periodicity factors reflect the periodic components of speech.
  • speech the periodic components of the silent segment and the soft segment are very poor.
  • voiced segment the periodicity is very good. Based on this, voice frame detection can be performed.
  • the voice activity detection (VAD) is used to determine the voice data frame and the noise data frame, and the pitch and the periodic factor are combined with the sub-band energy to perform the decision of the voice frame and the silence frame.
  • the VAD judgment is based on the following two factors: speech frame and noise frame decision:
  • the periodicity is generally a speech frame.
  • the technical solution of the present invention provides a method for improving a speech recognition rate in a noisy environment, which obtains a second confidence by acquiring a noise scenario and adjusting the empirical data of the value according to the pre-stored confidence value and the noise scenario. value. This method of flexibly adjusting the confidence value according to the noise scene greatly improves the speech recognition rate in a noisy environment.
  • FIG. 4 is a flow chart of another implementation manner of a method for voice recognition according to Embodiment 3 of the present invention.
  • the embodiment is described on the basis of the embodiment 1.
  • the method of step S103 of the embodiment 1 specifically includes: S1031, adjusting the empirical data according to the noise scenario and the pre-stored confidence value. Corresponding relationship, obtaining a confidence value adjustment value corresponding to the noise scenario;
  • the confidence value adjustment value corresponding to the noise scenario is obtained according to the noise type in the noise scenario, the noise magnitude, and the empirical data of the confidence value adjustment value obtained by a large number of simulation measurements.
  • This type of noise indicates the type of environment in which the user is performing speech recognition, and the magnitude of the noise indicates the amount of noise of the type of environment in which the user is located.
  • the noise is too large, the confidence value is correspondingly increased; in combination with the noise type, the noise is too small, and the confidence value is correspondingly reduced.
  • the empirical data of the specific confidence value adjustment value is obtained by simulation measurement statistics.
  • the simulation value is obtained by simulation measurement and the confidence value is adjusted to +15 ⁇ +5. Therefore, in the noise scenario, the value of the confidence value is adjusted to a value of 15 to 5.
  • the noise type is the vehicle environment, and the noise is small (the noise level is greater than -30 and less than -40 dB, and the signal-to-noise ratio is greater than 10 dB. 20dB), in the noise scene obtained by simulation measurement and statistics, the confidence value adjustment value is +10 ⁇ +3. Therefore, in the noise scenario, the acquisition confidence value adjustment value is adjusted to a value of 10 to 3.
  • the noise type is office environment, the noise is too small (the noise level is greater than -40dB, and the signal-to-noise ratio is greater than 20dB).
  • the confidence value is adjusted to +5 ⁇ 0. Therefore, in the noise scenario, the value of the confidence value adjustment is adjusted to a value from 5 to 0.
  • step S1032 Adjust the first confidence value according to the confidence value adjustment value, and obtain the second confidence value.
  • the adjustment includes: increasing, decreasing, and remaining unchanged.
  • the first confidence value acquired in step S101 is adjusted. Adjusting the first confidence value according to the confidence adjustment value to obtain a second confidence value, the first confidence value may be adjusted to be small or small.
  • the technical solution of the present invention provides a method for improving a speech recognition rate in a noisy environment, which obtains a second confidence by acquiring a noise scenario and adjusting the empirical data of the value according to the pre-stored confidence value and the noise scenario. value.
  • This method of flexibly adjusting the confidence value according to the noise scene greatly enhances the speech recognition in the noise environment.
  • FIG. 5 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 4 of the present invention. As shown in FIG. 5, the device includes: an acquiring unit 300, configured to acquire voice data;
  • the first confidence value unit 301 is configured to receive the voice data acquired by the acquiring unit 300, and obtain a first confidence value according to the voice data;
  • the noise scene unit 302 is configured to receive the acquired voice data of the acquiring unit 300, and acquire a noise scene according to the voice data;
  • a second confidence value unit 303 configured to receive the noise scene of the noise scene unit 302 and The first confidence value of the first confidence value unit 301, and acquiring a second confidence value corresponding to the noise scenario according to the first confidence value;
  • the processing unit 304 configured to receive the first The second confidence value obtained by the two-confidence value unit 303, if the second confidence value is greater than or equal to a pre-stored confidence threshold, the voice data is processed.
  • the acquiring unit 300 acquires the voice data; the first confidence value unit 301 receives the voice data acquired by the acquiring unit 300, and acquires a first confidence value according to the voice data; the noise scene unit 302 receives the acquiring unit. Obtaining the voice data of 300, and acquiring a noise scene according to the voice data, where the noise scene includes a noise type and a noise level; and the second confidence value unit 303 receives the noise scene of the noise scene unit 302 And the first confidence value of the first confidence value unit 301, and acquiring a second confidence value corresponding to the noise scenario according to the first confidence value; the processing unit 304 receives the The second confidence value obtained by the second confidence value unit 303 processes the voice data if the second confidence value is greater than or equal to a pre-stored confidence threshold.
  • the obtaining unit 300, the first confidence value unit 301, the noise scene unit 302, the second confidence value unit 303, and the processing unit 304 can be used to perform steps S100, S101, S102, and S103 in Embodiment 1.
  • the method described in S104 is described in detail in the description of the method in Embodiment 1, and details are not described herein again.
  • the technical solution of the present invention provides a speech recognition apparatus, which acquires a second confidence value by acquiring a noise scene and adjusting the value empirical data and the noise scene according to a pre-stored confidence value.
  • This device which flexibly adjusts the confidence value according to the noise scene, greatly improves the speech recognition rate in a noisy environment.
  • FIG. 6 is a schematic diagram of another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • the device further includes: a modeling unit 305, configured to acquire noise data, obtain a frequency cepstrum coefficient of the noise data according to the noise data, and process the frequency cepstrum coefficient according to an EM algorithm. , establish a noise type model
  • the modeling unit 305 can be used to perform the method of the pre-established noise type model in the step S1022 in the second embodiment. For details, refer to the description of the method in the embodiment 2, and details are not described herein again.
  • FIG. 7 is another schematic structural diagram of a voice recognition apparatus according to Embodiment 4 of the present invention.
  • the noise scene unit specifically includes: a noise type unit 3021, configured to acquire, according to the voice data of the acquiring unit, a frequency cepstrum coefficient of noise in the voice data, according to the frequency of the noise.
  • a cepstrum coefficient and the noise type model of the modeling unit acquires a noise type of the speech data.
  • the noise type unit 3021 may be used to perform the method described in the step S1021 and the SI 022 in the second embodiment. For details, refer to the description of the method in the embodiment 2, and details are not described herein again.
  • the noise size unit 3022 is configured to acquire a feature parameter of the voice data according to the voice data of the acquiring unit, perform voice activation detection according to the feature parameter, and acquire the noise size according to the result of the voice activation detection. .
  • the noise size unit 3022 can be used to perform the steps S1023, S1024 in the second embodiment.
  • FIG. 8 is a schematic diagram of another possible structure of a voice recognition apparatus according to Embodiment 4 of the present invention. As shown in FIG. 8, the apparatus further includes: a storage unit 306, configured to store empirical data of a confidence threshold and a confidence value adjustment value. .
  • the second confidence value unit 303 is configured to acquire a confidence value adjustment value corresponding to the noise scenario according to the correspondence between the empirical data and the noise scenario that are stored in advance by the storage unit 306; Adjusting the first confidence value according to the confidence value adjustment value, and acquiring the second confidence value; wherein the adjusting comprises: increasing, reducing, and remaining unchanged.
  • the second confidence value unit 303 can be used to perform the method described in the foregoing steps in the steps S1031 and S1032. For details, refer to the description of the method in the embodiment 3, and details are not described herein again.
  • the technical solution of the present invention provides a speech recognition apparatus, which acquires a second confidence value by acquiring a noise scene and adjusting the value empirical data and the noise scene according to a pre-stored confidence value.
  • This device which flexibly adjusts the confidence value according to the noise scene, greatly improves the speech recognition rate in a noisy environment.
  • FIG. 9 is a schematic structural diagram of a mobile terminal according to Embodiment 5 of the present invention.
  • the mobile terminal includes a processor and a microphone, wherein the microphone 501 is configured to acquire voice data, and the processor 502 is configured to acquire a first confidence according to the voice data. And obtaining a noise scenario according to the voice data, and acquiring, according to the first confidence value, a second confidence value corresponding to the noise scenario, if the second confidence value is greater than or equal to a pre-stored confidence The degree threshold then processes the voice data.
  • the microphone 501 and the processor 502 may be used to perform the methods described in the steps S100, S10, S102, S103, and S104 in Embodiment 1.
  • the technical solution of the present invention provides a mobile terminal, which acquires a second confidence value by acquiring a noise scenario and according to the empirical data of the pre-stored confidence value adjustment value and the noise scenario. According to the noise scenario, the mobile terminal that flexibly adjusts the confidence value greatly improves the speech recognition rate in a noisy environment.
  • the mobile terminal further includes: a memory 503, configured to store empirical data of the confidence value adjustment value and the confidence threshold.
  • the processor 502 is specifically configured to: obtain a noise scenario according to the voice data, and obtain a confidence value adjustment value corresponding to the noise scenario according to the correspondence between the noise scenario and the experience data; Confidence value adjustment value, adjusting the first confidence value to obtain the second confidence value; if the second confidence value is greater than or equal to the confidence threshold, processing the voice data.
  • the above structure can be used to perform the methods in Embodiment 1, Embodiment 2, and Embodiment 3.
  • the technical solution of the present invention provides a mobile terminal, which acquires a second confidence value by acquiring a noise scene and according to the empirical data of the pre-stored confidence value adjustment value and the noise scene.
  • This kind of mobile terminal that flexibly adjusts the confidence value according to the noise scenario greatly improves the speech recognition rate in a noisy environment.
  • Embodiment 6 As shown in FIG. 11, this embodiment uses a mobile phone as an example to specifically describe an embodiment of the present invention. It should be understood that the illustrated mobile phone is merely an example of a mobile phone, and that the mobile phone may have more or fewer components than those shown in the figures, two or more components may be combined, or may have different Component configuration. The various components shown in the figures can be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
  • FIG. 11 is a schematic structural diagram of a mobile phone according to an embodiment of the present invention. The mobile phone shown in FIG.
  • the 11 includes: a touch screen 41, a memory 42, a CPU 43, a power management chip 44, an RF circuit 45, and a peripheral interface 46. Audio circuit 47, microphone 48, I/O subsystem 49.
  • the touch screen 41 is an input interface and an output interface between the mobile phone and the user. In addition to the function of acquiring user touch information and control commands, the touch screen 41 also presents the visual output to the user, and the visual output may include graphics, text, Icons, videos, etc.
  • the memory 42 can be used to store empirical data of the confidence value adjustment value and the confidence threshold for use by the CPU 43 for processing.
  • Memory 42 may be accessed by CPU 43, peripheral interface 46, etc., which may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other volatile Solid state storage devices.
  • the CPU 43 is configured to process the voice data acquired by the audio circuit 47 and the microphone 48, and obtain a noise scene and a first confidence value according to the voice data; and adjust the value according to the noise scene and the confidence value stored in advance by the memory 42.
  • the empirical data adjusts the first confidence value to obtain a second confidence threshold.
  • the CPU 43 is a control center of the mobile phone, and connects various parts of the entire mobile phone using various interfaces and lines, and executes the mobile phone by running or executing software programs and/or modules stored in the memory 42, and calling data stored in the memory 42.
  • the CPU 43 may include one or more processing units.
  • the CPU 43 may integrate an application processor and a modem processor.
  • the application processor mainly processes an operating system, a user interface, an application, etc., and modulates
  • the demodulation processor primarily handles wireless communications. It will be appreciated that the above described modem processor may also not be integrated into the CPU 43. It should be understood that the foregoing functions are only one of the functions that the CPU 43 can perform, and the other functions are not limited in the embodiments of the present invention.
  • the power management chip 44 can be used for power supply and power management of the hardware connected to the CPU 43, the I/O subsystem 49, and the peripheral interface 46.
  • the RF circuit 45 is mainly used to establish communication between the mobile phone and the wireless network (ie, the network side), and realize data acquisition and transmission between the mobile phone and the wireless network. For example, sending and receiving short messages, emails, and the like. Specifically, the RF circuit 45 acquires and transmits an RF signal, which is also referred to as an electromagnetic signal, and the RF circuit 45 converts the electrical signal into an electromagnetic signal or converts the electromagnetic signal into an electrical signal, and through the electromagnetic signal and communication network and other devices Communicate.
  • the RF circuit 45 may include known circuits for performing these functions, These include, but are not limited to, an antenna system, an RF transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a CODEC chipset, a Subscriber Identity Module (SIM), and the like.
  • an antenna system an RF transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, a CODEC chipset, a Subscriber Identity Module (SIM), and the like.
  • SIM Subscriber Identity Module
  • the peripheral interface 46 which can interface the input and output peripherals of the device to the CPU 43 and the memory 42.
  • the audio circuit 47 is primarily operable to acquire audio data from the peripheral interface 46 and convert the audio data into electrical signals.
  • the microphone 48 can be used to acquire voice data.
  • the I/O subsystem 49 can control input and output peripherals on the device, and the I/O subsystem 49 can include a display controller 491 and one for controlling other input/control devices. Or multiple input controllers 492.
  • one or more input controllers 492 acquire electrical signals from other input/control devices or transmit electrical signals to other input/control devices, and other input/control devices may include physical buttons (press buttons, rocker buttons, etc.) , dial, slide switch, step bar, click wheel.
  • the input controller 492 can be connected to any of the following: a keyboard, an infrared port, a USB interface, and a pointing device such as a mouse.
  • the display controller 491 in the I/O subsystem 49 acquires an electrical signal from the touch screen 41 or transmits an electrical signal to the touch screen 41.
  • the touch screen 41 acquires the contact display controller 491 on the touch screen to convert the acquired contact into an interaction with the user interface object presented on the touch screen 41, that is, realizes human-computer interaction and is presented on the touch screen 41.
  • the user interface objects can be icons that run the game, icons that are networked to the corresponding network, filtering modes, and the like. It is worth noting that the device may also include a light mouse, which is a touch sensitive surface that does not present a visual output, or an extension of a touch sensitive surface formed by the touch screen.
  • the microphone 48 acquires the acquired voice data of the large screen device through the peripheral interface 46 and the audio
  • the circuit 47 sends the voice data to the CPU 43, and the CPU 43 can be configured to process the voice data, and obtain a noise scene and a first confidence value according to the voice data; and adjust according to the noise scene and the pre-stored confidence value of the memory 42.
  • the empirical data of the value adjusts the first confidence value to obtain a second confidence value, and if the second confidence value is greater than or equal to the pre-stored confidence threshold, the voice data is processed.
  • the above structure can be used to perform the methods in Embodiment 1, Embodiment 2, and Embodiment 3.
  • the technical solution of the present invention provides a voice recognition mobile phone, which acquires a second confidence value by acquiring a noise scene and according to the empirical data of the pre-stored confidence value adjustment value and the noise scene.
  • This kind of mobile phone that flexibly adjusts the confidence value according to the noise scene greatly improves the speech recognition rate in a noisy environment.
  • the device readable medium includes a device storage medium and a communication medium, and the optional communication medium includes any medium that facilitates transfer of the device program from one location to another.
  • the storage medium can be any available medium that the device can access.
  • the device readable medium may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage media or other magnetic storage device, or can be used for carrying or storing in the form of an instruction or data structure.
  • Any connection may suitably be a device readable medium.
  • coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL) or wireless technology such as infrared, radio and microwave transmission from a website, server or other remote source
  • coaxial cable, fiber optic cable, twisted pair, DSL or such as infrared, wireless and microwave Classes of wireless technology are included in the fixing of the associated medium.
  • disks and discs include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, and optional discs are usually magnetically replicated. Data, while discs use lasers to optically replicate data. Combinations of the above should also be included within the scope of the device readable media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephone Function (AREA)

Abstract

一种语音识别的方法,包括:获取语音数据(S100);根据所述语音数据,获取第一置信度值(S101);根据所述语音数据,获取噪声场景(S102);根据所述第一置信度值,获取与所述噪声场景对应的第二置信度值(S103);如果所述第二置信度值大于或者等于预先存储的置信度阈值,则处理所述语音数据(S104)。以及一种装置。这种根据噪声场景,灵活调整置信度值的方法和装置,大大提升了噪声环境下的语音识别率。

Description

一种语音识别的方法、 装置 技术领域 本发明实施例涉及语音处理技术领域,尤其涉及一种语音识別的方法及装 置。
背景技术 用户在手机等终端设备上一般使用语音助手软件用来进行语音识別。用语 音助手等软件进行语音识別的过程为,用户开启语音助手软件,获取语音数据; 语音数据送到降噪模块进行降噪处理;降噪处理后的语音数据送给语音识別引 擎; 语音识別引擎返回识別结果给语音助手; 语音助手为减少误判, 根据置信 度阈值判断识別结果的正确性, 然后呈现。 目前, 语音助手类软件通常是在办公室等安静环境下使用效果相对较好, 但在噪声环境下 (如: 车载环境下) 的使用效果不佳; 业界普遍采用软件降噪 的方法来提升语音识別率, 但提升效果并不明显, 有时甚至会降低识別率。 发明内容 本技术方案提供一种语音识別的方法和装置, 用以提升语音识別率, 同时 提升用户感受。 第一方面, 提供一种语音识別的方法: 所述方法包括: 获取语音数据; 根 据所述语音数据, 获取第一置信度值; 根据所述语音数据, 获取噪声场景; 根 据所述第一置信度值, 获取与所述噪声场景对应的第二置信度值; 如果所述第 二置信度值大于或者等于预先存储的置信度阈值, 则处理所述语音数据。
结合第一方面, 在第一方面的第一种可能的实现方式中, 所述噪声场景具 体包括: 噪声类型; 噪声大小。 结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现 方式中, 所述噪声场景包括噪声类型, 所述根据语音数据获取噪声场景, 具体 包括: 根据所述语音数据, 获取所述语音数据中的噪声的频率倒谱系数; 根据 所述噪声的频率倒谱系数和预先建立的噪声类型模型,获取所述语音数据的噪 声类型。 结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现 方式中, 所述噪声类型模型的建立方法具体包括: 获取噪声数据; 根据所述噪 声数据, 获取所述噪声数据的频率倒谱系数; 根据 EM算法处理所述频率倒谱 系数, 建立所述噪声类型模型。
结合第一方面的第三种可能的实现方式或者第一方面的第二种可能的实 现方式, 在第一方面的第四种可能的实现方式中, 所述噪声类型模型是, 高斯 混合模型。
结合第一方面的第一种可能的实现方式,在第一方面的第六种可能的实现 方式中, 所述噪声场景包括噪声大小, 所述根据语音数据获取噪声场景, 具体 包括:根据所述语音数据,获取所述语音数据的特征参数;根据所述特征参数, 进行语音激活检测; 根据所述语音激活检测的结果, 获取所述噪声大小。 结合第一方面的第一种可能的实现方式或者第一方面的第二种可能的实 现方式或者第一方面的第三种可能的实现方式或者第一方面的第四种可能的 实现方式或者第一方面的第五种可能的实现方式或者,在第一方面的第六种可 能的实现方式中, 所述噪声大小具体包括: 信噪比; 述噪声能量水平。
结合第一方面或者第一方面的第一种可能的实现方式或者第一方面的第 二种可能的实现方式或者第一方面的第三种可能的实现方式或者第一方面的 第四种可能的实现方式或者第一方面的第五种可能的实现方式或者第一方面 的第六种可能的实现方式或者, 在第一方面的第七种可能的实现方式中, 所述 根据第一置信度值, 获取与所述噪声场景对应的第二置信度值, 具体包括: 根 据所述噪声场景和预先存储的置信度值调整值的经验数据的对应关系,获取与 所述噪声场景对应的置信度值调整值; 根据所述置信度值调整值, 调整所述第 一置信度值, 获取所述第二置信度值; 其中, 所述调整包括: 调大、 调小、 保 持不变。
结合第一方面或者第一方面的第一种可能的实现方式或者第一方面的第 二种可能的实现方式或者第一方面的第三种可能的实现方式或者第一方面的 第四种可能的实现方式或者第一方面的第五种可能的实现方式或者第一方面 的第六种可能的实现方式或者第一方面的第七种可能的实现方式或者,在第一 方面的第八种可能的实现方式中,如果所述第二置信度值小于所述置信度阈值 则提示用户。 第二方面, 提供一种语音识別装置, 其特征在于, 所述装置包括: 获取单 元, 用于获取语音数据; 并根据所述语音数据获取第一置信度值第一置信度值 单元, 用于接收所述获取单元获取的所述语音数据, 并根据所述语音数据获取 第一置信度值; 噪声场景单元, 用于接收所述获取单元获取的所述语音数据, 并根据所述语音数据获取噪声场景; 第二置信度值单元, 用于接收所述噪声场 景单元的所述噪声场景和所述第一置信度值单元的所述第一置信度值,并根据 所述第一置信度值, 获取与所述噪声场景对应的第二置信度值; 处理单元, 用 于接收所述第二置信度值单元获取的所述第二置信度值,如果所述第二置信度 值大于或者等于预先存储的置信度阈值, 则处理所述语音数据。 第二置信度值 单元如果所述第二置信度值大于或者等于预先存储的置信度阈值
结合第二方面,在第二方面的第一种可能的实现方式中,所述装置还包括: 建模单元, 用于获取噪声数据, 根据所述噪声数据, 获取所述噪声数据的频率 倒谱系数, 根据 EM算法处理所述频率倒谱系数, 建立噪声类型模型。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现 方式中, 所述噪声场景单元具体包括: 噪声类型单元, 用于根据所述获取单元 的所述语音数据, 获取所述语音数据中的噪声的频率倒谱系数, 根据所述噪声 的频率倒谱系数和所述建模单元的所述噪声类型模型,获取所述语音数据的噪 声类型。
结合第二方面或者第二方面的第一种可能的实现方式或者第二方面的第 二种可能的实现方式, 在第二方面的第三种可能的实现方式中, 所述噪声场景 单元还包括: 噪声大小单元, 用于根据所述获取单元的语音数据, 获取所述语 音数据的特征参数, 根据所述特征参数, 进行语音激活检测; 根据所述语音激 活检测的结果, 获取所述噪声大小。
结合第二方面或者第二方面的第一种可能的实现方式或者第二方面的第 二种可能的实现方式或者第二方面的第三种可能的实现方式,在第二方面的第 四种可能的实现方式中, 所述装置还包括: 存储单元, 用于存储的置信度阈值 和置信度值调整值的经验数据。 。 结合者第二方面的第四种可能的实现方式,在第二方面的第五种可能的实 现方式中, 所述第二置信度值单元, 具体用于,
根据所述噪声场景和所述经验数据的对应关系,获取与所述噪声场景对应 的置信度值调整值;
根据所述置信度值调整值, 调整所述第一置信度值, 获取所述第二置信度 值;
其中, 所述调整包括: 调大、 调小、 保持不变。 置信度值调整值的经验数据根据所述第一置信度值,获取与所述噪声场景 对应的第二置信度值第三方面, 提供移动终端, 包括处理器、 麦克风, 其特征 在于,所述麦克风, 用于获取语音数据;所述处理器, 用于根据所述语音数据, 获取第一置信度值, 根据所述语音数据, 获取噪声场景, 根据所述第一置信度 值, 获取与所述噪声场景对应的第二置信度值, 如果所述第二置信度值大于或 者等于预先存储的置信度阈值, 则处理所述语音数据。
结合第三方面,在第二方面的第一种可能的实现方式中所述移动终端还包 括: 存储器, 用于存储置信度值调整值的经验数据和所述置信度阈值。
结合第三方面的第一种可能的实现方式,在第三方面的第二种可能的实现 方式中, 所述处理器具体用于, 根据所述语音数据, 获取第一置信度值; 根据 所述语音数据,获取噪声场景;根据所述噪声场景和所述经验数据的对应关系, 获取与所述噪声场景对应的置信度值调整值; 根据所述置信度值调整值, 调整 所述第一置信度值, 获取所述第二置信度值; 如果所述第二置信度值大于或者 等于所述置信度阈值, 则处理所述语音数据。
本发明技术方案提供了一种语音识別的方法以及装置, 该方法和装置, 通 过获取噪声场景,并根据预先存储的置信度值调整值的经验数据和所述噪声场 景,获取第二置信度值。这种根据噪声场景,灵活调整置信度值的方法和装置, 大大提升了噪声环境下的语音识別率
附图说明 为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施 例中所需要使用的附图作一简单地介绍, 显而易见地, 下面描述中的附图是本 发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动性的 前提下, 还可以根据这些附图获取其他的附图。
图 1为本发明实施例 1提供的一种语音识別的方法流程图;
图 2为本发明实施例 1提供的一种语音识別的方法的另一种实现方式的流 程图;
图 3为本发明实施例 2提供的一种语音识別的方法的另一种实现方式的流 程图;
图 4为本发明实施例 2提供的一种语音识別的方法的另一种实现方式的流 程图;
图 5为本发明实施例 4提供的一种语音识別装置的结构示意图;
图 6为本发明实施例 4提供的一种语音识別装置的另一种可能的结构示意 图;
图 7为本发明实施例 4提供的一种语音识別装置的另一种可能的结构示意 图;
图 8为本发明实施例 4提供的一种语音识別装置的另一种可能的结构示意 图;
图 9为本发明实施例 5提供的一种移动终端的结构示意图;
图 10为本发明实施例 5提供的一种移动终端的另一种可能的结构示意图; 图 11为本发明实施例提供的手机的结构示意图。 具体实施方式 为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本发明 实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。基于本发明中 的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获取的所有其 他实施例, 都属于本发明实施例保护的范围。 在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨 在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的 "一 种" 、 "所述 "和 "该"也旨在包括多数形式, 除非上下文清楚地表示其他含 义。 还应当理解, 本文中使用的术语 "和 /或" 是指并包含一个或多个相关联 的列出项目的任何或所有可能组合。 进一步应当理解, 本文中采用的术语 "包 括" 规定了所述的特征、 整体、 步骤、 操作、 元件和 /或部件的存在, 而不排 除一个或多个其他特征、 整体、 步骤、 操作、 元件、 部件和 /或它们的组的存 在或附加。 在本发明实施例中, 装置包括但不限于手机、 个人数字助理 (Personal Digital Assistant, PDA) 、 平板电脑、 便携设备 (例如, 便携式计算机) 车载 设备, ATM机 (Automatic Teller Machine, 自动柜员机) 等设备, 本发明实施 例并不限定。 实施例 1
图 1为本发明实施例 1提供的一种语音识別的方法流程图。
如图 1所示, 本发明实施例 1提供一种语音识別的方法具体可以包括: S100, 获取语音数据; 用户开启装置上的语音助手等语音识別类软件,通过麦克风获取用户输入 的语音数据。 应当理解的是, 所述语音数据也可以不是用户输入的, 也可以是 机器输入的, 包括任何包含信息的数据。 S101, 根据所述语音数据, 获取第一置信度值。 该第一置信度值是指特定 个体对待特定命题真实性相信的程度。在本发明实施例中, 是装置等对该语音 数据识別结果的真实性相信的程度。 即, 该第一置信度值用来表示语音识別结 果的可信程度的数值。 举例来说, 用户输入的语音数据为 "给张三打电话" , 则在该语音数据识別过程中, 返回的第一置信度值包含: 句置信度 N1 ( "给 张三打电话"的总体置信度) , 前置命令词置信度 N2 ( "给"为前置命令词, 即 "给" 的置信度值为 N2) , 人名置信度 N3 ( "张三" 为人名, 即 "张三" 的置信度值为 N3 ) , 后置命令词置信度 N4( "打电话 " 为后置命令词, 即 "打 电话" 的置信度为 N4)。 通常, 句置信度 N1是由 N2、 N3、 N4综合得到的。 在 某次实验中, 经测试得到, 用户输入 "给张三打电话" 该语音数据的第一置 信度值分別为 N1 =62, N2=50, N3=48, N4=80。
应当理解, 尽管在本发明实施例中可能采用术语第一、 第二等来描述各种置 信度值, 但这些置信度值不应限于这些术语。这些术语仅用来将置信度值彼此 区分开。 例如, 在不脱离本发明实施例范围的情况下, 第一置信度值也可以被 称为第二置信度值, 类似地, 第二置信度值也可以被称为第一置信度值。 并且 该第一置信度值和第二置信度值都是置信度值。
5102, 根据所述语音数据, 获取噪声场景; 根据用户输入的语音数据, 获取噪声场景。所述噪声场景是用户输入语音 数据时所处的噪声状态。 即可以理解为, 用户是在马路上的噪声环境, 还是在 办公室的噪声环境或者是在车载的噪声环境中输入该语音数据,以及用户所处 的相应环境中噪声是大还是小。
应当理解的是, 所述步骤 S102可以在步骤 S101之前, 所述步骤 S102也可 以在步骤 S101之后, 或者所述步骤 S102可以和步骤 S101同时执行, 本发明实 施例对此不做限制
5103,根据所述第一置信度值,获取与所述噪声场景对应的第二置信度值。 该第二置信度值是根据所述获取的第一置信度值获取的。该第二置信度值 不是根据用户输入的语音数据直接得到的, 而是根据该第一置信度值获得的。 在获取该语音数据所处的噪声场景之后, 可以根据所述第一置信度值, 获取与 所述噪声场景对应的第二置信度值。
S104, 如果所述第二置信度值大于或者等于预先存储的置信度阈值, 则处 理所述语音数据;
该预先存储的置信度阈值作为第二置信度值是否可接受的评价指标,如第 二置信度值大于此置信度阈值, 则认为识別结果正确, 如果第二置信度值小于 此置信度阈值, 则认为识別结果错误, 结果是不可相信的。
如果所述第二置信度值大于或者等于预先存储的置信度阈值,则认为该语 音数据识別的结果是正确的, 即处理相应的语音数据。 举例来说, 如步骤 S103 中获取的第二置信度值 N3=48, 步骤 S104中预先存储的置信度阈值步骤 S104 中预先存储的置信度阈值 =40, 则所述第二置信度值大于所述置信度阈值, 该 语音数据识別结果是正确的。 进一步举例说明, 当该语音数据是"打电话给张 三" "发短信给张三" "打开应用程序"等包含命令词的语音数据时, 该语音 识別属于命令词识別, 则所述装置执行相应命令, 如打电话、 发短信、 打开应 用程序等操作。如果该语音数据识別属于文本听写识別,则显示识別结果文本。 即如果所述第二置信度值大于或者等于预先存储的置信度阈值,则处理所述语 音数据。
本发明技术方案提供了一种语音识別的方法,该方法通过获取噪声场景, 并根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信 度值。 这种根据噪声场景, 灵活调整置信度值的方法, 大大提升了噪声环境下 的语音识別率。 可选的,
图 2为本发明实施例 1提供的一种语音识別的方法的另一种实现方式的流 程图。 如图 2所示, 所述方法还包括:
S1041 , 如果所述第二置信度值小于所述置信度阈值, 则提示用户。
如果所述第二置信度值小于所述置信度阈值,则认为该语音数据识別结果 是错误的,则提示用户。举例来说,如步骤 S 103中获取的第二置信度值 N3=48, 步骤 S104中预先存储的置信度阈值 =50, 则所述第二置信度值小于所述置信度 阈值,所述语音数据识別结果是错误的。进一步举例说明, 当该语音数据是"给 张三打电话"时, 则装置判断该语音数据的识別结果错误, 系统提示用户重新 说一遍和 /或者告知用户错误。 即, 如果所述第二置信度值小于所述置信度阈 值, 则提示用户重新输入或者纠正错误等。
本发明技术方案提供了一种语音识別的方法,该方法通过获取噪声场景, 并根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信 度值。 这种根据噪声场景, 灵活调整置信度值的方法, 大大提升了噪声环境下 的语音识別率。
实施例 2 图 3为本发明实施例 2提供的一种语音识別的方法的另一种实现方式的流 程图。 本发明实施例 2是在本发明实施例 1的基础之上进行描述的。 如图 3所示, 在实施例 1中的步骤 S102中, 所述噪声场景具体包括: 噪声类型; 噪声大小。 该噪声类型是指用户输入语音数据时所处的噪声环境,即可以理解为用户 是在马路上的噪声环境, 还是在办公室的噪声环境或者是在车载的噪声环境。 该噪声大小表示用户输入语音数据该时所处噪声环境中噪声的大小。可选 的, 该噪声大小包括: 信噪比和噪声能量水平。 该信噪比是语音数据与噪声数 据功率的比值, 常常用分贝数表示, 一般信噪比越高表明噪声数据功率越小, 否则则相反。该噪声能量水平是用来反应用户语音数据中噪声数据能量的大小。 信噪比和噪声能量水平结合起来, 表示该噪声大小。
所述噪声场景包括噪声类型, 在实施例 1中的步骤 S102, 所述根据语音数 据获取噪声场景, 具体包括:
S1021 ,根据所述语音数据,获取所述语音数据中的噪声的频率倒谱系数; 根据用户输入的语音数据, 通过语音激活检测 (Voice activity detection , VAD) 判断语音数据帧和噪声数据帧, 在获取噪声数据帧之后, 获取该噪声 数据帧的频率倒谱系数。 Mel (美尔) 是主观音高的单位, 而 Hz (赫兹) 则是 客观音高的单位, Mel频率是基于人耳听觉特性提出的, 它与 Hz频率成非线性 对应关系。 频率倒谱系数 (Mel Frequency Cepstrum Coefficient, MFCC) 是 Mel频率上的倒谱系数, 具有良好的识別性能, 被广泛应用于语音识別、 声紋 识別、 语种识別等领域。
S1022, 根据所述噪声的频率倒谱系数和预先建立的噪声类型模型, 获取 所述语音数据的噪声类型。
将该频率倒谱系数分別代入预先建立的每一个噪声类型模型中进行计算, 如果某一噪声类型模型的计算结果值最大,则认为用户输入该语音数据时处于 该噪声类型的环境中, 即获取该语音数据的噪声类型。
在步骤 S1022中的该预先建立的噪声类型模型是高斯混合模型。 高斯密度函数估计是一种参数化模型, 有单高斯模型 ( Single GaussianModel, SGM) 和高斯混合模型 (Gaussian mixture model, GMM) 两 类。 高斯模型是一种有效的聚类模型, 它根据高斯概率密度函数参数的不同, 每一个已经建立的高斯模型可以看作一种类別, 输入一个样本 X , 即可通过高 斯概率密度函数计算其值,然后通过一个阈值来判断该样本是否属于已经建立 的该高斯模型。 由于 GMM具有多个模型, 划分更为精细, 适用于复杂对象的 划分, 广泛应用于复杂对象建模, 例如语音识別中利用 GMM对不同噪声类型 的分类和建模。
在本发明实施例中, 某一噪声类型的 GMM建立的过程可以是, 输入多组 同一类型噪声数据, 根据所述噪声数据反复训练 GMM模型, 并最终获得该噪 声类型的 GMM。
高斯混合模型可用下式表达:
P(x) =∑Γ=1 Ν(χ; μί, Σί) , 其中, ∑=1 = 1 其中, 高斯模型 Ν(χ; μ,∑)可用下式表达:
Ν(χ; μ'∑) = ¾ exp(xμ)τ∑1 (x - μ)]
其中, Ν为 GMM模型的混合度, 即由 N个高斯模型组合而成, (^为第1个 高斯模型的权值, μ为均值, Ε为协方差矩阵。 理论上, 空间中的任意开 ΐ状都可 以使用一个 GMM模型来建模。 由于高斯模型的输出是 1个 0~1之间的小数, 为 了便于计算, 一般会对结果进行取自然对数 (In:), 从而变成小于 0的浮点数。
在步骤 S1022中的该预先建立的噪声类型模型的建立方法包括: 获取噪声数据。 获取多组同一类型噪声, 如, 车载噪声, 街道噪声, 办公 室噪声等, 的噪声数据。 用于建立该种类型噪声数据的 GMM, 即该种噪声数 据的噪声类型模型。 应当理解的是, 本发明还可以获得其他种类的噪声数据, 并针对每一种类型噪声数据建立相应的噪声类型模型,本发明实施例对此不做 限制。
根据所述噪声数据,获取所述噪声数据的频率倒谱系数。从该噪声数据中, 提取该噪声的频率倒谱系数。 Mel (美尔) 是主观音高的单位, 而 Hz (赫兹) 则是客观音高的单位, Mel频率是基于人耳听觉特性提出的, 它与 Hz频率成非 线性对应关系。 频率倒谱系数 (Mel Frequency Cepstmm Coefficient, MFCC) 是 Mel频率上的倒谱系数, 具有良好的识別性能, 被广泛应用于语音识別、 声 紋识別、 语种识別等领域。
根据 EM算法处理所述频率倒谱系数, 建立所述噪声类型模型。 EM算法 ( Expectation-maximization algorithm, 最大期望算法) 在统计中被用于寻找, 依赖于不可观察的隐性变量的概率模型中, 参数的最大似然估计。在统计计算 中, 最大期望 (EM) 算法是在 GMM中寻找参数最大似然估计或者最大后验估 计的算法, 其中 GMM依赖于无法观测的隐藏变量 (Latent Variable) 。
EM算法经过两个步骤交替进行计算: 第一步是计算期望 (E) , 估计未知 参数的期望值, 给出当前的参数估计。 ; 第二步是最大化 (M) , 重新估计分 布参数, 以使得数据的似然性最大, 给出未知变量的期望估计。总体来说, EM 的算法流程如下: 1, 初始化分布参数; 2, 重复直到收敛。 简单说来 EM算法 就是, 假设我们估计知道 A和 B两个参数, 在开始状态下二者都是未知的, 并 且知道了 A的信息就可以得到 B的信息, 反过来知道了 B也就得到了 A。 可以考 虑首先赋予 A某种初值, 以此得到 B的估计值, 然后从 B的当前值出发, 重新估 计 A的取值, 这个过程一直持续到收敛为止。 EM 算法可以从非完整数据集中 对参数进行最大可能性估计, 是一种非常简单实用的学习算法。 通过交替使用 E和 M这两个个步骤, EM算法逐步改进模型的参数, 使参数和训练样本的似然 概率逐渐增大, 最后终止于一个极大点。 直观地理解 EM算法, 它也可被看作 为一个逐次逼近算法: 事先并不知道模型的参数, 可以随机的选择一套参数或 者事先粗略地给定某个初始参数, 确定出对应于这组参数的最可能的状态, 计 算每个训练样本的可能结果的概率, 在当前的状态下再由样本对参数修正, 重 新估计参数, 并在新的参数下重新确定模型的状态, 这样, 通过多次的迭代, 循环直至某个收敛条件满足为止, 就可以使得模型的参数逐渐逼近真实参数。 将获取的频率倒谱系数代入 EM算法进行训练, 通过训练过程, 获取高斯混合 模型中的 Ν、 α ρ μ、 ∑等参数, 根据这些参数和 ρ(χ) =∑= 1 0^( ^∑ , 其中∑[il C = 1 , 建立高斯混合模型, 即建立该种噪声类型相应的噪声类型模 型。 同时, X是频率倒谱系数。 举例来说, 在实施例 1中的步骤 S102, 所述根据语音数据获取噪声场景, 具体为:
根据语音数据获取该语音数据噪声帧的频率倒谱系数,该频率倒谱系数即 为高斯混合模型 p(x)
Figure imgf000014_0001
^,∑ 中的 x。假设, 有两个噪声类型模型, 一个是由车载噪声训练得到的车载噪声的噪声类型模型,另一个是由非车载类 噪声(可以包含办公室噪声、 街道噪声、 超市噪声等:)训练得到的非车载噪声的 噪声类型模型。假设当前用户输入的语音数据有 10帧噪声帧, 将每个噪声帧的 频率倒谱系数, 即 X分別代入两个噪声类型模型 p(x)
Figure imgf000014_0002
(¾^( ; ,∑ 中(其 中, Ν、 α ρ μ、 ∑等参数为已知) , 获取计算结果, 将计算结果取对数, 并 进行累加平1均, 最后结果如下表一所示:
Figure imgf000014_0004
Figure imgf000014_0003
最终的结果显示,车载噪声的噪声类型模型的计算结果值大于非车载噪声的噪 声类型模型的计算结果值 (即, -41.9>-46.8) , 所以当前语音数据的噪声类型 为车载噪声。
本发明技术方案提供了一种噪声环境下提升语音识別率的方法,该方法通 过获取噪声场景,并根据预先存储的置信度值调整值的经验数据和所述噪声场 景, 获取第二置信度值。 这种根据噪声场景, 灵活调整置信度值的方法, 大大 提升了噪声环境下的语音识別率。 可选的,
如图 3所示, 所述噪声场景包括包括噪声大小, 在实施例 1中的步骤 S102, 所述根据语音数据获取噪声场景, 具体包括:
S1023 , 根据所述语音数据, 获取所述语音数据的特征参数;
根据该语音数据, 提取该语音数据的特征参数, 所述特征参数包括: 子带 能量、 基音、 周期性因子。
子带能量, 根据语音数据不同频带中有用成分的不同, 将 0~8K频带分成 N 个子带, 并分別计算各子带每帧语音的能量。 子带能量计算公式为:
L- 1
ener = - ^ (χ[ϊ] Λ2)
i=0 其中, L为帧长, 一帧语音数据为 x[0]x[l]~x[L-l]。
基音及周期性因子, 反映了语音中的周期性成分。在语音中, 静音段及轻 声段周期性成分很差, 在浊音段, 周期性很好, 基于此点可进行语音帧检测。
51024, 根据所述特征参数, 进行语音激活检测;
根据用户输入的语音数据, 通过语音激活检测 (Voice activity detection , VAD) 判断语音数据帧和噪声数据帧, 将基音及周期性因子与子带能量相结 合, 进行语音帧、 静音帧的判决。
VAD判断主要基于以下两个因素进行语音帧、 噪声帧的判决:
1)语音帧的能量高于噪声帧的能量;
2)周期性强的一般是语音帧。
51025, 根据所述语音激活检测的结果, 获取所述噪声大小。
根据 VAD判断结果, 对语音帧、 噪声帧分別求平均能量, 得到语音能量 水平 (speechLev)、 噪声能量水平 (noiseLev), 然后计算得到信噪比 (SNR), 其公 式为: noiseLev = 10 * logl0(l ener
Figure imgf000015_0001
speechLev = 10 * loglO (1 + ^∑】 ener[Sj]) SNR = speechLev― noiseLev 其中, Ln、 Ls分別表示噪声帧、 语音帧总帧数, ener[Ni]表示第 i个噪声帧 的能量, ener[Sj]表示第 j个语音帧的能量。 本发明技术方案提供了一种噪声环境下提升语音识別率的方法,该方法通 过获取噪声场景,并根据预先存储的置信度值调整值的经验数据和所述噪声场 景, 获取第二置信度值。 这种根据噪声场景, 灵活调整置信度值的方法, 大大 提升了噪声环境下的语音识別率。
实施例 3, 图 4为本发明实施例 3提供的一种语音识別的方法的另一种实现方式的流 程图。 本实施例是在实施例 1的基础上描述的, 如图 4所示, 实施例 1的步骤 S103 方法具体包括: S1031 , 根据所述噪声场景和预先存储的置信度值调整值的经验数据的对 应关系, 获取与所述噪声场景对应的置信度值调整值;
根据噪声场景中的噪声类型,噪声大小以及经大量仿真测量得到的置信度 值调整值的经验数据, 获取该噪声场景对应的置信度值调整值。该噪声类型表 明用户进行语音识別时所处的环境类型,该噪声大小表明用户所处的环境类型 的噪声大小。 其中, 结合噪声类型, 当噪声偏大时, 将置信度值相应的调大; 结合噪声类型, 噪声偏小时, 将置信度值相应的调小。 具体的置信度值调整值 的经验数据通过仿真测量统计得到。 举例说明: 在噪声类型为车载环境, 噪声偏大时 (即, 噪声水平小于 -30dB, 信噪比小 于 10dB), 通过仿真测量统计得到此种噪声场景中,置信度值调整值为 +15~+5。 因此, 该噪声场景中, 获取置信度值调整值为调大 15至 5中的某一值 在噪声类型为车载环境, 噪声偏小时 (噪声水平大于 -30小于 -40dB, 信噪 比大于 10dB小于 20dB), 通过仿真测量统计得到此种噪声场景中, 置信度值调 整值为 +10~+3。 因此, 该噪声场景中, 获取置信度值调整值为调大 10至 3中的 某一值。
在噪声类型为办公室环境, 噪声偏小时 (噪声水平大于 -40dB, 信噪比大于 20dB), 通过仿真测量统计得到此种噪声场景中, 置信度值调整值为 +5~0。 因 此, 该噪声场景中, 获取置信度值调整值为调大 5至 0中的某一值。
S1032, 根据所述置信度值调整值, 调整所述第一置信度值, 获取所述第 二置信度值; 其中, 所述调整包括: 调大、 调小、 保持不变。 根据该置信度值调整值, 调整在步骤 S101中获取的第一置信度值。根据置 信度调整值, 调整该第一置信度值获取第二置信度值, 该第一置信度值可能被 调大调小或者保持不变。
本发明技术方案提供了一种噪声环境下提升语音识別率的方法,该方法通 过获取噪声场景,并根据预先存储的置信度值调整值的经验数据和所述噪声场 景, 获取第二置信度值。 这种根据噪声场景, 灵活调整置信度值的方法, 大大 提升了噪声环境下的语音识另 'J率
实施例 4 图 5为本发明实施例 4提供的一种语音识別装置的结构示意图。 如图 5所示, 所述装置包括: 获取单元 300, 用于获取语音数据;
第一置信度值单元 301, 用于接收所述获取单元 300获取的所述语音数据, 并根据所述语音数据获取第一置信度值;
噪声场景单元 302, 用于接收所述获取单元 300的获取的所述语音数据, 并 根据所述语音数据获取噪声场景;
第二置信度值单元 303,用于接收所述噪声场景单元 302的所述噪声场景和 所述第一置信度值单元 301的所述第一置信度值, 并根据所述第一置信度值, 获取与所述噪声场景对应的第二置信度值; 处理单元 304, 用于接收第二置信度值单元 303获取的所述第二置信度值, 如果所述第二置信度值大于或者等于预先存储的置信度阈值,则处理所述语音 数据。
该获取单元 300获取语音数据; 第一置信度值单元 301接收所述获取单元 300获取的所述语音数据, 并根据所述语音数据获取第一置信度值; 噪声场景 单元 302接收所述获取单元 300的获取的所述语音数据,并根据所述语音数据获 取噪声场景, 所述噪声场景包括, 噪声类型、噪声大小; 第二置信度值单元 303 接收所述噪声场景单元 302的所述噪声场景和所述第一置信度值单元 301的所 述第一置信度值, 并根据所述第一置信度值, 获取与所述噪声场景对应的第二 置信度值; ; 处理单元 304接收所述所述第二置信度值单元 303获取的所述第二 置信度值, 如果所述第二置信度值大于或者等于预先存储的置信度阈值, 则处 理所述语音数据。
其中, 获取单元 300、 第一置信度值单元 301、 噪声场景单元 302、 第二置 信度值单元 303、处理单元 304,可以用于执行实施例 1中步骤 S 100、 S 101、 S 102、 S103、 S104所述的方法, 具体描述详见实施例 1对所述方法的描述, 在此不再 赘述。
本发明技术方案提供了一种语音识別装置,该装置通过获取噪声场景, 并 根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信度 值。 这种根据噪声场景, 灵活调整置信度值的装置, 大大提升了噪声环境下的 语音识別率。 可选的,
图 6为本发明实施例 4提供的一种语音识別装置的另一种可能的结构示意 图。 如图 6所示, 所述装置还包括: 建模单元 305, 用于获取噪声数据, 根据所述噪声数据, 获取所述噪声数 据的频率倒谱系数,根据 EM算法处理所述频率倒谱系数,建立噪声类型模型 其中, 建模单元 305, 可以用于执行实施例 2中在步骤 S1022中的预先建立 的噪声类型模型的方法, 具体描述详见实施例 2对所述方法的描述, 在此不再 赘述。
本发明技术方案提供了一种语音识別装置,该装置通过获取噪声场景, 并 根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信度 值。 这种根据噪声场景, 灵活调整置信度值的装置, 大大提升了噪声环境下的 语音识別率。 可选的, 图 7为本发明实施例 4提供的一种语音识別装置的另一种可能的结构示意 图。 如图 7所示, 噪声场景单元具体包括: 噪声类型单元 3021, 用于根据所述获取单元的所述语音数据, 获取所述语 音数据中的噪声的频率倒谱系数,根据所述噪声的频率倒谱系数和所述建模单 元的所述噪声类型模型, 获取所述语音数据的噪声类型。 其中, 噪声类型单元 3021, 可以用于执行实施例 2中在步骤 S1021、 SI 022 中所述的方法, 具体描述详见实施例 2对所述方法的描述, 在此不再赘述。 噪声大小单元 3022, 用于根据所述获取单元的语音数据, 获取所述语音数 据的特征参数, 根据所述特征参数, 进行语音激活检测; 根据所述语音激活检 测的结果, 获取所述噪声大小。
其中,噪声大小单元 3022,可以用于执行实施例 2中在步骤 S1023、 S1024、
S 1025中所述的方法,具体描述详见实施例 2对所述方法的描述,在此不再赘述。
本发明技术方案提供了一种语音识別装置,该装置通过获取噪声场景, 并 根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信度 值。 这种根据噪声场景, 灵活调整置信度值的装置, 大大提升了噪声环境下的 语音识別率 可选的 图 8为本发明实施例 4提供的一种语音识別装置的另一种可能的结构示意 图。 如图 8所示, 所述装置还包括: 存储单元 306, 用于存储的置信度阈值和置信度值调整值的经验数据。 。 所述第二置信度值单元 303, 具体用于, 根据所述存储单元 306预先存储的 所述经验数据和所述噪声场景的对应关系,获取与所述噪声场景对应的置信度 值调整值; 根据所述置信度值调整值, 调整所述第一置信度值, 获取所述第二 置信度值; 其中, 所述调整包括: 调大、 调小、 保持不变。
其中,第二置信度值单元 303,可以用于执行实施例 3中在步骤 S1031、S1032 中所述的方法, 具体描述详见实施例 3对所述方法的描述, 在此不再赘述。
本发明技术方案提供了一种语音识別装置,该装置通过获取噪声场景, 并 根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信度 值。 这种根据噪声场景, 灵活调整置信度值的装置, 大大提升了噪声环境下的 语音识別率。
实施例 5
图 9为本发明实施例 5提供的一种移动终端的结构示意图。 如图 9所示, 该移动终端, 包括处理器、 麦克风, 其特征在于, 所述麦克风 501, 用于获取语音数据; 所述处理器 502, 用于根据所述语音数据, 获取第一置信度值, 根据所述 语音数据, 获取噪声场景, 根据所述第一置信度值, 获取与所述噪声场景对应 的第二置信度值,如果所述第二置信度值大于或者等于预先存储的置信度阈值 则处理所述语音数据。 其中,所述麦克风 501、所述处理器 502,可以用于执行实施例 1中步骤 S 100、 S10 S102、 S103、 S104所述的方法, 具体描述详见实施例 1对所述方法的描 述, 在此不再赘述。 本发明技术方案提供了一种移动终端,该移动终端通过获取噪声场景, 并 根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信度 值。 这种根据噪声场景, 灵活调整置信度值的移动终端, 大大提升了噪声环境 下的语音识別率。
可选的, 如图 10所示, 所述所述移动终端还包括: 存储器 503, 用于存储置信度值 调整值的经验数据和所述置信度阈值。
所述处理器 502具体用于, 根据所述语音数据, 获取噪声场景; 根据所述 噪声场景和所述经验数据的对应关系,获取与所述噪声场景对应的置信度值调 整值; 根据所述置信度值调整值, 调整所述第一置信度值, 获取所述第二置信 度值;如果所述第二置信度值大于或者等于所述置信度阈值, 则处理所述语音 数据。
上述结构可用于执行实施例 1、 实施例 2、 实施例 3中的方法, 具体方法详 见实施例 1、 实施例 2、 实施例 3中所述的方法, 在此不再赘述。
本发明技术方案提供了一种移动终端,该装置通过获取噪声场景, 并根据 预先存储的置信度值调整值的经验数据和所述噪声场景, 获取第二置信度值。 这种根据噪声场景, 灵活调整置信度值的移动终端, 大大提升了噪声环境下的 语音识別率。
实施例 6 如图 11所示, 本实施例以手机为例对本发明实施例进行具体说明。应该理 解的是, 图示手机仅仅是手机的一个范例, 并且手机可以具有比图中所示出的 更过的或者更少的部件, 可以组合两个或更多的部件, 或者可以具有不同的部 件配置。 图中所示出的各种部件可以在包括一个或多个信号处理和 /或专用集 成电路在内的硬件、 软件、 或硬件和软件的组合中实现。 图 11为本发明实施例提供的手机的结构示意图。如图 11所示手机包括: 触控 屏 41, 存储器 42, CPU43 , 电源管理芯片 44, RF电路 45, 外设接口 46, 音频电路 47, 麦克风 48, I/O子系统 49。 所述触控屏 41是手机与用户之间的输入接口和输出接口,除具有获取用户 触摸信息和控制指令的功能外, 还将可视输出呈现给用户, 可视输出可以包括 图形、 文本、 图标、 视频等。 所述存储器 42,可以用于存储置信度值调整值的经验数据和所述置信度阈 值, 以供 CPU43处理时使用。 存储器 42可以被 CPU43、 外设接口 46等访问, 所 述存储器 42可以包括高速随机存取存储器, 还可以包括非易失性存储器, 例如 一个或多个磁盘存储器件、 闪存器件、 或其他易失性固态存储器件。 所述 CPU43 , 可用于处理音频电路 47和麦克风 48获取的语音数据, 并根据 该语音数据获取噪声场景和第一置信度值;根据所述噪声场景和存储器 42预先 存储的置信度值调整值的经验数据,调整第一置信度值,获取第二置信度阈值。
CPU43是手机的控制中心, 利用各种接口和线路连接整个手机的各个部分, 通 过运行或执行存储在存储器 42内的软件程序和 /或模块, 以及调用存储在存储 器 42内的数据, 执行手机的各种功能和处理数据, 从而对手机进行整体监控。 可选的, CPU43可包括一个或多个处理单元; 优选的, CPU43可集成应用处理 器和调制解调处理器, 可选的, 应用处理器主要处理操作系统、 用户界面和应 用程序等, 调制解调处理器主要处理无线通信。 可以理解的是, 上述调制解调 处理器也可以不集成到 CPU43中。 还应当理解, 上述功能只是 CPU43能够执行 功能中的一种, 对于其他功能本发明实施例不做限制。
所述电源管理芯片 44, 可用于为 CPU43、 I/O子系统 49及外设接口 46所连 接的硬件进行供电及电源管理。 所述 RF电路 45, 主要用于建立手机与无线网络 (即网络侧) 的通信, 实 现手机与无线网络的数据获取和发送。例如收发短信息、电子邮件等。具体地, RF电路 45获取并发送 RF信号, RF信号也称为电磁信号, RF电路 45将电信号转 换为电磁信号或将电磁信号转换为电信号,并且通过该电磁信号与通信网络以 及其他设备进行通信。 RF电路 45可以包括用于执行这些功能的已知电路, 其 包括但不限于天线系统、 RF收发机、 一个或多个放大器、 调谐器、 一个或多 个振荡器、数字信号处理器、 CODEC芯片组、用户标识模块 (Subscriber Identity Module, SIM)等等。
所述外设接口 46,所述外设接口可以将设备的输入和输出外设连接到 CPU 43和存储器 42。
所述音频电路 47, 主要可用于从外设接口 46获取音频数据,将该音频数据 转换为电信号。
所述麦克风 48, 可用于获取语音数据.
所述 I/O子系统 49 :所述 I/O子系统 49可以控制设备上的输入输出外设, I/O 子系统 49可以包括显示控制器 491和用于控制其他输入 /控制设备的一个或多 个输入控制器 492。 可选的, 一个或多个输入控制器 492从其他输入 /控制设备 获取电信号或者向其他输入 /控制设备发送电信号, 其他输入 /控制设备可以包 括物理按钮 (按压按钮、 摇臂按钮等) 、 拨号盘、 滑动开关、 操級杆、 点击滚 轮。值得说明的是,输入控制器 492可以与以下任一个连接:键盘、红外端口、 USB接口以及诸如鼠标的指示设备。所述 I/O子系统 49中的显示控制器 491从触 控屏 41获取电信号或者向触控屏 41发送电信号。触控屏 41获取触控屏上的接触 显示控制器 491将获取到的接触转换为与呈现在触控屏 41上的用户界面对象的 交互, 即实现人机交互, 呈现在触控屏 41上的用户界面对象可以是运行游戏的 图标、 联网到相应网络的图标、 筛选模式等。 值得说明的是, 设备还可以包括 光鼠, 光鼠是不呈现可视输出的触摸敏感表面, 或者是由触控屏形成的触摸敏 感表面的延伸。
麦克风 48获取大屏设备的获取语音数据,通过所述外设接口 46和所述音频 电路 47将所述语音数据送入 CUP43, CPU43可用于处理所述语音数据, 并根 据该语音数据获取噪声场景和第一置信度值; 根据所述噪声场景和存储器 42 预先存储的置信度值调整值的经验数据, 调整第一置信度值, 获取第二置信度 值, 如果所述第二置信度值大于或者等于预先存储的置信度阈值, 则处理所述 语音数据。
上述结构可用于执行实施例 1、 实施例 2、 实施例 3中的方法, 具体方法详 见实施例 1、 实施例 2、 实施例 3中所述的方法, 在此不再赘述。
本发明技术方案提供了一种语音识別的手机,该手机通过获取噪声场景, 并根据预先存储的置信度值调整值的经验数据和所述噪声场景,获取第二置信 度值。 这种根据噪声场景, 灵活调整置信度值的手机, 大大提升了噪声环境下 的语音识別率。 通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本发 明实施例可以用硬件实现, 或固件实现, 或它们的组合方式来实现。 当使用软 件实现时,可以将上述功能存储在装置可读介质中或作为装置可读介质上的一 个或多个指令或代码进行传输。 装置可读介质包括装置存储介质和通信介质, 可选的通信介质包括便于从一个地方向另一个地方传送装置程序的任何介质。 存储介质可以是装置能够存取的任何可用介质。 以此为例但不限于: 装置可读 介质可以包括 RAM、 ROM, EEPROM、 CD-ROM或其他光盘存储、 磁盘存储 介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式 的期望的程序代码并能够由装置存取的任何其他介质。 此外。任何连接可以适 当的成为装置可读介质。例如,如果软件是使用同轴电缆、光纤光缆、双绞线、 数字用户线 (DSL) 或者诸如红外线、 无线电和微波之类的无线技术从网站、 服务器或者其他远程源传输的, 那么同轴电缆、 光纤光缆、 双绞线、 DSL 或 者诸如红外线、 无线和微波之类的无线技术包括在所属介质的定影中。如本发 明实施例所使用的, 盘 (Disk) 和碟 (disc) 包括压缩光碟 (CD)、 激光碟、 光碟、 数字通用光碟 (DVD)、 软盘和蓝光光碟, 可选的盘通常磁性的复制数 据, 而碟则用激光来光学的复制数据。 上面的組合也应当包括在装置可读介质 的保护范围之内。 总之, 以上所述仅为本发明技术方案的实施例而已, 并非用于限定本发明 的保护范围。 凡在本发明的精神和原则之内, 所作的任何修改、 等同替换、 改 进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要 求 书
1, 一种语音识別方法, 其特征在于, 所述方法包括: 获取语音数据;
根据所述语音数据, 获取第一置信度值;
根据所述语音数据, 获取噪声场景; 根据所述第一置信度值, 获取与所述噪声场景对应的第二置信度值; 如果所述第二置信度值大于或者等于预先存储的置信度阈值,则处理所述 语音数据。
2, 根据权利要求 1所述的方法, 其特征在于, 所述噪声场景具体包括: 噪声类型;
噪声大小。
3,根据权利要求 2所述的方法,其特征在于,所述噪声场景包括噪声类型, 所述根据语音数据获取噪声场景, 具体包括:
根据所述语音数据, 获取所述语音数据中的噪声的频率倒谱系数; 根据所述噪声的频率倒谱系数和预先建立的噪声类型模型,获取所述语音 数据的噪声类型。
4, 根据权利要求 3所述的方法, 其特征在于, 所述噪声类型模型的建立方 法具体包括:
获取噪声数据;
根据所述噪声数据, 获取所述噪声数据的频率倒谱系数;
根据 EM算法处理所述频率倒谱系数, 建立所述噪声类型模型。
5, 根据权利要求 3或 4所述的方法, 其特征在于, 所述噪声类型模型是, 高斯混合模型。
6,根据权利要求 2所述的方法,其特征在于,所述噪声场景包括噪声大小, 所述根据语音数据获取噪声场景, 具体包括:
根据所述语音数据, 获取所述语音数据的特征参数; 根据所述特征参数, 进行语音激活检测;
根据所述语音激活检测的结果, 获取所述噪声大小。
7, 根据权利要求 2或者 6所述的方法, 其特征在于, 所述噪声大小具体包 括:
信噪比; 噪声能量水平。
8, 根据权利要求 1至 7任一项所述的方法, 其特征在于, 所述根据第一置 信度值, 获取与所述噪声场景对应的第二置信度值, 具体包括:
根据所述噪声场景和预先存储的置信度值调整值的经验数据的对应关系, 获取与所述噪声场景对应的置信度值调整值;
根据所述置信度值调整值, 调整所述第一置信度值, 获取所述第二置信度 值;
其中, 所述调整包括: 调大、 调小、 保持不变。
9,根据权利要求 1至 8任一项所述的方法,其特征在于,所述方法还包括: 如果所述第二置信度值小于所述置信度阈值, 则提示用户。
10, 一种语音识別装置, 其特征在于, 所述装置包括: 获取单元, 用于获取语音数据;
第一置信度值单元, 用于接收所述获取单元获取的所述语音数据, 并根据 所述语音数据获取第一置信度值;
噪声场景单元, 用于接收所述获取单元获取的所述语音数据, 并根据所述 语音数据获取噪声场景;
第二置信度值单元,用于接收所述噪声场景单元的所述噪声场景和所述第 一置信度值单元的所述第一置信度值, 并根据所述第一置信度值, 获取与所述 噪声场景对应的第二置信度值;
处理单元, 用于接收所述第二置信度值单元获取的所述第二置信度值, 如 果所述第二置信度值大于或者等于预先存储的置信度阈值,则处理所述语音数 据。
11, 根据权利要求 10所述的装置, 其特征在于, 所述装置还包括: 建模单元, 用于获取噪声数据, 根据所述噪声数据, 获取所述噪声数据的 频率倒谱系数, 根据 EM算法处理所述频率倒谱系数, 建立噪声类型模型。
12, 根据权利要求 11所述的装置, 其特征在于, 所述噪声场景单元具体包 括:
噪声类型单元, 用于根据所述获取单元的所述语音数据, 获取所述语音数 据中的噪声的频率倒谱系数,根据所述噪声的频率倒谱系数和所述建模单元的 所述噪声类型模型, 获取所述语音数据的噪声类型。
13, 根据权利要求 10至 12任一项所述的方法, 其特征在于, 所述噪声场景 单元还包括:
噪声大小单元, 用于根据所述获取单元的语音数据, 获取所述语音数据的 特征参数, 根据所述特征参数, 进行语音激活检测, 根据所述语音激活检测的 结果, 获取所述噪声大小。
14, 根据权利要求 10至 13任一项所述的方法, 其特征在于, 所述装置还包 括:
存储单元, 用于存储的置信度阈值和置信度值调整值的经验数据。
15, 根据权利要求 14所述的方法, 其特征在于, 所述第二置信度值单元具 体用于,
根据所述噪声场景和所述经验数据的对应关系,获取与所述噪声场景对应 的置信度值调整值;
根据所述置信度值调整值, 调整所述第一置信度值, 获取所述第二置信度 值;
其中, 所述调整包括: 调大、 调小、 保持不变。
16, 一种移动终端, 包括处理器、 麦克风, 其特征在于, 所述麦克风, 用于获取语音数据;
所述处理器, 用于根据所述语音数据, 获取第一置信度值, 根据所述语音 数据, 获取噪声场景, 根据所述第一置信度值, 获取与所述噪声场景对应的第 二置信度值,如果所述第二置信度值大于或者等于预先存储的置信度阈值, 则 处理所述语音数据。
17,根据权利要求 16所述的移动终端,其特征在于,所述移动终端还包括: 存储器, 用于存储置信度值调整值的经验数据和所述置信度阈值。
18,根据权利要求 17所述的移动终端,其特征在于,所述处理器具体用于, 根据所述语音数据, 获取第一置信度值;
根据所述语音数据, 获取噪声场景;
根据所述噪声场景和所述经验数据的对应关系,获取与所述噪声场景对应 的置信度值调整值;
根据所述置信度值调整值, 调整所述第一置信度值, 获取所述第二置信度 值;
如果所述第二置信度值大于或者等于所述置信度阈值,则处理所述语音数 据。
PCT/CN2013/077529 2013-01-24 2013-06-20 一种语音识别的方法、装置 WO2014114049A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310027326.9A CN103065631B (zh) 2013-01-24 2013-01-24 一种语音识别的方法、装置
CN201310027326.9 2013-01-24

Publications (1)

Publication Number Publication Date
WO2014114049A1 true WO2014114049A1 (zh) 2014-07-31

Family

ID=48108231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/077529 WO2014114049A1 (zh) 2013-01-24 2013-06-20 一种语音识别的方法、装置

Country Status (5)

Country Link
US (1) US9607619B2 (zh)
EP (1) EP2760018B1 (zh)
JP (1) JP6099556B2 (zh)
CN (1) CN103065631B (zh)
WO (1) WO2014114049A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022222045A1 (zh) * 2021-04-20 2022-10-27 华为技术有限公司 语音信息处理方法及设备

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065631B (zh) * 2013-01-24 2015-07-29 华为终端有限公司 一种语音识别的方法、装置
CN103971680B (zh) 2013-01-24 2018-06-05 华为终端(东莞)有限公司 一种语音识别的方法、装置
US9240182B2 (en) * 2013-09-17 2016-01-19 Qualcomm Incorporated Method and apparatus for adjusting detection threshold for activating voice assistant function
CN104637495B (zh) * 2013-11-08 2019-03-26 宏达国际电子股份有限公司 电子装置以及音频信号处理方法
GB2523984B (en) * 2013-12-18 2017-07-26 Cirrus Logic Int Semiconductor Ltd Processing received speech data
CN103680493A (zh) * 2013-12-19 2014-03-26 百度在线网络技术(北京)有限公司 区分地域性口音的语音数据识别方法和装置
US9384738B2 (en) 2014-06-24 2016-07-05 Google Inc. Dynamic threshold for speaker verification
CN104078040A (zh) * 2014-06-26 2014-10-01 美的集团股份有限公司 语音识别方法及系统
CN104078041B (zh) * 2014-06-26 2018-03-13 美的集团股份有限公司 语音识别方法及系统
CN105224844B (zh) * 2014-07-01 2020-01-24 腾讯科技(深圳)有限公司 验证方法、系统和装置
CN105989838B (zh) * 2015-01-30 2019-09-06 展讯通信(上海)有限公司 语音识别方法及装置
US9769564B2 (en) 2015-02-11 2017-09-19 Google Inc. Methods, systems, and media for ambient background noise modification based on mood and/or behavior information
US10223459B2 (en) 2015-02-11 2019-03-05 Google Llc Methods, systems, and media for personalizing computerized services based on mood and/or behavior information from multiple data sources
US10284537B2 (en) 2015-02-11 2019-05-07 Google Llc Methods, systems, and media for presenting information related to an event based on metadata
US11392580B2 (en) 2015-02-11 2022-07-19 Google Llc Methods, systems, and media for recommending computerized services based on an animate object in the user's environment
US11048855B2 (en) 2015-02-11 2021-06-29 Google Llc Methods, systems, and media for modifying the presentation of contextually relevant documents in browser windows of a browsing application
US9685156B2 (en) * 2015-03-12 2017-06-20 Sony Mobile Communications Inc. Low-power voice command detector
CN105405298B (zh) * 2015-12-24 2018-01-16 浙江宇视科技有限公司 一种车牌标识的识别方法和装置
CN106971717A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 机器人与网络服务器协作处理的语音识别方法、装置
CN106971715A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 一种应用于机器人的语音识别装置
US9972322B2 (en) * 2016-03-29 2018-05-15 Intel Corporation Speaker recognition using adaptive thresholding
CN107437412B (zh) * 2016-05-25 2021-06-29 北京搜狗科技发展有限公司 一种声学模型处理方法、语音合成方法、装置及相关设备
CN106384594A (zh) * 2016-11-04 2017-02-08 湖南海翼电子商务股份有限公司 语音识别的车载终端及其方法
WO2018090252A1 (zh) * 2016-11-16 2018-05-24 深圳达闼科技控股有限公司 机器人语音指令识别的方法及相关机器人装置
CN107945793A (zh) * 2017-12-25 2018-04-20 广州势必可赢网络科技有限公司 一种语音激活检测方法及装置
CN108831487B (zh) * 2018-06-28 2020-08-18 深圳大学 声纹识别方法、电子装置及计算机可读存储介质
CN108986791B (zh) * 2018-08-10 2021-01-05 南京航空航天大学 针对民航陆空通话领域的中英文语种语音识别方法及系统
CN109065046A (zh) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 语音唤醒的方法、装置、电子设备及计算机可读存储介质
CN109346071A (zh) * 2018-09-26 2019-02-15 出门问问信息科技有限公司 唤醒处理方法、装置及电子设备
CN109545238B (zh) * 2018-12-11 2022-05-10 珠海一微半导体股份有限公司 一种基于清洁机器人的语音去噪装置
CN109658943B (zh) * 2019-01-23 2023-04-14 平安科技(深圳)有限公司 一种音频噪声的检测方法、装置、存储介质和移动终端
CN110602391B (zh) * 2019-08-30 2021-08-24 Oppo广东移动通信有限公司 拍照控制方法、装置、存储介质及电子设备
CN112687298A (zh) * 2019-10-18 2021-04-20 Oppo广东移动通信有限公司 语音唤醒优化方法及装置、系统、存储介质和电子设备
CN112767965B (zh) * 2019-11-01 2023-01-17 博泰车联网科技(上海)股份有限公司 噪声识别模型的生成/应用方法、系统、介质及服务/终端
CN112868061A (zh) * 2019-11-29 2021-05-28 深圳市大疆创新科技有限公司 环境检测方法、电子设备及计算机可读存储介质
CN111326148B (zh) * 2020-01-19 2021-02-23 北京世纪好未来教育科技有限公司 置信度校正及其模型训练方法、装置、设备及存储介质
CN111462737B (zh) * 2020-03-26 2023-08-08 中国科学院计算技术研究所 一种训练用于语音分组的分组模型的方法和语音降噪方法
CN112201270B (zh) * 2020-10-26 2023-05-23 平安科技(深圳)有限公司 语音噪声的处理方法、装置、计算机设备及存储介质
CN112466280B (zh) * 2020-12-01 2021-12-24 北京百度网讯科技有限公司 语音交互方法、装置、电子设备和可读存储介质
CN113380253A (zh) * 2021-06-21 2021-09-10 紫优科技(深圳)有限公司 一种基于云计算和边缘计算的语音识别系统、设备及介质
CN113380254B (zh) * 2021-06-21 2024-05-24 枣庄福缘网络科技有限公司 一种基于云计算和边缘计算的语音识别方法、设备及介质
CN115132197B (zh) * 2022-05-27 2024-04-09 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、程序产品及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1708782A (zh) * 2002-11-02 2005-12-14 皇家飞利浦电子股份有限公司 用于操作语音识别系统的方法
CN101051461A (zh) * 2006-04-06 2007-10-10 株式会社东芝 特征向量补偿装置和特征向量补偿方法
CN101197130A (zh) * 2006-12-07 2008-06-11 华为技术有限公司 声音活动检测方法和声音活动检测器
US7536301B2 (en) * 2005-01-03 2009-05-19 Aai Corporation System and method for implementing real-time adaptive threshold triggering in acoustic detection systems
CN101593522A (zh) * 2009-07-08 2009-12-02 清华大学 一种全频域数字助听方法和设备
CN102693724A (zh) * 2011-03-22 2012-09-26 张燕 一种基于神经网络的高斯混合模型的噪声分类方法
CN103065631A (zh) * 2013-01-24 2013-04-24 华为终端有限公司 一种语音识别的方法、装置

Family Cites Families (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5970446A (en) * 1997-11-25 1999-10-19 At&T Corp Selective noise/channel/coding models and recognizers for automatic speech recognition
US6466906B2 (en) * 1999-01-06 2002-10-15 Dspc Technologies Ltd. Noise padding and normalization in dynamic time warping
US6434520B1 (en) * 1999-04-16 2002-08-13 International Business Machines Corporation System and method for indexing and querying audio archives
JP2001075595A (ja) 1999-09-02 2001-03-23 Honda Motor Co Ltd 車載用音声認識装置
US6735562B1 (en) 2000-06-05 2004-05-11 Motorola, Inc. Method for estimating a confidence measure for a speech recognition system
JP4244514B2 (ja) * 2000-10-23 2009-03-25 セイコーエプソン株式会社 音声認識方法および音声認識装置
US8023622B2 (en) * 2000-12-21 2011-09-20 Grape Technology Group, Inc. Technique for call context based advertising through an information assistance service
JP2003177781A (ja) 2001-12-12 2003-06-27 Advanced Telecommunication Research Institute International 音響モデル生成装置及び音声認識装置
JP3826032B2 (ja) 2001-12-28 2006-09-27 株式会社東芝 音声認識装置、音声認識方法及び音声認識プログラム
JP2003241788A (ja) 2002-02-20 2003-08-29 Ntt Docomo Inc 音声認識装置及び音声認識システム
US7502737B2 (en) * 2002-06-24 2009-03-10 Intel Corporation Multi-pass recognition of spoken dialogue
US7103541B2 (en) * 2002-06-27 2006-09-05 Microsoft Corporation Microphone array signal enhancement using mixture models
JP4109063B2 (ja) * 2002-09-18 2008-06-25 パイオニア株式会社 音声認識装置及び音声認識方法
JP4357867B2 (ja) * 2003-04-25 2009-11-04 パイオニア株式会社 音声認識装置、音声認識方法、並びに、音声認識プログラムおよびそれを記録した記録媒体
JP2004325897A (ja) * 2003-04-25 2004-11-18 Pioneer Electronic Corp 音声認識装置及び音声認識方法
WO2005041170A1 (en) * 2003-10-24 2005-05-06 Nokia Corpration Noise-dependent postfiltering
US8005668B2 (en) * 2004-09-22 2011-08-23 General Motors Llc Adaptive confidence thresholds in telematics system speech recognition
KR100745976B1 (ko) * 2005-01-12 2007-08-06 삼성전자주식회사 음향 모델을 이용한 음성과 비음성의 구분 방법 및 장치
US7949533B2 (en) * 2005-02-04 2011-05-24 Vococollect, Inc. Methods and systems for assessing and improving the performance of a speech recognition system
US8311819B2 (en) * 2005-06-15 2012-11-13 Qnx Software Systems Limited System for detecting speech with background voice estimates and noise estimates
US20070055519A1 (en) * 2005-09-02 2007-03-08 Microsoft Corporation Robust bandwith extension of narrowband signals
JP2008009153A (ja) 2006-06-29 2008-01-17 Xanavi Informatics Corp 音声対話システム
US8560316B2 (en) * 2006-12-19 2013-10-15 Robert Vogt Confidence levels for speaker recognition
US8140325B2 (en) * 2007-01-04 2012-03-20 International Business Machines Corporation Systems and methods for intelligent control of microphones for speech recognition applications
CN101320559B (zh) * 2007-06-07 2011-05-18 华为技术有限公司 一种声音激活检测装置及方法
US7881929B2 (en) * 2007-07-25 2011-02-01 General Motors Llc Ambient noise injection for use in speech recognition
US7856353B2 (en) * 2007-08-07 2010-12-21 Nuance Communications, Inc. Method for processing speech signal data with reverberation filtering
US8306817B2 (en) * 2008-01-08 2012-11-06 Microsoft Corporation Speech recognition with non-linear noise reduction on Mel-frequency cepstra
JPWO2010128560A1 (ja) * 2009-05-08 2012-11-01 パイオニア株式会社 音声認識装置、音声認識方法、及び音声認識プログラム
CA2778343A1 (en) * 2009-10-19 2011-04-28 Martin Sehlstedt Method and voice activity detector for a speech encoder
US8632465B1 (en) * 2009-11-03 2014-01-21 Vivaquant Llc Physiological signal denoising
DK2352312T3 (da) * 2009-12-03 2013-10-21 Oticon As Fremgangsmåde til dynamisk undertrykkelse af omgivende akustisk støj, når der lyttes til elektriske input
JP5621783B2 (ja) 2009-12-10 2014-11-12 日本電気株式会社 音声認識システム、音声認識方法および音声認識プログラム
US8626498B2 (en) * 2010-02-24 2014-01-07 Qualcomm Incorporated Voice activity detection based on plural voice activity detectors
JPWO2011122522A1 (ja) 2010-03-30 2013-07-08 日本電気株式会社 感性表現語選択システム、感性表現語選択方法及びプログラム
US8234111B2 (en) * 2010-06-14 2012-07-31 Google Inc. Speech and noise models for speech recognition
JP5200080B2 (ja) 2010-09-29 2013-05-15 日本電信電話株式会社 音声認識装置、音声認識方法、およびそのプログラム
US8886532B2 (en) * 2010-10-27 2014-11-11 Microsoft Corporation Leveraging interaction context to improve recognition confidence scores
EP2678861B1 (en) * 2011-02-22 2018-07-11 Speak With Me, Inc. Hybridized client-server speech recognition
US10418047B2 (en) * 2011-03-14 2019-09-17 Cochlear Limited Sound processing with increased noise suppression
US8731936B2 (en) * 2011-05-26 2014-05-20 Microsoft Corporation Energy-efficient unobtrusive identification of a speaker
JP2013114518A (ja) * 2011-11-29 2013-06-10 Sony Corp 画像処理装置、および画像処理方法、並びにプログラム
US20130144618A1 (en) * 2011-12-02 2013-06-06 Liang-Che Sun Methods and electronic devices for speech recognition
US8930187B2 (en) * 2012-01-03 2015-01-06 Nokia Corporation Methods, apparatuses and computer program products for implementing automatic speech recognition and sentiment detection on a device
US20130211832A1 (en) * 2012-02-09 2013-08-15 General Motors Llc Speech signal processing responsive to low noise levels
CN103578468B (zh) * 2012-08-01 2017-06-27 联想(北京)有限公司 一种语音识别中置信度阈值的调整方法及电子设备
US8571871B1 (en) * 2012-10-02 2013-10-29 Google Inc. Methods and systems for adaptation of synthetic speech in an environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1708782A (zh) * 2002-11-02 2005-12-14 皇家飞利浦电子股份有限公司 用于操作语音识别系统的方法
US7536301B2 (en) * 2005-01-03 2009-05-19 Aai Corporation System and method for implementing real-time adaptive threshold triggering in acoustic detection systems
CN101051461A (zh) * 2006-04-06 2007-10-10 株式会社东芝 特征向量补偿装置和特征向量补偿方法
CN101197130A (zh) * 2006-12-07 2008-06-11 华为技术有限公司 声音活动检测方法和声音活动检测器
CN101593522A (zh) * 2009-07-08 2009-12-02 清华大学 一种全频域数字助听方法和设备
CN102693724A (zh) * 2011-03-22 2012-09-26 张燕 一种基于神经网络的高斯混合模型的噪声分类方法
CN103065631A (zh) * 2013-01-24 2013-04-24 华为终端有限公司 一种语音识别的方法、装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022222045A1 (zh) * 2021-04-20 2022-10-27 华为技术有限公司 语音信息处理方法及设备

Also Published As

Publication number Publication date
JP6099556B2 (ja) 2017-03-22
CN103065631B (zh) 2015-07-29
CN103065631A (zh) 2013-04-24
US9607619B2 (en) 2017-03-28
EP2760018A1 (en) 2014-07-30
JP2014142627A (ja) 2014-08-07
EP2760018B1 (en) 2017-10-25
US20140207460A1 (en) 2014-07-24

Similar Documents

Publication Publication Date Title
WO2014114049A1 (zh) 一种语音识别的方法、装置
JP6393730B2 (ja) 音声識別方法および装置
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
US9940935B2 (en) Method and device for voiceprint recognition
WO2020073694A1 (zh) 一种声纹识别的方法、模型训练的方法以及服务器
JP6350148B2 (ja) 話者インデキシング装置、話者インデキシング方法及び話者インデキシング用コンピュータプログラム
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
WO2014114116A1 (en) Method and system for voiceprint recognition
US20120143608A1 (en) Audio signal source verification system
WO2014153800A1 (zh) 语音识别系统
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN113077798B (zh) 一种居家老人呼救设备
CN113643693A (zh) 以声音特征为条件的声学模型
CN110992940B (zh) 语音交互的方法、装置、设备和计算机可读存储介质
WO2017177629A1 (zh) 远讲语音识别方法及装置
JP6268916B2 (ja) 異常会話検出装置、異常会話検出方法及び異常会話検出用コンピュータプログラム
WO2019041871A1 (zh) 语音对象识别方法及装置
CN112509556B (zh) 一种语音唤醒方法及装置
CN112382296A (zh) 一种声纹遥控无线音频设备的方法和装置
CN116741193B (zh) 语音增强网络的训练方法、装置、存储介质及计算机设备
CN115410586A (zh) 音频处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13872870

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13872870

Country of ref document: EP

Kind code of ref document: A1