WO2017094121A1 - Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system - Google Patents

Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system Download PDF

Info

Publication number
WO2017094121A1
WO2017094121A1 PCT/JP2015/083768 JP2015083768W WO2017094121A1 WO 2017094121 A1 WO2017094121 A1 WO 2017094121A1 JP 2015083768 W JP2015083768 W JP 2015083768W WO 2017094121 A1 WO2017094121 A1 WO 2017094121A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
noise suppression
unit
speech recognition
speech
Prior art date
Application number
PCT/JP2015/083768
Other languages
French (fr)
Japanese (ja)
Inventor
勇気 太刀岡
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to JP2017553538A priority Critical patent/JP6289774B2/en
Priority to PCT/JP2015/083768 priority patent/WO2017094121A1/en
Priority to US15/779,315 priority patent/US20180350358A1/en
Priority to CN201580084845.6A priority patent/CN108292501A/en
Priority to KR1020187014775A priority patent/KR102015742B1/en
Priority to DE112015007163.6T priority patent/DE112015007163B4/en
Priority to TW105110250A priority patent/TW201721631A/en
Publication of WO2017094121A1 publication Critical patent/WO2017094121A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • the present invention relates to a speech recognition technology and a speech enhancement technology, and particularly to a technology corresponding to use under various noise environments.
  • noise suppression processing When performing speech recognition using speech with superimposed noise, it is common to perform processing for suppressing superimposed noise (hereinafter referred to as noise suppression processing) before performing speech recognition processing.
  • noise suppression processing processing for suppressing superimposed noise
  • the noise suppression process is a spectrum removal process that is strong against stationary noise, the removal process against unsteady noise becomes weak.
  • the noise suppression process is a process that has high followability to non-stationary noise, the process is low in followability to steady noise.
  • integration of speech recognition results or selection of speech recognition results has been used as a method for solving such problems.
  • noise is generated by two noise suppression units that perform, for example, suppression processing with high followability to stationary noise and suppression processing with high followability to non-stationary noise.
  • the two voices are acquired by suppressing the voice, and the voices are recognized by the two voice recognition units with respect to the obtained two voices.
  • the two speech recognition results obtained by speech recognition are integrated using a speech combining method such as ROVER (Recognition Output Voting Error Reduction), or the speech recognition result with the highest likelihood is selected from the two speech recognition results, The integrated or selected speech recognition result is output.
  • ROVER Recognition Output Voting Error Reduction
  • Patent Document 1 discloses a speech recognition device that calculates the likelihood of each acoustic feature parameter of input noise for each probability speech model and selects a sound probability acoustic model from the likelihood.
  • Japanese Patent Laid-Open No. 2004-228688 discloses a method in which noise is removed from an input target signal, preprocessing for extracting feature amount data representing the characteristics of the target signal is performed, and then the target signal is determined according to the shape of the clustering map of the competitive neural network. Is classified into a plurality of categories, and a signal identification device that automatically selects processing contents is disclosed.
  • Patent Document 1 uses the likelihood of each acoustic feature parameter of the input noise for each probability speech model, a noise suppression process that can obtain a good speech recognition rate or acoustic index is selected. There was a problem that it might not be.
  • the technique disclosed in Patent Document 2 although the target signal is clustered, clustering associated with the speech recognition rate or the acoustic index is not performed. There was a problem that the obtained noise suppression processing might not be selected.
  • the above two methods share the need for noise-suppressed speech for performance prediction, so all candidate noise suppression processing must be performed once during both learning and use. There was a problem.
  • the present invention has been made in order to solve the above-described problems, and provides a good speech recognition rate or acoustic index only from noise speech data without performing noise suppression processing at the time of use in order to select a noise suppression method. It is an object to select a noise suppression process that provides high accuracy.
  • a speech recognition apparatus includes a plurality of noise suppression units that perform noise suppression processing using different methods on input noise speech data, and speech recognition of speech data in which noise signals are suppressed by the noise suppression unit.
  • a speech recognition unit that performs a noise recognition process, and a prediction unit that predicts a speech recognition rate obtained when noise suppression processing is performed on each of the noise speech data by a plurality of noise suppression units, based on the acoustic feature amount of the input noise speech data, And a suppression method selection unit that selects a noise suppression unit that performs noise suppression processing on noise speech data from a plurality of noise suppression units based on the speech recognition rate predicted by the prediction unit.
  • the present invention it is possible to select a noise suppression process that provides a good speech recognition rate or acoustic index without performing a noise suppression process to select a noise suppression method.
  • FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 1.
  • FIG. 2A and 2B are diagrams showing a hardware configuration of the speech recognition apparatus according to the first embodiment.
  • 3 is a flowchart showing an operation of the speech recognition apparatus according to the first embodiment.
  • 4 is a block diagram illustrating a configuration of a speech recognition apparatus according to Embodiment 2.
  • FIG. 6 is a flowchart showing the operation of the speech recognition apparatus according to the second embodiment.
  • FIG. 6 is a block diagram illustrating a configuration of a speech recognition apparatus according to a third embodiment. It is a figure which shows the structural example of the recognition rate database of the speech recognition apparatus which concerns on Embodiment 3.
  • FIG. 10 is a flowchart showing the operation of the speech recognition apparatus according to the third embodiment.
  • FIG. 10 is a block diagram illustrating a configuration of a speech enhancement device according to a fourth embodiment.
  • 10 is a flowchart illustrating an operation of the speech enhancement device according to the fourth embodiment.
  • FIG. 10 is a functional block diagram showing a configuration of a navigation system according to a fifth embodiment.
  • FIG. 1 is a block diagram showing the configuration of the speech recognition apparatus 100 according to the first embodiment.
  • the speech recognition apparatus 100 includes a first prediction unit 1, a suppression method selection unit 2, a noise suppression unit 3, and a speech recognition unit 4.
  • the first prediction unit 1 is composed of a regressor.
  • a neural network hereinafter referred to as NN
  • NN a neural network
  • an acoustic feature that is generally used such as Mel-frequency Cepstral Coefficient (MFCC) or filter bank feature, is used to directly calculate a speech recognition rate of 0 to 1 as a regressor.
  • the error back-propagation method is a learning method in which, when given learning data is given, the coupling load / bias between the layers is corrected so that the error between the learning data and the output of the NN becomes small.
  • the first predicting unit 1 predicts the speech recognition rate of the input acoustic feature amount by, for example, the NN having the input as the acoustic feature amount and the output as the speech recognition rate.
  • the suppression method selection unit 2 refers to the speech recognition rate predicted by the first prediction unit 1, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c.
  • the suppression method selection unit 2 outputs a control instruction to perform noise suppression processing on the selected noise suppression unit 3.
  • the noise suppression unit 3 includes a plurality of noise suppression units 3a, 3b, and 3c, and each noise suppression unit 3a, 3b, and 3c performs different noise suppression processing on the input noise voice data.
  • different noise suppression processing for example, an adaptive filter method using a spectrum removal method (SS), a learning identification method (Normalized Least Mean Square Algorithm; NLMS algorithm), or a method using NN such as Denoising auto encoder can be applied.
  • SS spectrum removal method
  • NLMS algorithm Normalized Least Mean Square Algorithm
  • NN Denoising auto encoder
  • noise suppression units 3a, 3b, and 3c performs the noise suppression processing is determined based on a control instruction input from the suppression method selection unit 2.
  • FIG. 1 an example in which the three noise suppression units 3a, 3b, and 3c are configured is shown, but the number of components is not limited to three and can be changed as appropriate.
  • the voice recognition unit 4 performs voice recognition on the voice data in which the noise signal is suppressed by the noise suppression unit 3, and outputs a voice recognition result.
  • speech recognition speech recognition processing is performed using, for example, an acoustic model based on Gaussian mix model or Deep neural network and a language model based on n-gram. Note that since the voice recognition process can be configured by applying a known technique, a detailed description thereof will be omitted.
  • the first prediction unit 1, the suppression method selection unit 2, the noise suppression unit 3, and the speech recognition unit 4 of the speech recognition apparatus 100 are realized by a processing circuit.
  • the processing circuit may be dedicated hardware or may be a CPU (Central Processing Unit) that executes a program stored in a memory, a processing device, a processor, and the like.
  • FIG. 2A shows a hardware configuration of the speech recognition apparatus 100 according to Embodiment 1, and shows a block diagram when the processing circuit is executed by hardware.
  • each function of the first prediction unit 1, the suppression method selection unit 2, the noise suppression unit 3, and the speech recognition unit 4 is realized by the processing circuit.
  • the functions of the respective units may be integrated and realized by a processing circuit.
  • FIG. 2B shows a hardware configuration of the speech recognition apparatus 100 according to Embodiment 1, and shows a block diagram when the processing circuit is executed by software.
  • each function of the first prediction unit 1, the suppression method selection unit 2, the noise suppression unit 3, and the speech recognition unit 4 is software, firmware, or software. Realized by combination with firmware.
  • Software and firmware are described as programs and stored in the memory 103.
  • the processor 102 reads out and executes the program stored in the memory 103, thereby executing the function of each unit.
  • the memory 103 corresponds to, for example, a nonvolatile or volatile semiconductor memory such as a RAM, a ROM, and a flash memory, a magnetic disk, and an optical disk.
  • the processing circuit can realize the above-described functions by hardware, software, firmware, or a combination thereof.
  • the 1st prediction part 1 to which the regressor is applied is comprised by NN which uses an acoustic feature-value as an input and uses an output as a speech recognition rate.
  • the first prediction unit 1 predicts the speech recognition rate by each of the noise suppression units 3a, 3b, and 3c by NN. That is, the first predicting unit 1 calculates a speech recognition rate for each frame of the acoustic feature amount when different noise suppression processes are applied.
  • the suppression method selection unit 2 refers to the speech recognition rate when the noise suppression units 3a, 3b, and 3c calculated by the first prediction unit 1 are applied, and noise suppression that leads to a speech recognition result with the highest speech recognition rate The unit 3 is selected, and a control instruction is output to the selected noise suppression unit 3.
  • FIG. 3 is a flowchart showing the operation of the speech recognition apparatus 100 according to the first embodiment. It is assumed that the voice recognition apparatus 100 receives noise voice data and an acoustic feature amount of the noise voice data via an external microphone or the like. It is assumed that the acoustic feature amount of the noise voice data is calculated by an external feature amount calculation unit.
  • the first predicting unit 1 performs each noise by NN in a frame unit of short-time Fourier transform of the input acoustic feature quantity.
  • a speech recognition rate is predicted when noise suppression processing is performed by the suppression units 3a, 3b, and 3c (step ST2).
  • step ST2 is repeatedly performed for a plurality of set frames.
  • the first prediction unit 1 obtains the average, maximum value, or minimum value of the speech recognition rate predicted for each frame and a plurality of frames in step ST2, and performs processing in each noise suppression unit 3a, 3b, 3c
  • Each prediction recognition rate is calculated (step ST3).
  • the first prediction unit 1 associates the calculated prediction recognition rate with each noise suppression unit 3a, 3b, 3c and outputs it to the suppression method selection unit 2 (step ST4).
  • the suppression technique selection unit 2 refers to the prediction recognition rate output in step ST4, selects the noise suppression unit 3 that exhibits the highest prediction recognition rate, and performs noise suppression processing on the selected noise suppression unit 3.
  • a control instruction is output to (step ST5).
  • the noise suppression unit 3 to which the control instruction is input in step ST5 performs a process of suppressing the noise signal with respect to the actual noise voice data input in step ST1 (step ST6).
  • the voice recognition unit 4 performs voice recognition on the voice data in which the noise signal is suppressed in step ST6, acquires a voice recognition result, and outputs it (step ST7). Thereafter, the flowchart returns to the process of step ST1 and repeats the process described above.
  • the first predicting unit 1 configured by the regressor, configured by the NN having the acoustic feature quantity as input and the output as the speech recognition rate
  • the noise suppression unit 3 that leads the speech recognition result with the highest speech recognition rate is selected from the plurality of noise suppression units 3, and a control instruction is given to the selected noise suppression unit 3
  • a noise suppression unit 3 that performs noise suppression processing of noise speech data based on a control instruction of the suppression method selection unit 2
  • a suppression method selection unit 2 that outputs a plurality of processing units to which a plurality of noise suppression methods are applied.
  • the voice recognition unit 4 that performs voice recognition of the voice data subjected to the noise suppression processing is provided, so that the noise suppression can be performed without increasing the processing amount of voice recognition and selecting a noise suppression method.
  • Effective noise suppression without processing It can be selected.
  • the noise suppression processing is performed by all three methods and the best noise suppression processing is selected based on the result.
  • a method that will have the best performance can be predicted in advance. Therefore, noise suppression processing is performed by performing noise suppression processing only with the selected method. There is an advantage that the amount of calculation for processing can be reduced.
  • FIG. 4 is a block diagram illustrating a configuration of the speech recognition apparatus 100a according to the second embodiment.
  • the speech recognition apparatus 100a according to Embodiment 2 replaces the first prediction unit 1 and the suppression method selection unit 2 of the speech recognition apparatus 100 described in Embodiment 1, with a second prediction unit 1a and a suppression method selection unit 2a. Is provided.
  • the same or corresponding parts as the components of the speech recognition apparatus 100 according to the first embodiment are denoted by the same reference numerals as those used in the first embodiment, and the description thereof is omitted or simplified. .
  • the second prediction unit 1a is composed of a discriminator.
  • NN is constructed and applied.
  • a generally used acoustic feature such as MFCC or filter bank feature is used, and classification processing such as 2-class classification or multi-class classification is performed as a discriminator, and the suppression method with the highest recognition rate
  • the NN that selects the identifiers of the two is constructed using the error back-propagation method.
  • the second prediction unit 1a performs, for example, an acoustic feature value as an input, performs a 2-class or multi-class classification with a final output layer as a softmax layer, and uses a suppression method ID that leads to a speech recognition result with the highest speech recognition rate It is composed of NN as (identification). For NN teacher data, only the suppression method that leads to the speech recognition result with the highest speech recognition rate is set to “1”, the other methods are set to “0”, the recognition rate is multiplied by Sigmoid, etc.
  • Weighted data (Sigmoid ((recognition rate of the system— (max (recognition rate) ⁇ min (recognition rate) / 2)) / ⁇ ) can be used, where ⁇ is a scaling factor.
  • SVM support vector machine
  • the suppression method selection unit 2a refers to the suppression method ID predicted by the second prediction unit 1a, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c.
  • the noise suppression unit 3 can employ a spectrum removal method (SS), an adaptive filter method, a method using NN, and the like.
  • the suppression method selection unit 2a outputs a control instruction to the selected noise suppression unit 3 so as to perform noise suppression processing.
  • FIG. 5 is a flowchart showing the operation of the speech recognition apparatus 100a according to the second embodiment.
  • the same steps as those of the speech recognition apparatus 100 according to Embodiment 1 are denoted by the same reference numerals as those used in FIG. 3, and the description thereof is omitted or simplified.
  • noise voice data and an acoustic feature amount of the noise voice data are input to the voice recognition device 100a via an external microphone or the like.
  • the second predicting unit 1a is the most speech by the NN in a short-time Fourier transform frame unit of the input acoustic feature amount.
  • a suppression method ID of a noise suppression method that leads to a speech recognition result with a high recognition rate is predicted (step ST11).
  • the second prediction unit 1a obtains the mode or average value of the suppression method ID predicted in units of frames in step ST11, and acquires the mode or average suppression method ID as the prediction suppression method ID (step S11). ST12).
  • the suppression method selection unit 2a refers to the prediction suppression method ID acquired in step ST12, selects the noise suppression unit 3 corresponding to the acquired prediction suppression method ID, and performs noise suppression processing on the selected noise suppression unit 3.
  • a control instruction is output so as to be performed (step ST13). Thereafter, the same processing as that in step ST6 and step ST7 described in the first embodiment is performed.
  • a discriminator is applied, and an acoustic feature quantity is input, and an output is composed of NNs that are IDs of suppression methods that lead to a speech recognition result with the highest speech recognition rate.
  • the second prediction unit 1a and the noise suppression unit 3 that leads the speech recognition result with the highest speech recognition rate from the plurality of noise suppression units 3 with reference to the suppression method ID predicted by the second prediction unit 1a are selected.
  • a suppression method selection unit 2a that outputs a control instruction to the selected noise suppression unit 3, and a plurality of processing units corresponding to each of the plurality of noise suppression processes, and noise based on the control instruction of the suppression method selection unit 2a Since it is configured to include the noise suppression unit 3 that performs noise suppression of speech data and the speech recognition unit 4 that performs speech recognition of speech data that has undergone noise suppression processing, the processing amount of speech recognition is not increased. And noise suppression techniques Without performing the noise suppression process for-option, it is possible to select a valid noise suppression techniques.
  • Embodiment 3 the acoustic feature amount is input to the first prediction unit 1 or the second prediction unit 1a for each short-time Fourier transform frame, and the speech recognition rate or suppression is input for each input frame.
  • the structure which estimates method ID was shown.
  • an utterance closest to the acoustic feature amount of the noise speech data actually input to the speech recognition device is selected from previously learned data using the acoustic feature amount of the utterance unit. 1 shows a configuration for selecting a noise suppression unit based on a speech recognition rate of a selected utterance.
  • FIG. 6 is a block diagram illustrating a configuration of the speech recognition apparatus 100b according to the third embodiment.
  • the speech recognition apparatus 100b according to the third embodiment replaces the first prediction unit 1 and the suppression technique selection unit 2 of the speech recognition apparatus 100 described in the first embodiment, and includes a feature amount calculation unit 5 and a similarity calculation unit 6.
  • the third prediction unit 1c and the suppression method selection unit 2b including the recognition rate database 7 are provided.
  • the same or corresponding parts as the components of the speech recognition apparatus 100 according to the first embodiment are denoted by the same reference numerals as those used in the first embodiment, and the description thereof is omitted or simplified. .
  • the feature quantity calculation unit 5 constituting the third prediction unit 1c calculates an acoustic feature quantity in units of utterances from the input noise voice data. The details of the method for calculating the acoustic feature amount for each utterance will be described later.
  • the similarity calculation unit 6 refers to the recognition rate database 7, collates the acoustic feature amount of the utterance unit calculated by the feature amount calculation unit 5 with the acoustic feature amount stored in the recognition rate database 7, and determines the acoustic feature amount. The similarity is calculated.
  • the similarity calculation unit 6 acquires a set of speech recognition rates when noise suppression is performed by the noise suppression units 3a, 3b, and 3c associated with the acoustic feature amount having the highest similarity among the calculated similarities.
  • the suppression method selection unit 2b refers to the set of speech recognition rates input from the similarity calculation unit 6, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c.
  • the recognition rate database 7 is a storage area in which acoustic feature amounts of a plurality of learning data and speech recognition rates when the noise feature units 3a, 3b, and 3c suppress noise are associated and stored. .
  • FIG. 7 is a diagram illustrating a configuration example of the recognition rate database 7 of the speech recognition apparatus 100b according to the third embodiment.
  • the recognition rate database 7 includes the acoustic feature amount of the learning data, and the speech in which each learning data is subjected to noise suppression processing by each noise suppression unit (first, second, and third noise suppression units in the example of FIG. 7).
  • the voice recognition rate of the data is stored in association with each other.
  • FIG. 7 is a storage area in which acoustic feature amounts of a plurality of learning data and speech recognition rates when the noise feature units 3a, 3b, and 3c suppress noise are associated and stored.
  • FIG. 7 is a diagram illustrating a configuration example of the recognition rate database 7 of the speech recognition apparatus 100b according to the third embodiment.
  • the speech recognition rate of the speech data in which the first noise suppression unit has performed the noise suppression processing with respect to the learning data of the first acoustic feature amount V (r1) is 80%
  • the second The voice recognition rate of the voice data subjected to the noise suppression processing by the noise suppression unit is 75%
  • the voice recognition rate of the voice data subjected to the noise suppression processing by the third noise suppression unit is 78%.
  • the recognition rate database 7 may be configured to cluster the learning data, store the recognition rate of the clustered learning data and the acoustic feature amount in association with each other, and store the data amount while suppressing the data amount.
  • the acoustic feature quantity of the utterance unit an average vector of acoustic feature quantities, an average likelihood vector by Universal background model (UBM), an i-vector, or the like can be applied.
  • the feature amount calculation unit 5 calculates the above-described acoustic feature amount in units of utterances for each of the recognition target noise sound data. For example, when an i-vector is applied as an acoustic feature, a Gaussian mixture model (GMM) is applied to the utterance r, and the obtained supervector V (r) is obtained in advance as a UBM supervector.
  • GBM Gaussian mixture model
  • the similarity between acoustic features of the utterance unit as shown in the following equation (2), measure using a Euclid distance or cosine similarity closest to the current evaluation data r e from in the training data r t to select a speech r't.
  • the similarity is represented by sim
  • an utterance represented by the following formula (3) is selected.
  • FIG. 8 is a flowchart showing the operation of the speech recognition apparatus 100b according to the third embodiment.
  • the same steps as those of the speech recognition apparatus 100 according to Embodiment 1 are denoted by the same reference numerals as those used in FIG. 3, and the description thereof is omitted or simplified.
  • noise voice data is input to the voice recognition device 100b via an external microphone or the like.
  • the feature amount calculation unit 5 calculates an acoustic feature amount from the input noise voice data (step ST22).
  • the similarity calculation unit 6 compares the acoustic feature quantity calculated in step ST22 with the acoustic feature quantity of the learning data stored in the recognition rate database 7, and calculates the similarity degree (step ST23).
  • the similarity calculation unit 6 selects the acoustic feature amount indicating the highest similarity among the similarity of the acoustic feature amount calculated in step ST23, and is associated with the selected acoustic feature amount with reference to the recognition rate database 7.
  • a set of recognition rates is acquired (step ST24). In step ST24, when the Euclid distance is used as the similarity between the acoustic feature quantities, a pair of recognition rates with the shortest distance is acquired.
  • the suppression technique selection unit 2b selects the noise suppression unit 3 that exhibits the highest recognition rate from the recognition rate group acquired in step ST24, and performs noise suppression processing on the selected noise suppression unit 3.
  • a control instruction is output (step ST25). Thereafter, the same processing as in steps ST6 and ST7 described above is performed.
  • the calculated acoustic feature amount and the acoustic feature of the learning data are referred to by referring to the feature amount calculation unit 5 that calculates the acoustic feature amount from the noise voice data and the recognition rate database 7.
  • the similarity calculation unit 6 that calculates the similarity to the amount and acquires the speech recognition rate pair associated with the acoustic feature amount indicating the highest similarity, and the highest among the acquired speech recognition rate pairs Since it comprises the suppression method selection part 2b which selects the noise suppression part 3 which shows a speech recognition rate, the speech recognition performance can be predicted in speech units, and the speech recognition performance is highly predicted and fixed. By using the dimension feature amount, there is an effect that the similarity can be easily calculated.
  • the similarity calculation unit 6 refers to an external database and calculates the similarity with the acoustic feature amount. You may comprise so that acquisition of a recognition rate may be performed.
  • a delay occurs when speech recognition is performed in units of utterances, but if the delay cannot be allowed, the acoustic feature value is calculated using the utterances for the first few seconds after the start of the utterance. You may comprise so that it may refer. Further, in the case where the environment with the utterance made before the utterance subject to the speech recognition does not change, the speech recognition is performed using the selection result of the noise suppression unit 3 in the previous utterance. Also good.
  • FIG. 9 is a block diagram showing the configuration of the speech enhancement apparatus 200 according to the fourth embodiment.
  • the speech enhancement apparatus 200 according to the fourth embodiment includes the third prediction unit 1c and the suppression provided with the feature amount calculation unit 5, the similarity calculation unit 6, and the recognition rate database 7 of the speech recognition apparatus 100b illustrated in the third embodiment.
  • a fourth prediction unit 1d and a suppression method selection unit 2c each including a feature amount calculation unit 5, a similarity calculation unit 6a, and an acoustic index database 8 are provided. Further, the voice recognition unit 4 is not provided.
  • the same or corresponding parts as those of the speech recognition apparatus 100b according to the third embodiment are denoted by the same reference numerals as those used in the third embodiment, and the description thereof is omitted or simplified. .
  • the acoustic index database 8 is a storage area in which acoustic feature quantities of a plurality of learning data are stored in association with acoustic indices when the respective learning data are noise-suppressed by the noise suppression units 3a, 3b, and 3c.
  • the acoustic index is PESQ or SNR / SDR calculated from the emphasized speech in which the noise is suppressed and the noise speech before the noise is suppressed.
  • the acoustic index database 8 may be configured to cluster the learning data, store the acoustic index of the clustered learning data and the acoustic feature amount in association with each other, and store the data while suppressing the data amount.
  • the similarity calculation unit 6 a refers to the acoustic index database 8, collates the acoustic feature amount of the utterance unit calculated by the feature amount calculation unit 5 with the acoustic feature amount stored in the acoustic index database 8, and determines the acoustic feature amount The similarity is calculated.
  • the similarity calculation unit 6a acquires a set of acoustic indices associated with the acoustic feature quantity having the highest similarity among the calculated similarities, and outputs it to the suppression method selection unit 2c.
  • the suppression method selection unit 2c refers to the set of acoustic indices input from the similarity calculation unit 6a, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c.
  • FIG. 10 is a flowchart showing the operation of the speech enhancement apparatus 200 according to the fourth embodiment. It is assumed that noise speech data is input to the speech enhancement device 200 via an external microphone or the like.
  • the feature quantity calculation unit 5 calculates an acoustic feature quantity from the input noise voice data (step ST32).
  • the similarity calculation unit 6a compares the acoustic feature quantity calculated in step ST32 with the acoustic feature quantity of the learning data stored in the acoustic index database 8, and calculates the similarity degree (step ST33).
  • the similarity calculation unit 6a selects an acoustic feature amount that indicates the highest similarity among the similarities of the acoustic feature amounts calculated in step ST33, and acquires a set of acoustic indicators associated with the selected acoustic feature amount. (Step ST34).
  • the suppression technique selection unit 2c selects the noise suppression unit 3 that indicates the highest acoustic index among the set of acoustic indexes acquired in step ST34, and performs noise suppression processing on the selected noise suppression unit 3.
  • a control instruction is output (step ST35).
  • the noise suppression unit 3 to which the control instruction is input in step ST35 performs processing for suppressing the noise signal on the actual noise speech data input in step ST31 to acquire and output the emphasized speech (step ST36). . Thereafter, the flowchart returns to the process of step ST31 and repeats the above-described process.
  • the calculated acoustic feature quantity and the acoustic feature of the learning data are referred to by referring to the feature quantity calculation unit 5 that calculates the acoustic feature quantity from the noise voice data and the acoustic index database 8.
  • the suppression method selection unit 2c that selects the noise suppression unit 3 indicating the voice recognition performance can be predicted in units of utterances, the speech recognition performance is highly predicted, and fixed-dimension features
  • the similarity calculation unit 6a refers to an external database and calculates the similarity with the acoustic feature amount. You may comprise so that an acoustic parameter
  • a delay occurs when speech recognition is performed in units of utterances, but if the delay cannot be tolerated, the acoustic feature value is calculated using the utterances for the first few seconds after the start of the utterance. You may comprise so that it may refer. Further, when the environment with the utterance performed before the utterance that is the target of the enhanced speech acquisition does not change, the enhanced speech is acquired using the selection result of the noise suppression unit 3 in the previous utterance. It may be configured.
  • FIG. 5 The speech recognition devices 100, 100a, and 100b of Embodiment 1-3 and the speech enhancement device 200 of Embodiment 4 described above are applied to, for example, a navigation system, a telephone-compatible system, and an elevator that have a voice call function. Can do.
  • a navigation system a telephone-compatible system
  • an elevator that have a voice call function. Can do.
  • FIG. 11 is a functional block diagram showing a configuration of the navigation system 300 according to the fifth embodiment.
  • the navigation system 300 is a device that is mounted on a vehicle and performs route guidance to a destination, for example, and includes an information acquisition device 301, a control device 302, an output device 303, an input device 304, a voice recognition device 100, a map database 305, A route calculation device 306 and a route guidance device 307 are provided.
  • the operation of each device of the navigation system 300 is centrally controlled by the control device 302.
  • the information acquisition device 301 includes, for example, current position detection means, wireless communication means, surrounding information detection means, and the like, and acquires information detected by the current position of the host vehicle, the surroundings of the host vehicle, and other vehicles.
  • the output device 303 includes, for example, a display unit, a display control unit, a voice output unit, a voice control unit, and the like, and notifies the user of information.
  • the input device 304 is realized by voice input means such as a microphone and operation input means such as buttons and a touch panel, and receives information input from the user.
  • the speech recognition device 100 is a speech recognition device having the configuration and functions described in the first embodiment, performs speech recognition on noise speech data input via the input device 304, and acquires a speech recognition result. And output to the control device 302.
  • the map database 305 is a storage area for storing map data, and is realized as a storage device such as an HDD (Hard Disk Drive) or a RAM (Random Access Memory).
  • the route calculation device 306 uses the current position of the vehicle acquired by the information acquisition device 301 as a departure point, uses the voice recognition result of the voice recognition device 100 as a destination, and based on the map data stored in the map database 305. The route from to the destination is calculated.
  • the route guidance device 307 guides the host vehicle according to the route calculated by the route calculation device 306.
  • the speech recognition device 100 When the noise speech data including the user's utterance is input from the microphone constituting the input device 304, the speech recognition device 100 performs the processing shown in the flowchart of FIG. 3 on the noise speech data. To obtain a speech recognition result.
  • the route calculation device 306 Based on the information input from the control device 302 and the information acquisition device 301, the route calculation device 306 uses the current position of the vehicle acquired by the information acquisition device 301 as the departure point and the information indicated by the voice recognition result as the destination. The route from the starting point to the destination is calculated based on the map data.
  • the route guidance device 307 outputs the route guidance information calculated according to the route calculated by the route calculation unit 306 via the output device 303, and performs route guidance for the user.
  • the speech recognition apparatus 100 generates a speech recognition result indicating a favorable speech recognition rate for noise speech data including a user's utterance input to the input device 304.
  • the noise suppression unit 3 that is predicted to perform the noise suppression process performs speech recognition, so that the route calculation can be performed based on the speech recognition result with a good speech recognition rate. Route guidance can be performed.
  • the configuration in which the voice recognition device 100 described in the first embodiment is applied to the navigation system 300 has been described.
  • the voice recognition device 100a described in the second embodiment and the third embodiment are described.
  • the speech recognition device 100b shown in FIG. 5 or the speech enhancement device 200 shown in Embodiment 4 may be applied.
  • the speech enhancement apparatus 200 is applied to the navigation system 300, the navigation system 300 side has a function of recognizing the enhanced speech.
  • the present invention can be freely combined with each embodiment, modified any component of each embodiment, or omitted any component in each embodiment. Is possible.
  • the speech recognition apparatus and the speech enhancement apparatus can select a noise suppression method that provides a favorable speech recognition rate or acoustic index, a device having a call function such as a navigation system, a telephone-compatible system, and an elevator Can be applied to.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Navigation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

This invention is provided with: a plurality of noise suppression units (3) for performing noise suppression processes of mutually different methods on inputted noise audio data; a voice recognition unit (4) for performing voice recognition on voice data in which noise signals are suppressed; a prediction unit (2) for predicting, from acoustic feature quantities of inputted noise audio data, the voice recognition rate obtained when a noise suppression process is performed on the noise audio data by each of the plurality of noise suppression units (3); and a suppression method selection unit (2) for selecting, on the basis of the predicted voice recognition rates, the noise suppression unit (3) performing a noise suppression process on the noise audio data from the plurality of noise suppression units.

Description

音声認識装置、音声強調装置、音声認識方法、音声強調方法およびナビゲーションシステムSpeech recognition device, speech enhancement device, speech recognition method, speech enhancement method, and navigation system
 この発明は、音声認識技術および音声強調技術に関し、特に多様な騒音環境下における使用に対応した技術に関するものである。 The present invention relates to a speech recognition technology and a speech enhancement technology, and particularly to a technology corresponding to use under various noise environments.
 騒音が重畳した音声を用いて音声認識を行う場合、音声認識処理を行う前に重畳した騒音を抑圧する処理(以下、騒音抑圧処理と称する)を行うのが一般的である。騒音抑圧処理の特性により、騒音抑圧処理に対して効果的な騒音と効果的でない騒音が存在する。例えば、騒音抑圧処理が定常騒音に対して強いスペクトル引き去り処理である場合には、非定常騒音に対する引き去り処理が弱くなる。一方、騒音抑圧処理が非定常騒音に追従性が高い処理である場合には、定常騒音に対する追従性が低い処理となる。このような問題を解決する手法として、従来音声認識結果の統合、あるいは音声認識結果の選択が用いられている。 When performing speech recognition using speech with superimposed noise, it is common to perform processing for suppressing superimposed noise (hereinafter referred to as noise suppression processing) before performing speech recognition processing. Depending on the characteristics of the noise suppression processing, there are noises that are effective and noises that are not effective for the noise suppression processing. For example, when the noise suppression process is a spectrum removal process that is strong against stationary noise, the removal process against unsteady noise becomes weak. On the other hand, when the noise suppression process is a process that has high followability to non-stationary noise, the process is low in followability to steady noise. Conventionally, integration of speech recognition results or selection of speech recognition results has been used as a method for solving such problems.
 当該従来の手法は、騒音が重畳された音声が入力された場合に、例えば定常騒音に追従性が高い抑圧処理と非定常騒音に追従性が高い抑圧処理とを行う2つの騒音抑圧部により騒音を抑圧して2つの音声を取得し、取得した2つの音声に対して2つの音声認識部で音声の認識を行う。音声認識によって得られた2つの音声認識結果をROVER(Recognition Output Voting Error Reduction)などの音声結合手法を用いて統合する、あるいは2つの音声認識結果のうち尤度の高い音声認識結果を選択し、統合あるいは選択した音声認識結果を出力する。しかし、当該従来の手法では、認識精度の改善程度は大きいが、音声認識のための処理が増加するという問題があった。 In the conventional method, when a sound with superimposed noise is input, noise is generated by two noise suppression units that perform, for example, suppression processing with high followability to stationary noise and suppression processing with high followability to non-stationary noise. The two voices are acquired by suppressing the voice, and the voices are recognized by the two voice recognition units with respect to the obtained two voices. The two speech recognition results obtained by speech recognition are integrated using a speech combining method such as ROVER (Recognition Output Voting Error Reduction), or the speech recognition result with the highest likelihood is selected from the two speech recognition results, The integrated or selected speech recognition result is output. However, the conventional method has a problem that the degree of improvement in recognition accuracy is large, but processing for speech recognition increases.
 当該問題を解決する手法とし、例えば特許文献1には、入力騒音の音響特徴パラメータの各確率音声モデルに対する尤度を算出し、当該尤度から音確率音響モデルを選択する音声認識装置が開示されている。また、特許文献2には、入力された対象信号からノイズを除去し、対象信号の特徴を表す特徴量データを抽出する前処理を行った後、競合型ニューラルネットワークのクラスタリングマップの形状により対象信号を複数のカテゴリに分類し、処理内容を自動的に選択する信号識別装置が開示されている。 As a technique for solving this problem, for example, Patent Document 1 discloses a speech recognition device that calculates the likelihood of each acoustic feature parameter of input noise for each probability speech model and selects a sound probability acoustic model from the likelihood. ing. Japanese Patent Laid-Open No. 2004-228688 discloses a method in which noise is removed from an input target signal, preprocessing for extracting feature amount data representing the characteristics of the target signal is performed, and then the target signal is determined according to the shape of the clustering map of the competitive neural network. Is classified into a plurality of categories, and a signal identification device that automatically selects processing contents is disclosed.
特開2000-194392号公報JP 2000-194392 A 特開2005-115569号公報JP 2005-115569 A
 しかしながら、上述した特許文献1に開示された技術では、入力騒音の音響特徴パラメータの各確率音声モデルに対する尤度を用いているため、良好な音声認識率または音響指標が得られる騒音抑圧処理が選択されない場合があるという課題があった。また、特許文献2に開示された技術では、対象信号のクラスタリングが行われているものの、音声認識率または音響指標に紐付けたクラスタリングは行われていないため、良好な音声認識率または音響指標が得られる騒音抑圧処理が選択されない場合がある課題があった。また上記の2手法は共通して、性能予測のために騒音抑圧処理を行った音声が必要となるため、学習時・使用時ともに、一度すべての候補となる騒音抑圧処理を行わなければならないという課題があった。 However, since the technique disclosed in Patent Document 1 described above uses the likelihood of each acoustic feature parameter of the input noise for each probability speech model, a noise suppression process that can obtain a good speech recognition rate or acoustic index is selected. There was a problem that it might not be. In the technique disclosed in Patent Document 2, although the target signal is clustered, clustering associated with the speech recognition rate or the acoustic index is not performed. There was a problem that the obtained noise suppression processing might not be selected. In addition, the above two methods share the need for noise-suppressed speech for performance prediction, so all candidate noise suppression processing must be performed once during both learning and use. There was a problem.
 この発明は、上記のような課題を解決するためになされたもので、騒音抑圧手法を選択するために使用時に騒音抑圧処理を行うことなく、騒音音声データだけから良好な音声認識率または音響指標が得られる騒音抑圧処理を高精度に選択することを目的とする。 The present invention has been made in order to solve the above-described problems, and provides a good speech recognition rate or acoustic index only from noise speech data without performing noise suppression processing at the time of use in order to select a noise suppression method. It is an object to select a noise suppression process that provides high accuracy.
 この発明に係る音声認識装置は、入力された騒音音声データに対して、それぞれ異なる手法の騒音抑圧処理を行う複数の騒音抑圧部と、騒音抑圧部により騒音信号が抑圧された音声データの音声認識を行う音声認識部と、入力された騒音音声データの音響特徴量から、騒音音声データを複数の騒音抑圧部によりそれぞれ騒音抑圧処理を行った場合に得られる音声認識率を予測する予測部と、予測部が予測した音声認識率に基づいて、複数の騒音抑圧部から騒音音声データに対して騒音抑圧処理を行う騒音抑圧部を選択する抑圧手法選択部とを備えるものである。 A speech recognition apparatus according to the present invention includes a plurality of noise suppression units that perform noise suppression processing using different methods on input noise speech data, and speech recognition of speech data in which noise signals are suppressed by the noise suppression unit. A speech recognition unit that performs a noise recognition process, and a prediction unit that predicts a speech recognition rate obtained when noise suppression processing is performed on each of the noise speech data by a plurality of noise suppression units, based on the acoustic feature amount of the input noise speech data, And a suppression method selection unit that selects a noise suppression unit that performs noise suppression processing on noise speech data from a plurality of noise suppression units based on the speech recognition rate predicted by the prediction unit.
 この発明によれば、騒音抑圧手法を選択するために騒音抑圧処理を行うことなく、良好な音声認識率または音響指標が得られる騒音抑圧処理を選択することができる。 According to the present invention, it is possible to select a noise suppression process that provides a good speech recognition rate or acoustic index without performing a noise suppression process to select a noise suppression method.
実施の形態1に係る音声認識装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 1. FIG. 図2A、図2Bは実施の形態1に係る音声認識装置のハードウェア構成を示す図である。2A and 2B are diagrams showing a hardware configuration of the speech recognition apparatus according to the first embodiment. 実施の形態1に係る音声認識装置の動作を示すフローチャートである。3 is a flowchart showing an operation of the speech recognition apparatus according to the first embodiment. 実施の形態2に係る音声認識装置の構成を示すブロック図である。4 is a block diagram illustrating a configuration of a speech recognition apparatus according to Embodiment 2. FIG. 実施の形態2に係る音声認識装置の動作を示すフローチャートである。6 is a flowchart showing the operation of the speech recognition apparatus according to the second embodiment. 実施の形態3に係る音声認識装置の構成を示すブロック図である。FIG. 6 is a block diagram illustrating a configuration of a speech recognition apparatus according to a third embodiment. 実施の形態3に係る音声認識装置の認識率データベースの構成例を示す図である。It is a figure which shows the structural example of the recognition rate database of the speech recognition apparatus which concerns on Embodiment 3. FIG. 実施の形態3に係る音声認識装置の動作を示すフローチャートである。10 is a flowchart showing the operation of the speech recognition apparatus according to the third embodiment. 実施の形態4に係る音声強調装置の構成を示すブロック図である。FIG. 10 is a block diagram illustrating a configuration of a speech enhancement device according to a fourth embodiment. 実施の形態4に係る音声強調装置の動作を示すフローチャートである。10 is a flowchart illustrating an operation of the speech enhancement device according to the fourth embodiment. 実施の形態5に係るナビゲーションシステムの構成を示す機能ブロック図である。FIG. 10 is a functional block diagram showing a configuration of a navigation system according to a fifth embodiment.
 以下、この発明をより詳細に説明するために、この発明を実施するための形態について、添付の図面に従って説明する。
実施の形態1.
 まず、図1は、実施の形態1に係る音声認識装置100の構成を示すブロック図である。
 音声認識装置100は、第1の予測部1、抑圧手法選択部2、騒音抑圧部3および音声認識部4を備えて構成されている。
 第1の予測部1は、回帰器で構成される。回帰器としては、例えばNeural network(以下、NNと称する)を構築して適用する。NNの構築では、Mel-frequency Cepstral Coefficient(MFCC)またはフィルタバンク特徴利用など、一般的に利用される音響特徴量を利用し、回帰器として0以上1以下となる音声認識率を直接算出するNNを、例えば誤差逆伝搬法などを用いて構築する。誤差逆伝搬法とは、ある学習データが与えられた時に、当該学習データとNNの出力の誤差が小さくなるように各層の間の結合荷重・バイアス等を修正する学習法である。第1の予測部1は、例えば入力を音響特徴量とし、出力を音声認識率とするNNにより、入力された音響特徴量の音声認識率を予測する。
Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
First, FIG. 1 is a block diagram showing the configuration of the speech recognition apparatus 100 according to the first embodiment.
The speech recognition apparatus 100 includes a first prediction unit 1, a suppression method selection unit 2, a noise suppression unit 3, and a speech recognition unit 4.
The first prediction unit 1 is composed of a regressor. For example, a neural network (hereinafter referred to as NN) is constructed and applied as the regressor. In the construction of the NN, an acoustic feature that is generally used, such as Mel-frequency Cepstral Coefficient (MFCC) or filter bank feature, is used to directly calculate a speech recognition rate of 0 to 1 as a regressor. Is constructed using, for example, the error back-propagation method. The error back-propagation method is a learning method in which, when given learning data is given, the coupling load / bias between the layers is corrected so that the error between the learning data and the output of the NN becomes small. The first predicting unit 1 predicts the speech recognition rate of the input acoustic feature amount by, for example, the NN having the input as the acoustic feature amount and the output as the speech recognition rate.
 抑圧手法選択部2は、第1の予測部1が予測した音声認識率を参照し、複数の騒音抑圧部3a,3b,3cから騒音抑圧を行う騒音抑圧部3を選択する。抑圧手法選択部2は、選択した騒音抑圧部3に対して騒音抑圧処理を行うように制御指示を出力する。騒音抑圧部3は、複数の騒音抑圧部3a,3b,3cで構成され、各騒音抑圧部3a,3b,3cは入力された騒音音声データに対してそれぞれ異なる騒音抑圧処理を行う。それぞれ異なる騒音抑圧処理として、例えばスペクトル引き去り法(SS)、学習同定法(Normalized Least Mean Square Algorithm;NLMSアルゴリズム)などを適用した適応フィルタ法、Denoising auto encoderなどのNNを用いた手法などが適用可能である。また、騒音抑圧部3a,3b,3cのいずれにおいて騒音抑圧処理を行うかは、抑圧手法選択部2から入力される制御指示に基づいて決定される。なお、図1の例では、3つの騒音抑圧部3a,3b,3cで構成する例を示したが、構成数は3つに限定されるものではなく、適宜変更可能である。 The suppression method selection unit 2 refers to the speech recognition rate predicted by the first prediction unit 1, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c. The suppression method selection unit 2 outputs a control instruction to perform noise suppression processing on the selected noise suppression unit 3. The noise suppression unit 3 includes a plurality of noise suppression units 3a, 3b, and 3c, and each noise suppression unit 3a, 3b, and 3c performs different noise suppression processing on the input noise voice data. As different noise suppression processing, for example, an adaptive filter method using a spectrum removal method (SS), a learning identification method (Normalized Least Mean Square Algorithm; NLMS algorithm), or a method using NN such as Denoising auto encoder can be applied. It is. Also, which of the noise suppression units 3a, 3b, and 3c performs the noise suppression processing is determined based on a control instruction input from the suppression method selection unit 2. In the example of FIG. 1, an example in which the three noise suppression units 3a, 3b, and 3c are configured is shown, but the number of components is not limited to three and can be changed as appropriate.
 音声認識部4は、騒音抑圧部3で騒音信号が抑圧された音声データに対して音声認識を行い、音声認識結果を出力する。音声認識は、例えばGaussian mixture modelまたはDeep neural networkによる音響モデルと、n-gramによる言語モデルとを用いて音声認識処理を行う。なお、音声認識処理については、公知の技術を適用して構成することが可能であるため、詳細な説明を省略する。 The voice recognition unit 4 performs voice recognition on the voice data in which the noise signal is suppressed by the noise suppression unit 3, and outputs a voice recognition result. For speech recognition, speech recognition processing is performed using, for example, an acoustic model based on Gaussian mix model or Deep neural network and a language model based on n-gram. Note that since the voice recognition process can be configured by applying a known technique, a detailed description thereof will be omitted.
 音声認識装置100の第1の予測部1、抑圧手法選択部2、騒音抑圧部3および音声認識部4は、処理回路により実現される。処理回路は、専用のハードウェアであっても、メモリに格納されるプログラムを実行するCPU(Central Processing Unit)、処理装置およびプロセッサなどであってもよい。
 図2Aは、実施の形態1に係る音声認識装置100のハードウェア構成を示し、処理回路がハードウェアで実行される場合のブロック図を示す。図2Aに示すように、処理回路101が専用のハードウェアである場合、第1の予測部1、抑圧手法選択部2、騒音抑圧部3および音声認識部4の各機能それぞれを処理回路で実現してもよいし、各部の機能をまとめて処理回路で実現してもよい。
The first prediction unit 1, the suppression method selection unit 2, the noise suppression unit 3, and the speech recognition unit 4 of the speech recognition apparatus 100 are realized by a processing circuit. The processing circuit may be dedicated hardware or may be a CPU (Central Processing Unit) that executes a program stored in a memory, a processing device, a processor, and the like.
FIG. 2A shows a hardware configuration of the speech recognition apparatus 100 according to Embodiment 1, and shows a block diagram when the processing circuit is executed by hardware. As shown in FIG. 2A, when the processing circuit 101 is dedicated hardware, each function of the first prediction unit 1, the suppression method selection unit 2, the noise suppression unit 3, and the speech recognition unit 4 is realized by the processing circuit. Alternatively, the functions of the respective units may be integrated and realized by a processing circuit.
 図2Bは、実施の形態1に係る音声認識装置100のハードウェア構成を示し、処理回路がソフトウェアで実行される場合のブロック図を示す。
 図2Bに示すように、処理回路がプロセッサ102である場合、第1の予測部1、抑圧手法選択部2、騒音抑圧部3および音声認識部4の各機能は、ソフトウェア、ファームウェア、またはソフトウェアとファームウェアとの組み合わせにより実現される。ソフトウェアやファームウェアはプログラムとして記述され、メモリ103に格納される。プロセッサ102はメモリ103に記憶されたプログラムを読み出して実行することにより、各部の機能を実行する。ここで、メモリ103とは、例えばRAM、ROM、フラッシュメモリーなどの不揮発性または揮発性の半導体メモリや、磁気ディスク、光ディスクなどが該当する。
FIG. 2B shows a hardware configuration of the speech recognition apparatus 100 according to Embodiment 1, and shows a block diagram when the processing circuit is executed by software.
As shown in FIG. 2B, when the processing circuit is the processor 102, each function of the first prediction unit 1, the suppression method selection unit 2, the noise suppression unit 3, and the speech recognition unit 4 is software, firmware, or software. Realized by combination with firmware. Software and firmware are described as programs and stored in the memory 103. The processor 102 reads out and executes the program stored in the memory 103, thereby executing the function of each unit. Here, the memory 103 corresponds to, for example, a nonvolatile or volatile semiconductor memory such as a RAM, a ROM, and a flash memory, a magnetic disk, and an optical disk.
 このように、処理回路は、ハードウェア、ソフトウェア、ファームウェアまたはこれらの組み合わせによって上述の各機能を実現することができる。 As described above, the processing circuit can realize the above-described functions by hardware, software, firmware, or a combination thereof.
 次に、第1の予測部1および抑圧手法選択部2の詳細な構成について説明する。
 まず、回帰器を適用した第1の予測部1は、音響特徴量を入力とし、出力を音声認識率とするNNで構成されている。第1の予測部1は、音響特徴量が短時間フーリエ変換のフレーム毎に入力されると、NNにより各騒音抑圧部3a,3b,3cのそれぞれが音声認識率を予測する。即ち、第1の予測部1は、それぞれ異なる騒音抑圧処理を適用した場合の音声認識率を音響特徴量のフレーム毎に算出する。抑圧手法選択部2は、第1の予測部1が算出した各騒音抑圧部3a,3b,3cを適用した場合の音声認識率を参照し、最も音声認識率が高い音声認識結果を導く騒音抑圧部3を選択し、選択した騒音抑圧部3に対して制御指示を出力する。
Next, detailed configurations of the first prediction unit 1 and the suppression method selection unit 2 will be described.
First, the 1st prediction part 1 to which the regressor is applied is comprised by NN which uses an acoustic feature-value as an input and uses an output as a speech recognition rate. When the acoustic feature quantity is input for each frame of the short-time Fourier transform, the first prediction unit 1 predicts the speech recognition rate by each of the noise suppression units 3a, 3b, and 3c by NN. That is, the first predicting unit 1 calculates a speech recognition rate for each frame of the acoustic feature amount when different noise suppression processes are applied. The suppression method selection unit 2 refers to the speech recognition rate when the noise suppression units 3a, 3b, and 3c calculated by the first prediction unit 1 are applied, and noise suppression that leads to a speech recognition result with the highest speech recognition rate The unit 3 is selected, and a control instruction is output to the selected noise suppression unit 3.
 図3は、実施の形態1に係る音声認識装置100の動作を示すフローチャートである。
 音声認識装置100には、外部のマイクなどを介して騒音音声データと、当該騒音音声データの音響特徴量とが入力されるものとする。なお、騒音音声データの音響特徴量は、外部の特徴量算出手段により算出されるものとする。
 騒音音声データ、および当該騒音音声データの音響特徴量が入力されると(ステップST1)、第1の予測部1は入力された音響特徴量の短時間フーリエ変換のフレーム単位で、NNにより各騒音抑圧部3a,3b,3cで騒音抑圧処理を行った場合の音声認識率を予測する(ステップST2)。なお、ステップST2の処理は、設定された複数のフレームに対して処理が繰り返し行われる。第1の予測部1は、ステップST2においてフレーム単位且つ複数のフレームについて予測した音声認識率の平均、最大値、または最小値を求め、各騒音抑圧部3a,3b,3cで処理を行った場合のそれぞれの予測認識率を算出する(ステップST3)。第1の予測部1は算出した予測認識率を各騒音抑圧部3a,3b,3cと紐付けて抑圧手法選択部2に出力する(ステップST4)。
FIG. 3 is a flowchart showing the operation of the speech recognition apparatus 100 according to the first embodiment.
It is assumed that the voice recognition apparatus 100 receives noise voice data and an acoustic feature amount of the noise voice data via an external microphone or the like. It is assumed that the acoustic feature amount of the noise voice data is calculated by an external feature amount calculation unit.
When the noise voice data and the acoustic feature quantity of the noise voice data are input (step ST1), the first predicting unit 1 performs each noise by NN in a frame unit of short-time Fourier transform of the input acoustic feature quantity. A speech recognition rate is predicted when noise suppression processing is performed by the suppression units 3a, 3b, and 3c (step ST2). Note that the process of step ST2 is repeatedly performed for a plurality of set frames. When the first prediction unit 1 obtains the average, maximum value, or minimum value of the speech recognition rate predicted for each frame and a plurality of frames in step ST2, and performs processing in each noise suppression unit 3a, 3b, 3c Each prediction recognition rate is calculated (step ST3). The first prediction unit 1 associates the calculated prediction recognition rate with each noise suppression unit 3a, 3b, 3c and outputs it to the suppression method selection unit 2 (step ST4).
 抑圧手法選択部2は、ステップST4で出力された予測認識率を参照し、最も高い予測認識率を示す騒音抑圧部3を選択し、選択した騒音抑圧部3に対して騒音抑圧処理を行うように制御指示を出力する(ステップST5)。ステップST5で制御指示が入力された騒音抑圧部3は、ステップST1で入力された実際の騒音音声データに対して騒音信号を抑圧する処理を行う(ステップST6)。音声認識部4は、ステップST6で騒音信号が抑圧された音声データに対して音声認識を行って音声認識結果を取得し、出力する(ステップST7)。その後、フローチャートはステップST1の処理に戻り、上述した処理を繰り返す。 The suppression technique selection unit 2 refers to the prediction recognition rate output in step ST4, selects the noise suppression unit 3 that exhibits the highest prediction recognition rate, and performs noise suppression processing on the selected noise suppression unit 3. A control instruction is output to (step ST5). The noise suppression unit 3 to which the control instruction is input in step ST5 performs a process of suppressing the noise signal with respect to the actual noise voice data input in step ST1 (step ST6). The voice recognition unit 4 performs voice recognition on the voice data in which the noise signal is suppressed in step ST6, acquires a voice recognition result, and outputs it (step ST7). Thereafter, the flowchart returns to the process of step ST1 and repeats the process described above.
 以上のように、この実施の形態1によれば、回帰器で構成され、音響特徴量を入力とし、出力を音声認識率とするNNで構成された第1の予測部1と、第1の予測部1が予測した音声認識率を参照して複数の騒音抑圧部3から最も音声認識率の高い音声認識結果を導く騒音抑圧部3を選択し、選択した騒音抑圧部3に対して制御指示を出力する抑圧手法選択部2と、複数の騒音抑圧手法を適用した複数の処理部を備え、抑圧手法選択部2の制御指示に基づいて騒音音声データの騒音抑圧処理を行う騒音抑圧部3と、騒音抑圧処理が行われた音声データの音声認識を行う音声認識部4とを備えるように構成したので、音声認識の処理量を増加させることなく、また騒音抑圧手法を選択するために騒音抑圧処理を行うことなく、有効な騒音抑圧手法を選択することができる。
 例えば従来の技術では、3つ候補となる騒音抑圧手法があった場合には、3つの手法すべてで騒音抑圧処理を行いその結果に基づいて最もよい騒音抑圧処理を選んでいたが、この実施の形態1によれば、候補となる騒音抑圧手法が3つあった場合にも、あらかじめ最も性能がよいであろう手法が予測できるので、その選ばれた手法でのみ騒音抑圧処理を行うことで騒音抑圧処理にかかる計算量を削減することができるという利点が得られる。    
As described above, according to the first embodiment, the first predicting unit 1 configured by the regressor, configured by the NN having the acoustic feature quantity as input and the output as the speech recognition rate, With reference to the speech recognition rate predicted by the prediction unit 1, the noise suppression unit 3 that leads the speech recognition result with the highest speech recognition rate is selected from the plurality of noise suppression units 3, and a control instruction is given to the selected noise suppression unit 3 And a noise suppression unit 3 that performs noise suppression processing of noise speech data based on a control instruction of the suppression method selection unit 2, and a suppression method selection unit 2 that outputs a plurality of processing units to which a plurality of noise suppression methods are applied. The voice recognition unit 4 that performs voice recognition of the voice data subjected to the noise suppression processing is provided, so that the noise suppression can be performed without increasing the processing amount of voice recognition and selecting a noise suppression method. Effective noise suppression without processing It can be selected.
For example, in the conventional technology, when there are three candidate noise suppression methods, the noise suppression processing is performed by all three methods and the best noise suppression processing is selected based on the result. According to the first aspect, even when there are three candidate noise suppression methods, a method that will have the best performance can be predicted in advance. Therefore, noise suppression processing is performed by performing noise suppression processing only with the selected method. There is an advantage that the amount of calculation for processing can be reduced.
実施の形態2.
 上述した実施の形態1では、回帰器を用いて音声認識率の高い音声認識結果を導く騒音抑圧部3を選択する構成を示したが、この実施の形態2では識別器を用いて音声認識率の高い音声認識結果を導く騒音抑圧部3を選択する構成を示す。
 図4は、実施の形態2に係る音声認識装置100aの構成を示すブロック図である。
 実施の形態2の音声認識装置100aは、実施の形態1で示した音声認識装置100の第1の予測部1および抑圧手法選択部2に替えて第2の予測部1aおよび抑圧手法選択部2aを設けて構成している。なお、以下では、実施の形態1に係る音声認識装置100の構成要素と同一または相当する部分には、実施の形態1で使用した符号と同一の符号を付して説明を省略または簡略化する。
Embodiment 2. FIG.
In the first embodiment described above, the configuration has been shown in which the noise suppression unit 3 that leads the speech recognition result having a high speech recognition rate is selected using the regressor. However, in the second embodiment, the speech recognition rate is determined using the discriminator. The structure which selects the noise suppression part 3 which guides a high speech recognition result is shown.
FIG. 4 is a block diagram illustrating a configuration of the speech recognition apparatus 100a according to the second embodiment.
The speech recognition apparatus 100a according to Embodiment 2 replaces the first prediction unit 1 and the suppression method selection unit 2 of the speech recognition apparatus 100 described in Embodiment 1, with a second prediction unit 1a and a suppression method selection unit 2a. Is provided. In the following description, the same or corresponding parts as the components of the speech recognition apparatus 100 according to the first embodiment are denoted by the same reference numerals as those used in the first embodiment, and the description thereof is omitted or simplified. .
 第2の予測部1aは、識別器で構成される。識別器としては、例えばNNを構築して適用する。NNの構築では、MFCCまたはフィルタバンク特徴利用など、一般的に利用される音響特徴量を利用し、識別器として2クラス分類または多クラス分類などの分類処理を行い、最も認識率の高い抑圧手法の識別子を選択するNNを、誤差逆伝搬法を用いて構築する。第2の予測部1aは、例えば入力を音響特徴量とし、最終的な出力層をsoftmax層として2クラスまたは多クラス分類を行い、出力を最も音声認識率の高い音声認識結果を導く抑圧手法ID(identification)とするNNで構成される。NNの教師データは、音声認識率の最も高い音声認識結果を導く抑圧手法のみを「1」とし、他の手法を「0」としたベクトルや、認識率に対して、Sigmoidなどを掛けて、重みづけしたデータ(Sigmoid((当該システムの認識率-(max(認識率)-min(認識率)/2))/σ)を用いることができる。ここで、σはスケーリング係数である。
 もちろんSVM(support vector machine)などの他の分類器を使うことも考えられる。
The second prediction unit 1a is composed of a discriminator. As the discriminator, for example, NN is constructed and applied. In the construction of NN, a generally used acoustic feature such as MFCC or filter bank feature is used, and classification processing such as 2-class classification or multi-class classification is performed as a discriminator, and the suppression method with the highest recognition rate The NN that selects the identifiers of the two is constructed using the error back-propagation method. The second prediction unit 1a performs, for example, an acoustic feature value as an input, performs a 2-class or multi-class classification with a final output layer as a softmax layer, and uses a suppression method ID that leads to a speech recognition result with the highest speech recognition rate It is composed of NN as (identification). For NN teacher data, only the suppression method that leads to the speech recognition result with the highest speech recognition rate is set to “1”, the other methods are set to “0”, the recognition rate is multiplied by Sigmoid, etc. Weighted data (Sigmoid ((recognition rate of the system— (max (recognition rate) −min (recognition rate) / 2)) / σ) can be used, where σ is a scaling factor.
Of course, it is possible to use other classifiers such as SVM (support vector machine).
 抑圧手法選択部2aは、第2の予測部1aが予測した抑圧手法IDを参照し、複数の騒音抑圧部3a,3b,3cから騒音抑圧を行う騒音抑圧部3を選択する。騒音抑圧部3には、実施の形態1と同様に、スペクトル引き去り法(SS)、適応フィルタ法、NNを用いた手法などが適用可能である。抑圧手法選択部2aは、選択した騒音抑圧部3に対して騒音抑圧処理を行うように制御指示を出力する。 The suppression method selection unit 2a refers to the suppression method ID predicted by the second prediction unit 1a, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c. As in the first embodiment, the noise suppression unit 3 can employ a spectrum removal method (SS), an adaptive filter method, a method using NN, and the like. The suppression method selection unit 2a outputs a control instruction to the selected noise suppression unit 3 so as to perform noise suppression processing.
 次に、音声認識装置100aの動作について説明する。
 図5は、実施の形態2に係る音声認識装置100aの動作を示すフローチャートである。なお、以下では実施の形態1に係る音声認識装置100と同一のステップには図3で使用した符号と同一の符号を付し、説明を省略または簡略化する。
 音声認識装置100aには、外部のマイクなどを介して騒音音声データと、当該騒音音声データの音響特徴量とが入力されるものとする。
 騒音音声データ、および当該騒音音声データの音響特徴量が入力されると(ステップST1)、第2の予測部1aは入力された音響特徴量の短時間フーリエ変換のフレーム単位で、NNにより最も音声認識率の高い音声認識結果を導く騒音抑圧手法の抑圧手法IDを予測する(ステップST11)。
Next, the operation of the speech recognition apparatus 100a will be described.
FIG. 5 is a flowchart showing the operation of the speech recognition apparatus 100a according to the second embodiment. In the following, the same steps as those of the speech recognition apparatus 100 according to Embodiment 1 are denoted by the same reference numerals as those used in FIG. 3, and the description thereof is omitted or simplified.
It is assumed that noise voice data and an acoustic feature amount of the noise voice data are input to the voice recognition device 100a via an external microphone or the like.
When the noise voice data and the acoustic feature amount of the noise voice data are input (step ST1), the second predicting unit 1a is the most speech by the NN in a short-time Fourier transform frame unit of the input acoustic feature amount. A suppression method ID of a noise suppression method that leads to a speech recognition result with a high recognition rate is predicted (step ST11).
 第2の予測部1aは、ステップST11においてフレーム単位で予測した抑圧手法IDの最頻値または平均値を求め、当該最頻値又は平均値の抑圧手法IDを予測抑圧手法IDとして取得する(ステップST12)。抑圧手法選択部2aは、ステップST12で取得した予測抑圧手法IDを参照し、取得した予測抑圧手法IDに対応する騒音抑圧部3を選択し、選択した騒音抑圧部3に対して騒音抑圧処理を行うように制御指示を出力する(ステップST13)。その後、実施の形態1で示したステップST6およびステップST7と同一の処理を行う。 The second prediction unit 1a obtains the mode or average value of the suppression method ID predicted in units of frames in step ST11, and acquires the mode or average suppression method ID as the prediction suppression method ID (step S11). ST12). The suppression method selection unit 2a refers to the prediction suppression method ID acquired in step ST12, selects the noise suppression unit 3 corresponding to the acquired prediction suppression method ID, and performs noise suppression processing on the selected noise suppression unit 3. A control instruction is output so as to be performed (step ST13). Thereafter, the same processing as that in step ST6 and step ST7 described in the first embodiment is performed.
 以上のように、この実施の形態2によれば、識別器が適用され、音響特徴量を入力とし、出力を最も音声認識率が高い音声認識結果を導く抑圧手法のIDとするNNで構成された第2の予測部1aと、第2の予測部1aが予測した抑圧手法IDを参照して複数の騒音抑圧部3から最も音声認識率の高い音声認識結果を導く騒音抑圧部3を選択し、選択した騒音抑圧部3に対して制御指示を出力する抑圧手法選択部2aと、複数の騒音抑圧処理それぞれに対応した複数の処理部を備え、抑圧手法選択部2aの制御指示に基づいて騒音音声データの騒音抑圧を行う騒音抑圧部3と、騒音抑圧処理が行われた音声データの音声認識を行う音声認識部4とを備えるように構成したので、音声認識の処理量を増加させることなく、また騒音抑圧手法を選択するために騒音抑圧処理を行うことなく、有効な騒音抑圧手法を選択することができる。 As described above, according to the second embodiment, a discriminator is applied, and an acoustic feature quantity is input, and an output is composed of NNs that are IDs of suppression methods that lead to a speech recognition result with the highest speech recognition rate. The second prediction unit 1a and the noise suppression unit 3 that leads the speech recognition result with the highest speech recognition rate from the plurality of noise suppression units 3 with reference to the suppression method ID predicted by the second prediction unit 1a are selected. And a suppression method selection unit 2a that outputs a control instruction to the selected noise suppression unit 3, and a plurality of processing units corresponding to each of the plurality of noise suppression processes, and noise based on the control instruction of the suppression method selection unit 2a Since it is configured to include the noise suppression unit 3 that performs noise suppression of speech data and the speech recognition unit 4 that performs speech recognition of speech data that has undergone noise suppression processing, the processing amount of speech recognition is not increased. And noise suppression techniques Without performing the noise suppression process for-option, it is possible to select a valid noise suppression techniques.
実施の形態3.
 上述した実施の形態1,2では、音響特徴量を短時間フーリエ変換のフレーム毎に第1の予測部1または第2の予測部1aに入力し、入力されたフレーム毎に音声認識率または抑圧手法IDを予測する構成を示した。一方、この実施の形態3では、発話単位の音響特徴量を用いて、予め学習したデータの中から、実際に音声認識装置に入力される騒音音声データの音響特徴量に最も近い発話を選択し、選択した発話の音声認識率に基づいて騒音抑圧部の選択を行う構成を示す。
Embodiment 3 FIG.
In the first and second embodiments described above, the acoustic feature amount is input to the first prediction unit 1 or the second prediction unit 1a for each short-time Fourier transform frame, and the speech recognition rate or suppression is input for each input frame. The structure which estimates method ID was shown. On the other hand, in the third embodiment, an utterance closest to the acoustic feature amount of the noise speech data actually input to the speech recognition device is selected from previously learned data using the acoustic feature amount of the utterance unit. 1 shows a configuration for selecting a noise suppression unit based on a speech recognition rate of a selected utterance.
 図6は、実施の形態3に係る音声認識装置100bの構成を示すブロック図である。
 実施の形態3の音声認識装置100bは、実施の形態1で示した音声認識装置100の第1の予測部1および抑圧手法選択部2に替えて、特徴量算出部5、類似度算出部6、認識率データベース7を備える第3の予測部1cおよび抑圧手法選択部2bを設けて構成している。
 なお、以下では、実施の形態1に係る音声認識装置100の構成要素と同一または相当する部分には、実施の形態1で使用した符号と同一の符号を付して説明を省略または簡略化する。
FIG. 6 is a block diagram illustrating a configuration of the speech recognition apparatus 100b according to the third embodiment.
The speech recognition apparatus 100b according to the third embodiment replaces the first prediction unit 1 and the suppression technique selection unit 2 of the speech recognition apparatus 100 described in the first embodiment, and includes a feature amount calculation unit 5 and a similarity calculation unit 6. The third prediction unit 1c and the suppression method selection unit 2b including the recognition rate database 7 are provided.
In the following description, the same or corresponding parts as the components of the speech recognition apparatus 100 according to the first embodiment are denoted by the same reference numerals as those used in the first embodiment, and the description thereof is omitted or simplified. .
 第3の予測部1cを構成する特徴量算出部5は、入力された騒音音声データから、発話単位で音響特徴量を算出する。なお、発話単位の音響特徴量の算出手法の詳細については後述する。類似度算出部6は、認識率データベース7を参照し、特徴量算出部5が算出した発話単位の音響特徴量と、認識率データベース7に格納された音響特徴量とを照合し、音響特徴量の類似度を算出する。類似度算出部6は、算出した類似度のうち最も高い類似度を有する音響特徴量に対応付けられた各騒音抑圧部3a,3b,3cで騒音抑圧した場合の音声認識率の組を取得し、抑圧手法選択部2bに出力する。音声認識率の組とは、例えば「音声認識率1-1,音声認識率1-2,音声認識率1-3」および「音声認識率2-1,音声認識率2-2,音声認識率2-3」などである。抑圧手法選択部2bは、類似度算出部6から入力された音声認識率の組を参照し、複数の騒音抑圧部3a,3b,3cから騒音抑圧を行う騒音抑圧部3を選択する。 The feature quantity calculation unit 5 constituting the third prediction unit 1c calculates an acoustic feature quantity in units of utterances from the input noise voice data. The details of the method for calculating the acoustic feature amount for each utterance will be described later. The similarity calculation unit 6 refers to the recognition rate database 7, collates the acoustic feature amount of the utterance unit calculated by the feature amount calculation unit 5 with the acoustic feature amount stored in the recognition rate database 7, and determines the acoustic feature amount. The similarity is calculated. The similarity calculation unit 6 acquires a set of speech recognition rates when noise suppression is performed by the noise suppression units 3a, 3b, and 3c associated with the acoustic feature amount having the highest similarity among the calculated similarities. And output to the suppression method selection unit 2b. Examples of the speech recognition rate pairs are “speech recognition rate 1-1 , speech recognition rate 1-2 , speech recognition rate 1-3 ” and “speech recognition rate 2-1 , speech recognition rate 2-2 , speech recognition rate”. 2-3 ". The suppression method selection unit 2b refers to the set of speech recognition rates input from the similarity calculation unit 6, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c.
 認識率データベース7は、複数の学習データの音響特徴量と、当該音響特徴量を各騒音抑圧部3a,3b,3cで騒音抑圧した場合の音声認識率とを対応付けて記憶した記憶領域である。
 図7は、実施の形態3に係る音声認識装置100bの認識率データベース7の構成例を示す図である。
 認識率データベース7は、学習データの音響特徴量と、各学習データを各騒音抑圧部(図7の例では、第1,第2,第3の騒音抑圧部)により騒音抑圧処理を行った音声データの音声認識率とを対応付けて格納している。図7において、例えば、第1の音響特徴量V(r1)の学習データに対して、第1の騒音抑圧部が騒音抑圧処理を行った音声データの音声認識率が80%であり、第2の騒音抑圧部が騒音抑圧処理を行った音声データの音声認識率が75%であり、第3の騒音抑圧部が騒音抑圧処理を行った音声データの音声認識率が78%であることを示している。なお、認識率データベース7は、学習データをクラスタリングし、クラスタリングした学習データの認識率と、音響特徴量とを対応付けて記憶し、データ量を抑制して格納するように構成してもよい。
The recognition rate database 7 is a storage area in which acoustic feature amounts of a plurality of learning data and speech recognition rates when the noise feature units 3a, 3b, and 3c suppress noise are associated and stored. .
FIG. 7 is a diagram illustrating a configuration example of the recognition rate database 7 of the speech recognition apparatus 100b according to the third embodiment.
The recognition rate database 7 includes the acoustic feature amount of the learning data, and the speech in which each learning data is subjected to noise suppression processing by each noise suppression unit (first, second, and third noise suppression units in the example of FIG. 7). The voice recognition rate of the data is stored in association with each other. In FIG. 7, for example, the speech recognition rate of the speech data in which the first noise suppression unit has performed the noise suppression processing with respect to the learning data of the first acoustic feature amount V (r1) is 80%, and the second The voice recognition rate of the voice data subjected to the noise suppression processing by the noise suppression unit is 75%, and the voice recognition rate of the voice data subjected to the noise suppression processing by the third noise suppression unit is 78%. ing. Note that the recognition rate database 7 may be configured to cluster the learning data, store the recognition rate of the clustered learning data and the acoustic feature amount in association with each other, and store the data amount while suppressing the data amount.
 次に、特徴量算出部5による発話単位の音響特徴量の算出の詳細について説明する。
 発話単位の音響特徴量としては、音響特徴量の平均ベクトル、Universal background model(UBM)による平均尤度ベクトル、i-vectorなどが適用可能である。特徴量算出部5は、上述した音響特徴量を、認識対象の騒音音声データそれぞれに対して、発話単位で算出する。例えば音響特徴量としてi-vectorを適用する場合には、Gaussian mixture model (GMM)を発話rに対して適応し、得られたスーパーベクトルV(r)を、あらかじめ求めておいたUBMのスーパーベクトルvと低ランクの全変数平面を張る基底ベクトルから成る行列Tにより、以下の式(1)に基づいて因子分解する。
 V(r)=v+Tw(r)   (1)
 上述した式(1)により得られるw(r)がi-vectorである。
Next, the details of the calculation of the acoustic feature amount for each utterance by the feature amount calculation unit 5 will be described.
As the acoustic feature quantity of the utterance unit, an average vector of acoustic feature quantities, an average likelihood vector by Universal background model (UBM), an i-vector, or the like can be applied. The feature amount calculation unit 5 calculates the above-described acoustic feature amount in units of utterances for each of the recognition target noise sound data. For example, when an i-vector is applied as an acoustic feature, a Gaussian mixture model (GMM) is applied to the utterance r, and the obtained supervector V (r) is obtained in advance as a UBM supervector. Factorization is performed based on the following equation (1) by a matrix T composed of basis vectors extending v and all rank planes of low rank.
V (r) = v + Tw (r) (1)
W (r) obtained by the above equation (1 ) is an i-vector.
 発話単位の音響特徴量の間の類似性を、以下の式(2)に示すように、Euclid距離またはcosine類似度を用いて測り、学習データr中から今の評価データrに最も近い発話r´を選択する。類似度をsimで表した場合、以下の式(3)で表される発話が選択される。
Figure JPOXMLDOC01-appb-I000001

Figure JPOXMLDOC01-appb-I000002
The similarity between acoustic features of the utterance unit, as shown in the following equation (2), measure using a Euclid distance or cosine similarity closest to the current evaluation data r e from in the training data r t to select a speech r't. When the similarity is represented by sim, an utterance represented by the following formula (3) is selected.
Figure JPOXMLDOC01-appb-I000001

Figure JPOXMLDOC01-appb-I000002
 学習データrに対して、あらかじめi番目の騒音抑圧部3および音声認識部4を利用して得られた単語誤り率Wtr(i, r)を求めておけば、reに対して最適なシステムi´が以下の式(4)で示すように、認識性能に基づいて選択される。
Figure JPOXMLDOC01-appb-I000003
 なお、上述した説明では、騒音抑圧手法が2つの場合を例に説明を行ったが、騒音抑圧手法が3つ以上の場合にも適用可能である。
If the word error rate W tr (i, r t ) obtained using the i-th noise suppression unit 3 and the speech recognition unit 4 is obtained in advance for the learning data r t , it is optimal for re The system i ′ is selected based on the recognition performance as shown in the following equation (4).
Figure JPOXMLDOC01-appb-I000003
In the above description, the case where there are two noise suppression methods has been described as an example, but the present invention can also be applied to a case where there are three or more noise suppression methods.
 次に、音声認識装置100bの動作について説明する。
 図8は、実施の形態3に係る音声認識装置100bの動作を示すフローチャートである。なお、以下では実施の形態1に係る音声認識装置100と同一のステップには図3で使用した符号と同一の符号を付し、説明を省略または簡略化する。
 音声認識装置100bには、外部のマイクなどを介して騒音音声データが入力されるものとする。
 騒音音声データが入力されると(ステップST21)、特徴量算出部5は入力された騒音音声データから音響特徴量を算出する(ステップST22)。類似度算出部6は、ステップST22で算出された音響特徴量と、認識率データベース7に格納された学習データの音響特徴量とを比較し、類似度を算出する(ステップST23)。類似度算出部6は、ステップST23で算出した音響特徴量の類似度のうち最も高い類似度を示す音響特徴量を選択し、認識率データベース7を参照して選択した音響特徴量に対応付けられた認識率の組を取得する(ステップST24)。ステップST24において、音響特徴量間の類似性としてEuclid距離を用いた場合には、最も距離の短い認識率の組を取得する。
Next, the operation of the speech recognition apparatus 100b will be described.
FIG. 8 is a flowchart showing the operation of the speech recognition apparatus 100b according to the third embodiment. In the following, the same steps as those of the speech recognition apparatus 100 according to Embodiment 1 are denoted by the same reference numerals as those used in FIG. 3, and the description thereof is omitted or simplified.
It is assumed that noise voice data is input to the voice recognition device 100b via an external microphone or the like.
When noise voice data is input (step ST21), the feature amount calculation unit 5 calculates an acoustic feature amount from the input noise voice data (step ST22). The similarity calculation unit 6 compares the acoustic feature quantity calculated in step ST22 with the acoustic feature quantity of the learning data stored in the recognition rate database 7, and calculates the similarity degree (step ST23). The similarity calculation unit 6 selects the acoustic feature amount indicating the highest similarity among the similarity of the acoustic feature amount calculated in step ST23, and is associated with the selected acoustic feature amount with reference to the recognition rate database 7. A set of recognition rates is acquired (step ST24). In step ST24, when the Euclid distance is used as the similarity between the acoustic feature quantities, a pair of recognition rates with the shortest distance is acquired.
 抑圧手法選択部2bは、ステップST24で取得した認識率の組の中で最も高い認識率を示す騒音抑圧部3を選択し、選択した騒音抑圧部3に対して、騒音抑圧処理を行うように制御指示を出力する(ステップST25)。その後、上述したステップST6およびステップST7と同一の処理を行う。 The suppression technique selection unit 2b selects the noise suppression unit 3 that exhibits the highest recognition rate from the recognition rate group acquired in step ST24, and performs noise suppression processing on the selected noise suppression unit 3. A control instruction is output (step ST25). Thereafter, the same processing as in steps ST6 and ST7 described above is performed.
 以上のように、この実施の形態3によれば、騒音音声データから音響特徴量を算出する特徴量算出部5と、認識率データベース7を参照し、算出した音響特徴量と学習データの音響特徴量との類似度を算出し、最も高い類似度を示す音響特徴量に対応付けられた音声認識率の組を取得する類似度算出部6と、取得した音声認識率の組の中で最も高い音声認識率を示す騒音抑圧部3を選択する抑圧手法選択部2bとを備えるように構成したので、発話単位で音声認識性能の予測を行うことができ、音声認識性能を高度に予測し、固定次元の特徴量を用いることにより類似性の算出が容易になるという効果がある。 As described above, according to the third embodiment, the calculated acoustic feature amount and the acoustic feature of the learning data are referred to by referring to the feature amount calculation unit 5 that calculates the acoustic feature amount from the noise voice data and the recognition rate database 7. The similarity calculation unit 6 that calculates the similarity to the amount and acquires the speech recognition rate pair associated with the acoustic feature amount indicating the highest similarity, and the highest among the acquired speech recognition rate pairs Since it comprises the suppression method selection part 2b which selects the noise suppression part 3 which shows a speech recognition rate, the speech recognition performance can be predicted in speech units, and the speech recognition performance is highly predicted and fixed. By using the dimension feature amount, there is an effect that the similarity can be easily calculated.
 なお、上述した実施の形態3では、音声認識装置100bが認識率データベース7を備える構成を示したが、外部のデータベースを参照して類似度算出部6が音響特徴量との類似度の算出および認識率の取得を行うように構成してもよい。 In the above-described third embodiment, the configuration in which the speech recognition apparatus 100b includes the recognition rate database 7 has been described. However, the similarity calculation unit 6 refers to an external database and calculates the similarity with the acoustic feature amount. You may comprise so that acquisition of a recognition rate may be performed.
 なお、上述した実施の形態3において、発話単位で音声認識を行った場合に遅延が生じるが、当該遅延を許容できない場合には、発話開始後の初めの数秒の発話を用いて音響特徴量を参照するように構成してもよい。また、音声認識の対象となる発話の前に行われた発話との環境が変化しない場合には、前の発話での騒音抑圧部3の選択結果を用いて音声認識を行うように構成してもよい。 Note that in Embodiment 3 described above, a delay occurs when speech recognition is performed in units of utterances, but if the delay cannot be allowed, the acoustic feature value is calculated using the utterances for the first few seconds after the start of the utterance. You may comprise so that it may refer. Further, in the case where the environment with the utterance made before the utterance subject to the speech recognition does not change, the speech recognition is performed using the selection result of the noise suppression unit 3 in the previous utterance. Also good.
実施の形態4.
 上述した実施の形態3では、学習データの音響特徴量と音声認識率とを対応付けた認識率データベース7を参照して騒音抑圧手法を選択する構成を示したが、この実施の形態4では学習データの音響特徴量と音響指標とを対応付けた音響指標データベースを参照して騒音抑圧手法を選択する構成を示す。
 図9は、実施の形態4に係る音声強調装置200の構成を示すブロック図である。
 実施の形態4の音声強調装置200は、実施の形態3で示した音声認識装置100bの特徴量算出部5、類似度算出部6および認識率データベース7を備えた第3の予測部1cおよび抑圧手法選択部2bに替えて、特徴量算出部5、類似度算出部6aおよび音響指標データベース8を備えた第4の予測部1dおよび抑圧手法選択部2cを設けて構成している。また、音声認識部4を備えていない。
 なお、以下では、実施の形態3に係る音声認識装置100bの構成要素と同一または相当する部分には、実施の形態3で使用した符号と同一の符号を付して説明を省略または簡略化する。
Embodiment 4 FIG.
In the third embodiment described above, the configuration in which the noise suppression method is selected with reference to the recognition rate database 7 in which the acoustic feature amount of the learning data is associated with the speech recognition rate has been described. The structure which selects the noise suppression method with reference to the acoustic parameter | index database which matched the acoustic feature-value and acoustic parameter | index of data is shown.
FIG. 9 is a block diagram showing the configuration of the speech enhancement apparatus 200 according to the fourth embodiment.
The speech enhancement apparatus 200 according to the fourth embodiment includes the third prediction unit 1c and the suppression provided with the feature amount calculation unit 5, the similarity calculation unit 6, and the recognition rate database 7 of the speech recognition apparatus 100b illustrated in the third embodiment. Instead of the method selection unit 2b, a fourth prediction unit 1d and a suppression method selection unit 2c each including a feature amount calculation unit 5, a similarity calculation unit 6a, and an acoustic index database 8 are provided. Further, the voice recognition unit 4 is not provided.
In the following, the same or corresponding parts as those of the speech recognition apparatus 100b according to the third embodiment are denoted by the same reference numerals as those used in the third embodiment, and the description thereof is omitted or simplified. .
 音響指標データベース8は、複数の学習データの音響特徴量と、各学習データを各騒音抑圧部3a,3b,3cで騒音抑圧した場合の音響指標とを対応付けて記憶した記憶領域である。ここで、音響指標とは、騒音を抑圧した強調音声と、騒音を抑圧する前の騒音音声から算出されるPESQまたはSNR/SDRなどである。なお、音響指標データベース8は、学習データをクラスタリングし、クラスタリングした学習データの音響指標と、音響特徴量とを対応付けて記憶し、データ量を抑制して格納するように構成してもよい。 The acoustic index database 8 is a storage area in which acoustic feature quantities of a plurality of learning data are stored in association with acoustic indices when the respective learning data are noise-suppressed by the noise suppression units 3a, 3b, and 3c. Here, the acoustic index is PESQ or SNR / SDR calculated from the emphasized speech in which the noise is suppressed and the noise speech before the noise is suppressed. Note that the acoustic index database 8 may be configured to cluster the learning data, store the acoustic index of the clustered learning data and the acoustic feature amount in association with each other, and store the data while suppressing the data amount.
 類似度算出部6aは、音響指標データベース8を参照し、特徴量算出部5が算出した発話単位の音響特徴量と、音響指標データベース8に格納された音響特徴量とを照合し、音響特徴量の類似度を算出する。類似度算出部6aは、算出した類似度のうち最も高い類似度を有する音響特徴量に対応付けられた音響指標の組を取得し、抑圧手法選択部2cに出力する。音響指標の組とは、例えば「PESQ1-1,PESQ1-2,PESQ1-3」および「PESQ2-1,PESQ2-2,PESQ2-3」などである。
 抑圧手法選択部2cは、類似度算出部6aから入力された音響指標の組を参照し、複数の騒音抑圧部3a,3b,3cから騒音抑圧を行う騒音抑圧部3を選択する。
The similarity calculation unit 6 a refers to the acoustic index database 8, collates the acoustic feature amount of the utterance unit calculated by the feature amount calculation unit 5 with the acoustic feature amount stored in the acoustic index database 8, and determines the acoustic feature amount The similarity is calculated. The similarity calculation unit 6a acquires a set of acoustic indices associated with the acoustic feature quantity having the highest similarity among the calculated similarities, and outputs it to the suppression method selection unit 2c. Examples of the set of acoustic indicators include “PESQ 1-1 , PESQ 1-2 , PESQ 1-3 ” and “PESQ 2-1 , PESQ 2-2 , PESQ 2-3 ” and the like.
The suppression method selection unit 2c refers to the set of acoustic indices input from the similarity calculation unit 6a, and selects the noise suppression unit 3 that performs noise suppression from the plurality of noise suppression units 3a, 3b, and 3c.
 次に、音声強調装置200の動作について説明する。
 図10は、実施の形態4に係る音声強調装置200の動作を示すフローチャートである。音声強調装置200には、外部のマイクなどを介して騒音音声データが入力されるものとする。
 騒音音声データが入力されると(ステップST31)、特徴量算出部5は入力された騒音音声データから音響特徴量を算出する(ステップST32)。類似度算出部6aは、ステップST32で算出された音響特徴量と、音響指標データベース8に格納された学習データの音響特徴量とを比較し、類似度を算出する(ステップST33)。類似度算出部6aは、ステップST33で算出した音響特徴量の類似度のうち最も高い類似度を示す音響特徴量を選択し、選択した音響特徴量に対応付けられた音響指標の組を取得する(ステップST34)。
Next, the operation of the speech enhancement apparatus 200 will be described.
FIG. 10 is a flowchart showing the operation of the speech enhancement apparatus 200 according to the fourth embodiment. It is assumed that noise speech data is input to the speech enhancement device 200 via an external microphone or the like.
When noise voice data is input (step ST31), the feature quantity calculation unit 5 calculates an acoustic feature quantity from the input noise voice data (step ST32). The similarity calculation unit 6a compares the acoustic feature quantity calculated in step ST32 with the acoustic feature quantity of the learning data stored in the acoustic index database 8, and calculates the similarity degree (step ST33). The similarity calculation unit 6a selects an acoustic feature amount that indicates the highest similarity among the similarities of the acoustic feature amounts calculated in step ST33, and acquires a set of acoustic indicators associated with the selected acoustic feature amount. (Step ST34).
 抑圧手法選択部2cは、ステップST34で取得した音響指標の組の中で最も高い音響指標を示す騒音抑圧部3を選択し、選択した騒音抑圧部3に対して、騒音抑圧処理を行うように制御指示を出力する(ステップST35)。ステップST35で制御指示が入力された騒音抑圧部3は、ステップST31で入力された実際の騒音音声データに対して騒音信号を抑圧する処理を行って強調音声を取得し、出力する(ステップST36)。その後、フローチャートはステップST31の処理に戻り、上述した処理を繰り返す。 The suppression technique selection unit 2c selects the noise suppression unit 3 that indicates the highest acoustic index among the set of acoustic indexes acquired in step ST34, and performs noise suppression processing on the selected noise suppression unit 3. A control instruction is output (step ST35). The noise suppression unit 3 to which the control instruction is input in step ST35 performs processing for suppressing the noise signal on the actual noise speech data input in step ST31 to acquire and output the emphasized speech (step ST36). . Thereafter, the flowchart returns to the process of step ST31 and repeats the above-described process.
 以上のように、この実施の形態4によれば、騒音音声データから音響特徴量を算出する特徴量算出部5と、音響指標データベース8を参照し、算出した音響特徴量と学習データの音響特徴量との類似度を算出し、最も高い類似度を示す音響特徴量に対応付けられた音響指標の組を取得する類似度算出部6aと、取得した音響指標の組の中で最も高い音響指標を示す騒音抑圧部3を選択する抑圧手法選択部2cとを備えるように構成したので、発話単位で音声認識性能の予測を行うことができ、音声認識性能を高度に予測し、固定次元の特徴量を用いることにより類似性の算出が容易になるという効果がある。 As described above, according to the fourth embodiment, the calculated acoustic feature quantity and the acoustic feature of the learning data are referred to by referring to the feature quantity calculation unit 5 that calculates the acoustic feature quantity from the noise voice data and the acoustic index database 8. A similarity calculation unit 6a for calculating a similarity to a quantity and acquiring a set of acoustic indices associated with an acoustic feature quantity indicating the highest similarity, and the highest acoustic index among the acquired sets of acoustic indices And the suppression method selection unit 2c that selects the noise suppression unit 3 indicating the voice recognition performance can be predicted in units of utterances, the speech recognition performance is highly predicted, and fixed-dimension features By using the quantity, there is an effect that the calculation of similarity becomes easy.
 なお、上述した実施の形態4では、音声強調装置200が音響指標データベース8を備える構成を示したが、外部のデータベースを参照して類似度算出部6aが音響特徴量との類似度の算出および音響指標の取得を行うように構成してもよい。 In the fourth embodiment described above, the configuration in which the speech enhancement apparatus 200 includes the acoustic index database 8 has been described. However, the similarity calculation unit 6a refers to an external database and calculates the similarity with the acoustic feature amount. You may comprise so that an acoustic parameter | index may be acquired.
 なお、上述した実施の形態4において、発話単位で音声認識を行った場合に遅延が生じるが、当該遅延を許容できない場合には、発話開始後の初めの数秒の発話を用いて音響特徴量を参照するように構成してもよい。また、強調音声取得の対象となる発話の前に行われた発話との環境が変化しない場合には、前の発話での騒音抑圧部3の選択結果を用いて強調音声の取得を行うように構成してもよい。 In Embodiment 4 described above, a delay occurs when speech recognition is performed in units of utterances, but if the delay cannot be tolerated, the acoustic feature value is calculated using the utterances for the first few seconds after the start of the utterance. You may comprise so that it may refer. Further, when the environment with the utterance performed before the utterance that is the target of the enhanced speech acquisition does not change, the enhanced speech is acquired using the selection result of the noise suppression unit 3 in the previous utterance. It may be configured.
実施の形態5.
 上述した実施の形態1-3の音声認識装置100,100a,100bおよび実施の形態4の音声強調装置200は、例えば音声による通話機能を備えたナビゲーションシステム、電話対応システム、エレベータなどに適用することができる。この実施の形態5では、実施の形態1の音声認識装置をナビゲーションシステムに適用した場合について示す。
 図11は、実施の形態5に係るナビゲーションシステム300の構成を示す機能ブロック図である。
 ナビゲーションシステム300は、例えば車両に搭載されて目的地までの経路案内を実行する装置であり、情報取得装置301、制御装置302、出力装置303、入力装置304、音声認識装置100、地図データベース305、経路算出装置306および経路案内装置307を備える。ナビゲーションシステム300の各装置の動作は、制御装置302によって統括的に制御される。
Embodiment 5 FIG.
The speech recognition devices 100, 100a, and 100b of Embodiment 1-3 and the speech enhancement device 200 of Embodiment 4 described above are applied to, for example, a navigation system, a telephone-compatible system, and an elevator that have a voice call function. Can do. In the fifth embodiment, the case where the speech recognition apparatus of the first embodiment is applied to a navigation system will be described.
FIG. 11 is a functional block diagram showing a configuration of the navigation system 300 according to the fifth embodiment.
The navigation system 300 is a device that is mounted on a vehicle and performs route guidance to a destination, for example, and includes an information acquisition device 301, a control device 302, an output device 303, an input device 304, a voice recognition device 100, a map database 305, A route calculation device 306 and a route guidance device 307 are provided. The operation of each device of the navigation system 300 is centrally controlled by the control device 302.
 情報取得装置301は、例えば現在位置検出手段、無線通信手段および周囲情報検出手段などを備え、自車の現在位置、自車周囲、他車で検出された情報を取得する。出力装置303は、例えば表示手段、表示制御手段、音声出力手段および音声制御手段などを備え、ユーザに情報を通知する。入力装置304は、マイクなどの音声入力手段、ボタン、タッチパネルなどの操作入力手段によって実現され、ユーザからの情報入力を受け付ける。音声認識装置100は、実施の形態1で示した構成および機能を備えた音声認識装置であり、入力装置304を介して入力された騒音音声データに対して音声認識を行い、音声認識結果を取得し、制御装置302に出力する。 The information acquisition device 301 includes, for example, current position detection means, wireless communication means, surrounding information detection means, and the like, and acquires information detected by the current position of the host vehicle, the surroundings of the host vehicle, and other vehicles. The output device 303 includes, for example, a display unit, a display control unit, a voice output unit, a voice control unit, and the like, and notifies the user of information. The input device 304 is realized by voice input means such as a microphone and operation input means such as buttons and a touch panel, and receives information input from the user. The speech recognition device 100 is a speech recognition device having the configuration and functions described in the first embodiment, performs speech recognition on noise speech data input via the input device 304, and acquires a speech recognition result. And output to the control device 302.
 地図データベース305は、地図データを記憶する記憶領域であり、例えば、HDD(Hard Disk Drive)、RAM(Random Access Memory)などの記憶装置として実現される。経路算出装置306は、情報取得装置301が取得した自車の現在位置を出発地とし、音声認識装置100の音声認識結果を目的地とし、地図データベース305に記憶された地図データに基づいて出発地から目的地までの経路を算出する。経路案内装置307は、経路算出装置306により算出された経路に従って自車両を案内する。 The map database 305 is a storage area for storing map data, and is realized as a storage device such as an HDD (Hard Disk Drive) or a RAM (Random Access Memory). The route calculation device 306 uses the current position of the vehicle acquired by the information acquisition device 301 as a departure point, uses the voice recognition result of the voice recognition device 100 as a destination, and based on the map data stored in the map database 305. The route from to the destination is calculated. The route guidance device 307 guides the host vehicle according to the route calculated by the route calculation device 306.
 ナビゲーションシステム300は、入力装置304を構成するマイクからユーザの発話を含む騒音音声データが入力されると、音声認識装置100は当該騒音音声データに対して上述した図3のフローチャートで示した処理を行い、音声認識結果を取得する。経路算出装置306は、制御装置302および情報取得装置301から入力される情報に基づいて、情報取得装置301が取得した自車の現在位置を出発地とし、音声認識結果が示す情報を目的地とし、地図データに基づいて出発地から目的地までの経路を算出する。経路案内装置307は、経路算出部306が算出した経路に従って算出した経路案内の情報を出力装置303を介して出力し、ユーザに対して経路案内を行う。 When the noise speech data including the user's utterance is input from the microphone constituting the input device 304, the speech recognition device 100 performs the processing shown in the flowchart of FIG. 3 on the noise speech data. To obtain a speech recognition result. Based on the information input from the control device 302 and the information acquisition device 301, the route calculation device 306 uses the current position of the vehicle acquired by the information acquisition device 301 as the departure point and the information indicated by the voice recognition result as the destination. The route from the starting point to the destination is calculated based on the map data. The route guidance device 307 outputs the route guidance information calculated according to the route calculated by the route calculation unit 306 via the output device 303, and performs route guidance for the user.
 以上のように、この実施の形態5によれば、入力装置304に入力されたユーザの発話を含む騒音音声データに対して、音声認識装置100が、良好な音声認識率を示す音声認識結果を導くと予測された騒音抑圧部3により騒音抑圧処理を行い、音声認識を行うように構成したので、音声認識率が良好な音声認識結果に基づいて経路算出を行うことができ、ユーザの希望に合った経路案内を行うことができる。 As described above, according to the fifth embodiment, the speech recognition apparatus 100 generates a speech recognition result indicating a favorable speech recognition rate for noise speech data including a user's utterance input to the input device 304. The noise suppression unit 3 that is predicted to perform the noise suppression process performs speech recognition, so that the route calculation can be performed based on the speech recognition result with a good speech recognition rate. Route guidance can be performed.
 なお、上述した実施の形態5では、ナビゲーションシステム300に実施の形態1で示した音声認識装置100を適用する構成を示したが、実施の形態2で示した音声認識装置100a、実施の形態3で示した音声認識装置100bまたは実施の形態4で示した音声強調装置200を適用して構成してもよい。ナビゲーションシステム300に音声強調装置200を適用する場合には、ナビゲーションシステム300側が強調音声を音声認識する機能を備えるものとする。 In the fifth embodiment described above, the configuration in which the voice recognition device 100 described in the first embodiment is applied to the navigation system 300 has been described. However, the voice recognition device 100a described in the second embodiment and the third embodiment are described. The speech recognition device 100b shown in FIG. 5 or the speech enhancement device 200 shown in Embodiment 4 may be applied. When the speech enhancement apparatus 200 is applied to the navigation system 300, the navigation system 300 side has a function of recognizing the enhanced speech.
 上記以外にも、本発明はその発明の範囲内において、各実施の形態の自由な組み合わせ、あるいは各実施の形態の任意の構成要素の変形、もしくは各実施の形態において任意の構成要素の省略が可能である。 In addition to the above, within the scope of the present invention, the present invention can be freely combined with each embodiment, modified any component of each embodiment, or omitted any component in each embodiment. Is possible.
 この発明に係る音声認識装置および音声強調装置は、良好な音声認識率または音響指標が得られる騒音抑圧手法を選択することができるので、ナビゲーションシステム、電話対応システムおよびエレベータなど通話機能を備えた装置に適用することができる。 Since the speech recognition apparatus and the speech enhancement apparatus according to the present invention can select a noise suppression method that provides a favorable speech recognition rate or acoustic index, a device having a call function such as a navigation system, a telephone-compatible system, and an elevator Can be applied to.
 1 第1の予測部、1a 第2の予測部、2,2a,2b 抑圧手法選択部、3,3a,3b,3c 騒音抑圧部、4 音声認識部、5 特徴量算出部、6,6a 類似度算出部、7 認識率データベース、8 音響指標データベース、100,100a,100b 音声認識装置、200 音声強調装置、300 ナビゲーションシステム、301 情報取得装置、302 制御装置、303 出力装置、304 入力装置、305 地図データベース、306 経路算出装置、307 経路案内装置。 1 1st prediction unit, 1a 2nd prediction unit, 2, 2a, 2b suppression method selection unit, 3, 3a, 3b, 3c noise suppression unit, 4 speech recognition unit, 5 feature quantity calculation unit, 6, 6a similarity Degree calculation unit, 7 recognition rate database, 8 acoustic index database, 100, 100a, 100b speech recognition device, 200 speech enhancement device, 300 navigation system, 301 information acquisition device, 302 control device, 303 output device, 304 input device, 305 Map database, 306 route calculation device, 307 route guidance device.

Claims (9)

  1.  入力された騒音音声データに対して、それぞれ異なる手法の騒音抑圧処理を行う複数の騒音抑圧部と、
     前記騒音抑圧部により騒音信号が抑圧された音声データの音声認識を行う音声認識部と、
     前記入力された騒音音声データの音響特徴量から、前記騒音音声データを前記複数の騒音抑圧部によりそれぞれ騒音抑圧処理を行った場合に得られる音声認識率を予測する予測部と、
     前記予測部が予測した音声認識率に基づいて、前記複数の騒音抑圧部から前記騒音音声データに対して騒音抑圧処理を行う騒音抑圧部を選択する抑圧手法選択部とを備えた音声認識装置。
    A plurality of noise suppression units that perform noise suppression processing of different methods on the input noise voice data,
    A voice recognition unit for performing voice recognition of voice data in which a noise signal is suppressed by the noise suppression unit;
    A prediction unit that predicts a speech recognition rate obtained when noise suppression processing is performed on each of the noise speech data by the plurality of noise suppression units, based on an acoustic feature amount of the input noise speech data;
    A speech recognition apparatus comprising: a suppression method selection unit that selects a noise suppression unit that performs noise suppression processing on the noise speech data from the plurality of noise suppression units based on the speech recognition rate predicted by the prediction unit.
  2.  前記予測部は、前記音響特徴量の短時間フーリエ変換のフレーム毎に、前記音声認識率の予測を行うこと特徴とする請求項1記載の音声認識装置。 The speech recognition apparatus according to claim 1, wherein the prediction unit predicts the speech recognition rate for each frame of a short-time Fourier transform of the acoustic feature amount.
  3.  前記予測部は、前記音響特徴量を入力とし、前記音響特徴量の音声認識率を出力とするニューラルネットワークで構成されることを特徴とする請求項1記載の音声認識装置。 The speech recognition apparatus according to claim 1, wherein the prediction unit is configured by a neural network that receives the acoustic feature amount as an input and outputs a speech recognition rate of the acoustic feature amount.
  4.  前記予測部は、前記音響特徴量を入力として分類処理を行い、音声認識率の高い前記騒音抑圧部を示す情報を出力とするニューラルネットワークで構成されることを特徴とする請求項1記載の音声認識装置。 The speech according to claim 1, wherein the prediction unit includes a neural network that performs classification processing using the acoustic feature amount as an input and outputs information indicating the noise suppression unit having a high speech recognition rate. Recognition device.
  5.  前記予測部は、前記騒音音声データから発話単位で音響特徴量を算出する特徴量算出部と、前記特徴量算出部が算出した音響特徴量とあらかじめ蓄積された音響特徴量との類似度に基づいて、予め蓄積された音声認識率を取得する類似度算出部とを備えたことを特徴とする請求1記載の音声認識装置。 The prediction unit is based on a feature amount calculation unit that calculates an acoustic feature amount in speech units from the noise voice data, and a similarity between the acoustic feature amount calculated by the feature amount calculation unit and the acoustic feature amount accumulated in advance. The speech recognition apparatus according to claim 1, further comprising a similarity calculation unit that acquires a speech recognition rate accumulated in advance.
  6.  入力された騒音音声データに対して、それぞれ異なる手法の騒音抑圧処理を行う複数の騒音抑圧部と、
     前記入力された騒音音声データから発話単位で音響特徴量を算出する特徴量算出部と、前記特徴量算出部が算出した音響特徴量とあらかじめ蓄積された音響特徴量との類似度に基づいて、あらかじめ蓄積された音響指標を取得する類似度算出部とを有する予測部と、
     前記類似度算出部が取得した音響指標に基づいて、前記複数の騒音抑圧部から前記騒音音声データの騒音抑圧処理を行う騒音抑圧部を選択する抑圧手法選択部とを備えた音声強調装置。
    A plurality of noise suppression units that perform noise suppression processing of different methods on the input noise voice data,
    Based on the similarity between the acoustic feature amount calculated by the feature amount calculation unit and the acoustic feature amount stored in advance by the feature amount calculation unit that calculates the acoustic feature amount in units of utterances from the input noise voice data, A prediction unit having a similarity calculation unit for acquiring a pre-stored acoustic index;
    A speech enhancement apparatus comprising: a suppression method selection unit that selects a noise suppression unit that performs noise suppression processing of the noise speech data from the plurality of noise suppression units based on an acoustic index acquired by the similarity calculation unit.
  7.  予測部が、入力された騒音音声データの音響特徴量から、前記騒音音声データを前記複数の騒音抑圧手法によりそれぞれ騒音抑圧処理を行った場合に得られる音声認識率を予測するステップと、
     抑圧手法選択部が、前記予測された音声認識率に基づいて、前記騒音音声データに対して騒音抑圧処理を行う騒音抑圧部を選択するステップと、
     前記選択された騒音抑圧部が、前記入力された騒音音声データの騒音抑圧処理を行うステップと、
     音声認識部が、前記騒音抑圧処理により騒音信号が抑圧された音声データの音声認識を行うステップとを備えた音声認識方法。
    A predicting unit predicting a speech recognition rate obtained when noise suppression processing is performed on each of the noise speech data by the plurality of noise suppression methods, based on an acoustic feature amount of the input noise speech data;
    A suppression method selection unit selecting a noise suppression unit that performs noise suppression processing on the noise speech data based on the predicted speech recognition rate;
    The selected noise suppression unit performing noise suppression processing of the input noise voice data;
    A speech recognition method comprising: a speech recognition unit performing speech recognition of speech data in which a noise signal is suppressed by the noise suppression processing.
  8.  予測部の特徴量算出部が、入力された騒音音声データから発話単位で音響特徴量を算出するステップと、
     予測部の類似度算出部が、前記算出された音響特徴量とあらかじめ蓄積された音響特徴量との類似度に基づいて、あらかじめ蓄積された音響指標を取得するステップと、
     抑圧手法選択部が、前記取得された音響指標に基づいて、前記騒音音声データに対して騒音抑圧処理を行う騒音抑圧部を選択するステップと、
     前記選択された騒音抑圧部が、前記入力された騒音音声データの騒音抑圧処理を行うステップとを備えた音声強調装置。
    A feature amount calculation unit of the prediction unit that calculates an acoustic feature amount in units of utterances from the input noise voice data;
    A step of obtaining a pre-stored acoustic index based on the similarity between the calculated acoustic feature quantity and the pre-stored acoustic feature quantity, the similarity calculating section of the prediction unit;
    A step of selecting a noise suppression unit that performs a noise suppression process on the noise speech data based on the acquired acoustic index;
    The speech enhancement apparatus, wherein the selected noise suppression unit performs a noise suppression process on the input noise speech data.
  9.  請求項1記載の音声認識装置と、
     移動体の現在位置を当該移動体の出発地とし、前記音声認識装置の出力である音声認識結果を前記移動体の目的地とし、地図データを参照して、前記出発地から前記目的地までの経路を算出する経路算出装置と、
     前記経路算出部が算出した経路に従って前記移動体の移動を案内する経路案内装置とを備えたナビゲーション装置。
    A speech recognition device according to claim 1;
    The current position of the moving object is the starting point of the moving object, the voice recognition result that is the output of the voice recognition device is the destination of the moving object, and the map data is referred to from the starting point to the destination. A route calculation device for calculating a route;
    A navigation device comprising: a route guidance device that guides the movement of the mobile body according to the route calculated by the route calculation unit.
PCT/JP2015/083768 2015-12-01 2015-12-01 Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system WO2017094121A1 (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
JP2017553538A JP6289774B2 (en) 2015-12-01 2015-12-01 Speech recognition device, speech enhancement device, speech recognition method, speech enhancement method, and navigation system
PCT/JP2015/083768 WO2017094121A1 (en) 2015-12-01 2015-12-01 Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system
US15/779,315 US20180350358A1 (en) 2015-12-01 2015-12-01 Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system
CN201580084845.6A CN108292501A (en) 2015-12-01 2015-12-01 Voice recognition device, sound enhancing devices, sound identification method, sound Enhancement Method and navigation system
KR1020187014775A KR102015742B1 (en) 2015-12-01 2015-12-01 Speech recognition device, speech emphasis device, speech recognition method, speech emphasis method and navigation system
DE112015007163.6T DE112015007163B4 (en) 2015-12-01 2015-12-01 Speech recognition device, speech enhancement device, speech recognition method, speech highlighting method and navigation system
TW105110250A TW201721631A (en) 2015-12-01 2016-03-31 Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/083768 WO2017094121A1 (en) 2015-12-01 2015-12-01 Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system

Publications (1)

Publication Number Publication Date
WO2017094121A1 true WO2017094121A1 (en) 2017-06-08

Family

ID=58796545

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/083768 WO2017094121A1 (en) 2015-12-01 2015-12-01 Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system

Country Status (7)

Country Link
US (1) US20180350358A1 (en)
JP (1) JP6289774B2 (en)
KR (1) KR102015742B1 (en)
CN (1) CN108292501A (en)
DE (1) DE112015007163B4 (en)
TW (1) TW201721631A (en)
WO (1) WO2017094121A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020034683A (en) * 2018-08-29 2020-03-05 富士通株式会社 Voice recognition device, voice recognition program and voice recognition method
JP2022034035A (en) * 2018-11-22 2022-03-02 株式会社Jvcケンウッド Voice processing condition setting device, wireless communication device, and voice processing condition setting method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920434B (en) * 2019-03-11 2020-12-15 南京邮电大学 Noise classification removal method based on conference scene
CN109817219A (en) * 2019-03-19 2019-05-28 四川长虹电器股份有限公司 Voice wake-up test method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010160246A (en) * 2009-01-07 2010-07-22 Nara Institute Of Science & Technology Noise suppressing device and program
JP2013183346A (en) * 2012-03-02 2013-09-12 Canon Inc Imaging apparatus and sound processing apparatus
JP2015057630A (en) * 2013-08-13 2015-03-26 日本電信電話株式会社 Acoustic event identification model learning device, acoustic event detection device, acoustic event identification model learning method, acoustic event detection method, and program

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6173255B1 (en) * 1998-08-18 2001-01-09 Lockheed Martin Corporation Synchronized overlap add voice processing using windows and one bit correlators
JP2000194392A (en) 1998-12-25 2000-07-14 Sharp Corp Noise adaptive type voice recognition device and recording medium recording noise adaptive type voice recognition program
KR101434071B1 (en) * 2002-03-27 2014-08-26 앨리프컴 Microphone and voice activity detection (vad) configurations for use with communication systems
JP4352790B2 (en) * 2002-10-31 2009-10-28 セイコーエプソン株式会社 Acoustic model creation method, speech recognition device, and vehicle having speech recognition device
JP2005115569A (en) 2003-10-06 2005-04-28 Matsushita Electric Works Ltd Signal identification device and method
CA2454296A1 (en) * 2003-12-29 2005-06-29 Nokia Corporation Method and device for speech enhancement in the presence of background noise
US20060206320A1 (en) * 2005-03-14 2006-09-14 Li Qi P Apparatus and method for noise reduction and speech enhancement with microphones and loudspeakers
US20070041589A1 (en) * 2005-08-17 2007-02-22 Gennum Corporation System and method for providing environmental specific noise reduction algorithms
JP2007206501A (en) * 2006-02-03 2007-08-16 Advanced Telecommunication Research Institute International Device for determining optimum speech recognition system, speech recognition device, parameter calculation device, information terminal device and computer program
US7676363B2 (en) * 2006-06-29 2010-03-09 General Motors Llc Automated speech recognition using normalized in-vehicle speech
JP4730369B2 (en) * 2007-10-30 2011-07-20 株式会社デンソー Navigation system
US8606573B2 (en) * 2008-03-28 2013-12-10 Alon Konchitsky Voice recognition improved accuracy in mobile environments
WO2010052749A1 (en) * 2008-11-04 2010-05-14 三菱電機株式会社 Noise suppression device
TWI404049B (en) * 2010-08-18 2013-08-01 Hon Hai Prec Ind Co Ltd Voice navigation device and voice navigation method
JP5949553B2 (en) * 2010-11-11 2016-07-06 日本電気株式会社 Speech recognition apparatus, speech recognition method, and speech recognition program
JP5916054B2 (en) * 2011-06-22 2016-05-11 クラリオン株式会社 Voice data relay device, terminal device, voice data relay method, and voice recognition system
WO2013149123A1 (en) * 2012-03-30 2013-10-03 The Ohio State University Monaural speech filter
JP6169849B2 (en) * 2013-01-15 2017-07-26 本田技研工業株式会社 Sound processor
US9830925B2 (en) * 2014-10-22 2017-11-28 GM Global Technology Operations LLC Selective noise suppression during automatic speech recognition
CN104575510B (en) * 2015-02-04 2018-08-24 深圳酷派技术有限公司 Noise-reduction method, denoising device and terminal
US20160284349A1 (en) * 2015-03-26 2016-09-29 Binuraj Ravindran Method and system of environment sensitive automatic speech recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010160246A (en) * 2009-01-07 2010-07-22 Nara Institute Of Science & Technology Noise suppressing device and program
JP2013183346A (en) * 2012-03-02 2013-09-12 Canon Inc Imaging apparatus and sound processing apparatus
JP2015057630A (en) * 2013-08-13 2015-03-26 日本電信電話株式会社 Acoustic event identification model learning device, acoustic event detection device, acoustic event identification model learning method, acoustic event detection method, and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SOTA HAMAGUCHI: "Robust speech recognition under noisy environments based on selection of multiple noise suppression methods using GMMs", IEICE TECHNICAL REPORT, vol. 104, no. 542, December 2004 (2004-12-01) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020034683A (en) * 2018-08-29 2020-03-05 富士通株式会社 Voice recognition device, voice recognition program and voice recognition method
US11183180B2 (en) 2018-08-29 2021-11-23 Fujitsu Limited Speech recognition apparatus, speech recognition method, and a recording medium performing a suppression process for categories of noise
JP7167554B2 (en) 2018-08-29 2022-11-09 富士通株式会社 Speech recognition device, speech recognition program and speech recognition method
JP2022034035A (en) * 2018-11-22 2022-03-02 株式会社Jvcケンウッド Voice processing condition setting device, wireless communication device, and voice processing condition setting method
JP7196993B2 (en) 2018-11-22 2022-12-27 株式会社Jvcケンウッド Voice processing condition setting device, wireless communication device, and voice processing condition setting method

Also Published As

Publication number Publication date
JP6289774B2 (en) 2018-03-07
CN108292501A (en) 2018-07-17
JPWO2017094121A1 (en) 2018-02-08
KR20180063341A (en) 2018-06-11
US20180350358A1 (en) 2018-12-06
TW201721631A (en) 2017-06-16
DE112015007163T5 (en) 2018-08-16
KR102015742B1 (en) 2019-08-28
DE112015007163B4 (en) 2019-09-05

Similar Documents

Publication Publication Date Title
US10468032B2 (en) Method and system of speaker recognition using context aware confidence modeling
US20210375272A1 (en) Sentiment aware voice user interface
US10878807B2 (en) System and method for implementing a vocal user interface by combining a speech to text system and a speech to intent system
EP2216775B1 (en) Speaker recognition
JP4355322B2 (en) Speech recognition method based on reliability of keyword model weighted for each frame, and apparatus using the method
US20160111084A1 (en) Speech recognition device and speech recognition method
US20090119103A1 (en) Speaker recognition system
CN109196583B (en) Dynamic speech recognition data evaluation
KR20160010961A (en) Method and device for performing voice recognition using context information
US11887596B2 (en) Multiple skills processing
EP2189976A1 (en) Method for adapting a codebook for speech recognition
JP6289774B2 (en) Speech recognition device, speech enhancement device, speech recognition method, speech enhancement method, and navigation system
US9786295B2 (en) Voice processing apparatus and voice processing method
Vafeiadis et al. Two-dimensional convolutional recurrent neural networks for speech activity detection
JPWO2010128560A1 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
Feng et al. On using heterogeneous data for vehicle-based speech recognition: A DNN-based approach
Herbig et al. Self-learning speaker identification for enhanced speech recognition
Ivanko et al. DAVIS: Driver's Audio-Visual Speech recognition.
Loh et al. Speech recognition interactive system for vehicle
JP4860962B2 (en) Speech recognition apparatus, speech recognition method, and program
Gamage et al. An i-vector gplda system for speech based emotion recognition
JP7511374B2 (en) Speech activity detection device, voice recognition device, speech activity detection system, speech activity detection method, and speech activity detection program
CN110875034A (en) Template training method for voice recognition, voice recognition method and system thereof
Lee et al. Semi-Supervised Speaker Adaptation for In-Vehicle Speech Recognition with Deep Neural Networks.
Delcroix et al. Discriminative feature transforms using differenced maximum mutual information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15909751

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017553538

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20187014775

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 112015007163

Country of ref document: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15909751

Country of ref document: EP

Kind code of ref document: A1