WO2009133719A1 - 音響モデル学習装置および音声認識装置 - Google Patents
音響モデル学習装置および音声認識装置 Download PDFInfo
- Publication number
- WO2009133719A1 WO2009133719A1 PCT/JP2009/052193 JP2009052193W WO2009133719A1 WO 2009133719 A1 WO2009133719 A1 WO 2009133719A1 JP 2009052193 W JP2009052193 W JP 2009052193W WO 2009133719 A1 WO2009133719 A1 WO 2009133719A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variation
- sound
- model
- acoustic model
- data
- Prior art date
Links
- 238000006243 chemical reaction Methods 0.000 claims description 60
- 238000000034 method Methods 0.000 claims description 45
- 230000009466 transformation Effects 0.000 claims description 17
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 14
- 230000006978 adaptation Effects 0.000 claims description 8
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 238000003860 storage Methods 0.000 description 59
- 238000013500 data storage Methods 0.000 description 22
- 230000014509 gene expression Effects 0.000 description 13
- 238000012545 processing Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Definitions
- the present invention relates to a learning technique for constructing an acoustic model.
- voice recognition technology is used in a wide range of areas such as speaker recognition, voice personal authentication, sound quality measurement, and environment measurement.
- voice personal authentication In order to improve the accuracy of speech recognition, an attempt has been made to reduce the influence of fluctuation factors caused by transmission channels, noise, and the like by learning an acoustic model.
- FIG. 10 shows an example of a model of an acoustic model learning device that realizes the acoustic model learning technique disclosed in Non-Patent Document 1 and Non-Patent Document 2.
- the acoustic model learning device 1 includes an audio data storage unit 11, a channel label storage unit 12, an unspecified speaker model learning unit 13, a channel model learning unit 14, and an unspecified speaker model storage unit. 15 and channel model storage means 16.
- the audio data storage means 11 stores sample audio data acquired via various transmission channels.
- a transmission channel refers to the type of physical device that has passed from a voice source such as a speaker until the voice is recorded, and includes a fixed telephone (including a fixed telephone terminal and a fixed telephone communication line), and a mobile phone. Examples include (including a mobile phone terminal and a mobile phone line) and a vocal microphone.
- the transmission channel is also simply referred to as a channel.
- the voice as data differs depending on whether the speaker is a woman or a man.
- the voice as data differs depending on whether recording is performed via a fixed telephone or a cellular phone.
- An audio source having a plurality of types and causing fluctuations in audio due to different types, a transmission channel, and the like are called a sound environment.
- the channel label storage unit 12 of the acoustic model learning device 1 stores label data indicating the channel through which the sample audio data passes, corresponding to the sample audio data stored in the audio data storage unit 11.
- the unspecified speaker model learning means 13 receives the sample voice data and the label data from the voice data storage means 11 and the channel label storage means 12, and removes the fluctuation component depending on the sound environment of the channel from the sample voice data, thereby An unspecified speaker acoustic model is learned by extracting only the fluctuation components depending on the sound environment of the speaker.
- the “unspecified speaker acoustic model” is also referred to as “unspecified speaker model”.
- the channel model learning unit 14 receives the sample audio data and the label data from the audio data storage unit 11 and the channel label storage unit 12, and learns the affine transformation parameters corresponding to the acoustic model of the channel for each channel. That is, the channel acoustic model is obtained by learning the parameters based on the assumption that the channel acoustic model is obtained by performing affine transformation on the unspecified speaker model.
- channel acoustic model is also referred to as “channel model”.
- the unspecified speaker model learning means 13 and the channel model learning means 14 perform the iterative solution described in Non-Patent Document 3 in conjunction with each other, and specify the unspecified speaker acoustic model and the affine transformation parameters (channel acoustic model). After the iterative solution is converged, the final speaker-independent acoustic model and affine transformation parameters are output.
- the unspecified speaker model storage unit 15 receives and stores the unspecified speaker model from the unspecified speaker model learning unit 13, and the channel model storage unit 16 receives and stores the channel model from the channel model learning unit 14. .
- the affine transformation parameters specific to the channel can be acquired for each channel. Therefore, by applying an affine transformation acoustic model to audio data input from any known channel, or by applying inverse affine transformation to audio data, the channel variation factor can be reduced. It is considered that the recognition target can be recognized correctly.
- D. A. Reynolds "Channel robust speaker verification via feature mapping,” Proc. ICASSP2003, Vol.II, pp.53-56, 2003
- D. Zhu et al. "A generalized feature transformation approach for channel robust speaker verification," Proc. ICASSP2007, Vol.IV, pp.61-64, 2007 T. Anastasakos et al., "A compact model for speaker-adaptive training," Proc. ICSLP96, 1996
- the channel model learning means 14 accurately estimates the affine transformation parameters for each channel, so that the unspecified speaker model learning means 13 is in a sound environment called a speaker. Although it is assumed that the resulting fluctuation component can be ignored, this assumption is not always true.
- the voice data through all the channels is acquired for all types of speakers.
- voice data uttered by the same type of speaker through all channels can be used, even if it is unknown which speaker is uttered, the net change in the voice due to channel change is possible. You can see how it changes. This is the same when comparing a set of audio data collected for each channel between channels.
- sample data that can usually be collected is not perfect as shown in FIG.
- FIG. 12 consider a case where there is a speaker who has not uttered some channels.
- a speaker who is “female” has voice data via two channels, “fixed phone” and “mobile phone”, but does not have voice data via a “microphone” channel.
- a speaker who is an “elderly person” has voice data via two channels, “microphone” and “landline telephone”, but no voice data via a “mobile phone” channel.
- a speaker who is “male” has only voice data via the “mobile phone” channel, and no voice data via the two channels “microphone” and “fixed phone”.
- a speaker who is “female” can know how the audio differs between the “landline” channel and the “mobile phone” channel, but the audio on the “microphone” channel. I can't figure out what happens.
- the set of audio data of the “microphone” channel includes only the audio data of “elderly” and includes the features of the audio of the elderly.
- the voice characteristics of the elderly are not included. In such a situation, it is difficult to grasp the variation factor due to the channel difference because the variation factor due to the channel difference and the variation factor due to the speaker type are mixed.
- the present invention has been made in view of the above circumstances, and provides a technique capable of learning an accurate acoustic model even in the case of non-perfect sample data, and thus capable of highly accurate speech recognition.
- the acoustic model learning device includes a first variation model learning unit, a second variation model learning unit, and an unspecified acoustic model learning unit.
- the first variation model learning unit has a plurality of types and has a plurality of types and different types from any one of the first sound environments in which a variation occurs in the sound due to the different types. For each type of first sound environment, the first sound environment of the type is converted into the sound for each type of the first sound environment using the sample sound data acquired through any one of the second sound environments in which the sound fluctuates. A parameter defining a first variation model indicating the variation to be generated is estimated.
- the second variation model learning unit uses the plurality of sample sound data to indicate, for each type of the second sound environment, a second variation model indicating a variation caused in the sound by the second sound environment of the type. Estimate the parameters that define
- the unspecified acoustic model learning unit defines an acoustic model (unspecified acoustic model) that is not specified for either the first sound environment type or the second sound environment type using the plurality of sample sound data. Estimate the parameters.
- These three learning units are adapted to the sample voice data of the first variation model, the fit to the sample voice data of the second variation model, and the fit of the unspecified acoustic model to the sample voice data.
- the respective parameters are estimated so that the integrated fitness obtained by integrating is the highest.
- This speech recognition apparatus is a method for recognizing speech data to be recognized acquired through a first type of first sound environment among the first variation models obtained by the acoustic model learning device according to the aspect of the present invention.
- a speech conversion unit that performs, on the speech data to be recognized, conversion opposite to the variation indicated by the first variation model corresponding to the predetermined type, and performs speech recognition on the speech data obtained by the speech conversion unit; I do.
- This speech recognition apparatus is a method for recognizing speech data to be recognized acquired through a predetermined type of second sound environment among the second variation models obtained by the acoustic model learning device according to the aspect of the present invention.
- a speech conversion unit that performs, on the speech data to be recognized, a conversion opposite to the variation indicated by the second variation model corresponding to the predetermined type, and performs speech recognition on the speech data obtained by the speech conversion unit; I do.
- the sound environment recognition device includes a second sound conversion unit, a first sound conversion unit, and an identification unit.
- the second speech conversion unit is a speech to be recognized acquired through a second type of second sound environment among the second variation models obtained by the acoustic model learning device according to the aspect of the present invention.
- a conversion opposite to the fluctuation indicated by the second fluctuation model corresponding to the predetermined type of data is performed on the speech data to be recognized.
- the first sound conversion unit performs conversion opposite to the variation indicated by each first variation model obtained by the acoustic model learning device according to the above aspect of the present invention on the sound data obtained by the second sound conversion unit. To obtain a plurality of audio data.
- the identification unit uses the plurality of voice data obtained by the first voice conversion unit and the unspecified acoustic model obtained by the acoustic model learning device according to the aspect of the invention to pass the voice data to be recognized.
- the type of the first sound environment is identified.
- the technique according to the present invention it is possible to learn an accurate acoustic model even in the case of sample data that is not perfect, and as a result, the accuracy of speech recognition can be improved.
- FIG. 1 It is a schematic diagram of the acoustic model learning apparatus for demonstrating the technique concerning this invention. It is a figure which shows the structural example of the data memorize
- FIG. 1 It is a figure which shows the speech recognition apparatus concerning the 3rd Embodiment of this invention. It is a flowchart which shows the flow of a process in the speech recognition apparatus shown in FIG. It is a schematic diagram of the acoustic model learning apparatus for demonstrating the conventional acoustic model learning method. It is a figure which shows the example of sample audio
- each element described as a functional block for performing various processes can be configured by a processor, a memory, and other circuits in terms of hardware, and in terms of software This is realized by a program recorded or loaded in the program. Therefore, it is understood by those skilled in the art that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof, and is not limited to any one. Also, for the sake of clarity, only those necessary for explaining the technique of the present invention are shown in these drawings.
- FIG. 1 is an example of a schematic diagram of an acoustic model learning device 100 based on the technique according to the present invention.
- the acoustic model learning device 100 includes a sample data storage unit 110, a first variation model learning unit 120, a second variation model learning unit 130, and an unspecified acoustic model learning unit 140.
- the sample data storage unit 110 stores various sample audio data (hereinafter simply referred to as sample data), the first sound environment type from which the sample data was acquired, and the second sound environment type in association with each other. is doing.
- sample data various sample audio data
- the first sound environment has a plurality of types, and fluctuations occur in the sound when these types are different.
- the second sound environment also has a plurality of types, and fluctuations occur in the sound when these types are different.
- FIG. 2 shows an example of data stored in the sample data storage unit 110.
- sample data sample data
- a first sound environment label A indicating in which first sound environment the sample data was acquired
- second sound the sample data is in It is stored in association with a second sound environment label B indicating whether it has been acquired in the environment.
- Each first sound environment label corresponds to a plurality of types of the first sound environment
- each second sound environment label corresponds to a plurality of types of the second sound environment.
- the sample data 1 is the sound data of the speaker A2 acquired through the channel B3
- the sample data 2 is voice data of the speaker A1 acquired via the channel B2.
- the first variation model learning unit 120 estimates, for each type of the first sound environment, a parameter that defines a first variation model indicating the variation that the first sound environment of the type causes in the speech. For example, when the first sound environment is a speaker, each first variation model is a speaker variation model.
- the second variation model learning unit 130 estimates, for each type of the second sound environment, a parameter that defines a second variation model indicating the variation that the second sound environment of the type causes in the speech. For example, when the second sound environment is a channel, each second variation model is a channel variation model.
- the unspecified acoustic model learning unit 140 learns an acoustic model that does not depend on either the first sound environment or the second sound environment.
- this acoustic model is referred to as an unspecified acoustic model.
- the unspecified acoustic model learning unit 140 initializes the unspecified acoustic model, reads each sample data and two kinds of sound environment labels stored in the sample data storage unit 110, and updates the parameters of the unspecified acoustic model.
- a conventionally known Gaussian mixture model (GMM), hidden Markov model (HMM), or the like can be used as this unspecified acoustic model.
- GMM Gaussian mixture model
- HMM hidden Markov model
- the GMM is taken as an example, but the same operation can be derived when other models are used.
- mu K and sigma K is the mean and variance of the k-th Gaussian, respectively, C K is the mixing coefficient according to the k-th Gaussian distribution (weight coefficient).
- Initialization of these parameters is performed by setting appropriate values for each parameter.
- the audio data is provided in the form of a time series of feature vectors, "1 / M" in the C K, the mu K and sigma K, the mean and variance of the feature vectors may be set, respectively.
- a parameter defining the model is referred to as a model parameter.
- T i, j is the number of frames (number) of feature vectors.
- the first variation model learning unit 120 initializes each first variation model, reads the sample data and the sound environment label A stored in the sample data storage unit 110, and updates the model parameters.
- the model parameter of the first variation model is, for example, an affine transformation parameter set ⁇ V i , ⁇ i
- i 1, 2,..., N ⁇ shown in Expression (3) (N: of the first sound environment) Number of types) can be used.
- the second variation model learning unit 130 for learning the second variation model initializes the second variation model, reads the sample data and the sound environment label B stored in the sample data storage unit 110, and sets model parameters. Update.
- the model parameter of the second variation model is, for example, an affine transformation parameter set ⁇ W j , ⁇ j
- j 1, 2,..., C ⁇ shown in Expression (4) (C: of the second sound environment) Number of types) can be used.
- the first variation model learning unit 120, the second variation model learning unit 130, and the unspecified acoustic model learning unit 140 are adapted to the sample sound data of the first variation model and the sample sound of the second variation model.
- the respective parameters are estimated so that the integrated fitness obtained by integrating the fitness to the data and the fitness to the sample sound data of the unspecified acoustic model becomes the highest.
- the probability of sample audio data being observed represented by the parameters of these three models can be used as the integrated fitness. This probability will be described with reference to the generation process of sample audio data.
- FIG. 3 is a conceptual diagram of a sample sound data generation model expressing a phenomenon in which sound data that has changed due to passing through the two sound environments is observed in the order of the first sound environment and the second sound environment.
- the speech before the fluctuation occurs is generated as a feature vector series such as “z 1 , z 2 ,..., Z T ” according to the probability distribution of the unspecified acoustic model.
- This sound passes through the first sound environment of type i (1 ⁇ i ⁇ N), and is converted into the expression (5).
- the voice that has passed through the first sound environment further passes through the second sound environment of type j (1 ⁇ j ⁇ C), so that the conversion shown in Expression (6) is performed, and the voice “x 1 , x 2 , ..., it becomes x T ".
- the observable speech is speech “x 1 , x 2 ,..., X T ”, and “z 1 , z 2 ,..., Z T ” or “y 1 , y 2 ,. , Y T ”is not observable.
- ⁇ is a parameter of the unspecified acoustic model, the first variation model, and the second variation model, that is, C K , ⁇ K , ⁇ K , V i , ⁇ i , W j , ⁇ j Represents one of the following.
- ⁇ , ⁇ ) represents a Gaussian distribution with an average ⁇ and a variance ⁇ .
- the most accurate acoustic model can be estimated by using a method for estimating each parameter so that the integrated fitness obtained by integrating the degrees becomes the highest.
- the probability represented by the equation (7) can be used.
- the most accurate acoustic model can be obtained by estimating the parameter ⁇ of the first variation model, the second variation model, and the unspecified acoustic model so that the probability represented by the equation (7) is maximized. Can do.
- each learning unit updates each parameter ⁇ according to the following equation (8).
- argmax means that the value of a variable (here, ⁇ ) is obtained so that the value of a given function is maximized.
- the calculation shown in Expression (8) is well known as a maximum likelihood estimation method, and a numerical solution method using an iterative calculation algorithm known as an expected value maximization (EM) method can be applied.
- the parameter ⁇ can be updated by a well-known method such as a maximum posterior probability (MAP) estimation method or a Bayes estimation method.
- each learning unit reads sample data, a first sound environment label, and a second sound environment label from the sample data storage unit 110 (S10, S12, S14). Note that the execution order of steps S10, S12, and S14 is not limited to the illustration and is arbitrary.
- each learning unit initializes each model parameter (S16). Specifically, the unspecified acoustic model learning unit 140 initializes the parameters C K , ⁇ K , and ⁇ K , the first variation model learning unit 120 initializes the parameters V i and ⁇ i , and the second variation. The model learning unit 130 initializes the parameters W j and ⁇ j . Examples of values set for each parameter by initialization are as described above, and details are omitted here.
- step S16 may be executed before steps S10 to S14.
- the unspecified acoustic model learning unit 140 uses a method such as initializing ⁇ K and ⁇ K with random numbers.
- the unspecified acoustic model learning unit 140 updates the parameters C K , ⁇ K , and ⁇ K of the unspecified acoustic model according to Expressions (9), (10), and (11) (S18).
- ⁇ ijkt in equations (9), (10), and (11) is calculated in advance according to equation (12) as the probability belonging to the kth Gaussian distribution of the unspecified acoustic model.
- the parameter update of the unspecified acoustic model learning unit 140 in step S18 may be performed only once or may be repeated a predetermined number of times. Furthermore, convergence determination, for example, convergence determination using the logarithmic probability of the right side of equation (8) as an index may be introduced and repeated until convergence.
- the first variation model learning unit 120 updates the parameters V i and ⁇ i of the first variation model according to the equations (13) and (14) (S20).
- ⁇ ijkt in the equations (7) and (8) is also calculated in advance according to the equation (12) as in the case of the unspecified acoustic model learning unit 140. Further, the number of parameter updates may be determined in the same manner as in the case of the unspecified acoustic model learning unit 140.
- the second variation model learning unit 130 updates the parameters ⁇ j and W j of the second variation model according to the equations (15) and (16) (S22).
- ⁇ ijkt in the equations (15) and (16) is also calculated in advance according to the equation (12) as in the case of the unspecified acoustic model learning unit 140. Further, the number of parameter updates may be determined in the same manner as in the case of the unspecified acoustic model learning unit 140.
- step S18 to S22 is repeated until convergence (S24: No, S18).
- the first variation model and the second variation model are obtained from the first variation model learning unit 120, the second variation model learning unit 130, and the unspecified acoustic model learning unit 140.
- the parameters of the unspecified acoustic model are output, and the learning process by the acoustic model learning device 100 ends.
- the first variation model learning unit 120 can extract only the variation factor due to the first sound environment, and the second variation model learning unit 130 It is possible to extract only the fluctuation factors due to the sound environment 2 and to construct a highly accurate acoustic model even with sample data that is not perfect. As a result, speech recognition using these acoustic models can also be performed with high accuracy.
- FIG. 5 shows an acoustic model learning apparatus 200 according to the first embodiment of the present invention.
- the acoustic model learning device 200 includes a sample data storage unit 212, a speaker label storage unit 214, a channel label storage unit 216, a speaker variation model learning unit 220, a channel variation model learning unit 230, and an unspecified.
- An acoustic model learning unit 240, a speaker variation model storage unit 252, a channel variation model storage unit 254, and an unspecified acoustic model storage unit 256 are provided.
- the sample data storage unit 212 stores sample voice data of a plurality of speakers recorded through various channels.
- the speaker label storage unit 214 stores label data (speaker label) indicating each speaker of each sample data stored in the sample data storage unit 212.
- the channel label storage unit 216 stores data of a label (channel label) indicating each channel of each sample data stored in the sample data storage unit 212.
- sample data storage unit 212 the speaker label storage unit 214, and the channel label storage unit 216 store the sample data, the speaker label, and the channel label so that they can be associated with each other.
- the speaker variation model learning unit 220 corresponds to the first variation model learning unit 120 of the acoustic model learning apparatus 100 shown in FIG.
- the speaker is the first sound environment
- the speaker variation model learning unit 220 obtains a first variation model for each speaker.
- This first variation model is hereinafter referred to as a speaker variation model.
- the channel variation model learning unit 230 corresponds to the second variation model learning unit 130 of the acoustic model learning apparatus 100.
- the channel is the second sound environment, and the channel variation model learning unit 230 obtains a second variation model for each channel.
- This second variation model is hereinafter referred to as a channel variation model.
- the unspecified acoustic model learning unit 240 corresponds to the unspecified acoustic model learning unit 140 of the acoustic model learning device 100 and learns an unspecified acoustic model that does not depend on either the speaker or the channel.
- These three learning units integrate the adaptability of the speaker variation model to the sample speech data, the adaptability of the channel variation model to the sample speech data, and the adaptability of the unspecified acoustic model to the sample speech data.
- Each parameter is estimated so that the integrated fitness obtained in this way becomes the highest. Since the specific processing of each learning unit is the same as that of the corresponding learning unit in the acoustic model learning device 100, detailed description thereof is omitted here.
- the speaker variation model storage unit 252, the channel variation model storage unit 254, and the unspecified acoustic model storage unit 256 are the speaker variation model learning unit 220, the channel variation model learning unit 230, and the unspecified acoustic model learning.
- the speaker variation model, channel variation model, and unspecified acoustic model obtained by the unit 240 are stored.
- the acoustic model learning device 200 of the present embodiment embodies the principle of the present invention, and can exhibit the same effects as the acoustic model learning device 100.
- FIG. 6 shows a speech recognition apparatus 300 according to the second embodiment of the present invention.
- the speech recognition apparatus 300 includes a channel input unit 312, a speech input unit 314, a channel variation model storage unit 324, an unspecified acoustic model storage unit 326, a speech conversion unit 330, and a speech recognition unit 340.
- the voice input unit 314 inputs voice data that is a target of voice recognition to the voice conversion unit 330.
- the channel input unit 312 inputs a label indicating the channel through which the audio data input by the audio input unit 314 passes.
- the label input by the channel input unit 312 is data indicating the type of the channel, and is not limited to a label as long as the model for each channel stored in the channel variation model storage unit 324 can be specified. It may be a name or number.
- the channel variation model storage unit 324 corresponds to the channel variation model storage unit 254 in the acoustic model learning apparatus 200 illustrated in FIG. 5, and stores the channel variation model obtained by the channel variation model learning unit 230. Specifically, for each of the C types of channels, parameters ⁇ j and W j are stored in association with labels indicating the types of channels.
- the unspecified acoustic model storage unit 326 corresponds to the unspecified acoustic model storage unit 256 in the acoustic model learning device 200 illustrated in FIG. 5 and stores the unspecified acoustic model obtained by the unspecified acoustic model learning unit 240.
- the voice conversion unit 330 performs conversion for removing the influence of the channel on the voice data input by the voice input unit 314. Specifically, the parameters ⁇ j and W j corresponding to the label input by the channel input unit 312 are read from the channel variation model storage unit 324 and the input audio data “x 1 , x 2, to convert ⁇ , x T "to" y 1, y 2, ⁇ , y T ".
- the voice data changes as shown in the following equation (6) by passing through the channel of type j.
- the conversion performed by the voice conversion unit 330 corresponds to the inverse conversion of the influence of the channel of type j on the voice indicated by Expression (6). That is, by this conversion, the influence of the channel of the type j through which the audio data is input, which is input by the channel input unit 312, is removed from the audio data that is input by the audio input unit 314.
- the voice conversion unit 330 outputs voice data “y 1 , y 2 ,..., Y T ” obtained by removing the influence of the channel to the voice recognition unit 340.
- the speech recognizing unit 340 reads the unspecified acoustic model from the unspecified acoustic model storage unit 326, and for the speech data “y 1 , y 2 ,..., Y T ” from the speech converting unit 330, a dictionary not shown. Using a language model, grammatical rules, and the like, speech recognition is performed by a conventionally known speech recognition method, and the resulting character string is output.
- FIG. 7 is a flowchart showing the flow of processing of the speech recognition apparatus 300 shown in FIG.
- the speech recognition unit 340 reads an unspecified acoustic model from the unspecified acoustic model storage unit 326 (S50). Note that the process of step S50 may be executed any time before the unspecified acoustic model storage unit 326 starts speech recognition.
- the voice conversion unit 330 reads the voice data from the voice input unit 314 and also reads the channel label indicating the channel through which the voice data has passed from the channel input unit 312 (S52, S54). Then, the voice conversion unit 330 reads the parameter of the channel variation model corresponding to the channel label read from the channel input unit 312 from the channel variation model storage unit 324, and performs the processing on the voice data read from the voice input unit 314. Audio conversion is performed to remove the influence of the channel (S58).
- the voice recognition unit 340 performs voice recognition on the voice data from which the influence of the channel is removed by the voice conversion unit 330 to obtain a character string (S60).
- the channel variation model extracts only the variation component due to the sound environment of the channel, the influence of the channel is removed from the speech data to be recognized. Voice recognition can be performed, and the accuracy of voice recognition can be improved.
- the influence of the channel is removed by performing affine transformation on the speech data by the speech conversion unit 330.
- the same effect can be obtained by performing considerable conversion on the unspecified acoustic model instead of conversion on the audio data.
- the speech recognition apparatus 300 is an example in which a channel variation model obtained by the acoustic model learning technique according to the present invention is applied to speech recognition.
- the speaker variation model obtained by the acoustic model learning technique according to the present invention may be applied to a speech input device or the like.
- the speech recognition is performed after removing the influence of the speaker on the speech data to be recognized. Can be accurate.
- FIG. 8 shows a speech recognition apparatus 400 according to the third embodiment of the present invention.
- the speech recognition apparatus 400 identifies a speaker of the input speech, and includes a channel input unit 412, a speech input unit 414, a speaker variation model storage unit 424, and an unspecified acoustic model storage unit 426.
- the channel input unit 412, the voice input unit 414, the channel variation model storage unit 422, the unspecified acoustic model storage unit 426, and the second voice conversion unit 430 are the channel input unit 312 in the voice recognition device 300 shown in FIG.
- the voice input unit 314, the channel variation model storage unit 324, the unspecified acoustic model storage unit 326, and the voice conversion unit 330 have the same functions and configurations, and description thereof is omitted here.
- the speaker variation model storage unit 424 corresponds to the speaker variation model storage unit 252 in the acoustic model learning device 200 illustrated in FIG. 5, and the speaker variation model obtained by the speaker variation model specifying learning unit 220 is the speaker variation model. Storing. Specifically, a parameter set “V i , ⁇ i ” is stored for each of N speakers.
- the voice data from which the influence of the channel is removed by the second voice conversion unit 430 is output to the first voice conversion unit 440.
- the first speech conversion unit 440 reads out the parameter sets “V i , ⁇ i ” corresponding to the N speakers from the speaker variation model storage unit 424 and uses the respective parameter sets to express the following formulas:
- the audio data “z 1,1 , z 1 , 2 ,..., Z 1, T ”, “z 2,1 , z 2,2,. , z 2, T “, ..., to get" z N, 1, z N, 2, ⁇ , z N, T , "the.
- the voice data changes according to the following equation (5) by the utterance by the speaker of type i.
- the conversion performed by the first speech conversion unit 440 is the inverse conversion of the influence of the speaker of type i shown in equation (5) on the speech. It corresponds to. That is, by this conversion, if the voice data input by the voice input unit 314 is uttered by the speaker i, the influence of the speaker i is removed from the voice data.
- the calculation of the similarity Si by the speaker identification unit 450 can be performed, for example, according to the following equation (19). Or you may use the following formula
- FIG. 9 is a flowchart showing the flow of processing of the speech recognition apparatus 400 shown in FIG.
- the processing from steps S80 to S88 is the same as the processing from steps S50 to S58 of the speech recognition apparatus 300 shown in FIG. 7, and detailed description thereof is omitted here.
- the first speech conversion unit 440 reads all the parameters of the speaker variation model stored in the speaker variation model storage unit 424, and the speakers are speaker i to speaker N. Assuming each of them, the first voice conversion for removing the influence of the speaker is performed on the voice data from the second voice converter 430 to obtain N pieces of voice data (S92).
- the speech recognition apparatus 400 of the present embodiment since the speaker is recognized after removing the influence of the channel on the speech data by the second speech conversion unit 430, the recognition accuracy can be improved.
- a program that describes the procedure of the acoustic model learning process or the speech recognition process according to each of the above-described embodiments is mounted on a computer, and the computer is operated as the acoustic model learning device or the speech recognition apparatus of each of the above-described embodiments. It may be.
- a computer storage device such as a hard disk may be used as the storage unit for storing each model.
- the present invention is used in a learning technique for constructing an acoustic model, for example.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
- Telephone Function (AREA)
Abstract
Description
D. A. Reynolds, "Channel robust speaker verification via feature mapping," Proc. ICASSP2003, Vol.II, pp.53-56, 2003 D. Zhu et al., "A generalized feature transformation approach for channel robust speaker verification," Proc. ICASSP2007, Vol.IV, pp.61-64, 2007 T. Anastasakos et al., "A compact model for speaker-adaptive training," Proc. ICSLP96, 1996
第2の音声変換部は、本発明の上記態様の音響モデル学習装置により得られた各第2の変動モデルのうちの、所定種類の第2の音環境を介して取得された認識対象の音声データの当該所定種類に対応した第2の変動モデルが示す変動と逆の変換を、該認識対象の音声データに対して施す。
12 チャネルラベル記憶手段 13 不特定話者モデル学習手段
14 チャネルモデル学習手段 15 不特定話者モデル記憶手段
16 チャネルモデル記憶手段 100 音響モデル学習装置
110 サンプルデータ記憶部 120 第1の変動モデル学習部
130 第2の変動モデル学習部 140 不特定音響モデル学習部
200 音響モデル学習装置 212 サンプルデータ記憶部
214 話者ラベル記憶部 216 チャネルラベル記憶部
220 話者の変動モデル学習部 230 チャネルの変動モデル学習部
240 不特定音響モデル学習部 252 話者の変動モデル記憶部
254 チャネルの変動モデル記憶部 256 不特定音響モデル記憶部
300 音声認識装置 312 チャネル入力部
314 音声入力部 324 チャネルの変動モデル記憶部
326 不特定音響モデル記憶部 330 音声変換部
340 音声認識部 400 音声認識装置
412 チャネル入力部 414 音声入力部
422 チャネルの変動モデル記憶部 424 話者の変動モデル記憶部
426 不特定音響モデル記憶部 430 第2の音声変換部
440 第1の音声変換部 450 話者識別部
図1は、本発明にかかる技術に基づく音響モデル学習装置100の模式図の例である。音響モデル学習装置100は、サンプルデータ記憶部110と、第1の変動モデル学習部120と、第2の変動モデル学習部130と、不特定音響モデル学習部140を備える。
不特定音響モデル学習部140は、不特定音響モデルを初期化して、サンプルデータ記憶部110に記憶された各サンプルデータおよび2種類の音環境ラベルを読み出して、不特定音響モデルのパラメータを更新する。この不特定音響モデルは、従来知られているガウス混合モデル(GMM)や隠れマルコフモデル(HMM)などを用いることができる。以下の説明においてGMMを例にするが、他のモデルを用いた場合も同様の動作を導出することができる。
第1の変動モデル学習部120は、各第1の変動モデルを初期化して、サンプルデータ記憶部110に記憶されたサンプルデータと音環境ラベルAを読み出してモデルパラメータを更新する。第1の変動モデルのモデルパラメータは、例えば式(3)に示すアフィン変換のパラメータセット{Vi、λi|i=1,2,・・・,N}(N:第1の音環境の種類の数)を用いることができる。
また、パラメータθの更新は、最尤推定法以外にも、よく知られた最大事後確率(MAP)推定法、ベイズ推定法などの手法により行うことができる。
<第1の実施の形態>
図5は、本発明の第1の実施の形態にかかる音響モデル学習装置200を示す。音響モデル学習装置200は、サンプルデータ記憶部212と、話者ラベル記憶部214と、チャネルラベル記憶部216と、話者の変動モデル学習部220と、チャネルの変動モデル学習部230と、不特定音響モデル学習部240と、話者の変動モデル記憶部252と、チャネルの変動モデル記憶部254と、不特定音響モデル記憶部256を備える。
チャネルの変動モデル学習部230は、音響モデル学習装置100の第2の変動モデル学習部130に対応する。ここで、チャネルが第2の音環境であり、チャネルの変動モデル学習部230は、チャネル毎の第2の変動モデルを得る。この第2の変動モデルを以下チャネルの変動モデルという。
<第2の実施の形態>
図6は、本発明の第2の実施の形態にかかる音声認識装置300を示す。この音声認識装置300は、チャネル入力部312と、音声入力部314と、チャネルの変動モデル記憶部324と、不特定音響モデル記憶部326と、音声変換部330と、音声認識部340を備える。
チャネル入力部312は、音声入力部314により入力される音声データが通ったチャネルを示すラベルを入力する。なお、チャネル入力部312が入力するラベルは、チャネルの種類を示すデータであり、チャネルの変動モデル記憶部324に記憶されたチャネル毎のモデルを指定することができれば、ラベルに限られず、任意の名前や番号でもよい。
<第3の実施の形態>
なお、話者識別部450による類似度Siの算出は、例えば下記の式(19)に従って行うことができる。
Claims (13)
- 複数の種類を有し、前記種類が異なることにより音声に変動が生じる第1の音環境のいずれか1種と、複数の種類を有し、前記種類が異なることにより音声に変動が生じる第2の音環境のいずれか1種とを介して取得されたサンプル音声データを用いて、前記第1の音環境の種類毎に、該種類の前記第1の音環境が音声に生じさせる変動を示す第1の変動モデルを規定するパラメータを推定する第1の変動モデル学習部と、
前記複数のサンプル音声データを用いて、前記第2の音環境の種類毎に、該種類の前記第2の音環境が音声に生じさせる変動を示す第2の変動モデルを規定するパラメータを推定する第2の変動モデル学習部と、
前記複数のサンプル音声データを用いて、前記第1の音環境の種類と前記第2の音環境の種類のいずれにも特定しない不特定音響モデルを規定するパラメータを推定する不特定音響モデル学習部とを備え、
各前記学習部は、前記第1の変動モデルの前記サンプル音声データへの適合度と、前記第2の変動モデルの前記サンプル音声データへの適合度と、前記不特定音響モデルの前記サンプル音声データへの適合度を統合した統合適応度が最も高くなるように、それぞれのパラメータを推定することを特徴とする音響モデル学習装置。 - 各前記学習部は、前記第1の変動モデルと前記第2の変動モデルと前記不特定音響モデルのパラメータにより表わされる、前記サンプル音声データが観測される確率を前記統合適合度として用いることを特徴とする請求項1に記載の音響モデル学習装置。
- 各前記学習部は、最尤推定法、最大事後確率推定法、及びベイズ推定法のいずれかに基づく反復解法を用いてパラメータを推定することを特徴とする請求項1または2に記載の音響モデル学習装置。
- 前記第1の変動モデルと前記第2の変動モデルは、アフィン変換で定義されることを特徴とする請求項3に記載の音響モデル学習装置。
- 前記不特定音響モデルは、ガウス混合モデルまたは隠れマルコフモデルであることを特徴とする請求項3または4に記載の音響モデル学習装置。
- 請求項1から5のいずれか1項に記載の音響モデル学習装置により得られた各前記第1の変動モデルのうちの、所定種類の前記第1の音環境を介して取得された認識対象の音声データの前記所定種類に対応した第1の変動モデルが示す変動と逆の変換を、前記音声データに対して施す音声変換部を備え、
該音声変換部により得た音声データに対して音声認識を行う特徴とする音声認識装置。 - 請求項1から5のいずれか1項に記載の音響モデル学習装置により得られた各前記第2の変動モデルのうちの、所定種類の前記第2の音環境を介して取得された認識対象の音声データの前記所定種類に対応した第2の変動モデルが示す変動と逆の変換を、前記音声データに対して施す音声変換部を備え、
該音声変換部により得た音声データに対して音声認識を行う特徴とする音声認識装置。 - 請求項1から5のいずれか1項に記載の音響モデル学習装置により得られた各前記第2の変動モデルのうちの、所定種類の前記第2の音環境を介して取得された認識対象の音声データの前記所定種類に対応した第2の変動モデルが示す変動と逆の変換を、前記音声データに対して施す第2の音声変換部と、
請求項1から5のいずれか1項に記載の音響モデル学習装置により得られた各前記第1の変動モデルが示す変動と逆の変換を、前記第2の音声変換部により得た音声データに対して夫々行って複数の音声データを得る第1の音声変換部と、
該第1の音声変換部が得た前記複数の音声データと、請求項1から5のいずれか1項に記載の音響モデル学習装置により得られた不特定音響モデルとを用いて、前記認識対象の音声データが通った第1の音環境の種類を識別する識別部とを備えることを特徴とする音環境認識装置。 - 前記第1の音環境は話者であり、前記第2の音環境は伝送チャネルであることを特徴とする請求項8に記載の音環境認識装置。
- 複数の種類を有し、前記種類が異なることにより音声に変動が生じる第1の音環境のいずれか1種と、複数の種類を有し、前記種類が異なることにより音声に変動が生じる第2の音環境のいずれか1種とを介して取得されたサンプル音声データを用いて、前記第1の音環境の種類毎に、該種類の前記第1の音環境が音声に生じさせる変動を示す第1の変動モデルを規定するパラメータを推定する第1の変動モデル学習工程と、
前記複数のサンプル音声データを用いて、前記第2の音環境の種類毎に、該種類の前記第2の音環境が音声に生じさせる変動を示す第2の変動モデルを規定するパラメータを推定する第2の変動モデル学習工程と、
前記複数のサンプル音声データを用いて、前記第1の音環境の種類と前記第2の音環境の種類のいずれにも特定しない不特定音響モデルを規定するパラメータを推定する不特定音響モデル学習工程とを備え、
各前記音響モデル学習工程は、前記第1の変動モデルの前記サンプル音声データへの適合度と、前記第2の変動モデルの前記サンプル音声データへの適合度と、前記不特定音響モデルの前記サンプル音声データへの適合度を統合した統合適応度が最も高くなるように、それぞれのパラメータを推定することを特徴とする音響モデル学習方法。 - 各前記音響モデル学習工程は、前記第1の変動モデルと前記第2の変動モデルと前記不特定音響モデルのパラメータにより表わされる、前記サンプル音声データが観測される確率を前記統合適合度として用いることを特徴とする請求項9に記載の音響モデル学習方法。
- 複数の種類を有し、前記種類が異なることにより音声に変動が生じる第1の音環境のいずれか1種と、複数の種類を有し、前記種類が異なることにより音声に変動が生じる第2の音環境のいずれか1種とを介して取得されたサンプル音声データを用いて、前記第1の音環境の種類毎に、該種類の前記第1の音環境が音声に生じさせる変動を示す第1の変動モデルを規定するパラメータを推定する第1の変動モデル学習ステップと、
前記複数のサンプル音声データを用いて、前記第2の音環境の種類毎に、該種類の前記第2の音環境が音声に生じさせる変動を示す第2の変動モデルを規定するパラメータを推定する第2の変動モデル学習ステップと、
前記複数のサンプル音声データを用いて、前記第1の音環境の種類と前記第2の音環境の種類のいずれにも特定しない不特定音響モデルを規定するパラメータを推定する不特定音響モデル学習ステップとをコンピュータに実行せしめるプログラムを記録したコンピュータ読取可能な記録媒体であって、
各前記音響モデル学習ステップは、前記第1の変動モデルの前記サンプル音声データへの適合度と、前記第2の変動モデルの前記サンプル音声データへの適合度と、前記不特定音響モデルの前記サンプル音声データへの適合度を統合した統合適応度が最も高くなるように、それぞれのパラメータを推定することを特徴とする記録媒体。 - 各前記音響モデル学習ステップは、前記第1の変動モデルと前記第2の変動モデルと前記不特定音響モデルのパラメータにより表わされる、前記サンプル音声データが観測される確率を前記統合適合度として用いることを特徴とする請求項12に記載の記録媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010510052A JP5423670B2 (ja) | 2008-04-30 | 2009-02-10 | 音響モデル学習装置および音声認識装置 |
US12/921,062 US8751227B2 (en) | 2008-04-30 | 2009-02-10 | Acoustic model learning device and speech recognition device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008118662 | 2008-04-30 | ||
JP2008-118662 | 2008-04-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009133719A1 true WO2009133719A1 (ja) | 2009-11-05 |
Family
ID=41254942
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/052193 WO2009133719A1 (ja) | 2008-04-30 | 2009-02-10 | 音響モデル学習装置および音声認識装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US8751227B2 (ja) |
JP (1) | JP5423670B2 (ja) |
WO (1) | WO2009133719A1 (ja) |
Families Citing this family (64)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8819554B2 (en) * | 2008-12-23 | 2014-08-26 | At&T Intellectual Property I, L.P. | System and method for playing media |
US9098467B1 (en) * | 2012-12-19 | 2015-08-04 | Rawles Llc | Accepting voice commands based on user identity |
US9818427B2 (en) * | 2015-12-22 | 2017-11-14 | Intel Corporation | Automatic self-utterance removal from multimedia files |
US9965247B2 (en) | 2016-02-22 | 2018-05-08 | Sonos, Inc. | Voice controlled media playback system based on user profile |
US9947316B2 (en) | 2016-02-22 | 2018-04-17 | Sonos, Inc. | Voice control of a media playback system |
US9772817B2 (en) | 2016-02-22 | 2017-09-26 | Sonos, Inc. | Room-corrected voice detection |
US10264030B2 (en) | 2016-02-22 | 2019-04-16 | Sonos, Inc. | Networked microphone device control |
US10509626B2 (en) | 2016-02-22 | 2019-12-17 | Sonos, Inc | Handling of loss of pairing between networked devices |
US10095470B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Audio response playback |
US9978390B2 (en) | 2016-06-09 | 2018-05-22 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US10134399B2 (en) | 2016-07-15 | 2018-11-20 | Sonos, Inc. | Contextualization of voice inputs |
US10115400B2 (en) | 2016-08-05 | 2018-10-30 | Sonos, Inc. | Multiple voice services |
US9942678B1 (en) | 2016-09-27 | 2018-04-10 | Sonos, Inc. | Audio playback settings for voice interaction |
US10181323B2 (en) | 2016-10-19 | 2019-01-15 | Sonos, Inc. | Arbitration-based voice recognition |
US10475449B2 (en) | 2017-08-07 | 2019-11-12 | Sonos, Inc. | Wake-word detection suppression |
US10048930B1 (en) | 2017-09-08 | 2018-08-14 | Sonos, Inc. | Dynamic computation of system response volume |
US10531157B1 (en) * | 2017-09-21 | 2020-01-07 | Amazon Technologies, Inc. | Presentation and management of audio and visual content across devices |
US10446165B2 (en) | 2017-09-27 | 2019-10-15 | Sonos, Inc. | Robust short-time fourier transform acoustic echo cancellation during audio playback |
US10621981B2 (en) | 2017-09-28 | 2020-04-14 | Sonos, Inc. | Tone interference cancellation |
US10051366B1 (en) | 2017-09-28 | 2018-08-14 | Sonos, Inc. | Three-dimensional beam forming with a microphone array |
US10482868B2 (en) | 2017-09-28 | 2019-11-19 | Sonos, Inc. | Multi-channel acoustic echo cancellation |
US10466962B2 (en) | 2017-09-29 | 2019-11-05 | Sonos, Inc. | Media playback system with voice assistance |
US11343614B2 (en) | 2018-01-31 | 2022-05-24 | Sonos, Inc. | Device designation of playback and network microphone device arrangements |
US10600408B1 (en) * | 2018-03-23 | 2020-03-24 | Amazon Technologies, Inc. | Content output management based on speech quality |
US11175880B2 (en) | 2018-05-10 | 2021-11-16 | Sonos, Inc. | Systems and methods for voice-assisted media content selection |
US10959029B2 (en) | 2018-05-25 | 2021-03-23 | Sonos, Inc. | Determining and adapting to changes in microphone performance of playback devices |
US10681460B2 (en) | 2018-06-28 | 2020-06-09 | Sonos, Inc. | Systems and methods for associating playback devices with voice assistant services |
US11741398B2 (en) | 2018-08-03 | 2023-08-29 | Samsung Electronics Co., Ltd. | Multi-layered machine learning system to support ensemble learning |
US11076035B2 (en) | 2018-08-28 | 2021-07-27 | Sonos, Inc. | Do not disturb feature for audio notifications |
US10461710B1 (en) | 2018-08-28 | 2019-10-29 | Sonos, Inc. | Media playback system with maximum volume setting |
US10587430B1 (en) | 2018-09-14 | 2020-03-10 | Sonos, Inc. | Networked devices, systems, and methods for associating playback devices based on sound codes |
US11315553B2 (en) * | 2018-09-20 | 2022-04-26 | Samsung Electronics Co., Ltd. | Electronic device and method for providing or obtaining data for training thereof |
US11024331B2 (en) | 2018-09-21 | 2021-06-01 | Sonos, Inc. | Voice detection optimization using sound metadata |
US11100923B2 (en) | 2018-09-28 | 2021-08-24 | Sonos, Inc. | Systems and methods for selective wake word detection using neural network models |
US10692518B2 (en) | 2018-09-29 | 2020-06-23 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection via multiple network microphone devices |
US11899519B2 (en) | 2018-10-23 | 2024-02-13 | Sonos, Inc. | Multiple stage network microphone device with reduced power consumption and processing load |
EP3654249A1 (en) | 2018-11-15 | 2020-05-20 | Snips | Dilated convolutions and gating for efficient keyword spotting |
US11183183B2 (en) | 2018-12-07 | 2021-11-23 | Sonos, Inc. | Systems and methods of operating media playback systems having multiple voice assistant services |
US11132989B2 (en) | 2018-12-13 | 2021-09-28 | Sonos, Inc. | Networked microphone devices, systems, and methods of localized arbitration |
US10602268B1 (en) | 2018-12-20 | 2020-03-24 | Sonos, Inc. | Optimization of network microphone devices using noise classification |
US10867604B2 (en) | 2019-02-08 | 2020-12-15 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing |
US11315556B2 (en) | 2019-02-08 | 2022-04-26 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification |
EP3709194A1 (en) | 2019-03-15 | 2020-09-16 | Spotify AB | Ensemble-based data comparison |
US11120794B2 (en) | 2019-05-03 | 2021-09-14 | Sonos, Inc. | Voice assistant persistence across multiple network microphone devices |
US11361756B2 (en) | 2019-06-12 | 2022-06-14 | Sonos, Inc. | Conditional wake word eventing based on environment |
US10586540B1 (en) | 2019-06-12 | 2020-03-10 | Sonos, Inc. | Network microphone device with command keyword conditioning |
US11200894B2 (en) | 2019-06-12 | 2021-12-14 | Sonos, Inc. | Network microphone device with command keyword eventing |
US11138969B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
US11138975B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
US10871943B1 (en) | 2019-07-31 | 2020-12-22 | Sonos, Inc. | Noise classification for event detection |
US11094319B2 (en) | 2019-08-30 | 2021-08-17 | Spotify Ab | Systems and methods for generating a cleaned version of ambient sound |
US11189286B2 (en) | 2019-10-22 | 2021-11-30 | Sonos, Inc. | VAS toggle based on device orientation |
US11200900B2 (en) | 2019-12-20 | 2021-12-14 | Sonos, Inc. | Offline voice control |
US11562740B2 (en) | 2020-01-07 | 2023-01-24 | Sonos, Inc. | Voice verification for media playback |
US11556307B2 (en) | 2020-01-31 | 2023-01-17 | Sonos, Inc. | Local voice data processing |
US11308958B2 (en) | 2020-02-07 | 2022-04-19 | Sonos, Inc. | Localized wakeword verification |
US11308959B2 (en) | 2020-02-11 | 2022-04-19 | Spotify Ab | Dynamic adjustment of wake word acceptance tolerance thresholds in voice-controlled devices |
US11328722B2 (en) * | 2020-02-11 | 2022-05-10 | Spotify Ab | Systems and methods for generating a singular voice audio stream |
US11308962B2 (en) * | 2020-05-20 | 2022-04-19 | Sonos, Inc. | Input detection windowing |
US11727919B2 (en) | 2020-05-20 | 2023-08-15 | Sonos, Inc. | Memory allocation for keyword spotting engines |
US11482224B2 (en) | 2020-05-20 | 2022-10-25 | Sonos, Inc. | Command keywords with input detection windowing |
US11698771B2 (en) | 2020-08-25 | 2023-07-11 | Sonos, Inc. | Vocal guidance engines for playback devices |
US11984123B2 (en) | 2020-11-12 | 2024-05-14 | Sonos, Inc. | Network device interaction by range |
CN115171654B (zh) * | 2022-06-24 | 2024-07-19 | 中国电子科技集团公司第二十九研究所 | 一种改进的基于总变化量因子的语种识别方法及系统 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06175678A (ja) * | 1992-07-30 | 1994-06-24 | Nec Corp | 音声認識装置 |
JP2002091485A (ja) * | 2000-09-18 | 2002-03-27 | Pioneer Electronic Corp | 音声認識システム |
JP2003099082A (ja) * | 2001-09-21 | 2003-04-04 | Nec Corp | 音声標準パタン学習装置、方法および音声標準パタン学習プログラムを記録した記録媒体 |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB9706174D0 (en) * | 1997-03-25 | 1997-11-19 | Secr Defence | Recognition system |
US6230122B1 (en) * | 1998-09-09 | 2001-05-08 | Sony Corporation | Speech detection with noise suppression based on principal components analysis |
US6134524A (en) * | 1997-10-24 | 2000-10-17 | Nortel Networks Corporation | Method and apparatus to detect and delimit foreground speech |
US6980952B1 (en) * | 1998-08-15 | 2005-12-27 | Texas Instruments Incorporated | Source normalization training for HMM modeling of speech |
US6173258B1 (en) * | 1998-09-09 | 2001-01-09 | Sony Corporation | Method for reducing noise distortions in a speech recognition system |
US6826528B1 (en) * | 1998-09-09 | 2004-11-30 | Sony Corporation | Weighted frequency-channel background noise suppressor |
US6233556B1 (en) * | 1998-12-16 | 2001-05-15 | Nuance Communications | Voice processing and verification system |
US6766295B1 (en) * | 1999-05-10 | 2004-07-20 | Nuance Communications | Adaptation of a speech recognition system across multiple remote sessions with a speaker |
US7451085B2 (en) * | 2000-10-13 | 2008-11-11 | At&T Intellectual Property Ii, L.P. | System and method for providing a compensated speech recognition model for speech recognition |
US6999926B2 (en) * | 2000-11-16 | 2006-02-14 | International Business Machines Corporation | Unsupervised incremental adaptation using maximum likelihood spectral transformation |
US6915259B2 (en) * | 2001-05-24 | 2005-07-05 | Matsushita Electric Industrial Co., Ltd. | Speaker and environment adaptation based on linear separation of variability sources |
US6778957B2 (en) * | 2001-08-21 | 2004-08-17 | International Business Machines Corporation | Method and apparatus for handset detection |
US6934364B1 (en) * | 2002-02-28 | 2005-08-23 | Hewlett-Packard Development Company, L.P. | Handset identifier using support vector machines |
-
2009
- 2009-02-10 US US12/921,062 patent/US8751227B2/en active Active
- 2009-02-10 JP JP2010510052A patent/JP5423670B2/ja active Active
- 2009-02-10 WO PCT/JP2009/052193 patent/WO2009133719A1/ja active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06175678A (ja) * | 1992-07-30 | 1994-06-24 | Nec Corp | 音声認識装置 |
JP2002091485A (ja) * | 2000-09-18 | 2002-03-27 | Pioneer Electronic Corp | 音声認識システム |
JP2003099082A (ja) * | 2001-09-21 | 2003-04-04 | Nec Corp | 音声標準パタン学習装置、方法および音声標準パタン学習プログラムを記録した記録媒体 |
Non-Patent Citations (2)
Title |
---|
YOSHIKAZU YAMAGUCHI ET AL.: "Taylor Tenkai ni yoru Onkyo Model no Tekio", IEICE TECHNICAL REPORT, vol. 96, no. 422, 13 December 1996 (1996-12-13), pages 1 - 8 * |
YUYA AKITA ET AL.: "Hanashi Kotoba Onsei Ninshiki no Tameno Han'yoteki na Tokeiteki Hatsuon Hendo Model", THE TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS D-II, vol. J88-D-II, no. 9, 1 September 2005 (2005-09-01), pages 1780 - 1789 * |
Also Published As
Publication number | Publication date |
---|---|
US20110046952A1 (en) | 2011-02-24 |
US8751227B2 (en) | 2014-06-10 |
JPWO2009133719A1 (ja) | 2011-08-25 |
JP5423670B2 (ja) | 2014-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5423670B2 (ja) | 音響モデル学習装置および音声認識装置 | |
US8566093B2 (en) | Intersession variability compensation for automatic extraction of information from voice | |
US11264044B2 (en) | Acoustic model training method, speech recognition method, acoustic model training apparatus, speech recognition apparatus, acoustic model training program, and speech recognition program | |
Li et al. | An overview of noise-robust automatic speech recognition | |
EP2189976B1 (en) | Method for adapting a codebook for speech recognition | |
JP2005062866A (ja) | コンパクトな音響モデルを作成するためのバブル分割方法 | |
US20070260455A1 (en) | Feature-vector compensating apparatus, feature-vector compensating method, and computer program product | |
JPH0850499A (ja) | 信号識別方法 | |
US20110257976A1 (en) | Robust Speech Recognition | |
KR102406512B1 (ko) | 음성인식 방법 및 그 장치 | |
CN111696522B (zh) | 基于hmm和dnn的藏语语音识别方法 | |
JP6499095B2 (ja) | 信号処理方法、信号処理装置及び信号処理プログラム | |
Lu et al. | Probabilistic linear discriminant analysis for acoustic modeling | |
JP5881454B2 (ja) | 音源ごとに信号のスペクトル形状特徴量を推定する装置、方法、目的信号のスペクトル特徴量を推定する装置、方法、プログラム | |
JP2020060757A (ja) | 話者認識装置、話者認識方法、及び、プログラム | |
KR19990083632A (ko) | 최대가능성방법을포함한고유음성에기초한스피커및환경적응방법 | |
Das et al. | Deep Auto-Encoder Based Multi-Task Learning Using Probabilistic Transcriptions. | |
CN102237082A (zh) | 语音识别系统的自适应方法 | |
Wu et al. | An environment-compensated minimum classification error training approach based on stochastic vector mapping | |
Harvianto et al. | Analysis and voice recognition In Indonesian language using MFCC and SVM method | |
Yuliani et al. | Feature transformations for robust speech recognition in reverberant conditions | |
JP2000259198A (ja) | パターン認識装置および方法、並びに提供媒体 | |
JP4004368B2 (ja) | 音声認識システム | |
Long et al. | Offline to online speaker adaptation for real-time deep neural network based LVCSR systems | |
Kumar | Feature normalisation for robust speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09738656 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12921062 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010510052 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09738656 Country of ref document: EP Kind code of ref document: A1 |