CN105580071B - Method and apparatus for training a voice recognition model database - Google Patents

Method and apparatus for training a voice recognition model database Download PDF

Info

Publication number
CN105580071B
CN105580071B CN201480025758.9A CN201480025758A CN105580071B CN 105580071 B CN105580071 B CN 105580071B CN 201480025758 A CN201480025758 A CN 201480025758A CN 105580071 B CN105580071 B CN 105580071B
Authority
CN
China
Prior art keywords
noise
data
user
specific
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201480025758.9A
Other languages
Chinese (zh)
Other versions
CN105580071A (en
Inventor
约翰·R·梅洛尼
约耳·A·克拉克
约瑟夫·C·德怀尔
阿德里安·舒斯特
斯内海特哈·辛加拉朱
罗伯特·A·茹雷克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google Technology Holdings LLC
Original Assignee
Google Technology Holdings LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US201361819985P priority Critical
Priority to US61/819,985 priority
Priority to US14/094,875 priority
Priority to US14/094,875 priority patent/US9275638B2/en
Application filed by Google Technology Holdings LLC filed Critical Google Technology Holdings LLC
Priority to PCT/US2014/035117 priority patent/WO2014182453A2/en
Publication of CN105580071A publication Critical patent/CN105580071A/en
Application granted granted Critical
Publication of CN105580071B publication Critical patent/CN105580071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

The electronic device (102) digitally combines a single sound input with each of a series of noise samples. Each noise sample is derived from a different audio environment (e.g., street noise, murmur, in-vehicle noise). The sound input/noise sample combination is used to train the VR model database without the user (104) having to repeat the sound input in each different environment. In one variation, the electronic device (102) transmits the user's voice input to a server (301) that maintains and trains a VR model database (308).

Description

Method and apparatus for training a voice recognition model database
Technical Field
The present disclosure relates to speech recognition, and more particularly, to methods and apparatus for training a voice recognition database.
Background
Although speech recognition has existed for decades, the quality of speech recognition software and hardware has only recently reached a sufficiently high level to be attractive to a large number of consumers. One area in which speech recognition has become very popular in recent years is the smart phone and tablet computer industries. With a speech recognition enabled device, a consumer can perform tasks such as making phone calls, writing mail, and using GPS navigation using voice commands only.
However, speech recognition in such devices is far from perfect. Speech recognition engines typically rely on a database of phonemes or commands that are capable of recognizing the spoken utterance. However, the user may need to "train" the phone or command database to recognize his or her speech features — accents, frequently mispronounced words and syllables, tonal features, cadence, and so forth. However, even after training, the phoneme or command database may not be accurate in all audio environments. For example, the presence of background noise can reduce speech recognition accuracy.
Drawings
While the appended claims set forth the features of the present technology with particularity, the technology may be better understood from the following detailed description taken in conjunction with the accompanying drawings in which:
fig. 1 shows a user speaking into an electronic device, which is depicted in the drawings as a mobile device.
FIG. 2 illustrates example components of the electronic device of FIG. 1.
Fig. 3 illustrates an architecture upon which various embodiments may be implemented.
Fig. 4-6 illustrate steps that may be performed in accordance with implementations of the present disclosure.
Detailed Description
The present disclosure sets forth methods and apparatus for training a noise-based voice recognition model database. The term "noise-based voice recognition model database" (simply "VR model database") as used herein refers to a database that functions as a noise-based phoneme database, as a command database, or as both.
Various embodiments of the present disclosure include manual and automated methods of training VR model databases. Manual embodiments of the present disclosure include a direct training method in which an electronic device (also referred to as a "device") guides a user to perform an operation, in response to which the device updates a VR model database. The device may perform the manual training method during initial setup of the device or at any time the user initiates the process. For example, when the user is in a new type of noise environment, the user may initiate a manual method to train the VR model database for this type of noise, and the device may store the new noise in the noise database.
An automatic embodiment includes a method initiated by a device without the knowledge of the user. The device may initiate an automatic method based on environmental characteristics, such as when the device senses a new type of noise or responds to a user's action. Examples of user actions that can initiate the auto-training method include a user initiating a speech recognition session via a button press, gesture trigger, or voice trigger. In these cases, the device will use the user's voice and other noise it detects to further train the VR model database. The device may also use the user's speech and detected noise for the speech recognition process itself. In this case, if the device actively reacts to the speech recognition result (i.e., performs the action initiated by the speech recognition process as opposed to canceling the action), the device will initiate an automatic training process using the user utterance from the speech recognition event and the result of the event as a training target.
According to various embodiments, in addition to live utterances and live noise, the device trains the VR model database using previously recorded noise and previously recorded utterances (retrieved from the noise database and utterance database, respectively). As with live noise and vocalization, previously recorded vocalization can be obtained in different noise environments and during different use cases of the device. Previously recorded utterances and noise may be stored and retrieved from a noise database and an utterance database, respectively. In addition, the device may store the live utterance and the live noise in a noise database and an utterance database, respectively, for future use.
According to embodiments, the device may train the VR model database in various ways, any of which may be used for both manual and automatic training methods depending on the circumstances. For example, three approaches relate to how to capture synthesized speech and noise signals in order to train the VR model database. The first of these methods is based on the synthesized signal of speech and natural noise captured by the device. The second is based on capturing the synthesized signal of the live speech and the noise generated by the acoustic output transducer of the device. The third is based on the synthesized signal produced by the device by mixing the speech with noise that it captures in the field or that it retrieves from memory. The last embodiment may use captured speech mixed with a previously stored noise file, or captured noise mixed with a previously stored speech utterance, captured in a quiet environment.
In one embodiment, the electronic device digitally combines a single sound input with each of a series of noise samples. Each noise sample is derived from a different audio environment (e.g., street noise, murmur, in-vehicle noise). The sound input/noise sample combination is used to train the VR model database without the user having to repeat the sound input in each of the different environments. In one variation, the electronic device transmits the user's voice input to a server that maintains and trains a VR model database.
According to an embodiment, the method is performed by recording the utterance, digitally combining the recorded utterance with previously recorded noise samples, and training a noise-based VR model database based on the digital combination. Using the same single utterance, these steps may be repeated for each previously recorded noise sample in the set of noise samples (e.g., noise samples of a noise database), and may thus be repeated before recording a different utterance. In the future, this process may be repeated to continually improve speech recognition.
Alternatively, the electronic device may use a predetermined noise playback (jingle, car, murmur) to create a simulated noise environment, or use a speaker on the device to create no feedback (mute). The user speaks during playback and without playback. This allows the device to identify changes in the user's speech characteristics in a quiet vs. noisy audio environment. A VR model database may be trained based on this information.
One embodiment involves receiving an utterance via a microphone of an electronic device and, while the utterance is received, reproducing a previously recorded noise sample through a speaker of the electronic device. The microphone picks up both the utterance and the previously recorded noise.
Yet another embodiment relates to recording an utterance during a speech-to-text command ("STT") mode, and determining whether the recorded utterance is an STT command. Such a determination may be made based on whether the word recognition confidence value exceeds a threshold.
If the recorded utterance is recognized as an STT command, the electronic device performs a function based on the STT command. If the electronic device performs the correct function (i.e., the function associated with the command), the device trains the noise-based VR model database to associate the utterance with the command.
The method can also be repeatedly performed during the STT command mode from the same voice phrase recorded from the same person combined with different noise environments. Examples of noisy environments include homes, cars, streets, offices, and restaurants.
While the present disclosure relates to modules and other elements for "providing" information (data) to one another, it should be understood that there are many possible ways in which such actions may be performed, including electrical signals transmitted along conductive paths (e.g., wires) and inter-object method calls.
The embodiments described herein are usable in an always-on-audio (AOA) environment. When using AOA, the electronic device can wake up from sleep mode upon receiving a trigger command from a user. AOA places additional demands on the device (especially mobile devices). AOA is most effective when the electronic device is able to accurately and quickly recognize the user's voice command.
Referring to fig. 1, a user 104 provides an acoustic input (or voiced information or speech) 106 received by a speech recognition enabled electronic device ("device") 102 through a microphone (or other acoustic receiver) 108. The device 102, which in this example is a mobile device, includes a touch screen display 110, the touch screen display 110 being capable of displaying visual images and receiving or sensing touch-type input as provided by a user's finger or other touch input device such as a stylus. In the embodiment shown in FIG. 1, device 102 also has a plurality of discrete keys or buttons 112 that serve as input devices for the device, although a touch screen display 110 is present. However, such keys or buttons (or any particular number of such keys or buttons) need not be present in other embodiments, and the touch screen display 110 may serve as the primary or sole user input device.
Although fig. 1 specifically illustrates device 102 as including touch screen display 110 and keys or buttons 112, these features are intended merely as examples of components/features on device 102, and in other embodiments device 102 need not include one or more of these features and/or may include other features in addition to or in place of these features.
Device 102 is intended to represent a variety of devices including, for example, a cellular telephone, a Personal Digital Assistant (PDA), a smart phone, or other handheld or portable electronic devices. In alternative embodiments, the device may also be a headset (e.g., a bluetooth headset), an MP3 player, a battery-powered device, a watch device (e.g., a wristwatch) or other wearable device, a radio, a navigation device, a laptop or notebook computer, a netbook, a pager, a PMP (personal media player), a DVR (digital video recorder), a gaming device, a camera, an e-reader, an e-book, a tablet computer device, a navigation device with a video-enabled screen, a multimedia docking station, or other device.
Embodiments of the present disclosure are intended to be applicable to any of a variety of electronic devices that are capable or configured to receive sound input or other sound input indicative or representative of voiced information.
Fig. 2 illustrates internal components of the device 102 of fig. 1, in accordance with an embodiment of the present disclosure. As shown in fig. 2, the device 102 includes one or more wireless transceivers 202, a computing processor 204 (e.g., a microprocessor, microcomputer, application specific integrated circuit, digital signal processor, etc.), a memory 206, one or more output devices 208, and one or more input devices 210. The device 102 may further include a component interface 212 to provide a direct connection to auxiliary components or accessories for additional or enhanced functionality. While enabling the mobile device to be portable, the device 102 may also include a power source 214, such as a battery, for providing power to other internal components. In addition, the device 102 additionally includes one or more sensors 228. All components of device 102 may be coupled to each other and communicate with each other via one or more internal communication links 232 (e.g., an internal bus).
Additionally, in the embodiment of fig. 2, the wireless transceiver 202 specifically includes a cellular transceiver 203 and a Wireless Local Area Network (WLAN) transceiver 205. More specifically, the cellular transceiver 203 is configured to conduct cellular communications such as 3G, 4G-LTE, face-to-face cell towers (not shown), although in other embodiments the cellular transceiver 203 may be configured to utilize any of a variety of other cellular-based communication techniques such as analog communications (using AMPS), digital communications (using CDMA, TDMA, GSM, iDEN, GPRS, EDGE, etc.), and/or next generation communications (using UMTS, WCDMA, LTE, IEEE 802.16, etc.) or variations thereof.
Conversely, the WLAN transceiver 205 is configured to communicate in accordance with the IEEE802.11(a, b, g, or n) standard with access points. In other embodiments, the WLAN transceiver 205 may instead of (or in addition to) engage in other types of communications such as certain types of peer-to-peer (e.g., Wi-Fi peer-to-peer) communications that are generally understood to be included within WLAN communications. Further, in other embodiments, the Wi-Fi transceiver 205 may be replaced or supplemented with one or more other wireless transceivers configured for non-cellular wireless communication, including, for example, wireless transceivers employing ad-hoc network communication technologies such as HomeRF (radio frequency), home node B (3G femtocell), bluetooth, and/or other wireless communication technologies such as infrared technologies.
Although in the present embodiment device 102 has two wireless transceivers 202 (i.e., transceivers 203 and 205), the present disclosure is intended to encompass many embodiments in which there are any number of wireless transceivers employing any number of communication technologies. By using wireless transceiver 202, device 102 is able to communicate with any of a variety of other devices or systems (not shown) including, for example, other mobile devices, Web servers, cell towers, access points, other remote devices, and so forth. Depending on the embodiment or the environment, wireless communication between device 102 and any number of other devices or systems may be implemented.
Operation of wireless transceiver 202 in conjunction with other internal components of device 102 may take various forms. For example, operation of the wireless transceiver 202 may be such that, upon receipt of a wireless signal, internal components of the device 102 detect the communication signal and the transceiver 202 demodulates the communication signal to recover incoming information, such as voice and/or data, transmitted via the wireless signal. After receiving the incoming information from the transceiver 202, the computing processor 204 formats the incoming information for one or more output devices 208. Likewise, for transmission of wireless signals, the computing processor 204 formats outgoing information, which may be, but is not required to be, activated by the input device 210, and conveys the outgoing information to one or more of the wireless transceivers 202 for modulation to provide a modulated communication signal to be transmitted.
According to embodiments, the input and output devices 208 and 210 of the device 102 may include various visual, audio, and/or mechanical outputs. For example, the output devices 208 may include one or more visual output devices 216 such as a liquid crystal display and/or a light emitting diode indicator, one or more audio output devices 218 such as a speaker, an alarm, and/or a buzzer, and/or one or more mechanical output devices 220 such as a vibrating mechanism. Visual output device 216 may include a video screen, among other things. Likewise, by way of example, the input devices 210 may include one or more visual input devices 222 such as optical sensors (e.g., camera lenses and photoelectric sensors), one or more audio input devices 224 such as the microphone 108 of fig. 1 (or further such as a microphone of a bluetooth headset), and/or one or more mechanical input devices 226 such as flip sensors, keyboards, keypads, selection buttons, navigation clusters, touch pads, capacitive sensors, motion sensors, and/or switches. Operation of one or more of the actuatable input devices 210 can include not only physical depression/actuation of a button and/or other actuator, but can also include, for example, opening the mobile device, unlocking the device, moving the device to actuate a motion, moving the device to actuate a position location system, and operating the device.
As described above, the device 102 may also include one or more of various types of sensors 228 and a sensor hub for managing one or more functions of the sensors. The sensors 228 may include, for example, proximity sensors (e.g., light detection sensors, ultrasonic transceivers, or infrared transceivers), touch sensors, altitude sensors, and one or more location circuits/components that may include, for example, Global Positioning System (GPS) receivers, triangulation receivers, accelerometers, tilt sensors, gyroscopes, or any other information gathering device that can identify a current location or user device interface (bearer mode) of the device 102. While the sensors 228 are considered to be distinct from the input devices 210 for purposes of fig. 2, it is possible in other embodiments that one or more of the input devices may also be considered to constitute one or more of the sensors (and vice versa). Further, while input device 210 is shown in the present embodiment as being distinct from output device 208, it should be appreciated that in some embodiments one or more devices function as both an input device and an output device. In particular, in the present embodiment where device 102 includes a touch screen display 110, the touch screen display may be considered to constitute both a visual output device and a mechanical input device (in contrast, keys or buttons 112 are merely mechanical input devices).
The memory 206 may include one or more memory devices in any of various forms (e.g., read-only memory, random access memory, static random access memory, dynamic random access memory, etc.) and may be used by the compute processor 204 to store and retrieve data. In some embodiments, the memory 206 and the computing processor 204 may be integrated in a single device (e.g., a processing device including memory or a storage Processor (PIM)), although such a single device typically still has different portions/parts for performing the different processing and memory functions and may be considered a stand-alone device. In some alternative embodiments, the memory 206 of the device 102 may be supplemented with or replaced by other memory located elsewhere remote from the device 102, and in this embodiment the device 102 may communicate with or access such other memory devices through any of a variety of communication techniques (e.g., wireless communication provided by the wireless transceiver 202 or connection via the component interface 212).
The data stored by the memory 206 may include, but is not limited to, operating systems, programs (applications), modules, and informational data. Each operating system includes executable code that controls basic functions of the device 102, such as interactions between various components included within the internal components of the device 102, communications with external devices via the wireless transceiver 202 and/or the component interface 212, and storing and retrieving programs and data from the memory 206. In terms of programs, each program includes executable code that uses the operating system to provide more specific functions such as file system services and processing of protected and unprotected data stored in memory 206. The program may include, among other things, programming that enables device 102 to perform a process such as the speech recognition process shown in fig. 3 and discussed further below. Finally, with respect to informational data, it is non-executable code or information that may be referenced and/or manipulated by an operating system or program for performing functions of device 102.
With reference to fig. 3, a configuration of the electronic device 102 according to the embodiment will now be described. The VR model database 308, the utterance database 309, and the noise database 310 are stored in the memory 206 of the electronic device 102, all of which are accessible to the computing processor 204, the audio input device 224 (e.g., a microphone), and the audio output device 218 (e.g., a speaker). VR model database 308 contains data for associating sounds with voice phonemes or commands, or both. Utterance database 309 contains samples of user's or user's voice utterances recorded by the user. The noise database 310 contains samples of noise recorded from different environments, generated digitally, or both.
The device 102 is capable of accessing a network such as the internet. Although direct coupling of components such as the audio input device 224, the audio output device 218, etc. is shown, it may be connected to the computing processor 204 by other components or circuitry in the device. In addition, the vocalization and noise captured by the device 102 may be temporarily stored in the memory 206 or more persistently stored in the vocalization database 309 and noise database 310, respectively. The vocalization and noise, whether temporarily stored or not, may then be accessed by the computing processor 204. The computing processor 204 may reside external to the electronic device 102, such as on a server on the internet.
The computation processor 204 executes a speech recognition engine 305, which speech recognition engine 305 may reside in the memory 206 and have access to a noise database 310, an utterance database 309, and a VR model database 308. In one embodiment, one or more of the noise database 310, the utterance database 309, the VR model database 308, and the speech recognition engine 305 are stored and executed by a remotely located server 310.
With reference to fig. 4, a description will now be made of a process performed by the electronic device 102 (fig. 3) according to an embodiment. The process 400 shown in fig. 4 is a passive training system that updates and improves the VR model database 308 in a manner that is transparent to the user, as it does not require cognitive interaction by the user to augment the model. The process 400 begins with the electronic device 102 in a STT command session during which the speech recognition engine 305 is in a mode that interprets utterances as commands rather than as words to be converted to text.
At step 402, the electronic device 102 records an utterance of a user utterance that includes natural background noise. The recorded utterances and noise may be stored in the utterance database 309 and noise database 310 for future use. At step 404, the speech recognition engine determines whether the utterance is an STT command. In doing so, the speech recognition engine 305 determines the most likely candidate STT command that gives the utterance. The speech recognition engine 305 assigns a confidence score to the candidate and if the confidence score is above a predetermined threshold, the utterance is considered an STT command. Where the factors that influence the confidence score are the methods used in performing the training. If it is determined that the utterance is not an STT command, the process returns to step 402. If the determination is an STT command, the electronic device 102 performs a function based on the STT command at step 406.
At step 408, the electronic device 102 determines whether the function performed is a valid operation. If so, at step 410, the electronic device 102 trains the VR model database 308 by, for example, associating the user's utterance with a command. The processes performed during normal operation allow the electronic device 102 to update the original VR model database 308 to reflect actual use in a variety of environments that naturally include noise inherent to those environments. Device 102 may also use previously recorded utterances from utterance database 309 and previously recorded noise from noise database 310 during the training process.
In an alternative embodiment, a "no" response during step 408 would result in device 102 asking the user to type text for the command they wish to execute in step 411. The captured text and utterances will then be used to train and update the VR model database 308 in step 402.
Referring to fig. 5, another process performed by the electronic device 102 will now be described. The process 500 is a process in which a user intentionally interacts with the electronic device 102. The process 500 begins at step 502, where the electronic device 102 records the utterance, for example, by converting the utterance to digital data and storing it as a digital file at step 502. The storage location may be volatile memory or in more persistent memory (e.g., utterance database 309). At step 504, the electronic device 102 retrieves data of noise samples (e.g., restaurant noise) from the noise database 310. The electronic device 102 may select the noise sample (e.g., loop through some or all of the previously recorded noise samples) or the user may select the noise sample. At step 506, the electronic device 102 digitally combines the noise sample with the utterance. At step 508, the electronic device 102 trains the VR model database 308 using the combined noise samples and pronunciations. At step 510, the electronic device 102 updates the VR model database 308. At step 512, the electronic device 102 determines whether there is any more noise with which to train the VR model database 308. If not at all, the process ends. If so, the process returns to step 504, where the electronic device 102 retrieves another noise sample from the noise database 310 at step 504.
With reference to fig. 6, a further process performed by the electronic device 102 according to an embodiment will now be described. The process 600 begins at step 602, where the electronic device 102 prompts the user to sound at step 602. At step 604, the electronic device 102 plays the noise samples of the noise database 310 via the speaker 306.
The electronic device performs step 606 while performing step 604. At step 606, the electronic device 102 records the user's utterance along with the played noise sample. At step 608, the electronic device 102 stores the acoustically combined noise sample and utterance in volatile memory or in the noise database 310 and the utterance database 309. At step 610, the electronic device 102 trains the VR model database 308 using the combined noise samples and utterances. At step 612, the electronic device 102 updates the VR model database 308.
From the above it can be seen that there has been provided a method for an apparatus for training a database of voice recognition models. In view of the many possible embodiments to which the principles of this discussion may be applied, it should be recognized that the embodiments described herein with respect to the drawing figures are meant to be illustrative only and should not be taken as limiting the scope of the claims. Accordingly, the techniques described herein contemplate all such embodiments as may fall within the scope of the following claims and equivalents thereof.

Claims (16)

1. A computer-implemented method, comprising:
receiving voice data corresponding to an utterance spoken in a particular noise environment, wherein receiving voice data comprises: prompting a user for a vocalization, playing a noise sample of a noise database, and recording the played noise sample with the vocalization;
for each of a plurality of noise environments different from the particular noise environment:
combining the speech data with stored noise data to generate noise-specific training audio data, the stored noise data associated with the one of the plurality of noise environments; and
training a noise-specific speech recognition model based at least on the noise-specific training audio data; and
providing, for output, a respective noise-specific speech recognition model associated with each of the plurality of noisy environments.
2. The method of claim 1, comprising:
data indicating a selection of the stored noise data is received from a user, wherein the voice data is received from the user.
3. The method of claim 1, wherein the plurality of noise environments comprises:
the noise associated with the home may be such that,
the noise associated with the automobile is such that,
noise associated with the office, or
Noise associated with the restaurant.
4. The method of claim 1, comprising:
detecting a new noise type; and
storing new noise data associated with the new noise type.
5. The method of claim 1, comprising:
detecting a new noise type; and
in response to detecting the new noise type:
prompting the user to provide additional voice data; and
a noise-specific speech recognition model is trained based at least on the additional speech data.
6. The method of claim 1, comprising:
receiving additional voice data;
combining the additional speech data with the stored noise data to produce additional noise-specific training audio data; and
updating the noise-specific speech recognition model based on the additive noise-specific training audio data.
7. The method of claim 1, comprising:
receiving additional voice data from a user providing the voice data; and
after combining the speech data with stored noise data to produce noise-specific training audio data:
combining the additional speech data with the stored noise data to produce additional noise-specific training audio data; and
updating the noise-specific speech recognition model based on the additive noise-specific training audio data.
8. The method of claim 1, comprising:
storing the voice data in a database of voice data.
9. A computer-implemented system, comprising:
means for receiving voice data corresponding to an utterance spoken in a particular noise environment, wherein receiving voice data comprises: prompting a user for a vocalization, playing a noise sample of a noise database, and recording the played noise sample with the vocalization;
for each of a plurality of noise environments different from the particular noise environment:
means for combining the speech data with stored noise data to generate noise-specific training audio data, the stored noise data associated with the one of the plurality of noise environments; and
means for training a noise-specific speech recognition model based at least on the noise-specific training audio data; and
means for providing, for output, a respective noise-specific speech recognition model associated with each of the plurality of noisy environments.
10. The computer-implemented system of claim 9, further comprising:
means for receiving data from a user indicating a selection of stored noise data, wherein the voice data is received from the user.
11. The computer-implemented system of claim 9, wherein the plurality of noise environments comprises:
the noise associated with the home may be such that,
the noise associated with the automobile is such that,
noise associated with the office, or
Noise associated with the restaurant.
12. The computer-implemented system of claim 9, further comprising:
means for detecting a new noise type; and
means for storing new noise data associated with the new noise type.
13. The computer-implemented system of claim 9, further comprising:
means for detecting a new noise type; and
for, in response to detecting the new noise type:
means for prompting a user to provide additional voice data; and
means for training a noise-specific speech recognition model based at least on the additional speech data.
14. The computer-implemented system of claim 9, further comprising:
means for receiving additional voice data;
means for combining the additional speech data with the stored noise data to produce additional noise-specific training audio data; and
means for updating the noise-specific speech recognition model based on the additive noise-specific training audio data.
15. The computer-implemented system of claim 9, further comprising:
means for receiving additional voice data from a user providing the voice data; and
for, after combining the speech data with stored noise data to produce noise-specific training audio data:
means for combining the additional speech data with the stored noise data to produce additional noise-specific training audio data; and
means for updating the noise-specific speech recognition model based on the additive noise-specific training audio data.
16. The computer-implemented system of claim 9, further comprising:
means for storing the voice data in a database of voice data.
CN201480025758.9A 2013-03-12 2014-04-23 Method and apparatus for training a voice recognition model database Active CN105580071B (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US201361819985P true 2013-05-06 2013-05-06
US61/819,985 2013-05-06
US14/094,875 2013-12-03
US14/094,875 US9275638B2 (en) 2013-03-12 2013-12-03 Method and apparatus for training a voice recognition model database
PCT/US2014/035117 WO2014182453A2 (en) 2013-05-06 2014-04-23 Method and apparatus for training a voice recognition model database

Publications (2)

Publication Number Publication Date
CN105580071A CN105580071A (en) 2016-05-11
CN105580071B true CN105580071B (en) 2020-08-21

Family

ID=51867838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201480025758.9A Active CN105580071B (en) 2013-03-12 2014-04-23 Method and apparatus for training a voice recognition model database

Country Status (3)

Country Link
EP (1) EP2994907A2 (en)
CN (1) CN105580071B (en)
WO (1) WO2014182453A2 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109192216A (en) * 2018-08-08 2019-01-11 联智科技(天津)有限责任公司 A kind of Application on Voiceprint Recognition training dataset emulation acquisition methods and its acquisition device
CN109545195A (en) * 2018-12-29 2019-03-29 深圳市科迈爱康科技有限公司 Accompany robot and its control method
CN109545196A (en) * 2018-12-29 2019-03-29 深圳市科迈爱康科技有限公司 Audio recognition method, device and computer readable storage medium
CN110544469B (en) * 2019-09-04 2022-04-19 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110808030B (en) * 2019-11-22 2021-01-22 珠海格力电器股份有限公司 Voice awakening method, system, storage medium and electronic equipment
CN111128141B (en) * 2019-12-31 2022-04-19 思必驰科技股份有限公司 Audio identification decoding method and device
CN113099353A (en) * 2021-04-21 2021-07-09 浙江吉利控股集团有限公司 Integrated microphone, safety belt, steering wheel and vehicle for vehicle

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331467A (en) * 2000-06-28 2002-01-16 松下电器产业株式会社 Method and device for producing acoustics model
CN1451152A (en) * 2000-09-01 2003-10-22 捷装技术公司 Computur-implemented speech recognition system training
US20050071159A1 (en) * 2003-09-26 2005-03-31 Robert Boman Speech recognizer performance in car and home applications utilizing novel multiple microphone configurations
CN101023467A (en) * 2005-01-04 2007-08-22 三菱电机株式会社 Method for refining training data set for audio classifiers and method for classifying data
US20080300871A1 (en) * 2007-05-29 2008-12-04 At&T Corp. Method and apparatus for identifying acoustic background environments to enhance automatic speech recognition
CN102426837A (en) * 2011-12-30 2012-04-25 中国农业科学院农业信息研究所 Robustness method used for voice recognition on mobile equipment during agricultural field data acquisition
CN102903360A (en) * 2011-07-26 2013-01-30 财团法人工业技术研究院 Microphone-array-based speech recognition system and method
CN103069480A (en) * 2010-06-14 2013-04-24 谷歌公司 Speech and noise models for speech recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6876966B1 (en) * 2000-10-16 2005-04-05 Microsoft Corporation Pattern recognition training method and apparatus using inserted noise followed by noise reduction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331467A (en) * 2000-06-28 2002-01-16 松下电器产业株式会社 Method and device for producing acoustics model
CN1451152A (en) * 2000-09-01 2003-10-22 捷装技术公司 Computur-implemented speech recognition system training
US20050071159A1 (en) * 2003-09-26 2005-03-31 Robert Boman Speech recognizer performance in car and home applications utilizing novel multiple microphone configurations
CN101023467A (en) * 2005-01-04 2007-08-22 三菱电机株式会社 Method for refining training data set for audio classifiers and method for classifying data
US20080300871A1 (en) * 2007-05-29 2008-12-04 At&T Corp. Method and apparatus for identifying acoustic background environments to enhance automatic speech recognition
CN103069480A (en) * 2010-06-14 2013-04-24 谷歌公司 Speech and noise models for speech recognition
CN102903360A (en) * 2011-07-26 2013-01-30 财团法人工业技术研究院 Microphone-array-based speech recognition system and method
CN102426837A (en) * 2011-12-30 2012-04-25 中国农业科学院农业信息研究所 Robustness method used for voice recognition on mobile equipment during agricultural field data acquisition

Also Published As

Publication number Publication date
CN105580071A (en) 2016-05-11
WO2014182453A2 (en) 2014-11-13
EP2994907A2 (en) 2016-03-16
WO2014182453A3 (en) 2014-12-31

Similar Documents

Publication Publication Date Title
US9275638B2 (en) Method and apparatus for training a voice recognition model database
CN105580071B (en) Method and apparatus for training a voice recognition model database
US10163439B2 (en) Method and apparatus for evaluating trigger phrase enrollment
JP7101322B2 (en) Voice trigger for digital assistant
US9542947B2 (en) Method and apparatus including parallell processes for voice recognition
CN106796785B (en) Sound sample validation for generating a sound detection model
CN106201424B (en) A kind of information interacting method, device and electronic equipment
CN106971723B (en) Voice processing method and device for voice processing
JP2019117623A (en) Voice dialogue method, apparatus, device and storage medium
JP6844608B2 (en) Voice processing device and voice processing method
US20140278392A1 (en) Method and Apparatus for Pre-Processing Audio Signals
KR20160106075A (en) Method and device for identifying a piece of music in an audio stream
CN112074900A (en) Audio analysis for natural language processing
US11373656B2 (en) Speech processing method and apparatus therefor
KR20190096308A (en) electronic device
CN112906369A (en) Lyric file generation method and device
WO2020202862A1 (en) Response generation device and response generation method
US20210110838A1 (en) Acoustic aware voice user interface
Tickoo et al. From Data to Recognition
CN110992928A (en) Audio processing method and terminal equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant