CN105580071A - Method and apparatus for training a voice recognition model database - Google Patents

Method and apparatus for training a voice recognition model database Download PDF

Info

Publication number
CN105580071A
CN105580071A CN201480025758.9A CN201480025758A CN105580071A CN 105580071 A CN105580071 A CN 105580071A CN 201480025758 A CN201480025758 A CN 201480025758A CN 105580071 A CN105580071 A CN 105580071A
Authority
CN
China
Prior art keywords
sounding
noise
recorded
voice
equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201480025758.9A
Other languages
Chinese (zh)
Other versions
CN105580071B (en
Inventor
约翰·R·梅洛尼
约耳·A·克拉克
约瑟夫·C·德怀尔
阿德里安·舒斯特
斯内海特哈·辛加拉朱
罗伯特·A·茹雷克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technology Holding Co
Google Technology Holdings LLC
Original Assignee
Technology Holding Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US201361819985P priority Critical
Priority to US61/819,985 priority
Priority to US14/094,875 priority patent/US9275638B2/en
Priority to US14/094,875 priority
Application filed by Technology Holding Co filed Critical Technology Holding Co
Priority to PCT/US2014/035117 priority patent/WO2014182453A2/en
Publication of CN105580071A publication Critical patent/CN105580071A/en
Application granted granted Critical
Publication of CN105580071B publication Critical patent/CN105580071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

An electronic device (102) digitally combines a single voice input with each of a series of noise samples. Each noise sample is taken from a different audio environment (e.g., street noise, babble, interior car noise). The voice input / noise sample combinations are used to train a voice recognition model database (308) without the user (104) having to repeat the voice input in each of the different environments. In one variation, the electronic device (102) transmits the user's voice input to a server (301) that maintains and trains the voice recognition model database (308).

Description

For training the method and apparatus of voice recognition model database
Technical field
The disclosure relates to speech recognition, and more particularly, relates to the method and apparatus for training voice recognition data storehouse.
Background technology
Although speech recognition is present in decades, the quality of speech recognition software and hardware just reaches the sufficiently high level being enough to attract a large amount of consumer recently.It is smart phone and dull and stereotyped computer industry that speech recognition in recent years has become a popular field.Use the equipment of speech recognition enabled, consumer can only use voice command to perform as called, writing mail and use the task GPS navigation.
But the speech recognition in this equipment is nowhere near perfect.Speech recognition engine typically depends on can the phoneme of sound recognition sounding or order data storehouse.But user may need " training " phoneme or order data storehouse to identify his or her phonetic feature---accent, the word often mispronounced and syllable, tonality feature, rhythm etc.But even if after training, phoneme or order data storehouse may neither be all accurately in all audio environment.Such as, the existence of ground unrest can reduce accuracy of speech recognition.
Accompanying drawing explanation
Although claims have set forth the feature with this technology of singularity, embodiment from behind can understand these technology better by reference to the accompanying drawings, wherein:
Fig. 1 shows the user spoken facing to the electronic equipment being depicted as mobile device in the accompanying drawings.
Fig. 2 shows the exemplary components of the electronic equipment of Fig. 1.
Fig. 3 shows the framework that can realize each embodiment thereon.
Fig. 4-6 shows can according to the step implementing to perform of the present disclosure.
Embodiment
Present disclosure sets forth the method and apparatus for training the voice recognition model database based on noise." the voice recognition model database based on noise " (referred to as " VR model database ") refers to and is used as based on the phoneme database of noise, as order data storehouse or the database being used as both as the term is used herein.
Each embodiment of the present disclosure comprises the manual and automated process of training VR model database.Manual embodiment of the present disclosure comprises direct training method, and in this direct training method, electronic equipment (being also called as " equipment ") instructs user with executable operations, in response to this, and renewal of the equipment VR model database.This equipment can perform manual training method during the initial setting up of equipment or in any time that user starts this process.Such as, when user is in the noise circumstance of newtype, user can start manual methods to train VR model database for such noise, and new noise can be stored in noise database by this equipment.
Automatic embodiment comprises the knowledge by the method for device start without the need to user.Such as when apparatus senses to the noise of newtype or response user action time, this equipment environmentally characteristic can start automated process.The user action example that can start automatic training method comprise user via pressing the button, gesture trigger or sounds trigger to start speech recognition session.In these cases, the use voice of user and other noises detected by it are trained VR model database by equipment further.Equipment can also use the voice of user and detected noise for voice recognition processing itself.In this case, if equipment is made a response (namely contrary with cancellation action energetically to voice identification result, perform the action that voice recognition processing is initiated), so use is started automatic training managing from user's sounding of voice recognition event and the result of this event as training objective by this equipment.
According to each embodiment, except on-the-spot sounding and on-the-spot noise, this equipment also uses the sounding of the noise of precedence record and precedence record (going out from noise database and sounding database retrieval respectively) to train VR model database.The same with sounding with on-the-spot noise, the sounding of precedence record and can be obtained in different noise circumstance during the different service conditions of equipment.The sounding of precedence record and noise can be stored in respectively in noise database and sounding database and can to go out from noise database and sounding database retrieval.In addition, on-the-spot sounding and on-the-spot noise can be stored in noise database and sounding database for using in the future by this equipment respectively.
According to embodiment, equipment can train VR model database according to following various mode, and any one in described various mode environmentally may be used for manual and automatic both training methods.Such as, three kinds of methods relate to how catching synthetic speech and noise signal to train VR model database.The composite signal of first in these methods voice of catching based on equipment and natural noise.The composite signal of second noise produced based on the acoustic output transducer of catching field speech and equipment.3rd based on equipment by voice and its captured at jobsite or it mix produced composite signal from the noise that memory search goes out.Last embodiment can be used in the voice mixed mutually with previously stored noise file or the noise of having caught mixed mutually with previously stored speech utterance of catching in quiet environment.
In one embodiment, electronic equipment digitally combines each in single Speech input and series of noise sample.The never same audio environment of each noise sample (such as street noise, noise, internal car noise) obtains.Speech input/noise sample combination is used for training VR model database, and must repeat Speech input in different environments each without the need to user.In a modification, the Speech input of user is sent to by electronic equipment to be safeguarded and trains the server of VR model database.
According to embodiment, the method is by record sounding, digitally combines the noise sample of recorded sounding and precedence record, and perform based on the VR model database of noise based on the incompatible training of this digit groups.Use identical single sounding, these steps can be repeated to the noise sample of each precedence record in noise sample set (noise sample of such as noise database), and can therefore repeat before recording different sounding.In the future, this process can be repeated constantly to improve speech recognition.
Alternatively, electronic equipment can use predetermined noise reset (tingtang, automobile, noise) produce simulator and noise environment, or the loudspeaker on use equipment produces without feedback (quiet).User talks at playback duration and when not resetting.This allows recognition of devices to go out the change of the characteristics of speech sounds of user in quiet Vs. noisy audio environment.VR model database can be trained based on this information.
Embodiment relates to the microphones sounding via electronic equipment, and while have received sounding, is reproduced the noise sample of precedence record by the loudspeaker of electronic equipment.Microphone pickup sounding and both the noises previously recorded.
Another embodiment relates to record sounding during voice-to-text order (" STT ") pattern, and determines whether recorded sounding is STT order.Thisly determine whether can exceed threshold value to carry out based on word recognition confidence value.
If the sounding recorded is identified as STT order, then electronic equipment performs the function based on STT order.If electronic equipment performs correct function (function be namely associated with this order), then this equipment trains sounding is associated with order to the VR model database based on noise.
The method can also repeated during STT command mode from from the same voice phrase that the identical people that different noise circumstance combines records.The example of noise circumstance comprises family, automobile, street, office and dining room.
When the disclosure relates to for " providing " module of information (data) and other element each other, be understood that the multiple possible mode existing and can perform this action, comprise the method call between electric signal and object transmitted along conducting path (such as electric wire).
Embodiment as described herein is available in the environment of audio frequency (always-onaudio) (AOA) always.When using AOA, electronic equipment, can from sleep mode wakeup when receiving the trigger command from user.Added requirement is placed in (especially mobile device) on equipment by AOA.When electronic equipment can identify the voice command of user accurately and quickly, AOA is the most effective.
With reference to figure 1, user 104 provides the electronic equipment (" equipment ") 102 of enabling speech recognition by microphone (or other sound receiver) Speech input received by 108 (or audible information or voice) 106.In this example for the equipment 102 of mobile device comprises touch-screen display 110, this touch-screen display 110 can show visual pattern and receive or sensing other touch input device as by the finger of user or such as stylus the touch-type that provides input.In the embodiment shown in fig. 1, although have touch-screen display 110, equipment 102 also has multiple discrete button as the input equipment of equipment or button 112.But, be not there is such button or button (or this button of any given number or button) in other embodiments, and touch-screen display 110 can be used as main or unique user input device.
Although Fig. 1 specifically illustrates equipment 102 comprise touch-screen display 110 and button or button 112, but these features are only intended to be the example of the assembly/feature on equipment 102, and equipment 102 need not comprise one or more in these features and/or in addition to these features or replace these features also can comprise further feature in other embodiments.
Equipment 102 is intended to expression and comprises such as cell phone, personal digital assistant (PDA), smart phone or other various equipment that are hand-held or portable electric appts.In alternative embodiment, equipment can also be earphone (such as bluetooth earphone), MP3 player, battery powered equipment, watch device (such as watch) or other wearable device, radio, navigator, on knee or notebook, net book, pager, PMP (personal media player), DVR (digital video recorder), game station, camera, electronic reader, e-book, flat computer equipment, the navigator with video capability screen, multimedia Docking station or miscellaneous equipment.
Embodiment of the present disclosure is intended to be applicable to any one in following various electronic equipment, and described various electronic equipment or can be configured to other Speech input receiving Speech input or instruction or expression audible information.
Fig. 2 shows the intraware of the equipment 102 according to Fig. 1 of embodiment of the present disclosure.As shown in Figure 2, equipment 102 comprises one or more wireless transceiver 202, computation processor 204 (such as, microprocessor, microcomputer, special IC, digital signal processor etc.), storer 206, one or more output device 208 and one or more input equipment 210.Equipment 102 can comprise component interface 212 further to be provided to the direct connection of accessory part or annex in order to function that is additional or that strengthen.Make mobile device can portable while, equipment 102 can also comprise the power supply 214 for providing electric power to other intraware of such as battery.In addition, equipment 102 comprises one or more sensor 228 in addition.The all component of equipment 102 can be coupled to each other and communicate each other by one or more internal communication link 232 (such as, internal bus).
In addition, in the embodiment of fig. 2, wireless transceiver 202 specifically comprises cellular transceiver 203 and WLAN (wireless local area network) (WLAN) transceiver 205.More particularly, cellular transceiver 203 is configured to the cellular communication carrying out such as 3G, 4G, 4G-LTE, aspectant cell tower (not shown), although in other embodiments, cellular transceiver 203 can be configured to utilize such as analogue communication (using AMPS), digital communication (using CDMA, TDMA, GSM, iDEN, GPRS, EDGE etc.) and/or next generation communication (using UMTS, WCDMA, LTE, IEEE802.16 etc.) or its modification various other based on any one in the communication technology of honeycomb.
On the contrary, WLAN transceiver 205 is configured to according to the IEEE802.11 with access point that (a, b, g or n) standard communicate.In other embodiments, the communication being usually understood to other type be included within WLAN communication that the equity (such as Wi-Fi equity) that WLAN transceiver 205 can replace (or supplementing) to carry out such as some type communicates.In addition, in other embodiments, Wi-Fi transceiver 205 can be replaced or supplement other wireless transceivers one or more that promising non-cellular radio communication configures, and it comprises the wireless transceiver of the ad hoc network communication technology of other wireless communication technology such as adopting such as HomeRF (radio frequency), home node-b (3G Femto cell), bluetooth and/or such as infrared technique.
Although (namely equipment 102 have two wireless transceivers 202 in the present embodiment, transceiver 203 and 205), but the disclosure is intended to contain following many embodiments, in described many embodiments, there is the wireless transceiver of the arbitrary number of the communication technology adopting arbitrary number.By means of use wireless transceiver 202, equipment 102 can with comprise in the various miscellaneous equipment of such as other mobile device, Web server, cell tower, access point, other remote equipment etc. or system (not shown) any one communicate.Depend on embodiment or environment, the radio communication between the miscellaneous equipment of equipment 102 and arbitrary number or system can be realized.
The operation that wireless transceiver 202 combines with other intrawares of equipment 102 can take various forms.Such as, the operation of wireless transceiver 202 can be carried out in the following manner, namely once receive wireless signal, then the intraware of equipment 102 detects signal of communication and transceiver 202 pairs of signals of communication carry out demodulation to recover the afferent message of the such as sound and/or data that are transmitted by wireless signal.After receiving afferent message from transceiver 202, the afferent message of computation processor 204 to one or more output device 208 formats.Similarly, for the transmission of wireless signal, computation processor 204 formats the information of spreading out of, described spread out of information its can but be not must be activated by input equipment 210, and will one or more to provide the brewed signal of communication that will transmit in the wireless transceiver 202 being used for modulating of information conveyance be spread out of.
According to embodiment, the input and output device 208 and 210 of equipment 102 can comprise various vision, audio frequency and/or machinery and export.Such as, output device 208 can comprise one or more visual output device 216, the such as loudspeaker of such as liquid crystal display and/or light emitting diode indicator, warning horn and/or one or more audio output apparatus 218 of hummer and/or one or more mechanical output device 220 of such as vibrating mechanism.Except other business, visual output device 216 also can comprise video screen.Similarly, pass through example, input equipment 210 can comprise one or more audio input device 224 of microphone 108 microphone of such as bluetooth earphone (or further) and/or one or more mechanical input device 226 of such as tilt sensor, keyboard, keypad, select button, navigation cluster, touch pad, capacitance type sensor, motion sensor and/or switch of one or more visual input device 222, such as Fig. 1 of such as optical sensor (such as, camera lens and photoelectric sensor).One or more operation in actuatable input equipment 210 not only can comprise the physical depression/actuating to button and/or other actuator, but also can comprise and such as open mobile device, unlock equipment, make equipment moving with actuating movement, make equipment moving to activate location positioning system and operating equipment.
As mentioned above, equipment 102 also can comprise one or more in various types of sensor 228 and the sensor hub for managing one or more functions of sensor.Sensor 228 can comprise such as proximity transducer (such as light detecting sensors, ultrasonic transceiver or infrared transceiver), touch sensor, height sensor and one or more location circuit/assembly, and described one or more location circuit/assembly can comprise such as GPS (GPS) receiver, triangulation receiver, accelerometer, inclination sensor, gyroscope or identifiable design and go out the current location of equipment 102 or any out of Memory collecting device of user device interface (bearing mode).Although the object sensor 228 in order to Fig. 2 is considered to different from input equipment 210, one or more one or more (and vice versa) that also can be considered to constitute in sensor in other embodiments likely in input equipment.In addition, although it is different from output device 208 to show input equipment 210 in the present embodiment, will be appreciated that one or more equipment is used as input equipment and output device in certain embodiments.Especially, comprise in the present embodiment of touch-screen display 110 at equipment 102, touch-screen display can be considered to constitute visual output device and mechanical input device (by contrast, button or button 112 are only mechanical input device).
Storer 206 can comprise any type of one or more memory devices in various forms (such as, ROM (read-only memory), random access memory, static RAM, dynamic RAM etc.), and can be used to store by computation processor 204 and fetch data.In certain embodiments, can by integrated in one single (such as to storer 206 and computation processor 204, comprise the treatment facility of storer or storage of processor (PIM),), although this individual equipment still typically has for performing different disposal and memory function and can being considered to the different piece/portion of autonomous device.In some alternative embodiments, the storer 206 of equipment 102 can be supplemented with or replaced by other storer be positioned at away from other place of equipment 102, and in this embodiment, equipment 102 carries out communicating or access other such memory devices with other such memory devices by any one in the various communication technology (such as by wireless transceiver 202 or the radio communication that provides via the connection of component interface 212).
The data that storer 206 stores can including, but not limited to operating system, program (application), module and information data.Each operating system comprises the executable code of the basic function of opertaing device 102, the interaction between each assembly in the middle of the intraware that described basic function is such as included in equipment 102, via wireless transceiver 202 and/or component interface 212 and external unit communication and program and data be stored into storer 206 and retrieve program and data from storer 206.With regard to program, each program comprises executable code, and this executable code uses operating system to provide such as file system service and process to be stored in more specific functions of the shielded and not protected data in storer 206.Among other things, this program can comprise programming, this programming make equipment 102 can perform all as shown in Figure 3 and the process of speech recognition process discussed further below.Finally, with regard to information data, it be can by operating system with reference to and/or the not executable code of operation or information, or the program of function for actuating equipment 102.
With reference to figure 3, now the configuration of the electronic equipment 102 according to embodiment is described.VR model database 308, sounding database 309 and noise database 310 are stored in the storer 206 of electronic equipment 102, described VR model database 308, sounding database 309 and noise database 310 all these can be accessed by computation processor 204, audio input device 224 (such as, microphone) and audio output apparatus 218 (such as loudspeaker).VR model database 308 comprises for making sound and phoneme of speech sound or order or data that both are associated.Sounding database 309 comprises user's or user speech sounding by user record sample.Noise database 310 comprises from different environment record, the digital sample producing ground or the noise of both.
Equipment 102 can access the network of such as the Internet.Although there is shown the direct-coupling of the assembly of such as audio input device 224, audio output apparatus 218 etc., computation processor 204 can be connected to by other assembly in equipment or circuit.In addition, sounding equipment 102 can caught and noise are temporarily stored in storer 206, or are stored in more enduringly respectively in sounding database 309 and noise database 310.Whether no matter temporarily store, sounding and noise can be accessed by computation processor 204 subsequently.It is outside that computation processor 204 can reside in electronic equipment 102, on such as resident server on the internet.
Computation processor 204 performs speech recognition engine 305, and described speech recognition engine 305 can reside in storer 206, and may have access to noise database 310, sounding database 309 and VR model database 308.In one embodiment, one or more in noise database 310, sounding database 309, VR model database 308 and speech recognition engine 305 store by being positioned at long-range server 310 and performing.
With reference to figure 4, now the process performed by the electronic equipment 102 (Fig. 3) of embodiment is described.Process 400 shown in Fig. 4 is passive exercise systems, described passive exercise system according to being that transparent mode upgrades and improves VR model database 308 for user because it does not need the cognition of user alternately with the model that increases.Process 400 starts from electronic equipment 102 and is in STT order session, and during described STT order session, speech recognition engine 305 is in pattern sounding being interpreted as order instead of being interpreted as the word that will convert text to.
In step 402 place, electronic equipment 102 record comprises the sounding of user's speech of natural background noise.Recorded sounding and noise can be stored in sounding database 309 and noise database 310 for using in the future.In step 404 place, whether speech recognition engine determination sounding is STT order.When doing like this, speech recognition engine 305 determines the most probable candidate STT order providing sounding.Speech recognition engine 305 assigns confidence score to candidate, and if confidence score is higher than predetermined threshold, then thinks that sounding is STT order.The factor wherein affecting confidence score is the method used when performing training.If determine that sounding is not STT order, then this process turns back to step 402.If determine it is STT order, then electronic equipment 102 performs the function based on STT order in step 406 place.
In step 408 place, electronic equipment 102 determines whether performed function is valid function.If like this, then in step 410 place, electronic equipment 102 trains VR model database 308 by such as making the sounding of user be associated with order.Process performed in the normal operation period allows electronic equipment 102 to upgrade original VR model database 308 to reflect the actual use in multiple environment, and described multiple naturally ground comprises for the intrinsic noise of those environments.Equipment 102 can also use the noise coming the sounding of the precedence record of Self-sounding database 309 and the precedence record from noise database 310 during this training managing.
In alternative embodiment, user is for they wish that text is keyed in the order performed in step 411 by causing equipment 102 to require for the response of "No" during step 408.After this caught in step 402 text and sounding will be used to training and upgrade VR model database 308.
With reference to figure 5, now another process performed by electronic equipment 102 is described.Process 500 is that user carries out mutual process with electronic equipment 102 wittingly.Process 500 starts from step 502 place, such as records sounding by converting sounding to numerical data and being stored as digital document in step 502 place electronic equipment 102.This memory location can be volatile memory or in more persistent storer (such as sounding database 309).In step 504 place, electronic equipment 102 retrieves the data (such as, dining room noise) of noise sample from noise database 310.Electronic equipment 102 can select noise sample (such as cycling through some or all in the noise sample of precedence record) or user to select noise sample.In step 506 place, electronic equipment 102 pairs of noise sample and sounding digitally combine.In step 508 place, electronic equipment 102 uses the noise sample combined to train VR model database 308 with pronunciation.In step 510 place, electronic equipment 102 upgrades VR model database 308.In step 512 place, electronic equipment 102 determines whether there is and uses it to train any more noises of VR model database 308.If do not existed, then this process terminates.If existed, then this process turns back to step 504, retrieves another noise sample in step 504 place electronic equipment 102 from noise database 310.
With reference to figure 6, now the another process performed by the electronic equipment 102 of embodiment is described.Process 600 starts from step 602 place, and in step 602 place, user pointed out sounding by this electronic equipment 102.In step 604 place, electronic equipment 102 plays the noise sample of noise database 310 via loudspeaker 306.
Electronic equipment performs step 606 while execution step 604.In step 606 place, the sounding of electronic equipment 102 recording user together with play noise sample.In step 608 place, the noise sample acoustically combined and sounding are stored in volatile memory or in noise database 310 and sounding database 309 by electronic equipment 102.In step 610 place, electronic equipment 102 uses the noise sample and sounding that combine to train VR model database 308.In step 612 place, electronic equipment 102 upgrades VR model database 308.
The method of the device provided for training voice recognition model database can be found out from above.In view of the principle of this discussion can be applied to many may embodiments, will be appreciated that and be here only illustrative with regard to the embodiment object described by accompanying drawing and should be understood to make restriction to the scope of claim.Therefore, technology as described herein all this embodiments are envisioned for may fall into following claim and equivalent thereof scope within.

Claims (18)

1. a method, comprising:
Record sounding;
In the noise sample of recorded sounding and multiple precedence record one is digitally combined;
Based on described combination of numbers, voice recognition model database (308) is trained; And
To the noise sample of described multiple precedence record remaining each, use the identical sounding recorded to come combination and training step described in repetition.
2. method according to claim 1, comprises further:
Record the second sounding;
In the noise sample of the second recorded sounding and described multiple precedence record one is digitally combined;
Based on the combination of numbers of the noise sample of the second recorded sounding and described precedence record, upgrade described voice recognition model database (308).
3. method according to claim 1,
Wherein said sounding receives from user (104),
Wherein, before the different sounding from described user (104) are recorded, to each in the noise sample of multiple precedence record, repeat described combination and training step.
4. a method comprises:
Via one or more microphones sounding of electronic equipment (102);
During described receiving step, reproduced by the noise sample of loudspeaker to precedence record of described electronic equipment (102),
The described one or more microphone of wherein said reproduction to described electronic equipment (102) can be listened; And
Based on the noise sample acoustically combined and sounding, train voice recognition model database (308).
5. method according to claim 1, comprises further: be stored in sounding database by recorded sounding.
6. method according to claim 1, comprises further:
Detect new noise; And
Described new noise is stored in noise database.
7. method according to claim 1, comprises further:
Detect new noise;
In response to described new noise being detected,
Described user (104) is pointed out to train further; And
Use described new noise as described noise sample, carry out training step described in repetition.
8. a method comprises:
During voice-to-text command mode, record sounding;
Determine whether recorded sounding is identified as voice-to-text order;
If the sounding recorded is confirmed as being voice-to-text order, then based on described voice-to-text order n-back test;
Determine whether performed function is valid function; And
If described voice-to-text order causes valid function, then train voice recognition model database (308) based on recorded sounding and described voice-to-text order.
9. method according to claim 8, wherein for from the same voice phrase recorded from the same person that different noise circumstance combines, repeatedly performs described method during described voice-to-text command mode.
10. method according to claim 8, wherein said different noise circumstance chooses from following group, and this group comprises: family, automobile, street, office and dining room.
11. methods according to claim 8, wherein when the confidence value of word identification exceedes threshold value, the sounding recorded is confirmed as being voice-to-text order.
12. methods according to claim 8, comprise further: the noise sample of recorded sounding and selected precedence record is digitally combined, and wherein, described training is the noise sample based on the described sounding recorded that digitally combines and selected precedence record.
13. methods according to claim 8, wherein said training step comprises: associate scoring based on the method used in the described training of execution.
14. 1 kinds of electronic equipments (102) comprising:
Storer (206);
With the computation processor (204) of described storer (206) electric coupling, wherein said computation processor (204):
Sounding is recorded during voice-to-text command mode;
Determine whether recorded sounding is identified as voice-to-text order;
If the sounding recorded is determined to be voice-to-text order, then based on described voice-to-text order n-back test;
Determine whether performed function is valid function; And
If performed function is determined to be valid function, then train voice recognition model database (308) based on recorded sounding and described voice-to-text order.
15. electronic equipments according to claim 14 (102), wherein, for from the same voice phrase recorded from the same person that different noise circumstance combines, described computation processor (204), according to the sounding recorded at different time during described speech text command mode, repeatedly trains described voice recognition model database (308).
16. electronic equipments according to claim 14 (102), wherein when the confidence value of word identification exceedes threshold value, the sounding recorded is identified as voice-to-text order.
17. electronic equipments according to claim 14 (102), the noise sample of recorded sounding and selected precedence record digitally combines by wherein said computation processor (204), and based on the noise sample of the described sounding recorded that digitally combines and selected precedence record, described voice recognition model database (308) is trained.
18. electronic equipments according to claim 14 (102), the loudspeaker that wherein said computation processor (204) can be listened further by microphone is to play the noise sample of selected precedence record, described sounding carrys out record by described microphone, described computation processor (204) is also by described sounding record together with play noise sample, and according to recorded sounding and the noise sample play, train described voice recognition model database (308).
CN201480025758.9A 2013-03-12 2014-04-23 Method and apparatus for training a voice recognition model database Active CN105580071B (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US201361819985P true 2013-05-06 2013-05-06
US61/819,985 2013-05-06
US14/094,875 US9275638B2 (en) 2013-03-12 2013-12-03 Method and apparatus for training a voice recognition model database
US14/094,875 2013-12-03
PCT/US2014/035117 WO2014182453A2 (en) 2013-05-06 2014-04-23 Method and apparatus for training a voice recognition model database

Publications (2)

Publication Number Publication Date
CN105580071A true CN105580071A (en) 2016-05-11
CN105580071B CN105580071B (en) 2020-08-21

Family

ID=51867838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201480025758.9A Active CN105580071B (en) 2013-03-12 2014-04-23 Method and apparatus for training a voice recognition model database

Country Status (3)

Country Link
EP (1) EP2994907A2 (en)
CN (1) CN105580071B (en)
WO (1) WO2014182453A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109192216A (en) * 2018-08-08 2019-01-11 联智科技(天津)有限责任公司 A kind of Application on Voiceprint Recognition training dataset emulation acquisition methods and its acquisition device
CN109545196A (en) * 2018-12-29 2019-03-29 深圳市科迈爱康科技有限公司 Audio recognition method, device and computer readable storage medium
CN109545195A (en) * 2018-12-29 2019-03-29 深圳市科迈爱康科技有限公司 Accompany robot and its control method
CN110544469A (en) * 2019-09-04 2019-12-06 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110808030A (en) * 2019-11-22 2020-02-18 珠海格力电器股份有限公司 Voice awakening method, system, storage medium and electronic equipment
CN111128141A (en) * 2019-12-31 2020-05-08 苏州思必驰信息科技有限公司 Audio identification decoding method and device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113099353A (en) * 2021-04-21 2021-07-09 浙江吉利控股集团有限公司 Integrated microphone, safety belt, steering wheel and vehicle for vehicle

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331467A (en) * 2000-06-28 2002-01-16 松下电器产业株式会社 Method and device for producing acoustics model
CN1451152A (en) * 2000-09-01 2003-10-22 捷装技术公司 Computur-implemented speech recognition system training
US20050071159A1 (en) * 2003-09-26 2005-03-31 Robert Boman Speech recognizer performance in car and home applications utilizing novel multiple microphone configurations
CN101023467A (en) * 2005-01-04 2007-08-22 三菱电机株式会社 Method for refining training data set for audio classifiers and method for classifying data
US20080300871A1 (en) * 2007-05-29 2008-12-04 At&T Corp. Method and apparatus for identifying acoustic background environments to enhance automatic speech recognition
CN102426837A (en) * 2011-12-30 2012-04-25 中国农业科学院农业信息研究所 Robustness method used for voice recognition on mobile equipment during agricultural field data acquisition
CN102903360A (en) * 2011-07-26 2013-01-30 财团法人工业技术研究院 Microphone-array-based speech recognition system and method
CN103069480A (en) * 2010-06-14 2013-04-24 谷歌公司 Speech and noise models for speech recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6876966B1 (en) * 2000-10-16 2005-04-05 Microsoft Corporation Pattern recognition training method and apparatus using inserted noise followed by noise reduction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331467A (en) * 2000-06-28 2002-01-16 松下电器产业株式会社 Method and device for producing acoustics model
CN1451152A (en) * 2000-09-01 2003-10-22 捷装技术公司 Computur-implemented speech recognition system training
US20050071159A1 (en) * 2003-09-26 2005-03-31 Robert Boman Speech recognizer performance in car and home applications utilizing novel multiple microphone configurations
CN101023467A (en) * 2005-01-04 2007-08-22 三菱电机株式会社 Method for refining training data set for audio classifiers and method for classifying data
US20080300871A1 (en) * 2007-05-29 2008-12-04 At&T Corp. Method and apparatus for identifying acoustic background environments to enhance automatic speech recognition
CN103069480A (en) * 2010-06-14 2013-04-24 谷歌公司 Speech and noise models for speech recognition
CN102903360A (en) * 2011-07-26 2013-01-30 财团法人工业技术研究院 Microphone-array-based speech recognition system and method
CN102426837A (en) * 2011-12-30 2012-04-25 中国农业科学院农业信息研究所 Robustness method used for voice recognition on mobile equipment during agricultural field data acquisition

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109192216A (en) * 2018-08-08 2019-01-11 联智科技(天津)有限责任公司 A kind of Application on Voiceprint Recognition training dataset emulation acquisition methods and its acquisition device
CN109545196A (en) * 2018-12-29 2019-03-29 深圳市科迈爱康科技有限公司 Audio recognition method, device and computer readable storage medium
CN109545195A (en) * 2018-12-29 2019-03-29 深圳市科迈爱康科技有限公司 Accompany robot and its control method
CN110544469A (en) * 2019-09-04 2019-12-06 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110544469B (en) * 2019-09-04 2022-04-19 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110808030A (en) * 2019-11-22 2020-02-18 珠海格力电器股份有限公司 Voice awakening method, system, storage medium and electronic equipment
CN111128141A (en) * 2019-12-31 2020-05-08 苏州思必驰信息科技有限公司 Audio identification decoding method and device
CN111128141B (en) * 2019-12-31 2022-04-19 思必驰科技股份有限公司 Audio identification decoding method and device

Also Published As

Publication number Publication date
EP2994907A2 (en) 2016-03-16
WO2014182453A3 (en) 2014-12-31
CN105580071B (en) 2020-08-21
WO2014182453A2 (en) 2014-11-13

Similar Documents

Publication Publication Date Title
US9275638B2 (en) Method and apparatus for training a voice recognition model database
CN105580071A (en) Method and apparatus for training a voice recognition model database
US20200380961A1 (en) Method and Apparatus for Evaluating Trigger Phrase Enrollment
CN106652996B (en) Prompt tone generation method and device and mobile terminal
CN104969289B (en) Voice trigger of digital assistant
CN107948801B (en) Earphone control method and earphone
CN103959201B (en) Be in idle mode based on ultrasonic mobile receiver
CN106575230A (en) Semantic framework for variable haptic output
CN105874732B (en) The method and apparatus of a piece of music in audio stream for identification
CN106971723A (en) Method of speech processing and device, the device for speech processes
CN103440862A (en) Method, device and equipment for synthesizing voice and music
US20140278392A1 (en) Method and Apparatus for Pre-Processing Audio Signals
CN104536978A (en) Voice data identifying method and device
CN106210266B (en) A kind of acoustic signal processing method and audio signal processor
CN108538320A (en) Recording control method and device, readable storage medium storing program for executing, terminal
CN110740262A (en) Background music adding method and device and electronic equipment
CN102664005A (en) Voice recognition prompter
US20140044307A1 (en) Sensor input recording and translation into human linguistic form
CN110830368A (en) Instant messaging message sending method and electronic equipment
CN105808716B (en) Alarm clock prompting method, device and terminal
CN110415703A (en) Voice memos information processing method and device
CN107229629A (en) Audio identification methods and device
CN111739530A (en) Interaction method and device, earphone and earphone storage device
CN111739529A (en) Interaction method and device, earphone and server
CN105632489A (en) Voice playing method and voice playing device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant