CN109923608A - The system and method graded using neural network to mixing voice recognition result - Google Patents

The system and method graded using neural network to mixing voice recognition result Download PDF

Info

Publication number
CN109923608A
CN109923608A CN201780070915.1A CN201780070915A CN109923608A CN 109923608 A CN109923608 A CN 109923608A CN 201780070915 A CN201780070915 A CN 201780070915A CN 109923608 A CN109923608 A CN 109923608A
Authority
CN
China
Prior art keywords
speech recognition
recognition result
neural network
feature vector
controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201780070915.1A
Other languages
Chinese (zh)
Other versions
CN109923608B (en
Inventor
Z.周
R.博特罗斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Robert Bosch GmbH
Original Assignee
Robert Bosch GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Robert Bosch GmbH filed Critical Robert Bosch GmbH
Publication of CN109923608A publication Critical patent/CN109923608A/en
Application granted granted Critical
Publication of CN109923608B publication Critical patent/CN109923608B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

A kind of method for grading for candidate speech recognition result is that candidate speech recognition result generates multiple feature vectors including the use of controller, and each feature vector, which includes triggering, scores one or more of feature and word level feature to feature, confidence.Furthermore the method includes that the multiple feature vector is provided as input into neural network, output layer neural network based scores to generate multiple gradings corresponding with the multiple feature vector of the multiple candidate speech recognition result is directed to, and is used as input to carry out operation automation system by using candidate speech recognition result corresponding with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result.

Description

The system and method graded using neural network to mixing voice recognition result
Technical field
Present disclosure is generally related to the field of automated voice identification, and relates more specifically to the multiple languages of utilization Sound identifies the improved system and method for the operation of the speech recognition system of engine.
Background technique
The speech recognition of automation is the important technology for realizing man-machine interface (HMI) in the application of wide scope.It is special Not, speech recognition is useful in following situation: in the situation, human user needs to concentrate in execution task, Wherein it will be inconvenient or unpractiaca using traditional input equipment, such as mouse and keyboard.For example, " information in vehicle The small electrical of amusement " system, domestic automation system and such as smart phone, tablet device and wearable computer etc Many purposes of sub- mobile device can receive voice command and other inputs from user using speech recognition.
Most prior art speech recognition system will be recorded to use by oneself using housebroken speech recognition engine The described input at family is converted into suitable for the numerical data handled in the system of computerization.For known in the art Various speech engines execute natural language understanding technology to identify word described in user and the extraction semanteme from the word Meaning is to control the operation of the system of computerization.
In some cases, for identifying voice from the user when user executes different task, individually Speech recognition engine is not necessarily optimal.Prior art solution is attempted to combine multiple speech recognition systems to improve voice The accuracy of identification including rudimentary output of the selection from acoustic model, different speech recognition modelings, or is based on pre- accepted opinion Grade process and select the entire output from different speech recognition engines to collect.However, from the defeated of multiple speech recognition systems Lower order combinations out do not keep high-level language information.In other embodiments, multiple speech recognition engines generate full voice identification As a result, still to select the determination process of which speech recognition result in the output of multiple speech recognition engines is also to have to choose The problem of war property.Therefore, beneficial will be to the improvement of speech recognition system, improves from from multiple speech recognition engines One group of candidate speech recognition result in select speech recognition result accuracy.
Summary of the invention
In one embodiment, a kind of method for executing speech recognition in automated system has been developed.Institute It states method and generates multiple feature vectors including the use of controller, each feature vector corresponds to multiple candidate speech recognition results In a candidate speech recognition result.For the in the multiple candidate speech recognition result in the multiple feature vector Furthermore the generation of the first eigenvector of one candidate speech recognition result includes: to be stored in memory using controller, reference In multiple predetermined triggers to come identified in the first candidate speech recognition result at least one triggering pair, it is described at least one touching Hair and generates first eigenvector using controller, the first eigenvector packet to including two predetermined trigger words Include the element at least one triggering pair.Furthermore the method includes: that the multiple feature vector is made using controller It is supplied to neural network for input, generated using controller, output layer neural network based and is directed to the multiple candidate The corresponding multiple grading scorings of the multiple feature vector of speech recognition result, and utilize controller, by using institute Candidate speech corresponding with the highest grading scoring in the multiple grading scoring in multiple candidate speech recognition results is stated to know Other result carrys out operation automation system as input.
In another embodiment, a kind of method for training neural network grading device, the nerve net have been developed Network grades device as the different candidate speech recognition results generation grading scoring in automated voice identifying system.The method includes It utilizes a processor to generate multiple feature vectors, each feature vector corresponds to the multiple trained voices being stored in memory A trained speech recognition result in recognition result.The multiple trained speech recognition knot is directed in the multiple feature vector Furthermore the generation of the first eigenvector of the first training speech recognition result in fruit includes: to be stored using processor, reference Multiple predetermined triggers in memory to come first training speech recognition result in identify at least one triggering pair, it is described extremely A few triggering is to including two predetermined trigger words, and utilizes a processor to generate first eigenvector, and described first is special Sign vector includes the element at least one triggering pair.The method furthermore include: using processor by using with Lower items execute the training process for neural network grading device: corresponding with the multiple trained speech recognition result work Device of grading by the multiple feature vector of the input to neural network grading device, by neural network is given birth to during training process At multiple outputs scoring and based on the multiple trained speech recognition result be directed to the multiple speech recognition result In each trained speech recognition it is predetermined correctly enter between predetermined editing distance multiple objective results;And for making Used in for additional eigenvectors corresponding with the speech recognition result being not present in the multiple trained speech recognition result After generating the training process completion in grading scoring, neural network is graded into device storage in memory using processor.
In another embodiment, the speech recognition system of automation has been developed.The system comprises memory and by It is operatively coupled to the controller of the memory.Memory is configured to store multiple predetermined triggers pair, and each triggering is to packet Include two words;And neural network, it is configured to generate grading scoring corresponding with multiple candidate speech recognition results.Control Device processed is configured to generate multiple feature vectors, and each feature vector corresponds to a time in multiple candidate speech recognition results Speech recognition result is selected, is generated candidate for first in the multiple candidate speech recognition result in the multiple feature vector The first eigenvector of speech recognition result.Furthermore controller is configured to: multiple predetermined in memory referring to being stored in Triggering to come identified in the first candidate speech recognition result at least one triggering pair, it is described at least one triggering to include two Predetermined trigger word;And first eigenvector is generated, the first eigenvector includes at least one described triggering pair Element.Furthermore controller is configured to the multiple feature vector to be fed as input to neural network, be based on nerve net The output layer of network is corresponding multiple with the multiple feature vector for the multiple candidate speech recognition result to generate Grading scoring, and grade by using in the multiple candidate speech recognition result with the highest in the multiple grading scoring Corresponding candidate speech recognition result score as input and carrys out operation automation system.
Detailed description of the invention
Fig. 1 be as in the vehicle that is embodied in the passenger carriage of vehicle in information system, from user to receive voice defeated Enter the schematic views of the component of the automated system of order.
Fig. 2 is for grading device using neural network during speech recognition process come for multiple candidate speech recognition results Generate the block diagram of the process of grading scoring.
Fig. 3 is to execute training process to generate showing for the computing system of the housebroken neural network grading device of Fig. 1 and Fig. 2 Meaning view.
Fig. 4 is the block diagram for generating the process of housebroken neural network grading device.
Fig. 5 is a diagram, and which depict the structures and nerve net according to speech recognition result feature vector generated The structure of network grading device.
Specific embodiment
In order to promote to understand the embodiments described herein principle purpose, referring now to the drawings and in following institute The description in specification write.The reference is not intended to the limitation of any pair of subject area.Present disclosure further includes pair Any change and modification of shown embodiment, and including such as present disclosure about field in technical staff it is logical Often it will be appreciated that the disclosed embodiments principle other application.
As used herein, term " speech recognition engine " refers to data model and executable program code, Enable the system of computerization based on via microphone or other audio input device received described word remembered Recording frequency enters data to identify the described word from operator.Speech recognition system generally includes lower level acoustic model With higher level language model, the lower level acoustic model identifies the independent sound of human speech in SoundRec, it is described compared with High-level language model identifies word and sentence based on the sound sequence from the acoustic model for scheduled instruction.To this field Known speech recognition engine usually realizes one or more statistical models, such as hidden Markov model (HMM), branch Vector machine (SVM), housebroken neural network are held, or by using the input data being applied to corresponding to human speech Multiple trained parameters of feature vector come generate be directed to recorded human speech statistical forecast another statistical model.Voice Identification engine is by using for example generating feature vector for various signal processing technologies known in the art, at the signal Reason technology extracts the property (" feature ") of recorded voice signal and by the feature organization at one or more dimensions vector, institute The various parts processed to identify the voice including individual word and sentence can be come by using statistical model by stating vector. Speech recognition engine can generate for voice input as a result, voice input corresponds to individually described phoneme and sound More complicated mode, including described word and sentence, the sentence includes the sequence in relation to word.
As used herein, term " speech recognition result " refers to that speech recognition engine is that given input is generated Machine readable output.As a result it can be for example with the encoded text of machine readable format or another encoded data collection, It is used as the operation that input carrys out control automation system.Due to the statistics speciality of speech recognition engine, in some configurations, voice Engine is that single input generates multiple potential speech recognition results.Speech engine is also that each speech recognition result generation " is set Letter scoring ", wherein confidence scoring is the quasi- to each speech recognition result of the trained statistical model based on speech recognition engine The statistical estimate of really a possibility that.As described in more detail below, the use of mixing voice identifying system is by multiple speech recognitions Speech recognition result caused by engine generates additional mixing voice recognition result, and is based ultimately upon multiple be previously generated Speech recognition result come generate at least one output speech recognition result.As used herein, term " know by candidate speech Other result " or more simply " candidate result " refer to candidate as the final speech recognition result from mixing voice identifying system Speech recognition result, the mixing voice identifying system generates multiple candidate results and only selects that the subset of result is (logical Normal a subset) it is used as final speech recognition result.In various embodiments, candidate speech recognition result include from general and The speech recognition result and system 100 of the specific speech recognition engine in domain are by using from multiple candidate speech recognition results Both word mixing voice recognition results generated.
As used herein, term " universal phonetic identification engine " refer to be trained to from such as English or Chinese it The a type of speech recognition engine of the voice of the natural human language identification wide scope of class.Universal phonetic identification engine is based on Wide in range word vocabulary and language model generates speech recognition result, the language model be trained to widely to cover from Language mode in right language.As used herein, term " leading specific speech recognition engine " refers to such a class The speech recognition engine of type: it, which is trained to identify, specifically using field or generally includes slightly different vocabulary and potential Voice input in " domain " of the ground expection syntactic structure different from broader natural language.It is usually wrapped for the vocabulary of special domain Certain terms from broader natural language are included, but may include narrower overall vocabulary, and wrap in some instances The term of specialization is included, the term of the specialization is not known as official's word in natural language by official but for spy Localization is well-known.For example, in navigation application, the specific speech recognition in domain can identify for road, cities and towns or its The term of his geographical title, the appropriate title not being acknowledged as usually in more generally language.In other configurations, special domain It is useful for special domain using specific jargon collection, but may not be identified well in wider range of language.Example Such as, pilot official uses English as exchange language, but also using the specific jargon word of domains and be not standard Other abbreviations of English components.
As used herein, term " triggering to " refers to two words, each of these can be word (such as " broadcasting ") or predetermined class (such as<title of the song>), word sequence that the predetermined class indicates to fall within predetermined class (such as " Poker Face "), such as appropriate title of the song, personnel, location name etc..The word for triggering centering, ought be appear in a specific order in voice When in word in the statement text content of recognition result, the triggering of A → B is wherein being directed to in audio input data In observe in the situation of word A not long ago that there are the high related levels occurred between word B later.As more fully below Description, one group of triggering is being identified to later via training process, is being triggered in the text of candidate speech recognition result Word is to a part for forming the feature vector for each candidate result, and ranking process is using the part come for different times Speech recognition result is selected to grade.
Use the inference system and ranking process of housebroken neural network grading device
Fig. 1 depicts information system 100 in vehicle comprising head up display (HUD) 120, the one or more face console LCD Plate 124, one or more input microphones 128 and one or more output loudspeakers 132.LCD display 124 and HUD 120 are based at least partially on system 100 and input a command for generating from the received voice of other of operator or vehicle occupant institute From the visual output response of system 100.Controller 148 is operatively coupled to every in the component in vehicle in information system 100 One.In some embodiments, controller 148 is connected to or is incorporated to additional component, and such as global positioning system (GPS) receives Device 152 and Wireless Communication Equipment 154, for providing navigation and communication using outer data network and calculating equipment.
In some operation modes, information system 100 is operating independently in vehicle, and in other operation modes, vehicle Middle information system 100 is set with mobile electronic device, such as smart phone 170, tablet device, notebook computer or other electronics Standby interaction.Information system is come and intelligent electricity by using wireline interface, such as USB or wireless interface, such as bluetooth in vehicle 170 communication of words.Information system 100 provides voice recognition user interface in vehicle, and the voice recognition user interface to operate Person can control smart phone 170 or another movement using the verbal order for reducing dispersion attention when operating vehicle Electronic communication equipment.For example, the offer of information system 100 speech interface to enable vehicle operators to utilize intelligence electricity in vehicle Words 170 are made a phone call or sending information message, hold or see smart phone 170 without operator.In some embodiments, intelligence Energy phone 170 includes various equipment, such as GPS and wireless networking device, the function of the equipment accommodated in supplement or substitution vehicle It can property.
Microphone 128 described inputs next life audio frequency according to from vehicle operators or another Vehicular occupant institute are received According to.Controller 148 includes hardware, such as DSP of processing audio data, and for will be from the input signal of microphone 128 It is converted into the component software of audio input data.As explained below, controller 148 is general and at least one using at least one A specific speech recognition engine in domain generates candidate speech recognition result to be based on audio input data, and controller 148 Furthermore improve the accuracy of final speech recognition result output using grading device.In addition, controller 148 includes making it possible to lead to It crosses loudspeaker 132 and generates the voice of synthesis or the hardware and software component of other audio output.
In vehicle information system 100 by using LCD panel 124, the HUD being projected on windshield 102 120, with And visible feedback is provided to vehicle operators by instrument, indicator light or additional LCD panel in instrument board 108.When Vehicle during exercise when, controller 148 optionally deactivates LCD panel 124 or shows only by LCD panel 124 Simplified output, for reducing the dispersion attention to vehicle operators.Controller 148 is shown by using HUD 120 can Environment depending on feedback, for enabling the operator to check vehicle periphery when receiving visible feedback.Controller 148 is usual Simplified data are shown on HUD 120, in area corresponding with the peripheral vision of vehicle operators, for ensuring that vehicle is grasped Author has the unobstructed view of road and vehicle-periphery.
As described above, the display visual information in a part of windshield 120 of HUD 120.As used herein, Term " HUD " generally refers to the head-up display device of wide scope, including but not limited to includes isolated combiner element through group Head-up display (CHUD) of conjunction etc..In some embodiments, HUD 120 shows monochromatic text and figure, and other HUD are real Applying example includes multicolor displaying.Although HUD 120 is depicted on windshield 102 and shows, in alternate embodiment In, head-up unit and the graticle that glass, helmet goggles or operator wear during operation are integrated.
Controller 148 includes one or more integrated circuits, the integrated circuit be configured as one of the following terms or A combination thereof: central processing unit (CPU), graphics processing unit (GPU), microcontroller, field programmable gate array (FPGA), specially With integrated circuit (ASIC), digital signal processor (DSP) or any other suitable digital logic device.Controller 148 is also Equipment is stored including memory, such as solid-state or magnetic data, stores instruction by programming for information system in vehicle 100 operation.
During operation, information system 100 receives from multiple input equipments and inputs request in vehicle, including passes through microphone The received voice input order of 128 institutes.Particularly, controller 148 receives sound corresponding with voice from user via microphone 128 Frequency input data.
Controller 148 includes one or more integrated circuits, and the integrated circuit is configured as central processing unit (CPU), microcontroller, field programmable gate array (FPGA), specific integrated circuit (ASIC), digital signal processor (DSP), Or any other suitable digital logic device.Controller 148 is also operatively connected to memory 160, and the memory 160 wraps Include nonvolatile solid state or magnetic data storage equipment and volatile data storage equipment, such as random access memory (RAM), instruction by programming is stored with the operation for information system 100 in vehicle.160 storage model data of memory with And executable program instruction code and data is to realize multiple speech recognition engines 162, feature extractor 164 and depth nerve Network grading device 166.Train speech recognition engine 162 by using predetermined training process, and speech recognition engine 162 with Other modes are known for this field.Although the embodiment of Fig. 1 includes depositing for the system 100 being stored in motor vehicles Element in reservoir 160, but in some embodiments, external computing device, the server such as through being connected to the network are realized Some or all of discribed feature in system 100.Thus, it would be recognized by those skilled in the art that including controller 148 and any refer to of operation of system 100 of memory 160 should this outsourcing in the alternate embodiment of system 100 Include the operation of server computing device He other distributed computing components.
In the embodiment in figure 1, feature extractor 164 is configured to generate the feature vector with multiple numerical value elements, The numerical value element corresponds to the content of each candidate speech recognition result, including is generated by one of speech recognition engine 162 Speech recognition result or mixing voice that two or more words in speech recognition engine 162 are combined know Other result.Feature extractor 164 generates feature vector, and described eigenvector includes for any one of following characteristics or group The element of conjunction :(a) triggering pair, (b) confidence scores, and (c) individual word level feature, including the word with decay characteristics Bag.
Within system 100, the triggering in feature extractor 164 is stored in respectively including scheduled one group of two list Word, the two words are being previously identified as defeated from the voice for the training corpus for indicating expected voice input structure Entering has strong correlation in sequence.First trigger words, which have to be followed by voice input, triggers the second trigger words in Strong statistical likelihood, although these words may not known the intermediate word separation of number in different voice inputs.Cause And if speech recognition result includes trigger words, due to the statistic correlation between the first and second trigger words, A possibility that those trigger words are accurate in speech recognition result is relatively high.Within system 100, by using for this field The statistical method known generates trigger words based on mutual information scoring.Memory 160 stores scheduled one group in feature vector To element, the triggering, which corresponds to element based on the trigger words collection to score with high mutual information, to be had first for N number of triggering The triggering pair of high correlation level between word and the second word.As described below, trigger words opposite direction neural network grading device 166 provide the supplementary features of speech recognition result, and the supplementary features enable neural network grading device 166 by using super The speech recognition result supplementary features of word present in speech recognition result grade to speech recognition result out.
Confidence scoring feature corresponds to speech recognition engine 162 and combines each candidate speech recognition result numerical value generated Confidence score value.For example, in one configuration, the numerical value speech recognition engine in the range of (0.0,1.0) is placed in spy Determine the probability confidence water slave lowest confidence (0.0) to highest confidence level (1.0) in the accuracy of candidate speech recognition result It is flat.Each of mixing candidate speech recognition result including the word from two or more speech recognition engines is assigned One confidence scoring, the confidence scoring are the candidate speech that controller 148 is used to generate the mixing voice recognition result focused The normalization average value of the confidence scoring of recognition result.
Within system 100, controller 148 also normalizes and albefaction is for generated by different speech recognition engines The confidence score value of speech recognition result, for generating final feature vector element, the final feature vector element packet It includes and uniformly scores through normalizing with the confidence of albefaction between the output of multiple speech recognition engines 162.Controller 148 passes through It is normalized to identify the confidence scoring of engine from different phonetic using normalization process, and then white by using the prior art Change technology, the mean value estimated by the training data and variance are come the normalised confidence score value of albefaction.In a reality It applies in example, controller 148 returns the confidence scoring between different phonetic identification engine by using linear regression procedure One changes.Subdivision or " storehouse " of the controller 148 first by confidence scoring range subdivision at predetermined number, such as know for two voices 20 unique storehouses of other engine A and B.Controller 148 is then based on observed speech recognition result and in process 200 Used practical bottom input, is directed to various voices corresponding with each scoring storehouse to identify during training process before The practical accuracy rate of recognition result.Controller 148 executes cluster to the confidence scoring in the predetermined value window around " edge " Operation, and averaged accuracies scoring corresponding with each edge confidence score value is identified, the edge separation is from difference The storehouse of every group of result of speech recognition engine." edge " confidence scores equal along the confidence scoring range of each speech recognition engine It is distributed evenly, and provides the comparison point of predetermined number to execute linear regression, the first speech recognition is drawn in the linear regression The confidence scoring held up is mapped to the confidence scoring of another speech recognition engine with similar accuracy rate.
Controller 148 is mapped using the accuracy data identified for the scoring of each edge to execute linear regression, The linear regression mapping enables controller 148 that will be converted into and come from from the scoring of the confidence of the first speech recognition engine The corresponding another confidence score value of equivalent confidence scoring of second speech recognition engine.One from the first speech recognition engine The mapping of a confidence scoring to another confidence scoring from another speech recognition is also known as scoring alignment procedures, and one In a little embodiments, controller 148 determines the scoring of the confidence from the first speech recognition engine and the by using following equation The alignment of two speech recognition engines:
WhereinxIt is the scoring from the first speech recognition engine,x'It isxIn the confidence scoring range of the second speech recognition engine Equivalence value, valuee i Withe i+i Corresponding to for the value closest to the first speech recognition enginexDifferent marginal values it is estimated Accuracy scoring (such as scoring for the estimated accuracy of the marginal value 20 and 25 around confidence scoring 22), and be worthe i 'Withe i+i 'Corresponding to for the estimated accuracy scoring at identical relative edge's edge value of the second speech recognition engine.
In some embodiments, the result of linear regression is stored in the feature extractor in memory 160 by controller 148 In 164, as look-up table or other suitable data structures, it is able to achieve between different phonetic identification engine 162 for making The efficient normalization of confidence scoring, without re-generating linear regression for each comparison.
Controller 148 also identifies the word level feature in candidate speech recognition result using feature extractor 164.It is single Word level characteristics correspond to controller 148 and are placed in the data in the element of feature vector, and the element is known corresponding to candidate speech The characteristic of individual word in other result.In one embodiment, controller 148 is only identified identifies with each candidate speech As a result the existence or non-existence of word in the corresponding multiple predetermined vocabularies of separate element of the predetermined characteristic vector in.For example, If word " street " occurs at least once in candidate speech recognition result, controller 148 is in the characteristic extraction procedure phase Between by the value of the corresponding element in feature vector set 1.In another embodiment, controller 148 identifies the frequency of each word Rate, wherein " frequency " refers to the number that word occurs in candidate speech recognition result as used in this article.Control The appearance number of word is placed in the corresponding element of feature vector by device 148 processed.
In still another embodiment, feature extractor 164 is feature vector corresponding with each word in predetermined vocabulary In Element generation " with decay characteristics bag of words "." bag of words with decay characteristics " refer to as used herein, the term Controller 148 considers candidate speech recognition result, the frequency of occurrence based on the word in the result and position and is assigned to pre- Determine the numeric ratings of each word in vocabulary.Controller 148 is each of the candidate speech recognition result in predetermined vocabulary Word generates the bag of words with reduction scores, and has those of not appear in candidate result word appointment in vocabulary The bag of words for the reduction scores for being zero.In some embodiments, scheduled vocabulary includes special entry to indicate outside any vocabulary Word, and controller 148 also to generate for special entry there is decaying to comment based on word outside all vocabulary in candidate result The single bag of words divided.For the given word in predetermined dictionaryw i , the bag of words with reduction scores are:, WhereinIt is word in candidate speech recognition resultw i The position collection of appearance, and itemIt is in the range of (0,1.0) Predetermined value decay factor is for example configured to 0.9 in the illustrative embodiments of system 100.
Fig. 5 depicts the example of the structure of feature vector 500 in more detail.Feature vector 500 includes corresponding to triggering pair Multiple elements of feature 504, confidence scoring element 508, and corresponding to other multiple elements of word level feature 512, institute It states word level feature 512 and is depicted as the bag of words with decay characteristics in Fig. 5.In feature vector 500, trigger words pair Feature 504 includes the element for each predetermined trigger pair, and intermediate value " 0 " instruction triggering is identified to candidate speech is not present in As a result in, and it is worth " 1 " instruction triggering to being present in candidate speech recognition result.Confidence scoring element 508 is individual element, It include combined by corresponding speech recognition engine 162 or the speech recognition engine for mixing voice recognition result it is generated Numerical value confidence score value.Word level characteristic element 512 includes respectively element corresponding with the certain words in predetermined vocabulary Array.For example, in one embodiment, the predetermined dictionary for a kind of language (such as English or Chinese) includes respectively being reflected It is mapped to the word of one of word level element 512.In another embodiment being described more particularly below, training process is based on The frequency of occurrences of the word that big training data is concentrated generates the vocabulary of word, wherein appearing in training data with highest frequency The word (such as 90% in the word with highest frequency) of concentration is mapped to the word level in the structure of feature vector 500 Element 512.
The precise order of discribed feature vector element is not for indicating triggering to, confidence in feature vector 500 The requirement of scoring and word level feature.Instead, as long as being directed to all candidate speech by using consistent structural generation The feature vector of recognition result, any sequence of the element in feature vector 500 are exactly effective, each member in the structure Element indicates identical triggering among all candidate speech recognition results to, confidence scoring or word level feature.
Referring again to FIGS. 1, in the embodiment in figure 1, neural network grading device 166 is housebroken neural network, packet Include the input layer of the neuron of reception multiple feature vectors corresponding with the candidate speech recognition result of predetermined number, Yi Jisheng At the output layer of the neuron of grading scoring corresponding with each of input feature value.In general, neural network includes Multiple nodes for being referred to as " neuron ".Each neuron receives at least one input value;Scheduled weighted factor is applied to Input value, wherein different input values usually receives different weighted factors;And output is generated, as weighted input Summation, wherein having the optional bias factor for being added to summation in some embodiments.In the instruction being described more particularly below The accurate weighted factor for each input and the optional bias in each neuron are generated during practicing process.Neural network Output layer include another group of neuron, " activation primitive " is particularly configured with during training process.The activation letter Number is such as s type function or other threshold function tables, the input of the final hidden layer based on the neuron in neural network Output valve is generated, wherein generating the accurate parameters of s type function or threshold value during the training process of neural network.
In the specific configuration of Fig. 1, neural network grading device 166 is feedforward deep neural network, and Fig. 5 includes feedforward The illustrative description of deep neural network 500.As is known in the art, feedforward neural network includes being connected from input The layer of neuron in layer (layer 554) to the travelling unidirectionally of output layer (layer 566), without it is any will be in one layer of neural network Neuron is connected to the recurrence or " feedback " loop of the neuron in neural network preceding layer.Deep neural network includes not sudden and violent Dew is at least one " hidden layer " (and being typically more than a hidden layer) of the neuron of input layer or output layer.In nerve net In network 550, neuronkInput layer 554 is connected to output layer 566 by a hidden layer 562.
In the embodiment of neural network 550, furthermore input layer includes projection layer 558, and the projection layer 558 is by predetermined square Battle array transformation is applied to institute's selected works of input feature value element, including is respectively used to triggering to element 554 and word level feature Two different projection matrixes of element 512.Projection layer 558 generates the simplification of the output of the input neuron in input layer 554 It indicates, because being " dilute to the feature vector element of 504 and word level feature 512 for triggering in most realistic input Dredge ", it means that each candidate speech recognition result only include be coded in it is a small amount of in the structure of feature vector 500 (if any) a small amount of words totally collected in (such as 10,000 words) to item and big word are triggered.Projection layer Transformation in 558 enables the remainder layer of neural network 550 to include less neuron, and is still candidate speech recognition result Feature vector input generate useful grading scoring.In an illustrative embodiments, for trigger words to for single Two projection matrixes of word level characteristicsP f WithP w Respectively corresponding input neuron, which is projected to, respectively has 200 elements In smaller vector space, this generates 401 for each in n input feature value in neural network grading device 166 The layer through projecting of neuron (neuron is reserved for confidence scoring feature).
Although Fig. 5 depicts neural network 550, have corresponding for the candidate speech recognition result different from n Feature vector n input slot (slot) in total, but in input layer 554 input neuron number include for candidate One neuron of each element in the feature vector of speech recognition result, or in totalA neuron, WhereinTIt is the number of the predetermined trigger pair identified in candidate speech recognition result, andVIt is the word in the word identified The number of the word occurred in remittance, wherein 0.9 coefficient indicates the filtering to training set only to include as described above with most high frequency 90% in word that rate occurs.Fixed value 2 is indicated for an input neuron of confidence score value and another input nerve Member, another input neuron are served as any list not corresponding with the predetermined word level element of input feature value Comprehensive (catch-all) of word level characteristics is inputted, and is not modeled clearly in neural network grading device 166 such as any The outer word of vocabulary.For example, controller 148 generates feature vector by using feature extractor 164, for for not with feature to Any word in the candidate speech recognition result of element aligned in the predetermined structure of amount generates the bag of words with reduction scores. Element in feature vector corresponding with word outside vocabulary enables neural network grading device 166 that will be not included in default word The presence of any word in remittance be incorporated into be include word outside vocabulary any candidate speech recognition result grading scoring In generation.
Output layer 566 includes than the less output neuron of input layer 554.Particularly, output layer 566 includes n output Neuron, wherein each output neuron is one generation numerical value of correspondence in n input feature value during reasoning process Grading scoring, the reasoning process is ranking process in the specific configuration of system 100, for being to identify with multiple candidate speech As a result corresponding feature vector generates grading scoring.Some hardware embodiments of controller 148 include one or more of GPU Computing unit or other specific hardware-accelerated components, for executing reasoning process in a manner of time and power-efficient.At it In his embodiment, furthermore system 100 includes that additional Digital Logic handles hardware, the additional Digital Logic handles hardware quilt It is incorporated into remote server, controller 148 accesses the long-range clothes by using Wireless Communication Equipment 154 and data network Business device.In some embodiments, the hardware in remote server also realizes functional one of multiple speech recognition engines 162 Point.The server include it is additional processing hardware come execute feature extraction and ANN Reasoning processing whole or one Point, for generating the feature vector and grading scoring of the multiple candidate speech recognition result.
During operation, system 100 receives audio input data by using microphone 128, and uses multiple languages Sound engine 162 generates multiple candidate speech recognition results, including mixing voice recognition result, the mixing voice recognition result It in some embodiments include from word selected in two or more candidate speech recognition results.Controller 148 is by making Feature is extracted from candidate speech recognition result with feature extractor 164, for generating according to candidate speech recognition result Feature vector, and provide neural network grading device 166 to generate output scoring for each feature vector feature vector.Control Device 148 processed then mark feature vector corresponding with highest grading scoring and candidate speech recognition result, and controller 148 By using corresponding with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result Candidate speech recognition result carrys out operation automation system as input.
Fig. 2 is depicted for selecting candidate speech to know by using multiple speech recognition engines and neural network grading device Other result and the process 200 for executing speech recognition.In the following description, it is to the referring to for process 200 for executing function or movement The operation for referring to controller is used to execute stored program instruction to hold in association with the other assemblies in automated system Row function or movement.For illustrative purposes, process 200 is described in conjunction with the system of Fig. 1 100.
Process 200 starts to generate multiple candidate speech identifications by using multiple speech recognition engines 162 for system 100 As a result (frame 204).Within system 100, user provides described audio input to audio input device, such as microphone 128.Control Device 148 generates multiple candidate speech recognition results using multiple speech recognition engines 162.As described above, in some embodiments In, controller 148 generates mixing candidate speech recognition result by following operation: drawing using from the specific speech recognition in domain In candidate speech recognition result of the selected word for the candidate speech recognition result held up to replace universal phonetic identification engine Selected word.Speech recognition engine 162 also generate system 100 in process 200 feature vector generate during used in set Believe score data.
As system 100 executes feature extraction to generate multiple feature vectors, process 200 continues, the multiple feature to Amount respectively correspond tos one of candidate speech recognition result (frame 208).Within system 100, controller 148 uses feature extractor 164 generate feature vector, and described eigenvector includes above-mentioned triggering to one in, confidence scoring and word level feature Or it is multiple, for generate structure with the feature vector 500 in Fig. 5 or for triggering it is special to, confidence scoring and word level The feature vector of one or more another like structures in sign.In the embodiment of fig. 2, controller 148 is by using having The bag of words of decaying measurement to generate word level feature for the word level characteristic element of feature vector.
It is commented as controller 148 will be supplied to neural network for the feature vector of the multiple candidate speech recognition result Grade device 166 is as the input in reasoning process with corresponding multiple with the multiple candidate speech recognition result for generating Grading scoring, process 200 continue (frame 212).In one embodiment, controller 148 uses housebroken feedforward depth nerve Network grading device 166 to generate multiple grading scorings at the output layer neuron of neural network by using reasoning process. As described above, in another embodiment, controller 148 is by using Wireless Communication Equipment 154 and by characteristic vector data, candidate The encoded version of speech recognition result or the audio speech recorded identification data is transferred to external server, wherein servicing A part of processor implementation procedure 200 in device scores to generate the grading of candidate speech recognition result.
In most of examples, controller 148 generates n candidate speech recognition result of number and character pair vector, The predetermined number n for being configured to the received feature vector input during training process with neural network grading device 166 is matched.So And in some instances, if the number for the feature vector of candidate speech recognition result is less than maximum number n, control Device 148 processed generates " sky " the feature vector input with whole zeros, in the input layer to ensure neural network grading device 166 All neurons receive input.Controller 148 ignores the scoring of the correspondence output layer neuron for each sky input, and Neural network in grading device 166 is that the non-empty feature vector of candidate search recognition result generates scoring.
As the mark of controller 148 is corresponding with the highest grading scoring in the output layer of neural network grading device 166 Candidate speech recognition result, process 200 continue (frame 216).As above described in Fig. 5, the output layer 566 of neural network 550 In each output neuron generate with system 100 be supplied in input layer 554 input neuron predetermined set input The corresponding output valve of grading scoring of one of feature vector.Controller 148 is based on the highest grading generated in neural network 550 The output neuron of scoring indexes to identify the candidate speech recognition result with highest grading scoring.
Referring again to FIGS. 2, the speech recognition result for using selected highest to grade with controller 148 is as from user Input with operation automation system, process 200 continues (frame 220).In the vehicle of Fig. 1 in information system 100, controller The 148 various systems of operation, including such as Vehicular navigation system, are shown using GPS 152, Wireless Communication Equipment 154 and LCD Device 124 or HUD 120 execute automobile navigation operation inputting in response to voice from the user.In another configuration, control Device 148 plays music by audio output apparatus 132 in response to voice command.In still another configuration, system 100 uses intelligence Can number 170 or another equipment through being connected to the network make hands-free phone to be based on voice input from the user or transmit text Message.Although Fig. 1 depicts Information System Implementation example in vehicle, other embodiments use automated system, described automatic Change system controls the operation of various hardware components and software application using audio input data.
Although Fig. 1 by information system 100 in vehicle be portrayed as execute speech recognition it is from the user to receive and execute The illustrated examples of the automated system of order, but similar speech recognition process may be implemented in other contexts.Example Such as, mobile electronic device, such as smart phone 170 or other suitable equipment generally include one or more microphones and processing Device, may be implemented speech recognition engine, grading device, the triggering that is stored to and realize speech recognition and control system Other assemblies.In another embodiment, domestic automation system calculates equipment by using at least one to control in house HVAC and utensil, at least one described calculating equipment receive voice input from the user and by using multiple speech recognitions Engine executes speech recognition, to control the operation of the various automated systems in house.In each example, system is optional Ground is configured to the specific speech recognition engine in domain using the specific application and operation that are tailor made to different automated systems Different sets.
For training the training system and process of neural network grading device
In the system 100 of Fig. 1 and the speech recognition process of Fig. 2, neural network grading device 166 is housebroken feedforward depth mind Through network.The training neural network grading device 166 before the operation of system 100, for executing upper speech recognition procedure.Fig. 3 The illustrative embodiments for being configured to train the computerized system 300 of neural network grading device 166 is depicted, and Fig. 4 is retouched The training process 400 for generating housebroken neural network grading device 166 is drawn.
System 300 includes processor 304 and memory 320.Processor 304 includes for example one or more CPU cores, described CPU core is alternatively coupled to the hardware accelerator of parallelization, and the hardware accelerator was designed to time and power-efficient Mode training neural network.The example of such accelerator includes for example with the calculating for being configured for neural metwork training The GPU of shader unit, and be exclusively used in training the fpga chip especially programmed or ASIC hardware of neural network.In some realities It applies in example, furthermore processor 304 includes concurrently operating to execute the cluster of the calculating equipment of neural network training process.
Memory 320 includes that such as nonvolatile solid state or magnetic data storage equipment and volatile data storage are set Standby, such as random-access memory (ram) stores instruction by programming with the operation for system 300.In the configuration of Fig. 3 In, 320 storing data of memory, the data correspond to training input data 324, the stochastic gradient descent for neural network Training aids 328, neural network grading device 332 and feature extractor 164.
Training data 324 includes for example being identified by the same voice within system 100 for big predetermined input set The big collection of speech recognition result caused by engine 162 optionally includes mixing voice recognition result.Training speech recognition Result data further includes the confidence scoring for training speech recognition result.For each speech recognition result, training data is also It is measured including Lay Weinstein distance (Levenshtein distance), quantization is real in speech recognition result and scheduled ground Condition voice inputs the difference between training data, and the scheduled ground truth voice input training data indicates in training process Canonically " correct " result.Lay Weinstein distance metric is an example of " editing distance " measurement, because of the measurement amount Change and is used to necessary to the actually entering of training data change for the speech recognition result from speech recognition engine to be transformed into Become (editor) amount.Both speech recognition result and ground truth voice input training data are referred to as the text in comparison measuring " character string ".For example, editing distance quantization is for by speech recognition result character string " Sally shells sea sells by The seashore " is converted into corresponding correct ground truth training data character string " Sally sells sea shells Required knots modification for by the seashore ".
Lay Weinstein distance metric is known for this field in other contexts, and has several properties, packet Include: (1) Lay Weinstein distance is always at least the difference of the size of two character strings;(2) Lay Weinstein distance is at most compared with long word Accord with the length of string;(3) and if only if character string is equivalent, Lay Weinstein distance is only zero;(4) if character string is phase Same size, then Hamming distance is Lay Weinstein apart from the upper upper bound;And the Lay Weinstein of (4) between two character strings away from From no more than the summation (triangle inequality) with a distance from their Lay Weinsteins from third character string.Hamming distance refer in turn for One character string is changed to minimum number of permutations required for another or a character string can be transformed into another most The measurement of small error numbers.Although for illustrative purposes, system 300 includes the training number for being encoded with Lay Weinstein distance According to, but in alternative embodiments, another editing distance measurement for describe training speech recognition result with it is corresponding Difference between ground truth training input.
In the fig. 3 embodiment, the feature extractor 164 in memory 320 is the phase used in above system 100 With feature extractor 164.Particularly, processor 304 using feature extractor 164 come by using above-mentioned trigger words to, set Believe one or more of scoring and word level feature and spy is generated according to each of training speech recognition result Levy vector.
Stochastic gradient descent training aids 328 includes the program instruction stored and parameter for neural network training process Data, processor 304 execute the neural network training process to be based on training data 324 by using feature extractor 164 Feature vector generated come train neural network grade device 332.Such as it is known in the art, stochastic gradient descent training Device includes a kind of related training process, trains neural network in an iterative process by operating as follows: adjusting nerve net Parameter in network with minimize the output in neural network between predeterminated target function at a distance from (error), the predeterminated target Function is also referred to as " target " function.Although stochastic gradient descent training is overall known for this field and not at this It is discussed in more detail in text, but system 300 modifies standard prior training process.Particularly, training process seeks benefit Use neural network, output generated as input by using training data, the output minimize in neural network output and Error between target result from predetermined training data.In prior art training process, target value is generally designated Given output is binary system " correct " or " incorrect ", and such target output from neural network grading device provides scoring Come indicate for training speech recognition result feature vector input when in training data ground truth input compared with when Time is 100% correct or incorrect to a certain extent.However, within the system 300, stochastic gradient descent training aids 328 The editing distance target data in training data 324 is used more accurately to be reacted as " soft " target for different trained voices The correctness of recognition result is horizontal, and the correctness level may include that the error range of grading scoring is influenced on successive range It rather than is only completely correct or incorrect.
Processor 304 is in target function using " soft " target data come by using stochastic gradient descent training aids 328 And execute training process.For example, the following form of configuration use of Fig. 3 " softmax(flexibility maximum value) " target function:, whereindiIt is for given training speech recognition resultiEditing distance.During training process, Gradient declines 328 executory cost of training aids and minimizes process, wherein " cost " refers to during each iteration of training process Neural network grading device 332 output valve and pass through the cross entropy between target function target value generated.Processor 304 exists Batch sample is provided during training process and declines training aids 328 to gradient, such as 180 training inputs of batch respectively include Pass through multiple speech recognition engines different training speech recognition result generated.Iterative process is continued until the friendship of training set Pitch entropy during ten iteration still without improvement until, and generate from all training datas minimum population entropy through instructing Experienced neural network parameter forms final housebroken neural network.
During training process, processor 304 is during the different iteration of training process by identical input feature value The scramble between the different sets of the input neuron in neural network grading device 332, in the input layer to ensure neural network The position of particular feature vector incorrect biasing is not generated in housebroken neural network.Such as above in reasoning process It is described, if the specific collection of training data does not include the candidate speech recognition result of enough numbers to comment to neural network All neurons in the input layer of grade device 332 provide input, then the generation of processor 304 there is " sky " of zero input to input special Levy vector.As it is known in the art, stochastic gradient descent training process includes numerical value training parameter, and in system 300 In one configuration, the super parameter of stochastic gradient descent training aids 328 isα=0.001,β 1 =0.9 andβ 2 = 0.999。
Neural network grading device 332 is feedforward deep neural network in one embodiment, has and is described in Fig. 5 Neural network 550 structure.During operation, processor 304 generates the knot of unbred neural network grading device 332 Structure, the structure have the neuron of predetermined number, input layer 554 of the predetermined number based on the neural network 550 in Fig. 5 In neuron number and for respectively as input be provided to neural network for reasoning process in total n candidate The number of output neuron in the output layer 566 of speech recognition result.Processor 304 is also hidden at the k of neural network 550 Suitable number of neuron is generated in layer 562.In one embodiment, processor 304, which utilizes to be directed to, goes to each of neuron The weighted value of the randomization of input initializes neural network structure.As described above, processor 304 is adjusted during training process Markingoff pin is to the various weights and bias of the neuron in the input layer 554 and hidden layer 562 of neural network, together with output layer The parameter of activation primitive in 566 neuron, for minimizing for given input set from neural network grading device 332 Cross entropy of the output compared with target function.
Although Fig. 3 depicts the specific configuration for generating the computerized equipment 300 of housebroken neural network grading device, It is in some embodiments, furthermore to be matched in speech recognition process using the same system of housebroken neural network grading device It is set to trained neural network grading device.For example, the controller 148 in system 100 is to can be configured to hold in some embodiments The example of the processor of row neural network training process.
Fig. 4 is depicted for selecting candidate speech to know by using multiple speech recognition engines and neural network grading device Other result and the process 400 for executing speech recognition.In the following description, it is to the referring to for process 400 for executing function or movement The operation of finger processor is used to execute stored program instruction to hold in association with the other assemblies in automated system Row function or movement.For illustrative purposes, process 400 is described in conjunction with the system of Fig. 3 300.
Process 400 starts to generate multiple feature vectors for system 300, corresponds to and is stored in training data 324 Multiple trained speech recognition results (frame 404).Within the system 300, processor 304 is generated described using feature extractor 164 Multiple feature vectors, wherein each feature vector corresponds to a trained speech recognition result in training data 324.Institute as above It states, at least one embodiment of process 400, controller 304 generates each feature vector comprising one in the following terms A or multiple: to feature, confidence scoring and word level feature, the word level feature includes having decaying special for triggering The bag of words of sign.
As the part of feature extraction and feature generating process, in some embodiments, controller 304 generates feature vector Structure comprising element-specific, the element-specific be mapped to triggering to feature and word level feature.For example, as more than Described within system 100, in some embodiments, controller 304 generates feature vector, have in training data 324 The only only a part of observed word, such as 90% the corresponding structure of most commonly observed word, and with low-limit frequency Remaining 10% word occurred is not encoded into the structure of feature vector.Processor 304 optionally identifies the most common triggering To feature, and generate the structure for most commonly observed trigger words pair present in training data 324.System wherein 300 generate during process 400 in the embodiment for the structure of feature vector, and the storage of processor 304 has feature extractor The structure of the feature vector of data 164, and after training process completion, the structure of feature vector is graded together with neural network Device 332 is provided to automated system, and the automated system uses the feature vector with specified structure as to warp The input of trained neural network, for generating grading scoring for candidate speech recognition result.In other embodiments, based on all As English or Chinese etc natural language rather than a priori determined in particular upon the content of training data 324 feature to The structure of amount.
With system 300 by using stochastic gradient descent training aids 328, based on training speech recognition result feature to Amount and the soft object editing distance data from training data 324 come train neural network grade device 332, process 400 continue (frame 408).During training process, 304 use of processor multiple features corresponding with multiple trained speech recognition results to It measures as the input to neural network grading device, and based on generated during training process by neural network grading device Multiple outputs, which are scored the cost minimization process between target function, trains neural network grading device 332, the target letter Number has above-mentioned soft scoring, and the soft scoring is based on knowing in the multiple trained speech recognition result with for the multiple voice The trained speech recognition of each of other result it is predetermined correctly enter between predetermined editing distance.During process 400, processing Device 304 modifies the weighted input coefficient and neuron bias in the input and hidden layer of neural network grading device 332, and leads to Cross the parameter of the activation primitive in the output layer for adjusting neuron using stochastic gradient descent training aids 328, in an iterative manner.
After training process completion, process 304 stores the structure of housebroken neural network grading device 332, Yi Ji The structure of optionally stored feature vector in embodiment, wherein described eigenvector based on the training data in memory 320 and It is generated (frame 412).The structure and features vector structure of neural network grading device 332 stored is subsequently passed to other certainly The system 100 of dynamicization system, such as Fig. 1, using housebroken neural network grade device 332 and feature extractor 164 come It grades during speech recognition operation to multiple candidate speech recognition results.
It will be appreciated that disclosed above and other feature and function variants or its alternative can close expectation Ground is incorporated into many other different systems, application or method.It is various currently without prediction or it is not expected that replace It changes scheme, modification, modification or improves and can then be made by those skilled in the art, be also intended to be contained by appended claims Lid.

Claims (17)

1. a kind of method for the speech recognition in automated system, comprising:
Multiple feature vectors are generated using controller, each feature vector corresponds to one in multiple candidate speech recognition results Candidate speech recognition result is generated in the multiple feature vector and is waited for first in the multiple candidate speech recognition result Furthermore the first eigenvector for selecting speech recognition result includes:
Using controller, referring to multiple predetermined triggers for being stored in memory to coming in the first candidate speech recognition result At least one triggering pair is identified, at least one described triggering is to including two predetermined trigger words;And
First eigenvector is generated using controller, the first eigenvector includes at least one triggering pair Element;
The multiple feature vector is fed as input to neural network using controller;
It is generated and is directed to described in the multiple candidate speech recognition result using controller, output layer neural network based The corresponding multiple gradings scorings of multiple feature vectors;And
Using the controller, by using in the multiple candidate speech recognition result with the multiple grading scoring in most High ratings score corresponding candidate speech recognition result as input and carry out operation automation system.
2. according to the method described in claim 1, furthermore each feature vector generated in the multiple feature vector includes:
Each feature vector is generated using controller, each feature vector includes for one in the scoring of multiple confidences The element of confidence scoring, each confidence scoring are associated with a candidate speech recognition result of each feature vector is corresponded to.
3. according to the method described in claim 2, furthermore including:
It is scored using controller, based on the multiple confidence to execute linear regression procedure, for being the multiple feature vector Normalised multiple confidence scorings are generated, normalised multiple confidence scorings are based on the multiple speech recognition result In a predetermined candidate speech recognition result confidence scoring.
4. according to the method described in claim 1, furthermore generation first eigenvector includes:
Multiple unique words in the first candidate speech recognition result are identified using controller, including in the multiple word often At least one position of each unique words in the frequency of a unique words appearance and the first candidate speech recognition result;
Multiple bag of words with attenuation parameter are generated using controller, each bag of words with attenuation parameter are based on the multiple The frequency of a unique words in unique words is at least one position and scheduled attenuation parameter and corresponding to described one A unique words;And
First eigenvector is generated using controller, the first eigenvector includes having attenuation parameter for the multiple The bag of words of each of bag of words with attenuation parameter element.
5. according to the method described in claim 1, furthermore including: to the multiple feature vector of neural network offer
The multiple feature vector is fed as input to feedforward deep neural network using controller.
6. according to the method described in claim 1, furthermore including:
Audio input data corresponding with voice from the user input is generated using audio input device;And
Multiple candidate languages corresponding with audio input data are generated using controller, by using multiple speech recognition engines Sound recognition result.
7. a kind of method for training neural network grading device, comprising:
Multiple feature vectors are generated using processor, each feature vector corresponds to the multiple trained languages being stored in memory A trained speech recognition result in sound recognition result generates in the multiple feature vector for the multiple trained voice In recognition result first training speech recognition result first eigenvector furthermore include:
Using processor, referring to multiple predetermined triggers for being stored in memory to coming in the first training speech recognition result At least one triggering pair is identified, at least one described triggering is to including two predetermined trigger words;And
It utilizes a processor to generate first eigenvector, the first eigenvector includes at least one triggering pair Element;
The training process for neural network grading device is executed by using the following terms using processor: with the multiple instruction Practice speech recognition result it is corresponding as to neural network grade device input the multiple feature vector, by neural network Grade device during training process multiple outputs scoring generated and based in the multiple trained speech recognition result and For trained speech recognition each in the multiple speech recognition result it is predetermined correctly enter between predetermined editing distance Multiple objective results;And
It is being corresponding with the speech recognition result being not present in the multiple trained speech recognition result for using After the training process that additional eigenvectors generate in grading scoring is completed, neural network grading device is stored in using processor In memory.
8. according to the method described in claim 7, furthermore generation first eigenvector includes:
Feature vector is generated using processor, described eigenvector includes for associated with the first training speech recognition result The element of confidence scoring.
9. according to the method described in claim 7, furthermore generation first eigenvector includes:
Multiple unique words in mark the first training speech recognition result are utilized a processor to, including in the multiple word often At least one position of each unique words in the frequency and the first training speech recognition result that a unique words occur;
It utilizes a processor to generate multiple bag of words with attenuation parameter, each bag of words with attenuation parameter are based on the multiple The frequency of a unique words in unique words is at least one position and scheduled attenuation parameter and corresponding to described one A unique words;And
It utilizes a processor to generate first eigenvector, the first eigenvector includes having attenuation parameter for the multiple Bag of words in per multiple bag of words with attenuation parameter element.
10. according to the method described in claim 7, furthermore training process includes:
Using processor, housebroken neural network is generated by using stochastic gradient descent training process.
11. according to the method described in claim 7, furthermore training includes:
It utilizes a processor to execute the training process for neural network grading device, the multiple objective result is based on described more A trained speech recognition result is correctly entered with for each the predetermined of trained speech recognition in the multiple speech recognition result Between Lay Weinstein distance.
12. a kind of system of the speech recognition for automation, comprising:
Memory, the memory are configured to store:
Multiple predetermined triggers pair, each triggering is to including two words;And
Neural network, the neural network are configured to generate grading scoring corresponding with multiple candidate speech recognition results; And
It is operatively coupled to the controller of memory, controller is configured to:
Multiple feature vectors are generated, the candidate speech that each feature vector corresponds in multiple candidate speech recognition results is known Not as a result, generating in the multiple feature vector for the first candidate speech identification in the multiple candidate speech recognition result As a result first eigenvector furthermore include further Configuration Control Unit with:
Referring to multiple predetermined triggers for being stored in memory to identifying at least one in the first candidate speech recognition result A triggering pair, at least one described triggering is to including two predetermined trigger words;And
First eigenvector is generated, the first eigenvector includes the element at least one triggering pair;
The multiple feature vector is fed as input to neural network;
Output layer neural network based come generate with for the multiple candidate speech recognition result the multiple feature to Measure corresponding multiple grading scorings;And
By using opposite with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result The candidate speech recognition result answered carrys out operation automation system as input.
13. furthermore system according to claim 12, controller are configured to:
Each feature vector is generated, each feature vector includes the member for the confidence scoring in the scoring of multiple confidences Element, each confidence scoring are associated with a candidate speech recognition result of each feature vector is corresponded to.
14. furthermore system according to claim 13, controller are configured to:
It is scored based on the multiple confidence to execute linear regression procedure, for generating for the multiple feature vector through normalizing The scoring of multiple confidences, the normalised multiple confidences scoring is predetermined based on one in the multiple speech recognition result The confidence of candidate speech recognition result scores.
15. furthermore system according to claim 12, controller are configured to:
Multiple unique words in the first candidate speech recognition result are identified, including each unique words go out in the multiple word At least one position of each unique words in existing frequency and the first candidate speech recognition result;
Multiple bag of words with attenuation parameter are generated, each bag of words with attenuation parameter are based in the multiple unique words The frequency of one unique words is at least one position and scheduled attenuation parameter and corresponding to one unique words;With And
First eigenvector is generated, the first eigenvector includes for every in the multiple bag of words with attenuation parameter The element of a bag of words with attenuation parameter.
16. system according to claim 12, wherein the neural network in memory is feedforward deep neural network, control Furthermore device is configured to:
The multiple feature vector is fed as input to feedforward deep neural network.
17. system according to claim 12, furthermore includes:
Audio input device;And
Controller, the controller are operatively coupled to audio input device and are furthermore configured to:
Audio input data corresponding with voice from the user input is generated using audio input device;And
Multiple candidate speech recognition results corresponding with audio input data are generated by using multiple speech recognition engines.
CN201780070915.1A 2016-11-17 2017-11-15 System and method for ranking mixed speech recognition results using neural networks Active CN109923608B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US15/353767 2016-11-17
US15/353,767 US10170110B2 (en) 2016-11-17 2016-11-17 System and method for ranking of hybrid speech recognition results with neural networks
PCT/EP2017/079272 WO2018091501A1 (en) 2016-11-17 2017-11-15 System and method for ranking of hybrid speech recognition results with neural networks

Publications (2)

Publication Number Publication Date
CN109923608A true CN109923608A (en) 2019-06-21
CN109923608B CN109923608B (en) 2023-08-01

Family

ID=60327326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780070915.1A Active CN109923608B (en) 2016-11-17 2017-11-15 System and method for ranking mixed speech recognition results using neural networks

Country Status (5)

Country Link
US (1) US10170110B2 (en)
JP (1) JP6743300B2 (en)
CN (1) CN109923608B (en)
DE (1) DE112017004397B4 (en)
WO (1) WO2018091501A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956621A (en) * 2019-11-27 2020-04-03 北京航空航天大学合肥创新研究院 Method and system for detecting tissue canceration based on neural network
CN113112827A (en) * 2021-04-14 2021-07-13 深圳市旗扬特种装备技术工程有限公司 Intelligent traffic control method and intelligent traffic control system

Families Citing this family (157)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
EP2954514B1 (en) 2013-02-07 2021-03-31 Apple Inc. Voice trigger for a digital assistant
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
DE112014002747T5 (en) 2013-06-09 2016-03-03 Apple Inc. Apparatus, method and graphical user interface for enabling conversation persistence over two or more instances of a digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US10643616B1 (en) * 2014-03-11 2020-05-05 Nvoq Incorporated Apparatus and methods for dynamically changing a speech resource based on recognized text
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
AU2015266863B2 (en) 2014-05-30 2018-03-15 Apple Inc. Multi-command single utterance input method
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10152299B2 (en) 2015-03-06 2018-12-11 Apple Inc. Reducing response latency of intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10095470B2 (en) 2016-02-22 2018-10-09 Sonos, Inc. Audio response playback
US10264030B2 (en) 2016-02-22 2019-04-16 Sonos, Inc. Networked microphone device control
US10743101B2 (en) 2016-02-22 2020-08-11 Sonos, Inc. Content mixing
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
CN106228976B (en) * 2016-07-22 2019-05-31 百度在线网络技术(北京)有限公司 Audio recognition method and device
US10115400B2 (en) 2016-08-05 2018-10-30 Sonos, Inc. Multiple voice services
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
WO2018106805A1 (en) * 2016-12-09 2018-06-14 William Marsh Rice University Signal recovery via deep convolutional networks
US10593346B2 (en) * 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
CN108460454B (en) * 2017-02-21 2022-07-26 京东方科技集团股份有限公司 Convolutional neural network and processing method, device and system for convolutional neural network
CN107103903B (en) * 2017-05-05 2020-05-29 百度在线网络技术(北京)有限公司 Acoustic model training method and device based on artificial intelligence and storage medium
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
DK201770427A1 (en) 2017-05-12 2018-12-20 Apple Inc. Low-latency intelligent automated assistant
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
DK179549B1 (en) 2017-05-16 2019-02-12 Apple Inc. Far-field extension for digital assistant services
US20180336275A1 (en) 2017-05-16 2018-11-22 Apple Inc. Intelligent automated assistant for media exploration
CN107240395B (en) * 2017-06-16 2020-04-28 百度在线网络技术(北京)有限公司 Acoustic model training method and device, computer equipment and storage medium
US10475449B2 (en) 2017-08-07 2019-11-12 Sonos, Inc. Wake-word detection suppression
DE102017213946B4 (en) * 2017-08-10 2022-11-10 Audi Ag Method for processing a recognition result of an automatic online speech recognizer for a mobile terminal
US10497370B2 (en) 2017-08-18 2019-12-03 2236008 Ontario Inc. Recognition module affinity
US10984788B2 (en) 2017-08-18 2021-04-20 Blackberry Limited User-guided arbitration of speech processing results
US10964318B2 (en) * 2017-08-18 2021-03-30 Blackberry Limited Dialogue management
CN107507615A (en) * 2017-08-29 2017-12-22 百度在线网络技术(北京)有限公司 Interface intelligent interaction control method, device, system and storage medium
US10048930B1 (en) 2017-09-08 2018-08-14 Sonos, Inc. Dynamic computation of system response volume
US10482868B2 (en) 2017-09-28 2019-11-19 Sonos, Inc. Multi-channel acoustic echo cancellation
US10466962B2 (en) 2017-09-29 2019-11-05 Sonos, Inc. Media playback system with voice assistance
US20190147855A1 (en) * 2017-11-13 2019-05-16 GM Global Technology Operations LLC Neural network for use in speech recognition arbitration
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US11676062B2 (en) * 2018-03-06 2023-06-13 Samsung Electronics Co., Ltd. Dynamically evolving hybrid personalized artificial intelligence system
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11175880B2 (en) 2018-05-10 2021-11-16 Sonos, Inc. Systems and methods for voice-assisted media content selection
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US10959029B2 (en) 2018-05-25 2021-03-23 Sonos, Inc. Determining and adapting to changes in microphone performance of playback devices
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. Virtual assistant operation in multi-device environments
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US11076039B2 (en) 2018-06-03 2021-07-27 Apple Inc. Accelerated task performance
US10825451B1 (en) * 2018-06-25 2020-11-03 Amazon Technologies, Inc. Wakeword detection
US10762896B1 (en) 2018-06-25 2020-09-01 Amazon Technologies, Inc. Wakeword detection
US10210860B1 (en) * 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
US20200042825A1 (en) * 2018-08-02 2020-02-06 Veritone, Inc. Neural network orchestration
WO2020041945A1 (en) * 2018-08-27 2020-03-05 Beijing Didi Infinity Technology And Development Co., Ltd. Artificial intelligent systems and methods for displaying destination on mobile device
US11024331B2 (en) 2018-09-21 2021-06-01 Sonos, Inc. Voice detection optimization using sound metadata
US10811015B2 (en) * 2018-09-25 2020-10-20 Sonos, Inc. Voice detection optimization based on selected voice assistant service
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11100923B2 (en) 2018-09-28 2021-08-24 Sonos, Inc. Systems and methods for selective wake word detection using neural network models
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11899519B2 (en) 2018-10-23 2024-02-13 Sonos, Inc. Multiple stage network microphone device with reduced power consumption and processing load
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11183183B2 (en) 2018-12-07 2021-11-23 Sonos, Inc. Systems and methods of operating media playback systems having multiple voice assistant services
US11132989B2 (en) 2018-12-13 2021-09-28 Sonos, Inc. Networked microphone devices, systems, and methods of localized arbitration
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11322136B2 (en) 2019-01-09 2022-05-03 Samsung Electronics Co., Ltd. System and method for multi-spoken language detection
US11380315B2 (en) * 2019-03-09 2022-07-05 Cisco Technology, Inc. Characterizing accuracy of ensemble models for automatic speech recognition by determining a predetermined number of multiple ASR engines based on their historical performance
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
RU2731334C1 (en) * 2019-03-25 2020-09-01 Общество С Ограниченной Ответственностью «Яндекс» Method and system for generating text representation of user's speech fragment
US11132991B2 (en) 2019-04-23 2021-09-28 Lg Electronics Inc. Method and apparatus for determining voice enable device
US11120794B2 (en) 2019-05-03 2021-09-14 Sonos, Inc. Voice assistant persistence across multiple network microphone devices
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11615785B2 (en) 2019-05-10 2023-03-28 Robert Bosch Gmbh Speech recognition using natural language understanding related knowledge via deep feedforward neural networks
CN114341979A (en) * 2019-05-14 2022-04-12 杜比实验室特许公司 Method and apparatus for voice source separation based on convolutional neural network
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. User activity shortcut suggestions
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11227599B2 (en) 2019-06-01 2022-01-18 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11397742B2 (en) 2019-06-21 2022-07-26 Microsoft Technology Licensing, Llc Rescaling layer in neural network
US11163845B2 (en) 2019-06-21 2021-11-02 Microsoft Technology Licensing, Llc Position debiasing using inverse propensity weight in machine-learned model
US11204973B2 (en) 2019-06-21 2021-12-21 Microsoft Technology Licensing, Llc Two-stage training with non-randomized and randomized data
US11204968B2 (en) * 2019-06-21 2021-12-21 Microsoft Technology Licensing, Llc Embedding layer in neural network for ranking candidates
KR20210010133A (en) * 2019-07-19 2021-01-27 삼성전자주식회사 Speech recognition method, learning method for speech recognition and apparatus thereof
KR20210030160A (en) * 2019-09-09 2021-03-17 삼성전자주식회사 Electronic apparatus and control method thereof
WO2021056255A1 (en) 2019-09-25 2021-04-01 Apple Inc. Text detection using global geometry estimators
DE102019214713A1 (en) * 2019-09-26 2021-04-01 Zf Friedrichshafen Ag System for the automated actuation of a vehicle door, vehicle and method
US11189286B2 (en) 2019-10-22 2021-11-30 Sonos, Inc. VAS toggle based on device orientation
US11437026B1 (en) * 2019-11-04 2022-09-06 Amazon Technologies, Inc. Personalized alternate utterance generation
US11200900B2 (en) 2019-12-20 2021-12-14 Sonos, Inc. Offline voice control
US11562740B2 (en) 2020-01-07 2023-01-24 Sonos, Inc. Voice verification for media playback
US11308958B2 (en) 2020-02-07 2022-04-19 Sonos, Inc. Localized wakeword verification
US11494593B2 (en) * 2020-03-18 2022-11-08 Walmart Apollo, Llc Methods and apparatus for machine learning model hyperparameter optimization
US11688219B2 (en) * 2020-04-17 2023-06-27 Johnson Controls Tyco IP Holdings LLP Systems and methods for access control using multi-factor validation
KR20210136463A (en) * 2020-05-07 2021-11-17 삼성전자주식회사 Electronic apparatus and controlling method thereof
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11038934B1 (en) 2020-05-11 2021-06-15 Apple Inc. Digital assistant hardware abstraction
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11482224B2 (en) 2020-05-20 2022-10-25 Sonos, Inc. Command keywords with input detection windowing
CN113486924A (en) * 2020-06-03 2021-10-08 谷歌有限责任公司 Object-centric learning with slot attention
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
US11829720B2 (en) 2020-09-01 2023-11-28 Apple Inc. Analysis and validation of language models
CN112466280B (en) * 2020-12-01 2021-12-24 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and readable storage medium
WO2022203701A1 (en) * 2021-03-23 2022-09-29 Google Llc Recurrent neural network-transducer model for performing speech recognition
CN113948085B (en) * 2021-12-22 2022-03-25 中国科学院自动化研究所 Speech recognition method, system, electronic device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020091522A1 (en) * 2001-01-09 2002-07-11 Ning Bi System and method for hybrid voice recognition
CN1454381A (en) * 2000-09-08 2003-11-05 高通股份有限公司 Combining DTW and HMM in speaker dependent and independent modes for speech recognition
CN102138175A (en) * 2008-07-02 2011-07-27 谷歌公司 Speech recognition with parallel recognition tasks
JP2011237621A (en) * 2010-05-11 2011-11-24 Honda Motor Co Ltd Robot
CN104143330A (en) * 2013-05-07 2014-11-12 佳能株式会社 Voice recognizing method and voice recognizing system
US20150112685A1 (en) * 2013-10-18 2015-04-23 Via Technologies, Inc. Speech recognition method and electronic apparatus using the method
CN104795069A (en) * 2014-01-21 2015-07-22 腾讯科技(深圳)有限公司 Speech recognition method and server
US9153231B1 (en) * 2013-03-15 2015-10-06 Amazon Technologies, Inc. Adaptive neural network speech recognition models

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004272134A (en) * 2003-03-12 2004-09-30 Advanced Telecommunication Research Institute International Speech recognition device and computer program
US8812321B2 (en) 2010-09-30 2014-08-19 At&T Intellectual Property I, L.P. System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning
US9916538B2 (en) * 2012-09-15 2018-03-13 Z Advanced Computing, Inc. Method and system for feature detection
JP6155592B2 (en) 2012-10-02 2017-07-05 株式会社デンソー Speech recognition system
JP6047364B2 (en) * 2012-10-10 2016-12-21 日本放送協会 Speech recognition apparatus, error correction model learning method, and program
US9519858B2 (en) * 2013-02-10 2016-12-13 Microsoft Technology Licensing, Llc Feature-augmented neural networks and applications of same
US9484023B2 (en) 2013-02-22 2016-11-01 International Business Machines Corporation Conversion of non-back-off language models for efficient speech decoding
US9058805B2 (en) * 2013-05-13 2015-06-16 Google Inc. Multiple recognizer speech recognition
JP5777178B2 (en) * 2013-11-27 2015-09-09 国立研究開発法人情報通信研究機構 Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for constructing a deep neural network, and statistical acoustic model adaptation Computer programs
US9520127B2 (en) * 2014-04-29 2016-12-13 Microsoft Technology Licensing, Llc Shared hidden layer combination for speech recognition systems
US9679558B2 (en) * 2014-05-15 2017-06-13 Microsoft Technology Licensing, Llc Language modeling for conversational understanding domains using semantic web resources
WO2016167779A1 (en) * 2015-04-16 2016-10-20 Mitsubishi Electric Corporation Speech recognition device and rescoring device
EP3284084A4 (en) * 2015-04-17 2018-09-05 Microsoft Technology Licensing, LLC Deep neural support vector machines
US10127220B2 (en) * 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1454381A (en) * 2000-09-08 2003-11-05 高通股份有限公司 Combining DTW and HMM in speaker dependent and independent modes for speech recognition
US20020091522A1 (en) * 2001-01-09 2002-07-11 Ning Bi System and method for hybrid voice recognition
CN102138175A (en) * 2008-07-02 2011-07-27 谷歌公司 Speech recognition with parallel recognition tasks
JP2011237621A (en) * 2010-05-11 2011-11-24 Honda Motor Co Ltd Robot
US9153231B1 (en) * 2013-03-15 2015-10-06 Amazon Technologies, Inc. Adaptive neural network speech recognition models
CN104143330A (en) * 2013-05-07 2014-11-12 佳能株式会社 Voice recognizing method and voice recognizing system
US20150112685A1 (en) * 2013-10-18 2015-04-23 Via Technologies, Inc. Speech recognition method and electronic apparatus using the method
CN104795069A (en) * 2014-01-21 2015-07-22 腾讯科技(深圳)有限公司 Speech recognition method and server

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956621A (en) * 2019-11-27 2020-04-03 北京航空航天大学合肥创新研究院 Method and system for detecting tissue canceration based on neural network
CN110956621B (en) * 2019-11-27 2022-09-13 北京航空航天大学合肥创新研究院 Method and system for detecting tissue canceration based on neural network
CN113112827A (en) * 2021-04-14 2021-07-13 深圳市旗扬特种装备技术工程有限公司 Intelligent traffic control method and intelligent traffic control system
CN113112827B (en) * 2021-04-14 2022-03-25 深圳市旗扬特种装备技术工程有限公司 Intelligent traffic control method and intelligent traffic control system

Also Published As

Publication number Publication date
US20180137857A1 (en) 2018-05-17
US10170110B2 (en) 2019-01-01
DE112017004397T5 (en) 2019-05-23
WO2018091501A1 (en) 2018-05-24
DE112017004397B4 (en) 2022-10-20
JP6743300B2 (en) 2020-08-19
CN109923608B (en) 2023-08-01
JP2019537749A (en) 2019-12-26

Similar Documents

Publication Publication Date Title
CN109923608A (en) The system and method graded using neural network to mixing voice recognition result
US9959861B2 (en) System and method for speech recognition
US10977452B2 (en) Multi-lingual virtual personal assistant
CN111897964B (en) Text classification model training method, device, equipment and storage medium
JP7022062B2 (en) VPA with integrated object recognition and facial expression recognition
US11615785B2 (en) Speech recognition using natural language understanding related knowledge via deep feedforward neural networks
CN110136693A (en) System and method for using a small amount of sample to carry out neural speech clone
US8195459B1 (en) Augmentation and calibration of output from non-deterministic text generators by modeling its characteristics in specific environments
CN108846063A (en) Determine the method, apparatus, equipment and computer-readable medium of problem answers
CN102142253B (en) Voice emotion identification equipment and method
CN110473523A (en) A kind of audio recognition method, device, storage medium and terminal
Williams Multi-domain learning and generalization in dialog state tracking
CN108846077A (en) Semantic matching method, device, medium and the electronic equipment of question and answer text
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN110431626A (en) Carry out repeating the super utterance detection in speech polling relatively using pairs of to improve speech recognition
CN110223714A (en) A kind of voice-based Emotion identification method
CN113505591A (en) Slot position identification method and electronic equipment
CN114830139A (en) Training models using model-provided candidate actions
CN106529525A (en) Chinese and Japanese handwritten character recognition method
Lippmann et al. LNKnet: neural network, machine-learning, and statistical software for pattern classification
Chai et al. Communication tool for the hard of hearings: A large vocabulary sign language recognition system
US11600263B1 (en) Natural language configuration and operation for tangible games
US11645947B1 (en) Natural language configuration and operation for tangible games
CN114333832A (en) Data processing method and device and readable storage medium
Schuller et al. Speech communication and multimodal interfaces

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant