CN109923608A - The system and method graded using neural network to mixing voice recognition result - Google Patents
The system and method graded using neural network to mixing voice recognition result Download PDFInfo
- Publication number
- CN109923608A CN109923608A CN201780070915.1A CN201780070915A CN109923608A CN 109923608 A CN109923608 A CN 109923608A CN 201780070915 A CN201780070915 A CN 201780070915A CN 109923608 A CN109923608 A CN 109923608A
- Authority
- CN
- China
- Prior art keywords
- speech recognition
- recognition result
- neural network
- feature vector
- controller
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
A kind of method for grading for candidate speech recognition result is that candidate speech recognition result generates multiple feature vectors including the use of controller, and each feature vector, which includes triggering, scores one or more of feature and word level feature to feature, confidence.Furthermore the method includes that the multiple feature vector is provided as input into neural network, output layer neural network based scores to generate multiple gradings corresponding with the multiple feature vector of the multiple candidate speech recognition result is directed to, and is used as input to carry out operation automation system by using candidate speech recognition result corresponding with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result.
Description
Technical field
Present disclosure is generally related to the field of automated voice identification, and relates more specifically to the multiple languages of utilization
Sound identifies the improved system and method for the operation of the speech recognition system of engine.
Background technique
The speech recognition of automation is the important technology for realizing man-machine interface (HMI) in the application of wide scope.It is special
Not, speech recognition is useful in following situation: in the situation, human user needs to concentrate in execution task,
Wherein it will be inconvenient or unpractiaca using traditional input equipment, such as mouse and keyboard.For example, " information in vehicle
The small electrical of amusement " system, domestic automation system and such as smart phone, tablet device and wearable computer etc
Many purposes of sub- mobile device can receive voice command and other inputs from user using speech recognition.
Most prior art speech recognition system will be recorded to use by oneself using housebroken speech recognition engine
The described input at family is converted into suitable for the numerical data handled in the system of computerization.For known in the art
Various speech engines execute natural language understanding technology to identify word described in user and the extraction semanteme from the word
Meaning is to control the operation of the system of computerization.
In some cases, for identifying voice from the user when user executes different task, individually
Speech recognition engine is not necessarily optimal.Prior art solution is attempted to combine multiple speech recognition systems to improve voice
The accuracy of identification including rudimentary output of the selection from acoustic model, different speech recognition modelings, or is based on pre- accepted opinion
Grade process and select the entire output from different speech recognition engines to collect.However, from the defeated of multiple speech recognition systems
Lower order combinations out do not keep high-level language information.In other embodiments, multiple speech recognition engines generate full voice identification
As a result, still to select the determination process of which speech recognition result in the output of multiple speech recognition engines is also to have to choose
The problem of war property.Therefore, beneficial will be to the improvement of speech recognition system, improves from from multiple speech recognition engines
One group of candidate speech recognition result in select speech recognition result accuracy.
Summary of the invention
In one embodiment, a kind of method for executing speech recognition in automated system has been developed.Institute
It states method and generates multiple feature vectors including the use of controller, each feature vector corresponds to multiple candidate speech recognition results
In a candidate speech recognition result.For the in the multiple candidate speech recognition result in the multiple feature vector
Furthermore the generation of the first eigenvector of one candidate speech recognition result includes: to be stored in memory using controller, reference
In multiple predetermined triggers to come identified in the first candidate speech recognition result at least one triggering pair, it is described at least one touching
Hair and generates first eigenvector using controller, the first eigenvector packet to including two predetermined trigger words
Include the element at least one triggering pair.Furthermore the method includes: that the multiple feature vector is made using controller
It is supplied to neural network for input, generated using controller, output layer neural network based and is directed to the multiple candidate
The corresponding multiple grading scorings of the multiple feature vector of speech recognition result, and utilize controller, by using institute
Candidate speech corresponding with the highest grading scoring in the multiple grading scoring in multiple candidate speech recognition results is stated to know
Other result carrys out operation automation system as input.
In another embodiment, a kind of method for training neural network grading device, the nerve net have been developed
Network grades device as the different candidate speech recognition results generation grading scoring in automated voice identifying system.The method includes
It utilizes a processor to generate multiple feature vectors, each feature vector corresponds to the multiple trained voices being stored in memory
A trained speech recognition result in recognition result.The multiple trained speech recognition knot is directed in the multiple feature vector
Furthermore the generation of the first eigenvector of the first training speech recognition result in fruit includes: to be stored using processor, reference
Multiple predetermined triggers in memory to come first training speech recognition result in identify at least one triggering pair, it is described extremely
A few triggering is to including two predetermined trigger words, and utilizes a processor to generate first eigenvector, and described first is special
Sign vector includes the element at least one triggering pair.The method furthermore include: using processor by using with
Lower items execute the training process for neural network grading device: corresponding with the multiple trained speech recognition result work
Device of grading by the multiple feature vector of the input to neural network grading device, by neural network is given birth to during training process
At multiple outputs scoring and based on the multiple trained speech recognition result be directed to the multiple speech recognition result
In each trained speech recognition it is predetermined correctly enter between predetermined editing distance multiple objective results;And for making
Used in for additional eigenvectors corresponding with the speech recognition result being not present in the multiple trained speech recognition result
After generating the training process completion in grading scoring, neural network is graded into device storage in memory using processor.
In another embodiment, the speech recognition system of automation has been developed.The system comprises memory and by
It is operatively coupled to the controller of the memory.Memory is configured to store multiple predetermined triggers pair, and each triggering is to packet
Include two words;And neural network, it is configured to generate grading scoring corresponding with multiple candidate speech recognition results.Control
Device processed is configured to generate multiple feature vectors, and each feature vector corresponds to a time in multiple candidate speech recognition results
Speech recognition result is selected, is generated candidate for first in the multiple candidate speech recognition result in the multiple feature vector
The first eigenvector of speech recognition result.Furthermore controller is configured to: multiple predetermined in memory referring to being stored in
Triggering to come identified in the first candidate speech recognition result at least one triggering pair, it is described at least one triggering to include two
Predetermined trigger word;And first eigenvector is generated, the first eigenvector includes at least one described triggering pair
Element.Furthermore controller is configured to the multiple feature vector to be fed as input to neural network, be based on nerve net
The output layer of network is corresponding multiple with the multiple feature vector for the multiple candidate speech recognition result to generate
Grading scoring, and grade by using in the multiple candidate speech recognition result with the highest in the multiple grading scoring
Corresponding candidate speech recognition result score as input and carrys out operation automation system.
Detailed description of the invention
Fig. 1 be as in the vehicle that is embodied in the passenger carriage of vehicle in information system, from user to receive voice defeated
Enter the schematic views of the component of the automated system of order.
Fig. 2 is for grading device using neural network during speech recognition process come for multiple candidate speech recognition results
Generate the block diagram of the process of grading scoring.
Fig. 3 is to execute training process to generate showing for the computing system of the housebroken neural network grading device of Fig. 1 and Fig. 2
Meaning view.
Fig. 4 is the block diagram for generating the process of housebroken neural network grading device.
Fig. 5 is a diagram, and which depict the structures and nerve net according to speech recognition result feature vector generated
The structure of network grading device.
Specific embodiment
In order to promote to understand the embodiments described herein principle purpose, referring now to the drawings and in following institute
The description in specification write.The reference is not intended to the limitation of any pair of subject area.Present disclosure further includes pair
Any change and modification of shown embodiment, and including such as present disclosure about field in technical staff it is logical
Often it will be appreciated that the disclosed embodiments principle other application.
As used herein, term " speech recognition engine " refers to data model and executable program code,
Enable the system of computerization based on via microphone or other audio input device received described word remembered
Recording frequency enters data to identify the described word from operator.Speech recognition system generally includes lower level acoustic model
With higher level language model, the lower level acoustic model identifies the independent sound of human speech in SoundRec, it is described compared with
High-level language model identifies word and sentence based on the sound sequence from the acoustic model for scheduled instruction.To this field
Known speech recognition engine usually realizes one or more statistical models, such as hidden Markov model (HMM), branch
Vector machine (SVM), housebroken neural network are held, or by using the input data being applied to corresponding to human speech
Multiple trained parameters of feature vector come generate be directed to recorded human speech statistical forecast another statistical model.Voice
Identification engine is by using for example generating feature vector for various signal processing technologies known in the art, at the signal
Reason technology extracts the property (" feature ") of recorded voice signal and by the feature organization at one or more dimensions vector, institute
The various parts processed to identify the voice including individual word and sentence can be come by using statistical model by stating vector.
Speech recognition engine can generate for voice input as a result, voice input corresponds to individually described phoneme and sound
More complicated mode, including described word and sentence, the sentence includes the sequence in relation to word.
As used herein, term " speech recognition result " refers to that speech recognition engine is that given input is generated
Machine readable output.As a result it can be for example with the encoded text of machine readable format or another encoded data collection,
It is used as the operation that input carrys out control automation system.Due to the statistics speciality of speech recognition engine, in some configurations, voice
Engine is that single input generates multiple potential speech recognition results.Speech engine is also that each speech recognition result generation " is set
Letter scoring ", wherein confidence scoring is the quasi- to each speech recognition result of the trained statistical model based on speech recognition engine
The statistical estimate of really a possibility that.As described in more detail below, the use of mixing voice identifying system is by multiple speech recognitions
Speech recognition result caused by engine generates additional mixing voice recognition result, and is based ultimately upon multiple be previously generated
Speech recognition result come generate at least one output speech recognition result.As used herein, term " know by candidate speech
Other result " or more simply " candidate result " refer to candidate as the final speech recognition result from mixing voice identifying system
Speech recognition result, the mixing voice identifying system generates multiple candidate results and only selects that the subset of result is (logical
Normal a subset) it is used as final speech recognition result.In various embodiments, candidate speech recognition result include from general and
The speech recognition result and system 100 of the specific speech recognition engine in domain are by using from multiple candidate speech recognition results
Both word mixing voice recognition results generated.
As used herein, term " universal phonetic identification engine " refer to be trained to from such as English or Chinese it
The a type of speech recognition engine of the voice of the natural human language identification wide scope of class.Universal phonetic identification engine is based on
Wide in range word vocabulary and language model generates speech recognition result, the language model be trained to widely to cover from
Language mode in right language.As used herein, term " leading specific speech recognition engine " refers to such a class
The speech recognition engine of type: it, which is trained to identify, specifically using field or generally includes slightly different vocabulary and potential
Voice input in " domain " of the ground expection syntactic structure different from broader natural language.It is usually wrapped for the vocabulary of special domain
Certain terms from broader natural language are included, but may include narrower overall vocabulary, and wrap in some instances
The term of specialization is included, the term of the specialization is not known as official's word in natural language by official but for spy
Localization is well-known.For example, in navigation application, the specific speech recognition in domain can identify for road, cities and towns or its
The term of his geographical title, the appropriate title not being acknowledged as usually in more generally language.In other configurations, special domain
It is useful for special domain using specific jargon collection, but may not be identified well in wider range of language.Example
Such as, pilot official uses English as exchange language, but also using the specific jargon word of domains and be not standard
Other abbreviations of English components.
As used herein, term " triggering to " refers to two words, each of these can be word (such as
" broadcasting ") or predetermined class (such as<title of the song>), word sequence that the predetermined class indicates to fall within predetermined class (such as " Poker
Face "), such as appropriate title of the song, personnel, location name etc..The word for triggering centering, ought be appear in a specific order in voice
When in word in the statement text content of recognition result, the triggering of A → B is wherein being directed to in audio input data
In observe in the situation of word A not long ago that there are the high related levels occurred between word B later.As more fully below
Description, one group of triggering is being identified to later via training process, is being triggered in the text of candidate speech recognition result
Word is to a part for forming the feature vector for each candidate result, and ranking process is using the part come for different times
Speech recognition result is selected to grade.
Use the inference system and ranking process of housebroken neural network grading device
Fig. 1 depicts information system 100 in vehicle comprising head up display (HUD) 120, the one or more face console LCD
Plate 124, one or more input microphones 128 and one or more output loudspeakers 132.LCD display 124 and HUD
120 are based at least partially on system 100 and input a command for generating from the received voice of other of operator or vehicle occupant institute
From the visual output response of system 100.Controller 148 is operatively coupled to every in the component in vehicle in information system 100
One.In some embodiments, controller 148 is connected to or is incorporated to additional component, and such as global positioning system (GPS) receives
Device 152 and Wireless Communication Equipment 154, for providing navigation and communication using outer data network and calculating equipment.
In some operation modes, information system 100 is operating independently in vehicle, and in other operation modes, vehicle
Middle information system 100 is set with mobile electronic device, such as smart phone 170, tablet device, notebook computer or other electronics
Standby interaction.Information system is come and intelligent electricity by using wireline interface, such as USB or wireless interface, such as bluetooth in vehicle
170 communication of words.Information system 100 provides voice recognition user interface in vehicle, and the voice recognition user interface to operate
Person can control smart phone 170 or another movement using the verbal order for reducing dispersion attention when operating vehicle
Electronic communication equipment.For example, the offer of information system 100 speech interface to enable vehicle operators to utilize intelligence electricity in vehicle
Words 170 are made a phone call or sending information message, hold or see smart phone 170 without operator.In some embodiments, intelligence
Energy phone 170 includes various equipment, such as GPS and wireless networking device, the function of the equipment accommodated in supplement or substitution vehicle
It can property.
Microphone 128 described inputs next life audio frequency according to from vehicle operators or another Vehicular occupant institute are received
According to.Controller 148 includes hardware, such as DSP of processing audio data, and for will be from the input signal of microphone 128
It is converted into the component software of audio input data.As explained below, controller 148 is general and at least one using at least one
A specific speech recognition engine in domain generates candidate speech recognition result to be based on audio input data, and controller 148
Furthermore improve the accuracy of final speech recognition result output using grading device.In addition, controller 148 includes making it possible to lead to
It crosses loudspeaker 132 and generates the voice of synthesis or the hardware and software component of other audio output.
In vehicle information system 100 by using LCD panel 124, the HUD being projected on windshield 102 120, with
And visible feedback is provided to vehicle operators by instrument, indicator light or additional LCD panel in instrument board 108.When
Vehicle during exercise when, controller 148 optionally deactivates LCD panel 124 or shows only by LCD panel 124
Simplified output, for reducing the dispersion attention to vehicle operators.Controller 148 is shown by using HUD 120 can
Environment depending on feedback, for enabling the operator to check vehicle periphery when receiving visible feedback.Controller 148 is usual
Simplified data are shown on HUD 120, in area corresponding with the peripheral vision of vehicle operators, for ensuring that vehicle is grasped
Author has the unobstructed view of road and vehicle-periphery.
As described above, the display visual information in a part of windshield 120 of HUD 120.As used herein,
Term " HUD " generally refers to the head-up display device of wide scope, including but not limited to includes isolated combiner element through group
Head-up display (CHUD) of conjunction etc..In some embodiments, HUD 120 shows monochromatic text and figure, and other HUD are real
Applying example includes multicolor displaying.Although HUD 120 is depicted on windshield 102 and shows, in alternate embodiment
In, head-up unit and the graticle that glass, helmet goggles or operator wear during operation are integrated.
Controller 148 includes one or more integrated circuits, the integrated circuit be configured as one of the following terms or
A combination thereof: central processing unit (CPU), graphics processing unit (GPU), microcontroller, field programmable gate array (FPGA), specially
With integrated circuit (ASIC), digital signal processor (DSP) or any other suitable digital logic device.Controller 148 is also
Equipment is stored including memory, such as solid-state or magnetic data, stores instruction by programming for information system in vehicle
100 operation.
During operation, information system 100 receives from multiple input equipments and inputs request in vehicle, including passes through microphone
The received voice input order of 128 institutes.Particularly, controller 148 receives sound corresponding with voice from user via microphone 128
Frequency input data.
Controller 148 includes one or more integrated circuits, and the integrated circuit is configured as central processing unit
(CPU), microcontroller, field programmable gate array (FPGA), specific integrated circuit (ASIC), digital signal processor (DSP),
Or any other suitable digital logic device.Controller 148 is also operatively connected to memory 160, and the memory 160 wraps
Include nonvolatile solid state or magnetic data storage equipment and volatile data storage equipment, such as random access memory
(RAM), instruction by programming is stored with the operation for information system 100 in vehicle.160 storage model data of memory with
And executable program instruction code and data is to realize multiple speech recognition engines 162, feature extractor 164 and depth nerve
Network grading device 166.Train speech recognition engine 162 by using predetermined training process, and speech recognition engine 162 with
Other modes are known for this field.Although the embodiment of Fig. 1 includes depositing for the system 100 being stored in motor vehicles
Element in reservoir 160, but in some embodiments, external computing device, the server such as through being connected to the network are realized
Some or all of discribed feature in system 100.Thus, it would be recognized by those skilled in the art that including controller
148 and any refer to of operation of system 100 of memory 160 should this outsourcing in the alternate embodiment of system 100
Include the operation of server computing device He other distributed computing components.
In the embodiment in figure 1, feature extractor 164 is configured to generate the feature vector with multiple numerical value elements,
The numerical value element corresponds to the content of each candidate speech recognition result, including is generated by one of speech recognition engine 162
Speech recognition result or mixing voice that two or more words in speech recognition engine 162 are combined know
Other result.Feature extractor 164 generates feature vector, and described eigenvector includes for any one of following characteristics or group
The element of conjunction :(a) triggering pair, (b) confidence scores, and (c) individual word level feature, including the word with decay characteristics
Bag.
Within system 100, the triggering in feature extractor 164 is stored in respectively including scheduled one group of two list
Word, the two words are being previously identified as defeated from the voice for the training corpus for indicating expected voice input structure
Entering has strong correlation in sequence.First trigger words, which have to be followed by voice input, triggers the second trigger words in
Strong statistical likelihood, although these words may not known the intermediate word separation of number in different voice inputs.Cause
And if speech recognition result includes trigger words, due to the statistic correlation between the first and second trigger words,
A possibility that those trigger words are accurate in speech recognition result is relatively high.Within system 100, by using for this field
The statistical method known generates trigger words based on mutual information scoring.Memory 160 stores scheduled one group in feature vector
To element, the triggering, which corresponds to element based on the trigger words collection to score with high mutual information, to be had first for N number of triggering
The triggering pair of high correlation level between word and the second word.As described below, trigger words opposite direction neural network grading device
166 provide the supplementary features of speech recognition result, and the supplementary features enable neural network grading device 166 by using super
The speech recognition result supplementary features of word present in speech recognition result grade to speech recognition result out.
Confidence scoring feature corresponds to speech recognition engine 162 and combines each candidate speech recognition result numerical value generated
Confidence score value.For example, in one configuration, the numerical value speech recognition engine in the range of (0.0,1.0) is placed in spy
Determine the probability confidence water slave lowest confidence (0.0) to highest confidence level (1.0) in the accuracy of candidate speech recognition result
It is flat.Each of mixing candidate speech recognition result including the word from two or more speech recognition engines is assigned
One confidence scoring, the confidence scoring are the candidate speech that controller 148 is used to generate the mixing voice recognition result focused
The normalization average value of the confidence scoring of recognition result.
Within system 100, controller 148 also normalizes and albefaction is for generated by different speech recognition engines
The confidence score value of speech recognition result, for generating final feature vector element, the final feature vector element packet
It includes and uniformly scores through normalizing with the confidence of albefaction between the output of multiple speech recognition engines 162.Controller 148 passes through
It is normalized to identify the confidence scoring of engine from different phonetic using normalization process, and then white by using the prior art
Change technology, the mean value estimated by the training data and variance are come the normalised confidence score value of albefaction.In a reality
It applies in example, controller 148 returns the confidence scoring between different phonetic identification engine by using linear regression procedure
One changes.Subdivision or " storehouse " of the controller 148 first by confidence scoring range subdivision at predetermined number, such as know for two voices
20 unique storehouses of other engine A and B.Controller 148 is then based on observed speech recognition result and in process 200
Used practical bottom input, is directed to various voices corresponding with each scoring storehouse to identify during training process before
The practical accuracy rate of recognition result.Controller 148 executes cluster to the confidence scoring in the predetermined value window around " edge "
Operation, and averaged accuracies scoring corresponding with each edge confidence score value is identified, the edge separation is from difference
The storehouse of every group of result of speech recognition engine." edge " confidence scores equal along the confidence scoring range of each speech recognition engine
It is distributed evenly, and provides the comparison point of predetermined number to execute linear regression, the first speech recognition is drawn in the linear regression
The confidence scoring held up is mapped to the confidence scoring of another speech recognition engine with similar accuracy rate.
Controller 148 is mapped using the accuracy data identified for the scoring of each edge to execute linear regression,
The linear regression mapping enables controller 148 that will be converted into and come from from the scoring of the confidence of the first speech recognition engine
The corresponding another confidence score value of equivalent confidence scoring of second speech recognition engine.One from the first speech recognition engine
The mapping of a confidence scoring to another confidence scoring from another speech recognition is also known as scoring alignment procedures, and one
In a little embodiments, controller 148 determines the scoring of the confidence from the first speech recognition engine and the by using following equation
The alignment of two speech recognition engines:
WhereinxIt is the scoring from the first speech recognition engine,x'It isxIn the confidence scoring range of the second speech recognition engine
Equivalence value, valuee i Withe i+i Corresponding to for the value closest to the first speech recognition enginexDifferent marginal values it is estimated
Accuracy scoring (such as scoring for the estimated accuracy of the marginal value 20 and 25 around confidence scoring 22), and be worthe i 'Withe i+i 'Corresponding to for the estimated accuracy scoring at identical relative edge's edge value of the second speech recognition engine.
In some embodiments, the result of linear regression is stored in the feature extractor in memory 160 by controller 148
In 164, as look-up table or other suitable data structures, it is able to achieve between different phonetic identification engine 162 for making
The efficient normalization of confidence scoring, without re-generating linear regression for each comparison.
Controller 148 also identifies the word level feature in candidate speech recognition result using feature extractor 164.It is single
Word level characteristics correspond to controller 148 and are placed in the data in the element of feature vector, and the element is known corresponding to candidate speech
The characteristic of individual word in other result.In one embodiment, controller 148 is only identified identifies with each candidate speech
As a result the existence or non-existence of word in the corresponding multiple predetermined vocabularies of separate element of the predetermined characteristic vector in.For example,
If word " street " occurs at least once in candidate speech recognition result, controller 148 is in the characteristic extraction procedure phase
Between by the value of the corresponding element in feature vector set 1.In another embodiment, controller 148 identifies the frequency of each word
Rate, wherein " frequency " refers to the number that word occurs in candidate speech recognition result as used in this article.Control
The appearance number of word is placed in the corresponding element of feature vector by device 148 processed.
In still another embodiment, feature extractor 164 is feature vector corresponding with each word in predetermined vocabulary
In Element generation " with decay characteristics bag of words "." bag of words with decay characteristics " refer to as used herein, the term
Controller 148 considers candidate speech recognition result, the frequency of occurrence based on the word in the result and position and is assigned to pre-
Determine the numeric ratings of each word in vocabulary.Controller 148 is each of the candidate speech recognition result in predetermined vocabulary
Word generates the bag of words with reduction scores, and has those of not appear in candidate result word appointment in vocabulary
The bag of words for the reduction scores for being zero.In some embodiments, scheduled vocabulary includes special entry to indicate outside any vocabulary
Word, and controller 148 also to generate for special entry there is decaying to comment based on word outside all vocabulary in candidate result
The single bag of words divided.For the given word in predetermined dictionaryw i , the bag of words with reduction scores are:,
WhereinIt is word in candidate speech recognition resultw i The position collection of appearance, and itemIt is in the range of (0,1.0)
Predetermined value decay factor is for example configured to 0.9 in the illustrative embodiments of system 100.
Fig. 5 depicts the example of the structure of feature vector 500 in more detail.Feature vector 500 includes corresponding to triggering pair
Multiple elements of feature 504, confidence scoring element 508, and corresponding to other multiple elements of word level feature 512, institute
It states word level feature 512 and is depicted as the bag of words with decay characteristics in Fig. 5.In feature vector 500, trigger words pair
Feature 504 includes the element for each predetermined trigger pair, and intermediate value " 0 " instruction triggering is identified to candidate speech is not present in
As a result in, and it is worth " 1 " instruction triggering to being present in candidate speech recognition result.Confidence scoring element 508 is individual element,
It include combined by corresponding speech recognition engine 162 or the speech recognition engine for mixing voice recognition result it is generated
Numerical value confidence score value.Word level characteristic element 512 includes respectively element corresponding with the certain words in predetermined vocabulary
Array.For example, in one embodiment, the predetermined dictionary for a kind of language (such as English or Chinese) includes respectively being reflected
It is mapped to the word of one of word level element 512.In another embodiment being described more particularly below, training process is based on
The frequency of occurrences of the word that big training data is concentrated generates the vocabulary of word, wherein appearing in training data with highest frequency
The word (such as 90% in the word with highest frequency) of concentration is mapped to the word level in the structure of feature vector 500
Element 512.
The precise order of discribed feature vector element is not for indicating triggering to, confidence in feature vector 500
The requirement of scoring and word level feature.Instead, as long as being directed to all candidate speech by using consistent structural generation
The feature vector of recognition result, any sequence of the element in feature vector 500 are exactly effective, each member in the structure
Element indicates identical triggering among all candidate speech recognition results to, confidence scoring or word level feature.
Referring again to FIGS. 1, in the embodiment in figure 1, neural network grading device 166 is housebroken neural network, packet
Include the input layer of the neuron of reception multiple feature vectors corresponding with the candidate speech recognition result of predetermined number, Yi Jisheng
At the output layer of the neuron of grading scoring corresponding with each of input feature value.In general, neural network includes
Multiple nodes for being referred to as " neuron ".Each neuron receives at least one input value;Scheduled weighted factor is applied to
Input value, wherein different input values usually receives different weighted factors;And output is generated, as weighted input
Summation, wherein having the optional bias factor for being added to summation in some embodiments.In the instruction being described more particularly below
The accurate weighted factor for each input and the optional bias in each neuron are generated during practicing process.Neural network
Output layer include another group of neuron, " activation primitive " is particularly configured with during training process.The activation letter
Number is such as s type function or other threshold function tables, the input of the final hidden layer based on the neuron in neural network
Output valve is generated, wherein generating the accurate parameters of s type function or threshold value during the training process of neural network.
In the specific configuration of Fig. 1, neural network grading device 166 is feedforward deep neural network, and Fig. 5 includes feedforward
The illustrative description of deep neural network 500.As is known in the art, feedforward neural network includes being connected from input
The layer of neuron in layer (layer 554) to the travelling unidirectionally of output layer (layer 566), without it is any will be in one layer of neural network
Neuron is connected to the recurrence or " feedback " loop of the neuron in neural network preceding layer.Deep neural network includes not sudden and violent
Dew is at least one " hidden layer " (and being typically more than a hidden layer) of the neuron of input layer or output layer.In nerve net
In network 550, neuronkInput layer 554 is connected to output layer 566 by a hidden layer 562.
In the embodiment of neural network 550, furthermore input layer includes projection layer 558, and the projection layer 558 is by predetermined square
Battle array transformation is applied to institute's selected works of input feature value element, including is respectively used to triggering to element 554 and word level feature
Two different projection matrixes of element 512.Projection layer 558 generates the simplification of the output of the input neuron in input layer 554
It indicates, because being " dilute to the feature vector element of 504 and word level feature 512 for triggering in most realistic input
Dredge ", it means that each candidate speech recognition result only include be coded in it is a small amount of in the structure of feature vector 500
(if any) a small amount of words totally collected in (such as 10,000 words) to item and big word are triggered.Projection layer
Transformation in 558 enables the remainder layer of neural network 550 to include less neuron, and is still candidate speech recognition result
Feature vector input generate useful grading scoring.In an illustrative embodiments, for trigger words to for single
Two projection matrixes of word level characteristicsP f WithP w Respectively corresponding input neuron, which is projected to, respectively has 200 elements
In smaller vector space, this generates 401 for each in n input feature value in neural network grading device 166
The layer through projecting of neuron (neuron is reserved for confidence scoring feature).
Although Fig. 5 depicts neural network 550, have corresponding for the candidate speech recognition result different from n
Feature vector n input slot (slot) in total, but in input layer 554 input neuron number include for candidate
One neuron of each element in the feature vector of speech recognition result, or in totalA neuron,
WhereinTIt is the number of the predetermined trigger pair identified in candidate speech recognition result, andVIt is the word in the word identified
The number of the word occurred in remittance, wherein 0.9 coefficient indicates the filtering to training set only to include as described above with most high frequency
90% in word that rate occurs.Fixed value 2 is indicated for an input neuron of confidence score value and another input nerve
Member, another input neuron are served as any list not corresponding with the predetermined word level element of input feature value
Comprehensive (catch-all) of word level characteristics is inputted, and is not modeled clearly in neural network grading device 166 such as any
The outer word of vocabulary.For example, controller 148 generates feature vector by using feature extractor 164, for for not with feature to
Any word in the candidate speech recognition result of element aligned in the predetermined structure of amount generates the bag of words with reduction scores.
Element in feature vector corresponding with word outside vocabulary enables neural network grading device 166 that will be not included in default word
The presence of any word in remittance be incorporated into be include word outside vocabulary any candidate speech recognition result grading scoring
In generation.
Output layer 566 includes than the less output neuron of input layer 554.Particularly, output layer 566 includes n output
Neuron, wherein each output neuron is one generation numerical value of correspondence in n input feature value during reasoning process
Grading scoring, the reasoning process is ranking process in the specific configuration of system 100, for being to identify with multiple candidate speech
As a result corresponding feature vector generates grading scoring.Some hardware embodiments of controller 148 include one or more of GPU
Computing unit or other specific hardware-accelerated components, for executing reasoning process in a manner of time and power-efficient.At it
In his embodiment, furthermore system 100 includes that additional Digital Logic handles hardware, the additional Digital Logic handles hardware quilt
It is incorporated into remote server, controller 148 accesses the long-range clothes by using Wireless Communication Equipment 154 and data network
Business device.In some embodiments, the hardware in remote server also realizes functional one of multiple speech recognition engines 162
Point.The server include it is additional processing hardware come execute feature extraction and ANN Reasoning processing whole or one
Point, for generating the feature vector and grading scoring of the multiple candidate speech recognition result.
During operation, system 100 receives audio input data by using microphone 128, and uses multiple languages
Sound engine 162 generates multiple candidate speech recognition results, including mixing voice recognition result, the mixing voice recognition result
It in some embodiments include from word selected in two or more candidate speech recognition results.Controller 148 is by making
Feature is extracted from candidate speech recognition result with feature extractor 164, for generating according to candidate speech recognition result
Feature vector, and provide neural network grading device 166 to generate output scoring for each feature vector feature vector.Control
Device 148 processed then mark feature vector corresponding with highest grading scoring and candidate speech recognition result, and controller 148
By using corresponding with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result
Candidate speech recognition result carrys out operation automation system as input.
Fig. 2 is depicted for selecting candidate speech to know by using multiple speech recognition engines and neural network grading device
Other result and the process 200 for executing speech recognition.In the following description, it is to the referring to for process 200 for executing function or movement
The operation for referring to controller is used to execute stored program instruction to hold in association with the other assemblies in automated system
Row function or movement.For illustrative purposes, process 200 is described in conjunction with the system of Fig. 1 100.
Process 200 starts to generate multiple candidate speech identifications by using multiple speech recognition engines 162 for system 100
As a result (frame 204).Within system 100, user provides described audio input to audio input device, such as microphone 128.Control
Device 148 generates multiple candidate speech recognition results using multiple speech recognition engines 162.As described above, in some embodiments
In, controller 148 generates mixing candidate speech recognition result by following operation: drawing using from the specific speech recognition in domain
In candidate speech recognition result of the selected word for the candidate speech recognition result held up to replace universal phonetic identification engine
Selected word.Speech recognition engine 162 also generate system 100 in process 200 feature vector generate during used in set
Believe score data.
As system 100 executes feature extraction to generate multiple feature vectors, process 200 continues, the multiple feature to
Amount respectively correspond tos one of candidate speech recognition result (frame 208).Within system 100, controller 148 uses feature extractor
164 generate feature vector, and described eigenvector includes above-mentioned triggering to one in, confidence scoring and word level feature
Or it is multiple, for generate structure with the feature vector 500 in Fig. 5 or for triggering it is special to, confidence scoring and word level
The feature vector of one or more another like structures in sign.In the embodiment of fig. 2, controller 148 is by using having
The bag of words of decaying measurement to generate word level feature for the word level characteristic element of feature vector.
It is commented as controller 148 will be supplied to neural network for the feature vector of the multiple candidate speech recognition result
Grade device 166 is as the input in reasoning process with corresponding multiple with the multiple candidate speech recognition result for generating
Grading scoring, process 200 continue (frame 212).In one embodiment, controller 148 uses housebroken feedforward depth nerve
Network grading device 166 to generate multiple grading scorings at the output layer neuron of neural network by using reasoning process.
As described above, in another embodiment, controller 148 is by using Wireless Communication Equipment 154 and by characteristic vector data, candidate
The encoded version of speech recognition result or the audio speech recorded identification data is transferred to external server, wherein servicing
A part of processor implementation procedure 200 in device scores to generate the grading of candidate speech recognition result.
In most of examples, controller 148 generates n candidate speech recognition result of number and character pair vector,
The predetermined number n for being configured to the received feature vector input during training process with neural network grading device 166 is matched.So
And in some instances, if the number for the feature vector of candidate speech recognition result is less than maximum number n, control
Device 148 processed generates " sky " the feature vector input with whole zeros, in the input layer to ensure neural network grading device 166
All neurons receive input.Controller 148 ignores the scoring of the correspondence output layer neuron for each sky input, and
Neural network in grading device 166 is that the non-empty feature vector of candidate search recognition result generates scoring.
As the mark of controller 148 is corresponding with the highest grading scoring in the output layer of neural network grading device 166
Candidate speech recognition result, process 200 continue (frame 216).As above described in Fig. 5, the output layer 566 of neural network 550
In each output neuron generate with system 100 be supplied in input layer 554 input neuron predetermined set input
The corresponding output valve of grading scoring of one of feature vector.Controller 148 is based on the highest grading generated in neural network 550
The output neuron of scoring indexes to identify the candidate speech recognition result with highest grading scoring.
Referring again to FIGS. 2, the speech recognition result for using selected highest to grade with controller 148 is as from user
Input with operation automation system, process 200 continues (frame 220).In the vehicle of Fig. 1 in information system 100, controller
The 148 various systems of operation, including such as Vehicular navigation system, are shown using GPS 152, Wireless Communication Equipment 154 and LCD
Device 124 or HUD 120 execute automobile navigation operation inputting in response to voice from the user.In another configuration, control
Device 148 plays music by audio output apparatus 132 in response to voice command.In still another configuration, system 100 uses intelligence
Can number 170 or another equipment through being connected to the network make hands-free phone to be based on voice input from the user or transmit text
Message.Although Fig. 1 depicts Information System Implementation example in vehicle, other embodiments use automated system, described automatic
Change system controls the operation of various hardware components and software application using audio input data.
Although Fig. 1 by information system 100 in vehicle be portrayed as execute speech recognition it is from the user to receive and execute
The illustrated examples of the automated system of order, but similar speech recognition process may be implemented in other contexts.Example
Such as, mobile electronic device, such as smart phone 170 or other suitable equipment generally include one or more microphones and processing
Device, may be implemented speech recognition engine, grading device, the triggering that is stored to and realize speech recognition and control system
Other assemblies.In another embodiment, domestic automation system calculates equipment by using at least one to control in house
HVAC and utensil, at least one described calculating equipment receive voice input from the user and by using multiple speech recognitions
Engine executes speech recognition, to control the operation of the various automated systems in house.In each example, system is optional
Ground is configured to the specific speech recognition engine in domain using the specific application and operation that are tailor made to different automated systems
Different sets.
For training the training system and process of neural network grading device
In the system 100 of Fig. 1 and the speech recognition process of Fig. 2, neural network grading device 166 is housebroken feedforward depth mind
Through network.The training neural network grading device 166 before the operation of system 100, for executing upper speech recognition procedure.Fig. 3
The illustrative embodiments for being configured to train the computerized system 300 of neural network grading device 166 is depicted, and Fig. 4 is retouched
The training process 400 for generating housebroken neural network grading device 166 is drawn.
System 300 includes processor 304 and memory 320.Processor 304 includes for example one or more CPU cores, described
CPU core is alternatively coupled to the hardware accelerator of parallelization, and the hardware accelerator was designed to time and power-efficient
Mode training neural network.The example of such accelerator includes for example with the calculating for being configured for neural metwork training
The GPU of shader unit, and be exclusively used in training the fpga chip especially programmed or ASIC hardware of neural network.In some realities
It applies in example, furthermore processor 304 includes concurrently operating to execute the cluster of the calculating equipment of neural network training process.
Memory 320 includes that such as nonvolatile solid state or magnetic data storage equipment and volatile data storage are set
Standby, such as random-access memory (ram) stores instruction by programming with the operation for system 300.In the configuration of Fig. 3
In, 320 storing data of memory, the data correspond to training input data 324, the stochastic gradient descent for neural network
Training aids 328, neural network grading device 332 and feature extractor 164.
Training data 324 includes for example being identified by the same voice within system 100 for big predetermined input set
The big collection of speech recognition result caused by engine 162 optionally includes mixing voice recognition result.Training speech recognition
Result data further includes the confidence scoring for training speech recognition result.For each speech recognition result, training data is also
It is measured including Lay Weinstein distance (Levenshtein distance), quantization is real in speech recognition result and scheduled ground
Condition voice inputs the difference between training data, and the scheduled ground truth voice input training data indicates in training process
Canonically " correct " result.Lay Weinstein distance metric is an example of " editing distance " measurement, because of the measurement amount
Change and is used to necessary to the actually entering of training data change for the speech recognition result from speech recognition engine to be transformed into
Become (editor) amount.Both speech recognition result and ground truth voice input training data are referred to as the text in comparison measuring
" character string ".For example, editing distance quantization is for by speech recognition result character string " Sally shells sea sells by
The seashore " is converted into corresponding correct ground truth training data character string " Sally sells sea shells
Required knots modification for by the seashore ".
Lay Weinstein distance metric is known for this field in other contexts, and has several properties, packet
Include: (1) Lay Weinstein distance is always at least the difference of the size of two character strings;(2) Lay Weinstein distance is at most compared with long word
Accord with the length of string;(3) and if only if character string is equivalent, Lay Weinstein distance is only zero;(4) if character string is phase
Same size, then Hamming distance is Lay Weinstein apart from the upper upper bound;And the Lay Weinstein of (4) between two character strings away from
From no more than the summation (triangle inequality) with a distance from their Lay Weinsteins from third character string.Hamming distance refer in turn for
One character string is changed to minimum number of permutations required for another or a character string can be transformed into another most
The measurement of small error numbers.Although for illustrative purposes, system 300 includes the training number for being encoded with Lay Weinstein distance
According to, but in alternative embodiments, another editing distance measurement for describe training speech recognition result with it is corresponding
Difference between ground truth training input.
In the fig. 3 embodiment, the feature extractor 164 in memory 320 is the phase used in above system 100
With feature extractor 164.Particularly, processor 304 using feature extractor 164 come by using above-mentioned trigger words to, set
Believe one or more of scoring and word level feature and spy is generated according to each of training speech recognition result
Levy vector.
Stochastic gradient descent training aids 328 includes the program instruction stored and parameter for neural network training process
Data, processor 304 execute the neural network training process to be based on training data 324 by using feature extractor 164
Feature vector generated come train neural network grade device 332.Such as it is known in the art, stochastic gradient descent training
Device includes a kind of related training process, trains neural network in an iterative process by operating as follows: adjusting nerve net
Parameter in network with minimize the output in neural network between predeterminated target function at a distance from (error), the predeterminated target
Function is also referred to as " target " function.Although stochastic gradient descent training is overall known for this field and not at this
It is discussed in more detail in text, but system 300 modifies standard prior training process.Particularly, training process seeks benefit
Use neural network, output generated as input by using training data, the output minimize in neural network output and
Error between target result from predetermined training data.In prior art training process, target value is generally designated
Given output is binary system " correct " or " incorrect ", and such target output from neural network grading device provides scoring
Come indicate for training speech recognition result feature vector input when in training data ground truth input compared with when
Time is 100% correct or incorrect to a certain extent.However, within the system 300, stochastic gradient descent training aids 328
The editing distance target data in training data 324 is used more accurately to be reacted as " soft " target for different trained voices
The correctness of recognition result is horizontal, and the correctness level may include that the error range of grading scoring is influenced on successive range
It rather than is only completely correct or incorrect.
Processor 304 is in target function using " soft " target data come by using stochastic gradient descent training aids 328
And execute training process.For example, the following form of configuration use of Fig. 3 " softmax(flexibility maximum value) " target function:, whereindiIt is for given training speech recognition resultiEditing distance.During training process,
Gradient declines 328 executory cost of training aids and minimizes process, wherein " cost " refers to during each iteration of training process
Neural network grading device 332 output valve and pass through the cross entropy between target function target value generated.Processor 304 exists
Batch sample is provided during training process and declines training aids 328 to gradient, such as 180 training inputs of batch respectively include
Pass through multiple speech recognition engines different training speech recognition result generated.Iterative process is continued until the friendship of training set
Pitch entropy during ten iteration still without improvement until, and generate from all training datas minimum population entropy through instructing
Experienced neural network parameter forms final housebroken neural network.
During training process, processor 304 is during the different iteration of training process by identical input feature value
The scramble between the different sets of the input neuron in neural network grading device 332, in the input layer to ensure neural network
The position of particular feature vector incorrect biasing is not generated in housebroken neural network.Such as above in reasoning process
It is described, if the specific collection of training data does not include the candidate speech recognition result of enough numbers to comment to neural network
All neurons in the input layer of grade device 332 provide input, then the generation of processor 304 there is " sky " of zero input to input special
Levy vector.As it is known in the art, stochastic gradient descent training process includes numerical value training parameter, and in system 300
In one configuration, the super parameter of stochastic gradient descent training aids 328 isα=0.001,β 1 =0.9 andβ 2 = 0.999。
Neural network grading device 332 is feedforward deep neural network in one embodiment, has and is described in Fig. 5
Neural network 550 structure.During operation, processor 304 generates the knot of unbred neural network grading device 332
Structure, the structure have the neuron of predetermined number, input layer 554 of the predetermined number based on the neural network 550 in Fig. 5
In neuron number and for respectively as input be provided to neural network for reasoning process in total n candidate
The number of output neuron in the output layer 566 of speech recognition result.Processor 304 is also hidden at the k of neural network 550
Suitable number of neuron is generated in layer 562.In one embodiment, processor 304, which utilizes to be directed to, goes to each of neuron
The weighted value of the randomization of input initializes neural network structure.As described above, processor 304 is adjusted during training process
Markingoff pin is to the various weights and bias of the neuron in the input layer 554 and hidden layer 562 of neural network, together with output layer
The parameter of activation primitive in 566 neuron, for minimizing for given input set from neural network grading device 332
Cross entropy of the output compared with target function.
Although Fig. 3 depicts the specific configuration for generating the computerized equipment 300 of housebroken neural network grading device,
It is in some embodiments, furthermore to be matched in speech recognition process using the same system of housebroken neural network grading device
It is set to trained neural network grading device.For example, the controller 148 in system 100 is to can be configured to hold in some embodiments
The example of the processor of row neural network training process.
Fig. 4 is depicted for selecting candidate speech to know by using multiple speech recognition engines and neural network grading device
Other result and the process 400 for executing speech recognition.In the following description, it is to the referring to for process 400 for executing function or movement
The operation of finger processor is used to execute stored program instruction to hold in association with the other assemblies in automated system
Row function or movement.For illustrative purposes, process 400 is described in conjunction with the system of Fig. 3 300.
Process 400 starts to generate multiple feature vectors for system 300, corresponds to and is stored in training data 324
Multiple trained speech recognition results (frame 404).Within the system 300, processor 304 is generated described using feature extractor 164
Multiple feature vectors, wherein each feature vector corresponds to a trained speech recognition result in training data 324.Institute as above
It states, at least one embodiment of process 400, controller 304 generates each feature vector comprising one in the following terms
A or multiple: to feature, confidence scoring and word level feature, the word level feature includes having decaying special for triggering
The bag of words of sign.
As the part of feature extraction and feature generating process, in some embodiments, controller 304 generates feature vector
Structure comprising element-specific, the element-specific be mapped to triggering to feature and word level feature.For example, as more than
Described within system 100, in some embodiments, controller 304 generates feature vector, have in training data 324
The only only a part of observed word, such as 90% the corresponding structure of most commonly observed word, and with low-limit frequency
Remaining 10% word occurred is not encoded into the structure of feature vector.Processor 304 optionally identifies the most common triggering
To feature, and generate the structure for most commonly observed trigger words pair present in training data 324.System wherein
300 generate during process 400 in the embodiment for the structure of feature vector, and the storage of processor 304 has feature extractor
The structure of the feature vector of data 164, and after training process completion, the structure of feature vector is graded together with neural network
Device 332 is provided to automated system, and the automated system uses the feature vector with specified structure as to warp
The input of trained neural network, for generating grading scoring for candidate speech recognition result.In other embodiments, based on all
As English or Chinese etc natural language rather than a priori determined in particular upon the content of training data 324 feature to
The structure of amount.
With system 300 by using stochastic gradient descent training aids 328, based on training speech recognition result feature to
Amount and the soft object editing distance data from training data 324 come train neural network grade device 332, process 400 continue
(frame 408).During training process, 304 use of processor multiple features corresponding with multiple trained speech recognition results to
It measures as the input to neural network grading device, and based on generated during training process by neural network grading device
Multiple outputs, which are scored the cost minimization process between target function, trains neural network grading device 332, the target letter
Number has above-mentioned soft scoring, and the soft scoring is based on knowing in the multiple trained speech recognition result with for the multiple voice
The trained speech recognition of each of other result it is predetermined correctly enter between predetermined editing distance.During process 400, processing
Device 304 modifies the weighted input coefficient and neuron bias in the input and hidden layer of neural network grading device 332, and leads to
Cross the parameter of the activation primitive in the output layer for adjusting neuron using stochastic gradient descent training aids 328, in an iterative manner.
After training process completion, process 304 stores the structure of housebroken neural network grading device 332, Yi Ji
The structure of optionally stored feature vector in embodiment, wherein described eigenvector based on the training data in memory 320 and
It is generated (frame 412).The structure and features vector structure of neural network grading device 332 stored is subsequently passed to other certainly
The system 100 of dynamicization system, such as Fig. 1, using housebroken neural network grade device 332 and feature extractor 164 come
It grades during speech recognition operation to multiple candidate speech recognition results.
It will be appreciated that disclosed above and other feature and function variants or its alternative can close expectation
Ground is incorporated into many other different systems, application or method.It is various currently without prediction or it is not expected that replace
It changes scheme, modification, modification or improves and can then be made by those skilled in the art, be also intended to be contained by appended claims
Lid.
Claims (17)
1. a kind of method for the speech recognition in automated system, comprising:
Multiple feature vectors are generated using controller, each feature vector corresponds to one in multiple candidate speech recognition results
Candidate speech recognition result is generated in the multiple feature vector and is waited for first in the multiple candidate speech recognition result
Furthermore the first eigenvector for selecting speech recognition result includes:
Using controller, referring to multiple predetermined triggers for being stored in memory to coming in the first candidate speech recognition result
At least one triggering pair is identified, at least one described triggering is to including two predetermined trigger words;And
First eigenvector is generated using controller, the first eigenvector includes at least one triggering pair
Element;
The multiple feature vector is fed as input to neural network using controller;
It is generated and is directed to described in the multiple candidate speech recognition result using controller, output layer neural network based
The corresponding multiple gradings scorings of multiple feature vectors;And
Using the controller, by using in the multiple candidate speech recognition result with the multiple grading scoring in most
High ratings score corresponding candidate speech recognition result as input and carry out operation automation system.
2. according to the method described in claim 1, furthermore each feature vector generated in the multiple feature vector includes:
Each feature vector is generated using controller, each feature vector includes for one in the scoring of multiple confidences
The element of confidence scoring, each confidence scoring are associated with a candidate speech recognition result of each feature vector is corresponded to.
3. according to the method described in claim 2, furthermore including:
It is scored using controller, based on the multiple confidence to execute linear regression procedure, for being the multiple feature vector
Normalised multiple confidence scorings are generated, normalised multiple confidence scorings are based on the multiple speech recognition result
In a predetermined candidate speech recognition result confidence scoring.
4. according to the method described in claim 1, furthermore generation first eigenvector includes:
Multiple unique words in the first candidate speech recognition result are identified using controller, including in the multiple word often
At least one position of each unique words in the frequency of a unique words appearance and the first candidate speech recognition result;
Multiple bag of words with attenuation parameter are generated using controller, each bag of words with attenuation parameter are based on the multiple
The frequency of a unique words in unique words is at least one position and scheduled attenuation parameter and corresponding to described one
A unique words;And
First eigenvector is generated using controller, the first eigenvector includes having attenuation parameter for the multiple
The bag of words of each of bag of words with attenuation parameter element.
5. according to the method described in claim 1, furthermore including: to the multiple feature vector of neural network offer
The multiple feature vector is fed as input to feedforward deep neural network using controller.
6. according to the method described in claim 1, furthermore including:
Audio input data corresponding with voice from the user input is generated using audio input device;And
Multiple candidate languages corresponding with audio input data are generated using controller, by using multiple speech recognition engines
Sound recognition result.
7. a kind of method for training neural network grading device, comprising:
Multiple feature vectors are generated using processor, each feature vector corresponds to the multiple trained languages being stored in memory
A trained speech recognition result in sound recognition result generates in the multiple feature vector for the multiple trained voice
In recognition result first training speech recognition result first eigenvector furthermore include:
Using processor, referring to multiple predetermined triggers for being stored in memory to coming in the first training speech recognition result
At least one triggering pair is identified, at least one described triggering is to including two predetermined trigger words;And
It utilizes a processor to generate first eigenvector, the first eigenvector includes at least one triggering pair
Element;
The training process for neural network grading device is executed by using the following terms using processor: with the multiple instruction
Practice speech recognition result it is corresponding as to neural network grade device input the multiple feature vector, by neural network
Grade device during training process multiple outputs scoring generated and based in the multiple trained speech recognition result and
For trained speech recognition each in the multiple speech recognition result it is predetermined correctly enter between predetermined editing distance
Multiple objective results;And
It is being corresponding with the speech recognition result being not present in the multiple trained speech recognition result for using
After the training process that additional eigenvectors generate in grading scoring is completed, neural network grading device is stored in using processor
In memory.
8. according to the method described in claim 7, furthermore generation first eigenvector includes:
Feature vector is generated using processor, described eigenvector includes for associated with the first training speech recognition result
The element of confidence scoring.
9. according to the method described in claim 7, furthermore generation first eigenvector includes:
Multiple unique words in mark the first training speech recognition result are utilized a processor to, including in the multiple word often
At least one position of each unique words in the frequency and the first training speech recognition result that a unique words occur;
It utilizes a processor to generate multiple bag of words with attenuation parameter, each bag of words with attenuation parameter are based on the multiple
The frequency of a unique words in unique words is at least one position and scheduled attenuation parameter and corresponding to described one
A unique words;And
It utilizes a processor to generate first eigenvector, the first eigenvector includes having attenuation parameter for the multiple
Bag of words in per multiple bag of words with attenuation parameter element.
10. according to the method described in claim 7, furthermore training process includes:
Using processor, housebroken neural network is generated by using stochastic gradient descent training process.
11. according to the method described in claim 7, furthermore training includes:
It utilizes a processor to execute the training process for neural network grading device, the multiple objective result is based on described more
A trained speech recognition result is correctly entered with for each the predetermined of trained speech recognition in the multiple speech recognition result
Between Lay Weinstein distance.
12. a kind of system of the speech recognition for automation, comprising:
Memory, the memory are configured to store:
Multiple predetermined triggers pair, each triggering is to including two words;And
Neural network, the neural network are configured to generate grading scoring corresponding with multiple candidate speech recognition results;
And
It is operatively coupled to the controller of memory, controller is configured to:
Multiple feature vectors are generated, the candidate speech that each feature vector corresponds in multiple candidate speech recognition results is known
Not as a result, generating in the multiple feature vector for the first candidate speech identification in the multiple candidate speech recognition result
As a result first eigenvector furthermore include further Configuration Control Unit with:
Referring to multiple predetermined triggers for being stored in memory to identifying at least one in the first candidate speech recognition result
A triggering pair, at least one described triggering is to including two predetermined trigger words;And
First eigenvector is generated, the first eigenvector includes the element at least one triggering pair;
The multiple feature vector is fed as input to neural network;
Output layer neural network based come generate with for the multiple candidate speech recognition result the multiple feature to
Measure corresponding multiple grading scorings;And
By using opposite with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result
The candidate speech recognition result answered carrys out operation automation system as input.
13. furthermore system according to claim 12, controller are configured to:
Each feature vector is generated, each feature vector includes the member for the confidence scoring in the scoring of multiple confidences
Element, each confidence scoring are associated with a candidate speech recognition result of each feature vector is corresponded to.
14. furthermore system according to claim 13, controller are configured to:
It is scored based on the multiple confidence to execute linear regression procedure, for generating for the multiple feature vector through normalizing
The scoring of multiple confidences, the normalised multiple confidences scoring is predetermined based on one in the multiple speech recognition result
The confidence of candidate speech recognition result scores.
15. furthermore system according to claim 12, controller are configured to:
Multiple unique words in the first candidate speech recognition result are identified, including each unique words go out in the multiple word
At least one position of each unique words in existing frequency and the first candidate speech recognition result;
Multiple bag of words with attenuation parameter are generated, each bag of words with attenuation parameter are based in the multiple unique words
The frequency of one unique words is at least one position and scheduled attenuation parameter and corresponding to one unique words;With
And
First eigenvector is generated, the first eigenvector includes for every in the multiple bag of words with attenuation parameter
The element of a bag of words with attenuation parameter.
16. system according to claim 12, wherein the neural network in memory is feedforward deep neural network, control
Furthermore device is configured to:
The multiple feature vector is fed as input to feedforward deep neural network.
17. system according to claim 12, furthermore includes:
Audio input device;And
Controller, the controller are operatively coupled to audio input device and are furthermore configured to:
Audio input data corresponding with voice from the user input is generated using audio input device;And
Multiple candidate speech recognition results corresponding with audio input data are generated by using multiple speech recognition engines.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/353767 | 2016-11-17 | ||
US15/353,767 US10170110B2 (en) | 2016-11-17 | 2016-11-17 | System and method for ranking of hybrid speech recognition results with neural networks |
PCT/EP2017/079272 WO2018091501A1 (en) | 2016-11-17 | 2017-11-15 | System and method for ranking of hybrid speech recognition results with neural networks |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109923608A true CN109923608A (en) | 2019-06-21 |
CN109923608B CN109923608B (en) | 2023-08-01 |
Family
ID=60327326
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201780070915.1A Active CN109923608B (en) | 2016-11-17 | 2017-11-15 | System and method for ranking mixed speech recognition results using neural networks |
Country Status (5)
Country | Link |
---|---|
US (1) | US10170110B2 (en) |
JP (1) | JP6743300B2 (en) |
CN (1) | CN109923608B (en) |
DE (1) | DE112017004397B4 (en) |
WO (1) | WO2018091501A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110956621A (en) * | 2019-11-27 | 2020-04-03 | 北京航空航天大学合肥创新研究院 | Method and system for detecting tissue canceration based on neural network |
CN113112827A (en) * | 2021-04-14 | 2021-07-13 | 深圳市旗扬特种装备技术工程有限公司 | Intelligent traffic control method and intelligent traffic control system |
Families Citing this family (157)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US8977255B2 (en) | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10706373B2 (en) | 2011-06-03 | 2020-07-07 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US10417037B2 (en) | 2012-05-15 | 2019-09-17 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
EP2954514B1 (en) | 2013-02-07 | 2021-03-31 | Apple Inc. | Voice trigger for a digital assistant |
US10652394B2 (en) | 2013-03-14 | 2020-05-12 | Apple Inc. | System and method for processing voicemail |
US10748529B1 (en) | 2013-03-15 | 2020-08-18 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
DE112014002747T5 (en) | 2013-06-09 | 2016-03-03 | Apple Inc. | Apparatus, method and graphical user interface for enabling conversation persistence over two or more instances of a digital assistant |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10296160B2 (en) | 2013-12-06 | 2019-05-21 | Apple Inc. | Method for extracting salient dialog usage from live data |
US10643616B1 (en) * | 2014-03-11 | 2020-05-05 | Nvoq Incorporated | Apparatus and methods for dynamically changing a speech resource based on recognized text |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
AU2015266863B2 (en) | 2014-05-30 | 2018-03-15 | Apple Inc. | Multi-command single utterance input method |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10152299B2 (en) | 2015-03-06 | 2018-12-11 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US10460227B2 (en) | 2015-05-15 | 2019-10-29 | Apple Inc. | Virtual assistant in a communication session |
US10200824B2 (en) | 2015-05-27 | 2019-02-05 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US20160378747A1 (en) | 2015-06-29 | 2016-12-29 | Apple Inc. | Virtual assistant for media playback |
US10331312B2 (en) | 2015-09-08 | 2019-06-25 | Apple Inc. | Intelligent automated assistant in a media environment |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10740384B2 (en) | 2015-09-08 | 2020-08-11 | Apple Inc. | Intelligent automated assistant for media search and playback |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10095470B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Audio response playback |
US10264030B2 (en) | 2016-02-22 | 2019-04-16 | Sonos, Inc. | Networked microphone device control |
US10743101B2 (en) | 2016-02-22 | 2020-08-11 | Sonos, Inc. | Content mixing |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
CN106228976B (en) * | 2016-07-22 | 2019-05-31 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
US10115400B2 (en) | 2016-08-05 | 2018-10-30 | Sonos, Inc. | Multiple voice services |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
WO2018106805A1 (en) * | 2016-12-09 | 2018-06-14 | William Marsh Rice University | Signal recovery via deep convolutional networks |
US10593346B2 (en) * | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
CN108460454B (en) * | 2017-02-21 | 2022-07-26 | 京东方科技集团股份有限公司 | Convolutional neural network and processing method, device and system for convolutional neural network |
CN107103903B (en) * | 2017-05-05 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Acoustic model training method and device based on artificial intelligence and storage medium |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
DK201770383A1 (en) | 2017-05-09 | 2018-12-14 | Apple Inc. | User interface for correcting recognition errors |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
DK180048B1 (en) | 2017-05-11 | 2020-02-04 | Apple Inc. | MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
DK201770427A1 (en) | 2017-05-12 | 2018-12-20 | Apple Inc. | Low-latency intelligent automated assistant |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US20180336892A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Detecting a trigger of a digital assistant |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
DK179549B1 (en) | 2017-05-16 | 2019-02-12 | Apple Inc. | Far-field extension for digital assistant services |
US20180336275A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Intelligent automated assistant for media exploration |
CN107240395B (en) * | 2017-06-16 | 2020-04-28 | 百度在线网络技术(北京)有限公司 | Acoustic model training method and device, computer equipment and storage medium |
US10475449B2 (en) | 2017-08-07 | 2019-11-12 | Sonos, Inc. | Wake-word detection suppression |
DE102017213946B4 (en) * | 2017-08-10 | 2022-11-10 | Audi Ag | Method for processing a recognition result of an automatic online speech recognizer for a mobile terminal |
US10497370B2 (en) | 2017-08-18 | 2019-12-03 | 2236008 Ontario Inc. | Recognition module affinity |
US10984788B2 (en) | 2017-08-18 | 2021-04-20 | Blackberry Limited | User-guided arbitration of speech processing results |
US10964318B2 (en) * | 2017-08-18 | 2021-03-30 | Blackberry Limited | Dialogue management |
CN107507615A (en) * | 2017-08-29 | 2017-12-22 | 百度在线网络技术(北京)有限公司 | Interface intelligent interaction control method, device, system and storage medium |
US10048930B1 (en) | 2017-09-08 | 2018-08-14 | Sonos, Inc. | Dynamic computation of system response volume |
US10482868B2 (en) | 2017-09-28 | 2019-11-19 | Sonos, Inc. | Multi-channel acoustic echo cancellation |
US10466962B2 (en) | 2017-09-29 | 2019-11-05 | Sonos, Inc. | Media playback system with voice assistance |
US20190147855A1 (en) * | 2017-11-13 | 2019-05-16 | GM Global Technology Operations LLC | Neural network for use in speech recognition arbitration |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US11676062B2 (en) * | 2018-03-06 | 2023-06-13 | Samsung Electronics Co., Ltd. | Dynamically evolving hybrid personalized artificial intelligence system |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11175880B2 (en) | 2018-05-10 | 2021-11-16 | Sonos, Inc. | Systems and methods for voice-assisted media content selection |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US10959029B2 (en) | 2018-05-25 | 2021-03-23 | Sonos, Inc. | Determining and adapting to changes in microphone performance of playback devices |
DK201870355A1 (en) | 2018-06-01 | 2019-12-16 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
DK179822B1 (en) | 2018-06-01 | 2019-07-12 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
DK180639B1 (en) | 2018-06-01 | 2021-11-04 | Apple Inc | DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US11076039B2 (en) | 2018-06-03 | 2021-07-27 | Apple Inc. | Accelerated task performance |
US10825451B1 (en) * | 2018-06-25 | 2020-11-03 | Amazon Technologies, Inc. | Wakeword detection |
US10762896B1 (en) | 2018-06-25 | 2020-09-01 | Amazon Technologies, Inc. | Wakeword detection |
US10210860B1 (en) * | 2018-07-27 | 2019-02-19 | Deepgram, Inc. | Augmented generalized deep learning with special vocabulary |
US20200042825A1 (en) * | 2018-08-02 | 2020-02-06 | Veritone, Inc. | Neural network orchestration |
WO2020041945A1 (en) * | 2018-08-27 | 2020-03-05 | Beijing Didi Infinity Technology And Development Co., Ltd. | Artificial intelligent systems and methods for displaying destination on mobile device |
US11024331B2 (en) | 2018-09-21 | 2021-06-01 | Sonos, Inc. | Voice detection optimization using sound metadata |
US10811015B2 (en) * | 2018-09-25 | 2020-10-20 | Sonos, Inc. | Voice detection optimization based on selected voice assistant service |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11100923B2 (en) | 2018-09-28 | 2021-08-24 | Sonos, Inc. | Systems and methods for selective wake word detection using neural network models |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US11899519B2 (en) | 2018-10-23 | 2024-02-13 | Sonos, Inc. | Multiple stage network microphone device with reduced power consumption and processing load |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US11183183B2 (en) | 2018-12-07 | 2021-11-23 | Sonos, Inc. | Systems and methods of operating media playback systems having multiple voice assistant services |
US11132989B2 (en) | 2018-12-13 | 2021-09-28 | Sonos, Inc. | Networked microphone devices, systems, and methods of localized arbitration |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11322136B2 (en) | 2019-01-09 | 2022-05-03 | Samsung Electronics Co., Ltd. | System and method for multi-spoken language detection |
US11380315B2 (en) * | 2019-03-09 | 2022-07-05 | Cisco Technology, Inc. | Characterizing accuracy of ensemble models for automatic speech recognition by determining a predetermined number of multiple ASR engines based on their historical performance |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
RU2731334C1 (en) * | 2019-03-25 | 2020-09-01 | Общество С Ограниченной Ответственностью «Яндекс» | Method and system for generating text representation of user's speech fragment |
US11132991B2 (en) | 2019-04-23 | 2021-09-28 | Lg Electronics Inc. | Method and apparatus for determining voice enable device |
US11120794B2 (en) | 2019-05-03 | 2021-09-14 | Sonos, Inc. | Voice assistant persistence across multiple network microphone devices |
DK201970509A1 (en) | 2019-05-06 | 2021-01-15 | Apple Inc | Spoken notifications |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11615785B2 (en) | 2019-05-10 | 2023-03-28 | Robert Bosch Gmbh | Speech recognition using natural language understanding related knowledge via deep feedforward neural networks |
CN114341979A (en) * | 2019-05-14 | 2022-04-12 | 杜比实验室特许公司 | Method and apparatus for voice source separation based on convolutional neural network |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
DK180129B1 (en) | 2019-05-31 | 2020-06-02 | Apple Inc. | User activity shortcut suggestions |
DK201970510A1 (en) | 2019-05-31 | 2021-02-11 | Apple Inc | Voice identification in digital assistant systems |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11227599B2 (en) | 2019-06-01 | 2022-01-18 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11397742B2 (en) | 2019-06-21 | 2022-07-26 | Microsoft Technology Licensing, Llc | Rescaling layer in neural network |
US11163845B2 (en) | 2019-06-21 | 2021-11-02 | Microsoft Technology Licensing, Llc | Position debiasing using inverse propensity weight in machine-learned model |
US11204973B2 (en) | 2019-06-21 | 2021-12-21 | Microsoft Technology Licensing, Llc | Two-stage training with non-randomized and randomized data |
US11204968B2 (en) * | 2019-06-21 | 2021-12-21 | Microsoft Technology Licensing, Llc | Embedding layer in neural network for ranking candidates |
KR20210010133A (en) * | 2019-07-19 | 2021-01-27 | 삼성전자주식회사 | Speech recognition method, learning method for speech recognition and apparatus thereof |
KR20210030160A (en) * | 2019-09-09 | 2021-03-17 | 삼성전자주식회사 | Electronic apparatus and control method thereof |
WO2021056255A1 (en) | 2019-09-25 | 2021-04-01 | Apple Inc. | Text detection using global geometry estimators |
DE102019214713A1 (en) * | 2019-09-26 | 2021-04-01 | Zf Friedrichshafen Ag | System for the automated actuation of a vehicle door, vehicle and method |
US11189286B2 (en) | 2019-10-22 | 2021-11-30 | Sonos, Inc. | VAS toggle based on device orientation |
US11437026B1 (en) * | 2019-11-04 | 2022-09-06 | Amazon Technologies, Inc. | Personalized alternate utterance generation |
US11200900B2 (en) | 2019-12-20 | 2021-12-14 | Sonos, Inc. | Offline voice control |
US11562740B2 (en) | 2020-01-07 | 2023-01-24 | Sonos, Inc. | Voice verification for media playback |
US11308958B2 (en) | 2020-02-07 | 2022-04-19 | Sonos, Inc. | Localized wakeword verification |
US11494593B2 (en) * | 2020-03-18 | 2022-11-08 | Walmart Apollo, Llc | Methods and apparatus for machine learning model hyperparameter optimization |
US11688219B2 (en) * | 2020-04-17 | 2023-06-27 | Johnson Controls Tyco IP Holdings LLP | Systems and methods for access control using multi-factor validation |
KR20210136463A (en) * | 2020-05-07 | 2021-11-17 | 삼성전자주식회사 | Electronic apparatus and controlling method thereof |
US11061543B1 (en) | 2020-05-11 | 2021-07-13 | Apple Inc. | Providing relevant data items based on context |
US11038934B1 (en) | 2020-05-11 | 2021-06-15 | Apple Inc. | Digital assistant hardware abstraction |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11482224B2 (en) | 2020-05-20 | 2022-10-25 | Sonos, Inc. | Command keywords with input detection windowing |
CN113486924A (en) * | 2020-06-03 | 2021-10-08 | 谷歌有限责任公司 | Object-centric learning with slot attention |
US11490204B2 (en) | 2020-07-20 | 2022-11-01 | Apple Inc. | Multi-device audio adjustment coordination |
US11438683B2 (en) | 2020-07-21 | 2022-09-06 | Apple Inc. | User identification using headphones |
US11829720B2 (en) | 2020-09-01 | 2023-11-28 | Apple Inc. | Analysis and validation of language models |
CN112466280B (en) * | 2020-12-01 | 2021-12-24 | 北京百度网讯科技有限公司 | Voice interaction method and device, electronic equipment and readable storage medium |
WO2022203701A1 (en) * | 2021-03-23 | 2022-09-29 | Google Llc | Recurrent neural network-transducer model for performing speech recognition |
CN113948085B (en) * | 2021-12-22 | 2022-03-25 | 中国科学院自动化研究所 | Speech recognition method, system, electronic device and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020091522A1 (en) * | 2001-01-09 | 2002-07-11 | Ning Bi | System and method for hybrid voice recognition |
CN1454381A (en) * | 2000-09-08 | 2003-11-05 | 高通股份有限公司 | Combining DTW and HMM in speaker dependent and independent modes for speech recognition |
CN102138175A (en) * | 2008-07-02 | 2011-07-27 | 谷歌公司 | Speech recognition with parallel recognition tasks |
JP2011237621A (en) * | 2010-05-11 | 2011-11-24 | Honda Motor Co Ltd | Robot |
CN104143330A (en) * | 2013-05-07 | 2014-11-12 | 佳能株式会社 | Voice recognizing method and voice recognizing system |
US20150112685A1 (en) * | 2013-10-18 | 2015-04-23 | Via Technologies, Inc. | Speech recognition method and electronic apparatus using the method |
CN104795069A (en) * | 2014-01-21 | 2015-07-22 | 腾讯科技(深圳)有限公司 | Speech recognition method and server |
US9153231B1 (en) * | 2013-03-15 | 2015-10-06 | Amazon Technologies, Inc. | Adaptive neural network speech recognition models |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004272134A (en) * | 2003-03-12 | 2004-09-30 | Advanced Telecommunication Research Institute International | Speech recognition device and computer program |
US8812321B2 (en) | 2010-09-30 | 2014-08-19 | At&T Intellectual Property I, L.P. | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
US9916538B2 (en) * | 2012-09-15 | 2018-03-13 | Z Advanced Computing, Inc. | Method and system for feature detection |
JP6155592B2 (en) | 2012-10-02 | 2017-07-05 | 株式会社デンソー | Speech recognition system |
JP6047364B2 (en) * | 2012-10-10 | 2016-12-21 | 日本放送協会 | Speech recognition apparatus, error correction model learning method, and program |
US9519858B2 (en) * | 2013-02-10 | 2016-12-13 | Microsoft Technology Licensing, Llc | Feature-augmented neural networks and applications of same |
US9484023B2 (en) | 2013-02-22 | 2016-11-01 | International Business Machines Corporation | Conversion of non-back-off language models for efficient speech decoding |
US9058805B2 (en) * | 2013-05-13 | 2015-06-16 | Google Inc. | Multiple recognizer speech recognition |
JP5777178B2 (en) * | 2013-11-27 | 2015-09-09 | 国立研究開発法人情報通信研究機構 | Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for constructing a deep neural network, and statistical acoustic model adaptation Computer programs |
US9520127B2 (en) * | 2014-04-29 | 2016-12-13 | Microsoft Technology Licensing, Llc | Shared hidden layer combination for speech recognition systems |
US9679558B2 (en) * | 2014-05-15 | 2017-06-13 | Microsoft Technology Licensing, Llc | Language modeling for conversational understanding domains using semantic web resources |
WO2016167779A1 (en) * | 2015-04-16 | 2016-10-20 | Mitsubishi Electric Corporation | Speech recognition device and rescoring device |
EP3284084A4 (en) * | 2015-04-17 | 2018-09-05 | Microsoft Technology Licensing, LLC | Deep neural support vector machines |
US10127220B2 (en) * | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
-
2016
- 2016-11-17 US US15/353,767 patent/US10170110B2/en active Active
-
2017
- 2017-11-15 CN CN201780070915.1A patent/CN109923608B/en active Active
- 2017-11-15 WO PCT/EP2017/079272 patent/WO2018091501A1/en active Application Filing
- 2017-11-15 DE DE112017004397.2T patent/DE112017004397B4/en active Active
- 2017-11-15 JP JP2019526240A patent/JP6743300B2/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1454381A (en) * | 2000-09-08 | 2003-11-05 | 高通股份有限公司 | Combining DTW and HMM in speaker dependent and independent modes for speech recognition |
US20020091522A1 (en) * | 2001-01-09 | 2002-07-11 | Ning Bi | System and method for hybrid voice recognition |
CN102138175A (en) * | 2008-07-02 | 2011-07-27 | 谷歌公司 | Speech recognition with parallel recognition tasks |
JP2011237621A (en) * | 2010-05-11 | 2011-11-24 | Honda Motor Co Ltd | Robot |
US9153231B1 (en) * | 2013-03-15 | 2015-10-06 | Amazon Technologies, Inc. | Adaptive neural network speech recognition models |
CN104143330A (en) * | 2013-05-07 | 2014-11-12 | 佳能株式会社 | Voice recognizing method and voice recognizing system |
US20150112685A1 (en) * | 2013-10-18 | 2015-04-23 | Via Technologies, Inc. | Speech recognition method and electronic apparatus using the method |
CN104795069A (en) * | 2014-01-21 | 2015-07-22 | 腾讯科技(深圳)有限公司 | Speech recognition method and server |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110956621A (en) * | 2019-11-27 | 2020-04-03 | 北京航空航天大学合肥创新研究院 | Method and system for detecting tissue canceration based on neural network |
CN110956621B (en) * | 2019-11-27 | 2022-09-13 | 北京航空航天大学合肥创新研究院 | Method and system for detecting tissue canceration based on neural network |
CN113112827A (en) * | 2021-04-14 | 2021-07-13 | 深圳市旗扬特种装备技术工程有限公司 | Intelligent traffic control method and intelligent traffic control system |
CN113112827B (en) * | 2021-04-14 | 2022-03-25 | 深圳市旗扬特种装备技术工程有限公司 | Intelligent traffic control method and intelligent traffic control system |
Also Published As
Publication number | Publication date |
---|---|
US20180137857A1 (en) | 2018-05-17 |
US10170110B2 (en) | 2019-01-01 |
DE112017004397T5 (en) | 2019-05-23 |
WO2018091501A1 (en) | 2018-05-24 |
DE112017004397B4 (en) | 2022-10-20 |
JP6743300B2 (en) | 2020-08-19 |
CN109923608B (en) | 2023-08-01 |
JP2019537749A (en) | 2019-12-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109923608A (en) | The system and method graded using neural network to mixing voice recognition result | |
US9959861B2 (en) | System and method for speech recognition | |
US10977452B2 (en) | Multi-lingual virtual personal assistant | |
CN111897964B (en) | Text classification model training method, device, equipment and storage medium | |
JP7022062B2 (en) | VPA with integrated object recognition and facial expression recognition | |
US11615785B2 (en) | Speech recognition using natural language understanding related knowledge via deep feedforward neural networks | |
CN110136693A (en) | System and method for using a small amount of sample to carry out neural speech clone | |
US8195459B1 (en) | Augmentation and calibration of output from non-deterministic text generators by modeling its characteristics in specific environments | |
CN108846063A (en) | Determine the method, apparatus, equipment and computer-readable medium of problem answers | |
CN102142253B (en) | Voice emotion identification equipment and method | |
CN110473523A (en) | A kind of audio recognition method, device, storage medium and terminal | |
Williams | Multi-domain learning and generalization in dialog state tracking | |
CN108846077A (en) | Semantic matching method, device, medium and the electronic equipment of question and answer text | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN110431626A (en) | Carry out repeating the super utterance detection in speech polling relatively using pairs of to improve speech recognition | |
CN110223714A (en) | A kind of voice-based Emotion identification method | |
CN113505591A (en) | Slot position identification method and electronic equipment | |
CN114830139A (en) | Training models using model-provided candidate actions | |
CN106529525A (en) | Chinese and Japanese handwritten character recognition method | |
Lippmann et al. | LNKnet: neural network, machine-learning, and statistical software for pattern classification | |
Chai et al. | Communication tool for the hard of hearings: A large vocabulary sign language recognition system | |
US11600263B1 (en) | Natural language configuration and operation for tangible games | |
US11645947B1 (en) | Natural language configuration and operation for tangible games | |
CN114333832A (en) | Data processing method and device and readable storage medium | |
Schuller et al. | Speech communication and multimodal interfaces |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |