CN109923608A

CN109923608A - The system and method graded using neural network to mixing voice recognition result

Info

Publication number: CN109923608A
Application number: CN201780070915.1A
Authority: CN
Inventors: Z.周; R.博特罗斯
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2016-11-17
Filing date: 2017-11-15
Publication date: 2019-06-21
Anticipated expiration: 2037-11-15
Also published as: US20180137857A1; US10170110B2; DE112017004397T5; WO2018091501A1; DE112017004397B4; JP6743300B2; CN109923608B; JP2019537749A

Abstract

A kind of method for grading for candidate speech recognition result is that candidate speech recognition result generates multiple feature vectors including the use of controller, and each feature vector, which includes triggering, scores one or more of feature and word level feature to feature, confidence.Furthermore the method includes that the multiple feature vector is provided as input into neural network, output layer neural network based scores to generate multiple gradings corresponding with the multiple feature vector of the multiple candidate speech recognition result is directed to, and is used as input to carry out operation automation system by using candidate speech recognition result corresponding with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result.

Description

The system and method graded using neural network to mixing voice recognition result

Technical field

Present disclosure is generally related to the field of automated voice identification, and relates more specifically to the multiple languages of utilization Sound identifies the improved system and method for the operation of the speech recognition system of engine.

Background technique

The speech recognition of automation is the important technology for realizing man-machine interface (HMI) in the application of wide scope.It is special Not, speech recognition is useful in following situation: in the situation, human user needs to concentrate in execution task, Wherein it will be inconvenient or unpractiaca using traditional input equipment, such as mouse and keyboard.For example, " information in vehicle The small electrical of amusement " system, domestic automation system and such as smart phone, tablet device and wearable computer etc Many purposes of sub- mobile device can receive voice command and other inputs from user using speech recognition.

Most prior art speech recognition system will be recorded to use by oneself using housebroken speech recognition engine The described input at family is converted into suitable for the numerical data handled in the system of computerization.For known in the art Various speech engines execute natural language understanding technology to identify word described in user and the extraction semanteme from the word Meaning is to control the operation of the system of computerization.

In some cases, for identifying voice from the user when user executes different task, individually Speech recognition engine is not necessarily optimal.Prior art solution is attempted to combine multiple speech recognition systems to improve voice The accuracy of identification including rudimentary output of the selection from acoustic model, different speech recognition modelings, or is based on pre- accepted opinion Grade process and select the entire output from different speech recognition engines to collect.However, from the defeated of multiple speech recognition systems Lower order combinations out do not keep high-level language information.In other embodiments, multiple speech recognition engines generate full voice identification As a result, still to select the determination process of which speech recognition result in the output of multiple speech recognition engines is also to have to choose The problem of war property.Therefore, beneficial will be to the improvement of speech recognition system, improves from from multiple speech recognition engines One group of candidate speech recognition result in select speech recognition result accuracy.

Summary of the invention

In one embodiment, a kind of method for executing speech recognition in automated system has been developed.Institute It states method and generates multiple feature vectors including the use of controller, each feature vector corresponds to multiple candidate speech recognition results In a candidate speech recognition result.For the in the multiple candidate speech recognition result in the multiple feature vector Furthermore the generation of the first eigenvector of one candidate speech recognition result includes: to be stored in memory using controller, reference In multiple predetermined triggers to come identified in the first candidate speech recognition result at least one triggering pair, it is described at least one touching Hair and generates first eigenvector using controller, the first eigenvector packet to including two predetermined trigger words Include the element at least one triggering pair.Furthermore the method includes: that the multiple feature vector is made using controller It is supplied to neural network for input, generated using controller, output layer neural network based and is directed to the multiple candidate The corresponding multiple grading scorings of the multiple feature vector of speech recognition result, and utilize controller, by using institute Candidate speech corresponding with the highest grading scoring in the multiple grading scoring in multiple candidate speech recognition results is stated to know Other result carrys out operation automation system as input.

In another embodiment, a kind of method for training neural network grading device, the nerve net have been developed Network grades device as the different candidate speech recognition results generation grading scoring in automated voice identifying system.The method includes It utilizes a processor to generate multiple feature vectors, each feature vector corresponds to the multiple trained voices being stored in memory A trained speech recognition result in recognition result.The multiple trained speech recognition knot is directed in the multiple feature vector Furthermore the generation of the first eigenvector of the first training speech recognition result in fruit includes: to be stored using processor, reference Multiple predetermined triggers in memory to come first training speech recognition result in identify at least one triggering pair, it is described extremely A few triggering is to including two predetermined trigger words, and utilizes a processor to generate first eigenvector, and described first is special Sign vector includes the element at least one triggering pair.The method furthermore include: using processor by using with Lower items execute the training process for neural network grading device: corresponding with the multiple trained speech recognition result work Device of grading by the multiple feature vector of the input to neural network grading device, by neural network is given birth to during training process At multiple outputs scoring and based on the multiple trained speech recognition result be directed to the multiple speech recognition result In each trained speech recognition it is predetermined correctly enter between predetermined editing distance multiple objective results；And for making Used in for additional eigenvectors corresponding with the speech recognition result being not present in the multiple trained speech recognition result After generating the training process completion in grading scoring, neural network is graded into device storage in memory using processor.

In another embodiment, the speech recognition system of automation has been developed.The system comprises memory and by It is operatively coupled to the controller of the memory.Memory is configured to store multiple predetermined triggers pair, and each triggering is to packet Include two words；And neural network, it is configured to generate grading scoring corresponding with multiple candidate speech recognition results.Control Device processed is configured to generate multiple feature vectors, and each feature vector corresponds to a time in multiple candidate speech recognition results Speech recognition result is selected, is generated candidate for first in the multiple candidate speech recognition result in the multiple feature vector The first eigenvector of speech recognition result.Furthermore controller is configured to: multiple predetermined in memory referring to being stored in Triggering to come identified in the first candidate speech recognition result at least one triggering pair, it is described at least one triggering to include two Predetermined trigger word；And first eigenvector is generated, the first eigenvector includes at least one described triggering pair Element.Furthermore controller is configured to the multiple feature vector to be fed as input to neural network, be based on nerve net The output layer of network is corresponding multiple with the multiple feature vector for the multiple candidate speech recognition result to generate Grading scoring, and grade by using in the multiple candidate speech recognition result with the highest in the multiple grading scoring Corresponding candidate speech recognition result score as input and carrys out operation automation system.

Detailed description of the invention

Fig. 1 be as in the vehicle that is embodied in the passenger carriage of vehicle in information system, from user to receive voice defeated Enter the schematic views of the component of the automated system of order.

Fig. 2 is for grading device using neural network during speech recognition process come for multiple candidate speech recognition results Generate the block diagram of the process of grading scoring.

Fig. 3 is to execute training process to generate showing for the computing system of the housebroken neural network grading device of Fig. 1 and Fig. 2 Meaning view.

Fig. 4 is the block diagram for generating the process of housebroken neural network grading device.

Fig. 5 is a diagram, and which depict the structures and nerve net according to speech recognition result feature vector generated The structure of network grading device.

Specific embodiment

In order to promote to understand the embodiments described herein principle purpose, referring now to the drawings and in following institute The description in specification write.The reference is not intended to the limitation of any pair of subject area.Present disclosure further includes pair Any change and modification of shown embodiment, and including such as present disclosure about field in technical staff it is logical Often it will be appreciated that the disclosed embodiments principle other application.

As used herein, term " speech recognition engine " refers to data model and executable program code, Enable the system of computerization based on via microphone or other audio input device received described word remembered Recording frequency enters data to identify the described word from operator.Speech recognition system generally includes lower level acoustic model With higher level language model, the lower level acoustic model identifies the independent sound of human speech in SoundRec, it is described compared with High-level language model identifies word and sentence based on the sound sequence from the acoustic model for scheduled instruction.To this field Known speech recognition engine usually realizes one or more statistical models, such as hidden Markov model (HMM), branch Vector machine (SVM), housebroken neural network are held, or by using the input data being applied to corresponding to human speech Multiple trained parameters of feature vector come generate be directed to recorded human speech statistical forecast another statistical model.Voice Identification engine is by using for example generating feature vector for various signal processing technologies known in the art, at the signal Reason technology extracts the property (" feature ") of recorded voice signal and by the feature organization at one or more dimensions vector, institute The various parts processed to identify the voice including individual word and sentence can be come by using statistical model by stating vector. Speech recognition engine can generate for voice input as a result, voice input corresponds to individually described phoneme and sound More complicated mode, including described word and sentence, the sentence includes the sequence in relation to word.

As used herein, term " speech recognition result " refers to that speech recognition engine is that given input is generated Machine readable output.As a result it can be for example with the encoded text of machine readable format or another encoded data collection, It is used as the operation that input carrys out control automation system.Due to the statistics speciality of speech recognition engine, in some configurations, voice Engine is that single input generates multiple potential speech recognition results.Speech engine is also that each speech recognition result generation " is set Letter scoring ", wherein confidence scoring is the quasi- to each speech recognition result of the trained statistical model based on speech recognition engine The statistical estimate of really a possibility that.As described in more detail below, the use of mixing voice identifying system is by multiple speech recognitions Speech recognition result caused by engine generates additional mixing voice recognition result, and is based ultimately upon multiple be previously generated Speech recognition result come generate at least one output speech recognition result.As used herein, term " know by candidate speech Other result " or more simply " candidate result " refer to candidate as the final speech recognition result from mixing voice identifying system Speech recognition result, the mixing voice identifying system generates multiple candidate results and only selects that the subset of result is (logical Normal a subset) it is used as final speech recognition result.In various embodiments, candidate speech recognition result include from general and The speech recognition result and system 100 of the specific speech recognition engine in domain are by using from multiple candidate speech recognition results Both word mixing voice recognition results generated.

As used herein, term " universal phonetic identification engine " refer to be trained to from such as English or Chinese it The a type of speech recognition engine of the voice of the natural human language identification wide scope of class.Universal phonetic identification engine is based on Wide in range word vocabulary and language model generates speech recognition result, the language model be trained to widely to cover from Language mode in right language.As used herein, term " leading specific speech recognition engine " refers to such a class The speech recognition engine of type: it, which is trained to identify, specifically using field or generally includes slightly different vocabulary and potential Voice input in " domain " of the ground expection syntactic structure different from broader natural language.It is usually wrapped for the vocabulary of special domain Certain terms from broader natural language are included, but may include narrower overall vocabulary, and wrap in some instances The term of specialization is included, the term of the specialization is not known as official's word in natural language by official but for spy Localization is well-known.For example, in navigation application, the specific speech recognition in domain can identify for road, cities and towns or its The term of his geographical title, the appropriate title not being acknowledged as usually in more generally language.In other configurations, special domain It is useful for special domain using specific jargon collection, but may not be identified well in wider range of language.Example Such as, pilot official uses English as exchange language, but also using the specific jargon word of domains and be not standard Other abbreviations of English components.

As used herein, term " triggering to " refers to two words, each of these can be word (such as " broadcasting ") or predetermined class (such as<title of the song>), word sequence that the predetermined class indicates to fall within predetermined class (such as " Poker Face "), such as appropriate title of the song, personnel, location name etc..The word for triggering centering, ought be appear in a specific order in voice When in word in the statement text content of recognition result, the triggering of A → B is wherein being directed to in audio input data In observe in the situation of word A not long ago that there are the high related levels occurred between word B later.As more fully below Description, one group of triggering is being identified to later via training process, is being triggered in the text of candidate speech recognition result Word is to a part for forming the feature vector for each candidate result, and ranking process is using the part come for different times Speech recognition result is selected to grade.

Use the inference system and ranking process of housebroken neural network grading device

Fig. 1 depicts information system 100 in vehicle comprising head up display (HUD) 120, the one or more face console LCD Plate 124, one or more input microphones 128 and one or more output loudspeakers 132.LCD display 124 and HUD 120 are based at least partially on system 100 and input a command for generating from the received voice of other of operator or vehicle occupant institute From the visual output response of system 100.Controller 148 is operatively coupled to every in the component in vehicle in information system 100 One.In some embodiments, controller 148 is connected to or is incorporated to additional component, and such as global positioning system (GPS) receives Device 152 and Wireless Communication Equipment 154, for providing navigation and communication using outer data network and calculating equipment.

In some operation modes, information system 100 is operating independently in vehicle, and in other operation modes, vehicle Middle information system 100 is set with mobile electronic device, such as smart phone 170, tablet device, notebook computer or other electronics Standby interaction.Information system is come and intelligent electricity by using wireline interface, such as USB or wireless interface, such as bluetooth in vehicle 170 communication of words.Information system 100 provides voice recognition user interface in vehicle, and the voice recognition user interface to operate Person can control smart phone 170 or another movement using the verbal order for reducing dispersion attention when operating vehicle Electronic communication equipment.For example, the offer of information system 100 speech interface to enable vehicle operators to utilize intelligence electricity in vehicle Words 170 are made a phone call or sending information message, hold or see smart phone 170 without operator.In some embodiments, intelligence Energy phone 170 includes various equipment, such as GPS and wireless networking device, the function of the equipment accommodated in supplement or substitution vehicle It can property.

Microphone 128 described inputs next life audio frequency according to from vehicle operators or another Vehicular occupant institute are received According to.Controller 148 includes hardware, such as DSP of processing audio data, and for will be from the input signal of microphone 128 It is converted into the component software of audio input data.As explained below, controller 148 is general and at least one using at least one A specific speech recognition engine in domain generates candidate speech recognition result to be based on audio input data, and controller 148 Furthermore improve the accuracy of final speech recognition result output using grading device.In addition, controller 148 includes making it possible to lead to It crosses loudspeaker 132 and generates the voice of synthesis or the hardware and software component of other audio output.

In vehicle information system 100 by using LCD panel 124, the HUD being projected on windshield 102 120, with And visible feedback is provided to vehicle operators by instrument, indicator light or additional LCD panel in instrument board 108.When Vehicle during exercise when, controller 148 optionally deactivates LCD panel 124 or shows only by LCD panel 124 Simplified output, for reducing the dispersion attention to vehicle operators.Controller 148 is shown by using HUD 120 can Environment depending on feedback, for enabling the operator to check vehicle periphery when receiving visible feedback.Controller 148 is usual Simplified data are shown on HUD 120, in area corresponding with the peripheral vision of vehicle operators, for ensuring that vehicle is grasped Author has the unobstructed view of road and vehicle-periphery.

As described above, the display visual information in a part of windshield 120 of HUD 120.As used herein, Term " HUD " generally refers to the head-up display device of wide scope, including but not limited to includes isolated combiner element through group Head-up display (CHUD) of conjunction etc..In some embodiments, HUD 120 shows monochromatic text and figure, and other HUD are real Applying example includes multicolor displaying.Although HUD 120 is depicted on windshield 102 and shows, in alternate embodiment In, head-up unit and the graticle that glass, helmet goggles or operator wear during operation are integrated.

Controller 148 includes one or more integrated circuits, the integrated circuit be configured as one of the following terms or A combination thereof: central processing unit (CPU), graphics processing unit (GPU), microcontroller, field programmable gate array (FPGA), specially With integrated circuit (ASIC), digital signal processor (DSP) or any other suitable digital logic device.Controller 148 is also Equipment is stored including memory, such as solid-state or magnetic data, stores instruction by programming for information system in vehicle 100 operation.

During operation, information system 100 receives from multiple input equipments and inputs request in vehicle, including passes through microphone The received voice input order of 128 institutes.Particularly, controller 148 receives sound corresponding with voice from user via microphone 128 Frequency input data.

Controller 148 includes one or more integrated circuits, and the integrated circuit is configured as central processing unit (CPU), microcontroller, field programmable gate array (FPGA), specific integrated circuit (ASIC), digital signal processor (DSP), Or any other suitable digital logic device.Controller 148 is also operatively connected to memory 160, and the memory 160 wraps Include nonvolatile solid state or magnetic data storage equipment and volatile data storage equipment, such as random access memory (RAM), instruction by programming is stored with the operation for information system 100 in vehicle.160 storage model data of memory with And executable program instruction code and data is to realize multiple speech recognition engines 162, feature extractor 164 and depth nerve Network grading device 166.Train speech recognition engine 162 by using predetermined training process, and speech recognition engine 162 with Other modes are known for this field.Although the embodiment of Fig. 1 includes depositing for the system 100 being stored in motor vehicles Element in reservoir 160, but in some embodiments, external computing device, the server such as through being connected to the network are realized Some or all of discribed feature in system 100.Thus, it would be recognized by those skilled in the art that including controller 148 and any refer to of operation of system 100 of memory 160 should this outsourcing in the alternate embodiment of system 100 Include the operation of server computing device He other distributed computing components.

In the embodiment in figure 1, feature extractor 164 is configured to generate the feature vector with multiple numerical value elements, The numerical value element corresponds to the content of each candidate speech recognition result, including is generated by one of speech recognition engine 162 Speech recognition result or mixing voice that two or more words in speech recognition engine 162 are combined know Other result.Feature extractor 164 generates feature vector, and described eigenvector includes for any one of following characteristics or group The element of conjunction :(a) triggering pair, (b) confidence scores, and (c) individual word level feature, including the word with decay characteristics Bag.

Within system 100, the triggering in feature extractor 164 is stored in respectively including scheduled one group of two list Word, the two words are being previously identified as defeated from the voice for the training corpus for indicating expected voice input structure Entering has strong correlation in sequence.First trigger words, which have to be followed by voice input, triggers the second trigger words in Strong statistical likelihood, although these words may not known the intermediate word separation of number in different voice inputs.Cause And if speech recognition result includes trigger words, due to the statistic correlation between the first and second trigger words, A possibility that those trigger words are accurate in speech recognition result is relatively high.Within system 100, by using for this field The statistical method known generates trigger words based on mutual information scoring.Memory 160 stores scheduled one group in feature vector To element, the triggering, which corresponds to element based on the trigger words collection to score with high mutual information, to be had first for N number of triggering The triggering pair of high correlation level between word and the second word.As described below, trigger words opposite direction neural network grading device 166 provide the supplementary features of speech recognition result, and the supplementary features enable neural network grading device 166 by using super The speech recognition result supplementary features of word present in speech recognition result grade to speech recognition result out.

Confidence scoring feature corresponds to speech recognition engine 162 and combines each candidate speech recognition result numerical value generated Confidence score value.For example, in one configuration, the numerical value speech recognition engine in the range of (0.0,1.0) is placed in spy Determine the probability confidence water slave lowest confidence (0.0) to highest confidence level (1.0) in the accuracy of candidate speech recognition result It is flat.Each of mixing candidate speech recognition result including the word from two or more speech recognition engines is assigned One confidence scoring, the confidence scoring are the candidate speech that controller 148 is used to generate the mixing voice recognition result focused The normalization average value of the confidence scoring of recognition result.

Within system 100, controller 148 also normalizes and albefaction is for generated by different speech recognition engines The confidence score value of speech recognition result, for generating final feature vector element, the final feature vector element packet It includes and uniformly scores through normalizing with the confidence of albefaction between the output of multiple speech recognition engines 162.Controller 148 passes through It is normalized to identify the confidence scoring of engine from different phonetic using normalization process, and then white by using the prior art Change technology, the mean value estimated by the training data and variance are come the normalised confidence score value of albefaction.In a reality It applies in example, controller 148 returns the confidence scoring between different phonetic identification engine by using linear regression procedure One changes.Subdivision or " storehouse " of the controller 148 first by confidence scoring range subdivision at predetermined number, such as know for two voices 20 unique storehouses of other engine A and B.Controller 148 is then based on observed speech recognition result and in process 200 Used practical bottom input, is directed to various voices corresponding with each scoring storehouse to identify during training process before The practical accuracy rate of recognition result.Controller 148 executes cluster to the confidence scoring in the predetermined value window around " edge " Operation, and averaged accuracies scoring corresponding with each edge confidence score value is identified, the edge separation is from difference The storehouse of every group of result of speech recognition engine." edge " confidence scores equal along the confidence scoring range of each speech recognition engine It is distributed evenly, and provides the comparison point of predetermined number to execute linear regression, the first speech recognition is drawn in the linear regression The confidence scoring held up is mapped to the confidence scoring of another speech recognition engine with similar accuracy rate.

Controller 148 is mapped using the accuracy data identified for the scoring of each edge to execute linear regression, The linear regression mapping enables controller 148 that will be converted into and come from from the scoring of the confidence of the first speech recognition engine The corresponding another confidence score value of equivalent confidence scoring of second speech recognition engine.One from the first speech recognition engine The mapping of a confidence scoring to another confidence scoring from another speech recognition is also known as scoring alignment procedures, and one In a little embodiments, controller 148 determines the scoring of the confidence from the first speech recognition engine and the by using following equation The alignment of two speech recognition engines:

WhereinxIt is the scoring from the first speech recognition engine,x'It isxIn the confidence scoring range of the second speech recognition engine Equivalence value, valuee _iWithe _i+iCorresponding to for the value closest to the first speech recognition enginexDifferent marginal values it is estimated Accuracy scoring (such as scoring for the estimated accuracy of the marginal value 20 and 25 around confidence scoring 22), and be worthe _i 'Withe _i+i 'Corresponding to for the estimated accuracy scoring at identical relative edge's edge value of the second speech recognition engine.

In some embodiments, the result of linear regression is stored in the feature extractor in memory 160 by controller 148 In 164, as look-up table or other suitable data structures, it is able to achieve between different phonetic identification engine 162 for making The efficient normalization of confidence scoring, without re-generating linear regression for each comparison.

Controller 148 also identifies the word level feature in candidate speech recognition result using feature extractor 164.It is single Word level characteristics correspond to controller 148 and are placed in the data in the element of feature vector, and the element is known corresponding to candidate speech The characteristic of individual word in other result.In one embodiment, controller 148 is only identified identifies with each candidate speech As a result the existence or non-existence of word in the corresponding multiple predetermined vocabularies of separate element of the predetermined characteristic vector in.For example, If word " street " occurs at least once in candidate speech recognition result, controller 148 is in the characteristic extraction procedure phase Between by the value of the corresponding element in feature vector set 1.In another embodiment, controller 148 identifies the frequency of each word Rate, wherein " frequency " refers to the number that word occurs in candidate speech recognition result as used in this article.Control The appearance number of word is placed in the corresponding element of feature vector by device 148 processed.

In still another embodiment, feature extractor 164 is feature vector corresponding with each word in predetermined vocabulary In Element generation " with decay characteristics bag of words "." bag of words with decay characteristics " refer to as used herein, the term Controller 148 considers candidate speech recognition result, the frequency of occurrence based on the word in the result and position and is assigned to pre- Determine the numeric ratings of each word in vocabulary.Controller 148 is each of the candidate speech recognition result in predetermined vocabulary Word generates the bag of words with reduction scores, and has those of not appear in candidate result word appointment in vocabulary The bag of words for the reduction scores for being zero.In some embodiments, scheduled vocabulary includes special entry to indicate outside any vocabulary Word, and controller 148 also to generate for special entry there is decaying to comment based on word outside all vocabulary in candidate result The single bag of words divided.For the given word in predetermined dictionaryw _i, the bag of words with reduction scores are:, WhereinIt is word in candidate speech recognition resultw _iThe position collection of appearance, and itemIt is in the range of (0,1.0) Predetermined value decay factor is for example configured to 0.9 in the illustrative embodiments of system 100.

Fig. 5 depicts the example of the structure of feature vector 500 in more detail.Feature vector 500 includes corresponding to triggering pair Multiple elements of feature 504, confidence scoring element 508, and corresponding to other multiple elements of word level feature 512, institute It states word level feature 512 and is depicted as the bag of words with decay characteristics in Fig. 5.In feature vector 500, trigger words pair Feature 504 includes the element for each predetermined trigger pair, and intermediate value " 0 " instruction triggering is identified to candidate speech is not present in As a result in, and it is worth " 1 " instruction triggering to being present in candidate speech recognition result.Confidence scoring element 508 is individual element, It include combined by corresponding speech recognition engine 162 or the speech recognition engine for mixing voice recognition result it is generated Numerical value confidence score value.Word level characteristic element 512 includes respectively element corresponding with the certain words in predetermined vocabulary Array.For example, in one embodiment, the predetermined dictionary for a kind of language (such as English or Chinese) includes respectively being reflected It is mapped to the word of one of word level element 512.In another embodiment being described more particularly below, training process is based on The frequency of occurrences of the word that big training data is concentrated generates the vocabulary of word, wherein appearing in training data with highest frequency The word (such as 90% in the word with highest frequency) of concentration is mapped to the word level in the structure of feature vector 500 Element 512.

The precise order of discribed feature vector element is not for indicating triggering to, confidence in feature vector 500 The requirement of scoring and word level feature.Instead, as long as being directed to all candidate speech by using consistent structural generation The feature vector of recognition result, any sequence of the element in feature vector 500 are exactly effective, each member in the structure Element indicates identical triggering among all candidate speech recognition results to, confidence scoring or word level feature.

Referring again to FIGS. 1, in the embodiment in figure 1, neural network grading device 166 is housebroken neural network, packet Include the input layer of the neuron of reception multiple feature vectors corresponding with the candidate speech recognition result of predetermined number, Yi Jisheng At the output layer of the neuron of grading scoring corresponding with each of input feature value.In general, neural network includes Multiple nodes for being referred to as " neuron ".Each neuron receives at least one input value；Scheduled weighted factor is applied to Input value, wherein different input values usually receives different weighted factors；And output is generated, as weighted input Summation, wherein having the optional bias factor for being added to summation in some embodiments.In the instruction being described more particularly below The accurate weighted factor for each input and the optional bias in each neuron are generated during practicing process.Neural network Output layer include another group of neuron, " activation primitive " is particularly configured with during training process.The activation letter Number is such as s type function or other threshold function tables, the input of the final hidden layer based on the neuron in neural network Output valve is generated, wherein generating the accurate parameters of s type function or threshold value during the training process of neural network.

In the specific configuration of Fig. 1, neural network grading device 166 is feedforward deep neural network, and Fig. 5 includes feedforward The illustrative description of deep neural network 500.As is known in the art, feedforward neural network includes being connected from input The layer of neuron in layer (layer 554) to the travelling unidirectionally of output layer (layer 566), without it is any will be in one layer of neural network Neuron is connected to the recurrence or " feedback " loop of the neuron in neural network preceding layer.Deep neural network includes not sudden and violent Dew is at least one " hidden layer " (and being typically more than a hidden layer) of the neuron of input layer or output layer.In nerve net In network 550, neuronkInput layer 554 is connected to output layer 566 by a hidden layer 562.

In the embodiment of neural network 550, furthermore input layer includes projection layer 558, and the projection layer 558 is by predetermined square Battle array transformation is applied to institute's selected works of input feature value element, including is respectively used to triggering to element 554 and word level feature Two different projection matrixes of element 512.Projection layer 558 generates the simplification of the output of the input neuron in input layer 554 It indicates, because being " dilute to the feature vector element of 504 and word level feature 512 for triggering in most realistic input Dredge ", it means that each candidate speech recognition result only include be coded in it is a small amount of in the structure of feature vector 500 (if any) a small amount of words totally collected in (such as 10,000 words) to item and big word are triggered.Projection layer Transformation in 558 enables the remainder layer of neural network 550 to include less neuron, and is still candidate speech recognition result Feature vector input generate useful grading scoring.In an illustrative embodiments, for trigger words to for single Two projection matrixes of word level characteristicsP _fWithP _wRespectively corresponding input neuron, which is projected to, respectively has 200 elements In smaller vector space, this generates 401 for each in n input feature value in neural network grading device 166 The layer through projecting of neuron (neuron is reserved for confidence scoring feature).

Although Fig. 5 depicts neural network 550, have corresponding for the candidate speech recognition result different from n Feature vector n input slot (slot) in total, but in input layer 554 input neuron number include for candidate One neuron of each element in the feature vector of speech recognition result, or in totalA neuron, WhereinTIt is the number of the predetermined trigger pair identified in candidate speech recognition result, andVIt is the word in the word identified The number of the word occurred in remittance, wherein 0.9 coefficient indicates the filtering to training set only to include as described above with most high frequency 90% in word that rate occurs.Fixed value 2 is indicated for an input neuron of confidence score value and another input nerve Member, another input neuron are served as any list not corresponding with the predetermined word level element of input feature value Comprehensive (catch-all) of word level characteristics is inputted, and is not modeled clearly in neural network grading device 166 such as any The outer word of vocabulary.For example, controller 148 generates feature vector by using feature extractor 164, for for not with feature to Any word in the candidate speech recognition result of element aligned in the predetermined structure of amount generates the bag of words with reduction scores. Element in feature vector corresponding with word outside vocabulary enables neural network grading device 166 that will be not included in default word The presence of any word in remittance be incorporated into be include word outside vocabulary any candidate speech recognition result grading scoring In generation.

Output layer 566 includes than the less output neuron of input layer 554.Particularly, output layer 566 includes n output Neuron, wherein each output neuron is one generation numerical value of correspondence in n input feature value during reasoning process Grading scoring, the reasoning process is ranking process in the specific configuration of system 100, for being to identify with multiple candidate speech As a result corresponding feature vector generates grading scoring.Some hardware embodiments of controller 148 include one or more of GPU Computing unit or other specific hardware-accelerated components, for executing reasoning process in a manner of time and power-efficient.At it In his embodiment, furthermore system 100 includes that additional Digital Logic handles hardware, the additional Digital Logic handles hardware quilt It is incorporated into remote server, controller 148 accesses the long-range clothes by using Wireless Communication Equipment 154 and data network Business device.In some embodiments, the hardware in remote server also realizes functional one of multiple speech recognition engines 162 Point.The server include it is additional processing hardware come execute feature extraction and ANN Reasoning processing whole or one Point, for generating the feature vector and grading scoring of the multiple candidate speech recognition result.

During operation, system 100 receives audio input data by using microphone 128, and uses multiple languages Sound engine 162 generates multiple candidate speech recognition results, including mixing voice recognition result, the mixing voice recognition result It in some embodiments include from word selected in two or more candidate speech recognition results.Controller 148 is by making Feature is extracted from candidate speech recognition result with feature extractor 164, for generating according to candidate speech recognition result Feature vector, and provide neural network grading device 166 to generate output scoring for each feature vector feature vector.Control Device 148 processed then mark feature vector corresponding with highest grading scoring and candidate speech recognition result, and controller 148 By using corresponding with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result Candidate speech recognition result carrys out operation automation system as input.

Fig. 2 is depicted for selecting candidate speech to know by using multiple speech recognition engines and neural network grading device Other result and the process 200 for executing speech recognition.In the following description, it is to the referring to for process 200 for executing function or movement The operation for referring to controller is used to execute stored program instruction to hold in association with the other assemblies in automated system Row function or movement.For illustrative purposes, process 200 is described in conjunction with the system of Fig. 1 100.

Process 200 starts to generate multiple candidate speech identifications by using multiple speech recognition engines 162 for system 100 As a result (frame 204).Within system 100, user provides described audio input to audio input device, such as microphone 128.Control Device 148 generates multiple candidate speech recognition results using multiple speech recognition engines 162.As described above, in some embodiments In, controller 148 generates mixing candidate speech recognition result by following operation: drawing using from the specific speech recognition in domain In candidate speech recognition result of the selected word for the candidate speech recognition result held up to replace universal phonetic identification engine Selected word.Speech recognition engine 162 also generate system 100 in process 200 feature vector generate during used in set Believe score data.

As system 100 executes feature extraction to generate multiple feature vectors, process 200 continues, the multiple feature to Amount respectively correspond tos one of candidate speech recognition result (frame 208).Within system 100, controller 148 uses feature extractor 164 generate feature vector, and described eigenvector includes above-mentioned triggering to one in, confidence scoring and word level feature Or it is multiple, for generate structure with the feature vector 500 in Fig. 5 or for triggering it is special to, confidence scoring and word level The feature vector of one or more another like structures in sign.In the embodiment of fig. 2, controller 148 is by using having The bag of words of decaying measurement to generate word level feature for the word level characteristic element of feature vector.

It is commented as controller 148 will be supplied to neural network for the feature vector of the multiple candidate speech recognition result Grade device 166 is as the input in reasoning process with corresponding multiple with the multiple candidate speech recognition result for generating Grading scoring, process 200 continue (frame 212).In one embodiment, controller 148 uses housebroken feedforward depth nerve Network grading device 166 to generate multiple grading scorings at the output layer neuron of neural network by using reasoning process. As described above, in another embodiment, controller 148 is by using Wireless Communication Equipment 154 and by characteristic vector data, candidate The encoded version of speech recognition result or the audio speech recorded identification data is transferred to external server, wherein servicing A part of processor implementation procedure 200 in device scores to generate the grading of candidate speech recognition result.

In most of examples, controller 148 generates n candidate speech recognition result of number and character pair vector, The predetermined number n for being configured to the received feature vector input during training process with neural network grading device 166 is matched.So And in some instances, if the number for the feature vector of candidate speech recognition result is less than maximum number n, control Device 148 processed generates " sky " the feature vector input with whole zeros, in the input layer to ensure neural network grading device 166 All neurons receive input.Controller 148 ignores the scoring of the correspondence output layer neuron for each sky input, and Neural network in grading device 166 is that the non-empty feature vector of candidate search recognition result generates scoring.

As the mark of controller 148 is corresponding with the highest grading scoring in the output layer of neural network grading device 166 Candidate speech recognition result, process 200 continue (frame 216).As above described in Fig. 5, the output layer 566 of neural network 550 In each output neuron generate with system 100 be supplied in input layer 554 input neuron predetermined set input The corresponding output valve of grading scoring of one of feature vector.Controller 148 is based on the highest grading generated in neural network 550 The output neuron of scoring indexes to identify the candidate speech recognition result with highest grading scoring.

Referring again to FIGS. 2, the speech recognition result for using selected highest to grade with controller 148 is as from user Input with operation automation system, process 200 continues (frame 220).In the vehicle of Fig. 1 in information system 100, controller The 148 various systems of operation, including such as Vehicular navigation system, are shown using GPS 152, Wireless Communication Equipment 154 and LCD Device 124 or HUD 120 execute automobile navigation operation inputting in response to voice from the user.In another configuration, control Device 148 plays music by audio output apparatus 132 in response to voice command.In still another configuration, system 100 uses intelligence Can number 170 or another equipment through being connected to the network make hands-free phone to be based on voice input from the user or transmit text Message.Although Fig. 1 depicts Information System Implementation example in vehicle, other embodiments use automated system, described automatic Change system controls the operation of various hardware components and software application using audio input data.

Although Fig. 1 by information system 100 in vehicle be portrayed as execute speech recognition it is from the user to receive and execute The illustrated examples of the automated system of order, but similar speech recognition process may be implemented in other contexts.Example Such as, mobile electronic device, such as smart phone 170 or other suitable equipment generally include one or more microphones and processing Device, may be implemented speech recognition engine, grading device, the triggering that is stored to and realize speech recognition and control system Other assemblies.In another embodiment, domestic automation system calculates equipment by using at least one to control in house HVAC and utensil, at least one described calculating equipment receive voice input from the user and by using multiple speech recognitions Engine executes speech recognition, to control the operation of the various automated systems in house.In each example, system is optional Ground is configured to the specific speech recognition engine in domain using the specific application and operation that are tailor made to different automated systems Different sets.

For training the training system and process of neural network grading device

In the system 100 of Fig. 1 and the speech recognition process of Fig. 2, neural network grading device 166 is housebroken feedforward depth mind Through network.The training neural network grading device 166 before the operation of system 100, for executing upper speech recognition procedure.Fig. 3 The illustrative embodiments for being configured to train the computerized system 300 of neural network grading device 166 is depicted, and Fig. 4 is retouched The training process 400 for generating housebroken neural network grading device 166 is drawn.

System 300 includes processor 304 and memory 320.Processor 304 includes for example one or more CPU cores, described CPU core is alternatively coupled to the hardware accelerator of parallelization, and the hardware accelerator was designed to time and power-efficient Mode training neural network.The example of such accelerator includes for example with the calculating for being configured for neural metwork training The GPU of shader unit, and be exclusively used in training the fpga chip especially programmed or ASIC hardware of neural network.In some realities It applies in example, furthermore processor 304 includes concurrently operating to execute the cluster of the calculating equipment of neural network training process.

Memory 320 includes that such as nonvolatile solid state or magnetic data storage equipment and volatile data storage are set Standby, such as random-access memory (ram) stores instruction by programming with the operation for system 300.In the configuration of Fig. 3 In, 320 storing data of memory, the data correspond to training input data 324, the stochastic gradient descent for neural network Training aids 328, neural network grading device 332 and feature extractor 164.

Training data 324 includes for example being identified by the same voice within system 100 for big predetermined input set The big collection of speech recognition result caused by engine 162 optionally includes mixing voice recognition result.Training speech recognition Result data further includes the confidence scoring for training speech recognition result.For each speech recognition result, training data is also It is measured including Lay Weinstein distance (Levenshtein distance), quantization is real in speech recognition result and scheduled ground Condition voice inputs the difference between training data, and the scheduled ground truth voice input training data indicates in training process Canonically " correct " result.Lay Weinstein distance metric is an example of " editing distance " measurement, because of the measurement amount Change and is used to necessary to the actually entering of training data change for the speech recognition result from speech recognition engine to be transformed into Become (editor) amount.Both speech recognition result and ground truth voice input training data are referred to as the text in comparison measuring " character string ".For example, editing distance quantization is for by speech recognition result character string " Sally shells sea sells by The seashore " is converted into corresponding correct ground truth training data character string " Sally sells sea shells Required knots modification for by the seashore ".

Lay Weinstein distance metric is known for this field in other contexts, and has several properties, packet Include: (1) Lay Weinstein distance is always at least the difference of the size of two character strings；(2) Lay Weinstein distance is at most compared with long word Accord with the length of string；(3) and if only if character string is equivalent, Lay Weinstein distance is only zero；(4) if character string is phase Same size, then Hamming distance is Lay Weinstein apart from the upper upper bound；And the Lay Weinstein of (4) between two character strings away from From no more than the summation (triangle inequality) with a distance from their Lay Weinsteins from third character string.Hamming distance refer in turn for One character string is changed to minimum number of permutations required for another or a character string can be transformed into another most The measurement of small error numbers.Although for illustrative purposes, system 300 includes the training number for being encoded with Lay Weinstein distance According to, but in alternative embodiments, another editing distance measurement for describe training speech recognition result with it is corresponding Difference between ground truth training input.

In the fig. 3 embodiment, the feature extractor 164 in memory 320 is the phase used in above system 100 With feature extractor 164.Particularly, processor 304 using feature extractor 164 come by using above-mentioned trigger words to, set Believe one or more of scoring and word level feature and spy is generated according to each of training speech recognition result Levy vector.

Stochastic gradient descent training aids 328 includes the program instruction stored and parameter for neural network training process Data, processor 304 execute the neural network training process to be based on training data 324 by using feature extractor 164 Feature vector generated come train neural network grade device 332.Such as it is known in the art, stochastic gradient descent training Device includes a kind of related training process, trains neural network in an iterative process by operating as follows: adjusting nerve net Parameter in network with minimize the output in neural network between predeterminated target function at a distance from (error), the predeterminated target Function is also referred to as " target " function.Although stochastic gradient descent training is overall known for this field and not at this It is discussed in more detail in text, but system 300 modifies standard prior training process.Particularly, training process seeks benefit Use neural network, output generated as input by using training data, the output minimize in neural network output and Error between target result from predetermined training data.In prior art training process, target value is generally designated Given output is binary system " correct " or " incorrect ", and such target output from neural network grading device provides scoring Come indicate for training speech recognition result feature vector input when in training data ground truth input compared with when Time is 100% correct or incorrect to a certain extent.However, within the system 300, stochastic gradient descent training aids 328 The editing distance target data in training data 324 is used more accurately to be reacted as " soft " target for different trained voices The correctness of recognition result is horizontal, and the correctness level may include that the error range of grading scoring is influenced on successive range It rather than is only completely correct or incorrect.

Processor 304 is in target function using " soft " target data come by using stochastic gradient descent training aids 328 And execute training process.For example, the following form of configuration use of Fig. 3 " softmax(flexibility maximum value) " target function:, whereindiIt is for given training speech recognition resultiEditing distance.During training process, Gradient declines 328 executory cost of training aids and minimizes process, wherein " cost " refers to during each iteration of training process Neural network grading device 332 output valve and pass through the cross entropy between target function target value generated.Processor 304 exists Batch sample is provided during training process and declines training aids 328 to gradient, such as 180 training inputs of batch respectively include Pass through multiple speech recognition engines different training speech recognition result generated.Iterative process is continued until the friendship of training set Pitch entropy during ten iteration still without improvement until, and generate from all training datas minimum population entropy through instructing Experienced neural network parameter forms final housebroken neural network.

During training process, processor 304 is during the different iteration of training process by identical input feature value The scramble between the different sets of the input neuron in neural network grading device 332, in the input layer to ensure neural network The position of particular feature vector incorrect biasing is not generated in housebroken neural network.Such as above in reasoning process It is described, if the specific collection of training data does not include the candidate speech recognition result of enough numbers to comment to neural network All neurons in the input layer of grade device 332 provide input, then the generation of processor 304 there is " sky " of zero input to input special Levy vector.As it is known in the art, stochastic gradient descent training process includes numerical value training parameter, and in system 300 In one configuration, the super parameter of stochastic gradient descent training aids 328 isα=0.001,β ₁=0.9 andβ ₂ = 0.999。

Neural network grading device 332 is feedforward deep neural network in one embodiment, has and is described in Fig. 5 Neural network 550 structure.During operation, processor 304 generates the knot of unbred neural network grading device 332 Structure, the structure have the neuron of predetermined number, input layer 554 of the predetermined number based on the neural network 550 in Fig. 5 In neuron number and for respectively as input be provided to neural network for reasoning process in total n candidate The number of output neuron in the output layer 566 of speech recognition result.Processor 304 is also hidden at the k of neural network 550 Suitable number of neuron is generated in layer 562.In one embodiment, processor 304, which utilizes to be directed to, goes to each of neuron The weighted value of the randomization of input initializes neural network structure.As described above, processor 304 is adjusted during training process Markingoff pin is to the various weights and bias of the neuron in the input layer 554 and hidden layer 562 of neural network, together with output layer The parameter of activation primitive in 566 neuron, for minimizing for given input set from neural network grading device 332 Cross entropy of the output compared with target function.

Although Fig. 3 depicts the specific configuration for generating the computerized equipment 300 of housebroken neural network grading device, It is in some embodiments, furthermore to be matched in speech recognition process using the same system of housebroken neural network grading device It is set to trained neural network grading device.For example, the controller 148 in system 100 is to can be configured to hold in some embodiments The example of the processor of row neural network training process.

Fig. 4 is depicted for selecting candidate speech to know by using multiple speech recognition engines and neural network grading device Other result and the process 400 for executing speech recognition.In the following description, it is to the referring to for process 400 for executing function or movement The operation of finger processor is used to execute stored program instruction to hold in association with the other assemblies in automated system Row function or movement.For illustrative purposes, process 400 is described in conjunction with the system of Fig. 3 300.

Process 400 starts to generate multiple feature vectors for system 300, corresponds to and is stored in training data 324 Multiple trained speech recognition results (frame 404).Within the system 300, processor 304 is generated described using feature extractor 164 Multiple feature vectors, wherein each feature vector corresponds to a trained speech recognition result in training data 324.Institute as above It states, at least one embodiment of process 400, controller 304 generates each feature vector comprising one in the following terms A or multiple: to feature, confidence scoring and word level feature, the word level feature includes having decaying special for triggering The bag of words of sign.

As the part of feature extraction and feature generating process, in some embodiments, controller 304 generates feature vector Structure comprising element-specific, the element-specific be mapped to triggering to feature and word level feature.For example, as more than Described within system 100, in some embodiments, controller 304 generates feature vector, have in training data 324 The only only a part of observed word, such as 90% the corresponding structure of most commonly observed word, and with low-limit frequency Remaining 10% word occurred is not encoded into the structure of feature vector.Processor 304 optionally identifies the most common triggering To feature, and generate the structure for most commonly observed trigger words pair present in training data 324.System wherein 300 generate during process 400 in the embodiment for the structure of feature vector, and the storage of processor 304 has feature extractor The structure of the feature vector of data 164, and after training process completion, the structure of feature vector is graded together with neural network Device 332 is provided to automated system, and the automated system uses the feature vector with specified structure as to warp The input of trained neural network, for generating grading scoring for candidate speech recognition result.In other embodiments, based on all As English or Chinese etc natural language rather than a priori determined in particular upon the content of training data 324 feature to The structure of amount.

With system 300 by using stochastic gradient descent training aids 328, based on training speech recognition result feature to Amount and the soft object editing distance data from training data 324 come train neural network grade device 332, process 400 continue (frame 408).During training process, 304 use of processor multiple features corresponding with multiple trained speech recognition results to It measures as the input to neural network grading device, and based on generated during training process by neural network grading device Multiple outputs, which are scored the cost minimization process between target function, trains neural network grading device 332, the target letter Number has above-mentioned soft scoring, and the soft scoring is based on knowing in the multiple trained speech recognition result with for the multiple voice The trained speech recognition of each of other result it is predetermined correctly enter between predetermined editing distance.During process 400, processing Device 304 modifies the weighted input coefficient and neuron bias in the input and hidden layer of neural network grading device 332, and leads to Cross the parameter of the activation primitive in the output layer for adjusting neuron using stochastic gradient descent training aids 328, in an iterative manner.

After training process completion, process 304 stores the structure of housebroken neural network grading device 332, Yi Ji The structure of optionally stored feature vector in embodiment, wherein described eigenvector based on the training data in memory 320 and It is generated (frame 412).The structure and features vector structure of neural network grading device 332 stored is subsequently passed to other certainly The system 100 of dynamicization system, such as Fig. 1, using housebroken neural network grade device 332 and feature extractor 164 come It grades during speech recognition operation to multiple candidate speech recognition results.

It will be appreciated that disclosed above and other feature and function variants or its alternative can close expectation Ground is incorporated into many other different systems, application or method.It is various currently without prediction or it is not expected that replace It changes scheme, modification, modification or improves and can then be made by those skilled in the art, be also intended to be contained by appended claims Lid.

Claims

1. a kind of method for the speech recognition in automated system, comprising:

Multiple feature vectors are generated using controller, each feature vector corresponds to one in multiple candidate speech recognition results Candidate speech recognition result is generated in the multiple feature vector and is waited for first in the multiple candidate speech recognition result Furthermore the first eigenvector for selecting speech recognition result includes:

Using controller, referring to multiple predetermined triggers for being stored in memory to coming in the first candidate speech recognition result At least one triggering pair is identified, at least one described triggering is to including two predetermined trigger words；And

First eigenvector is generated using controller, the first eigenvector includes at least one triggering pair Element；

The multiple feature vector is fed as input to neural network using controller；

It is generated and is directed to described in the multiple candidate speech recognition result using controller, output layer neural network based The corresponding multiple gradings scorings of multiple feature vectors；And

Using the controller, by using in the multiple candidate speech recognition result with the multiple grading scoring in most High ratings score corresponding candidate speech recognition result as input and carry out operation automation system.

2. according to the method described in claim 1, furthermore each feature vector generated in the multiple feature vector includes:

Each feature vector is generated using controller, each feature vector includes for one in the scoring of multiple confidences The element of confidence scoring, each confidence scoring are associated with a candidate speech recognition result of each feature vector is corresponded to.

3. according to the method described in claim 2, furthermore including:

It is scored using controller, based on the multiple confidence to execute linear regression procedure, for being the multiple feature vector Normalised multiple confidence scorings are generated, normalised multiple confidence scorings are based on the multiple speech recognition result In a predetermined candidate speech recognition result confidence scoring.

4. according to the method described in claim 1, furthermore generation first eigenvector includes:

Multiple unique words in the first candidate speech recognition result are identified using controller, including in the multiple word often At least one position of each unique words in the frequency of a unique words appearance and the first candidate speech recognition result；

Multiple bag of words with attenuation parameter are generated using controller, each bag of words with attenuation parameter are based on the multiple The frequency of a unique words in unique words is at least one position and scheduled attenuation parameter and corresponding to described one A unique words；And

First eigenvector is generated using controller, the first eigenvector includes having attenuation parameter for the multiple The bag of words of each of bag of words with attenuation parameter element.

5. according to the method described in claim 1, furthermore including: to the multiple feature vector of neural network offer

The multiple feature vector is fed as input to feedforward deep neural network using controller.

6. according to the method described in claim 1, furthermore including:

Audio input data corresponding with voice from the user input is generated using audio input device；And

Multiple candidate languages corresponding with audio input data are generated using controller, by using multiple speech recognition engines Sound recognition result.

7. a kind of method for training neural network grading device, comprising:

Multiple feature vectors are generated using processor, each feature vector corresponds to the multiple trained languages being stored in memory A trained speech recognition result in sound recognition result generates in the multiple feature vector for the multiple trained voice In recognition result first training speech recognition result first eigenvector furthermore include:

Using processor, referring to multiple predetermined triggers for being stored in memory to coming in the first training speech recognition result At least one triggering pair is identified, at least one described triggering is to including two predetermined trigger words；And

It utilizes a processor to generate first eigenvector, the first eigenvector includes at least one triggering pair Element；

The training process for neural network grading device is executed by using the following terms using processor: with the multiple instruction Practice speech recognition result it is corresponding as to neural network grade device input the multiple feature vector, by neural network Grade device during training process multiple outputs scoring generated and based in the multiple trained speech recognition result and For trained speech recognition each in the multiple speech recognition result it is predetermined correctly enter between predetermined editing distance Multiple objective results；And

It is being corresponding with the speech recognition result being not present in the multiple trained speech recognition result for using After the training process that additional eigenvectors generate in grading scoring is completed, neural network grading device is stored in using processor In memory.

8. according to the method described in claim 7, furthermore generation first eigenvector includes:

Feature vector is generated using processor, described eigenvector includes for associated with the first training speech recognition result The element of confidence scoring.

9. according to the method described in claim 7, furthermore generation first eigenvector includes:

Multiple unique words in mark the first training speech recognition result are utilized a processor to, including in the multiple word often At least one position of each unique words in the frequency and the first training speech recognition result that a unique words occur；

It utilizes a processor to generate multiple bag of words with attenuation parameter, each bag of words with attenuation parameter are based on the multiple The frequency of a unique words in unique words is at least one position and scheduled attenuation parameter and corresponding to described one A unique words；And

It utilizes a processor to generate first eigenvector, the first eigenvector includes having attenuation parameter for the multiple Bag of words in per multiple bag of words with attenuation parameter element.

10. according to the method described in claim 7, furthermore training process includes:

Using processor, housebroken neural network is generated by using stochastic gradient descent training process.

11. according to the method described in claim 7, furthermore training includes:

It utilizes a processor to execute the training process for neural network grading device, the multiple objective result is based on described more A trained speech recognition result is correctly entered with for each the predetermined of trained speech recognition in the multiple speech recognition result Between Lay Weinstein distance.

12. a kind of system of the speech recognition for automation, comprising:

Memory, the memory are configured to store:

Multiple predetermined triggers pair, each triggering is to including two words；And

Neural network, the neural network are configured to generate grading scoring corresponding with multiple candidate speech recognition results； And

It is operatively coupled to the controller of memory, controller is configured to:

Multiple feature vectors are generated, the candidate speech that each feature vector corresponds in multiple candidate speech recognition results is known Not as a result, generating in the multiple feature vector for the first candidate speech identification in the multiple candidate speech recognition result As a result first eigenvector furthermore include further Configuration Control Unit with:

Referring to multiple predetermined triggers for being stored in memory to identifying at least one in the first candidate speech recognition result A triggering pair, at least one described triggering is to including two predetermined trigger words；And

First eigenvector is generated, the first eigenvector includes the element at least one triggering pair；

The multiple feature vector is fed as input to neural network；

Output layer neural network based come generate with for the multiple candidate speech recognition result the multiple feature to Measure corresponding multiple grading scorings；And

By using opposite with the highest grading scoring in the multiple grading scoring in the multiple candidate speech recognition result The candidate speech recognition result answered carrys out operation automation system as input.

13. furthermore system according to claim 12, controller are configured to:

Each feature vector is generated, each feature vector includes the member for the confidence scoring in the scoring of multiple confidences Element, each confidence scoring are associated with a candidate speech recognition result of each feature vector is corresponded to.

14. furthermore system according to claim 13, controller are configured to:

It is scored based on the multiple confidence to execute linear regression procedure, for generating for the multiple feature vector through normalizing The scoring of multiple confidences, the normalised multiple confidences scoring is predetermined based on one in the multiple speech recognition result The confidence of candidate speech recognition result scores.

15. furthermore system according to claim 12, controller are configured to:

Multiple unique words in the first candidate speech recognition result are identified, including each unique words go out in the multiple word At least one position of each unique words in existing frequency and the first candidate speech recognition result；

Multiple bag of words with attenuation parameter are generated, each bag of words with attenuation parameter are based in the multiple unique words The frequency of one unique words is at least one position and scheduled attenuation parameter and corresponding to one unique words；With And

First eigenvector is generated, the first eigenvector includes for every in the multiple bag of words with attenuation parameter The element of a bag of words with attenuation parameter.

16. system according to claim 12, wherein the neural network in memory is feedforward deep neural network, control Furthermore device is configured to:

The multiple feature vector is fed as input to feedforward deep neural network.

17. system according to claim 12, furthermore includes:

Audio input device；And

Controller, the controller are operatively coupled to audio input device and are furthermore configured to:

Multiple candidate speech recognition results corresponding with audio input data are generated by using multiple speech recognition engines.