CN106558306A

CN106558306A - Method for voice recognition, device and equipment

Info

Publication number: CN106558306A
Application number: CN201610052812.XA
Authority: CN
Inventors: 王斌; 杨帅; 曾明
Original assignee: Guangdong Xinxintong Information System Services Co Ltd
Priority date: 2015-09-28
Filing date: 2016-01-25
Publication date: 2017-04-05

Abstract

A kind of method for voice recognition, including：Receive voice messaging；Extract the voice characteristics information in the voice messaging；The voice characteristics information and the sound template in sound bank are matched；Using the sound template in sound bank described in the voice characteristics information re -training after the match is successful.After due to each speech recognition, sound template that all can be in re -training sound bank, so that sound template increasingly enriches, so as to greatly improve the success rate of speech recognition.A kind of device for speech recognition and a kind of equipment with speech identifying function are also disclosed in certain embodiments.

Description

Method for voice recognition, device and equipment

Technical field

The invention belongs to mode identification technology, more particularly to a kind of method for voice recognition, device, and have The equipment of speech identifying function.

Background technology

Currently, the smart machine such as panel computer, smart mobile phone, smart home product becomes increasingly popular, progressively become family and Personal standard configuration.Smart machine based on interactive voice is practical, sets in household electrical appliances, car machine, mobile phone etc. It has been widely used for upper, wherein, many equipment all have voice arousal function, open for unlocking screen or conduct The supplementary meanss of dynamic application.It is such technology that voice wakes up, when equipment is in holding state, in low-down work( Under the conditions of consumption, run without interruption a device on backstage, certain predefined wake-up word is detected, when detecting When user says this word, equipment is waken up, put the device into normal operating conditions.But the skill of speech recognition at present The success rate of art is also less desirable, needs further to improve.

The content of the invention

In view of this, it is an object of the invention to propose a kind of method for voice recognition, to improve speech recognition Success rate.In order to the embodiment to disclosing some in terms of have a basic understanding, shown below is simple summary. The summarized section is not extensive overview, nor key/critical component is determined or describes the protection of these embodiments Scope.Its sole purpose is that some concepts are presented with simple form, in this, as the preamble of following detailed description.

In some optional embodiments, the method for voice recognition includes：Receive voice messaging；Extract described Voice characteristics information in voice messaging；The voice characteristics information and the sound template in sound bank are matched； With the sound template utilized after success in sound bank described in the voice characteristics information re -training.Due to each speech recognition After success, sound template that all can be in re -training sound bank, so that sound template increasingly enriches, so as to significantly Improve the success rate of speech recognition.

Another object of the present invention is to propose a kind of device for speech recognition.

In some optional embodiments, the device for speech recognition includes：Receive the voice collecting of voice messaging Unit；Extract the feature extraction unit of the voice characteristics information in the voice messaging；By the voice characteristics information and language The voice recognition unit matched by sound template in sound storehouse；With retraining unit, in the speech recognition list The match is successful afterwards using the sound template in sound bank described in the voice characteristics information re -training for unit.

Another object of the present invention is to propose a kind of equipment with speech identifying function.

In some optional embodiments, the equipment with speech identifying function includes speech input device, also includes The device for speech recognition.

For above-mentioned and related purpose, one or more embodiments include will be explained in below and in the claims The feature for particularly pointing out.Description below and accompanying drawing describe some illustrative aspects in detail, and its indicate be only Some modes in the utilizable various modes of principle of each embodiment.Other benefits and novel features will be with Detailed description below is considered in conjunction with the accompanying and becomes obvious, the disclosed embodiments be will include all these aspects and Their equivalent.

Description of the drawings

Fig. 1 is a method for voice recognition embodiment；

Fig. 2 is a device embodiment for being used for speech recognition；

Fig. 3 is another device embodiment for speech recognition.

Specific embodiment

The following description and drawings fully illustrate specific embodiments of the present invention, to enable those skilled in the art to reality Trample them.Other embodiments can include structure, logic, electric, process and other changes.It is real Apply example and only represent possible change.Unless explicitly requested, otherwise individually components and functionality is optional, and is operated Order can change.The part of some embodiments and feature can be included in or replace other embodiments part and Feature.The scope of embodiment of the present invention includes the gamut of claims, and all of claims can The equivalent of acquisition.Herein, these embodiments of the invention individually or generally " can be invented " with term To represent, it is convenient that this is used for the purpose of, and if in fact disclosing the invention more than, is not meant to automatically limit The scope for making the application is any single invention or inventive concept.

Fig. 1 shows one embodiment of method for voice recognition.

Step 11：Receive voice messaging；

Step 12：Extract the voice characteristics information in voice messaging；

Step 13：Acoustic model in voice characteristics information and sound bank is matched；

Step 14：Using the speech model in voice characteristics information re -training sound bank after the match is successful.

For speech recognition technology, realize that the basic skills of meaningful, the substantial voice messaging of identification is at present：In advance Voice characteristics information is analyzed, machine is given as requested and is stored, the voice characteristics information in this speech parameter storehouse Referred to as " template (Template-based Approach) ", this process referred to as " training (Training) ".Send Unknown voice (also known as after knowledge voice) to recognize is transformed into after the signal of telecommunication through pretreatment, pronunciation modeling and feature extraction, Obtain voice characteristics information, it compared one by one with the sound template in sound bank, and adopt the method for matching to find out most connecing The template of nearly phonetic feature, draws recognition result, and this process is known as " identification (Recognition) ".Certainly, exist There to be individual standard when being compared, here it is " distortion measure (the Distortion between metering speech parameter vector Measures) ", the content representated by that minimum template of distortion is exactly the result of identification.

Speech recognition process is generally divided into two stages：Training stage and cognitive phase.The former task is to set up identification base The speech model and language model of this unit, the latter are then to carry out the speech characteristic parameter and sound template of target voice Relatively, it is identified result.

Acoustic model

Acoustic model is the underlying model of identifying system, is a most key part in speech recognition system.Acoustic model Target be to provide a kind of effective method, calculate the distance between feature vector sequence and each sound template of voice. The design of acoustic model is closely related with language pronouncing feature.Model Identification cell size (word pronunciation model, word pronunciation mould Type, half syllable-based hmm or phoneme model) have larger to voice training data volume size, phonetic recognization rate and motility Affect.For speech recognition system more than medium vocabulary quantity, recognition unit is little, then amount of calculation is also little, required mould Type amount of storage and amount of training data are also little, but the problem brought is the positioning and segmentation difficulty of correspondence voice segments, and more multiple Miscellaneous identification model rule.Generally big recognition unit easily includes coarticulation in a model, and this is conducive to raising system Discrimination, but require training data relative increase.

Language model

Language model (Language Model, LM) refers generally in matching search for words and the language of path constraint Speech rule, is, for the knowledge that syntax and semantics are effectively combined during speech recognition, to improve discrimination, reduces The scope of search.Due to being difficult to accurately determine the border of word, and acoustic model describes the limited in one's ability of sound-variation, The sequence of many probability scores similar word will be produced during identification.Therefore, it is usually used in practical speech recognition system Language model selects most possible word sequence to make up the deficiency of acoustic model from many candidate results.

Language model can be divided into rule-based language model and the language model based on statistics.Rule-based language mould Type is to sum up grammatical ruless or even semantic rule, is then excluded in acoustics identification with these rules and does not conform to grammatical ruless or language The result of adopted rule.Statistical language model passes through the dependence between statistical probability descriptor and word, indirectly to grammer Or semantic rule is encoded.Rule-based language model obtains application well in particular task system, can be larger Amplitude improves the discrimination of system.As everyday spoken english dialogue cannot be described with hard and fast rule, know in large vocabulary voice The main language model using based on statistics in other system.

Feature extraction

Feature extraction seeks to the relevant information of the reflection phonetic feature for extracting important from speech waveform, removes those phases To unrelated information.It is both the process that an information is significantly compressed, and a signal uncoiling process.Due to voice The time-varying characteristics of signal, speech feature extraction must be carried out on a bit of voice signal, that is, carry out short-time analysiss.At present The more commonly used Speech Feature Extraction is the linear prediction cepstrum coefficient technology (LPCC) based on channel model and based on listening Feel Mel frequency cepstral technologies (MFCC) of mechanism.The former basic thought is：The adjacent sampled point of voice signal Between have very strong dependency.Therefore the sampled value of each voice signal, can adding with the sampled value of several before it Power and linear combination carry out approximate representation.The latter has then taken into full account the auditory properties of human ear, and with objective metric characterizing people Subjective feeling to volume up-down.By contrast, MFCC has certain advantage：1. the information of voice has focused largely on low Frequency part, and HFS easily receives ambient noise interference, MFCC to emphasize the low-frequency information of voice, it is favourable so as to highlight In the information of identification, noise jamming is shielded；2.MFCC does not have any hypotheses, can make in all cases With, recognition performance and noise robustness (i.e. to noise characteristic or the insensitivity of parameter) better than LPCC.

As a rule, knowledge voice will be treated before speech feature extraction is carried out carries out pretreatment, partially removes noise and not The impact brought with speaker, makes the signal after process more reflect the substitutive characteristics of voice.The most frequently used pretreatment has end Point detection and speech enhan-cement.End-point detection is referred to and in voice signal is made a distinction voice and non-speech audio period, accurate The starting point of voice signal is determined really.After end-point detection, subsequent treatment just only can be carried out to voice signal, This plays an important role to the degree of accuracy and recognition correct rate that improve model.The main task of speech enhan-cement is exactly to eliminate environment to make an uproar Impact of the sound to voice.Method general at present is to adopt Wiener filtering, and the method effect in the case where noise is larger is good In other wave filter.

Pattern match

Pattern match is also called measuring similarity, refers to according to certain criterion, makes unknown voice and a certain language in sound bank Sound template obtains best match.Specifically, pattern match is by the language in the character vector and sound bank of voice to be known Sound template carries out similarity measure comparison, using similarity highest sound template generic as the intermediate candidate result for recognizing Output.

The process of speech recognition is exactly to gather voice messaging substantially, and which is carried out contrast with the sound template in sound bank Match somebody with somebody, choose immediate result and exported.But correct identification is completed, specific operation process is just it is necessary to have suitable Algorithm be supported.

A kind of optional speech recognition algorithm is dynamic time warping (DTW, the Dynamic Time based on pattern match Warping) method, the method are the same recording for waking up word of some for prerecording, and train and obtain waking up some of word Individual sound template and sound bank；In speech recognition period, the voice of collection and each sound template are carried out into Dynamic Matching, will Matching distance is compared with threshold value set in advance, and when distance is less than threshold value, the match is successful.

Another kind of optional speech recognition algorithm is the side based on log-likelihood ratio (LLR, log likelihood ration) Method, the method are a kind of methods based on model.The method says the same voice for waking up word, instruction first by a large amount of people The hidden Markov model (HMM, Hidden Markov Model) of a wake-up word is got, and trains some Individual background template.In matching, voice and model state are done into pressure using Viterbi (Viterbi) algorithm and is alignd, obtained To a log-likelihood；Voice is given a mark using background model simultaneously, obtain a maximum and refer to likelihood value.Will Log-likelihood and the maximum ratio with reference to likelihood value are compared with threshold value set in advance, when ratio is more than threshold value, matching Success.

Another optional speech recognition algorithm is the method based on log-likelihood, the method for the method and above-mentioned LLR Similar, difference is that it no longer needs background model, but will directly wake up word model carries out pressure with voice and align obtaining The log-likelihood marking of optimal path, when marking is more than threshold value set in advance, the match is successful.

When the method for above-described embodiment is used for terminal, after user says wake-up word, mobile terminal match cognization goes out to wake up After word, mobile terminal will be waken up, i.e., mode of operation will be switched to from battery saving mode.

Fig. 2 shows one embodiment of the device for speech recognition, and the device includes that the voice for receiving voice messaging is adopted Collection cell S 21, extracts feature extraction unit S22 of the voice characteristics information in voice messaging, by voice characteristics information and The voice recognition unit S23 matched by sound template in sound bank, and retraining cell S 24.Retraining unit S24 is for the voice in voice recognition unit S23 is after the match is successful using voice characteristics information re -training sound bank Template.

Fig. 3 shows another embodiment of the device for speech recognition, and the device includes the voice for receiving voice messaging Collecting unit S21, carries out the pretreatment unit S31 of pretreatment to voice messaging, extracts the phonetic feature in voice messaging Feature extraction unit S22 of information, the speech recognition matched by the sound template in voice characteristics information and sound bank Cell S 23, and retraining cell S 24.

In some optional embodiments, voice recognition unit S23 has one of following computing unit：Dynamic time warping Algorithm unit, log-likelihood ratio algorithm unit, or log-likelihood algorithm unit.

A kind of equipment with speech identifying function is proposed herein also, in one embodiment, the equipment includes phonetic entry Device, also including the device for speech recognition disclosed in previous embodiment.In another embodiment, the equipment Also include mode switch element, after the match is successful in the voice recognition unit, the equipment is cut from battery saving mode Change to mode of operation.

The equipment is including but not limited to electronic equipment and electrical equipment etc..The electronic equipment is including but not limited to handss Machine, tablet PC and cart-mounted computing device etc..Described is including but not limited to electrically TV, audio amplifier, electric light, heat Hydrophone and refrigerator etc..

It should also be appreciated by one skilled in the art that with reference to the embodiments herein description various illustrative box, module, Circuit and algorithm steps can be implemented as electronic hardware, computer software or its combination.In order to clearly demonstrate hardware and Interchangeability between software, surrounds its function to various illustrative parts, frame, module, circuit and step above It is generally described.Hardware is implemented as this function and is also implemented as software, depending on specific application and The design constraint applied by whole system.Those skilled in the art can be directed to each application-specific, with accommodation Mode realizes described function, but, it is this to realize that decision-making should not be construed as the protection domain away from the disclosure.

Claims

1. a kind of method for voice recognition, it is characterised in that include：

Receive voice messaging；

Extract the voice characteristics information in the voice messaging；

The voice characteristics information and the sound template in sound bank are matched；

Using the sound template in sound bank described in the voice characteristics information re -training after the match is successful.

2. the method for claim 1, it is characterised in that methods described is used for mobile terminal, also includes： After with success, the mobile terminal is switched to into second mode from first mode.

3. method as claimed in claim 1 or 2, it is characterised in that using dynamic time warping, log-likelihood The voice characteristics information and the acoustic model in sound bank are matched than method or log-likelihood method.

4. method as claimed in claim 1 or 2, it is characterised in that before extracting voice characteristics information, it is right also to include The voice messaging carries out pretreatment.

5. a kind of device for speech recognition, it is characterised in that include：

Receive the voice collecting unit of voice messaging；

Extract the feature extraction unit of the voice characteristics information in the voice messaging；

The voice recognition unit matched by the voice characteristics information and the sound template in sound bank；With,

Retraining unit, for the match is successful afterwards using the voice characteristics information re -training in the voice recognition unit Sound template in the sound bank.

6. device as claimed in claim 5, it is characterised in that the voice recognition unit has following computing unit One of：Dynamic time warping algorithm unit, log-likelihood ratio algorithm unit or log-likelihood algorithm unit.

7. the device as described in claim 5 or 6, it is characterised in that also include carrying out pretreatment to voice messaging Pretreatment unit.

8. a kind of equipment with speech identifying function, including speech input device, it is characterised in that also include such as power Profit requires the device for speech recognition described in 5,6 or 7.

9. equipment as claimed in claim 8, it is characterised in that also including mode switch element, in institute's predicate The equipment is switched to second mode from first mode after the match is successful by sound recognition unit.

10. equipment as claimed in claim 8 or 9, it is characterised in that the equipment is electronic equipment or electrical equipment.