CN106157956A

CN106157956A - The method and device of speech recognition

Info

Publication number: CN106157956A
Application number: CN201510130636.2A
Authority: CN
Inventors: 罗炜; 贾鑫
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2016-11-23
Also published as: WO2016150001A1

Abstract

The invention discloses the method and device of a kind of speech recognition, wherein, the method obtains the voice recognition information of user's current speech, and assists in identifying information based on what user's current state corresponding with user's current speech obtained this voice recognition information；According to voice recognition information with assist in identifying information and determine the final recognition result of user's current speech.Solved in correlation technique by the present invention and only obtain the speech content of user by the sound of user and cause the problem that the accuracy of speech recognition is the highest, and then improve the accuracy of speech recognition.

Description

The method and device of speech recognition

Technical field

The present invention relates to the communications field, in particular to the method and device of a kind of speech recognition.

Background technology

Speech recognition technology, along with computer and the development of relevant software and hardware technology, is the most more and more applied in every field, Its discrimination is also constantly improving.Under the specified conditions such as environment quiet, pronunciation standard, apply at present in speech recognition The discrimination of input writing system has reached more than 95%.Regular speech identification technology comparative maturity, for mobile whole The speech recognition of end, owing to voice quality is relatively poor relative to normal speech identification scene, therefore speech recognition effect is subject to To limiting.Here voice quality is very poor includes that following reason, such as client are had powerful connections noise, client voice collecting Equipment, the noise of verbal system, the noise of communication line and interference, band of itself speaking is also had to have an accent or the side of employing Speech, speaker itself speak ambiguous or unclear etc..All of these factors taken together is all likely to result in speech recognition effect and is deteriorated. Its discrimination is affected by several factors, low for phonetic recognization rate in correlation technique and cause user experience difference Problem, the most not yet proposes effective solution.Onboard or noise is relatively big, pronounce non-type in the case of, it is known Not rate will be had a greatly reduced quality, to such an extent as to be unable to reach real practical purpose.Its correct recognition rata is low, impact accurately manipulation, effect The most not ideal enough.If other method can be used to carry out auxiliary judgment to improve the accuracy rate of its speech recognition, then speech recognition Practicality will significantly improve.

The language acknowledging process of the mankind is a multichannel perception.During the daily exchange of person to person, pass through Sound carrys out the content of other people speech of perception, when noisy environment or the other side pronounce smudgy, in addition it is also necessary to eye observation its The shape of the mouth as one speaks, the change of expression etc., the content that the other side is said could be understood exactly.Existing speech recognition system have ignored language This one side of the visual characteristic of speech perception, with only single auditory properties so that existing speech recognition system is being made an uproar Under the conditions of acoustic environment or loquacity person, its discrimination is all remarkably decreased, and reduces the practicality of speech recognition, and range of application is also Restricted.

For in correlation technique, only cause the accuracy of speech recognition not by the speech content of the sound acquisition user of user High problem, does not also propose effective solution.

Summary of the invention

The invention provides the method and device of a kind of speech recognition, at least to solve in correlation technique only by the sound of user Sound obtains the speech content of user and causes the problem that the accuracy of speech recognition is the highest.

According to an aspect of the invention, it is provided a kind of method of speech recognition, including: obtain user's current speech Voice recognition information, and obtain described speech recognition letter based on user's current state corresponding with described user's current speech Cease assists in identifying information；Described user's current speech is determined according to described voice recognition information and the described information that assists in identifying Final recognition result.

Further, determine that described user's current speech is according to described voice recognition information and the described information that assists in identifying Whole recognition result includes: obtain corresponding one or more of described user's current speech the according to described voice recognition information One candidate's vocabulary；According to described assist in identifying vocabulary classification corresponding to user's current speech described in acquisition of information or one or The multiple second candidate's vocabulary of person；Described use is determined according to one or more the first candidate vocabulary and described lexical types The final recognition result of family current speech；Or, according to one or more the first candidate vocabulary and one or The multiple second candidate's vocabulary of person determines the final recognition result of described user's current speech.

Further, determine that described user is current according to one or more the first candidate vocabulary and described lexical types The final recognition result of voice includes: select to meet described vocabulary classification from one or more the first candidate vocabulary The first specific vocabulary, using described first specific vocabulary as the final recognition result of described user's current speech.

Further, according to one or more the first candidate vocabulary and one or more the second candidate vocabulary Determine that the final recognition result of described user's current speech includes: select from one or more the second candidate vocabulary The second specific vocabulary high with one or more the first candidate Lexical Similarity, using described second specific vocabulary as The final recognition result of described user's current speech.

Further, described voice recognition information is obtained based on user's current state corresponding with described user's current speech Assist in identifying information to include: obtain the image for indicating described user's current state；Special according to described Image Acquisition image Reference ceases；According to described image feature information obtain the vocabulary classification corresponding with described image feature information and/or one or Person's multiple candidate vocabulary, assists in identifying described vocabulary classification and/or one or more candidate's vocabulary as described Information.

Further, according to described image feature information obtain the vocabulary classification corresponding with described image feature information and/or One or more candidate's vocabulary includes: search the highest with described image feature information similarity in predetermined image library Specific image；According to default image and vocabulary classification or the corresponding relation of one or more candidate's vocabulary, obtain with Vocabulary classification that described specific image is corresponding or one or more candidate's vocabulary.

Further, described user's current state includes at least one of: the lip kinestate of described user, described The laryngeal vibration state of user, the facial movement state of described user, the gesture motion state of described user.

Further, obtain the voice recognition information of user's current speech, and based on corresponding with described user's current speech User's current state obtain described voice recognition information assist in identifying information before include: judge know based on described voice Other information determines that the accuracy of the final recognition result of described user's current speech is less than predetermined threshold.

According to another aspect of the present invention, it is provided that the device of a kind of speech recognition, described device includes: acquisition module, For obtaining the voice recognition information of user's current speech and current based on the user corresponding with described user's current speech State obtain described voice recognition information assist in identifying information；Determine module, for according to described voice recognition information and The described information that assists in identifying determines the final recognition result of described user's current speech.

Further, described determine that module includes: the first acquiring unit, for obtaining institute according to described voice recognition information State one or more the first candidate vocabulary that user's current speech is corresponding；Second acquisition unit, for according to described auxiliary Identify vocabulary classification corresponding to user's current speech described in acquisition of information or one or more the second candidate vocabulary；Determine Unit, for determining described user's current speech according to one or more the first candidate vocabulary and described lexical types Final recognition result；Or, according to one or more the first candidate vocabulary and one or more second Candidate's vocabulary determines the final recognition result of described user's current speech.

Further, described determine that unit is additionally operable to from one or more the first candidate vocabulary to select to meet described First specific vocabulary of vocabulary classification, using described first specific vocabulary as the final recognition result of described user's current speech.

Further, described determine that unit is additionally operable to from one or more the second candidate vocabulary select with described one The second specific vocabulary that individual or multiple first candidate's Lexical Similarity is high, using described second specific vocabulary as described user The final recognition result of current speech.

Further, described acquisition module also includes: the 3rd acquiring unit, is used for indicating described user current for acquisition The image of state；4th acquiring unit, for according to described Image Acquisition image feature information；5th acquiring unit, uses According to described image feature information obtain the vocabulary classification corresponding with described image feature information and/or one or more Candidate's vocabulary, assists in identifying information using described vocabulary classification and/or one or more candidate's vocabulary as described.

Further, described 5th acquiring unit also includes: search subelement, in predetermined image library search with The specific image that described image feature information similarity is the highest；Obtain subelement, for according to the image preset and vocabulary class Not or the corresponding relation of one or more candidate's vocabulary, the vocabulary classification or corresponding with described specific image is obtained Individual or multiple candidate's vocabulary.

Further, described device also includes: determination module, determines described based on described voice recognition information for judgement The accuracy of the final recognition result of user's current speech is less than predetermined threshold.

According to another aspect of the present invention, additionally providing a kind of terminal, including processor, described processor is used for obtaining The voice recognition information of user's current speech, and obtain based on user's current state corresponding with described user's current speech Described voice recognition information assist in identifying information；Institute is determined according to described voice recognition information and the described information that assists in identifying State the final recognition result of user's current speech.

By the present invention, obtain the voice recognition information of user's current speech, and based on corresponding with user's current speech What user's current state obtained this voice recognition information assists in identifying information；According to voice recognition information and assist in identifying information Determine the final recognition result of user's current speech.Solve in correlation technique and only obtain saying of user by the sound of user Words content causes the problem that the accuracy of speech recognition is the highest, and then improves the accuracy of speech recognition.

Accompanying drawing explanation

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, the present invention Schematic description and description be used for explaining the present invention, be not intended that inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of audio recognition method according to embodiments of the present invention；

Fig. 2 is the structured flowchart of speech recognition equipment according to embodiments of the present invention；

Fig. 3 is the structured flowchart () of speech recognition equipment according to embodiments of the present invention；

Fig. 4 is the structured flowchart (two) of speech recognition equipment according to embodiments of the present invention；

Fig. 5 is the structured flowchart (three) of speech recognition equipment according to embodiments of the present invention；

Fig. 6 is the structured flowchart (four) of speech recognition equipment according to embodiments of the present invention；

Fig. 7 is the flow chart of voice recognition processing method according to embodiments of the present invention；

The structured flowchart of Fig. 8 voice recognition processing device according to embodiments of the present invention；

Fig. 9 is voice recognition processing flow chart according to embodiments of the present invention.

Detailed description of the invention

Below with reference to accompanying drawing and describe the present invention in detail in conjunction with the embodiments.It should be noted that in the feelings do not conflicted Under condition, the embodiment in the application and the feature in embodiment can be mutually combined.

A kind of method providing speech recognition in the present embodiment, Fig. 1 is speech recognition side according to embodiments of the present invention The flow chart of method, as it is shown in figure 1, this flow process comprises the steps:

Step S102, obtains the voice recognition information of user's current speech, and based on corresponding with this user's current speech User's current state obtain this voice recognition information assist in identifying information；

Step S104, according to voice recognition information with assist in identifying information and determine the final recognition result of user's current speech.

By above-mentioned steps, obtain the voice recognition information of user's current speech, and obtain user when sending voice State characteristic information, using user's state characteristic information when sending voice as the auxiliary information identifying current speech, phase More relatively low than only carrying out the recognition accuracy of voice by the current speech of user in prior art, above-mentioned steps solves phase Pass technology only obtains the speech content of user by the sound of user and causes the problem that the accuracy of speech recognition is the highest, enter And improve the accuracy of speech recognition.

Above-mentioned steps S104 relate to according to voice recognition information and assist in identifying information and determine that this user's current speech is Whole recognition result, in one alternate embodiment, according to voice recognition information obtain user's current speech corresponding or The multiple first candidate's vocabulary of person；According to assisting in identifying vocabulary classification corresponding to this user's current speech of acquisition of information or one Or multiple second candidate's vocabulary；Determine that this user is current according to one or more the first candidate vocabulary and this lexical types The final recognition result of voice；Or, according to one or more the first candidate vocabulary and one or more second candidate Vocabulary determines the final recognition result of user's current speech.

The final recognition result of this user's current speech is determined according to one or more the first candidate vocabulary and lexical types Mode can have a variety of, in one alternate embodiment, select to meet from one or more the first candidate vocabulary First specific vocabulary of vocabulary classification, using the first specific vocabulary as the final recognition result of this user's current speech.Separately In one alternative embodiment, select and one or more the first candidate vocabulary from one or more the second candidate vocabulary The second specific vocabulary that similarity is high, using the second specific vocabulary as the final recognition result of user's current speech.

Above-mentioned determining this user according to one or more the first candidate vocabulary and one or more the second candidate vocabulary During the final recognition result of current speech, in one alternate embodiment, first acquisition is used for indicating this user to work as The image of front state, then according to this Image Acquisition image feature information, obtains and this figure further according to this image feature information As vocabulary classification corresponding to characteristic information and/or one or more candidate's vocabulary, by this vocabulary classification and/or this or Person's multiple candidate vocabulary assists in identifying information as this.

In one alternate embodiment, search in predetermined image library the highest with this image feature information similarity specific Image, according to default image and vocabulary classification or the corresponding relation of one or more candidate's vocabulary, obtains and this spy Determine vocabulary classification corresponding to image or one or more candidate's vocabulary.Thus according to image feature information got with Vocabulary classification that this image feature information is corresponding and/or one or more candidate's vocabulary.

User's current state can include multiple, is illustrated this below.In one alternate embodiment, this use The lip kinestate at family, the laryngeal vibration state of this user, the facial movement state of this user, the gesture fortune of this user Dynamic state.The information included by current state feature of above-mentioned user is illustrative only, and is not restricted this.Such as In actual life, only can i.e. can recognize that the content described in speaker by lip reading.Therefore, lip reading is to identify voice Important cofactor.

In one alternate embodiment, obtain the voice recognition information of user's current speech, and based on current with this user User's current state that voice is corresponding obtain this voice recognition information assist in identifying information before, it is determined that know based on this voice Other information determines that the accuracy of the final recognition result of this user's current speech is less than predetermined threshold.

Additionally providing the device of a kind of speech recognition in the present embodiment, this device is used for realizing above-described embodiment and the most real Execute mode, carry out repeating no more of explanation.As used below, term " module " can realize predetermined merit The software of energy and/or the combination of hardware.Although the device described by following example preferably realizes with software, but Hardware, or the realization of the combination of software and hardware also may and be contemplated.

Fig. 2 is the structured flowchart of speech recognition equipment according to embodiments of the present invention, as in figure 2 it is shown, this device includes: Acquisition module 22, for obtaining the voice recognition information of user's current speech, and based on corresponding with this user's current speech User's current state obtain this voice recognition information assist in identifying information；Determine module 24, for knowing according to this voice Other information assists in identifying information with this and determines the final recognition result of this user's current speech.

Fig. 3 is the structured flowchart () of speech recognition equipment according to embodiments of the present invention, as it is shown on figure 3, determine mould Block 24 includes: the first acquiring unit 242, for obtain that this user's current speech is corresponding according to this voice recognition information one Individual or multiple first candidate's vocabulary；Second acquisition unit 244, works as assisting in identifying this user of acquisition of information according to this Vocabulary classification that front voice is corresponding or one or more the second candidate vocabulary；Determine unit 246, for according to this one Individual or multiple first candidate's vocabulary and this lexical types determine the final recognition result of this user's current speech；Or, root This user's current speech is determined with this one or more second candidate's vocabulary according to this one or more first candidate's vocabulary Final recognition result.

Optionally it is determined that unit 246 is additionally operable to select to meet this vocabulary class from this one or more first candidate's vocabulary Other first specific vocabulary, using this first specific vocabulary as the final recognition result of this user's current speech.

Optionally it is determined that unit 246 be additionally operable to select from this one or more second candidate's vocabulary with this or The second specific vocabulary that multiple first candidate's Lexical Similarities are high, using this second specific vocabulary as this user's current speech Final recognition result.

Fig. 4 is the structured flowchart (two) of speech recognition equipment according to embodiments of the present invention, as described in Figure 4, obtains mould Block 22 also includes: the 3rd acquiring unit 222, for obtaining the image for indicating this user's current state；4th obtains Unit 224, for according to this Image Acquisition image feature information；5th acquiring unit 226, for special according to this image Levy the acquisition of information vocabulary classification corresponding with this image feature information and/or one or more candidate's vocabulary, by this vocabulary Classification and/or this one or more candidate's vocabulary assist in identifying information as this.

Fig. 5 is the structured flowchart (three) of speech recognition equipment according to embodiments of the present invention, as it is shown in figure 5, the 5th obtains Take unit 226 also to include: search subelement 2262, for searching and this image feature information phase in predetermined image library Like spending the highest specific image；Obtain subelement 2264, for according to image and the vocabulary classification preset or one or The corresponding relation of multiple candidate's vocabulary, obtains the vocabulary classification corresponding with this specific image or one or more candidate word Converge.

Alternatively, user's current state includes at least one of: the lip kinestate of this user, the throat of this user Vibrational state, the facial movement state of this user, the gesture motion state of this user.

Fig. 6 is the structured flowchart (four) of speech recognition equipment according to embodiments of the present invention, as shown in Figure 6, this device Also include: determination module 26, finally identify knot based on what this voice recognition information determined this user's current speech for judging The accuracy of fruit is less than predetermined threshold.

According to another aspect of the present invention, additionally providing a kind of terminal, including processor, this processor is used for obtaining use The voice recognition information of family current speech, and obtain this language based on user's current state corresponding with this user's current speech Sound identification information assist in identifying information；Assist in identifying information according to this voice recognition information with this and determine the current language of this user The final recognition result of sound.

It should be noted that above-mentioned modules can be by software or hardware realizes, for the latter, Ke Yitong Cross in the following manner to realize, but be not limited to this: above-mentioned modules is respectively positioned in same processor；Or, each mould above-mentioned Block lays respectively at first processor, the second processor and the 3rd processor ... in.

For the problems referred to above present in correlation technique, illustrate below in conjunction with concrete alternative embodiment, following can Select and embodiment combines above-mentioned alternative embodiment and optional embodiment thereof.

This alternative embodiment provides a kind of voice recognition processing method and device, to solve phonetic recognization rate in correlation technique Low and cause user experience difference problem.In order to overcome disadvantages mentioned above and deficiency, this alternative embodiment of prior art Purpose be to provide a kind of Intelligent voice recognition method based on auxiliary interactive mode and device, on the basis of speech recognition On, as baseband signal, with the use of lipreading recognition, recognition of face, gesture identification, laryngeal vibration identification etc., as Auxiliary signal.Utilizing each technology in the advantage of its application, learn from other's strong points to offset one's weaknesses, each technology modules is relatively independent the most mutually to be melted Close, be greatly improved speech processes discrimination, it is preferred that the increase of auxiliary signal identification can be determined by voice identification result, When voice identification result probability then increases assistance data less than threshold value.The language acknowledging process meeting the mankind is a manifold The perception in road.Allow terminal carry out, based on by sound, the content that perception is talked, coordinate and identify its shape of the mouth as one speaks, changes in faces etc. Understand the content said exactly.

An aspect according to this alternative embodiment, it is provided that a kind of voice recognition processing method, is obtained by audio sensor Take on the basis of voice data carries out speech recognition as baseband signal, by terminal unit photographic head or external sensing Device gathers the moving image of human body, including gesture motion, facial movement, laryngeal vibration, lipreading recognition etc., and passes through collection The image algorithm become and action process chip and resolve, as the auxiliary signal of speech recognition, baseband signal and auxiliary letter Number recognition result by terminal integrated treatment and performs corresponding operating.By auxiliary signal recognition result and speech recognition baseband signal Result carries out accumulation process and forms unified recognition result, helps out speech recognition, improves phonetic recognization rate.

By gesture motion, facial movement, laryngeal vibration, lipreading recognition integrates, each way all passes through feature extraction, Template training, template classification, judging process organically combine, and use first speech recognition to carry out point as baseband signal Analysis confirmation, rear auxiliary signal carry out the logical judgment sequence of auxiliary judgment, effectively reduce because noise and external sound disturb Produce the probability identifying mistake.During auxiliary signal identification, by sensor and camera collection characteristic, enter Row characteristic is extracted, and carries out a series of matching judgment identification with preset template base data, then identifies feature with corresponding Result is compared, and identifies candidate word vocabulary possible in speech recognition modeling dictionary.

Alternatively, the above-mentioned lipreading recognition lip image by camera collection speaker, lip image is carried out at image Reason, Real-time and Dynamic is extracted lip feature, is then determined, with lip algorithm for pattern recognition, content of speaking.Use lip and color of the lip The determination methods combined, is accurately positioned lip position.Suitable lip matching algorithm is used to be identified.

Alternatively, pretreated video data is taken out the feature of lip image by above-mentioned lipreading recognition, utilizes lip image Feature identification active user nozzle type change；Detection user's mouth motion realizes the identification of lip, improves recognition efficiency And accuracy rate.Above-mentioned mouth motion feature figure is classified, it is thus achieved that classification information, above-mentioned mouth motion feature figure is entered Row is sorted out, and the mouth motion feature figure of every kind of characteristic type is all to there being some vocabulary classifications.Above-mentioned lipreading recognition obtains letter Breath, after a series of process such as denoising, modulus (A/D) conversion, respectively be preset in image/voice recognition processing mould Template base comparing in block, relatively above-mentioned lipreading recognition information with all mouth motion feature figures sampled in advance Similarity, reads the some vocabulary classifications corresponding to mouth motion feature figure that similarity is the highest.

Alternatively, above-mentioned laryngeal vibration is identified by the laryngeal vibration form of outer sensor collection speaker, to vibration shape State processes, and Real-time and Dynamic extracts vibration shape feature, then determines, with vibration shape algorithm for pattern recognition, content of speaking.

Alternatively, before user is carried out laryngeal vibration identification, need first the laryngeal vibration motion feature figure of user to be carried out Sampling, sets up different laryngeal vibration motion feature archives to different user.Laryngeal vibration at sample user in advance is moved During characteristic pattern, the laryngeal vibration motion feature figure that user can send a syllable is sampled, it is possible to user is sent one The laryngeal vibration motion feature figure of individual word is sampled.For the different speech events that pronounces, laryngeal vibration motion is different, Owing to being relevant between each speech events that user sends, after completing the identification to laryngeal vibration, on using Error correcting technique hereafter, verifies the laryngeal vibration identified, reduces the identification of generic laryngeal vibration motion feature figure Mistake, improves the accuracy rate of laryngeal vibration identification further.

Alternatively, pretreated vibration data is taken out the feature of laryngeal vibration image by above-mentioned laryngeal vibration identification, utilizes The laryngeal vibration change of the feature identification active user of laryngeal vibration image；Detection user's laryngeal vibration motion realizes throat The identification of vibration, improves recognition efficiency and accuracy rate.Above-mentioned laryngeal vibration motion feature figure is classified, it is thus achieved that classification Information, sorts out above-mentioned laryngeal vibration motion feature figure, and the laryngeal vibration motion feature of every kind of characteristic type is the most corresponding There are some vocabulary classifications.Above-mentioned laryngeal vibration identification obtain information, respectively be preset in image/voice recognition processing module In template base comparing, relatively above-mentioned laryngeal vibration identification information and all laryngeal vibration motion spy of sampling in advance Levy the similarity of figure, read the some vocabulary classifications corresponding to laryngeal vibration motion feature figure that similarity is the highest.

Above-mentioned recognition of face is for extracting user's face feature in video data, and identity and position to user are carried out Determine；When speaking, facial muscle also correspond to different motor patterns, by gathering the action of facial muscle, the most permissible From signal characteristic, identify the muscle movement pattern of correspondence, and then auxiliary is identified voice messaging.

An aspect according to this alternative embodiment, additionally provides a kind of voice recognition processing device, including: baseband signal Module.Auxiliary signal module, signal processing module.

Baseband signal module, for traditional sound identification module, it is right that above-mentioned sound identification module is used for by audio sensor Pretreated voice data is identified；The identification object of sound identification module includes speech recognition and the company of isolated vocabulary The speech recognition of continuous large vocabulary, the former is mainly used to determine the input that control instruction, the latter are mainly used in text.At this Mainly illustrating as a example by the identification of isolated vocabulary in invention, identifying of continuous large vocabulary uses identical processing mode.

Alternatively, audio sensor is microphone array or directional microphone.Owing to environment existing various forms of making an uproar Sound interference, and existing audio frequency based on common microphone obtains mode and has identical spirit for user speech and environment noise Sensitivity, as broad as long voice and the ability of noise, therefore easily cause the decline of user speech identification command operating accuracy. Use microphone array or directional microphone can overcome the problems referred to above, use sound localization to follow the tracks of with voice enhancement algorithm Operate the sound of user and its acoustical signal is strengthened, suppression ambient noise and the impact of people's sound interference, improve The signal to noise ratio of system voice audio frequency input, it is ensured that back-end algorithm obtains the reliable of the quality of data.

Auxiliary signal module, including front end photographic head, audio sensor, laryngeal vibration sensor；For obtaining video counts According to, voice data and action data；

Alternatively, laryngeal vibration sensor integration contacts with user throat in wearable device, position, and detection user produces Speech fluctuations, a temperature sensor is positioned over inside wearable device, and a temperature sensor is positioned over wearable setting Standby outside, microprocessor is by comparing the temperature of two sensor detections, it is judged that whether wearable device is dressed by user, Wearable device, under the situation not being worn, will reduce wearable device overall power automatically into park mode. Microprocessor is by detection vibrating sensor condition adjudgement and identifies the phonetic order that user sends, and by phonetic order by indigo plant Tooth equipment is sent to need device to be controlled, performs voice recognition instruction.

Signal processing unit, including lipreading recognition module, face recognition module, Vibration identification module, gesture recognition module, Sound identification module and point adjusting module；For baseband signal (voice signal) and auxiliary signal are identified, select Baseband signal is as main voice messaging, using auxiliary signal as assistant voice information；

Use elder generation's baseband signal (voice signal) to be analyzed confirmation as baseband signal, rear auxiliary signal carries out auxiliary and sentences Disconnected logical judgment sequence, concrete identify during, the highest some of the probability score value that selects voice signal identification to draw Individual word is as candidate word, for for each candidate word, generating multistage related term set according to predetermined vocabulary.Auxiliary letter Number assistant voice information produced for the related term that improves in speech recognition modeling in candidate word and related term set at language Score value in the other model dictionary of sound.After baseband signal and auxiliary signal are all disposed, select the candidate that score value is the highest Word or related term are as recognition result.

Above-mentioned lipreading recognition module, for pretreated video data takes out the feature of lip image, utilizes lip motion information Identify the nozzle type change of active user；

Above-mentioned face recognition module is used for extracting user's face feature in video data, identity and the position to user Being determined, the identity identifying different registration user is mainly conducive to the customization of whole device individual operation, if not Authorizing with control, the positional information of user may be used for assisting gesture identification to determine the operating area of user's hands, determine User carries out orientation during voice operating, to improve the audio frequency input gain in microphone users orientation；Multiple possible when having During user, this module can recognize that the position of all faces, and judges all user identity, and carries out respectively Process.Ask that the user in who camera view of user will be awarded control；

Above-mentioned gesture recognition module, for extracting gesture information in pretreated video data, determines hand-type, hands Movement locus, hands coordinate information in the picture, and then any hand-type is tracked, opponent's profile in the picture Be analyzed, user by specific gesture or action to obtain startup and the control of whole terminal.

By alternative embodiment, to existing various forms of human-computer interaction technologies, know including gesture identification, laryngeal vibration Not, speech recognition, recognition of face, lipreading recognition technology etc. merge, speech recognition makes as baseband signal, cooperation Speech recognition candidate word is carried out as auxiliary signal with lipreading recognition, recognition of face, gesture identification, laryngeal vibration identification etc. Point adjust.Use elder generation's baseband signal (voice signal) to be analyzed confirmation as baseband signal, rear auxiliary signal is carried out The logical judgment sequence of auxiliary judgment, utilizes each technology in the advantage of its application, learns from other's strong points to offset one's weaknesses, each technology modules phase The most mutually merge independent, utilize the nozzle type of lip motion information identification active user to change, reduce user on this basis and carry out False Rate during speech recognition operation, to ensure that voice operating also can normally identify in noise circumstance；Face recognition module Identify the positional information of user, may be used for assisting gesture identification to determine the operating area of user's hands, determine that user is carried out Orientation during voice operating, to improve the audio frequency input gain in microphone users orientation.Thus overcome the impact of noise, aobvious Work improves phonetic recognization rate, then result is changed into dependent instruction.Accomplish that lifting terminal speech identification is stable well Comfortable with operate.

Can perform in user terminal such as smart mobile phone, the panel computer etc. in the step shown in the flow chart of accompanying drawing, and And, although show logical order in flow charts, but in some cases, can hold with the order being different from herein Step shown or described by row.

Present embodiments providing a kind of voice recognition processing method, Fig. 7 is voice recognition processing according to embodiments of the present invention The flow chart of method, as it is shown in fig. 7, this flow process includes:

Step S702, the voice messaging obtained by audio sensor is identified processing as baseband signal；

Step S704, is identified lipreading recognition, recognition of face, Vibration identification, gesture identification as auxiliary signal Process, and the recognition result of baseband signal is carried out a point adjustment.

Speech recognition object includes speech recognition and the speech recognition of continuous large vocabulary of isolated vocabulary, and the former is mainly used to Determine that control instruction, the latter are mainly used in the input of text.Say as a example by the identification of isolated vocabulary in the present embodiment Bright, identifying of continuous large vocabulary uses identical processing mode.By each step above-mentioned, use first baseband signal (language Tone signal) be analyzed confirming as baseband signal, rear auxiliary signal carries out the logical judgment sequence of auxiliary judgment, select The highest several words of probability score value that voice signal identification draws are as candidate word, for for each candidate word, root Multistage related term set is generated according to predetermined vocabulary.The highest candidate word classification of probability score value that auxiliary signal identification produces As auxiliary information, judge several candidate word that baseband signal identifies successively, if meeting what auxiliary signal identified Candidate word classification, then improve the score value in the other model dictionary of voice of the related term in this candidate word and related term set.When After baseband signal and auxiliary signal are all disposed, select the highest candidate word of score value or related term as recognition result.

In specific implementation process, lipreading recognition, recognition of face, Vibration identification, gesture identification are carried out as auxiliary signal Identifying processing, various recognition method are separate, and one or more recognition method can be used as auxiliary letter simultaneously Number input.

Additionally providing a kind of device in an embodiment, this device is corresponding with the method in above-described embodiment, has carried out Illustrate does not repeats them here.Module or unit in this device can be stored in memorizer or user terminal and permissible The code run by processor, it is also possible to realize in other ways, illustrates the most one by one at this.

According to an aspect of the present invention, additionally providing a kind of voice recognition processing device, Fig. 8 is to implement according to the present invention The structured flowchart of the voice recognition processing device of example, as shown in Figure 8, this device includes:

Baseband signal module, including audio sensor, for traditional sound identification module, above-mentioned sound identification module passes through Audio sensor is for being identified pretreated voice data；

Auxiliary signal module, including front end photographic head, laryngeal vibration sensor；For obtaining video data, voice data And action data, including lipreading recognition, recognition of face, laryngeal vibration identification, gesture identification etc.；

Signal processing module, including lipreading recognition module, face recognition module, Vibration identification module, gesture recognition module, Sound identification module and point adjusting module；For baseband signal (voice signal) and auxiliary signal are identified, select Auxiliary signal, as main voice messaging, is carried out a point adjustment as auxiliary information by baseband signal；

Above-mentioned face recognition module is used for extracting user's face feature in video data, identity and the position to user Being determined, the identity identifying different registration user is mainly conducive to the customization of whole device individual operation, if not Authorizing with control；

Above-mentioned gesture recognition module, for extracting gesture information in pretreated video data, determines hand-type, hands Movement locus, hands coordinate information in the picture, and then any hand-type is tracked, opponent's profile in the picture Be analyzed, user by specific gesture or action to obtain startup and the control of whole terminal；

Fig. 9 is the flow chart according to voice recognition processing method of the present invention, as it is shown in figure 9, the speech recognition of this embodiment Method is as follows:

Step S902, the voice messaging obtained from audio sensor, obtain from front end photographic head, laryngeal vibration sensor Video data, action data, the information such as including lipreading recognition, recognition of face, laryngeal vibration identification, gesture identification；

Step S904, as a example by the speech recognition of isolated vocabulary, is identified confirming as baseband signal to voice signal, Identify that this isolated vocabulary obtains several maximum words of this probability as candidate word；

Step S906, to terminal unit photographic head or the moving image of external sensor acquisition human body, including gesture Motion, facial movement, laryngeal vibration, lipreading recognition etc., as auxiliary signal, is analyzed confirming, obtains probability and divide It is worth the highest candidate word classification；

Step S908, judging several candidate word that baseband signal identifies successively, identifying if meeting auxiliary signal Candidate word classification, then improve this candidate word score value in the other model dictionary of voice；

Step S910, after baseband signal and auxiliary signal are all disposed, selects the candidate word conduct that score value is the highest Recognition result.

With a concrete example, this alternative embodiment is illustrated below.Such as by the voice of owner is identified, Obtain following result:

" please call (0.9) browser (0.7) by (0.6) cards folder (0.9), the numerical value in its bracket is probability score value value, generation Table probability size, the biggest probability of score value is the biggest.The word selecting probability score value the highest is candidate word, such as, select such as Under candidate word: cards folder (0.9) calling (0.9) as voice identification result.

The gesture motion that simultaneously carries out, facial movement, laryngeal vibration, the various ways combination such as lipreading recognition or only use One or more of which mode is identified as auxiliary signal, obtains the candidate word classification that probability score value is the highest.

Judge the cards folder (0.9) calling (0.9) that voice signal identifies successively, it may be judged whether meet what auxiliary signal identified Candidate word classification.Assume that cards folder meets candidate word classification.Then improve the probability score value of cards folder, such as, update and run after fame Sheet folder (1.0) calling (0.9).

After voice baseband signal and auxiliary signal are all disposed, the candidate word cards folder (1.0) that score value is the highest is selected to make For recognition result.

As the alternative embodiment of the present embodiment, first auxiliary signal identification can be used to determine candidate word classification, after pass through language Tone signal is analyzed the logical judgment sequence confirmed as baseband signal.First pass through gesture motion, facial movement, throat Vibration, the various ways such as lipreading recognition combination or only use one or more of which mode to be identified as auxiliary signal, When using various ways to be identified, the recognition result accumulation process of each way, obtain the time that probability score value is the highest Select word class, in this on the basis of combine voice identification result, therefrom selecting the word that probability score value is the highest be final identification Result.With a concrete example, this programme is illustrated below.Such as by the voice of owner is identified, obtain Following result:

" please call (0.9) browser (0.7) by (0.6) cards folder (0.9), the numerical value in its bracket is probability score value.Select The word that probability score value is the highest is candidate word, such as, select following candidate word: cards folder (0.9) calling (0.9) is as voice Recognition result.

The laryngeal vibration simultaneously carried out and the combination of lipreading recognition two ways are identified as auxiliary signal, it is assumed that be first Laryngeal vibration identification, judges the cards folder (0.9) calling (0.9) that baseband signal identifies, it may be judged whether meet throat and shake successively The candidate word classification that dynamic identification identifies.Assume that cards folder meets the classification of laryngeal vibration identification, then that improves cards folder can Energy property score value, such as, be updated to cards folder (1.0) calling (0.9).Lip is proceeded on the basis of upper once recognition result Identify and judge, judge that cards folder (1.0) calls (0.9) successively, it may be judged whether meet the candidate word classification of lipreading recognition.Assume Cards folder meets the classification of lipreading recognition, then improve the probability score value of cards folder, such as, be updated to cards folder (1.1) calling (0.9).The recognition result of two ways has carried out accumulation process.

After voice baseband signal and auxiliary signal are all disposed, the candidate word cards folder (1.1) that score value is the highest is selected to make For recognition result.

As the alternative embodiment of the present embodiment, the process of screening is to be completed by a point adjustment further, i.e. can increase Meet the score value of the candidate word of auxiliary signal identification, it is also possible to reduce the score value of the candidate word not meeting auxiliary signal identification, After baseband signal and auxiliary signal are all disposed, select the candidate word that score value is the highest as recognition result.

As the alternative embodiment of the present embodiment, the utilization added to improve speech recognition accuracy assists information to identification It is optional that result carries out confirming user, and speech recognition device determines recognition result according to input voice.Tie for above-mentioned identification Fruit calculates a possibility metric.If this possibility metric less than threshold value, then prompt the user with whether input auxiliary Help data or automatically turn on ancillary data identification.If this possibility metric is more than threshold value, then whether prompt the user with Close assistance data or be automatically switched off ancillary data identification.The concrete numerical value of threshold value is not defined, empirical value draw Or draw according to Consumer's Experience.

The audio recognition method improved based on this above-described embodiment, to existing various forms of human-computer interaction technologies, including Gesture identification, laryngeal vibration identification, speech recognition, recognition of face, lipreading recognition technology etc. merge, speech recognition As baseband signal, with the use of lipreading recognition, recognition of face, gesture identification, laryngeal vibration identification etc. as auxiliary letter Number carry out point adjusting of speech recognition candidate word.First baseband signal (voice signal) is used to be analyzed as baseband signal Confirmation, rear auxiliary signal carry out the logical judgment sequence of auxiliary judgment, and profit has accomplished that lifting terminal speech identification is steady well Fixed comfortable with operate.

In sum, a kind of voice recognition processing method and device provided by the present invention, on the basis of speech recognition, As baseband signal, with the use of lipreading recognition, recognition of face, gesture identification, laryngeal vibration identification etc. as auxiliary letter Number.Solve phonetic recognization rate in correlation technique low and cause user experience difference problem.Each technology is utilized to answer at it By the advantage in field, learning from other's strong points to offset one's weaknesses, each technology modules is relatively independent the most mutually to be merged, and is greatly improved speech processes discrimination.

In another embodiment, additionally providing a kind of software, this software is used for performing above-described embodiment and being preferable to carry out Technical scheme described in mode.

In another embodiment, additionally providing a kind of storage medium, in this storage medium, storage has above-mentioned software, should Storage medium includes but not limited to: CD, floppy disk, hard disk, scratch pad memory etc..

Obviously, those skilled in the art should be understood that each module of the above-mentioned present invention or each step can be with general Calculating device to realize, they can concentrate on single calculating device, or be distributed in multiple calculating device and formed Network on, alternatively, they can realize, it is thus possible to by them with calculating the executable program code of device Storage is performed by calculating device in the storage device, and in some cases, can hold with the order being different from herein Step shown or described by row, or they are fabricated to respectively each integrated circuit modules, or by many in them Individual module or step are fabricated to single integrated circuit module and realize.So, the present invention is not restricted to any specific hardware Combine with software.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for the technology of this area For personnel, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, that is made is any Amendment, equivalent, improvement etc., should be included within the scope of the present invention.

Claims

1. the method for a speech recognition, it is characterised in that including:

Obtain the voice recognition information of user's current speech, and based on the user corresponding with described user's current speech Current state obtain described voice recognition information assist in identifying information；

The final knowledge of described user's current speech is determined according to described voice recognition information and the described information that assists in identifying Other result.

Method the most according to claim 1, it is characterised in that according to described voice recognition information with described assist in identifying Information determines that the final recognition result of described user's current speech includes:

One or more first candidate that described user's current speech is corresponding is obtained according to described voice recognition information Vocabulary；

According to described assist in identifying vocabulary classification corresponding to user's current speech described in acquisition of information or one or Multiple second candidate's vocabulary；

Described user's current speech is determined according to one or more the first candidate vocabulary and described vocabulary classification Final recognition result；Or, according to one or more the first candidate vocabulary and one or more Second candidate's vocabulary determines the final recognition result of described user's current speech.

Method the most according to claim 2, it is characterised in that according to one or more the first candidate vocabulary and Described lexical types determines that the final recognition result of described user's current speech includes:

From one or more the first candidate vocabulary, select to meet the first specific vocabulary of described vocabulary classification, Using described first specific vocabulary as the final recognition result of described user's current speech.

Method the most according to claim 2, it is characterised in that according to one or more the first candidate vocabulary and One or more the second candidate vocabulary determines that the final recognition result of described user's current speech includes:

Select and one or more the first candidate vocabulary from one or more the second candidate vocabulary The second specific vocabulary that similarity is high, identifies described second specific vocabulary as the final of described user's current speech Result.

Method the most according to claim 1, it is characterised in that work as based on the user corresponding with described user's current speech Front state obtains the information that assists in identifying of described voice recognition information and includes:

Obtain the image for indicating described user's current state；

According to described Image Acquisition image feature information；

According to described image feature information obtain the vocabulary classification corresponding with described image feature information and/or one or Person's multiple candidate vocabulary, using described vocabulary classification and/or one or more candidate's vocabulary as described auxiliary Identification information.

Method the most according to claim 5, it is characterised in that obtain and described image according to described image feature information Vocabulary classification and/or one or more candidate's vocabulary that characteristic information is corresponding include:

The specific image the highest with described image feature information similarity is searched in predetermined image library；

According to default image and vocabulary classification or the corresponding relation of one or more candidate's vocabulary, obtain and institute State vocabulary classification corresponding to specific image or one or more candidate's vocabulary.

Method the most according to any one of claim 1 to 6, it is characterised in that described user's current state include with At least one lower: the lip kinestate of described user, the laryngeal vibration state of described user, the face of described user Portion's kinestate, the gesture motion state of described user.

Method the most according to any one of claim 1 to 7, it is characterised in that obtain the voice of user's current speech Identification information, and obtain described speech recognition letter based on user's current state corresponding with described user's current speech Breath assist in identifying information before include:

Judge to determine the accuracy of the final recognition result of described user's current speech based on described voice recognition information Less than predetermined threshold.

9. the device of a speech recognition, it is characterised in that described device includes:

Acquisition module, for obtaining the voice recognition information of user's current speech, and based on current with described user What user's current state corresponding to voice obtained described voice recognition information assists in identifying information；

Determine module, for determining that described user is current according to described voice recognition information and the described information that assists in identifying The final recognition result of voice.

Device the most according to claim 9, it is characterised in that described determine that module includes:

First acquiring unit, for one corresponding according to the described user's current speech of described voice recognition information acquisition Or multiple first candidate's vocabulary；

Second acquisition unit, for assisting in identifying, described in basis, the vocabulary that user's current speech described in acquisition of information is corresponding Classification or one or more the second candidate vocabulary；

Determine unit, described for determining according to one or more the first candidate vocabulary and described vocabulary classification The final recognition result of user's current speech；Or, according to one or more the first candidate vocabulary and described One or more the second candidate vocabulary determines the final recognition result of described user's current speech.

11. devices according to claim 10, it is characterised in that described determine unit be additionally operable to from one or Multiple first candidate's vocabulary select to meet the first specific vocabulary of described vocabulary classification, by described first specific vocabulary Final recognition result as described user's current speech.

12. devices according to claim 10, it is characterised in that described determine unit be additionally operable to from one or Multiple second candidate's vocabulary select second high with one or more the first candidate Lexical Similarity specific Vocabulary, using described second specific vocabulary as the final recognition result of described user's current speech.

13. devices according to claim 9, it is characterised in that described acquisition module also includes:

3rd acquiring unit, for obtaining the image for indicating described user's current state；

4th acquiring unit, for according to described Image Acquisition image feature information；

5th acquiring unit, for obtaining the word corresponding with described image feature information according to described image feature information Remittance classification and/or one or more candidate's vocabulary, by described vocabulary classification and/or one or more candidate Vocabulary assists in identifying information as described.

14. devices according to claim 13, it is characterised in that described 5th acquiring unit also includes:

Search subelement, for searching the spy the highest with described image feature information similarity in predetermined image library Determine image；

Obtain subelement, right for according to image and the vocabulary classification preset or one or more candidate's vocabulary Should be related to, obtain the vocabulary classification corresponding with described specific image or one or more candidate's vocabulary.

15. according to the device according to any one of claim 9 to 14, it is characterised in that described user's current state includes At least one of: the lip kinestate of described user, the laryngeal vibration state of described user, described user Facial movement state, the gesture motion state of described user.

16. according to the device according to any one of claim 9 to 15, it is characterised in that described device also includes:

Determination module, for judging to determine the final identification of described user's current speech based on described voice recognition information The accuracy of result is less than predetermined threshold.

17. 1 kinds of terminals, including processor, it is characterised in that described processor is known for the voice obtaining user's current speech Other information, and obtain described voice recognition information based on user's current state corresponding with described user's current speech Assist in identifying information；The current language of described user is determined according to described voice recognition information and the described information that assists in identifying The final recognition result of sound.