CN102324035A

CN102324035A - Method and system of applying lip posture assisted speech recognition technique to vehicle navigation

Info

Publication number: CN102324035A
Application number: CN201110239403A
Authority: CN
Inventors: 伍栋杨; 王冰
Original assignee: Guangdong Coagent Electronics S&T Co Ltd
Current assignee: Guangdong Coagent Electronics S&T Co Ltd
Priority date: 2011-08-19
Filing date: 2011-08-19
Publication date: 2012-01-18

Abstract

The invention relates to a method and a system of applying a lip posture assisted speech recognition technique to vehicle navigation. The technical scheme is that a camera and a microphone are arranged at proper positions to acquire the lip posture image signals and voice signals of a user, the signals are input into an image/speech recognition processing module, a logic judgment sequence of speech recognition judgment first and lip-rounding recognition confirmation later is conducted through a speech recognition and lip-rounding recognition combined method to form uniform judgment results, and the recognized accurate information corresponds to the control commands of vehicle navigation equipment to realize a speech recognition control function. Therefore, the probability of recognition errors caused by noise interference during speech recognition is effectively reduced, the speech recognition rate in a running state and an idling state of a vehicle with car windows being closed is improved to more than 90 percent from the original approximate 80 percent, the recognition rate of the speech recognition technique applied in the field of the vehicle navigation is improved, the speech navigation is enabled to have a higher practical value, a driver can use the navigation equipment more conveniently and the driving safety factor is improved.

Description

The method and system that shape of the mouth as one speaks assistant voice identification art is used in vehicle mounted guidance

Technical field

The present invention relates to the vehicle-mounted voice navigation field, relate in particular to application process and the system of a kind of shape of the mouth as one speaks assistant voice identification art in the vehicle carried video navigation.

Background technology

Speech recognition technology is along with the development of computing machine and relevant software and hardware technology, the increasing every field that is applied in, and its discrimination is also in continuous raising.Under specified conditions such as environment peace and quiet, pronunciation standard, the discrimination that is applied in speech recognition input characters system at present reaches more than 95%.If but onboard or the outside noise interference ratio is big, under the non-type situation of pronunciation, its discrimination will be had a greatly reduced quality, to such an extent as to can't reach practical purpose.If can adopt other method to come auxiliary judgment to improve the accuracy rate of its speech recognition, the practicality of speech recognition will significantly improve so.

Human language acknowledging process is a multichannel perception.In the process of the daily interchange of person to person, come the content of other people speech of perception through sound, pronounce when smudgy the environment of noise and excitement or the other side, also need eyes to observe its shape of the mouth as one speaks, the variation of expression etc., the content that could understand the other side exactly and said.Existing speech recognition system has been ignored this one side of visual characteristic of language perception; Only utilized single auditory properties; Make existing speech recognition system under noise circumstance or loquacity person condition; Its discrimination all significantly descends, and has reduced the practicality of speech recognition, and range of application is also restricted.

Along with the popularization and application of onboard navigation system, the driver carries out controlling of onboard navigation system each item application function when steering vehicle, and it is convenient inadequately only to control still with button and touch, and when controlling owing to divert one's attention to drive, be easy to cause driving accident.Control with the voice RM and to solve this problem, but adopt the navigational system of speech control technology to use on the more serious car of neighbourhood noise at present, its correct recognition rata is low, and influence is accurately controlled, and effect is not ideal enough.

Summary of the invention

The objective of the invention is to: solve the low problem of phonetic recognization rate in the noise circumstance of onboard navigation system when normal vehicle operation or idling.

For addressing the above problem; The scheme that the present invention proposes is: utilize human language hyperchannel apperceive characteristic; Utilize Sensor Analog Relay System " sense of hearing " and " vision ", adopt the identification of Mouth-Shape Recognition technology assistant voice, improve the phonetic recognization rate of onboard navigation system in noise circumstance.Implementation process is: through sensor sound and mouth shape image variation series are obtained " sense of hearing " and " vision " information; After a series of processing such as denoising, A/D conversion; Carry out the speech recognition Mouth-Shape Recognition with the ATL data comparison that is preset in image/voice recognition processing module respectively, compare with Mouth-Shape Recognition result and voice identification result, as if both as a result similarity acquire a certain degree; Promptly can confirm voice identification result; Thereby overcome Effects of Noise, significantly improved phonetic recognization rate, change into dependent instruction to the result again and output to onboard navigation system and navigate or obtain information.

The present invention program's implementation method as shown in Figure 1: system carries out feature extraction after pre-service is carried out in input to phonetic entry and degree of lip-rounding image at first respectively, and " training " " template piece " made identification and matching usefulness.After pre-service is carried out in input to phonetic entry and degree of lip-rounding image respectively again during use; Carry out feature extraction, obtain " test " signal, carry out " measure and estimate " in conjunction with trained template piece; Confirm the effective information of speech recognition; After discerning judgement with " expertise " system that presets again, speech recognition process is accomplished in output " result ".

Specify be training template piece the time, carry out template training through recording and shooting, set up the ATL of voice and Mouth-Shape Recognition, in recording with shape of the mouth as one speaks video image do corresponding one by one judge store.

The method of template matches is adopted in speech recognition of the present invention, and this method is totally four steps: feature extraction, template training, template classification, judgement.

With the voice recognition is example:

The first step is feature extraction, and the various analog signal of voice of gathering are carried out the A/D conversion, processes and stores after converting digital signal to.Be about to this signal digital and carry out digital denoising processing, remove pseudo-data, the keeping characteristics data.The denoising method that adopts is the characteristics according to the environment inside car noise; Analyze the normal mode noise of car when cruising or idling; As close or the engine when opening vehicle window, air-conditioning and driving noise characteristic data; The primary voice data of gathering through related operation, is formed near real voice feature data after removing these noise characteristic data.

Second step was a template training; Control voice command commonly used and relevant information is set up the sound template storehouse according to mobile unit; Like voice such as " beginning ", " navigation ", " destination ", " Shanghai "; Look for the people of all ages and classes, sex, accent to read, and do corresponding processing, set up the automobile-used sound template database of controlling.

The 3rd step was a template classification, was divided into control command class, address information class according to application characteristic, and range of information is type classification by size, to dwindle coupling judgement scope, improved matching efficiency and accuracy rate.The control command class is specifically just like navigation command class, voice control class; The big group of address information is specifically just like provincial place name, city-level place name or littler place name etc.

The 4th step was to judge, utilized matching algorithm to carry out phonetic feature and sound template storehouse Model Matching, with result who judges and Mouth-Shape Recognition comparison, further confirmed the accuracy of voice identification result.

The determination methods that Mouth-Shape Recognition of the present invention adopts lip and lip form and aspect to combine is accurately located the lip position.Be specially and adopt moving feature extraction of a kind of lip and recognition methods based on colourity filtering; Colourity filtering through lip; The lip motion video that is enhanced utilizes variable template to describe shape of the mouth as one speaks profile again and extracts characteristic parameter, and carries out the identification of lip movement sequence image with Hidden Markov (HMM) model.This method does not receive the influence of shape of the mouth as one speaks convergent-divergent, distortion, rotation; Different lip types there is good robustness, illumination is not had special requirement, and non-to the persona certa; Be applicable to the shape of the mouth as one speaks description under the natural conditions, can satisfy variable template has high-resolution to object edge requirement.Thereby realized that the lip position accurately locatees, and adopted suitable lip matching algorithm to discern.Recognition result and voice identification result are compared, form unified recognition result, the accurate information that will discern at last and mobile unit are controlled instruction and are mapped and accomplish the speech recognition manipulation function, and speech recognition is helped out, and improve phonetic recognization rate.

The beneficial effect that the present invention adopts above-mentioned technical solution to reach is: speech recognition and Mouth-Shape Recognition are organically combined through feature extraction, template training, template classification, judging process; Use that first speech recognition is judged, the logic determines sequence of back shape of the mouth as one speaks recognition and verification, effectively reduce because of noise and external sound disturb the probability that produces identification error, experiment proof vehicle go with the idling situation under the phonetic recognization rate of (closing vehicle window) bring up to more than 90% by original about 80%.The raising of discrimination means the weakness that has overcome single Voice Navigation, lets the more convenient use Voice Navigation of user equipment, uses navigator safer during driving.

Description of drawings

Below in conjunction with accompanying drawing and embodiment, the present invention and useful technique effect thereof are further elaborated, wherein:

Fig. 1 is shape of the mouth as one speaks information of the present invention and voice messaging main processing process synoptic diagram.

Fig. 2 is shape of the mouth as one speaks assistant voice recognition system figure of the present invention.

Description of reference numerals: 21, driver's face 22, camera 23, microphone 24, image/voice recognition processing module 25, vehicle mounted guidance audio-video system

Embodiment

The shape of the mouth as one speaks information that the present invention program discloses and voice messaging main processing process be referring to Fig. 1, and system carries out feature extraction after pre-service is carried out in input with degree of lip-rounding image to phonetic entry at first respectively, makes identification and matching usefulness after " training " " template piece " stored.After pre-service is carried out in input to phonetic entry and degree of lip-rounding image respectively again during use; Carry out feature extraction, obtain " test " signal, carry out " measure and estimate " in conjunction with " template piece " through " training "; Confirm the effective information of speech recognition; After discerning judgement with " expertise " system that presets again, speech recognition process is accomplished in output " result ".

Generally, the method that the shape of the mouth as one speaks assistant voice identification art that the present invention discloses is used in vehicle mounted guidance mainly comprises following steps:

A, obtain voice messaging,, handle laggard lang sound identification through feature extraction, template training, template classification, judgement through voice recording equipment;

B, obtain image information,, carry out Mouth-Shape Recognition after the processing through feature extraction, template training, template classification, judgement through shape of the mouth as one speaks picture pick-up device, and mouth shape image information with step a in voice messaging corresponding one by one;

C, voice identification result and Mouth-Shape Recognition result are compared, when both recognition result similarities acquire a certain degree, can confirm that this voice identification result is effective, export this voice identification result;

D, change into command adapted thereto to voice identification result again and output to in-vehicle navigation apparatus and navigate or obtain information.

Further, the method for the template matches of speech recognition employing of the present invention is divided into four steps: feature extraction, template training, template classification, judgement.

With the voice recognition is example:

(a) feature extraction is carried out the A/D conversion with the various analog signal of voice of gathering, and processes and stores after converting digital signal to.Be about to this signal digital and carry out digital denoising processing, remove pseudo-data, the keeping characteristics data.The denoising method that adopts is the characteristics according to the environment inside car noise; Analyze the normal mode noise of car when cruising or idling; As close or the engine when opening vehicle window, air-conditioning and driving noise characteristic data; The primary voice data of gathering through related operation, is formed near real voice feature data after removing these noise characteristic data.

(b) template training; Control voice command commonly used and relevant information is set up the sound template storehouse according to mobile unit; Like voice such as " beginning ", " navigation ", " destination ", " Shanghai "; Look for the people of all ages and classes, sex, accent to read, and do corresponding processing, set up the automobile-used sound template database of controlling.

(c) template classification is divided into control command class, address information class according to application characteristic, and range of information is type classification by size, to dwindle coupling judgement scope, improves matching efficiency and accuracy rate.The control command class is specifically just like navigation command class, voice control class; The big group of address information is specifically just like provincial place name, city-level place name or littler place name etc.

(d) judge, utilize matching algorithm to carry out phonetic feature and sound template storehouse Model Matching,, further confirm the accuracy of voice identification result result who judges and Mouth-Shape Recognition comparison.

Preferably; Speech recognition algorithm adopts Hidden Markov (HMM) method; The present invention designs in the optimization and the practicability of on the basis of this general-purpose algorithm related algorithm having been carried out under the vehicle-mounted voice application particular surroundings; Be specially: ATL is carried out reasonable classification, with series arrangement from small to large, when carrying out beginning successively to big type from group earlier when the phonetic feature coupling is differentiated; Effectively raise matching efficiency like this, and group just comprises the specific command and the warp sound template data commonly used, crucial of those coincidence control mobile units.

For the Mouth-Shape Recognition method, the present invention is preferably based on the moving feature extraction of lip and the recognition methods of colourity filtering, and it is through the colourity filtering of lip; The lip motion video that is enhanced; Utilize variable template again, realize the extraction and the tracking of shape of the mouth as one speaks profile, extract characteristic parameter; And result's (parameter of curve) sent into recognizer, and the HMM model carries out the identification of lip movement sequence image.

Shape of the mouth as one speaks assistant voice recognition system structure of the present invention is as shown in Figure 2; Vehicle mounted guidance audio-video system (25) and the image/voice recognition processing module (24) that upward connects thereof are connected in the image/microphone (23) of voice recognition processing module (24) input end, camera (22).When driver facial 21 pronounces with camera 22 facing to microphone 23; Microphone 23 and camera 22 are gathered and are input to image/voice recognition processing module 24 to voice signal and mouth shape image signal respectively and carry out handled (like processes such as denoising, pre-service, feature extraction, judgement and identifications); And the result after the identification converts the control corresponding instruction to; Be input to vehicle mounted guidance audio-video system 25, realize the voice control operation.

Preferably; Microphone 23 adopts the high-fidelity/highly sensitive electret condenser cartridge with directional audio transfer function; And be installed in panel board upper part, dead ahead, driver position; And the acoustic pickup mouth will be guaranteed to collect best voice signal over against driver's face 21, reduces car internal and external environment The noise as much as possible.

Preferably; Camera 22 adopt the band night vision function, video resolution is 640 * 480,25 frames, the very color CCD video image sensors of 32bit; And be installed in the upper edge end of driver dead ahead windshield; Camera lens is facial 21 over against the driver, guarantees when light is dark, also can obtain lip image information clearly, and system is to more accurately to image analysis processing;

Preferably, image/voice recognition processing module 24 used processor adopting High Performance DSP processors guarantees that system has good real-time performance.

On software processes, control command adopts as " opening navigation ", and " localizing objects ", " programme path ", " making a phone call ", fix command forms such as " answering ", thus greatly reduce the data operation quantity of template matches, also improved recognition efficiency simultaneously.Map address and voice messaging adopt crucial words fuzzy matching recognition methods, thereby have strengthened identification range, also improve the information Recognition rate simultaneously.The correctness that adopts said method that voice command is controlled provides sound assurance.

Preferably; The process of setting up of ATL is: each 20 people of men and women that select 16-70 age last birthday section; Carrying out vehicle mounted guidance voice command, cartographic information voice, speech play voice command and voice programm name voice, device control order voice and corresponding mouth shape image thereof respectively records; Through setting up basic ATL after voice/shape of the mouth as one speaks comparison and the characterization, after the speech recognition ATL is set up, that its classification and storage is subsequent use in the template corresponding class libraries.

In shape of the mouth as one speaks assistant voice identifying, through microphone 23 and camera 22 acquisition characteristics data, in speech processes; In image/voice recognition processing module 24, earlier the original sound of gathering is carried out denoising; Carry out characteristic then and extract, after corresponding shape of the mouth as one speaks characteristic is extracted, carry out a series of matching judgment identifications with the ATL data that preset; Judging characteristic result after the speech recognition compares with corresponding Mouth-Shape Recognition characteristic result again; Preferably, both recognition result similarities reach 70% can confirm voice content when above, converts this voice content to steering order again and sends into the vehicle mounted guidance audio-video system and handle.

Be applied in shape of the mouth as one speaks assistant voice recognition technology in the onboard navigation system; Because of phonetic recognization rate improves; When vehicle '; The vehicle-mounted voice navigator also can Real time identification under the environment of noise, response driver's speech control and navigating, and security incident takes place when avoiding the driver's operation navigator as far as possible.

According to the announcement and the instruction of above-mentioned instructions and specific embodiment, those skilled in the art in the invention can also change and revise above-mentioned embodiment.Therefore, the embodiment that discloses and describe above the present invention is not limited to also should fall in the protection domain of claim of the present invention modifications more of the present invention and change.In addition, although used some specific term and notions in this instructions, these terms and notion be explanation for ease just, the present invention is not constituted any restriction.

Claims

1. a shape of the mouth as one speaks assistant voice is discerned the method that art is used in vehicle mounted guidance, it is characterized in that comprising following steps:

Obtain voice messaging through voice recording equipment,, handle laggard lang sound identification through feature extraction, template training, template classification, judgement;

Obtain image information through shape of the mouth as one speaks picture pick-up device,, carry out Mouth-Shape Recognition after the processing through feature extraction, template training, template classification, judgement, and mouth shape image information with step a in voice messaging corresponding one by one;

Voice identification result and Mouth-Shape Recognition result are compared, when both recognition result similarities acquire a certain degree, can confirm that this voice identification result is effective, export this voice identification result;

Changing into command adapted thereto to voice identification result again outputs to in-vehicle navigation apparatus and navigates or obtain information.

2. the method that shape of the mouth as one speaks assistant voice identification art according to claim 1 is used in vehicle mounted guidance, it is characterized in that: step a concrete steps are following:

(a) feature extraction is carried out the A/D conversion with the various analog signal of voice of gathering, and processes and stores after converting digital signal to; Be about to this signal digital and carry out digital denoising processing, remove pseudo-data, the keeping characteristics data;

(b) template training is controlled voice command commonly used and relevant information is set up the sound template storehouse according to mobile unit, looks for the people of all ages and classes, sex, accent to read, and does corresponding processing, sets up the automobile-used sound template database of controlling;

(c) template classification, according to application characteristic, i.e. control command class, address information class, range of information is type classification by size, to dwindle coupling judgement scope, improves matching efficiency and accuracy rate;

(d) judge, utilize matching algorithm to carry out phonetic feature and sound template storehouse Model Matching, the result that output is judged.

3. the method that shape of the mouth as one speaks assistant voice identification art according to claim 1 is used in vehicle mounted guidance; It is characterized in that: step b practical implementation also comprises following method: adopt moving feature extraction of a kind of lip based on colourity filtering and recognition methods; Through the colourity filtering of lip, the lip motion video that is enhanced; Utilize variable template again, describe shape of the mouth as one speaks profile and extract characteristic parameter, and carry out the identification of lip movement sequence image with HMM.

4. the application process of shape of the mouth as one speaks assistant voice according to claim 1 identification art in vehicle mounted guidance is characterized in that: the said similarity of step c acquires a certain degree and reaches more than 70% for similarity.

5. a shape of the mouth as one speaks assistant voice is discerned the system that art is used in vehicle mounted guidance; It is characterized in that comprising: vehicle mounted guidance audio-video system (25) and the image/voice recognition processing module (24) that upward connects thereof are connected in the image/microphone (23) of voice recognition processing module (24) input end, camera (22); Microphone (23) and camera (22) are gathered voice signal and mouth shape image signal respectively; And be input to image/voice recognition processing module (24) and carry out other handled, identification; And convert the result after the identification to control corresponding instruction, be input to vehicle mounted guidance audio-video system (25) and realize the voice control operation.

6. the system that shape of the mouth as one speaks assistant voice identification art according to claim 5 is used in vehicle mounted guidance is characterized in that: said microphone (23) is for having the high-fidelity/highly sensitive electret condenser cartridge of directional audio transfer function.

7. the system that shape of the mouth as one speaks assistant voice according to claim 5 identification art is used in vehicle mounted guidance is characterized in that: said camera (22) for the band night vision function, video resolution is 640 * 480,25 frames, the very color CCD video image sensors of 32bit.

8. discern the system that arts are used according to claim 5 or 6 described shape of the mouth as one speaks assistant voices in vehicle mounted guidance; It is characterized in that: said microphone (23) installation site is mounted in panel board upper part, dead ahead, driver position, and the acoustic pickup mouth will be over against driver facial (21).

9. discern the system that arts are used according to claim 5 or 7 described shape of the mouth as one speaks assistant voices in vehicle mounted guidance; It is characterized in that: said camera (22) installation site is mounted in the upper edge end of dead ahead, driver position windshield, and camera lens is over against driver facial (21).

10. the system that shape of the mouth as one speaks assistant voice identification art according to claim 5 is used in vehicle mounted guidance, it is characterized in that: image/used processor of voice recognition processing module (24) is the High Performance DSP processor.