CN101038742B

CN101038742B - Apparatus and method for assistant voice remote control using image feature

Info

Publication number: CN101038742B
Application number: CN2006100585631A
Authority: CN
Inventors: 洪进福
Original assignee: Hongfujin Precision Industry Shenzhen Co Ltd; Hon Hai Precision Industry Co Ltd
Current assignee: Fujin Precision Industry Shenzhen Co Ltd; Hon Hai Precision Industry Co Ltd
Priority date: 2006-03-16
Filing date: 2006-03-16
Publication date: 2011-06-22
Anticipated expiration: 2026-03-16
Also published as: CN101038742A

Abstract

A device or a method for helping voice remote control by using video characteristics is suitable for a remote control device having a video and voice capturing function. A voice characteristic library, a video characteristic library and a command library for voice remote control operation are provided in the device, when operating the voice remote control operation, the method comprises: inputting a voice via a radiogram device, analyzing the voice characteristic and searching the voice characteristics library to find a command gather which is approached to the voice characteristic in a relative command library; capturing a real-time video by a video capturing member and searching the video characteristic library by using the real-time video; checking which one in the command gather is a needed command which accords with the real condition upon operation by an operator by using a searching result of the video characteristic library; and executing the needed command which accords with the real condition upon operation by the operator. The method of checking the voice command by video characteristic is capable of increasing accuracy of voice control and effectively reducing operation errors.

Description

Utilize the device and method of image feature assistant voice remote control

Technical field

The present invention is relevant a kind of device and method that utilizes the image feature assistant voice remote control, and particularly relevant a kind of rationality of utilizing image feature to check phonetic order, increases the accuracy to the phonetic order identification.

Background technology

In the past, the audio-visual devices of digital camera, sound equipment or videocorder and so on except that the push-botton operation that can utilize on the machine, can also utilize telepilot to control.The operator only need utilize the button operation on the telepilot, need not touch audio-visual devices fully.In any case but its condition precedent is that the operator must grasp telepilot, in case telepilot is lost or not on hand, just can't enjoy the facility of these remote controls immediately.

New voice remote control technology can allow the operator need not to hold to get any telepilot and can carry out remote control.Its principle is to utilize audio signal reception device (as microphone) to receive operator's voice, analyzes its phonetic feature then, searches a relative operational order from instruction database, carries out this instruction then.The speech recognition technology has developed for many years, and constantly the someone weeds out the old and bring forth the new both at home and abroad, proposes relevant patent.

Content with United States Patent (USP) US2005/0071169A1 open file is an example, the speed of person's speech that the inventor of this case considers the different operating tends to speed and differs, therefore its countermeasure is to add the preceding paragraph automatically to postpone between the time point of finishing reception and beginning to carry out, so that whether definite this phonetic order has been assigned finishes.The content of this patent disclosure proposes the idea of time shaft, but still is that information round sound is dealing with.

The content that is disclosed with United States Patent (USP) US2005/0105575A1 open file is an example again, and the problem that this invention is considered is that same phonetic order may allow indoor difference set reaction simultaneously, and the mistake that can cause expecting is with chaotic.The countermeasure that this case inventor is proposed is to make the every remote control equipment in the same room all dispose a video camera and microphone, but whether the device purpose of this video camera just sends instruction to this equipment in order to detect the operator, uses and avoids above-mentioned chaotic situation to take place.Because this invention is equipped with the purpose of video camera and just is used for differentiating whether to accept phonetic order, is not the accuracy that is used to promote speech recognition, and is therefore different with the present invention.

In addition, United States Patent (USP) the 6th, 452, the 625B1 number disclosed formula video recording microscope that compacts, though the inside also is provided with microphone and image capture equipment, its image capture equipment mainly is recording function, and microphone is so long as provide simple recording or voice control, how not utilize image information assistant voice control but speak of, do not speak of it simultaneously yet as voice operating video recording microscope how.

United States Patent (USP) the 6th, 289 has also disclosed a kind of voice control technology that can be applicable to image capture unit 140B1 number, the discrimination method of a cover voice instruction is provided and carries out above-mentioned required hardware structure.Thereafter United States Patent (USP) the 6th, 762,692B1 well also propose to show the mode of phonetic order tree on screen, help the user to read predetermined vocabulary and come operating equipment.Yet, more than two patents do not expect identification fully with image information assistant voice steering order.

Above-mentioned patent documentation and general voice identification system all are merely to collect voice, analyzing speech feature, and finding out from instruction database according to this phonetic feature then can corresponding instruction.But the condition of speech recognition can differ slowly with operator's accent, speed disease and environmental background instantly and different, and its comparison condition and image factor may be because of the people, vary in different localities, and is rather complicated.How to improve voice-operated discrimination power and be a major challenge in the current research and development.How promoting the discrimination power of phonetic order, has been the research and development emphasis that current each company makes great efforts to compete for.

Summary of the invention

Purpose of the present invention is exactly to add checking of phonetic feature in the identification process of phonetic order, uses and improves voice-operated accuracy.

For achieving the above object, the present invention proposes a kind of device and method that utilizes the image feature assistant voice remote control.The inside of described device is provided with a phonetic feature storehouse, an image feature storehouse and the instruction database usefulness for voice remote control simultaneously, and when carrying out the voice remote control operation, comprise the following steps: that (a) imports voice by an audio signal reception device, and go to contrast a phonetic feature storehouse with the feature of these voice, take this from an instruction database corresponding, to choose the instruction that all can be corresponding with this phonetic feature, with the synthetic instruction set of described instruction set with this phonetic feature storehouse; (b) capture a real-time imaging by the image capture element, and utilize this real-time imaging to search an image feature storehouse; (c) utilize the search result in this image feature storehouse from this instruction set, to filter out to meet the user instruction of the actual state needs when operating; And (d) instruction of the actual state needs when carrying out this and meeting the user and operate.

The device that utilizes the image feature assistant voice remote control of the present invention is applicable to the remote control equipment with image and voice acquisition function, the camera mobile phone as digital camera, digital videocorder, operating room Video Camera reach etc.

The present invention checks the method for phonetic order with image feature, can increase voice-operated accuracy, and effectively reduce operational mistake.

Description of drawings

Fig. 1 is the embodiment calcspar of the present invention's device of utilizing the image feature assistant voice remote control.

Fig. 2 is the embodiment calcspar of the present invention's method of utilizing the image feature assistant voice remote control.

Fig. 3 is the alternate embodiment calcspar of the present invention's method of utilizing the image feature assistant voice remote control.

The main element symbol description:

A, b, c, c1, c2, d, d1 step

10 image capture units, 11 camera lens modules

12 image sensing modules, 13 image processors

14 display screens, 15 data storage modules

16 storeies, 17 processing units

18 transmission interfaces, 19 buttons

20 microphones, 21 voice identification apparatus

21B image feature storehouse, 21A phonetic feature storehouse

The 21C instruction database

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent, preferred embodiment cited below particularly, and cooperate appended graphicly, be described in detail below:

Please join shown in Figure 1, it is an embodiment according to the image capture unit 10 that the present invention implemented, and comprises that a camera lens module 11, an image sensing module 12, an image processor 13, a display screen 14, a data memory module 15, a storer 16, a processor unit 17, transmit compositions such as interface 18, a button 19, a microphone 20 and a voice identification apparatus 21.Wherein, microphone 20 is used for sound import, and camera lens module 11 is used to take in optical image and produces the array image via sensing module 12, handles via image processor 13 and by display screen 14 demonstrations, stores at data memory module 15 and storer 16.During operation,, take, record a video, browse by

button

19 and 21 pairs of processor unit 17 input instructions of voice identification apparatus, operation such as additions and deletions archives or transmission.Transmission interface 18 can pass through general radiofrequency emitting module, or manages online bluetooth communication, USB port, 1394 ports or optical fiber communication port etc. with computing machine, mobile phone or other audio-visual devices.Because this image capture unit 10 is except that voice identification apparatus 21, all the other all are the application of known tip assemblies, will not give unnecessary details in this part for known tip assemblies.

Voice identification apparatus 21 comprises a phonetic feature storehouse 21A, image feature storehouse 21B, reaches an instruction database 21C.Wherein, the instruction of instruction database 21C is corresponding with phonetic feature storehouse 21A and image feature storehouse 21B respectively, can be according to phonetic feature contrast phone feature database 21A by microphone 20 input, use in instruction database 21C, find out this phonetic feature can corresponding instruction.Disease differs slowly because everyone talks, height is different, adds that volume and accent are also variant mutually unavoidably, if therefore the result of contrast only gets uniquely, produces erroneous judgement most probably.For this reason, can relax the contrast condition, will choose simultaneously, become instruction set near the instruction of phonetic feature.

When camera lens module 11 absorption optical images and via image sensing module 12 and image processor 13, after producing the real-time imaging of real time reaction floor environment, speech recognition equipment 21 these real-time imagings of acquisition and image feature storehouse 21B contrast, the result of taking this to contrast checks or screens the instruction set of being chosen from instruction database 21C, judge the instruction of the actual state needs when meeting the user and operating, carry out by processor unit 17 then.

Image feature storehouse 21B stores the instruction corresponding image feature of a plurality of and described instruction database 21C.So-called image feature can include but not limited to measured different bright-dark degrees, the form of body profile.For example can make the bright degree of voice identification apparatus 21 according to the floor environment that image reflected, decision operation person will change the phonetic order of ISO value actually for becoming big or diminishing; Or, adjust the position that needs to lock focusing in the picture according to the humanoid position of body profile analysis judgement.The corresponding relation of image feature storehouse 21B and instruction storehouse 21C can be prerecorded at assembling process, with the pairing different instruction collection of definition different images feature.Certainly, the content of image feature storehouse 21B and the correspondence of instruction database 21C also can be changed by the operator after sale voluntarily, are edited or increase, delete etc. according to its professional purposes or specific application target.

The command content of instruction database 21C can be prerecorded at assembling process.For example can make it contain first instruction set, comprise and browse the instruction that is: " storage (save) ", " deletion (delete) ", " amplifying (zoom in) ", " dwindling (zoom out) ", " left side (left) ", " right (right) ", " going up (up) ", " (down) down ", " transmitting (send) " or " all transmitting (send all) " or the like voice remote control instruction.

The instruction database 21C of present embodiment also can comprise when taking will be mobile required second instruction set during focusing, for example: make focusing be locked in face humanoid in the picture " to face (onface) ", make " left side (left) " that focusing moves to left, make " the going up (up) " of moving on the focusing, " (down) down " that focusing is moved down wait voice remote control to instruct.

In addition, required the 3rd instruction set when but the instruction 21C of present embodiment also can comprise insufficient light, for example: light will make " going up (up) " that IS O value improves when too dark, when light is too bright and will make " (down) down ", " the opening (on) " that will open flashlamp that ISO value descends, open the voice remote controls such as " under (down) " that also will strengthen " going up (up) " of brightness of flash lamp after the flashlamp and will reduce brightness of flash lamp and instruct.

Please refer to Fig. 1 and Fig. 2, it is the embodiment that utilizes the method for image feature assistant voice remote control according to the present invention, comprises the steps:

(a) import voice by an audio signal reception device, and go to contrast a phonetic feature storehouse, take this from an instruction database corresponding, to choose the instruction that all can be corresponding with this phonetic feature, the synthetic instruction set of described instruction set with this phonetic feature storehouse with the feature of these voice;

(b) capture a real-time imaging by the image capture element, and utilize this real-time imaging to go to contrast an image feature storehouse;

(c) utilize the search result in image feature storehouse from this instruction set, to filter out to meet the user instruction of the actual state needs when operating; And

The instruction of the actual state needs when (d) carrying out this and meet the user and operate.

So, the accuracy of speech recognition can be increased, operational mistake can be effectively reduced with the method for image feature assistant voice remote control.

Wherein so-called audio signal reception device in (a) step can utilize the microphone 20 among Fig. 1 to be realized.Find out the phonetic feature that conforms at phonetic feature storehouse 21A behind the input voice, use and find out the instruction that all can be corresponding with this feature among the instruction database 21C, for example: when voice were " going up (up) ", same phonetic order might be " browse the top toward picture " of first instruction set, " make on the focusing and moving " or " making the ISO value improve " of the 3rd instruction set of second instruction set.This step picks out dependent instruction and assembles an instruction set.

In (b) step and (c) in the step, the image capture element can be realized by the image processor among Fig. 1 13.Image processor 13 produces a real-time imaging, and speech recognition equipment 21 is the image feature contrast of this real-time imaging and image feature storehouse 21B, to screen instruction set, to use the instruction of choosing the actual state needs when meeting the user and operating from instruction set.For example, when the result from image feature storehouse 21B contrast belongs to the sight that has not had new image input at present, then infer the present browsing of user, so phonetic order " going up (up) " should be " browse the top toward picture " in first instruction set; Though and the result who is contrasted as image feature storehouse 21B belongs to new picture input to be arranged but insufficient light at present, then can infer this phonetic order and should be " making the raising of I SO value " in the 3rd instruction set; But normal and new picture input arranged if the result of contrast belongs to present light, then this phonetic order can be estimated as making on the focusing in second instruction set and moves ".

The people who has the knack of this technology should be not difficult to reach from the above description embodiment and know other feasible conversion by inference, and complies with different consumer groups' preference or need be adjusted variation.For example providing too much phonetic order, is a unacceptable shortcoming for forgetful user, for this reason deviser's quantity of phonetic order of must trying every possible means as far as possible to reduce.Yet the result of reduction phonetic order quantity will certainly run into the situation that can't be pre-defined goes out judgment criterion.Therefore can please refer to Fig. 3, take an alternate embodiment, its step is as follows:

(b) capture a real-time imaging by the image capture element, and utilize this image feature to go to contrast an image feature storehouse;

(c1) utilize the search result in image feature storehouse from this instruction set, to filter out the instruction of the actual state needs of a plurality of users of meeting when operating;

(c2) utilize a display to show the instruction of the actual state needs when these a plurality of users of meeting operate, therefrom select an instruction for the operator; And

(d1) the selected instruction of executable operations person.

Though the embodiment of Fig. 3 also needs the user to pass on them to want the instruction of selecting with voice at last, but utilized image feature to filter out the instruction set of the actual state needs of a plurality of users of meeting when operating in step (c1), and with on these instruction set that filter out demonstrations and the screen (can be shown) by the display screen among Fig. 1 14, but this mode of operation is fallen the operator to forgetful, can see that screen reads the instruction of the actual state needs when meeting the user and operating again, with as if relieved of a heavy load.

This contains the auxiliary voice remote control method of image information analysis, will help to increase the accuracy of voice remote control, and can reduce operational mistake effectively.

The present invention compares with known method, and its advantage comprises:

1. as method of the present invention, the image feature contrast image feature storehouse that utilizes enforcement to obtain, can screen or check selected phonetic order according to described shooting situation, and use the phonetic order that filters out the actual state needs when meeting the user and operating, help to improve the accuracy of voice remote control.

2. because digital camera itself has had the function of image capture and processing, do not need to increase again extra nextport hardware component NextPort cost so implement method of the present invention, in other words, the present invention is applied to digital camera and only needs original storage module adding phonetic feature storehouse, image feature storehouse and corresponding instruction database and firmware in machine, can promote the accuracy of voice remote control.

More than describing in detail is specifying of providing at preferred embodiment of the present invention; but this embodiment is not in order to restriction protection scope of the present invention; allly do not break away from equivalence that the technology of the present invention spirit done and implement or change, all should be contained in the protection domain of this case.

Claims

1. device that utilizes the image feature assistant voice remote control, this device that utilizes the image feature assistant voice remote control is to install an audio signal reception device and a voice identification apparatus at an image capture unit; It is characterized in that, this image capture unit comprises a camera lens module, an image sensing module, an image processor and a processor unit, this image sensing module is used for transferring the light that camera lens module is taken in to image, and this image processor is used to provide a real-time imaging; This audio signal reception device is used to receive extraneous phonetic order; Described voice identification apparatus contains:

An instruction database is stored a plurality of instructions for the described image capture unit of operation;

A plurality of phonetic features corresponding with the instruction of described instruction database are stored in a phonetic feature storehouse, use to pick out the instruction that meets phonetic feature and become an instruction set; And

An image feature storehouse stores the instruction corresponding image feature of a plurality of and described instruction database;

This speech recognition equipment compares and produces a comparing result with the image feature in this real-time imaging and this image feature storehouse, and this processor unit screens an instruction according to this comparing result and carries out from this instruction set then.

2. the device that utilizes the image feature assistant voice remote control as claimed in claim 1 is characterized in that, described audio signal reception device is a microphone.

3. the device that utilizes the image feature assistant voice remote control as claimed in claim 1 is characterized in that, described image capture unit is a digital camera, a digital Video Camera or a shooting mobile phone.

4. the device that utilizes the image feature assistant voice remote control as claimed in claim 1 is characterized in that, described instruction database include one will mobile focusing when taking instruction.

5. the device that utilizes the image feature assistant voice remote control as claimed in claim 1 is characterized in that, whether enough described image feature storehouse include the feature of brightness in the time of can taking for contrast, and it is corresponding with image to include an instruction in this instruction database at least.

6. the device that utilizes the image feature assistant voice remote control as claimed in claim 1 is characterized in that, described instruction database comprise that light makes that ISO value improves when too dark " on " and light make the D score of ISO value decline when too bright.

7. the device that utilizes the image feature assistant voice remote control as claimed in claim 1, it is characterized in that, described instruction database comprises " opening " in the time of will opening flashlamp, after opening flashlamp, to strengthen brightness of flash lamp " on " and the voice remote control that will reduce the D score of brightness of flash lamp instruct.

8. method of utilizing the image feature assistant voice remote control, this method is to utilize an image capture unit to install a voice identification apparatus, and utilizes an audio signal reception device to receive the in addition remote control of phonetic order that the operator sent; Contain a phonetic feature storehouse, an image feature storehouse and an instruction database in the described voice identification apparatus, it is characterized in that, described method comprises the following step:

(a) import voice by described audio signal reception device, and remove the contrast phone feature database, take this from instruction database, to choose the instruction that all can be corresponding with this phonetic feature, the synthetic instruction set of described instruction set with the feature of these voice;

(b) capture a real-time imaging by image capture unit, and utilize this image feature to remove the contrast image feature database;

(c) utilize the comparing result in image feature storehouse from described instruction set, to filter out the instruction of actual state needs when meeting the user and operating; And

(d) carry out the instruction of the actual state needs when meeting the user and operating.

9. method of utilizing the image feature assistant voice remote control, this method is to utilize an image capture unit to install a voice identification apparatus, and utilizes an audio signal reception device to receive the in addition remote control of phonetic order that the operator sent; Contain a phonetic feature storehouse, an image feature storehouse and an instruction database in the described voice identification apparatus, it is characterized in that, described method comprises the following step:

(c1) utilize the search result in image feature storehouse from described instruction set, to filter out the instruction of the actual state needs of a plurality of users of meeting when operating;

(c2) utilize a display to show the instruction of the actual state needs when these a plurality of users of meeting operate; Therefrom select an instruction for the operator; And

(d1) the selected instruction of executable operations person.