Background technology
In the past, the audio-visual devices of digital camera, sound equipment or videocorder and so on except that the push-botton operation that can utilize on the machine, can also utilize telepilot to control.The operator only need utilize the button operation on the telepilot, need not touch audio-visual devices fully.In any case but its condition precedent is that the operator must grasp telepilot, in case telepilot is lost or not on hand, just can't enjoy the facility of these remote controls immediately.
New voice remote control technology can allow the operator need not to hold to get any telepilot and can carry out remote control.Its principle is to utilize audio signal reception device (as microphone) to receive operator's voice, analyzes its phonetic feature then, searches a relative operational order from instruction database, carries out this instruction then.The speech recognition technology has developed for many years, and constantly the someone weeds out the old and bring forth the new both at home and abroad, proposes relevant patent.
Content with United States Patent (USP) US2005/0071169A1 open file is an example, the speed of person's speech that the inventor of this case considers the different operating tends to speed and differs, therefore its countermeasure is to add the preceding paragraph automatically to postpone between the time point of finishing reception and beginning to carry out, so that whether definite this phonetic order has been assigned finishes.The content of this patent disclosure proposes the idea of time shaft, but still is that information round sound is dealing with.
The content that is disclosed with United States Patent (USP) US2005/0105575A1 open file is an example again, and the problem that this invention is considered is that same phonetic order may allow indoor difference set reaction simultaneously, and the mistake that can cause expecting is with chaotic.The countermeasure that this case inventor is proposed is to make the every remote control equipment in the same room all dispose a video camera and microphone, but whether the device purpose of this video camera just sends instruction to this equipment in order to detect the operator, uses and avoids above-mentioned chaotic situation to take place.Because this invention is equipped with the purpose of video camera and just is used for differentiating whether to accept phonetic order, is not the accuracy that is used to promote speech recognition, and is therefore different with the present invention.
In addition, United States Patent (USP) the 6th, 452, the 625B1 number disclosed formula video recording microscope that compacts, though the inside also is provided with microphone and image capture equipment, its image capture equipment mainly is recording function, and microphone is so long as provide simple recording or voice control, how not utilize image information assistant voice control but speak of, do not speak of it simultaneously yet as voice operating video recording microscope how.
United States Patent (USP) the 6th, 289 has also disclosed a kind of voice control technology that can be applicable to image capture unit 140B1 number, the discrimination method of a cover voice instruction is provided and carries out above-mentioned required hardware structure.Thereafter United States Patent (USP) the 6th, 762,692B1 well also propose to show the mode of phonetic order tree on screen, help the user to read predetermined vocabulary and come operating equipment.Yet, more than two patents do not expect identification fully with image information assistant voice steering order.
Above-mentioned patent documentation and general voice identification system all are merely to collect voice, analyzing speech feature, and finding out from instruction database according to this phonetic feature then can corresponding instruction.But the condition of speech recognition can differ slowly with operator's accent, speed disease and environmental background instantly and different, and its comparison condition and image factor may be because of the people, vary in different localities, and is rather complicated.How to improve voice-operated discrimination power and be a major challenge in the current research and development.How promoting the discrimination power of phonetic order, has been the research and development emphasis that current each company makes great efforts to compete for.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent, preferred embodiment cited below particularly, and cooperate appended graphicly, be described in detail below:
Please join shown in Figure 1, it is an embodiment according to the image capture unit 10 that the present invention implemented, and comprises that a camera lens module 11, an image sensing module 12, an image processor 13, a display screen 14, a data memory module 15, a storer 16, a processor unit 17, transmit compositions such as interface 18, a button 19, a microphone 20 and a voice identification apparatus 21.Wherein, microphone 20 is used for sound import, and camera lens module 11 is used to take in optical image and produces the array image via sensing module 12, handles via image processor 13 and by display screen 14 demonstrations, stores at data memory module 15 and storer 16.During operation,, take, record a video, browse by button 19 and 21 pairs of processor unit 17 input instructions of voice identification apparatus, operation such as additions and deletions archives or transmission.Transmission interface 18 can pass through general radiofrequency emitting module, or manages online bluetooth communication, USB port, 1394 ports or optical fiber communication port etc. with computing machine, mobile phone or other audio-visual devices.Because this image capture unit 10 is except that voice identification apparatus 21, all the other all are the application of known tip assemblies, will not give unnecessary details in this part for known tip assemblies.
Voice identification apparatus 21 comprises a phonetic feature storehouse 21A, image feature storehouse 21B, reaches an instruction database 21C.Wherein, the instruction of instruction database 21C is corresponding with phonetic feature storehouse 21A and image feature storehouse 21B respectively, can be according to phonetic feature contrast phone feature database 21A by microphone 20 input, use in instruction database 21C, find out this phonetic feature can corresponding instruction.Disease differs slowly because everyone talks, height is different, adds that volume and accent are also variant mutually unavoidably, if therefore the result of contrast only gets uniquely, produces erroneous judgement most probably.For this reason, can relax the contrast condition, will choose simultaneously, become instruction set near the instruction of phonetic feature.
When camera lens module 11 absorption optical images and via image sensing module 12 and image processor 13, after producing the real-time imaging of real time reaction floor environment, speech recognition equipment 21 these real-time imagings of acquisition and image feature storehouse 21B contrast, the result of taking this to contrast checks or screens the instruction set of being chosen from instruction database 21C, judge the instruction of the actual state needs when meeting the user and operating, carry out by processor unit 17 then.
Image feature storehouse 21B stores the instruction corresponding image feature of a plurality of and described instruction database 21C.So-called image feature can include but not limited to measured different bright-dark degrees, the form of body profile.For example can make the bright degree of voice identification apparatus 21 according to the floor environment that image reflected, decision operation person will change the phonetic order of ISO value actually for becoming big or diminishing; Or, adjust the position that needs to lock focusing in the picture according to the humanoid position of body profile analysis judgement.The corresponding relation of image feature storehouse 21B and instruction storehouse 21C can be prerecorded at assembling process, with the pairing different instruction collection of definition different images feature.Certainly, the content of image feature storehouse 21B and the correspondence of instruction database 21C also can be changed by the operator after sale voluntarily, are edited or increase, delete etc. according to its professional purposes or specific application target.
The command content of instruction database 21C can be prerecorded at assembling process.For example can make it contain first instruction set, comprise and browse the instruction that is: " storage (save) ", " deletion (delete) ", " amplifying (zoom in) ", " dwindling (zoom out) ", " left side (left) ", " right (right) ", " going up (up) ", " (down) down ", " transmitting (send) " or " all transmitting (send all) " or the like voice remote control instruction.
The instruction database 21C of present embodiment also can comprise when taking will be mobile required second instruction set during focusing, for example: make focusing be locked in face humanoid in the picture " to face (onface) ", make " left side (left) " that focusing moves to left, make " the going up (up) " of moving on the focusing, " (down) down " that focusing is moved down wait voice remote control to instruct.
In addition, required the 3rd instruction set when but the instruction 21C of present embodiment also can comprise insufficient light, for example: light will make " going up (up) " that IS O value improves when too dark, when light is too bright and will make " (down) down ", " the opening (on) " that will open flashlamp that ISO value descends, open the voice remote controls such as " under (down) " that also will strengthen " going up (up) " of brightness of flash lamp after the flashlamp and will reduce brightness of flash lamp and instruct.
Please refer to Fig. 1 and Fig. 2, it is the embodiment that utilizes the method for image feature assistant voice remote control according to the present invention, comprises the steps:
(a) import voice by an audio signal reception device, and go to contrast a phonetic feature storehouse, take this from an instruction database corresponding, to choose the instruction that all can be corresponding with this phonetic feature, the synthetic instruction set of described instruction set with this phonetic feature storehouse with the feature of these voice;
(b) capture a real-time imaging by the image capture element, and utilize this real-time imaging to go to contrast an image feature storehouse;
(c) utilize the search result in image feature storehouse from this instruction set, to filter out to meet the user instruction of the actual state needs when operating; And
The instruction of the actual state needs when (d) carrying out this and meet the user and operate.
So, the accuracy of speech recognition can be increased, operational mistake can be effectively reduced with the method for image feature assistant voice remote control.
Wherein so-called audio signal reception device in (a) step can utilize the microphone 20 among Fig. 1 to be realized.Find out the phonetic feature that conforms at phonetic feature storehouse 21A behind the input voice, use and find out the instruction that all can be corresponding with this feature among the instruction database 21C, for example: when voice were " going up (up) ", same phonetic order might be " browse the top toward picture " of first instruction set, " make on the focusing and moving " or " making the ISO value improve " of the 3rd instruction set of second instruction set.This step picks out dependent instruction and assembles an instruction set.
In (b) step and (c) in the step, the image capture element can be realized by the image processor among Fig. 1 13.Image processor 13 produces a real-time imaging, and speech recognition equipment 21 is the image feature contrast of this real-time imaging and image feature storehouse 21B, to screen instruction set, to use the instruction of choosing the actual state needs when meeting the user and operating from instruction set.For example, when the result from image feature storehouse 21B contrast belongs to the sight that has not had new image input at present, then infer the present browsing of user, so phonetic order " going up (up) " should be " browse the top toward picture " in first instruction set; Though and the result who is contrasted as image feature storehouse 21B belongs to new picture input to be arranged but insufficient light at present, then can infer this phonetic order and should be " making the raising of I SO value " in the 3rd instruction set; But normal and new picture input arranged if the result of contrast belongs to present light, then this phonetic order can be estimated as making on the focusing in second instruction set and moves ".
The people who has the knack of this technology should be not difficult to reach from the above description embodiment and know other feasible conversion by inference, and complies with different consumer groups' preference or need be adjusted variation.For example providing too much phonetic order, is a unacceptable shortcoming for forgetful user, for this reason deviser's quantity of phonetic order of must trying every possible means as far as possible to reduce.Yet the result of reduction phonetic order quantity will certainly run into the situation that can't be pre-defined goes out judgment criterion.Therefore can please refer to Fig. 3, take an alternate embodiment, its step is as follows:
(a) import voice by an audio signal reception device, and go to contrast a phonetic feature storehouse, take this from an instruction database corresponding, to choose the instruction that all can be corresponding with this phonetic feature, the synthetic instruction set of described instruction set with this phonetic feature storehouse with the feature of these voice;
(b) capture a real-time imaging by the image capture element, and utilize this image feature to go to contrast an image feature storehouse;
(c1) utilize the search result in image feature storehouse from this instruction set, to filter out the instruction of the actual state needs of a plurality of users of meeting when operating;
(c2) utilize a display to show the instruction of the actual state needs when these a plurality of users of meeting operate, therefrom select an instruction for the operator; And
(d1) the selected instruction of executable operations person.
Though the embodiment of Fig. 3 also needs the user to pass on them to want the instruction of selecting with voice at last, but utilized image feature to filter out the instruction set of the actual state needs of a plurality of users of meeting when operating in step (c1), and with on these instruction set that filter out demonstrations and the screen (can be shown) by the display screen among Fig. 1 14, but this mode of operation is fallen the operator to forgetful, can see that screen reads the instruction of the actual state needs when meeting the user and operating again, with as if relieved of a heavy load.
This contains the auxiliary voice remote control method of image information analysis, will help to increase the accuracy of voice remote control, and can reduce operational mistake effectively.
The present invention compares with known method, and its advantage comprises:
1. as method of the present invention, the image feature contrast image feature storehouse that utilizes enforcement to obtain, can screen or check selected phonetic order according to described shooting situation, and use the phonetic order that filters out the actual state needs when meeting the user and operating, help to improve the accuracy of voice remote control.
2. because digital camera itself has had the function of image capture and processing, do not need to increase again extra nextport hardware component NextPort cost so implement method of the present invention, in other words, the present invention is applied to digital camera and only needs original storage module adding phonetic feature storehouse, image feature storehouse and corresponding instruction database and firmware in machine, can promote the accuracy of voice remote control.
More than describing in detail is specifying of providing at preferred embodiment of the present invention; but this embodiment is not in order to restriction protection scope of the present invention; allly do not break away from equivalence that the technology of the present invention spirit done and implement or change, all should be contained in the protection domain of this case.