CN110223690A - The man-machine interaction method and device merged based on image with voice - Google Patents
The man-machine interaction method and device merged based on image with voice Download PDFInfo
- Publication number
- CN110223690A CN110223690A CN201910497613.3A CN201910497613A CN110223690A CN 110223690 A CN110223690 A CN 110223690A CN 201910497613 A CN201910497613 A CN 201910497613A CN 110223690 A CN110223690 A CN 110223690A
- Authority
- CN
- China
- Prior art keywords
- image data
- voice
- monitoring area
- interactive instruction
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 26
- 230000002452 interceptive effect Effects 0.000 claims abstract description 80
- 238000012544 monitoring process Methods 0.000 claims abstract description 61
- 230000001815 facial effect Effects 0.000 claims description 28
- 238000004458 analytical method Methods 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 23
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 4
- 230000017105 transposition Effects 0.000 claims description 4
- 230000005236 sound signal Effects 0.000 claims 1
- 238000010191 image analysis Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a kind of man-machine interaction method and device merged based on image with voice, method is the following steps are included: obtain monitoring area video image data;Video image data is analyzed, obtains audible target in the azimuth information of monitoring area;According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation;It identifies acoustical signal, and is converted into corresponding text information.This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligent level of interactive voice;The scheme merged using image procossing with speech recognition just will start interactive process and subsequent control instruction under the scene for monitoring user, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger the shortcomings that, it is safer.
Description
Technical field
The present invention relates to intelligent interaction fields, especially relate to a kind of human-computer interaction merged based on image with voice
Method and device.
Background technique
Human-computer interaction, human-computer interaction (English: Human-Computer Interaction or Human-Machine
Interaction, abbreviation HCI or HMI), it is the technology of the interactive relation between a door system and user.
Existing human-computer interaction technology purely relies on microphone array-speech recognition man-machine interaction method, in noise ratio
Larger or more people speak under scene can not effective position target speaker, and then affect following noise inhibit, speech recognition
Accuracy rate, influence interactive experience.
Existing image processing method can be captured accurately, track target person information in certain area, but can not be with
User interacts.Current voice interactive system, not can guarantee privacy, anyone can interact, using vocal print
Certificate scheme accuracy rate it is lower.
Summary of the invention
In order to solve the defect of the above-mentioned prior art, merged with voice the object of the present invention is to provide a kind of based on image
Man-machine interaction method and device.
In order to achieve the above objectives, the technical scheme is that
A kind of man-machine interaction method merged based on image with voice, comprising the following steps:
Obtain monitoring area video image data;
Video image data is analyzed, obtains audible target in the azimuth information of monitoring area;
According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation;
It identifies acoustical signal, and is converted into corresponding text information.
Further, described according to azimuth information, call microphone to obtain the acoustical signal step of corresponding position of orientation, packet
It includes,
According to the azimuth information, noise suppressed is carried out to sound source and Reverberation Rejection is handled, the acoustical signal that obtains that treated.
Further, described according to the azimuth information, noise suppressed and Reverberation Rejection processing step, packet are carried out to sound source
It includes,
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t);
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GM(ω) represents used overdetermination to drop
It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame;
Super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH(ω)Ψ-1
(ω);
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each member
Element value is as followsdi,jThe distance between i-th and jth microphone are represented, c is represented
Spread speed, v (ω) represent the steering vector of target direction to sound in air.
Further, the analysis video image data obtains audible target in the azimuth information step of monitoring area, packet
It includes,
Video image data is analyzed, monitoring area face images are obtained;
Face images are matched with pre-stored face image data;
The facial image to match with pre-stored face image data is obtained as audible target, and obtains it in monitored space
The azimuth information in domain.
Further, the identification acoustical signal, and be converted into after corresponding text information step, including,
The text information that conversion is obtained, is matched with pre-stored interactive instruction;
The interactive instruction being matched to is obtained, and executes the interactive instruction.
Further, the analysis video image data, obtain audible target in monitoring area azimuth information step it
Before, including,
The facial image of preparatory typing control user, constructs pre-stored face image data;
The interactive instruction of different function is preset, and interactive instruction is associated at least one text information and is saved.
The invention also provides a kind of human-computer interaction devices merged based on image with voice, including,
Video acquisition unit, for obtaining monitoring area video image data;
Orientation analysis unit obtains audible target in the azimuth information of monitoring area for analyzing video image data;
Voice pickup unit, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information;
Text conversion unit, acoustical signal for identification, and it is converted into corresponding text information.
Further, the voice pickup unit includes sound source processing module, for according to the azimuth information, to sound source into
Row noise suppressed and Reverberation Rejection processing, obtain treated acoustical signal.
Further, the orientation analysis unit includes:
Human face analysis module obtains monitoring area face images for analyzing video image data;
Face matching module, for matching face images with pre-stored face image data;
Orientation obtains module, for obtaining the facial image to match with pre-stored face image data as sounding mesh
Mark, and it is obtained in the azimuth information of monitoring area.
Further, further includes:
Text matches unit is matched for that will convert obtained text information with pre-stored interactive instruction;
Instruction acquisition unit for obtaining the interactive instruction being matched to, and executes the interactive instruction;
The pre- memory cell of face constructs pre-stored face image data for the facial image of preparatory typing control user;
Pre- memory cell is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one text
This information is saved.
The beneficial effects of the present invention are: this programme assists user's positioning interaction by being effectively utilized image recognition, promoted
The accuracy and intelligent level of user speech interaction;The scheme merged using image procossing with speech recognition is monitoring to use
It just will start interactive process and subsequent control instruction under the scene at family, avoid purely by the interactive system of speech recognition at nobody
Under scene also can erroneous trigger the shortcomings that, it is safer.
Detailed description of the invention
Fig. 1 is a kind of method flow diagram of the man-machine interaction method merged based on image with voice of the present invention;
Fig. 2 is present invention analysis video image data, obtains audible target in the tool of the azimuth information step of monitoring area
Body flow chart;
Fig. 3 is a kind of method flow of the man-machine interaction method merged based on image with voice of another embodiment of the present invention
Figure;
Fig. 4 is present invention analysis video image data, obtains audible target before the azimuth information step of monitoring area
Step flow chart;
Fig. 5 is a kind of structural principle block diagram of the human-computer interaction device merged based on image with voice of the present invention;
Fig. 6 is the structural principle block diagram of orientation analysis unit of the present invention;
Fig. 7 is the structural principle block diagram of voice pickup unit of the present invention;
Fig. 8 is that a kind of structure of human-computer interaction device structure merged based on image with voice of another embodiment of the present invention is former
Manage block diagram;
Fig. 9 is a kind of systematic schematic diagram of the man-machine interactive system merged based on image with voice of the present invention.
Specific embodiment
To illustrate thought and purpose of the invention, the present invention is done further below in conjunction with the drawings and specific embodiments
Explanation.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiment is only a part of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
His embodiment, shall fall within the protection scope of the present invention.
It is to be appreciated that the directional instruction (up, down, left, right, before and after etc.) of institute is only used in the embodiment of the present invention
It explains in relative positional relationship, the motion conditions etc. under a certain particular pose (as shown in the picture) between each component, if the spy
When determining posture and changing, then directionality instruction also correspondingly changes correspondingly, and the connection, which can be, to be directly connected to, can also
To be to be indirectly connected with.
In addition, the description for being such as related to " first ", " second " in the present invention is used for description purposes only, and should not be understood as
Its relative importance of indication or suggestion or the quantity for implicitly indicating indicated technical characteristic.Define as a result, " first ",
The feature of " second " can explicitly or implicitly include at least one of the features.In addition, the technical side between each embodiment
Case can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when the combination of technical solution
Conflicting or cannot achieve when occur will be understood that the combination of this technical solution is not present, also not the present invention claims guarantor
Within the scope of shield.
Unless otherwise instructed, "/" herein represents meaning as "or".
Referring to Fig.1-4, a kind of man-machine interaction method merged based on image with voice is proposed, comprising the following steps:
S10, monitoring area video image data is obtained;
S20, analysis video image data, obtain audible target in the azimuth information of monitoring area;
S30, according to azimuth information, call microphone to obtain the acoustical signal of corresponding position of orientation;
S40, identification acoustical signal, and it is converted into corresponding text information.
For step S10 and step S20, due to the limitation of usage scenario, human-computer interaction is present in specific space, example
Such as the parlor in house, in bedroom or a kitchen, the meeting room of company, classroom of school etc., by the way that man-machine friendship can be being carried out
(monitoring area) sets the video image data that camera obtains the space in real time in mutual space, is concurrently fed to video figure
As analysis engine, video image analysis engine, which can integrate, also be can integrate in cloud server in local terminal.Obtain setting
In the video image data in monitoring space, and the video image data got is analyzed, can be used for according to video image number
The target user position for controlling sound is issued according to getting.
Specifically, video image analysis engine includes the camera for Real-time captured video image information, Yi Jitu
As the processing engine of tracking.Video image analysis engine is for facial image in tracing monitoring region, for example, identification monitoring area
Interior user's lip determines that the user issues voice signal, the facial image is identified, if the facial image is to prestore after dynamic
The facial image of the limit that is possessed of control power of storage, then orient the azimuth information of user in real time.
In other embodiment of the present invention, in order to adapt to actual use needs, there was only one in detecting monitoring area
When a user, that is, defaults it and have control authority, it is not necessary that its facial image is matched with pre-stored face image data,
By analysis video image data, moved when recognizing face lip in monitoring area, default receives the institute in monitoring area
Audible signal, and carry out subsequent processing.
With reference to Fig. 2, step S20, including,
S21, analysis video image data, obtain monitoring area face images;
S22, face images are matched with pre-stored face image data;
S23, the facial image to match with pre-stored face image data is obtained as audible target, and obtain it and supervising
Control the azimuth information in region.
For step S21-S23, it is not necessarily the only user in monitoring area and exists, it is possible to there are multiple users,
And multiple simultaneous scenes of user.It speaks under scene in more people, analysis gets more people in video image data and exists simultaneously
It speaks, then obtains monitoring area face images, and face images are matched with pre-stored face image data
It compares, selection is most matched with pre-stored facial image, and the highest face of the degree of approximation is target user's face, and place orientation is mesh
Mark orientation.It should be noted that matching when, one of target face and pre-stored facial image similarity more than specified threshold it
Afterwards, it just can determine that target person face matches with the pre-stored facial image.
With reference to Fig. 4, before step S20 further include:
S11, preparatory typing control the facial image of user, construct pre-stored face image data;
S12, the interactive instruction for presetting different function, and interactive instruction is associated at least one text information and is protected
It deposits.
For step S11, user needs to acquire registered face image, building before camera before using this system
Pre-stored face image data, to guarantee the accuracy of subsequent positioning and user's control.
For step S12, interactive instruction further includes that user is customized other than including the interactive instruction on default basis
Interactive instruction.It may include the conventional funcs such as the control various equipment of smart home, such as common " opening TV ", " open visitor
The lamp in the Room ", the lamp of bedroom " close " etc., then can be with when user issues corresponding voice by pre-set interactive instruction
Realize corresponding interactive operation.
Interactive instruction is associated at least one text information and is saved, then representing an interactive instruction can be by multistage difference
Text information call.Such as interactive instruction " opening the lamp in parlor ", it is corresponding that text message is called to can be " opening parlor
Lamp ", " opening parlor lamp ", " opening Room lamp ", " light in parlor is opened " or " lighting parlor " etc..
For step S30, microphone is set in monitoring area, the sound letter for all orientation in collection monitoring region
Breath.When obtaining azimuth information of the target user in monitoring area, microphone can be called to obtain the target user in the orientation
Sound as interactive instruction.In acquisition process, need to carry out sound source certain processing, such as noise suppressed and reverberation suppression
System, and sound is arrived as interactive instruction after obtaining processing.
Specifically, step S30, comprising: S31, according to the azimuth information carries out at noise suppressed and Reverberation Rejection sound source
Reason obtains treated acoustical signal.
Wherein step S31 concrete processing procedure is as follows:
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t)。
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);.
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GM(ω) represents used overdetermination to drop
It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame.
Specifically, super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH
(ω)Ψ-1(ω)。
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each member
Element value is as followsdi,jThe distance between i-th and jth microphone are represented, c is represented
Spread speed, v (ω) represent the steering vector of target direction to sound in air.
For step S40, after recognizing target voice interactive instruction, acoustical signal is sent to speech recognition engine, in real time
Translation is converted to text information, and text information is compared with pre-stored interactive instruction, calls that specific to determine
Interactive instruction carry out human-computer interaction.Specifically, speech recognition can use at the natural language in Baidu AI open platform
Interface is managed to realize.
With reference to Fig. 3, after step S40, comprising:
S50, the text information for obtaining conversion, are matched with pre-stored interactive instruction;
The interactive instruction that S60, acquisition are matched to, and execute the interactive instruction.
For step S50-S60, interactive instruction is associated at least one text information (voice messaging), by recognizing difference
Text information similar and different interactive instruction can be called to be executed.For example, when user says " lamp for opening parlor " language
When sound, recognizing corresponding text information is " lamp for opening parlor ", at this time corresponding interactive instruction are as follows: opens the lamp in parlor, executes
The interactive instruction simultaneously opens the lamp in parlor, realizes human-computer interaction.
This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligence of interactive voice
Energyization is horizontal;The scheme merged using image procossing with speech recognition just will start under the scene for monitoring user and interact
Journey and subsequent control instruction, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger lack
Point, it is safer.
With reference to Fig. 5-8, the invention also provides a kind of human-computer interaction devices merged based on image with voice, including,
Video acquisition unit 10, for obtaining monitoring area video image data;
Orientation analysis unit 20 obtains audible target in the azimuth information of monitoring area for analyzing video image data;
Voice pickup unit 30, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information;
Text conversion unit 40, acoustical signal for identification, and it is converted into corresponding text information.
For video acquisition unit 10 and orientation analysis unit 20, due to the limitation of usage scenario, human-computer interaction is present in
In specific space, such as the parlor in house, in bedroom or a kitchen, the meeting room of company, classroom of school etc., by
It can carry out in the space of human-computer interaction (monitoring area) and set the video image data that camera obtains the space in real time, and
Transmission is given to video image analysis engine, and video image analysis engine, which can integrate, also can integrate in cloud server in local
Terminal.The video image data being set in monitoring space is obtained, and analyzes the video image data got, can be used for root
The target user position for issuing control sound is got according to video image data.
Specifically, video image analysis engine includes the camera for Real-time captured video image information, Yi Jitu
As the processing engine of tracking.Video image analysis engine is for facial image in tracing monitoring region, for example, identification monitoring area
Interior user's lip determines that the user issues voice signal, the facial image is identified, if the facial image is to prestore after dynamic
The facial image of the limit that is possessed of control power of storage, then orient the azimuth information of user in real time.
In other embodiment of the present invention, in order to adapt to actual use needs, there was only one in detecting monitoring area
When a user, that is, defaults it and have control authority, it is not necessary that its facial image is matched with pre-stored face image data,
By analysis video image data, moved when recognizing face lip in monitoring area, default receives the institute in monitoring area
Audible signal, and carry out subsequent processing.
With reference to Fig. 6, orientation analysis unit 20 includes:
Human face analysis module 21 obtains monitoring area face images for analyzing video image data;
Face matching module 22, for matching face images with pre-stored face image data;
Orientation obtains module 23, for obtaining the facial image to match with pre-stored face image data as sounding mesh
Mark, and it is obtained in the azimuth information of monitoring area.
For human face analysis module 21, face matching module 22 and orientation obtain module 23, in monitoring area not necessarily only
With the presence of a user, it is possible to there are multiple users, and the simultaneous scene of multiple users.It speaks under scene in more people,
Analysis gets in video image data more people while speaking, then obtains monitoring area face images, and by owner
Face image match comparing with pre-stored face image data, selects most to match with pre-stored facial image, degree of approximation highest
Face be target user's face, where orientation be target bearing.It should be noted that in matching, target face and pre-
Store after one of facial image similarity is more than specified threshold, just can determine that target person face and this be pre-stored facial image phase
Match.
For voice pickup unit 30, microphone is set in monitoring area, for all orientation in collection monitoring region
Acoustic information.When obtaining azimuth information of the target user in monitoring area, microphone can be called to obtain the orientation
The sound of target user is as interactive instruction.In acquisition process, need to carry out sound source certain processing, such as noise suppressed
And Reverberation Rejection, and sound is arrived as interactive instruction after obtaining processing.
With reference to Fig. 7, voice pickup unit 30 includes sound source processing module 31, for according to the azimuth information, to sound source into
Row noise suppressed and Reverberation Rejection processing, obtain treated acoustical signal.
Wherein 31 carrying out practically process of sound source processing module is as follows:
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t)。
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);.
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GM(ω) represents used overdetermination to drop
It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame.
Specifically, super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH
(ω)Ψ-1(ω)。
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each member
Element value is as followsdi,jThe distance between i-th and jth microphone are represented, c is represented
Spread speed, v (ω) represent the steering vector of target direction to sound in air.
For text conversion unit 40, after recognizing target voice interactive instruction, acoustical signal, which is sent to speech recognition, draws
It holds up, real time translation is converted to text information, and text information is compared with pre-stored interactive instruction, calls that determine
A specific interactive instruction carries out human-computer interaction.Specifically, speech recognition can be using the nature in Baidu AI open platform
Language Processing interface is realized.
With reference to Fig. 8, a kind of human-computer interaction device merged based on image with voice of the present invention further include:
Text matches unit 50 is matched for that will convert obtained text information with pre-stored interactive instruction;
Instruction acquisition unit 60 for obtaining the interactive instruction being matched to, and executes the interactive instruction;
The pre- memory cell 70 of face constructs pre-stored face image data for the facial image of preparatory typing control user;
Pre- memory cell 80 is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one
Text information is saved.
For text matches unit 50 and instruction acquisition unit 60, interactive instruction is associated at least one text information (voice
Information), similar and different interactive instruction can be called to be executed by recognizing different text informations.For example, when using
When " lamp for opening parlor " voice is said at family, recognizing corresponding text information is " lamp for opening parlor ", and corresponding interaction refers at this time
It enables are as follows: the lamp for opening parlor executes the interactive instruction and opens the lamp in parlor, realizes human-computer interaction.
Memory cell 70 pre- for face and the pre- memory cell 80 of instruction, user are needed before using this system in camera
Preceding acquisition registered face image constructs pre-stored face image data, to guarantee the accuracy of subsequent positioning and user's control.
Interactive instruction further includes the customized interactive instruction of user other than including the interactive instruction on default basis.It can
To include the conventional funcs such as the control various equipment of smart home, for example common " opening TV ", " lamp for opening parlor " " close
Close the lamp in bedroom " etc., it then may be implemented corresponding by pre-set interactive instruction when user issues corresponding voice
Interactive operation.
Interactive instruction is associated at least one text information and is saved, then representing an interactive instruction can be by multistage difference
Text information call.Such as interactive instruction " opening the lamp in parlor ", it is corresponding that text message is called to can be " opening parlor
Lamp ", " opening parlor lamp ", " opening Room lamp ", " light in parlor is opened " or " lighting parlor " etc..
This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligence of interactive voice
Energyization is horizontal;The scheme merged using image procossing with speech recognition just will start under the scene for monitoring user and interact
Journey and subsequent control instruction, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger lack
Point, it is safer.
With reference to Fig. 9, the invention also provides a kind of man-machine interactive systems merged based on image with voice, including are set to
The hardware terminal of monitoring area, and the processing engine for carrying out data processing.Hardware terminal includes for obtaining monitoring area
The picture pick-up device of video image data, for receiving the microphone array of user voice signal, for playing out raising for voice
Sound device.
Processing engine includes the video image analysis engine for analyzing video image, identifies voice signal and is converted into text
The speech recognition engine of this information is matched to the text analyzing engine of corresponding control instruction according to text information, control is referred to
Enable the speech synthesis engine for synthesizing voice broadcast and playing by loudspeaker.
Specifically, processing engine can integrate and also can integrate in cloud server in local terminal.
When work, picture pick-up device obtains monitoring area video image data, is analyzed in video image analysis engine
Processing positions target user, receives the voice signal from target user by microphone array, and pass through speech recognition engine
It is converted to text information, is matched by text analyzing engine with pre-stored interactive instruction, finds corresponding interactive instruction,
And corresponding voice is synthesized by speech synthesis engine and is played in loudspeaker, while interactive instruction is sent to target by communication module
Household electrical appliances are executed.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all utilizations
Equivalent structure or equivalent flow shift made by description of the invention and accompanying drawing content is applied directly or indirectly in other correlations
Technical field, be included within the scope of the present invention.
Claims (10)
1. a kind of man-machine interaction method merged based on image with voice, which comprises the following steps:
Obtain monitoring area video image data;
Video image data is analyzed, obtains audible target in the azimuth information of monitoring area;
According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation;
It identifies acoustical signal, and is converted into corresponding text information.
2. the man-machine interaction method merged as described in claim 1 based on image with voice, which is characterized in that described according to side
Position information calls microphone to obtain the acoustical signal step of corresponding position of orientation, including,
According to the azimuth information, noise suppressed is carried out to sound source and Reverberation Rejection is handled, the acoustical signal that obtains that treated.
3. the man-machine interaction method merged as claimed in claim 2 based on image with voice, which is characterized in that the basis should
Azimuth information carries out noise suppressed and Reverberation Rejection processing step to sound source, including,
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t);
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GMOverdetermination used by (ω) is represented is filtered to noise reduction
Wave device, ω are the circular frequency of a frequency band, and t represents time frame;
Super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH(ω)Ψ-1(ω);
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each element value
It is as followsdi,jRepresent the distance between i-th and jth microphone, c representative voice
Spread speed in air, v (ω) represent the steering vector of target direction.
4. the man-machine interaction method merged as described in claim 1 based on image with voice, which is characterized in that the analysis view
Frequency image data obtains audible target in the azimuth information step of monitoring area, including,
Video image data is analyzed, monitoring area face images are obtained;
Face images are matched with pre-stored face image data;
The facial image to match with pre-stored face image data is obtained as audible target, and obtains it in monitoring area
Azimuth information.
5. the man-machine interaction method merged as claimed in claim 4 based on image with voice, which is characterized in that the identification sound
Signal, and be converted into after corresponding text information step, including,
The text information that conversion is obtained, is matched with pre-stored interactive instruction;
The interactive instruction being matched to is obtained, and executes the interactive instruction.
6. the man-machine interaction method merged as claimed in claim 5 based on image with voice, which is characterized in that the analysis view
Frequency image data obtains audible target before the azimuth information step of monitoring area, including,
The facial image of preparatory typing control user, constructs pre-stored face image data;
The interactive instruction of different function is preset, and interactive instruction is associated at least one text information and is saved.
7. a kind of human-computer interaction device merged based on image with voice, which is characterized in that including,
Video acquisition unit, for obtaining monitoring area video image data;
Orientation analysis unit obtains audible target in the azimuth information of monitoring area for analyzing video image data;
Voice pickup unit, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information;
Text conversion unit, acoustical signal for identification, and it is converted into corresponding text information.
8. the human-computer interaction device merged as claimed in claim 7 based on image with voice, which is characterized in that the sound picks up
Taking unit includes sound source processing module, for carrying out noise suppressed to sound source and Reverberation Rejection being handled, obtain according to the azimuth information
The acoustical signal that takes that treated.
9. the human-computer interaction device merged as claimed in claim 7 based on image with voice, which is characterized in that the orientation point
Analysing unit includes:
Human face analysis module obtains monitoring area face images for analyzing video image data;
Face matching module, for matching face images with pre-stored face image data;
Orientation obtains module, for obtaining the facial image to match with pre-stored face image data as audible target, and
It is obtained in the azimuth information of monitoring area.
10. the human-computer interaction device merged as claimed in claim 9 based on image with voice, which is characterized in that further include:
Text matches unit is matched for that will convert obtained text information with pre-stored interactive instruction;
Instruction acquisition unit for obtaining the interactive instruction being matched to, and executes the interactive instruction;
The pre- memory cell of face constructs pre-stored face image data for the facial image of preparatory typing control user;
Pre- memory cell is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one text envelope
Breath is saved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910497613.3A CN110223690A (en) | 2019-06-10 | 2019-06-10 | The man-machine interaction method and device merged based on image with voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910497613.3A CN110223690A (en) | 2019-06-10 | 2019-06-10 | The man-machine interaction method and device merged based on image with voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110223690A true CN110223690A (en) | 2019-09-10 |
Family
ID=67816067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910497613.3A Pending CN110223690A (en) | 2019-06-10 | 2019-06-10 | The man-machine interaction method and device merged based on image with voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110223690A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110992971A (en) * | 2019-12-24 | 2020-04-10 | 达闼科技成都有限公司 | Method for determining voice enhancement direction, electronic equipment and storage medium |
CN111354353A (en) * | 2020-03-09 | 2020-06-30 | 联想(北京)有限公司 | Voice data processing method and device |
CN111405122A (en) * | 2020-03-18 | 2020-07-10 | 苏州科达科技股份有限公司 | Audio call testing method, device and storage medium |
CN111554269A (en) * | 2019-10-12 | 2020-08-18 | 南京奥拓软件技术有限公司 | Voice number taking method, system and storage medium |
CN111681673A (en) * | 2020-05-27 | 2020-09-18 | 北京华夏电通科技有限公司 | Method and system for identifying knocking hammer in court trial process |
CN111767785A (en) * | 2020-05-11 | 2020-10-13 | 南京奥拓电子科技有限公司 | Man-machine interaction control method and device, intelligent robot and storage medium |
CN112397065A (en) * | 2020-11-04 | 2021-02-23 | 深圳地平线机器人科技有限公司 | Voice interaction method and device, computer readable storage medium and electronic equipment |
CN112578338A (en) * | 2019-09-27 | 2021-03-30 | 阿里巴巴集团控股有限公司 | Sound source positioning method, device, equipment and storage medium |
CN112634911A (en) * | 2020-12-21 | 2021-04-09 | 苏州思必驰信息科技有限公司 | Man-machine conversation method, electronic device and computer readable storage medium |
CN112655000A (en) * | 2020-04-30 | 2021-04-13 | 华为技术有限公司 | In-vehicle user positioning method, vehicle-mounted interaction method, vehicle-mounted device and vehicle |
CN113014983A (en) * | 2021-03-08 | 2021-06-22 | Oppo广东移动通信有限公司 | Video playing method and device, storage medium and electronic equipment |
CN113327286A (en) * | 2021-05-10 | 2021-08-31 | 中国地质大学(武汉) | 360-degree omnibearing speaker visual space positioning method |
CN113362849A (en) * | 2020-03-02 | 2021-09-07 | 阿里巴巴集团控股有限公司 | Voice data processing method and device |
CN113407758A (en) * | 2021-07-13 | 2021-09-17 | 中国第一汽车股份有限公司 | Data processing method and device, electronic equipment and storage medium |
CN114530151A (en) * | 2022-02-10 | 2022-05-24 | 山东企联信息技术股份有限公司 | Artificial intelligence AI voice control system and experience device thereof |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070127736A1 (en) * | 2003-06-30 | 2007-06-07 | Markus Christoph | Handsfree system for use in a vehicle |
WO2008041878A2 (en) * | 2006-10-04 | 2008-04-10 | Micronas Nit | System and procedure of hands free speech communication using a microphone array |
JP2013172411A (en) * | 2012-02-22 | 2013-09-02 | Nec Corp | Voice recognition system, voice recognition method, and voice recognition program |
CN104049721A (en) * | 2013-03-11 | 2014-09-17 | 联想(北京)有限公司 | Information processing method and electronic equipment |
CN106056710A (en) * | 2016-06-02 | 2016-10-26 | 北京云知声信息技术有限公司 | Method and device for controlling intelligent electronic locks |
CN106440192A (en) * | 2016-09-19 | 2017-02-22 | 珠海格力电器股份有限公司 | Household appliance control method, device and system and intelligent air conditioner |
-
2019
- 2019-06-10 CN CN201910497613.3A patent/CN110223690A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070127736A1 (en) * | 2003-06-30 | 2007-06-07 | Markus Christoph | Handsfree system for use in a vehicle |
WO2008041878A2 (en) * | 2006-10-04 | 2008-04-10 | Micronas Nit | System and procedure of hands free speech communication using a microphone array |
JP2013172411A (en) * | 2012-02-22 | 2013-09-02 | Nec Corp | Voice recognition system, voice recognition method, and voice recognition program |
CN104049721A (en) * | 2013-03-11 | 2014-09-17 | 联想(北京)有限公司 | Information processing method and electronic equipment |
CN106056710A (en) * | 2016-06-02 | 2016-10-26 | 北京云知声信息技术有限公司 | Method and device for controlling intelligent electronic locks |
CN106440192A (en) * | 2016-09-19 | 2017-02-22 | 珠海格力电器股份有限公司 | Household appliance control method, device and system and intelligent air conditioner |
Non-Patent Citations (2)
Title |
---|
SEON MAN KIM ET AL.: "《Multi-channel audio recording based on superdirective beamforming for portable multimedia recording devices》", 《IEEE TRANSACTIONS ON CONSUMER ELECTRONICS》 * |
W.TAGER ET AL.: "《NEAR FIELD SUPERDIRECTIVITY (NFSD)》", 《ICASSP98》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112578338B (en) * | 2019-09-27 | 2024-05-14 | 阿里巴巴集团控股有限公司 | Sound source positioning method, device, equipment and storage medium |
CN112578338A (en) * | 2019-09-27 | 2021-03-30 | 阿里巴巴集团控股有限公司 | Sound source positioning method, device, equipment and storage medium |
CN111554269A (en) * | 2019-10-12 | 2020-08-18 | 南京奥拓软件技术有限公司 | Voice number taking method, system and storage medium |
CN110992971A (en) * | 2019-12-24 | 2020-04-10 | 达闼科技成都有限公司 | Method for determining voice enhancement direction, electronic equipment and storage medium |
CN113362849A (en) * | 2020-03-02 | 2021-09-07 | 阿里巴巴集团控股有限公司 | Voice data processing method and device |
CN111354353A (en) * | 2020-03-09 | 2020-06-30 | 联想(北京)有限公司 | Voice data processing method and device |
CN111354353B (en) * | 2020-03-09 | 2023-09-19 | 联想(北京)有限公司 | Voice data processing method and device |
CN111405122A (en) * | 2020-03-18 | 2020-07-10 | 苏州科达科技股份有限公司 | Audio call testing method, device and storage medium |
CN112655000A (en) * | 2020-04-30 | 2021-04-13 | 华为技术有限公司 | In-vehicle user positioning method, vehicle-mounted interaction method, vehicle-mounted device and vehicle |
CN111767785A (en) * | 2020-05-11 | 2020-10-13 | 南京奥拓电子科技有限公司 | Man-machine interaction control method and device, intelligent robot and storage medium |
CN111681673A (en) * | 2020-05-27 | 2020-09-18 | 北京华夏电通科技有限公司 | Method and system for identifying knocking hammer in court trial process |
CN111681673B (en) * | 2020-05-27 | 2023-06-20 | 北京华夏电通科技股份有限公司 | Method and system for identifying judicial mallet knocked in court trial process |
CN112397065A (en) * | 2020-11-04 | 2021-02-23 | 深圳地平线机器人科技有限公司 | Voice interaction method and device, computer readable storage medium and electronic equipment |
CN112634911A (en) * | 2020-12-21 | 2021-04-09 | 苏州思必驰信息科技有限公司 | Man-machine conversation method, electronic device and computer readable storage medium |
CN113014983A (en) * | 2021-03-08 | 2021-06-22 | Oppo广东移动通信有限公司 | Video playing method and device, storage medium and electronic equipment |
CN113327286A (en) * | 2021-05-10 | 2021-08-31 | 中国地质大学(武汉) | 360-degree omnibearing speaker visual space positioning method |
CN113327286B (en) * | 2021-05-10 | 2023-05-19 | 中国地质大学(武汉) | 360-degree omnibearing speaker vision space positioning method |
CN113407758A (en) * | 2021-07-13 | 2021-09-17 | 中国第一汽车股份有限公司 | Data processing method and device, electronic equipment and storage medium |
CN114530151A (en) * | 2022-02-10 | 2022-05-24 | 山东企联信息技术股份有限公司 | Artificial intelligence AI voice control system and experience device thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110223690A (en) | The man-machine interaction method and device merged based on image with voice | |
CN106653008B (en) | Voice control method, device and system | |
CN104049721B (en) | Information processing method and electronic equipment | |
CN111833899B (en) | Voice detection method based on polyphonic regions, related device and storage medium | |
CN106440192A (en) | Household appliance control method, device and system and intelligent air conditioner | |
CN108470568B (en) | Intelligent device control method and device, storage medium and electronic device | |
CN103124165A (en) | Automatic gain control | |
CN104269172A (en) | Voice control method and system based on video positioning | |
CN107124647A (en) | A kind of panoramic video automatically generates the method and device of subtitle file when recording | |
CN110956965A (en) | Personalized intelligent home safety control system and method based on voiceprint recognition | |
CN110767225A (en) | Voice interaction method, device and system | |
CN113676668A (en) | Video shooting method and device, electronic equipment and readable storage medium | |
CN110970020A (en) | Method for extracting effective voice signal by using voiceprint | |
CN112700773A (en) | Method and control system for controlling exhibition hall based on voice | |
CN115482830B (en) | Voice enhancement method and related equipment | |
CN117941343A (en) | Multi-source audio processing system and method | |
CN109343481B (en) | Method and device for controlling device | |
CN110992971A (en) | Method for determining voice enhancement direction, electronic equipment and storage medium | |
CN107247923A (en) | A kind of instruction identification method, device, storage device, mobile terminal and electrical equipment | |
JP7400364B2 (en) | Speech recognition system and information processing method | |
CN104202694A (en) | Method and system of orientation of voice pick-up device | |
CN112712818A (en) | Voice enhancement method, device and equipment | |
CN116386623A (en) | Voice interaction method of intelligent equipment, storage medium and electronic device | |
CN113763942A (en) | Interaction method and interaction system of voice household appliances and computer equipment | |
CN103856740B (en) | Information processing method and video conference system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190910 |