CN110223690A - The man-machine interaction method and device merged based on image with voice - Google Patents

The man-machine interaction method and device merged based on image with voice Download PDF

Info

Publication number
CN110223690A
CN110223690A CN201910497613.3A CN201910497613A CN110223690A CN 110223690 A CN110223690 A CN 110223690A CN 201910497613 A CN201910497613 A CN 201910497613A CN 110223690 A CN110223690 A CN 110223690A
Authority
CN
China
Prior art keywords
image data
voice
monitoring area
interactive instruction
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910497613.3A
Other languages
Chinese (zh)
Inventor
蔡育俊
姚家妙
李果
肖君诺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yongshun Shenzhen Wisdom Mdt Infotech Ltd
Original Assignee
Yongshun Shenzhen Wisdom Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yongshun Shenzhen Wisdom Mdt Infotech Ltd filed Critical Yongshun Shenzhen Wisdom Mdt Infotech Ltd
Priority to CN201910497613.3A priority Critical patent/CN110223690A/en
Publication of CN110223690A publication Critical patent/CN110223690A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a kind of man-machine interaction method and device merged based on image with voice, method is the following steps are included: obtain monitoring area video image data;Video image data is analyzed, obtains audible target in the azimuth information of monitoring area;According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation;It identifies acoustical signal, and is converted into corresponding text information.This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligent level of interactive voice;The scheme merged using image procossing with speech recognition just will start interactive process and subsequent control instruction under the scene for monitoring user, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger the shortcomings that, it is safer.

Description

The man-machine interaction method and device merged based on image with voice
Technical field
The present invention relates to intelligent interaction fields, especially relate to a kind of human-computer interaction merged based on image with voice Method and device.
Background technique
Human-computer interaction, human-computer interaction (English: Human-Computer Interaction or Human-Machine Interaction, abbreviation HCI or HMI), it is the technology of the interactive relation between a door system and user.
Existing human-computer interaction technology purely relies on microphone array-speech recognition man-machine interaction method, in noise ratio Larger or more people speak under scene can not effective position target speaker, and then affect following noise inhibit, speech recognition Accuracy rate, influence interactive experience.
Existing image processing method can be captured accurately, track target person information in certain area, but can not be with User interacts.Current voice interactive system, not can guarantee privacy, anyone can interact, using vocal print Certificate scheme accuracy rate it is lower.
Summary of the invention
In order to solve the defect of the above-mentioned prior art, merged with voice the object of the present invention is to provide a kind of based on image Man-machine interaction method and device.
In order to achieve the above objectives, the technical scheme is that
A kind of man-machine interaction method merged based on image with voice, comprising the following steps:
Obtain monitoring area video image data;
Video image data is analyzed, obtains audible target in the azimuth information of monitoring area;
According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation;
It identifies acoustical signal, and is converted into corresponding text information.
Further, described according to azimuth information, call microphone to obtain the acoustical signal step of corresponding position of orientation, packet It includes,
According to the azimuth information, noise suppressed is carried out to sound source and Reverberation Rejection is handled, the acoustical signal that obtains that treated.
Further, described according to the azimuth information, noise suppressed and Reverberation Rejection processing step, packet are carried out to sound source It includes,
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t);
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GM(ω) represents used overdetermination to drop It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame;
Super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH(ω)Ψ-1 (ω);
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each member Element value is as followsdi,jThe distance between i-th and jth microphone are represented, c is represented Spread speed, v (ω) represent the steering vector of target direction to sound in air.
Further, the analysis video image data obtains audible target in the azimuth information step of monitoring area, packet It includes,
Video image data is analyzed, monitoring area face images are obtained;
Face images are matched with pre-stored face image data;
The facial image to match with pre-stored face image data is obtained as audible target, and obtains it in monitored space The azimuth information in domain.
Further, the identification acoustical signal, and be converted into after corresponding text information step, including,
The text information that conversion is obtained, is matched with pre-stored interactive instruction;
The interactive instruction being matched to is obtained, and executes the interactive instruction.
Further, the analysis video image data, obtain audible target in monitoring area azimuth information step it Before, including,
The facial image of preparatory typing control user, constructs pre-stored face image data;
The interactive instruction of different function is preset, and interactive instruction is associated at least one text information and is saved.
The invention also provides a kind of human-computer interaction devices merged based on image with voice, including,
Video acquisition unit, for obtaining monitoring area video image data;
Orientation analysis unit obtains audible target in the azimuth information of monitoring area for analyzing video image data;
Voice pickup unit, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information;
Text conversion unit, acoustical signal for identification, and it is converted into corresponding text information.
Further, the voice pickup unit includes sound source processing module, for according to the azimuth information, to sound source into Row noise suppressed and Reverberation Rejection processing, obtain treated acoustical signal.
Further, the orientation analysis unit includes:
Human face analysis module obtains monitoring area face images for analyzing video image data;
Face matching module, for matching face images with pre-stored face image data;
Orientation obtains module, for obtaining the facial image to match with pre-stored face image data as sounding mesh Mark, and it is obtained in the azimuth information of monitoring area.
Further, further includes:
Text matches unit is matched for that will convert obtained text information with pre-stored interactive instruction;
Instruction acquisition unit for obtaining the interactive instruction being matched to, and executes the interactive instruction;
The pre- memory cell of face constructs pre-stored face image data for the facial image of preparatory typing control user;
Pre- memory cell is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one text This information is saved.
The beneficial effects of the present invention are: this programme assists user's positioning interaction by being effectively utilized image recognition, promoted The accuracy and intelligent level of user speech interaction;The scheme merged using image procossing with speech recognition is monitoring to use It just will start interactive process and subsequent control instruction under the scene at family, avoid purely by the interactive system of speech recognition at nobody Under scene also can erroneous trigger the shortcomings that, it is safer.
Detailed description of the invention
Fig. 1 is a kind of method flow diagram of the man-machine interaction method merged based on image with voice of the present invention;
Fig. 2 is present invention analysis video image data, obtains audible target in the tool of the azimuth information step of monitoring area Body flow chart;
Fig. 3 is a kind of method flow of the man-machine interaction method merged based on image with voice of another embodiment of the present invention Figure;
Fig. 4 is present invention analysis video image data, obtains audible target before the azimuth information step of monitoring area Step flow chart;
Fig. 5 is a kind of structural principle block diagram of the human-computer interaction device merged based on image with voice of the present invention;
Fig. 6 is the structural principle block diagram of orientation analysis unit of the present invention;
Fig. 7 is the structural principle block diagram of voice pickup unit of the present invention;
Fig. 8 is that a kind of structure of human-computer interaction device structure merged based on image with voice of another embodiment of the present invention is former Manage block diagram;
Fig. 9 is a kind of systematic schematic diagram of the man-machine interactive system merged based on image with voice of the present invention.
Specific embodiment
To illustrate thought and purpose of the invention, the present invention is done further below in conjunction with the drawings and specific embodiments Explanation.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiment is only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.
It is to be appreciated that the directional instruction (up, down, left, right, before and after etc.) of institute is only used in the embodiment of the present invention It explains in relative positional relationship, the motion conditions etc. under a certain particular pose (as shown in the picture) between each component, if the spy When determining posture and changing, then directionality instruction also correspondingly changes correspondingly, and the connection, which can be, to be directly connected to, can also To be to be indirectly connected with.
In addition, the description for being such as related to " first ", " second " in the present invention is used for description purposes only, and should not be understood as Its relative importance of indication or suggestion or the quantity for implicitly indicating indicated technical characteristic.Define as a result, " first ", The feature of " second " can explicitly or implicitly include at least one of the features.In addition, the technical side between each embodiment Case can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when the combination of technical solution Conflicting or cannot achieve when occur will be understood that the combination of this technical solution is not present, also not the present invention claims guarantor Within the scope of shield.
Unless otherwise instructed, "/" herein represents meaning as "or".
Referring to Fig.1-4, a kind of man-machine interaction method merged based on image with voice is proposed, comprising the following steps:
S10, monitoring area video image data is obtained;
S20, analysis video image data, obtain audible target in the azimuth information of monitoring area;
S30, according to azimuth information, call microphone to obtain the acoustical signal of corresponding position of orientation;
S40, identification acoustical signal, and it is converted into corresponding text information.
For step S10 and step S20, due to the limitation of usage scenario, human-computer interaction is present in specific space, example Such as the parlor in house, in bedroom or a kitchen, the meeting room of company, classroom of school etc., by the way that man-machine friendship can be being carried out (monitoring area) sets the video image data that camera obtains the space in real time in mutual space, is concurrently fed to video figure As analysis engine, video image analysis engine, which can integrate, also be can integrate in cloud server in local terminal.Obtain setting In the video image data in monitoring space, and the video image data got is analyzed, can be used for according to video image number The target user position for controlling sound is issued according to getting.
Specifically, video image analysis engine includes the camera for Real-time captured video image information, Yi Jitu As the processing engine of tracking.Video image analysis engine is for facial image in tracing monitoring region, for example, identification monitoring area Interior user's lip determines that the user issues voice signal, the facial image is identified, if the facial image is to prestore after dynamic The facial image of the limit that is possessed of control power of storage, then orient the azimuth information of user in real time.
In other embodiment of the present invention, in order to adapt to actual use needs, there was only one in detecting monitoring area When a user, that is, defaults it and have control authority, it is not necessary that its facial image is matched with pre-stored face image data, By analysis video image data, moved when recognizing face lip in monitoring area, default receives the institute in monitoring area Audible signal, and carry out subsequent processing.
With reference to Fig. 2, step S20, including,
S21, analysis video image data, obtain monitoring area face images;
S22, face images are matched with pre-stored face image data;
S23, the facial image to match with pre-stored face image data is obtained as audible target, and obtain it and supervising Control the azimuth information in region.
For step S21-S23, it is not necessarily the only user in monitoring area and exists, it is possible to there are multiple users, And multiple simultaneous scenes of user.It speaks under scene in more people, analysis gets more people in video image data and exists simultaneously It speaks, then obtains monitoring area face images, and face images are matched with pre-stored face image data It compares, selection is most matched with pre-stored facial image, and the highest face of the degree of approximation is target user's face, and place orientation is mesh Mark orientation.It should be noted that matching when, one of target face and pre-stored facial image similarity more than specified threshold it Afterwards, it just can determine that target person face matches with the pre-stored facial image.
With reference to Fig. 4, before step S20 further include:
S11, preparatory typing control the facial image of user, construct pre-stored face image data;
S12, the interactive instruction for presetting different function, and interactive instruction is associated at least one text information and is protected It deposits.
For step S11, user needs to acquire registered face image, building before camera before using this system Pre-stored face image data, to guarantee the accuracy of subsequent positioning and user's control.
For step S12, interactive instruction further includes that user is customized other than including the interactive instruction on default basis Interactive instruction.It may include the conventional funcs such as the control various equipment of smart home, such as common " opening TV ", " open visitor The lamp in the Room ", the lamp of bedroom " close " etc., then can be with when user issues corresponding voice by pre-set interactive instruction Realize corresponding interactive operation.
Interactive instruction is associated at least one text information and is saved, then representing an interactive instruction can be by multistage difference Text information call.Such as interactive instruction " opening the lamp in parlor ", it is corresponding that text message is called to can be " opening parlor Lamp ", " opening parlor lamp ", " opening Room lamp ", " light in parlor is opened " or " lighting parlor " etc..
For step S30, microphone is set in monitoring area, the sound letter for all orientation in collection monitoring region Breath.When obtaining azimuth information of the target user in monitoring area, microphone can be called to obtain the target user in the orientation Sound as interactive instruction.In acquisition process, need to carry out sound source certain processing, such as noise suppressed and reverberation suppression System, and sound is arrived as interactive instruction after obtaining processing.
Specifically, step S30, comprising: S31, according to the azimuth information carries out at noise suppressed and Reverberation Rejection sound source Reason obtains treated acoustical signal.
Wherein step S31 concrete processing procedure is as follows:
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t)。
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);.
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GM(ω) represents used overdetermination to drop It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame.
Specifically, super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH (ω)Ψ-1(ω)。
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each member Element value is as followsdi,jThe distance between i-th and jth microphone are represented, c is represented Spread speed, v (ω) represent the steering vector of target direction to sound in air.
For step S40, after recognizing target voice interactive instruction, acoustical signal is sent to speech recognition engine, in real time Translation is converted to text information, and text information is compared with pre-stored interactive instruction, calls that specific to determine Interactive instruction carry out human-computer interaction.Specifically, speech recognition can use at the natural language in Baidu AI open platform Interface is managed to realize.
With reference to Fig. 3, after step S40, comprising:
S50, the text information for obtaining conversion, are matched with pre-stored interactive instruction;
The interactive instruction that S60, acquisition are matched to, and execute the interactive instruction.
For step S50-S60, interactive instruction is associated at least one text information (voice messaging), by recognizing difference Text information similar and different interactive instruction can be called to be executed.For example, when user says " lamp for opening parlor " language When sound, recognizing corresponding text information is " lamp for opening parlor ", at this time corresponding interactive instruction are as follows: opens the lamp in parlor, executes The interactive instruction simultaneously opens the lamp in parlor, realizes human-computer interaction.
This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligence of interactive voice Energyization is horizontal;The scheme merged using image procossing with speech recognition just will start under the scene for monitoring user and interact Journey and subsequent control instruction, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger lack Point, it is safer.
With reference to Fig. 5-8, the invention also provides a kind of human-computer interaction devices merged based on image with voice, including,
Video acquisition unit 10, for obtaining monitoring area video image data;
Orientation analysis unit 20 obtains audible target in the azimuth information of monitoring area for analyzing video image data;
Voice pickup unit 30, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information;
Text conversion unit 40, acoustical signal for identification, and it is converted into corresponding text information.
For video acquisition unit 10 and orientation analysis unit 20, due to the limitation of usage scenario, human-computer interaction is present in In specific space, such as the parlor in house, in bedroom or a kitchen, the meeting room of company, classroom of school etc., by It can carry out in the space of human-computer interaction (monitoring area) and set the video image data that camera obtains the space in real time, and Transmission is given to video image analysis engine, and video image analysis engine, which can integrate, also can integrate in cloud server in local Terminal.The video image data being set in monitoring space is obtained, and analyzes the video image data got, can be used for root The target user position for issuing control sound is got according to video image data.
Specifically, video image analysis engine includes the camera for Real-time captured video image information, Yi Jitu As the processing engine of tracking.Video image analysis engine is for facial image in tracing monitoring region, for example, identification monitoring area Interior user's lip determines that the user issues voice signal, the facial image is identified, if the facial image is to prestore after dynamic The facial image of the limit that is possessed of control power of storage, then orient the azimuth information of user in real time.
In other embodiment of the present invention, in order to adapt to actual use needs, there was only one in detecting monitoring area When a user, that is, defaults it and have control authority, it is not necessary that its facial image is matched with pre-stored face image data, By analysis video image data, moved when recognizing face lip in monitoring area, default receives the institute in monitoring area Audible signal, and carry out subsequent processing.
With reference to Fig. 6, orientation analysis unit 20 includes:
Human face analysis module 21 obtains monitoring area face images for analyzing video image data;
Face matching module 22, for matching face images with pre-stored face image data;
Orientation obtains module 23, for obtaining the facial image to match with pre-stored face image data as sounding mesh Mark, and it is obtained in the azimuth information of monitoring area.
For human face analysis module 21, face matching module 22 and orientation obtain module 23, in monitoring area not necessarily only With the presence of a user, it is possible to there are multiple users, and the simultaneous scene of multiple users.It speaks under scene in more people, Analysis gets in video image data more people while speaking, then obtains monitoring area face images, and by owner Face image match comparing with pre-stored face image data, selects most to match with pre-stored facial image, degree of approximation highest Face be target user's face, where orientation be target bearing.It should be noted that in matching, target face and pre- Store after one of facial image similarity is more than specified threshold, just can determine that target person face and this be pre-stored facial image phase Match.
For voice pickup unit 30, microphone is set in monitoring area, for all orientation in collection monitoring region Acoustic information.When obtaining azimuth information of the target user in monitoring area, microphone can be called to obtain the orientation The sound of target user is as interactive instruction.In acquisition process, need to carry out sound source certain processing, such as noise suppressed And Reverberation Rejection, and sound is arrived as interactive instruction after obtaining processing.
With reference to Fig. 7, voice pickup unit 30 includes sound source processing module 31, for according to the azimuth information, to sound source into Row noise suppressed and Reverberation Rejection processing, obtain treated acoustical signal.
Wherein 31 carrying out practically process of sound source processing module is as follows:
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t)。
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);.
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GM(ω) represents used overdetermination to drop It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame.
Specifically, super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH (ω)Ψ-1(ω)。
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each member Element value is as followsdi,jThe distance between i-th and jth microphone are represented, c is represented Spread speed, v (ω) represent the steering vector of target direction to sound in air.
For text conversion unit 40, after recognizing target voice interactive instruction, acoustical signal, which is sent to speech recognition, draws It holds up, real time translation is converted to text information, and text information is compared with pre-stored interactive instruction, calls that determine A specific interactive instruction carries out human-computer interaction.Specifically, speech recognition can be using the nature in Baidu AI open platform Language Processing interface is realized.
With reference to Fig. 8, a kind of human-computer interaction device merged based on image with voice of the present invention further include:
Text matches unit 50 is matched for that will convert obtained text information with pre-stored interactive instruction;
Instruction acquisition unit 60 for obtaining the interactive instruction being matched to, and executes the interactive instruction;
The pre- memory cell 70 of face constructs pre-stored face image data for the facial image of preparatory typing control user;
Pre- memory cell 80 is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one Text information is saved.
For text matches unit 50 and instruction acquisition unit 60, interactive instruction is associated at least one text information (voice Information), similar and different interactive instruction can be called to be executed by recognizing different text informations.For example, when using When " lamp for opening parlor " voice is said at family, recognizing corresponding text information is " lamp for opening parlor ", and corresponding interaction refers at this time It enables are as follows: the lamp for opening parlor executes the interactive instruction and opens the lamp in parlor, realizes human-computer interaction.
Memory cell 70 pre- for face and the pre- memory cell 80 of instruction, user are needed before using this system in camera Preceding acquisition registered face image constructs pre-stored face image data, to guarantee the accuracy of subsequent positioning and user's control.
Interactive instruction further includes the customized interactive instruction of user other than including the interactive instruction on default basis.It can To include the conventional funcs such as the control various equipment of smart home, for example common " opening TV ", " lamp for opening parlor " " close Close the lamp in bedroom " etc., it then may be implemented corresponding by pre-set interactive instruction when user issues corresponding voice Interactive operation.
Interactive instruction is associated at least one text information and is saved, then representing an interactive instruction can be by multistage difference Text information call.Such as interactive instruction " opening the lamp in parlor ", it is corresponding that text message is called to can be " opening parlor Lamp ", " opening parlor lamp ", " opening Room lamp ", " light in parlor is opened " or " lighting parlor " etc..
This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligence of interactive voice Energyization is horizontal;The scheme merged using image procossing with speech recognition just will start under the scene for monitoring user and interact Journey and subsequent control instruction, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger lack Point, it is safer.
With reference to Fig. 9, the invention also provides a kind of man-machine interactive systems merged based on image with voice, including are set to The hardware terminal of monitoring area, and the processing engine for carrying out data processing.Hardware terminal includes for obtaining monitoring area The picture pick-up device of video image data, for receiving the microphone array of user voice signal, for playing out raising for voice Sound device.
Processing engine includes the video image analysis engine for analyzing video image, identifies voice signal and is converted into text The speech recognition engine of this information is matched to the text analyzing engine of corresponding control instruction according to text information, control is referred to Enable the speech synthesis engine for synthesizing voice broadcast and playing by loudspeaker.
Specifically, processing engine can integrate and also can integrate in cloud server in local terminal.
When work, picture pick-up device obtains monitoring area video image data, is analyzed in video image analysis engine Processing positions target user, receives the voice signal from target user by microphone array, and pass through speech recognition engine It is converted to text information, is matched by text analyzing engine with pre-stored interactive instruction, finds corresponding interactive instruction, And corresponding voice is synthesized by speech synthesis engine and is played in loudspeaker, while interactive instruction is sent to target by communication module Household electrical appliances are executed.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all utilizations Equivalent structure or equivalent flow shift made by description of the invention and accompanying drawing content is applied directly or indirectly in other correlations Technical field, be included within the scope of the present invention.

Claims (10)

1. a kind of man-machine interaction method merged based on image with voice, which comprises the following steps:
Obtain monitoring area video image data;
Video image data is analyzed, obtains audible target in the azimuth information of monitoring area;
According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation;
It identifies acoustical signal, and is converted into corresponding text information.
2. the man-machine interaction method merged as described in claim 1 based on image with voice, which is characterized in that described according to side Position information calls microphone to obtain the acoustical signal step of corresponding position of orientation, including,
According to the azimuth information, noise suppressed is carried out to sound source and Reverberation Rejection is handled, the acoustical signal that obtains that treated.
3. the man-machine interaction method merged as claimed in claim 2 based on image with voice, which is characterized in that the basis should Azimuth information carries out noise suppressed and Reverberation Rejection processing step to sound source, including,
Obtain the collected time-domain signal x of M microphone1(t),...,xM(t);
Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x1(t,ω),...,xM(t,ω);
Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection
Wherein, z (t, ω) represents the output signal after noise reduction, G1(ω),...,GMOverdetermination used by (ω) is represented is filtered to noise reduction Wave device, ω are the circular frequency of a frequency band, and t represents time frame;
Super directional filter are as follows: [G1(ω),...,GM(ω)]H=[vH(ω)Ψ-1(ω)v(ω)]-1vH(ω)Ψ-1(ω);
Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each element value It is as followsdi,jRepresent the distance between i-th and jth microphone, c representative voice Spread speed in air, v (ω) represent the steering vector of target direction.
4. the man-machine interaction method merged as described in claim 1 based on image with voice, which is characterized in that the analysis view Frequency image data obtains audible target in the azimuth information step of monitoring area, including,
Video image data is analyzed, monitoring area face images are obtained;
Face images are matched with pre-stored face image data;
The facial image to match with pre-stored face image data is obtained as audible target, and obtains it in monitoring area Azimuth information.
5. the man-machine interaction method merged as claimed in claim 4 based on image with voice, which is characterized in that the identification sound Signal, and be converted into after corresponding text information step, including,
The text information that conversion is obtained, is matched with pre-stored interactive instruction;
The interactive instruction being matched to is obtained, and executes the interactive instruction.
6. the man-machine interaction method merged as claimed in claim 5 based on image with voice, which is characterized in that the analysis view Frequency image data obtains audible target before the azimuth information step of monitoring area, including,
The facial image of preparatory typing control user, constructs pre-stored face image data;
The interactive instruction of different function is preset, and interactive instruction is associated at least one text information and is saved.
7. a kind of human-computer interaction device merged based on image with voice, which is characterized in that including,
Video acquisition unit, for obtaining monitoring area video image data;
Orientation analysis unit obtains audible target in the azimuth information of monitoring area for analyzing video image data;
Voice pickup unit, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information;
Text conversion unit, acoustical signal for identification, and it is converted into corresponding text information.
8. the human-computer interaction device merged as claimed in claim 7 based on image with voice, which is characterized in that the sound picks up Taking unit includes sound source processing module, for carrying out noise suppressed to sound source and Reverberation Rejection being handled, obtain according to the azimuth information The acoustical signal that takes that treated.
9. the human-computer interaction device merged as claimed in claim 7 based on image with voice, which is characterized in that the orientation point Analysing unit includes:
Human face analysis module obtains monitoring area face images for analyzing video image data;
Face matching module, for matching face images with pre-stored face image data;
Orientation obtains module, for obtaining the facial image to match with pre-stored face image data as audible target, and It is obtained in the azimuth information of monitoring area.
10. the human-computer interaction device merged as claimed in claim 9 based on image with voice, which is characterized in that further include:
Text matches unit is matched for that will convert obtained text information with pre-stored interactive instruction;
Instruction acquisition unit for obtaining the interactive instruction being matched to, and executes the interactive instruction;
The pre- memory cell of face constructs pre-stored face image data for the facial image of preparatory typing control user;
Pre- memory cell is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one text envelope Breath is saved.
CN201910497613.3A 2019-06-10 2019-06-10 The man-machine interaction method and device merged based on image with voice Pending CN110223690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910497613.3A CN110223690A (en) 2019-06-10 2019-06-10 The man-machine interaction method and device merged based on image with voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910497613.3A CN110223690A (en) 2019-06-10 2019-06-10 The man-machine interaction method and device merged based on image with voice

Publications (1)

Publication Number Publication Date
CN110223690A true CN110223690A (en) 2019-09-10

Family

ID=67816067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910497613.3A Pending CN110223690A (en) 2019-06-10 2019-06-10 The man-machine interaction method and device merged based on image with voice

Country Status (1)

Country Link
CN (1) CN110223690A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110992971A (en) * 2019-12-24 2020-04-10 达闼科技成都有限公司 Method for determining voice enhancement direction, electronic equipment and storage medium
CN111354353A (en) * 2020-03-09 2020-06-30 联想(北京)有限公司 Voice data processing method and device
CN111405122A (en) * 2020-03-18 2020-07-10 苏州科达科技股份有限公司 Audio call testing method, device and storage medium
CN111554269A (en) * 2019-10-12 2020-08-18 南京奥拓软件技术有限公司 Voice number taking method, system and storage medium
CN111681673A (en) * 2020-05-27 2020-09-18 北京华夏电通科技有限公司 Method and system for identifying knocking hammer in court trial process
CN111767785A (en) * 2020-05-11 2020-10-13 南京奥拓电子科技有限公司 Man-machine interaction control method and device, intelligent robot and storage medium
CN112397065A (en) * 2020-11-04 2021-02-23 深圳地平线机器人科技有限公司 Voice interaction method and device, computer readable storage medium and electronic equipment
CN112578338A (en) * 2019-09-27 2021-03-30 阿里巴巴集团控股有限公司 Sound source positioning method, device, equipment and storage medium
CN112634911A (en) * 2020-12-21 2021-04-09 苏州思必驰信息科技有限公司 Man-machine conversation method, electronic device and computer readable storage medium
CN112655000A (en) * 2020-04-30 2021-04-13 华为技术有限公司 In-vehicle user positioning method, vehicle-mounted interaction method, vehicle-mounted device and vehicle
CN113014983A (en) * 2021-03-08 2021-06-22 Oppo广东移动通信有限公司 Video playing method and device, storage medium and electronic equipment
CN113327286A (en) * 2021-05-10 2021-08-31 中国地质大学(武汉) 360-degree omnibearing speaker visual space positioning method
CN113362849A (en) * 2020-03-02 2021-09-07 阿里巴巴集团控股有限公司 Voice data processing method and device
CN113407758A (en) * 2021-07-13 2021-09-17 中国第一汽车股份有限公司 Data processing method and device, electronic equipment and storage medium
CN114530151A (en) * 2022-02-10 2022-05-24 山东企联信息技术股份有限公司 Artificial intelligence AI voice control system and experience device thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070127736A1 (en) * 2003-06-30 2007-06-07 Markus Christoph Handsfree system for use in a vehicle
WO2008041878A2 (en) * 2006-10-04 2008-04-10 Micronas Nit System and procedure of hands free speech communication using a microphone array
JP2013172411A (en) * 2012-02-22 2013-09-02 Nec Corp Voice recognition system, voice recognition method, and voice recognition program
CN104049721A (en) * 2013-03-11 2014-09-17 联想(北京)有限公司 Information processing method and electronic equipment
CN106056710A (en) * 2016-06-02 2016-10-26 北京云知声信息技术有限公司 Method and device for controlling intelligent electronic locks
CN106440192A (en) * 2016-09-19 2017-02-22 珠海格力电器股份有限公司 Household appliance control method, device and system and intelligent air conditioner

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070127736A1 (en) * 2003-06-30 2007-06-07 Markus Christoph Handsfree system for use in a vehicle
WO2008041878A2 (en) * 2006-10-04 2008-04-10 Micronas Nit System and procedure of hands free speech communication using a microphone array
JP2013172411A (en) * 2012-02-22 2013-09-02 Nec Corp Voice recognition system, voice recognition method, and voice recognition program
CN104049721A (en) * 2013-03-11 2014-09-17 联想(北京)有限公司 Information processing method and electronic equipment
CN106056710A (en) * 2016-06-02 2016-10-26 北京云知声信息技术有限公司 Method and device for controlling intelligent electronic locks
CN106440192A (en) * 2016-09-19 2017-02-22 珠海格力电器股份有限公司 Household appliance control method, device and system and intelligent air conditioner

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SEON MAN KIM ET AL.: "《Multi-channel audio recording based on superdirective beamforming for portable multimedia recording devices》", 《IEEE TRANSACTIONS ON CONSUMER ELECTRONICS》 *
W.TAGER ET AL.: "《NEAR FIELD SUPERDIRECTIVITY (NFSD)》", 《ICASSP98》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112578338B (en) * 2019-09-27 2024-05-14 阿里巴巴集团控股有限公司 Sound source positioning method, device, equipment and storage medium
CN112578338A (en) * 2019-09-27 2021-03-30 阿里巴巴集团控股有限公司 Sound source positioning method, device, equipment and storage medium
CN111554269A (en) * 2019-10-12 2020-08-18 南京奥拓软件技术有限公司 Voice number taking method, system and storage medium
CN110992971A (en) * 2019-12-24 2020-04-10 达闼科技成都有限公司 Method for determining voice enhancement direction, electronic equipment and storage medium
CN113362849A (en) * 2020-03-02 2021-09-07 阿里巴巴集团控股有限公司 Voice data processing method and device
CN111354353A (en) * 2020-03-09 2020-06-30 联想(北京)有限公司 Voice data processing method and device
CN111354353B (en) * 2020-03-09 2023-09-19 联想(北京)有限公司 Voice data processing method and device
CN111405122A (en) * 2020-03-18 2020-07-10 苏州科达科技股份有限公司 Audio call testing method, device and storage medium
CN112655000A (en) * 2020-04-30 2021-04-13 华为技术有限公司 In-vehicle user positioning method, vehicle-mounted interaction method, vehicle-mounted device and vehicle
CN111767785A (en) * 2020-05-11 2020-10-13 南京奥拓电子科技有限公司 Man-machine interaction control method and device, intelligent robot and storage medium
CN111681673A (en) * 2020-05-27 2020-09-18 北京华夏电通科技有限公司 Method and system for identifying knocking hammer in court trial process
CN111681673B (en) * 2020-05-27 2023-06-20 北京华夏电通科技股份有限公司 Method and system for identifying judicial mallet knocked in court trial process
CN112397065A (en) * 2020-11-04 2021-02-23 深圳地平线机器人科技有限公司 Voice interaction method and device, computer readable storage medium and electronic equipment
CN112634911A (en) * 2020-12-21 2021-04-09 苏州思必驰信息科技有限公司 Man-machine conversation method, electronic device and computer readable storage medium
CN113014983A (en) * 2021-03-08 2021-06-22 Oppo广东移动通信有限公司 Video playing method and device, storage medium and electronic equipment
CN113327286A (en) * 2021-05-10 2021-08-31 中国地质大学(武汉) 360-degree omnibearing speaker visual space positioning method
CN113327286B (en) * 2021-05-10 2023-05-19 中国地质大学(武汉) 360-degree omnibearing speaker vision space positioning method
CN113407758A (en) * 2021-07-13 2021-09-17 中国第一汽车股份有限公司 Data processing method and device, electronic equipment and storage medium
CN114530151A (en) * 2022-02-10 2022-05-24 山东企联信息技术股份有限公司 Artificial intelligence AI voice control system and experience device thereof

Similar Documents

Publication Publication Date Title
CN110223690A (en) The man-machine interaction method and device merged based on image with voice
CN106653008B (en) Voice control method, device and system
CN104049721B (en) Information processing method and electronic equipment
CN111833899B (en) Voice detection method based on polyphonic regions, related device and storage medium
CN106440192A (en) Household appliance control method, device and system and intelligent air conditioner
CN108470568B (en) Intelligent device control method and device, storage medium and electronic device
CN103124165A (en) Automatic gain control
CN104269172A (en) Voice control method and system based on video positioning
CN107124647A (en) A kind of panoramic video automatically generates the method and device of subtitle file when recording
CN110956965A (en) Personalized intelligent home safety control system and method based on voiceprint recognition
CN110767225A (en) Voice interaction method, device and system
CN113676668A (en) Video shooting method and device, electronic equipment and readable storage medium
CN110970020A (en) Method for extracting effective voice signal by using voiceprint
CN112700773A (en) Method and control system for controlling exhibition hall based on voice
CN115482830B (en) Voice enhancement method and related equipment
CN117941343A (en) Multi-source audio processing system and method
CN109343481B (en) Method and device for controlling device
CN110992971A (en) Method for determining voice enhancement direction, electronic equipment and storage medium
CN107247923A (en) A kind of instruction identification method, device, storage device, mobile terminal and electrical equipment
JP7400364B2 (en) Speech recognition system and information processing method
CN104202694A (en) Method and system of orientation of voice pick-up device
CN112712818A (en) Voice enhancement method, device and equipment
CN116386623A (en) Voice interaction method of intelligent equipment, storage medium and electronic device
CN113763942A (en) Interaction method and interaction system of voice household appliances and computer equipment
CN103856740B (en) Information processing method and video conference system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190910