CN110223690A

CN110223690A - The man-machine interaction method and device merged based on image with voice

Info

Publication number: CN110223690A
Application number: CN201910497613.3A
Authority: CN
Inventors: 蔡育俊; 姚家妙; 李果; 肖君诺
Original assignee: Yongshun Shenzhen Wisdom Mdt Infotech Ltd
Current assignee: Yongshun Shenzhen Wisdom Mdt Infotech Ltd
Priority date: 2019-06-10
Filing date: 2019-06-10
Publication date: 2019-09-10

Abstract

The invention discloses a kind of man-machine interaction method and device merged based on image with voice, method is the following steps are included: obtain monitoring area video image data；Video image data is analyzed, obtains audible target in the azimuth information of monitoring area；According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation；It identifies acoustical signal, and is converted into corresponding text information.This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligent level of interactive voice；The scheme merged using image procossing with speech recognition just will start interactive process and subsequent control instruction under the scene for monitoring user, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger the shortcomings that, it is safer.

Description

The man-machine interaction method and device merged based on image with voice

Technical field

The present invention relates to intelligent interaction fields, especially relate to a kind of human-computer interaction merged based on image with voice Method and device.

Background technique

Human-computer interaction, human-computer interaction (English: Human-Computer Interaction or Human-Machine Interaction, abbreviation HCI or HMI), it is the technology of the interactive relation between a door system and user.

Existing human-computer interaction technology purely relies on microphone array-speech recognition man-machine interaction method, in noise ratio Larger or more people speak under scene can not effective position target speaker, and then affect following noise inhibit, speech recognition Accuracy rate, influence interactive experience.

Existing image processing method can be captured accurately, track target person information in certain area, but can not be with User interacts.Current voice interactive system, not can guarantee privacy, anyone can interact, using vocal print Certificate scheme accuracy rate it is lower.

Summary of the invention

In order to solve the defect of the above-mentioned prior art, merged with voice the object of the present invention is to provide a kind of based on image Man-machine interaction method and device.

In order to achieve the above objectives, the technical scheme is that

A kind of man-machine interaction method merged based on image with voice, comprising the following steps:

Obtain monitoring area video image data；

Video image data is analyzed, obtains audible target in the azimuth information of monitoring area；

According to azimuth information, microphone is called to obtain the acoustical signal of corresponding position of orientation；

It identifies acoustical signal, and is converted into corresponding text information.

Further, described according to azimuth information, call microphone to obtain the acoustical signal step of corresponding position of orientation, packet It includes,

According to the azimuth information, noise suppressed is carried out to sound source and Reverberation Rejection is handled, the acoustical signal that obtains that treated.

Further, described according to the azimuth information, noise suppressed and Reverberation Rejection processing step, packet are carried out to sound source It includes,

Obtain the collected time-domain signal x of M microphone₁(t),...,x_M(t)；

Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x₁(t,ω),...,x_M(t,ω)；

Inhibited based on overdetermination to filter noise, acoustical signal output is after Reverberation Rejection

Wherein, z (t, ω) represents the output signal after noise reduction, G₁(ω),...,G_M(ω) represents used overdetermination to drop It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame；

Super directional filter are as follows: [G₁(ω),...,G_M(ω)]^H=[v^H(ω)Ψ^-1(ω)v(ω)]^-1v^H(ω)Ψ^-1 (ω)；

Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each member Element value is as followsd_i,jThe distance between i-th and jth microphone are represented, c is represented Spread speed, v (ω) represent the steering vector of target direction to sound in air.

Further, the analysis video image data obtains audible target in the azimuth information step of monitoring area, packet It includes,

Video image data is analyzed, monitoring area face images are obtained；

Face images are matched with pre-stored face image data；

The facial image to match with pre-stored face image data is obtained as audible target, and obtains it in monitored space The azimuth information in domain.

Further, the identification acoustical signal, and be converted into after corresponding text information step, including,

The text information that conversion is obtained, is matched with pre-stored interactive instruction；

The interactive instruction being matched to is obtained, and executes the interactive instruction.

Further, the analysis video image data, obtain audible target in monitoring area azimuth information step it Before, including,

The facial image of preparatory typing control user, constructs pre-stored face image data；

The interactive instruction of different function is preset, and interactive instruction is associated at least one text information and is saved.

The invention also provides a kind of human-computer interaction devices merged based on image with voice, including,

Video acquisition unit, for obtaining monitoring area video image data；

Orientation analysis unit obtains audible target in the azimuth information of monitoring area for analyzing video image data；

Voice pickup unit, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information；

Text conversion unit, acoustical signal for identification, and it is converted into corresponding text information.

Further, the voice pickup unit includes sound source processing module, for according to the azimuth information, to sound source into Row noise suppressed and Reverberation Rejection processing, obtain treated acoustical signal.

Further, the orientation analysis unit includes:

Human face analysis module obtains monitoring area face images for analyzing video image data；

Face matching module, for matching face images with pre-stored face image data；

Orientation obtains module, for obtaining the facial image to match with pre-stored face image data as sounding mesh Mark, and it is obtained in the azimuth information of monitoring area.

Further, further includes:

Text matches unit is matched for that will convert obtained text information with pre-stored interactive instruction；

Instruction acquisition unit for obtaining the interactive instruction being matched to, and executes the interactive instruction；

The pre- memory cell of face constructs pre-stored face image data for the facial image of preparatory typing control user；

Pre- memory cell is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one text This information is saved.

The beneficial effects of the present invention are: this programme assists user's positioning interaction by being effectively utilized image recognition, promoted The accuracy and intelligent level of user speech interaction；The scheme merged using image procossing with speech recognition is monitoring to use It just will start interactive process and subsequent control instruction under the scene at family, avoid purely by the interactive system of speech recognition at nobody Under scene also can erroneous trigger the shortcomings that, it is safer.

Detailed description of the invention

Fig. 1 is a kind of method flow diagram of the man-machine interaction method merged based on image with voice of the present invention；

Fig. 2 is present invention analysis video image data, obtains audible target in the tool of the azimuth information step of monitoring area Body flow chart；

Fig. 3 is a kind of method flow of the man-machine interaction method merged based on image with voice of another embodiment of the present invention Figure；

Fig. 4 is present invention analysis video image data, obtains audible target before the azimuth information step of monitoring area Step flow chart；

Fig. 5 is a kind of structural principle block diagram of the human-computer interaction device merged based on image with voice of the present invention；

Fig. 6 is the structural principle block diagram of orientation analysis unit of the present invention；

Fig. 7 is the structural principle block diagram of voice pickup unit of the present invention；

Fig. 8 is that a kind of structure of human-computer interaction device structure merged based on image with voice of another embodiment of the present invention is former Manage block diagram；

Fig. 9 is a kind of systematic schematic diagram of the man-machine interactive system merged based on image with voice of the present invention.

Specific embodiment

To illustrate thought and purpose of the invention, the present invention is done further below in conjunction with the drawings and specific embodiments Explanation.

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiment is only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

It is to be appreciated that the directional instruction (up, down, left, right, before and after etc.) of institute is only used in the embodiment of the present invention It explains in relative positional relationship, the motion conditions etc. under a certain particular pose (as shown in the picture) between each component, if the spy When determining posture and changing, then directionality instruction also correspondingly changes correspondingly, and the connection, which can be, to be directly connected to, can also To be to be indirectly connected with.

In addition, the description for being such as related to " first ", " second " in the present invention is used for description purposes only, and should not be understood as Its relative importance of indication or suggestion or the quantity for implicitly indicating indicated technical characteristic.Define as a result, " first ", The feature of " second " can explicitly or implicitly include at least one of the features.In addition, the technical side between each embodiment Case can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when the combination of technical solution Conflicting or cannot achieve when occur will be understood that the combination of this technical solution is not present, also not the present invention claims guarantor Within the scope of shield.

Unless otherwise instructed, "/" herein represents meaning as "or".

Referring to Fig.1-4, a kind of man-machine interaction method merged based on image with voice is proposed, comprising the following steps:

S10, monitoring area video image data is obtained；

S20, analysis video image data, obtain audible target in the azimuth information of monitoring area；

S30, according to azimuth information, call microphone to obtain the acoustical signal of corresponding position of orientation；

S40, identification acoustical signal, and it is converted into corresponding text information.

For step S10 and step S20, due to the limitation of usage scenario, human-computer interaction is present in specific space, example Such as the parlor in house, in bedroom or a kitchen, the meeting room of company, classroom of school etc., by the way that man-machine friendship can be being carried out (monitoring area) sets the video image data that camera obtains the space in real time in mutual space, is concurrently fed to video figure As analysis engine, video image analysis engine, which can integrate, also be can integrate in cloud server in local terminal.Obtain setting In the video image data in monitoring space, and the video image data got is analyzed, can be used for according to video image number The target user position for controlling sound is issued according to getting.

Specifically, video image analysis engine includes the camera for Real-time captured video image information, Yi Jitu As the processing engine of tracking.Video image analysis engine is for facial image in tracing monitoring region, for example, identification monitoring area Interior user's lip determines that the user issues voice signal, the facial image is identified, if the facial image is to prestore after dynamic The facial image of the limit that is possessed of control power of storage, then orient the azimuth information of user in real time.

In other embodiment of the present invention, in order to adapt to actual use needs, there was only one in detecting monitoring area When a user, that is, defaults it and have control authority, it is not necessary that its facial image is matched with pre-stored face image data, By analysis video image data, moved when recognizing face lip in monitoring area, default receives the institute in monitoring area Audible signal, and carry out subsequent processing.

With reference to Fig. 2, step S20, including,

S21, analysis video image data, obtain monitoring area face images；

S22, face images are matched with pre-stored face image data；

S23, the facial image to match with pre-stored face image data is obtained as audible target, and obtain it and supervising Control the azimuth information in region.

For step S21-S23, it is not necessarily the only user in monitoring area and exists, it is possible to there are multiple users, And multiple simultaneous scenes of user.It speaks under scene in more people, analysis gets more people in video image data and exists simultaneously It speaks, then obtains monitoring area face images, and face images are matched with pre-stored face image data It compares, selection is most matched with pre-stored facial image, and the highest face of the degree of approximation is target user's face, and place orientation is mesh Mark orientation.It should be noted that matching when, one of target face and pre-stored facial image similarity more than specified threshold it Afterwards, it just can determine that target person face matches with the pre-stored facial image.

With reference to Fig. 4, before step S20 further include:

S11, preparatory typing control the facial image of user, construct pre-stored face image data；

S12, the interactive instruction for presetting different function, and interactive instruction is associated at least one text information and is protected It deposits.

For step S11, user needs to acquire registered face image, building before camera before using this system Pre-stored face image data, to guarantee the accuracy of subsequent positioning and user's control.

For step S12, interactive instruction further includes that user is customized other than including the interactive instruction on default basis Interactive instruction.It may include the conventional funcs such as the control various equipment of smart home, such as common " opening TV ", " open visitor The lamp in the Room ", the lamp of bedroom " close " etc., then can be with when user issues corresponding voice by pre-set interactive instruction Realize corresponding interactive operation.

Interactive instruction is associated at least one text information and is saved, then representing an interactive instruction can be by multistage difference Text information call.Such as interactive instruction " opening the lamp in parlor ", it is corresponding that text message is called to can be " opening parlor Lamp ", " opening parlor lamp ", " opening Room lamp ", " light in parlor is opened " or " lighting parlor " etc..

For step S30, microphone is set in monitoring area, the sound letter for all orientation in collection monitoring region Breath.When obtaining azimuth information of the target user in monitoring area, microphone can be called to obtain the target user in the orientation Sound as interactive instruction.In acquisition process, need to carry out sound source certain processing, such as noise suppressed and reverberation suppression System, and sound is arrived as interactive instruction after obtaining processing.

Specifically, step S30, comprising: S31, according to the azimuth information carries out at noise suppressed and Reverberation Rejection sound source Reason obtains treated acoustical signal.

Wherein step S31 concrete processing procedure is as follows:

Obtain the collected time-domain signal x of M microphone₁(t),...,x_M(t)。

Short Time Fourier Transform is done to time-domain signal, is obtained in corresponding band signal x₁(t,ω),...,x_M(t,ω)；.

Wherein, z (t, ω) represents the output signal after noise reduction, G₁(ω),...,G_M(ω) represents used overdetermination to drop It makes an uproar filter, ω is the circular frequency of a frequency band, and t represents time frame.

Specifically, super directional filter are as follows: [G₁(ω),...,G_M(ω)]^H=[v^H(ω)Ψ^-1(ω)v(ω)]^-1v^H (ω)Ψ^-1(ω)。

For step S40, after recognizing target voice interactive instruction, acoustical signal is sent to speech recognition engine, in real time Translation is converted to text information, and text information is compared with pre-stored interactive instruction, calls that specific to determine Interactive instruction carry out human-computer interaction.Specifically, speech recognition can use at the natural language in Baidu AI open platform Interface is managed to realize.

With reference to Fig. 3, after step S40, comprising:

S50, the text information for obtaining conversion, are matched with pre-stored interactive instruction；

The interactive instruction that S60, acquisition are matched to, and execute the interactive instruction.

For step S50-S60, interactive instruction is associated at least one text information (voice messaging), by recognizing difference Text information similar and different interactive instruction can be called to be executed.For example, when user says " lamp for opening parlor " language When sound, recognizing corresponding text information is " lamp for opening parlor ", at this time corresponding interactive instruction are as follows: opens the lamp in parlor, executes The interactive instruction simultaneously opens the lamp in parlor, realizes human-computer interaction.

This programme assists user's positioning interaction by being effectively utilized image recognition, promotes the accuracy and intelligence of interactive voice Energyization is horizontal；The scheme merged using image procossing with speech recognition just will start under the scene for monitoring user and interact Journey and subsequent control instruction, avoid purely by speech recognition interactive system under unmanned scene also can erroneous trigger lack Point, it is safer.

With reference to Fig. 5-8, the invention also provides a kind of human-computer interaction devices merged based on image with voice, including,

Video acquisition unit 10, for obtaining monitoring area video image data；

Orientation analysis unit 20 obtains audible target in the azimuth information of monitoring area for analyzing video image data；

Voice pickup unit 30, for calling microphone to obtain the acoustical signal of corresponding position of orientation according to azimuth information；

Text conversion unit 40, acoustical signal for identification, and it is converted into corresponding text information.

For video acquisition unit 10 and orientation analysis unit 20, due to the limitation of usage scenario, human-computer interaction is present in In specific space, such as the parlor in house, in bedroom or a kitchen, the meeting room of company, classroom of school etc., by It can carry out in the space of human-computer interaction (monitoring area) and set the video image data that camera obtains the space in real time, and Transmission is given to video image analysis engine, and video image analysis engine, which can integrate, also can integrate in cloud server in local Terminal.The video image data being set in monitoring space is obtained, and analyzes the video image data got, can be used for root The target user position for issuing control sound is got according to video image data.

With reference to Fig. 6, orientation analysis unit 20 includes:

Human face analysis module 21 obtains monitoring area face images for analyzing video image data；

Face matching module 22, for matching face images with pre-stored face image data；

Orientation obtains module 23, for obtaining the facial image to match with pre-stored face image data as sounding mesh Mark, and it is obtained in the azimuth information of monitoring area.

For human face analysis module 21, face matching module 22 and orientation obtain module 23, in monitoring area not necessarily only With the presence of a user, it is possible to there are multiple users, and the simultaneous scene of multiple users.It speaks under scene in more people, Analysis gets in video image data more people while speaking, then obtains monitoring area face images, and by owner Face image match comparing with pre-stored face image data, selects most to match with pre-stored facial image, degree of approximation highest Face be target user's face, where orientation be target bearing.It should be noted that in matching, target face and pre- Store after one of facial image similarity is more than specified threshold, just can determine that target person face and this be pre-stored facial image phase Match.

For voice pickup unit 30, microphone is set in monitoring area, for all orientation in collection monitoring region Acoustic information.When obtaining azimuth information of the target user in monitoring area, microphone can be called to obtain the orientation The sound of target user is as interactive instruction.In acquisition process, need to carry out sound source certain processing, such as noise suppressed And Reverberation Rejection, and sound is arrived as interactive instruction after obtaining processing.

With reference to Fig. 7, voice pickup unit 30 includes sound source processing module 31, for according to the azimuth information, to sound source into Row noise suppressed and Reverberation Rejection processing, obtain treated acoustical signal.

Wherein 31 carrying out practically process of sound source processing module is as follows:

Obtain the collected time-domain signal x of M microphone₁(t),...,x_M(t)。

For text conversion unit 40, after recognizing target voice interactive instruction, acoustical signal, which is sent to speech recognition, draws It holds up, real time translation is converted to text information, and text information is compared with pre-stored interactive instruction, calls that determine A specific interactive instruction carries out human-computer interaction.Specifically, speech recognition can be using the nature in Baidu AI open platform Language Processing interface is realized.

With reference to Fig. 8, a kind of human-computer interaction device merged based on image with voice of the present invention further include:

Text matches unit 50 is matched for that will convert obtained text information with pre-stored interactive instruction；

Instruction acquisition unit 60 for obtaining the interactive instruction being matched to, and executes the interactive instruction；

The pre- memory cell 70 of face constructs pre-stored face image data for the facial image of preparatory typing control user；

Pre- memory cell 80 is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one Text information is saved.

For text matches unit 50 and instruction acquisition unit 60, interactive instruction is associated at least one text information (voice Information), similar and different interactive instruction can be called to be executed by recognizing different text informations.For example, when using When " lamp for opening parlor " voice is said at family, recognizing corresponding text information is " lamp for opening parlor ", and corresponding interaction refers at this time It enables are as follows: the lamp for opening parlor executes the interactive instruction and opens the lamp in parlor, realizes human-computer interaction.

Memory cell 70 pre- for face and the pre- memory cell 80 of instruction, user are needed before using this system in camera Preceding acquisition registered face image constructs pre-stored face image data, to guarantee the accuracy of subsequent positioning and user's control.

Interactive instruction further includes the customized interactive instruction of user other than including the interactive instruction on default basis.It can To include the conventional funcs such as the control various equipment of smart home, for example common " opening TV ", " lamp for opening parlor " " close Close the lamp in bedroom " etc., it then may be implemented corresponding by pre-set interactive instruction when user issues corresponding voice Interactive operation.

With reference to Fig. 9, the invention also provides a kind of man-machine interactive systems merged based on image with voice, including are set to The hardware terminal of monitoring area, and the processing engine for carrying out data processing.Hardware terminal includes for obtaining monitoring area The picture pick-up device of video image data, for receiving the microphone array of user voice signal, for playing out raising for voice Sound device.

Processing engine includes the video image analysis engine for analyzing video image, identifies voice signal and is converted into text The speech recognition engine of this information is matched to the text analyzing engine of corresponding control instruction according to text information, control is referred to Enable the speech synthesis engine for synthesizing voice broadcast and playing by loudspeaker.

Specifically, processing engine can integrate and also can integrate in cloud server in local terminal.

When work, picture pick-up device obtains monitoring area video image data, is analyzed in video image analysis engine Processing positions target user, receives the voice signal from target user by microphone array, and pass through speech recognition engine It is converted to text information, is matched by text analyzing engine with pre-stored interactive instruction, finds corresponding interactive instruction, And corresponding voice is synthesized by speech synthesis engine and is played in loudspeaker, while interactive instruction is sent to target by communication module Household electrical appliances are executed.

The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all utilizations Equivalent structure or equivalent flow shift made by description of the invention and accompanying drawing content is applied directly or indirectly in other correlations Technical field, be included within the scope of the present invention.

Claims

1. a kind of man-machine interaction method merged based on image with voice, which comprises the following steps:

Obtain monitoring area video image data；

2. the man-machine interaction method merged as described in claim 1 based on image with voice, which is characterized in that described according to side Position information calls microphone to obtain the acoustical signal step of corresponding position of orientation, including,

3. the man-machine interaction method merged as claimed in claim 2 based on image with voice, which is characterized in that the basis should Azimuth information carries out noise suppressed and Reverberation Rejection processing step to sound source, including,

Obtain the collected time-domain signal x of M microphone₁(t),...,x_M(t)；

Wherein, z (t, ω) represents the output signal after noise reduction, G₁(ω),...,G_MOverdetermination used by (ω) is represented is filtered to noise reduction Wave device, ω are the circular frequency of a frequency band, and t represents time frame；

Super directional filter are as follows: [G₁(ω),...,G_M(ω)]^H=[v^H(ω)Ψ^-1(ω)v(ω)]^-1v^H(ω)Ψ^-1(ω)；

Wherein, subscript H represents conjugate transposition operation.Ψ (ω) represents isotropic noise field correlation matrix, each element value It is as followsd_i,jRepresent the distance between i-th and jth microphone, c representative voice Spread speed in air, v (ω) represent the steering vector of target direction.

4. the man-machine interaction method merged as described in claim 1 based on image with voice, which is characterized in that the analysis view Frequency image data obtains audible target in the azimuth information step of monitoring area, including,

Video image data is analyzed, monitoring area face images are obtained；

Face images are matched with pre-stored face image data；

The facial image to match with pre-stored face image data is obtained as audible target, and obtains it in monitoring area Azimuth information.

5. the man-machine interaction method merged as claimed in claim 4 based on image with voice, which is characterized in that the identification sound Signal, and be converted into after corresponding text information step, including,

6. the man-machine interaction method merged as claimed in claim 5 based on image with voice, which is characterized in that the analysis view Frequency image data obtains audible target before the azimuth information step of monitoring area, including,

7. a kind of human-computer interaction device merged based on image with voice, which is characterized in that including,

Video acquisition unit, for obtaining monitoring area video image data；

8. the human-computer interaction device merged as claimed in claim 7 based on image with voice, which is characterized in that the sound picks up Taking unit includes sound source processing module, for carrying out noise suppressed to sound source and Reverberation Rejection being handled, obtain according to the azimuth information The acoustical signal that takes that treated.

9. the human-computer interaction device merged as claimed in claim 7 based on image with voice, which is characterized in that the orientation point Analysing unit includes:

Orientation obtains module, for obtaining the facial image to match with pre-stored face image data as audible target, and It is obtained in the azimuth information of monitoring area.

10. the human-computer interaction device merged as claimed in claim 9 based on image with voice, which is characterized in that further include:

Pre- memory cell is instructed, for presetting the interactive instruction of different function, and interactive instruction is associated at least one text envelope Breath is saved.