WO2015154419A1 - Human-machine interaction device and method - Google Patents

Human-machine interaction device and method Download PDF

Info

Publication number
WO2015154419A1
WO2015154419A1 PCT/CN2014/089020 CN2014089020W WO2015154419A1 WO 2015154419 A1 WO2015154419 A1 WO 2015154419A1 CN 2014089020 W CN2014089020 W CN 2014089020W WO 2015154419 A1 WO2015154419 A1 WO 2015154419A1
Authority
WO
WIPO (PCT)
Prior art keywords
lip
human
microphone
camera
voice
Prior art date
Application number
PCT/CN2014/089020
Other languages
French (fr)
Chinese (zh)
Inventor
陈军
姚立哲
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2015154419A1 publication Critical patent/WO2015154419A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Definitions

  • the present invention relates to the field of human-computer interaction technology, and more particularly to a human-computer interaction device and method.
  • the technical problem to be solved by the present invention is to provide a human-machine interaction device and method to solve the problem of low reliability of speech recognition in an environment with noise interference.
  • a human-computer interaction method comprising:
  • the camera in the human-machine interaction device is activated to acquire the lip-reading image in real time;
  • the human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data
  • the human-machine interaction device fuses the lip-motion feature data with the voice feature data extracted from the voice signal to identify the input voice.
  • the step of detecting valid voice input includes:
  • the microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
  • the step of initiating the camera in the human-machine interaction device to acquire a lip-reading image in real time the method further includes:
  • the human-machine interaction device controls the microphone to enter a listening state, and the control center The camera stops working until the microphone detects a valid voice input again, and then the camera is started to operate normally.
  • a human-computer interaction method comprising:
  • the microphone in the human-machine interaction device acquires a voice signal, and the camera acquires a lip-reading image in real time;
  • the human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data.
  • the human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify an input voice, wherein the microphone acquires a voice signal but obtains from the camera When invalid lip motion feature data is obtained in the sequence formed by the lip-reading image, the microphone is controlled to enter a listening state, and the camera is controlled to stop working.
  • the method further includes:
  • the microphone When the microphone enters the listening state, if a valid voice input is detected, the working state is entered, and the camera is started to acquire the lip reading image in real time.
  • a human-machine interaction device includes a microphone, a camera, a lip-reading image processing module, and a fusion recognition module, wherein:
  • the microphone is configured to: acquire a voice signal, and when the valid voice input is detected, activate the camera;
  • the camera is configured to: acquire a lip reading image in real time according to the control of the microphone;
  • the lip-reading image processing module is configured to: process the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;
  • the fusion identification module is configured to: extract the lip motion feature data from the voice signal
  • the acquired voice feature data is fused to recognize the input voice.
  • the microphone is arranged to detect valid voice input as follows:
  • the microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
  • the apparatus further includes a control module, wherein:
  • the control module is configured to control the microphone to enter a sound when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-read image The state, the control camera stops working until the microphone detects a valid voice input again, and then the camera is started to work normally.
  • the device is assembled in any of the following devices:
  • Wearable devices portable devices, smart terminals, smart home appliances, security monitoring devices.
  • a human-machine interaction device includes a microphone and a camera, and further includes a lip-reading image processing module, a fusion recognition module, and a control module, wherein:
  • the lip-reading image processing module is configured to: process a sequence formed by the lip-reading image acquired by the camera to obtain lip-moving feature data;
  • the fusion identification module is configured to: fuse the lip motion feature data with the voice feature data extracted from the voice signal acquired by the microphone, and identify the input voice;
  • the control module is configured to: when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-reading image, controlling the microphone to enter a sounding Status, control camera stops working.
  • the microphone is further configured to: after entering the listening state according to the control of the control module, if a valid voice input is detected, enter a working state, and start the camera to acquire a lip-reading image in real time.
  • the device is assembled in any of the following devices:
  • Wearable devices portable devices, smart terminals, smart home appliances, security monitoring devices.
  • a computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform a corresponding human-computer interaction method.
  • a carrier carrying the computer program.
  • a computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform a corresponding human-computer interaction method.
  • a carrier carrying the computer program.
  • the technical solution of the present application combines lip reading and speech in a noisy environment, and improves speech recognition, improves machine recognition rate, and confirms effective speech in comparison with a conventional technique of recognizing using single speech feature data.
  • the input is started, the camera work is started, and the power consumption of the device is greatly reduced.
  • FIG. 1 is a structural diagram of an interaction apparatus implemented according to an embodiment of the present invention.
  • This embodiment provides a human-computer interaction method for combining lip reading and speech to perform speech recognition in a noisy environment.
  • the method mainly includes the following operations:
  • the camera in the human-machine interaction device is activated to acquire the lip-reading image in real time;
  • the human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data.
  • the human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to recognize the input voice.
  • the lip-reading image may also be referred to as a lip-moving image, and refers to an image in which the movement of the speaker's lips changes when a person speaks.
  • the lip-reading image constitutes a sequence of images or a lip-reading image video.
  • the sequence formed by the lip-reading image refers to the lip-reading image video over a period of time.
  • the characteristic parameters that is, the lip motion characteristic data, which are subjected to the specific operation processing of the lip motion image sequence, are common knowledge to those skilled in the art, and will not be described herein.
  • the speech feature data is obtained after the speech signal is processed and processed, and the representation method is more, for example, the spectral parameter of the speech can be used as one of the feature data.
  • the speech feature data processing is performed after the speech signal is acquired, and is executed by the speech processing module. Speech feature data processing and lip-reading image processing are performed independently.
  • the process of detecting valid voice input is as follows:
  • the microphone detects the sound source and converts the natural voice of the detected sound source into an electrical signal. When the converted electrical signal exceeds the set threshold, it is determined that there is a valid voice input.
  • the electrical signals involved include a current signal or a voltage signal.
  • a feedback mechanism of lip reading processing is also proposed, that is, when the microphone acquires the speech signal, invalid lip motion characteristic data is obtained from the sequence formed by the lip reading image acquired by the camera (at this time) That is to say, the user's lip does not have any action, the user may not speak.)
  • the human-machine interaction device controls the microphone to enter the listening state, and controls the camera to stop working until the microphone detects a valid voice input again, and then starts the camera to work normally.
  • This mechanism is mainly for the case of large noise influence, combined with the user's lip motion feature, accurately distinguishes whether the user voice or noise, and when the noise is recognized, stops the camera work to improve equipment utilization.
  • the human-machine interaction device may further reserve the microphone for acquiring the voice signal according to the user instruction, and notify the camera to cancel the acquisition of the lip-reading image. Therefore, in the special scene, the user selects the recognition mode and improves the user experience.
  • the user uses a headset to communicate with the smart device for voice interaction.
  • the lip can be utilized.
  • the recognition of the read image further improves the accuracy of the speech recognition, and facilitates the machine to better understand the user's language expression and execute the user's voice command.
  • the human-computer interaction process is as follows:
  • Step 1 The microphone acquires a voice signal, and when there is valid voice input, starts the camera work;
  • the microphone mainly uses a sound pressure sensor to detect the sound source and convert the natural voice into an electrical signal.
  • a threshold value of the sound pressure sensor electrical signal may be set to determine whether there is a valid voice input.
  • the converted sound pressure sensor electrical signal is greater than or not less than the set threshold value, when it is determined that there is valid voice input, the camera is notified to start and normal operation begins.
  • the microphone detects that there is valid voice input, it notifies the camera to work and obtains the lip-reading image, so that the operation can reduce the power consumption of the device.
  • Step 2 The camera acquires a lip-reading image.
  • the usual acquisition of lip-reading images is to perform face recognition first in the image sequence, determine the position of the lips, and then obtain lip-motion data.
  • a directional microphone can be selected, and the camera is built in the microphone (or the microphone is built in the camera), such as a headset, the camera is located at the microphone, and the camera is directly aimed at the user's lips when the user is using it. This makes it easy to get a lip image.
  • Step 3 Processing the sequence formed by the acquired lip-reading image to obtain lip-motion feature data.
  • a feedback mechanism for lip reading processing can be set. For example, in a noisy environment, or in a cross-talker scenario, if the microphone acquires other sound signals when the user does not speak, the camera starts to acquire the lip image, but the lip-reading image is processed without extracting the lip. Dynamic features. At this time, the human-machine interaction device can notify the camera, the voice processing module, the lip-reading processing module, and the fusion recognition module to stop working, and only the microphone is in the listening state.
  • the human-computer interaction is performed only by the voice to avoid the lip-reading recognition result to interfere with the speech recognition. Or for special scenes or special people, you can also set up human-computer interaction only through lip reading.
  • Step 4 Processing the acquired voice to obtain voice feature data.
  • the processing of the lip-reading image and the processing of the voice are performed by two independent parts, so the order of the above steps 3 and 4 can be adjusted. , can also be at the same time.
  • Step 5 The fusion identification module performs fusion recognition on the voice feature data and the lip motion feature data.
  • Lip reading and speech are complementary channels, such as /m/ and /n/ unit sounds that are indistinguishable in speech signal channels are visually distinguishable; visually indistinguishable from /b/, /p/ And /m/ unit sounds are distinguishable on the voice signal.
  • the auxiliary information of the lip-reading image can significantly improve the speech recognition rate of the machine.
  • the related recognition processing technology of lip reading and speech is used to correct the inconsistency between lip reading recognition and speech recognition results.
  • the trained identification library can be used to determine which channel information is more reliable, thereby improving the speech recognition rate.
  • the human-machine interaction device involved in the above method can also be installed in devices such as wearable devices (such as smart glasses, smart helmets), portable devices, smart terminals, smart home appliances, and security monitoring devices.
  • wearable devices such as smart glasses, smart helmets
  • portable devices such as smart terminals, smart home appliances, and security monitoring devices.
  • This embodiment provides a human-computer interaction method, and the method includes the following steps:
  • the microphone in the human-machine interaction device acquires a voice signal, and the camera acquires a lip-reading image in real time;
  • the human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data
  • the human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify the input voice, wherein the microphone acquires the voice signal but is invalidated from the sequence formed by the lip-read image acquired by the camera. Controlling the microphone into the Detect when the lip characterization data Listen to the status and control the camera to stop working.
  • the microphone after the control microphone enters the listening state and the control camera stops working, the microphone also detects whether there is valid voice input. If a valid voice input is detected, the working state is started, and the camera starts to work.
  • This embodiment provides a human-machine interaction device. As shown in FIG. 1, the interaction device includes the following parts.
  • the microphone 11 acquires a voice signal and activates the camera when a valid voice input is detected.
  • the microphone 11 detects the sound source and converts the natural voice into a voltage or current signal, and when the voltage or current signal is greater than or not less than the set threshold, it is considered that a valid voice input is detected.
  • the camera 12 acquires a lip-reading image in real time according to the control of the microphone 11;
  • the lip-reading image processing module 13 processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data
  • the voice processing module 14 processes the voice signal to obtain voice feature data.
  • the fusion recognition module 15 fuses the lip motion feature data and the voice feature data to recognize the input voice.
  • the trained model library is used to perform fusion recognition on the lip motion feature data and the voice feature data.
  • the above device may also adopt a feedback mechanism of lip reading, in which case a control module needs to be added, the module acquires a voice signal in the microphone, but the lip-reading image processing module obtains an invalid lip motion from the sequence formed by the acquired lip-reading image.
  • the control microphone 11 enters the listening state, and the camera is controlled. 12 stopped working.
  • the lip-reading image processing module 13, the voice processing module 14, and the fusion recognition module 15 are also controlled to stop working, thereby reducing the power consumption of the device.
  • the microphone 11 can detect whether there is valid voice input. If a valid voice input is detected, the working state is entered, and the camera 12, the lip-reading image processing module 13, and the voice processing module 14 are activated. And the fusion identification module 15 works normally. Such a scheme not only improves the reliability of speech recognition in a noisy environment, but also reduces the power consumption of the device.
  • control module may further reserve the microphone according to the user instruction to acquire the voice signal, and notify the camera 12 to cancel the acquisition of the lip-reading image. That is to say, the control module can select the voice recognition mode according to the user instruction, for example, the voice recognition is performed by using the microphone 11 alone, or the voice recognition can be performed by the camera 12 alone or in two ways.
  • the above devices can be built into any of the following devices:
  • Wearable devices portable devices, smart terminals, smart home appliances, security monitoring devices.
  • the microphone 11 and the camera 12 are optionally arranged on the same side of the device, for example, the camera 12 is mounted at the microphone of the headset, and the other parts can be mounted on the smart machine.
  • This embodiment provides a human-machine interaction device, including the following parts.
  • a lip reading image processing module processing the sequence formed by the acquired lip reading image to obtain lip motion characteristic data
  • the voice processing module processes the voice signal to obtain voice feature data.
  • the fusion recognition module combines the lip motion feature data and the voice feature data to recognize the input voice.
  • the trained model library is used to perform fusion recognition on the lip motion feature data and the voice feature data.
  • the control module acquires a voice signal in the microphone, but when the lip-reading image processing module obtains invalid lip-motion feature data from the acquired lip-reading image (ie, the lip-moving feature data cannot be recognized), the microphone is controlled to enter the listening state. , control the camera to stop working.
  • control module may further reserve the microphone according to the user instruction to acquire the voice signal, and notify the camera to cancel the acquisition of the lip-reading image. That is to say, the control module can select the voice recognition mode according to the user instruction, for example, the microphone is used for voice recognition alone, or the camera can be used for voice recognition alone or in two ways.
  • the above microphone can start the camera work when there is effective voice input to reduce the power consumption of the device.
  • the microphone detects the sound source and converts the natural voice into an electrical signal, and when the electrical signal is greater than or not less than the set threshold, it is considered that a valid voice input is detected.
  • the above devices can be built into any of the following devices:
  • Wearable devices portable devices, smart terminals, smart home appliances, security monitoring devices.
  • the microphone and the camera are optionally arranged on the same side of the device, for example, the camera is assembled at the microphone of the headset, and the other parts can be assembled on the smart machine.
  • the technical solution of the present application combines lip reading and voice in a noisy environment, and adopts traditional Compared with the technology of single speech feature data recognition, the speech recognition is improved, the machine recognition rate is improved, and the camera operation is started when the valid speech input is confirmed, and the power consumption of the device is greatly reduced.

Abstract

A human-machine interaction device and method, a corresponding computer program, and a carrier for the computer program. The method comprises: while a microphone in a human-machine interaction device is in a process of acquiring a speech signal, if a valid speech input is detected, then a camera in the human-machine interaction device is activated to acquire in real-time lipreading images; the human-machine interaction device processes a sequence formed by the acquired lipreading images to acquire lipreading feature data; and, the human-machine interaction device merges the lipreading feature data and speech feature data extracted from the speech signal to identify an inputted speech. The technical solution of the present application effectively improves speech recognition and increases machine recognition rate.

Description

一种人机交互装置及方法Human-computer interaction device and method 技术领域Technical field
本发明涉及人机交互技术领域,更具体涉及到一种人机交互装置及方法。The present invention relates to the field of human-computer interaction technology, and more particularly to a human-computer interaction device and method.
背景技术Background technique
随着移动终端设备的多样化、智能化发展,人机交互方式也呈现多样化趋势,从传统的按键输入到触摸输入,以及指纹、语音、手势等多形态的生物特征能被智能终端有效识别,人机交互技术也得到广泛研究和应用。With the diversification and intelligent development of mobile terminal devices, the human-computer interaction mode also presents a diversified trend. From traditional button input to touch input, and multi-form biometrics such as fingerprints, voices, gestures, etc. can be effectively recognized by intelligent terminals. Human-computer interaction technology has also been widely studied and applied.
但是,相关人机交互装置对于噪声干扰并没有十分有效的解决方案。However, related human-machine interaction devices do not have a very effective solution to noise interference.
发明内容Summary of the invention
本发明所要解决的技术问题是提供一种人机交互装置及方法,以解决噪声干扰的环境中语音识别可靠性低的问题。The technical problem to be solved by the present invention is to provide a human-machine interaction device and method to solve the problem of low reliability of speech recognition in an environment with noise interference.
为解决上述技术问题,采用如下技术方案:In order to solve the above technical problems, the following technical solutions are adopted:
一种人机交互方法,该方法包括:A human-computer interaction method, the method comprising:
人机交互装置中的麦克风获取语音信号的过程中,如果检测到有效的语音输入,则启动所述人机交互装置中的摄像头实时获取唇读图像;In the process of acquiring the voice signal by the microphone in the human-machine interaction device, if a valid voice input is detected, the camera in the human-machine interaction device is activated to acquire the lip-reading image in real time;
所述人机交互装置对所获取的唇读图像形成的序列进行处理,得到唇动特征数据;The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data;
所述人机交互装置将所述唇动特征数据与从所述语音信号中提取的语音特征数据进行融合,识别输入的语音。The human-machine interaction device fuses the lip-motion feature data with the voice feature data extracted from the voice signal to identify the input voice.
可选地,所述检测到有效的语音输入的步骤包括:Optionally, the step of detecting valid voice input includes:
所述麦克风探测声源,将探测到的声源的自然语音转换成电信号,当转换后的电信号超过设定门限值,则判断有有效的语音输入,其中,所述电信号包括电压信号或电流信号。The microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
可选地,所述启动所述人机交互装置中的摄像头实时获取唇读图像的步 骤后,该方法还包括:Optionally, the step of initiating the camera in the human-machine interaction device to acquire a lip-reading image in real time After the event, the method further includes:
所述麦克风获取到语音信号的同时,如果从所述摄像头获取的唇读图像形成的序列中得到无效的唇动特征数据,则所述人机交互装置控制所述麦克风进入侦听状态,控制所述摄像头停止工作,直到所述麦克风再次检测到有效的语音输入,再启动所述摄像头正常工作。While the microphone acquires the voice signal, if the lip-moving feature data is obtained from the sequence formed by the lip-read image acquired by the camera, the human-machine interaction device controls the microphone to enter a listening state, and the control center The camera stops working until the microphone detects a valid voice input again, and then the camera is started to operate normally.
一种人机交互方法,该方法包括:A human-computer interaction method, the method comprising:
人机交互装置中的麦克风获取语音信号,摄像头实时获取唇读图像;The microphone in the human-machine interaction device acquires a voice signal, and the camera acquires a lip-reading image in real time;
所述人机交互装置对所获取的唇读图像形成的序列进行处理,得到唇动特征数据,The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data.
所述人机交互装置将所述唇动特征数据和从所述语音信号中提取的语音特征数据进行融合,识别输入的语音,其中,所述麦克风获取到语音信号,但从所述摄像头获取的唇读图像形成的序列中得到无效的唇动特征数据时,控制所述麦克风进入侦听状态,控制所述摄像头停止工作。The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify an input voice, wherein the microphone acquires a voice signal but obtains from the camera When invalid lip motion feature data is obtained in the sequence formed by the lip-reading image, the microphone is controlled to enter a listening state, and the camera is controlled to stop working.
可选地,所述控制所述麦克风进入侦听状态,控制所述摄像头停止工作的步骤后,该方法还包括:Optionally, after the step of controlling the microphone to enter a listening state and controlling the camera to stop working, the method further includes:
所述麦克风进入侦听状态时,如果检测到有效的语音输入,则进入工作状态,并启动所述摄像头实时获取唇读图像。When the microphone enters the listening state, if a valid voice input is detected, the working state is entered, and the camera is started to acquire the lip reading image in real time.
一种人机交互装置,包括麦克风、摄像头、唇读图像处理模块和融合识别模块,其中:A human-machine interaction device includes a microphone, a camera, a lip-reading image processing module, and a fusion recognition module, wherein:
所述麦克风设置成:获取语音信号,并在检测到有效的语音输入时,启动所述摄像头;The microphone is configured to: acquire a voice signal, and when the valid voice input is detected, activate the camera;
所述摄像头设置成:按照所述麦克风的控制,实时获取唇读图像;The camera is configured to: acquire a lip reading image in real time according to the control of the microphone;
所述唇读图像处理模块设置成:对所获取的唇读图像形成的序列进行处理,得到唇动特征数据;The lip-reading image processing module is configured to: process the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;
所述融合识别模块设置成:将所述唇动特征数据与从所述语音信号中提 取的语音特征数据进行融合,识别输入的语音。The fusion identification module is configured to: extract the lip motion feature data from the voice signal The acquired voice feature data is fused to recognize the input voice.
可选地,所述麦克风设置成按照如下方式检测到有效的语音输入:Optionally, the microphone is arranged to detect valid voice input as follows:
所述麦克风探测声源,将探测到的声源的自然语音转换成电信号,当转换后的电信号超过设定门限值,则判断有有效的语音输入,其中,所述电信号包括电压信号或电流信号。The microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
可选地,该装置还包括控制模块,其中:Optionally, the apparatus further includes a control module, wherein:
所述控制模块设置成,在所述麦克风获取到语音信号,但所述唇读图像处理模块从获取的唇读图像形成的序列中得到无效的唇动特征数据时,控制所述麦克风进入侦听状态,控制摄像头停止工作,直到所述麦克风再次检测到有效的语音输入,再启动所述摄像头正常工作。The control module is configured to control the microphone to enter a sound when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-read image The state, the control camera stops working until the microphone detects a valid voice input again, and then the camera is started to work normally.
可选地,所述装置装配在如下任一种设备中:Optionally, the device is assembled in any of the following devices:
可穿戴设备、便携式设备、智能终端、智能家电设备、安防监控设备。Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
一种人机交互装置,包括麦克风和摄像头,还包括唇读图像处理模块、融合识别模块和控制模块,其中:A human-machine interaction device includes a microphone and a camera, and further includes a lip-reading image processing module, a fusion recognition module, and a control module, wherein:
所述唇读图像处理模块设置成:对所述摄像头获取的唇读图像形成的序列进行处理,得到唇动特征数据;The lip-reading image processing module is configured to: process a sequence formed by the lip-reading image acquired by the camera to obtain lip-moving feature data;
所述融合识别模块设置成:将所述唇动特征数据与从所述麦克风获取的语音信号中提取的语音特征数据进行融合,识别输入的语音;The fusion identification module is configured to: fuse the lip motion feature data with the voice feature data extracted from the voice signal acquired by the microphone, and identify the input voice;
所述控制模块设置成:在所述麦克风获取到语音信号,但所述唇读图像处理模块从获取的唇读图像形成的序列中得到无效的唇动特征数据时,控制所述麦克风进入侦听状态,控制摄像头停止工作。The control module is configured to: when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-reading image, controlling the microphone to enter a sounding Status, control camera stops working.
可选地,所述麦克风还设置成:按照所述控制模块的控制进入侦听状态后,如果检测到有效的语音输入,则进入工作状态,并启动所述摄像头实时获取唇读图像。Optionally, the microphone is further configured to: after entering the listening state according to the control of the control module, if a valid voice input is detected, enter a working state, and start the camera to acquire a lip-reading image in real time.
可选地,所述装置装配在如下任一种设备中:Optionally, the device is assembled in any of the following devices:
可穿戴设备、便携式设备、智能终端、智能家电设备、安防监控设备。 Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
一种计算机程序,包括程序指令,当该程序指令被人机交互装置执行时,使得该人机交互装置可执行相应的人机交互方法。一种载有该计算机程序的载体。A computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform a corresponding human-computer interaction method. A carrier carrying the computer program.
一种计算机程序,包括程序指令,当该程序指令被人机交互装置执行时,使得该人机交互装置可执行相应的人机交互方法。一种载有该计算机程序的载体。A computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform a corresponding human-computer interaction method. A carrier carrying the computer program.
本申请技术方案,在噪声环境下将唇读和语音进行融合,与传统的采用单一语音特征数据进行识别的技术相比,有效改善了语音识别,提高机器识别率,并且在确认有有效的语音输入时,才启动摄像头工作,也大大降低了设备功耗。还有可选方案提出将此方案应用于可穿戴智能设备中,以增强机器对用户输入的识别能力,便于用户使用,提升了用户体验。The technical solution of the present application combines lip reading and speech in a noisy environment, and improves speech recognition, improves machine recognition rate, and confirms effective speech in comparison with a conventional technique of recognizing using single speech feature data. When the input is started, the camera work is started, and the power consumption of the device is greatly reduced. There is also an alternative solution proposed to apply the solution to the wearable smart device to enhance the machine's ability to recognize the user input, which is convenient for the user to use and enhance the user experience.
附图概述BRIEF abstract
图1为本发明实施例实现的交互装置结构图。FIG. 1 is a structural diagram of an interaction apparatus implemented according to an embodiment of the present invention.
本发明的较佳实施方式Preferred embodiment of the invention
下文将结合附图对本发明技术方案作进一步详细说明。需要说明的是,在不冲突的情况下,本申请的实施例和实施例中的特征可以任意相互组合。The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments of the present application may be combined with each other arbitrarily.
实施例1Example 1
本实施例提供一种人机交互方法,在噪声环境下将唇读和语音进行融合以进行语音识别。该方法主要包括如下操作:This embodiment provides a human-computer interaction method for combining lip reading and speech to perform speech recognition in a noisy environment. The method mainly includes the following operations:
人机交互装置中的麦克风获取语音信号的过程中,如果检测到有效的语音输入,则启动人机交互装置中的摄像头实时获取唇读图像; In the process of acquiring the voice signal by the microphone in the human-machine interaction device, if a valid voice input is detected, the camera in the human-machine interaction device is activated to acquire the lip-reading image in real time;
人机交互装置对所获取的唇读图像形成的序列进行处理,得到唇动特征数据,The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data.
人机交互装置对上述唇动特征数据和从语音信号中提取的语音特征数据进行融合,识别输入的语音。The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to recognize the input voice.
其中,唇读图像也可称为唇动图像,是指当人说话时,说话人的嘴唇运动变化的图像。一段时间内,唇读图像就构成图像序列或称唇读图像视频。唇读图像形成的序列是指一段时间内的唇读图像视频。Among them, the lip-reading image may also be referred to as a lip-moving image, and refers to an image in which the movement of the speaker's lips changes when a person speaks. For a period of time, the lip-reading image constitutes a sequence of images or a lip-reading image video. The sequence formed by the lip-reading image refers to the lip-reading image video over a period of time.
由唇动图像序列经过特定运算处理后的特征参数即唇动特征数据,对于本领域技术人员来说是公知常识,在此不再赘述。The characteristic parameters, that is, the lip motion characteristic data, which are subjected to the specific operation processing of the lip motion image sequence, are common knowledge to those skilled in the art, and will not be described herein.
语音特征数据是从语音信号运算处理后得到的,表示方法较多,比如语音的频谱参数可作为其中一种特征数据。语音特征数据处理是当获取到语音信号后就可进行的,由语音处理模块执行。语音特征数据处理与唇读图像处理是独立进行的。The speech feature data is obtained after the speech signal is processed and processed, and the representation method is more, for example, the spectral parameter of the speech can be used as one of the feature data. The speech feature data processing is performed after the speech signal is acquired, and is executed by the speech processing module. Speech feature data processing and lip-reading image processing are performed independently.
其中,麦克风获取语音信号的过程中,检测有效的语音输入的过程如下:Among them, in the process of acquiring a voice signal by a microphone, the process of detecting valid voice input is as follows:
麦克风探测声源,将探测到的声源的自然语音转换成电信号,当转换后的电信号超过设定门限值,则判断有有效的语音输入。本实施例中,所涉及的电信号包括电流信号或电压信号。The microphone detects the sound source and converts the natural voice of the detected sound source into an electrical signal. When the converted electrical signal exceeds the set threshold, it is determined that there is a valid voice input. In this embodiment, the electrical signals involved include a current signal or a voltage signal.
另外,一些可选方案中,还提出一种唇读处理的反馈机制,即当麦克风获取到语音信号的同时,从摄像头获取的唇读图像形成的序列中得到无效的唇动特征数据(此时即认为用户的唇部没有任何动作,用户可能没有说话),则人机交互装置控制麦克风进入侦听状态,控制摄像头停止工作,直到麦克风再次检测到有效的语音输入,再启动摄像头正常工作。这种机制,主要针对噪声影响大的情况,结合用户的唇动特征,准确地辩识是用户语音还是噪声,并在辨识出噪声时,停止摄像头工作,以提高设备利用率。In addition, in some alternatives, a feedback mechanism of lip reading processing is also proposed, that is, when the microphone acquires the speech signal, invalid lip motion characteristic data is obtained from the sequence formed by the lip reading image acquired by the camera (at this time) That is to say, the user's lip does not have any action, the user may not speak.) The human-machine interaction device controls the microphone to enter the listening state, and controls the camera to stop working until the microphone detects a valid voice input again, and then starts the camera to work normally. This mechanism is mainly for the case of large noise influence, combined with the user's lip motion feature, accurately distinguishes whether the user voice or noise, and when the noise is recognized, stops the camera work to improve equipment utilization.
相应地,上述人机交互装置还可以根据用户指令保留麦克风进行语音信号的获取,而通知摄像头取消唇读图像的获取。从而适应特殊场景中,用户对识别方式的选择,提高用户体验感。 Correspondingly, the human-machine interaction device may further reserve the microphone for acquiring the voice signal according to the user instruction, and notify the camera to cancel the acquisition of the lip-reading image. Therefore, in the special scene, the user selects the recognition mode and improves the user experience.
下面结合具体应用场景说明上述方法的实现过程。The implementation process of the above method will be described below in combination with a specific application scenario.
例如:用户使用一个头戴式耳麦与智能设备进行语音交互,由于机器对人的语音识别在嘈杂环境下或用户语音语调有问题时会明显降低,为提高对语音的识别率,可以利用对唇读图像的识别进一步提高语音识别的准确度,便于机器更好理解用户的语言表达,执行用户的语音指令。可选地,人机交互过程如下:For example, the user uses a headset to communicate with the smart device for voice interaction. Since the machine-to-person speech recognition is significantly reduced in noisy environments or when the user's voice intonation has problems, in order to improve the recognition rate of the voice, the lip can be utilized. The recognition of the read image further improves the accuracy of the speech recognition, and facilitates the machine to better understand the user's language expression and execute the user's voice command. Optionally, the human-computer interaction process is as follows:
步骤1:麦克风获取语音信号,并在有有效的语音输入时,启动摄像头工作;Step 1: The microphone acquires a voice signal, and when there is valid voice input, starts the camera work;
麦克风主要是采用声压传感器来探测声源并把自然语音转换成电信号。为了区别背景音,可设置一个声压传感器电信号的门限值,用以判定是否有有效的语音输入。当转换后的声压传感器电信号大于或者不小于设定门限值,则判定有有效的语音输入时,通知摄像头启动,开始正常工作。The microphone mainly uses a sound pressure sensor to detect the sound source and convert the natural voice into an electrical signal. In order to distinguish the background sound, a threshold value of the sound pressure sensor electrical signal may be set to determine whether there is a valid voice input. When the converted sound pressure sensor electrical signal is greater than or not less than the set threshold value, when it is determined that there is valid voice input, the camera is notified to start and normal operation begins.
并且当麦克风检测到有有效的语音输入时,才通知摄像头工作,获取唇读图像,这样操作可以降低设备功耗的。And when the microphone detects that there is valid voice input, it notifies the camera to work and obtains the lip-reading image, so that the operation can reduce the power consumption of the device.
步骤2:摄像头获取唇读图像。Step 2: The camera acquires a lip-reading image.
通常的获取唇读图像是在图像序列中先进行人脸识别,确定唇部位置,再获取唇动数据。实际应用中,可选采用具有指向性的麦克风,且摄像头内置在麦克风内(或麦克风内置在摄像头内),例如头戴式耳麦,摄像头位于麦克风处,用户使用时,摄像头直接对准用户唇部,这样方便获取唇部图像。The usual acquisition of lip-reading images is to perform face recognition first in the image sequence, determine the position of the lips, and then obtain lip-motion data. In practical applications, a directional microphone can be selected, and the camera is built in the microphone (or the microphone is built in the camera), such as a headset, the camera is located at the microphone, and the camera is directly aimed at the user's lips when the user is using it. This makes it easy to get a lip image.
步骤3:对获取的唇读图像形成的序列进行处理,得到唇动特征数据。Step 3: Processing the sequence formed by the acquired lip-reading image to obtain lip-motion feature data.
主要是对于唇读图像形成的序列进行唇部定位与跟踪,唇动特征提取,最后将唇动特征数据输出到融合识别模块。It mainly performs lip positioning and tracking for the sequence formed by the lip-reading image, lip-motion feature extraction, and finally outputs the lip-motion feature data to the fusion recognition module.
另外通过用户配置,可设置唇读处理的反馈机制。例如在嘈杂环境下,或交叉讲话者情景下,麦克风在用户没有说话时如果获取了其他的声音信号,而导致摄像头启动获取唇部图像,但此时唇读图像进行处理时不会提取到唇动特征。此时,人机交互装置则可通知摄像头、语音处理模块及唇读处理模块、融合识别模块停止工作,仅使麦克风处于侦听状态。In addition, through user configuration, a feedback mechanism for lip reading processing can be set. For example, in a noisy environment, or in a cross-talker scenario, if the microphone acquires other sound signals when the user does not speak, the camera starts to acquire the lip image, but the lip-reading image is processed without extracting the lip. Dynamic features. At this time, the human-machine interaction device can notify the camera, the voice processing module, the lip-reading processing module, and the fusion recognition module to stop working, and only the microphone is in the listening state.
在某些特殊场景,还可设置成取消唇读处理的反馈机制,例如在摄像头 不能有效捕获唇读数据时,仅通过语音进行人机交互,以避免唇读识别结果反而对语音识别进行干扰。或者针对特殊场景或特殊人群,还可设置仅通过唇读进行人机交互。In some special scenarios, it can also be set to cancel the feedback mechanism of lip reading processing, for example in the camera When the lip reading data cannot be effectively captured, the human-computer interaction is performed only by the voice to avoid the lip-reading recognition result to interfere with the speech recognition. Or for special scenes or special people, you can also set up human-computer interaction only through lip reading.
步骤4:对获取的语音进行处理,得到语音特征数据。Step 4: Processing the acquired voice to obtain voice feature data.
要说明的是,由于人机交互装置中,对唇读图像的处理,以及对语音的处理,是由两个相互独立的部分分别进行操作的,故上述步骤3和步骤4的先后顺序可以调整,也可以同时。It should be noted that, in the human-computer interaction device, the processing of the lip-reading image and the processing of the voice are performed by two independent parts, so the order of the above steps 3 and 4 can be adjusted. , can also be at the same time.
步骤5:融合识别模块对语音特征数据和唇动特征数据进行融合识别。Step 5: The fusion identification module performs fusion recognition on the voice feature data and the lip motion feature data.
唇读和语音是互补的两个通道,例如在语音信号通道难以区分的/m/和/n/的单元音在视觉上是可以区分的;在视觉上难以区分的/b/、/p/和/m/单元音,在语音信号上是可以区分的。特别在噪音环境和多话者条件下,借助唇读图像的辅助信息能明显提高机器的语音识别率。采用相关唇读和语音的融合识别处理技术,对唇读识别和语音识别结果不一致的进行修正处理。当两个信道信息不一致时,利用训练过的识别库可判别哪个信道信息更可靠,从而提高语音识别率。Lip reading and speech are complementary channels, such as /m/ and /n/ unit sounds that are indistinguishable in speech signal channels are visually distinguishable; visually indistinguishable from /b/, /p/ And /m/ unit sounds are distinguishable on the voice signal. Especially in the noisy environment and multi-talker conditions, the auxiliary information of the lip-reading image can significantly improve the speech recognition rate of the machine. The related recognition processing technology of lip reading and speech is used to correct the inconsistency between lip reading recognition and speech recognition results. When the two channel information are inconsistent, the trained identification library can be used to determine which channel information is more reliable, thereby improving the speech recognition rate.
而上述方法中所涉及的人机交互装置还可以装配在可穿戴设备(如智能眼镜、智能头盔)、便携式设备以及智能终端、智能家电设备,以及安防监控等设备中。The human-machine interaction device involved in the above method can also be installed in devices such as wearable devices (such as smart glasses, smart helmets), portable devices, smart terminals, smart home appliances, and security monitoring devices.
实施例2Example 2
本实施例提供一种人机交互方法,该方法包括如下步骤:This embodiment provides a human-computer interaction method, and the method includes the following steps:
人机交互装置中的麦克风获取语音信号,摄像头实时获取唇读图像;The microphone in the human-machine interaction device acquires a voice signal, and the camera acquires a lip-reading image in real time;
人机交互装置对所获取的唇读图像形成的序列进行处理,得到唇动特征数据;The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;
人机交互装置对上述唇动特征数据和从语音信号中提取的语音特征数据进行融合,识别输入的语音,其中,麦克风获取到语音信号,但从摄像头获取的唇读图像形成的序列中得到无效的唇动特征数据时,控制麦克风进入侦 听状态,控制摄像头停止工作。The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify the input voice, wherein the microphone acquires the voice signal but is invalidated from the sequence formed by the lip-read image acquired by the camera. Controlling the microphone into the Detect when the lip characterization data Listen to the status and control the camera to stop working.
可选方案中,在控制麦克风进入侦听状态,控制摄像头停止工作后,麦克风还会检测是否有有效的语音输入,如果检测到有效的语音输入,则会进行工作状态,并启动摄像头开始工作。In the alternative, after the control microphone enters the listening state and the control camera stops working, the microphone also detects whether there is valid voice input. If a valid voice input is detected, the working state is started, and the camera starts to work.
实施例3Example 3
本实施例提供一种人机交互装置,该交互装置如图1所示,包括如下各部分。This embodiment provides a human-machine interaction device. As shown in FIG. 1, the interaction device includes the following parts.
麦克风11,获取语音信号,并在检测到有效的语音输入时,启动摄像头。The microphone 11 acquires a voice signal and activates the camera when a valid voice input is detected.
可选地,麦克风11探测音源并将自然语音转换成电压或电流信号,当电压或电流信号大于或者不小于设定门限值时,即认为检测到了有效的语音输入。Optionally, the microphone 11 detects the sound source and converts the natural voice into a voltage or current signal, and when the voltage or current signal is greater than or not less than the set threshold, it is considered that a valid voice input is detected.
摄像头12,按照所述麦克风11的控制,实时获取唇读图像;The camera 12 acquires a lip-reading image in real time according to the control of the microphone 11;
可选地,接收麦克风11的控制信号,当麦克风11探测到有效声源时同步对唇部图像进行摄像;Optionally, receiving a control signal of the microphone 11 and synchronously imaging the lip image when the microphone 11 detects the effective sound source;
唇读图像处理模块13,对所获取的唇读图像形成的序列进行处理,得到唇动特征数据;The lip-reading image processing module 13 processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;
可选地,对唇读图像进行唇部定位、跟踪,提取唇动特征数据;Optionally, performing lip positioning, tracking on the lip-reading image, and extracting lip-motion feature data;
语音处理模块14,对语音信号进行处理,得到语音特征数据。The voice processing module 14 processes the voice signal to obtain voice feature data.
融合识别模块15,对唇动特征数据和语音特征数据进行融合,识别输入的语音。The fusion recognition module 15 fuses the lip motion feature data and the voice feature data to recognize the input voice.
可选地,利用训练过的模型库对于唇动特征数据和语音特征数据进行融合识别。Optionally, the trained model library is used to perform fusion recognition on the lip motion feature data and the voice feature data.
另外,上述装置还可以采用唇读的反馈机制,此时需要增加控制模块,该模块在麦克风获取到语音信号,但唇读图像处理模块从获取的唇读图像形成的序列中得到无效的唇动特征数据(也可认为是从唇读图像形成的序列中无法提取到唇动特征数据)时,控制麦克风11进入侦听状态,控制摄像头 12停止工作。同时还控制唇读图像处理模块13、语音处理模块14及融合识别模块15也停止工作,从而降低装置的功耗。In addition, the above device may also adopt a feedback mechanism of lip reading, in which case a control module needs to be added, the module acquires a voice signal in the microphone, but the lip-reading image processing module obtains an invalid lip motion from the sequence formed by the acquired lip-reading image. When the feature data (which can also be considered as the lip motion feature data cannot be extracted from the sequence formed by the lip-reading image), the control microphone 11 enters the listening state, and the camera is controlled. 12 stopped working. At the same time, the lip-reading image processing module 13, the voice processing module 14, and the fusion recognition module 15 are also controlled to stop working, thereby reducing the power consumption of the device.
可选地,麦克风11进入侦听状态后,可以检测是否有有效的语音输入,如果检测到有效的语音输入,则进入工作状态,并启动摄像头12、唇读图像处理模块13、语音处理模块14及融合识别模块15正常工作。此种方案,不仅提高了噪声环境下的语音识别的可靠性,并且降低了设备功耗。Optionally, after the microphone 11 enters the listening state, it can detect whether there is valid voice input. If a valid voice input is detected, the working state is entered, and the camera 12, the lip-reading image processing module 13, and the voice processing module 14 are activated. And the fusion identification module 15 works normally. Such a scheme not only improves the reliability of speech recognition in a noisy environment, but also reduces the power consumption of the device.
另外,上述控制模块,还可以根据用户指令保留麦克风进行语音信号的获取,并通知摄像头12取消唇读图像的获取。也就是说,控制模块可以根据用户指令来选择语音识别方式,例如单独采用麦克风11进行语音识别,也可单独采用摄像头12进行语音识别,也可以两种方式同时使用。In addition, the above control module may further reserve the microphone according to the user instruction to acquire the voice signal, and notify the camera 12 to cancel the acquisition of the lip-reading image. That is to say, the control module can select the voice recognition mode according to the user instruction, for example, the voice recognition is performed by using the microphone 11 alone, or the voice recognition can be performed by the camera 12 alone or in two ways.
实际使用中,上述装置可内置于如下任一设备中:In actual use, the above devices can be built into any of the following devices:
可穿戴设备、便携式设备、智能终端、智能家电设备、安防监控设备。Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
其中,麦克风11和摄像头12可选地配置在设备同一侧,例如将摄像头12装配在头戴式耳麦的麦克风处,其他各部分可装配在智能机器设备上。Wherein, the microphone 11 and the camera 12 are optionally arranged on the same side of the device, for example, the camera 12 is mounted at the microphone of the headset, and the other parts can be mounted on the smart machine.
实施例4Example 4
本实施例提供一种人机交互装置,包括如下各部分。This embodiment provides a human-machine interaction device, including the following parts.
麦克风,获取语音信号。Microphone to get the voice signal.
摄像头,实时获取唇读图像;Camera to obtain lip reading images in real time;
唇读图像处理模块,对所获取的唇读图像形成的序列进行处理,得到唇动特征数据;a lip reading image processing module, processing the sequence formed by the acquired lip reading image to obtain lip motion characteristic data;
可选地,对唇读图像进行唇部定位、跟踪,提取唇动特征数据;Optionally, performing lip positioning, tracking on the lip-reading image, and extracting lip-motion feature data;
语音处理模块,对语音信号进行处理,得到语音特征数据。The voice processing module processes the voice signal to obtain voice feature data.
融合识别模块,对唇动特征数据和语音特征数据进行融合识别输入的语音。The fusion recognition module combines the lip motion feature data and the voice feature data to recognize the input voice.
可选地,利用训练过的模型库对于唇动特征数据和语音特征数据进行融合识别。 Optionally, the trained model library is used to perform fusion recognition on the lip motion feature data and the voice feature data.
控制模块,在麦克风获取到语音信号,但唇读图像处理模块从获取的唇读图像中得到无效的唇动特征数据(即无法得到可以辨识的唇动特征数据)时,控制麦克风进入侦听状态,控制摄像头停止工作。The control module acquires a voice signal in the microphone, but when the lip-reading image processing module obtains invalid lip-motion feature data from the acquired lip-reading image (ie, the lip-moving feature data cannot be recognized), the microphone is controlled to enter the listening state. , control the camera to stop working.
另外,上述控制模块,还可以根据用户指令保留麦克风进行语音信号的获取,并通知摄像头取消唇读图像的获取。也就是说,控制模块可以根据用户指令来选择语音识别方式,例如单独采用麦克风进行语音识别,也可单独采用摄像头进行语音识别,也可以两种方式同时使用。In addition, the above control module may further reserve the microphone according to the user instruction to acquire the voice signal, and notify the camera to cancel the acquisition of the lip-reading image. That is to say, the control module can select the voice recognition mode according to the user instruction, for example, the microphone is used for voice recognition alone, or the camera can be used for voice recognition alone or in two ways.
优先地,上述麦克风可以在有有效的语音输入时,再启动摄像头工作,以降低设备功耗。可选地,麦克风探测音源并将自然语音转换成电信号,当电信号大于或者不小于设定门限值时,即认为检测到了有效的语音输入。Firstly, the above microphone can start the camera work when there is effective voice input to reduce the power consumption of the device. Optionally, the microphone detects the sound source and converts the natural voice into an electrical signal, and when the electrical signal is greater than or not less than the set threshold, it is considered that a valid voice input is detected.
实际使用中,上述装置可内置于如下任一设备中:In actual use, the above devices can be built into any of the following devices:
可穿戴设备、便携式设备、智能终端、智能家电设备、安防监控设备。Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
其中,麦克风和摄像头可选地配置在设备同一侧,例如将摄像头装配在头戴式耳麦的麦克风处,其他各部分可装配在智能机器设备上。Wherein, the microphone and the camera are optionally arranged on the same side of the device, for example, the camera is assembled at the microphone of the headset, and the other parts can be assembled on the smart machine.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请不限制于任何特定形式的硬件和软件的结合。One of ordinary skill in the art will appreciate that all or a portion of the steps described above can be accomplished by a program that instructs the associated hardware, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware or in the form of a software function module. This application is not limited to any specific combination of hardware and software.
以上所述,仅为本发明的较佳实例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.
工业实用性Industrial applicability
本申请技术方案,在噪声环境下将唇读和语音进行融合,与传统的采用 单一语音特征数据进行识别的技术相比,有效改善了语音识别,提高机器识别率,并且在确认有有效的语音输入时,才启动摄像头工作,也大大降低了设备功耗。还有可选方案提出将此方案应用于可穿戴智能设备中,以增强机器对用户输入的识别能力,便于用户使用,提升了用户体验。因此本发明具有很强的工业实用性。 The technical solution of the present application combines lip reading and voice in a noisy environment, and adopts traditional Compared with the technology of single speech feature data recognition, the speech recognition is improved, the machine recognition rate is improved, and the camera operation is started when the valid speech input is confirmed, and the power consumption of the device is greatly reduced. There is also an alternative solution proposed to apply the solution to the wearable smart device to enhance the machine's ability to recognize the user input, which is convenient for the user to use and enhance the user experience. Therefore, the present invention has strong industrial applicability.

Claims (16)

  1. 一种人机交互方法,该方法包括:A human-computer interaction method, the method comprising:
    人机交互装置中的麦克风获取语音信号的过程中,如果检测到有效的语音输入,则启动所述人机交互装置中的摄像头实时获取唇读图像;In the process of acquiring the voice signal by the microphone in the human-machine interaction device, if a valid voice input is detected, the camera in the human-machine interaction device is activated to acquire the lip-reading image in real time;
    所述人机交互装置对所获取的唇读图像形成的序列进行处理,得到唇动特征数据;The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data;
    所述人机交互装置将所述唇动特征数据与从所述语音信号中提取的语音特征数据进行融合,识别输入的语音。The human-machine interaction device fuses the lip-motion feature data with the voice feature data extracted from the voice signal to identify the input voice.
  2. 如权利要求1所述的人机交互方法,其中,所述检测到有效的语音输入的步骤包括:The human-computer interaction method according to claim 1, wherein said step of detecting valid voice input comprises:
    所述麦克风探测声源,将探测到的声源的自然语音转换成电信号,当转换后的电信号超过设定门限值,则判断有有效的语音输入,其中,所述电信号包括电压信号或电流信号。The microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
  3. 如权利要求1或2所述的人机交互方法,其中,所述启动所述人机交互装置中的摄像头实时获取唇读图像的步骤后,该方法还包括:The human-computer interaction method according to claim 1 or 2, wherein after the step of initiating the camera in the human-machine interaction device to acquire a lip-reading image in real time, the method further comprises:
    所述麦克风获取到语音信号的同时,如果从所述摄像头获取的唇读图像形成的序列中得到无效的唇动特征数据,则所述人机交互装置控制所述麦克风进入侦听状态,控制所述摄像头停止工作,直到所述麦克风再次检测到有效的语音输入,再启动所述摄像头正常工作。While the microphone acquires the voice signal, if the lip-moving feature data is obtained from the sequence formed by the lip-read image acquired by the camera, the human-machine interaction device controls the microphone to enter a listening state, and the control center The camera stops working until the microphone detects a valid voice input again, and then the camera is started to operate normally.
  4. 一种人机交互方法,该方法包括:A human-computer interaction method, the method comprising:
    人机交互装置中的麦克风获取语音信号,摄像头实时获取唇读图像;The microphone in the human-machine interaction device acquires a voice signal, and the camera acquires a lip-reading image in real time;
    所述人机交互装置对所获取的唇读图像形成的序列进行处理,得到唇动特征数据,The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data.
    所述人机交互装置将所述唇动特征数据和从所述语音信号中提取的语音特征数据进行融合,识别输入的语音,其中,所述麦克风获取到语音信号,但从所述摄像头获取的唇读图像形成的序列中得到无效的唇动特征数据时,控制所述麦克风进入侦听状态,控制所述摄像头停止工作。 The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify an input voice, wherein the microphone acquires a voice signal but obtains from the camera When invalid lip motion feature data is obtained in the sequence formed by the lip-reading image, the microphone is controlled to enter a listening state, and the camera is controlled to stop working.
  5. 如权利要求4所述的人机交互方法,其中,所述控制所述麦克风进入侦听状态,控制所述摄像头停止工作的步骤后,该方法还包括:The human-computer interaction method according to claim 4, wherein after the step of controlling the microphone to enter a listening state and controlling the camera to stop working, the method further comprises:
    所述麦克风进入侦听状态时,如果检测到有效的语音输入,则进入工作状态,并启动所述摄像头实时获取唇读图像。When the microphone enters the listening state, if a valid voice input is detected, the working state is entered, and the camera is started to acquire the lip reading image in real time.
  6. 一种人机交互装置,包括麦克风、摄像头、唇读图像处理模块和融合识别模块,其中:A human-machine interaction device includes a microphone, a camera, a lip-reading image processing module, and a fusion recognition module, wherein:
    所述麦克风设置成:获取语音信号,并在检测到有效的语音输入时,启动所述摄像头;The microphone is configured to: acquire a voice signal, and when the valid voice input is detected, activate the camera;
    所述摄像头设置成:按照所述麦克风的控制,实时获取唇读图像;The camera is configured to: acquire a lip reading image in real time according to the control of the microphone;
    所述唇读图像处理模块设置成:对所获取的唇读图像形成的序列进行处理,得到唇动特征数据;The lip-reading image processing module is configured to: process the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;
    所述融合识别模块设置成:将所述唇动特征数据与从所述语音信号中提取的语音特征数据进行融合,识别输入的语音。The fusion recognition module is configured to fuse the lip motion feature data with the voice feature data extracted from the voice signal to identify the input voice.
  7. 如权利要求6所述的人机交互装置,其中,所述麦克风设置成按照如下方式检测到有效的语音输入:The human-machine interaction device of claim 6, wherein the microphone is configured to detect valid voice input as follows:
    所述麦克风探测声源,将探测到的声源的自然语音转换成电信号,当转换后的电信号超过设定门限值,则判断有有效的语音输入,其中,所述电信号包括电压信号或电流信号。The microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
  8. 如权利要求6或7所述的人机交互装置,该装置还包括控制模块,其中:The human-machine interaction device according to claim 6 or 7, further comprising a control module, wherein:
    所述控制模块设置成,在所述麦克风获取到语音信号,但所述唇读图像处理模块从获取的唇读图像形成的序列中得到无效的唇动特征数据时,控制所述麦克风进入侦听状态,控制摄像头停止工作,直到所述麦克风再次检测到有效的语音输入,再启动所述摄像头正常工作。The control module is configured to control the microphone to enter a sound when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-read image The state, the control camera stops working until the microphone detects a valid voice input again, and then the camera is started to work normally.
  9. 如权利要求8所述的人机交互装置,其中,所述装置装配在如下任一种设备中:The human-machine interaction device according to claim 8, wherein the device is assembled in any of the following devices:
    可穿戴设备、便携式设备、智能终端、智能家电设备、安防监控设备。 Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
  10. 一种人机交互装置,包括麦克风和摄像头,还包括唇读图像处理模块、融合识别模块和控制模块,其中:A human-machine interaction device includes a microphone and a camera, and further includes a lip-reading image processing module, a fusion recognition module, and a control module, wherein:
    所述唇读图像处理模块设置成:对所述摄像头获取的唇读图像形成的序列进行处理,得到唇动特征数据;The lip-reading image processing module is configured to: process a sequence formed by the lip-reading image acquired by the camera to obtain lip-moving feature data;
    所述融合识别模块设置成:将所述唇动特征数据与从所述麦克风获取的语音信号中提取的语音特征数据进行融合,识别输入的语音;The fusion identification module is configured to: fuse the lip motion feature data with the voice feature data extracted from the voice signal acquired by the microphone, and identify the input voice;
    所述控制模块设置成:在所述麦克风获取到语音信号,但所述唇读图像处理模块从获取的唇读图像形成的序列中得到无效的唇动特征数据时,控制所述麦克风进入侦听状态,控制摄像头停止工作。The control module is configured to: when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-reading image, controlling the microphone to enter a sounding Status, control camera stops working.
  11. 如权利要求10所述的装置,其中,The device of claim 10, wherein
    所述麦克风还设置成:按照所述控制模块的控制进入侦听状态后,如果检测到有效的语音输入,则进入工作状态,并启动所述摄像头实时获取唇读图像。The microphone is further configured to: after entering the listening state according to the control of the control module, if a valid voice input is detected, enter a working state, and start the camera to acquire a lip-reading image in real time.
  12. 如权利要求10或11所述的装置,其中,所述装置装配在如下任一种设备中:The device according to claim 10 or 11, wherein the device is assembled in any of the following devices:
    可穿戴设备、便携式设备、智能终端、智能家电设备、安防监控设备。Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
  13. 一种计算机程序,包括程序指令,当该程序指令被人机交互装置执行时,使得该人机交互装置可执行权利要求1-3中任一项所述的人机交互方法。A computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform the human-computer interaction method of any one of claims 1-3.
  14. 一种载有权利要求13所述计算机程序的载体。A carrier carrying the computer program of claim 13.
  15. 一种计算机程序,包括程序指令,当该程序指令被人机交互装置执行时,使得该人机交互装置可执行权利要求4或5所述的人机交互方法。A computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform the human-computer interaction method of claim 4 or 5.
  16. 一种载有权利要求15所述计算机程序的载体。 A carrier carrying the computer program of claim 15.
PCT/CN2014/089020 2014-09-03 2014-10-21 Human-machine interaction device and method WO2015154419A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410446967.2 2014-09-03
CN201410446967.2A CN105389097A (en) 2014-09-03 2014-09-03 Man-machine interaction device and method

Publications (1)

Publication Number Publication Date
WO2015154419A1 true WO2015154419A1 (en) 2015-10-15

Family

ID=54287187

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/089020 WO2015154419A1 (en) 2014-09-03 2014-10-21 Human-machine interaction device and method

Country Status (2)

Country Link
CN (1) CN105389097A (en)
WO (1) WO2015154419A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319912A (en) * 2018-01-30 2018-07-24 歌尔科技有限公司 A kind of lip reading recognition methods, device, system and intelligent glasses
CN112053690A (en) * 2020-09-22 2020-12-08 湖南大学 Cross-modal multi-feature fusion audio and video voice recognition method and system

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452381B (en) * 2016-05-30 2020-12-29 中国移动通信有限公司研究院 Multimedia voice recognition device and method
CN108227903B (en) * 2016-12-21 2020-01-10 深圳市掌网科技股份有限公司 Virtual reality language interaction system and method
CN107293300A (en) * 2017-08-01 2017-10-24 珠海市魅族科技有限公司 Audio recognition method and device, computer installation and readable storage medium storing program for executing
CN107679449B (en) * 2017-08-17 2018-08-03 平安科技(深圳)有限公司 Lip motion method for catching, device and storage medium
US11836592B2 (en) 2017-12-15 2023-12-05 International Business Machines Corporation Communication model for cognitive systems
CN108154140A (en) 2018-01-22 2018-06-12 北京百度网讯科技有限公司 Voice awakening method, device, equipment and computer-readable medium based on lip reading
CN111326152A (en) * 2018-12-17 2020-06-23 南京人工智能高等研究院有限公司 Voice control method and device
CN111868823A (en) * 2019-02-27 2020-10-30 华为技术有限公司 Sound source separation method, device and equipment
CN110111783A (en) * 2019-04-10 2019-08-09 天津大学 A kind of multi-modal audio recognition method based on deep neural network
CN110335600A (en) * 2019-07-09 2019-10-15 四川长虹电器股份有限公司 The multi-modal exchange method and system of household appliance
CN110765868A (en) * 2019-09-18 2020-02-07 平安科技(深圳)有限公司 Lip reading model generation method, device, equipment and storage medium
CN111063354B (en) * 2019-10-30 2022-03-25 云知声智能科技股份有限公司 Man-machine interaction method and device
CN111190484B (en) * 2019-12-25 2023-07-21 中国人民解放军军事科学院国防科技创新研究院 Multi-mode interaction system and method
CN111312217A (en) * 2020-02-28 2020-06-19 科大讯飞股份有限公司 Voice recognition method, device, equipment and storage medium
CN111539270A (en) * 2020-04-10 2020-08-14 贵州合谷信息科技有限公司 High-recognition-rate micro-expression recognition method for voice input method
CN112908334A (en) * 2021-01-31 2021-06-04 云知声智能科技股份有限公司 Hearing aid method, device and equipment based on directional pickup
CN114708642B (en) * 2022-05-24 2022-11-18 成都锦城学院 Business English simulation training device, system, method and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100189305A1 (en) * 2009-01-23 2010-07-29 Eldon Technology Limited Systems and methods for lip reading control of a media device
CN101937268A (en) * 2009-06-30 2011-01-05 索尼公司 Device control based on the identification of vision lip
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
CN103456303A (en) * 2013-08-08 2013-12-18 四川长虹电器股份有限公司 Method for controlling voice and intelligent air-conditionier system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7860718B2 (en) * 2005-12-08 2010-12-28 Electronics And Telecommunications Research Institute Apparatus and method for speech segment detection and system for speech recognition
CN102324035A (en) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 Method and system of applying lip posture assisted speech recognition technique to vehicle navigation
CN103745723A (en) * 2014-01-13 2014-04-23 苏州思必驰信息科技有限公司 Method and device for identifying audio signal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100189305A1 (en) * 2009-01-23 2010-07-29 Eldon Technology Limited Systems and methods for lip reading control of a media device
CN101937268A (en) * 2009-06-30 2011-01-05 索尼公司 Device control based on the identification of vision lip
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
CN103456303A (en) * 2013-08-08 2013-12-18 四川长虹电器股份有限公司 Method for controlling voice and intelligent air-conditionier system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319912A (en) * 2018-01-30 2018-07-24 歌尔科技有限公司 A kind of lip reading recognition methods, device, system and intelligent glasses
CN112053690A (en) * 2020-09-22 2020-12-08 湖南大学 Cross-modal multi-feature fusion audio and video voice recognition method and system
CN112053690B (en) * 2020-09-22 2023-12-29 湖南大学 Cross-mode multi-feature fusion audio/video voice recognition method and system

Also Published As

Publication number Publication date
CN105389097A (en) 2016-03-09

Similar Documents

Publication Publication Date Title
WO2015154419A1 (en) Human-machine interaction device and method
US9779725B2 (en) Voice wakeup detecting device and method
JP6230726B2 (en) Speech recognition apparatus and speech recognition method
US10109300B2 (en) System and method for enhancing speech activity detection using facial feature detection
KR102216048B1 (en) Apparatus and method for recognizing voice commend
JP6504808B2 (en) Imaging device, setting method of voice command function, computer program, and storage medium
WO2018049782A1 (en) Household appliance control method, device and system, and intelligent air conditioner
US11699442B2 (en) Methods and systems for speech detection
US20150279369A1 (en) Display apparatus and user interaction method thereof
US11423896B2 (en) Gaze-initiated voice control
US20150088515A1 (en) Primary speaker identification from audio and video data
WO2021184549A1 (en) Monaural earphone, intelligent electronic device, method and computer readable medium
US20180009118A1 (en) Robot control device, robot, robot control method, and program recording medium
CN110730115B (en) Voice control method and device, terminal and storage medium
US10991372B2 (en) Method and apparatus for activating device in response to detecting change in user head feature, and computer readable storage medium
KR20130091278A (en) Two mode agc for single and multiple speakers
CN111131601B (en) Audio control method, electronic equipment, chip and computer storage medium
US9516429B2 (en) Hearing aid and method for controlling hearing aid
WO2017219450A1 (en) Information processing method and device, and mobile terminal
JP5797009B2 (en) Voice recognition apparatus, robot, and voice recognition method
KR20210011146A (en) Apparatus for providing a service based on a non-voice wake-up signal and method thereof
WO2022199405A1 (en) Voice control method and apparatus
CN113643707A (en) Identity verification method and device and electronic equipment
CN104423992A (en) Speech recognition startup method for display
KR102265874B1 (en) Method and Apparatus for Distinguishing User based on Multimodal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14888851

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14888851

Country of ref document: EP

Kind code of ref document: A1