WO2015154419A1

WO2015154419A1 - Human-machine interaction device and method

Info

Publication number: WO2015154419A1
Application number: PCT/CN2014/089020
Authority: WO
Inventors: 陈军; 姚立哲
Original assignee: 中兴通讯股份有限公司
Priority date: 2014-09-03
Filing date: 2014-10-21
Publication date: 2015-10-15
Also published as: CN105389097A

Abstract

A human-machine interaction device and method, a corresponding computer program, and a carrier for the computer program. The method comprises: while a microphone in a human-machine interaction device is in a process of acquiring a speech signal, if a valid speech input is detected, then a camera in the human-machine interaction device is activated to acquire in real-time lipreading images; the human-machine interaction device processes a sequence formed by the acquired lipreading images to acquire lipreading feature data; and, the human-machine interaction device merges the lipreading feature data and speech feature data extracted from the speech signal to identify an inputted speech. The technical solution of the present application effectively improves speech recognition and increases machine recognition rate.

Description

Human-computer interaction device and method

Technical field

The present invention relates to the field of human-computer interaction technology, and more particularly to a human-computer interaction device and method.

Background technique

With the diversification and intelligent development of mobile terminal devices, the human-computer interaction mode also presents a diversified trend. From traditional button input to touch input, and multi-form biometrics such as fingerprints, voices, gestures, etc. can be effectively recognized by intelligent terminals. Human-computer interaction technology has also been widely studied and applied.

However, related human-machine interaction devices do not have a very effective solution to noise interference.

Summary of the invention

The technical problem to be solved by the present invention is to provide a human-machine interaction device and method to solve the problem of low reliability of speech recognition in an environment with noise interference.

In order to solve the above technical problems, the following technical solutions are adopted:

A human-computer interaction method, the method comprising:

In the process of acquiring the voice signal by the microphone in the human-machine interaction device, if a valid voice input is detected, the camera in the human-machine interaction device is activated to acquire the lip-reading image in real time;

The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data;

The human-machine interaction device fuses the lip-motion feature data with the voice feature data extracted from the voice signal to identify the input voice.

Optionally, the step of detecting valid voice input includes:

The microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.

Optionally, the step of initiating the camera in the human-machine interaction device to acquire a lip-reading image in real time After the event, the method further includes:

While the microphone acquires the voice signal, if the lip-moving feature data is obtained from the sequence formed by the lip-read image acquired by the camera, the human-machine interaction device controls the microphone to enter a listening state, and the control center The camera stops working until the microphone detects a valid voice input again, and then the camera is started to operate normally.

A human-computer interaction method, the method comprising:

The microphone in the human-machine interaction device acquires a voice signal, and the camera acquires a lip-reading image in real time;

The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data.

The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify an input voice, wherein the microphone acquires a voice signal but obtains from the camera When invalid lip motion feature data is obtained in the sequence formed by the lip-reading image, the microphone is controlled to enter a listening state, and the camera is controlled to stop working.

Optionally, after the step of controlling the microphone to enter a listening state and controlling the camera to stop working, the method further includes:

When the microphone enters the listening state, if a valid voice input is detected, the working state is entered, and the camera is started to acquire the lip reading image in real time.

A human-machine interaction device includes a microphone, a camera, a lip-reading image processing module, and a fusion recognition module, wherein:

The microphone is configured to: acquire a voice signal, and when the valid voice input is detected, activate the camera;

The camera is configured to: acquire a lip reading image in real time according to the control of the microphone;

The lip-reading image processing module is configured to: process the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;

The fusion identification module is configured to: extract the lip motion feature data from the voice signal The acquired voice feature data is fused to recognize the input voice.

Optionally, the microphone is arranged to detect valid voice input as follows:

Optionally, the apparatus further includes a control module, wherein:

The control module is configured to control the microphone to enter a sound when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-read image The state, the control camera stops working until the microphone detects a valid voice input again, and then the camera is started to work normally.

Optionally, the device is assembled in any of the following devices:

Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.

A human-machine interaction device includes a microphone and a camera, and further includes a lip-reading image processing module, a fusion recognition module, and a control module, wherein:

The lip-reading image processing module is configured to: process a sequence formed by the lip-reading image acquired by the camera to obtain lip-moving feature data;

The fusion identification module is configured to: fuse the lip motion feature data with the voice feature data extracted from the voice signal acquired by the microphone, and identify the input voice;

The control module is configured to: when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-reading image, controlling the microphone to enter a sounding Status, control camera stops working.

Optionally, the microphone is further configured to: after entering the listening state according to the control of the control module, if a valid voice input is detected, enter a working state, and start the camera to acquire a lip-reading image in real time.

Optionally, the device is assembled in any of the following devices:

A computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform a corresponding human-computer interaction method. A carrier carrying the computer program.

The technical solution of the present application combines lip reading and speech in a noisy environment, and improves speech recognition, improves machine recognition rate, and confirms effective speech in comparison with a conventional technique of recognizing using single speech feature data. When the input is started, the camera work is started, and the power consumption of the device is greatly reduced. There is also an alternative solution proposed to apply the solution to the wearable smart device to enhance the machine's ability to recognize the user input, which is convenient for the user to use and enhance the user experience.

BRIEF abstract

FIG. 1 is a structural diagram of an interaction apparatus implemented according to an embodiment of the present invention.

Preferred embodiment of the invention

The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments of the present application may be combined with each other arbitrarily.

Example 1

This embodiment provides a human-computer interaction method for combining lip reading and speech to perform speech recognition in a noisy environment. The method mainly includes the following operations:

The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data.

The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to recognize the input voice.

Among them, the lip-reading image may also be referred to as a lip-moving image, and refers to an image in which the movement of the speaker's lips changes when a person speaks. For a period of time, the lip-reading image constitutes a sequence of images or a lip-reading image video. The sequence formed by the lip-reading image refers to the lip-reading image video over a period of time.

The characteristic parameters, that is, the lip motion characteristic data, which are subjected to the specific operation processing of the lip motion image sequence, are common knowledge to those skilled in the art, and will not be described herein.

The speech feature data is obtained after the speech signal is processed and processed, and the representation method is more, for example, the spectral parameter of the speech can be used as one of the feature data. The speech feature data processing is performed after the speech signal is acquired, and is executed by the speech processing module. Speech feature data processing and lip-reading image processing are performed independently.

Among them, in the process of acquiring a voice signal by a microphone, the process of detecting valid voice input is as follows:

The microphone detects the sound source and converts the natural voice of the detected sound source into an electrical signal. When the converted electrical signal exceeds the set threshold, it is determined that there is a valid voice input. In this embodiment, the electrical signals involved include a current signal or a voltage signal.

In addition, in some alternatives, a feedback mechanism of lip reading processing is also proposed, that is, when the microphone acquires the speech signal, invalid lip motion characteristic data is obtained from the sequence formed by the lip reading image acquired by the camera (at this time) That is to say, the user's lip does not have any action, the user may not speak.) The human-machine interaction device controls the microphone to enter the listening state, and controls the camera to stop working until the microphone detects a valid voice input again, and then starts the camera to work normally. This mechanism is mainly for the case of large noise influence, combined with the user's lip motion feature, accurately distinguishes whether the user voice or noise, and when the noise is recognized, stops the camera work to improve equipment utilization.

Correspondingly, the human-machine interaction device may further reserve the microphone for acquiring the voice signal according to the user instruction, and notify the camera to cancel the acquisition of the lip-reading image. Therefore, in the special scene, the user selects the recognition mode and improves the user experience.

The implementation process of the above method will be described below in combination with a specific application scenario.

For example, the user uses a headset to communicate with the smart device for voice interaction. Since the machine-to-person speech recognition is significantly reduced in noisy environments or when the user's voice intonation has problems, in order to improve the recognition rate of the voice, the lip can be utilized. The recognition of the read image further improves the accuracy of the speech recognition, and facilitates the machine to better understand the user's language expression and execute the user's voice command. Optionally, the human-computer interaction process is as follows:

Step 1: The microphone acquires a voice signal, and when there is valid voice input, starts the camera work;

The microphone mainly uses a sound pressure sensor to detect the sound source and convert the natural voice into an electrical signal. In order to distinguish the background sound, a threshold value of the sound pressure sensor electrical signal may be set to determine whether there is a valid voice input. When the converted sound pressure sensor electrical signal is greater than or not less than the set threshold value, when it is determined that there is valid voice input, the camera is notified to start and normal operation begins.

And when the microphone detects that there is valid voice input, it notifies the camera to work and obtains the lip-reading image, so that the operation can reduce the power consumption of the device.

Step 2: The camera acquires a lip-reading image.

The usual acquisition of lip-reading images is to perform face recognition first in the image sequence, determine the position of the lips, and then obtain lip-motion data. In practical applications, a directional microphone can be selected, and the camera is built in the microphone (or the microphone is built in the camera), such as a headset, the camera is located at the microphone, and the camera is directly aimed at the user's lips when the user is using it. This makes it easy to get a lip image.

Step 3: Processing the sequence formed by the acquired lip-reading image to obtain lip-motion feature data.

It mainly performs lip positioning and tracking for the sequence formed by the lip-reading image, lip-motion feature extraction, and finally outputs the lip-motion feature data to the fusion recognition module.

In addition, through user configuration, a feedback mechanism for lip reading processing can be set. For example, in a noisy environment, or in a cross-talker scenario, if the microphone acquires other sound signals when the user does not speak, the camera starts to acquire the lip image, but the lip-reading image is processed without extracting the lip. Dynamic features. At this time, the human-machine interaction device can notify the camera, the voice processing module, the lip-reading processing module, and the fusion recognition module to stop working, and only the microphone is in the listening state.

In some special scenarios, it can also be set to cancel the feedback mechanism of lip reading processing, for example in the camera When the lip reading data cannot be effectively captured, the human-computer interaction is performed only by the voice to avoid the lip-reading recognition result to interfere with the speech recognition. Or for special scenes or special people, you can also set up human-computer interaction only through lip reading.

Step 4: Processing the acquired voice to obtain voice feature data.

It should be noted that, in the human-computer interaction device, the processing of the lip-reading image and the processing of the voice are performed by two independent parts, so the order of the above steps 3 and 4 can be adjusted. , can also be at the same time.

Step 5: The fusion identification module performs fusion recognition on the voice feature data and the lip motion feature data.

Lip reading and speech are complementary channels, such as /m/ and /n/ unit sounds that are indistinguishable in speech signal channels are visually distinguishable; visually indistinguishable from /b/, /p/ And /m/ unit sounds are distinguishable on the voice signal. Especially in the noisy environment and multi-talker conditions, the auxiliary information of the lip-reading image can significantly improve the speech recognition rate of the machine. The related recognition processing technology of lip reading and speech is used to correct the inconsistency between lip reading recognition and speech recognition results. When the two channel information are inconsistent, the trained identification library can be used to determine which channel information is more reliable, thereby improving the speech recognition rate.

The human-machine interaction device involved in the above method can also be installed in devices such as wearable devices (such as smart glasses, smart helmets), portable devices, smart terminals, smart home appliances, and security monitoring devices.

Example 2

This embodiment provides a human-computer interaction method, and the method includes the following steps:

The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;

The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify the input voice, wherein the microphone acquires the voice signal but is invalidated from the sequence formed by the lip-read image acquired by the camera. Controlling the microphone into the Detect when the lip characterization data Listen to the status and control the camera to stop working.

In the alternative, after the control microphone enters the listening state and the control camera stops working, the microphone also detects whether there is valid voice input. If a valid voice input is detected, the working state is started, and the camera starts to work.

Example 3

This embodiment provides a human-machine interaction device. As shown in FIG. 1, the interaction device includes the following parts.

The microphone 11 acquires a voice signal and activates the camera when a valid voice input is detected.

Optionally, the microphone 11 detects the sound source and converts the natural voice into a voltage or current signal, and when the voltage or current signal is greater than or not less than the set threshold, it is considered that a valid voice input is detected.

The camera 12 acquires a lip-reading image in real time according to the control of the microphone 11;

Optionally, receiving a control signal of the microphone 11 and synchronously imaging the lip image when the microphone 11 detects the effective sound source;

The lip-reading image processing module 13 processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;

Optionally, performing lip positioning, tracking on the lip-reading image, and extracting lip-motion feature data;

The voice processing module 14 processes the voice signal to obtain voice feature data.

The fusion recognition module 15 fuses the lip motion feature data and the voice feature data to recognize the input voice.

Optionally, the trained model library is used to perform fusion recognition on the lip motion feature data and the voice feature data.

In addition, the above device may also adopt a feedback mechanism of lip reading, in which case a control module needs to be added, the module acquires a voice signal in the microphone, but the lip-reading image processing module obtains an invalid lip motion from the sequence formed by the acquired lip-reading image. When the feature data (which can also be considered as the lip motion feature data cannot be extracted from the sequence formed by the lip-reading image), the control microphone 11 enters the listening state, and the camera is controlled. 12 stopped working. At the same time, the lip-reading image processing module 13, the voice processing module 14, and the fusion recognition module 15 are also controlled to stop working, thereby reducing the power consumption of the device.

Optionally, after the microphone 11 enters the listening state, it can detect whether there is valid voice input. If a valid voice input is detected, the working state is entered, and the camera 12, the lip-reading image processing module 13, and the voice processing module 14 are activated. And the fusion identification module 15 works normally. Such a scheme not only improves the reliability of speech recognition in a noisy environment, but also reduces the power consumption of the device.

In addition, the above control module may further reserve the microphone according to the user instruction to acquire the voice signal, and notify the camera 12 to cancel the acquisition of the lip-reading image. That is to say, the control module can select the voice recognition mode according to the user instruction, for example, the voice recognition is performed by using the microphone 11 alone, or the voice recognition can be performed by the camera 12 alone or in two ways.

In actual use, the above devices can be built into any of the following devices:

Wherein, the microphone 11 and the camera 12 are optionally arranged on the same side of the device, for example, the camera 12 is mounted at the microphone of the headset, and the other parts can be mounted on the smart machine.

Example 4

This embodiment provides a human-machine interaction device, including the following parts.

Microphone to get the voice signal.

Camera to obtain lip reading images in real time;

a lip reading image processing module, processing the sequence formed by the acquired lip reading image to obtain lip motion characteristic data;

The voice processing module processes the voice signal to obtain voice feature data.

The fusion recognition module combines the lip motion feature data and the voice feature data to recognize the input voice.

The control module acquires a voice signal in the microphone, but when the lip-reading image processing module obtains invalid lip-motion feature data from the acquired lip-reading image (ie, the lip-moving feature data cannot be recognized), the microphone is controlled to enter the listening state. , control the camera to stop working.

In addition, the above control module may further reserve the microphone according to the user instruction to acquire the voice signal, and notify the camera to cancel the acquisition of the lip-reading image. That is to say, the control module can select the voice recognition mode according to the user instruction, for example, the microphone is used for voice recognition alone, or the camera can be used for voice recognition alone or in two ways.

Firstly, the above microphone can start the camera work when there is effective voice input to reduce the power consumption of the device. Optionally, the microphone detects the sound source and converts the natural voice into an electrical signal, and when the electrical signal is greater than or not less than the set threshold, it is considered that a valid voice input is detected.

Wherein, the microphone and the camera are optionally arranged on the same side of the device, for example, the camera is assembled at the microphone of the headset, and the other parts can be assembled on the smart machine.

One of ordinary skill in the art will appreciate that all or a portion of the steps described above can be accomplished by a program that instructs the associated hardware, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware or in the form of a software function module. This application is not limited to any specific combination of hardware and software.

The above is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Industrial applicability

The technical solution of the present application combines lip reading and voice in a noisy environment, and adopts traditional Compared with the technology of single speech feature data recognition, the speech recognition is improved, the machine recognition rate is improved, and the camera operation is started when the valid speech input is confirmed, and the power consumption of the device is greatly reduced. There is also an alternative solution proposed to apply the solution to the wearable smart device to enhance the machine's ability to recognize the user input, which is convenient for the user to use and enhance the user experience. Therefore, the present invention has strong industrial applicability.

Claims

A human-computer interaction method, the method comprising:

In the process of acquiring the voice signal by the microphone in the human-machine interaction device, if a valid voice input is detected, the camera in the human-machine interaction device is activated to acquire the lip-reading image in real time;

The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-motion feature data;

The human-machine interaction device fuses the lip-motion feature data with the voice feature data extracted from the voice signal to identify the input voice.
The human-computer interaction method according to claim 1, wherein said step of detecting valid voice input comprises:

The microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
The human-computer interaction method according to claim 1 or 2, wherein after the step of initiating the camera in the human-machine interaction device to acquire a lip-reading image in real time, the method further comprises:

While the microphone acquires the voice signal, if the lip-moving feature data is obtained from the sequence formed by the lip-read image acquired by the camera, the human-machine interaction device controls the microphone to enter a listening state, and the control center The camera stops working until the microphone detects a valid voice input again, and then the camera is started to operate normally.
A human-computer interaction method, the method comprising:

The microphone in the human-machine interaction device acquires a voice signal, and the camera acquires a lip-reading image in real time;

The human-machine interaction device processes the sequence formed by the acquired lip-reading image to obtain lip-moving feature data.

The human-machine interaction device fuses the lip motion feature data and the voice feature data extracted from the voice signal to identify an input voice, wherein the microphone acquires a voice signal but obtains from the camera When invalid lip motion feature data is obtained in the sequence formed by the lip-reading image, the microphone is controlled to enter a listening state, and the camera is controlled to stop working.
The human-computer interaction method according to claim 4, wherein after the step of controlling the microphone to enter a listening state and controlling the camera to stop working, the method further comprises:

When the microphone enters the listening state, if a valid voice input is detected, the working state is entered, and the camera is started to acquire the lip reading image in real time.
A human-machine interaction device includes a microphone, a camera, a lip-reading image processing module, and a fusion recognition module, wherein:

The microphone is configured to: acquire a voice signal, and when the valid voice input is detected, activate the camera;

The camera is configured to: acquire a lip reading image in real time according to the control of the microphone;

The lip-reading image processing module is configured to: process the sequence formed by the acquired lip-reading image to obtain lip-moving feature data;

The fusion recognition module is configured to fuse the lip motion feature data with the voice feature data extracted from the voice signal to identify the input voice.
The human-machine interaction device of claim 6, wherein the microphone is configured to detect valid voice input as follows:

The microphone detects a sound source, converts the natural voice of the detected sound source into an electrical signal, and determines that there is a valid voice input when the converted electrical signal exceeds a set threshold, wherein the electrical signal includes a voltage Signal or current signal.
The human-machine interaction device according to claim 6 or 7, further comprising a control module, wherein:

The control module is configured to control the microphone to enter a sound when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-read image The state, the control camera stops working until the microphone detects a valid voice input again, and then the camera is started to work normally.
The human-machine interaction device according to claim 8, wherein the device is assembled in any of the following devices:

Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
A human-machine interaction device includes a microphone and a camera, and further includes a lip-reading image processing module, a fusion recognition module, and a control module, wherein:

The lip-reading image processing module is configured to: process a sequence formed by the lip-reading image acquired by the camera to obtain lip-moving feature data;

The fusion identification module is configured to: fuse the lip motion feature data with the voice feature data extracted from the voice signal acquired by the microphone, and identify the input voice;

The control module is configured to: when the microphone acquires a voice signal, but the lip-reading image processing module obtains invalid lip-motion feature data from a sequence formed by the acquired lip-reading image, controlling the microphone to enter a sounding Status, control camera stops working.
The device of claim 10, wherein

The microphone is further configured to: after entering the listening state according to the control of the control module, if a valid voice input is detected, enter a working state, and start the camera to acquire a lip-reading image in real time.
The device according to claim 10 or 11, wherein the device is assembled in any of the following devices:

Wearable devices, portable devices, smart terminals, smart home appliances, security monitoring devices.
A computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform the human-computer interaction method of any one of claims 1-3.
A carrier carrying the computer program of claim 13.
A computer program comprising program instructions that, when executed by a human-machine interaction device, cause the human-machine interaction device to perform the human-computer interaction method of claim 4 or 5.
A carrier carrying the computer program of claim 15.