WO2020073403A1

WO2020073403A1 - Silent voice input identification method, computing apparatus, and computer-readable medium

Info

Publication number: WO2020073403A1
Application number: PCT/CN2018/114608
Authority: WO
Inventors: 喻纯; 孙科; 史元春
Original assignee: 清华大学
Priority date: 2018-10-08
Filing date: 2018-11-08
Publication date: 2020-04-16
Also published as: CN109558788A; CN109558788B

Abstract

A silent voice input identification method, a computing apparatus, and a computer-readable medium. The silent voice input identification method comprises: obtaining a user movement mouth feature sequence; determining, by using a pretrained mouth movement discriminator, whether the user movement mouth feature sequence performs language input or other mouth movements; when it is determined that the user movement mouth feature sequence represents that language input is carried out, determining whether a user carries out silent language input or not; and when it is determined that the user performs silent language input, identifying the content of the silent language input. It is firstly determined whether a user performs silent voice input truly instead of the user's mouth performing other natural movements or is pseaking emitting sound, and therefore the accuracy rate of identification of the content of silent voice input can be improved by filtering out irrelevant input.

Description

Silent speech input recognition method, computing device and computer readable medium

Technical field

The present invention relates generally to lip language input technology, and in particular to lip speech input recognition methods, devices, and computer-readable media.

Background technique

With the development of machine learning technology and the improvement of computing device performance, Silent Speech Input (Silent Speech Input) has become a potential user input interaction method.

Silent voice input refers to the following input interaction method with the computing device. The user can communicate with the computing device through voice, but the user does not actually make a voice, but only makes the mouth shape corresponding to the said content.

Silent voice input is very suitable for occasions such as meetings that are not suitable for making voices and are not convenient for long-term input with fingers, and have very good privacy.

A device that supports silent voice input recognizes what the user says by capturing signals (or images) generated by the user's mouth movement through one or more specific sensors (such as myoelectric sensors, cameras, etc.).

In this article, the device we are targeting is a device and setting that captures and recognizes the image sequence of the user's moving mouth through a camera (this patent is concerned with the specific capture method, any method can be used, the camera is an important method). For example, when using a smartphone, computer, or head-mounted device, the user issues a voice command or content in the form of silent voice, and the camera on the device recognizes the command or content, and then the computing device makes corresponding responses and feedback.

One of the key issues is how the computing device determines whether the user is actually performing silent voice input, rather than the user's mouth performing other natural movements or making voiced voices. .

Summary of the invention

A device that supports silent voice input captures a signal generated by the user's mouth movement through one or more specific sensors, and analyzes the signal to identify what the user says.

In the prior art, the main focus is on how to process the mouth motion signal to recognize the content spoken by the user, but there is no technology for the computing device to judge whether the user is actually performing silent voice input.

The inventor of the present invention believes that humans have various mouth movements, such as chewing, yawning, and unconscious mouth movements, such as skimming, etc. If these mouth movements are directly used to recognize speech input, it will cause very large errors, so Distinguishing these mouth movements from speech input is a prerequisite for accurate recognition of speech input.

To this end, this article proposes a technique for computing devices to determine whether the user is actually performing silent voice input, rather than the user's mouth performing other natural movements or vocal voices.

In view of the above circumstances, the present invention has been proposed.

According to one aspect of the present invention, a silent speech input recognition method is provided, which includes: obtaining a user motion mouth feature sequence; using a pre-trained mouth motion discriminator to determine whether the user motion mouth feature sequence is performing language Input or perform other mouth movements; determine whether the user's movement mouth feature sequence is representative of performing language input, determine whether the user is performing silent language input; and determine that the user is performing silent language input, perform Recognition of silent language input.

Optionally, the motion mouth feature sequence is extracted from the motion mouth image sequence captured by the myoelectric sensor.

Optionally, the motion mouth feature sequence is extracted from the motion mouth image sequence captured by the camera.

Optionally, the image data of the moving mouth is one or a combination of RGB data, structured light, infrared point cloud data, and depth point cloud data.

Optionally, the moving mouth image sequence is obtained as follows: identifying the user's face position based on machine learning and extracting the user's facial feature points, and acquiring the real-time image of the user's mouth through the feature points.

Optionally, the user motion mouth feature sequence input to the mouth motion discriminator includes at least three feature data pieces that identify three states: the first feature data piece is a feature data piece that characterizes the beginning of movement of the mouth, and the second feature data The slice is a piece of characteristic data characterizing continuous movement of the mouth, and the third piece of characteristic data is a piece of characteristic data characterizing the stoppage of the mouth.

Optionally, the discriminator is a two-classifier, which is obtained by training using machine learning methods based on the collected user data.

Optionally, when judging that the user's motion mouth feature sequence is indicative of language input, determining whether the user is performing silent language input includes: according to a predetermined mouth feature and sound signal under silent language input The matching model between them determines the degree of matching between the mouth feature sequence and the sound signal sequence, and if the degree of matching exceeds a predetermined threshold, determines that the user is performing silent language input.

Optionally, the silent voice input recognition method further includes: after recognizing the input content of the silent language, responding with the recognized instruction or content.

According to another aspect of the present invention, there is provided a computing device including: a sensor capable of capturing a motion signal for a mouth; a controller and a memory, on which a computer-executable instruction is stored, when the computer-executable instruction is When executed by the controller, it is operable to perform the aforementioned silent speech input recognition method.

According to still another aspect of the present invention, there is provided a computer-readable storage medium on which computer-executable instructions are stored, and when the computer-executable instructions are executed by a computer, are operable to perform the aforementioned silent speech input recognition method.

Using the silent voice input recognition method of the present invention, the computing device first determines whether the user is actually performing silent voice input, rather than the user's mouth performing other natural movements or making voiced speech, thereby filtering out irrelevant input, It can improve the recognition accuracy of silent voice input content.

BRIEF DESCRIPTION

These and / or other aspects and advantages of the present invention will become clearer and easier to understand from the following detailed description of the embodiments of the present invention with reference to the drawings, in which:

FIG. 1 shows an overall flowchart of a computer-implemented silent speech input recognition method 1000 according to an embodiment of the present invention.

2 shows a schematic diagram of the operation and signal flow of hardware and / or software modules according to an embodiment of the present invention.

detailed description

In order to enable those skilled in the art to better understand the present invention, the present invention will be further described in detail below with reference to the drawings and specific embodiments.

Before introducing, explain the meaning of relevant terms in this article.

Silent speech input refers to the input behavior of making a speech but not speaking, some people call it "lip language".

In step S1100, a feature sequence of the mouth of the user's movement is obtained.

The feature sequence of the mouth movement of the user here may be any feature sequence depicting the movement of the mouth of the user. For example, it may be a feature sequence extracted from a moving mouth image sequence captured by a camera. Regarding the moving mouth image data, based on the corresponding corresponding light source and / or camera (general camera, structural light source, infrared camera device, stereo camera) used, The image data can be one or a combination of RGB data, structured light data, infrared point cloud data, and depth point cloud data.

In the case of obtaining an image of a moving mouth using a camera, a sequence of moving mouth images may be obtained, for example, by recognizing the position of the user's face based on machine learning and extracting user facial feature points, and acquiring real-time images of the user's mouth through the feature points .

In step S1200, a pre-trained mouth motion discriminator is used to determine whether the user motion mouth feature sequence is performing language input or other mouth motion.

In one example, the user motion mouth feature sequence input to the mouth motion discriminator includes at least three feature data pieces that identify three states: the first feature data piece is a feature data piece that characterizes the beginning of movement of the mouth, and the second feature The data piece is a characteristic data piece characterizing continuous movement of the mouth, and the third characteristic data piece is a characteristic data piece characterizing stoppage of the mouth.

For example, in the case where the motion mouth feature sequence is extracted from the user's mouth image, the mouth motion discriminator extracts the user's mouth motion sequence from the input user's mouth image sequence, specifically, based on the mouth feature points and image information Determine which of the following four states (1) the mouth starts to move (2) the mouth continues to move (3) the mouth stops moving (4) others. The result of the operation of extracting the user's mouth motion sequence is to obtain the mouth image sequence from state (1) to state (3). The discriminator needs to collect user data and use machine learning to train the model and recognize it.

The discriminator is a two-classifier, which is trained based on the collected user data using machine learning methods. Determine whether the mouth movement is speaking a natural language, rather than the confusion with mouth movements in other situations. Confusions include but are not limited to: users eating, yawning, unconscious movements, etc. The discriminator needs to collect user data and use machine learning to train the model and recognize it.

In step S1300, when it is determined that the user's motion mouth feature sequence is indicative of performing language input, it is determined whether the user is performing silent language input.

In one example, determining whether the user is performing silent language input includes: determining a match between the mouth feature sequence and the sound signal sequence according to a predetermined matching model between the mouth feature and the sound signal in the case of silent language input Degree, and if the degree of matching exceeds a predetermined threshold, it is determined that the user is performing silent language input.

Specifically, in an example, the input is the mouth motion image sequence and the human voice signal collected by the microphone in the same interval, and the output is the matching degree p of the two segments of the signal. If p is greater than a certain threshold, the segment of the mouth is determined The moving image sequence is a voiced sequence, that is, the user is performing voiced voice input. Otherwise, it is determined that the user is indeed performing silent voice input. The determinator needs to collect user data and use machine learning to train the model and identify it.

In step S1400, when it is determined that the user is performing silent language input, recognition of the silent language input content is performed.

There is no limitation on the technology for recognizing the input content of the mute language. Any technology that can specifically recognize the input content of the mute language can be used, whether it is an existing one or a technology developed in the future.

Regarding the application scenario of the silent voice input technology of the present invention, an example is not provided. A device that supports silent voice input (such as a mobile phone, a tablet, etc.) is captured by one or more specific sensors (such as myoelectric sensors, cameras, etc.) The signal (or image) generated by the movement of the user's mouth identifies what the user said.

In a more specific example, the computing device captures and recognizes a sequence of images of the user's moving mouth through a camera. For example, when using a smartphone, computer, or head-mounted device, the user issues a voice command or content in the form of silent voice, and the camera on the device recognizes the command or content, and then the computing device makes corresponding responses and feedback. For example, the user lip-speaks, "Open WeChat", and after the mobile phone recognizes it, the WeChat application is launched.

102, 104: The camera 104 acquires the image sequence of the user 102 in real time, and the image information may include but is not limited to RGB data, structured light or infrared point cloud data, and depth point cloud data.

106 Face recognition module: use machine learning and computer vision to identify the user's face position and extract user's facial feature points, obtain real-time images of the user's mouth through the feature points, image information can still include but not limited to RGB and point cloud data .

108 Extract user's mouth motion sequence module: An example is a discriminator, which is based on the mouth feature points and image information to determine which of the following four states (1) mouth movement (2) mouth continuous movement (3) mouth The department stops moving (4) others. The output of this module is a sequence of mouth images from state (1) to state (3). The discriminator needs to collect user data and use machine learning to train the model and recognize it.

110 Detect whether the mouth motion is a language input module: an example is a two-classifier, which determines whether the mouth motion is speaking a natural language, but not other words, based on the mouth motion image sequence output by the user's mouth motion sequence module 108 Confusing situation with mouth movements under the circumstances. Confusions include but are not limited to: users eating, yawning, unconscious movements, etc. The discriminator needs to collect user data and use machine learning to train the model and recognize it. The classifier needs to collect user data and use machine learning to train the model and recognize it.

112 sound signal detection module: the input is to extract the mouth motion image sequence output by the user's mouth motion sequence module 108 and the human voice signal collected by the microphone in the same interval, and the output is the matching degree p of these two signals, if p is greater than a certain Threshold, it is determined that this sequence of mouth moving images is a voiced sequence, that is, the user is performing voiced voice input. Otherwise, it is determined that the user is indeed performing silent voice input. This module needs to collect user data and use machine learning to train the model and identify it.

114 The final recognition model recognizes the instructions or content issued by the user.

According to another aspect of the present invention, a silent speech input recognition method is provided, including: a user motion mouth feature sequence obtaining component to obtain a user motion mouth feature sequence; detecting whether the mouth motion is a language input module, using pre-training Mouth motion discriminator to determine whether the user's movement mouth feature sequence is performing language input or other mouth movements; silent language input judgment is determining that the user's movement mouth feature sequence is indicative of language input In this case, it is determined whether the user is performing mute language input; the mute language input content recognition module, when it is determined that the user is performing mute language input, recognizes the mute language input content.

The embodiments of the present invention have been described above. The above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and changes will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

A silent speech input recognition method, including:

Obtain the user's movement mouth feature sequence;

Using a pre-trained mouth movement discriminator to determine whether the user's movement mouth feature sequence is performing language input or other mouth movements;

When it is judged that the user's movement mouth feature sequence is indicative of language input, whether the user is performing silent language input is determined;

When it is determined that the user is performing silent language input, recognition of the silent language input content is performed.
According to the silent speech input recognition method of claim 1, the motion mouth feature sequence is extracted from the motion mouth image sequence captured by the myoelectric sensor.
According to the silent speech input recognition method of claim 1, the motion mouth feature sequence is extracted from a motion mouth image sequence captured by a camera.
The silent speech input recognition method according to claim 3, wherein the moving mouth image data is one or a combination of RGB data, structured light, infrared point cloud data, and depth point cloud data.
According to the silent speech input recognition method of claim 3, the moving mouth image sequence is obtained as follows:

Recognize the position of the user's face based on machine learning and extract the user's facial feature points, and obtain the real-time image of the user's mouth through the feature points.
According to the silent speech input recognition method of claim 1, the user motion mouth feature sequence of the input mouth motion discriminator includes at least three feature data pieces identifying three states: the first feature data piece is used to characterize the mouth movement The second feature data piece is a feature data piece representing continuous movement of the mouth, and the third feature data piece is a feature data piece representing stopping movement of the mouth.
According to the silent speech input recognition method of claim 1, the discriminator is a two-classifier, which is obtained by training using machine learning methods based on the collected user data.
According to the silent speech input recognition method of claim 1, when judging that the user's motion mouth feature sequence is indicative of performing language input, determining whether the user is performing silent language input includes:

According to the predetermined matching model between the mouth feature and the sound signal in the case of silent language input, determine the degree of matching between the mouth feature sequence and the sound signal sequence, and determine the user if the degree of matching exceeds a predetermined threshold Mute language input.
The silent speech input recognition method according to claim 8, further comprising:

After recognizing the input content of the mute language, the recognized command or content responds.
A computing device, including:

The sensor can capture the movement signal for the mouth;

A controller and a memory. Computer-executable instructions are stored on the memory. When the computer-executable instructions are executed by the controller, the computer-executable instructions are operable to perform the silent speech input recognition method according to claims 1 to 8.
A computer-readable storage medium having computer-executable instructions stored thereon, when the computer-executable instructions are executed by a computer, operable to perform the silent speech input recognition method of claims 1 to 8.