WO2017035768A1

WO2017035768A1 - Voice control method based on visual wake-up

Info

Publication number: WO2017035768A1
Application number: PCT/CN2015/088723
Authority: WO
Inventors: 涂悦
Original assignee: 涂悦
Priority date: 2015-09-01
Filing date: 2015-09-01
Publication date: 2017-03-09

Abstract

Provided is a voice control method based on visual wake-up, which is used for waking up a voice controlled device so as to allow the voice controlled device to respond to a voice signal received by said device. The voice control method of the present invention comprises: a voice controlled device initiating, upon receiving at least a part of voice signals, an image receiving unit mounted thereon; the image receiving unit acquiring an image and transmitting same to an image recognition unit; the image recognition unit recognizing the image, when a human face whose sightline is directed to the voice controlled device is detected in the image, the voice controlled device being woken up to recognize the voice signals. The present invention wakes up a voice recognition unit by means of a visual wake-up function of searching for a human face whose sightline is directed to the voice controlled device, better conforming with the daily voice interaction habits of the user, being more convenient to use and smarter.

Description

A voice control method based on visual wake-up

Technical field

The present invention relates to the field of intelligent control technologies, and in particular, to a voice control method based on visual wake-up.

Background technique

With the development of technology, from manual control to sound control, intelligent voice technology is gradually infiltrating into TV, home, automotive, wearable devices and other fields. More and more devices support voice control. Future smart homes are likely to be based entirely or largely on voice control.

Figure 1 shows the structure of a typical voice control device comprising a voice receiving unit 1, typically a microphone, and a voice recognition unit 2 And processing unit 3 . The speech recognition unit 2 acquires the speech signal from the speech receiving unit 1 and performs speech signal recognition, and transmits the recognized result to the processing unit 3, and the processing unit 3 The voice control device is instructed to execute a command corresponding to the voice signal.

When controlling a plurality of voice control devices such as those shown in FIG. 1 with these devices An important feature in voice interaction is voice wakeup. It is understandable that in order to treat the plurality of voice control devices differently, it is a prerequisite that the command can be accurately transmitted to one of the determined devices without affecting other devices, and only waking up the device to receive commands is necessary. Currently The wake-up of the wake-up voice control device is generally based on wake-up words, such as the name of the device, the code name, and the like.

However, the current voice wake-up method has many innate defects, such as when the user says the same as the wake-up word / A similar word, then the device will be woken up, even though the user does not actually wake up the device. In addition, each time the user wakes up the device, the wake-up word is said, which is not a good experience for the user.

Since a common habit in people's voice interaction is to look at the object that interacts with their voice, they use voice control. When the voice control device is used, the user is also accustomed to watching the device. Therefore, compared with the current voice wake-up, it is more in line with the user's daily experience to determine the target device that wakes up by detecting the user's gaze.

Accordingly, those skilled in the art are directed to developing a visual wake-based voice control method to wake up a target device more intelligently.

Summary of the invention

To achieve the above object, the present invention provides a visual wake-up based voice control method for waking up a voice control device to cause the voice control device to reply to a voice signal it receives, characterized in that the voice control Methods include:

Step 1: After receiving the at least part of the voice signal, the voice control device starts an image receiving unit mounted thereon;

Step two, the image receiving unit acquires an image and transmits the image to the image recognition unit;

Step 3: The image recognition unit recognizes the image, and when a line of sight is detected in the image toward a face of the voice control device, the voice control device is woken up to recognize the voice signal.

Optionally, the image receiving unit is a camera.

Further, the camera is a wide-angle camera.

Optionally, the image receiving unit is a rotatable camera, the rotatable camera comprises a pan/tilt, and the pan/tilt is mounted on the voice control device.

Further, the pan/tilt is 2-axis driven.

Further, the step 1 includes: the voice control device distinguishes a source direction of the voice signal according to the received at least part of the voice signal; and when the voice control device can determine the voice signal In the source direction, the voice control device instructs the camera to turn to the source direction of the voice signal to acquire an image, and when the voice control device cannot determine the source direction of the voice signal, the voice control device instructs the camera Rotate and acquire an image within its maximum range of rotation angles.

Further, the step three includes:

For the case where the voice control device can determine the source direction of the voice signal, when the image recognition unit detects a line of sight toward the voice control device in the image, the voice control device receives After the voice signal, the voice signal is recognized and a reply is made;

For the case where the voice control device cannot determine the source direction of the voice signal, when the image recognition unit detects a line of sight toward the voice control device in the image and the face is speaking and After the voice signal is not received, the voice control device recognizes the voice signal after receiving the voice signal, and makes a reply; when the image recognition unit detects that the line of sight is facing the voice in the image Controlling the face of the device and the face is not speaking and the voice signal has been received, the voice control device recognizes the voice signal and responds, if the voice control device cannot recognize the voice signal, does not do Reply.

Further, when the step of looking at the face of the voice control device is not detected in the image, the voice control device is not woken up.

Further, the voice control device receives the voice signal through a voice receiving unit, and identifies the voice signal through a voice recognition unit.

Further, the voice receiving unit is a microphone.

The visual wake-up based voice control method of the present invention causes the voice control device to activate a visual wake-up function when starting to receive a voice signal originating from a user, by using an image receiving unit and an image recognition unit to search for a line of sight in a source direction of the voice signal toward the The face of the voice control device or the entire area searches for the line of sight toward the face of the voice control device to determine whether to wake up the voice control device; the awakened voice control device identifies the received voice signal through the voice recognition unit, and responds accordingly . The invention wakes up the voice recognition unit through the above-mentioned visual wake-up function, and is more suitable for the daily voice interaction habit of the user, and is more convenient and intelligent to use.

The concept, the specific structure and the technical effects of the present invention will be further described in conjunction with the accompanying drawings in order to fully understand the objects, features and effects of the invention.

DRAWINGS

1 is a structural block diagram of a prior art voice control device.

Fig. 2 is a block diagram showing a form of a voice control device to which the visual wake-up based voice control method of the present invention is applied.

Fig. 3 is a block diagram showing another form of the voice control device to which the visual wake-up based voice control method of the present invention is applied.

4 is a flow chart of a visual wake-based voice control method of the present invention to which the voice control device shown in FIG. 3 is applied.

detailed description

As shown in FIG. 2, in a preferred embodiment of the present invention, the visual wake-up based voice control method of the present invention is applied. The voice control device includes a voice receiving unit 1, an image receiving unit 11, a voice recognition unit 2, an image recognition unit 12, and a processing unit 13. Wherein, the voice receiving unit 1 The microphone receiving unit 11 is a camera, preferably a wide-angle camera; the voice receiving unit 1 and the image receiving unit 11 are mounted on the casing of the voice control device. Speech recognition unit 2 The speech signal from the speech receiving unit 1 is acquired, and the speech signal is recognized, and the result of the recognition is sent to the processing unit 13. Speech recognition unit 2 used in this example It can be any prior art software (and hardware) with speech recognition capabilities. The image recognition unit 12 acquires an image from the image receiving unit 11 and performs image recognition, and transmits the result of the recognition to the processing unit 13 The image recognition unit 12 employed in the present example may be any prior art software having a recognition function of a face and a line of sight direction, for example, a Chinese patent application 'a human-computer interaction method and system based on line of sight judgment' (Application No.: CN201210261378.8), Chinese patent application 'Fast and accurate human eye positioning method and line of sight estimation method based on human eye positioning' (Application No.: CN201510152613.1) and so on. In addition, the processing unit 13 can issue an instruction to the voice recognition unit 2 and the image recognition unit 12 to instruct its operation.

The visual wake-up based voice control method of the present invention applying the voice control device shown in FIG. 2 includes:

Step 1: The voice receiving unit of the voice control device 1 receives at least part of the voice signal, for example, just starts receiving 1-2 After the syllables, the image receiving unit 11 is activated.

Step 2: The image receiving unit 11 acquires an image and transmits it to the image recognition unit 12, that is, as the image receiving unit 11 The camera acquires an image within its field of view and transmits the image to the image recognition unit 12.

Step 3: The image recognition unit 12 recognizes the image, and when the line of sight is detected in the image toward the face of the voice control device, the image recognition unit 12 This recognition result is sent to the processing unit 13, which causes the voice control device to be woken up. Then the processing unit 13 causes the speech recognition unit 2 to operate, and the speech recognition unit 2 The complete speech signal is received and recognized, and the speech recognition unit 2 transmits the recognition result to the processing unit 3, which causes the speech control device to reply to the speech signal.

More preferably, as shown in FIG. 3, in a preferred embodiment of the present invention, the visual wake-up based voice control method of the present invention is applied. The voice control device includes a voice receiving unit 1, an image receiving unit 21, a voice recognition unit 2, an image recognition unit 22, and a processing unit 23. Wherein, the voice receiving unit 1 a microphone; the image receiving unit 21 is a rotatable camera such as a rotatable camera having a 2-axis driven pan/tilt that is rotatable about a horizontal axis and a vertical axis; a voice receiving unit 1 and an image receiving unit 21 Mounted on the housing of the voice control device, the pan/tilt of the rotatable camera is mounted on the housing of the voice control device. Speech recognition unit 2 acquisition from speech receiving unit 1 The voice signal is subjected to voice signal recognition, and the result of the recognition is sent to the processing unit 23. Speech recognition unit 2 used in this example It can be any prior art software (and hardware) that has a speech recognition function and is capable of discerning the source direction of the speech. The image recognition unit 22 acquires the image receiving unit 21 And performing image recognition, and transmitting the recognized result to the processing unit 23, the image recognition unit 22 employed in the present example It may be any prior art software having the recognition function of the face and the line of sight direction, which is the same as in the previous example. In addition, the processing unit 23 can be directed to the speech recognition unit 2 and the image recognition unit 22 An instruction is issued to instruct its operation; the processing unit 23 is also capable of controlling the rotation of the pan/tilt as a rotatable camera of the image receiving unit 21, thereby controlling the rotational direction and angle of the rotatable camera.

A flowchart of the visual wake-up based voice control method of the present invention using the voice control device shown in FIG. 3 is as shown in FIG. 4, and includes:

Step 1: The voice receiving unit of the voice control device 1 receives at least part of the voice signal, for example, just starts receiving 1-2 After the syllables, the image receiving unit 21 is activated. At the same time, the speech recognition unit 2 discriminates the source direction of the speech signal through the received partial speech signal.

Step 2: The image receiving unit 21 acquires an image and transmits it to the image recognition unit 22 , wherein, in the case where the source direction of the voice signal is determined in step 1, the processing unit 23 Controlling the source direction of the rotatable camera to the voice signal, and acquiring an image in a corresponding region of the source direction of the voice signal; for the case where the source direction of the voice signal cannot be determined in step 1, the processing unit 23 The rotatable camera is controlled to rotate within a range of its maximum angle of rotation, i.e., the image is acquired throughout the area until a line of sight is detected in the image toward the face of the voice control device.

For the former case, it can be specifically divided into two steps:

Step 1.1, the processing unit 23 controls the source direction of the rotatable camera to turn the voice signal, and acquires an image in the corresponding area;

Step 1.2, Image Recognition Unit 22 Analyze the acquired image, determine whether there is a face and whether there is a human face in the image, and determine whether the line of sight of the face faces the voice control device. If yes, go to the following step 3. If there is one, go to step 1.1. .

In the latter case, it can be specifically divided into two steps:

Step 2.1, processing unit 23 Controlling the rotatable camera to rotate within a range of its maximum rotation angle, acquiring an image having a face therein in the entire area, and stopping the rotation of the rotatable camera after the search is completed;

Step 2.2, Image Recognition Unit 22 Analyze the acquired image of the face and determine whether the line of sight of the face is toward the voice control device. If yes, proceed to step 3 below. If not, proceed to step 2.1.

Step three, image recognition unit 22 Identifying the image, wherein the image recognition unit detects a direction of the source direction of the voice signal in step one and detects a line of sight toward the voice control device in the image acquired in the corresponding region of the source direction of the voice signal The result of this recognition is sent to the processing unit 23, which causes the voice control device to be woken up. Then the processing unit 23 causes the speech recognition unit 2 to operate, and the speech recognition unit 2 Receiving the complete speech signal and identifying it, the speech recognition unit 2 sends the recognition result to the processing unit 23, and the processing unit 23 The voice control device is caused to reply to the voice signal. Preferably, this situation can also analyze the response in a more detailed manner, as shown in FIG. 4, and can also be used for the speech recognition unit 2 The time point at which the complete voice signal is received is determined, specifically:

1. The image recognition unit 22 confirms that the image of the face whose line of sight is facing the voice control device is acquired, the voice recognition unit 2 has Receiving the complete voice signal, that is, the voice has stopped at this time, the voice recognition unit 2 recognizes the received voice signal;

2, the image recognition unit 22 confirms that the image of the face of the face of the voice control device is acquired, the voice recognition unit 2 has not yet Receiving the complete speech signal, i.e., the speech is not stopped at this time, the processing unit 23 causes the image recognition unit 22 to judge whether the face in the acquired image is speaking.

If so, it can be judged that the received voice signal is sent by the person, whereby the processing unit 23 aligns the camera with the face until the voice signal is received. The speech recognition unit 2 receives the complete speech signal and recognizes it, and the speech recognition unit 2 transmits the recognition result to the processing unit 23, and the processing unit 23 Having the voice control device respond to the voice signal;

If not, it can be judged that the voice signal being received is not issued by the person, and thus needs to be searched again, that is, back to step 2.

It can be seen that such refined analysis should be able to adapt to more complex environments, such as the presence of multiple people in the scene to correctly find the person who emits the voice signal.

For the case where the source direction of the voice signal is determined in step 1, but when the face of the voice control device is not detected in the image acquired in the corresponding region of the source direction of the voice signal, it is regarded as a voice signal that cannot be determined. The source direction and return to step two, the processing unit 23 Controls the rotatable camera to rotate within the range of its maximum rotation angle to capture images throughout the area.

For the case where the source direction of the voice signal cannot be determined in step 1, and the face of the voice control device is detected in the image acquired in the entire area, and the face is speaking and the voice receiving unit is at this time 1 When the voice signal has not been received, the image recognition unit 22 transmits the recognition result to the processing unit 23, which causes the voice control device to be woken up. Processing unit 23 The camera is aimed at the face until the speech signal is received, and then the processing unit 23 causes the speech recognition unit 2 to operate, and the speech recognition unit 2 receives the complete speech signal and recognizes it, the speech recognition unit 2 The recognition result is sent to the processing unit 23, which causes the voice control device to reply to the voice signal.

For the case where the source direction of the voice signal cannot be determined in step 1, and the face of the voice control device is detected in the image acquired in the entire area, and the face is not speaking and the voice receiving unit is at this time 1 When the voice signal has been received, the image recognition unit 22 transmits the recognition result to the processing unit 23, which causes the voice control device to be woken up. Then the processing unit 23 makes the speech recognition unit 2 Working, speech recognition unit 2 receives the complete speech signal and recognizes it. If the speech recognition unit 2 can recognize the speech signal, the recognition result is sent to the processing unit 23 if the processing unit 23 Being able to correctly understand the recognition result (for example, matching one of the built-in operation instruction sets) causes the voice control device to reply to the voice signal; if the voice recognition unit 2 cannot recognize the voice signal, the processing unit 23 Causes the voice control device not to reply to the voice signal.

The above has described in detail the preferred embodiments of the invention. It will be appreciated that many modifications and variations can be made in the present invention without departing from the scope of the invention. Therefore, any technical solution that can be obtained by a person skilled in the art based on the prior art based on the prior art by logic analysis, reasoning or limited experimentation should be within the scope of protection determined by the claims.

Claims

A voice control method based on visual wake-up is used for waking up a voice control device to cause the voice control device to reply to a voice signal received by the voice control device, wherein the voice control method includes:

Step 1: After receiving the at least part of the voice signal, the voice control device starts an image receiving unit mounted thereon;

Step two, the image receiving unit acquires an image and transmits the image to the image recognition unit;

Step 3: The image recognition unit recognizes the image, and when a line of sight is detected in the image toward a face of the voice control device, the voice control device is woken up to recognize the voice signal.
The visual wake-up based voice control method according to claim 1, wherein said image receiving unit is a camera.
The visual wake-up based voice control method of claim 2 wherein said camera is a wide-angle camera.
The visual wake-up based voice control method according to claim 1, wherein said image receiving unit is a rotatable camera, said rotatable camera comprises a pan/tilt, and said pan-tilt is mounted on a casing of said voice control device.
The visual wake-up based voice control method of claim 4 wherein said pan/tilt is 2-axis driven.
The visual wake-up based voice control method according to claim 4 or 5, wherein said step 1 comprises: said voice control device distinguishing a source of said voice signal based on said received at least part of said voice signal a direction; when the voice control device is capable of determining a source direction of the voice signal, the voice control device instructs the camera to turn to a source direction of the voice signal to acquire an image, and when the voice control device cannot determine the voice When the source direction of the signal, the voice control device instructs the camera to rotate within its maximum range of rotation angles and acquire an image.
The visual wake-up based voice control method according to claim 6, wherein said step three comprises:

For the case where the voice control device can determine the source direction of the voice signal, when the image recognition unit detects a line of sight toward the voice control device in the image, the voice control device receives After the voice signal, the voice signal is recognized and a reply is made;

For the case where the voice control device cannot determine the source direction of the voice signal, when the image recognition unit detects a line of sight toward the voice control device in the image and the face is speaking and After the voice signal is not received, the voice control device recognizes the voice signal after receiving the voice signal, and makes a reply; when the image recognition unit detects that the line of sight is facing the voice in the image Controlling the face of the device and the face is not speaking and the voice signal has been received, the voice control device recognizes the voice signal and responds, if the voice control device cannot recognize the voice signal, does not do Reply.
The visual wake-up based voice control method according to claim 7, wherein when the face is not detected in the image toward the face of the voice control device, the voice control device is not wake.
The visual wake-up based voice control method according to claim 1, wherein said voice control device receives said voice signal through a voice receiving unit, and recognizes said voice signal by a voice recognition unit.
The visual wake-up based voice control method of claim 9, wherein the voice receiving unit is a microphone.