CN114220420A

CN114220420A - Multi-modal voice wake-up method, device and computer-readable storage medium

Info

Publication number: CN114220420A
Application number: CN202210098130.8A
Authority: CN
Inventors: 俞瑞华; 陈铖彬; 郭永利; 柳文斌
Original assignee: GAC Toyota Motor Co Ltd
Current assignee: GAC Toyota Motor Co Ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-03-22

Abstract

The invention discloses a multi-mode voice awakening method, a device and a computer readable storage medium, wherein the multi-mode voice awakening method comprises the following steps: acquiring facial image characteristics of a user and acquiring voice information from the user; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. By implementing the invention, the facial image characteristics of the user can be identified, and the voice information sent by the user is combined, so that whether the user has the interaction intention can be judged under the condition of noisy reception environment, and therefore, whether the voice assistant is awakened or not can be selected, the interference of the external environment is reduced in the process of man-machine interaction, the man-machine interaction experience is enhanced, and the high awakening rate during the voice interaction is ensured.

Description

Multi-modal voice wake-up method, device and computer-readable storage medium

Technical Field

The invention relates to the technical field of intelligent union interaction, in particular to a multi-mode voice awakening method, a multi-mode voice awakening device and a computer readable storage medium.

Background

The existing awakening method of the voice robot on the market mainly takes an awakening word, and when the voice of a user is recognized as a preset keyword, the robot is awakened and starts to interact with the user. Taking a voice assistant in the existing intelligent device as an example, when a user hits a corresponding keyword, the voice assistant can be seen on a user interface.

In the speech recognition process, a common method for reducing recognition interference caused by external noise is to perform noise reduction processing on audio data by using a microphone array, that is, multi-channel audio data acquired by the microphone array are input into a noise reduction algorithm to perform processing such as echo cancellation, dereverberation, beam forming and the like, so that clean single-channel audio is obtained, and then the clean single-channel audio is sent to a speech recognition engine for recognition.

However, microphone arrays and their noise reduction algorithms are very sensitive to external noise, especially non-stationary noise. When the signal-to-noise ratio is lower than 5dB (decibel), the performance of the algorithm will decrease rapidly, which makes it difficult for a single-dimensional speech noise reduction algorithm to meet the speech recognition requirements. When a plurality of people have a conversation after the voice robot is awakened, the voice data stream may contain the voice content of all people, so that a clear instruction of a corresponding user cannot be awakened.

Disclosure of Invention

The invention mainly aims to provide a multi-mode voice awakening method, a multi-mode voice awakening device and a computer readable storage medium, and aims to solve the technical problem of improving the voice awakening rate in a noisy environment so as to improve the identification effect.

In order to achieve the above object, the present invention provides a multi-modal voice wake-up method, which includes the following steps:

acquiring facial image characteristics of a user and acquiring voice information from the user;

judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information;

and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant.

Optionally, the facial image feature includes a lip contour feature, and the step of acquiring the facial image feature of the user includes:

the method comprises the steps of obtaining whole face frame image data of a user through a preset camera, and obtaining lip outline characteristics based on the whole face frame image data.

Optionally, the step of obtaining lip contour features based on the face whole frame image data is followed by:

comparing the lip contour characteristics at different moments, and judging whether position changes of key point coordinates of a preset number exist or not;

and if the position change of the key point coordinates of the preset number exists, obtaining a continuous lip action sequence based on the position change of the key point coordinates.

Optionally, the step of determining whether the user has a voice interaction intention based on the facial image features includes:

recording a special lip action sequence when a user uses an awakening word based on the lip feature key point change when the user uses the awakening word in a normal state;

comparing the continuous lip movement sequence with the dedicated lip movement sequence;

and if the comparison result is within a preset threshold value interval, judging that the user has the voice interaction intention.

Optionally, the facial image features further include eyeball image features, and the step of acquiring the facial image features of the user includes:

the method comprises the steps of obtaining eye information of a user through a preset camera, and obtaining eyeball image characteristics based on the eye information.

Optionally, the step of obtaining eyeball image features based on the eye information includes:

and processing the eye information through a preset application program, and extracting eyeball image features in the eye information.

according to the change of the eyeball image characteristics, dynamically tracking the sight position of the user;

judging whether the sight line position falls into a preset control area or not;

and if the sight line position falls into a preset control area, judging that the user has a voice interaction intention.

Optionally, the step of determining whether the user has a voice interaction intention based on the voice information includes:

recognizing the voice information to obtain a voice data stream, and judging whether the voice data stream contains a preset awakening word;

and if the voice data stream contains a preset awakening word, judging that the user has a voice interaction intention.

In addition, to achieve the above object, the present invention further provides a multi-modal voice wake-up apparatus, including: the system comprises a memory, a processor and a multi-modal voice wake-up program stored on the memory and capable of running on the processor, wherein the multi-modal voice wake-up program realizes the steps of the multi-modal voice wake-up method when being executed by the processor.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, having a multi-modal voice wake-up program stored thereon, which, when being executed by a processor, implements the steps of the multi-modal voice wake-up method as described above.

The invention provides a multi-mode voice awakening method, a multi-mode voice awakening device and a computer readable storage medium, which solve the technical problem of how to improve the voice awakening rate in a noisy environment so as to improve the identification effect; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. The method and the device can identify the facial image characteristics of the user, increase the mode which is helpful for awakening the voice assistant by combining the voice information sent by the user, and judge whether the user has the interaction intention under the condition of noisy reception environment, so that whether the voice assistant is awakened or not can be selected, the interference of the external environment is reduced in the process of man-machine interaction, the experience of man-machine interaction is enhanced, and the high awakening rate during the voice interaction is ensured.

Drawings

Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a multi-modal voice wake-up method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a multi-modal voice wake-up method according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a multi-modal voice wake-up method according to a third embodiment of the present invention;

fig. 5 is a flowchart illustrating a multimodal voice wake-up method according to a fourth embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The main solution of the embodiment of the invention is as follows: a multi-modal voice wake-up method, the multi-modal voice wake-up method comprising the steps of:

Because the existing awakening method of the voice robot on the market mainly comprises an awakening word, when the voice of a user is recognized as a preset keyword, the robot is awakened and starts to interact with the user. Taking a voice assistant in the existing intelligent device as an example, when a user hits a corresponding keyword, the voice assistant can be seen on a user interface. However, when a plurality of people have a conversation after the voice robot is awakened, the voice data stream may contain the voice content of all people, so that the specific instruction of the corresponding user cannot be awakened.

The invention provides a multi-mode voice awakening method, which solves the technical problem of how to improve the voice awakening rate in a noisy environment so as to improve the recognition effect, and in the multi-mode voice awakening method, voice information from a user is acquired by acquiring facial image characteristics of the user; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. The method and the device can identify the facial image characteristics of the user, increase the mode which is helpful for awakening the voice assistant by combining the voice information sent by the user, and judge whether the user has the interaction intention under the condition of noisy reception environment, so that whether the voice assistant is awakened or not can be selected, the interference of the external environment is reduced in the process of man-machine interaction, the experience of man-machine interaction is enhanced, and the high awakening rate during the voice interaction is ensured.

As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.

The terminal provided by the embodiment of the invention is a multi-mode voice awakening device.

As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a multimodal voice wake-up program.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to invoke the multimodal voice wake-up program stored in the memory 1005 and perform the following operations:

Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:

the facial image features include lip contour features, and the step of obtaining facial image features of the user includes:

the step of obtaining lip contour features based on the face whole frame image data then comprises:

the step of judging whether the user has the voice interaction intention or not based on the facial image features comprises the following steps:

the facial image features further include eyeball image features, and the step of acquiring the facial image features of the user includes:

the step of obtaining eyeball image features based on the eye information comprises the following steps:

the step of judging whether the user has the voice interaction intention or not based on the voice information comprises the following steps:

Referring to fig. 2, a first embodiment of the present invention provides a multi-modal voice wake-up method, where the multi-modal voice wake-up method includes:

step S10, acquiring the facial image characteristics of the user and acquiring the voice information from the user;

it should be noted that, in this embodiment, the execution main body is a multi-modal voice wake-up device, and the multi-modal voice wake-up device includes an information acquisition module, and determines whether to wake up a built-in voice assistant to interact with a user based on feature information from the user.

In this embodiment, the facial image features include lip contour features and eyeball image features, and step S10 includes:

step A10, acquiring the whole face frame image data of the user through a preset camera, and acquiring lip contour characteristics based on the whole face frame image data.

And step B10, acquiring eye information of the user through a preset camera, and acquiring eyeball image characteristics based on the eye information.

And step C10, acquiring voice information from the user through a preset microphone.

It should be noted that, in this embodiment, the multi-modal voice wake-up device includes a preset camera and a preset microphone, and is used to acquire user information. The facial image features comprise lip contour features and eyeball image features, namely whether the multi-mode voice awakening device awakens the voice assistant or not can be judged according to the lip contour features of the user, namely the lip language of the user, the eyeball image features of the user can be judged according to the sight direction of the user, and the voice information sent by the traditional user can be combined for judgment, namely the awakening words spoken by the user are used for judgment.

It will be appreciated that the present embodiment provides more options than the traditional way of waking up the voice assistant simply by capturing the wake-up word in the voice message.

Step S20, based on the face image feature or the voice information, judging whether the user has the voice interaction intention;

it can be understood that the wake-up precondition of the voice assistant is that the user needs the voice assistant to interact with the voice assistant, that is, the user has a voice interaction intention, and whether the user has the voice interaction intention is determined according to the facial image features or voice information of the user.

In this embodiment, step S20 includes:

and A20, when the lip contour feature changes, judging whether the user has the voice interaction intention according to the change state of the lip contour feature.

And step B20, when the eyeball image characteristics change, judging whether the user has voice interaction intention according to the change state of the eyeball image characteristics.

And step C20, when the voice information is acquired, judging whether the user has the voice interaction intention according to whether the voice information contains a wake-up word.

It can be understood that if none of the above three features are satisfied, the user is considered to have no voice interaction intention.

In a specific implementation manner, this embodiment provides three implementation manners, in one implementation manner, whether the user has a voice interaction intention is determined according to a lip change of the user, in another implementation manner, whether the user has the voice interaction intention is determined according to a gaze change of the user, and in another implementation manner, whether the user has the voice interaction intention is determined according to whether a wakeup word is included in voice information sent by the user.

Step S30, if any one of the facial image features and the voice information satisfies a preset interaction condition, determining that the user has a voice interaction intention, and waking up a preset voice assistant.

It is understood that the above three embodiments are parallel, and as long as one embodiment satisfies the preset interaction condition, the user is considered to have the voice interaction intention, and the preset voice assistant in the multi-modal voice wake is considered to be woken up.

In addition, before waking up the voice assistant, an additional implementation mode is provided, that is, after the former two implementation modes are judged, secondary judgment is further performed according to the user's wake-up word, so that the possibility of misjudgment is reduced, and bad experience brought to the user is avoided.

In the multi-modal voice awakening method, voice information from a user is acquired by acquiring facial image features of the user; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. The embodiment can identify the facial image characteristics of the user, combines the voice information sent by the user, increases the mode which is helpful for waking up the voice assistant, and can judge whether the user has an interaction intention under the noisy environment of the radio reception environment, so that the voice assistant can be selected to be woken up, the interference of the external environment is reduced in the process of man-machine interaction, the experience of man-machine interaction is enhanced, and the high waking-up rate during the voice interaction is ensured.

Further, referring to fig. 3, a second embodiment of the multi-modal voice wake-up method of the present invention is proposed, based on the above embodiment shown in fig. 2, the step a10 includes:

step A11, comparing the lip contour features at different moments, and judging whether position changes of key point coordinates of preset number exist or not;

step a12, if there are a preset number of position changes of the key point coordinates, a continuous lip motion sequence is obtained based on the position changes of the key point coordinates.

It can be understood that if there is no position change of a preset number of key point coordinates, the subsequent determination is not made based on the current lip contour feature.

In this embodiment, step a20 includes:

a21, recording a special lip action sequence when a user uses an awakening word based on the lip feature key point change when the user uses the awakening word in a normal state;

a step a22 of comparing the continuous lip movement sequence with the dedicated lip movement sequence;

and step A23, if the comparison result is within a preset threshold value interval, determining that the user has the voice interaction intention.

It is understood that if the comparison result is not within the preset threshold interval, the determination result based on the lip contour feature is that the user has no voice interaction intention, and other determination manners may be considered.

With reference to some steps included in the first embodiment, the present embodiment provides a voice wake-up method based on user lip movements, which includes firstly, taking 30 images per second of a user's face through a camera at a high speed, confirming a whole face frame map by referring to user face data, extracting lip features of a single face map, extracting at least one key point coordinate based on position information of a lip contour, connecting at least two key point coordinates to obtain semantic features of a single frame of lip map, and performing lip movement detection on the semantic features of the single lip map to obtain a single lip movement amplitude.

And then determining a lip action detection result of a video frame according to the at least two lip graphs and the lip action detection result of the single frame, splicing semantic features of the single frame lip graphs in the at least two lip graphs to obtain a lip feature sequence, splicing the lip action amplitudes of the at least two single frames to obtain an action amplitude sequence, and determining a continuous lip action sequence according to the lip feature sequence and the action amplitude sequence.

And finally, recording a special lip action sequence when the user uses the awakening word based on the change of the key points of the lip characteristics when the user uses the awakening word in a normal state, comparing the obtained lip action sequence with the special lip action sequence of the awakening word, and if the comparison result is within a preset threshold value, such as greater than 80% or greater than 75%, the specific preset threshold value can be modified according to actual requirements, wherein the embodiment is not limited, it is determined that the user has an interactive intention, and the voice assistant is awakened.

In the embodiment, a voice awakening method based on user lip actions is provided, and serves as a component of a multi-mode voice awakening method, the lip outline characteristics of a user can be identified, the voice information sent by the user is combined, a mode which is helpful for awakening a voice assistant is added, whether the user has an interaction intention can be judged under the condition that a reception environment is noisy, so that whether the voice assistant is awakened or not can be selected, the interference of an external environment is reduced in the man-machine interaction process, the man-machine interaction experience is enhanced, and the high awakening rate during the voice interaction is ensured.

Further, referring to fig. 4, a third embodiment of the multi-modal voice wake-up method of the present invention is proposed, based on the above-mentioned embodiment shown in fig. 2, where the step B10 includes:

and step B11, processing the eye information through a preset application program, and extracting eyeball image features in the eye information.

In this embodiment, step B20 includes:

step B21, dynamically tracking the sight line position of the user according to the change of the eyeball image characteristics;

step B22, judging whether the sight line position falls into a preset control area;

and step B23, if the sight line position falls into a preset control area, judging that the user has the voice interaction intention.

It is understood that if the gaze location does not fall within the preset control region, the determination result based on the eyeball image feature is that the user has no voice interaction intention, and other determination manners may be considered.

With reference to some steps included in the first embodiment, the present embodiment provides a voice wake-up method based on a gaze fixation point of an eyeball of a user, and the method includes acquiring two pieces of face image information through a camera, recognizing left eye information and right eye information in the face image information through a face recognition function, processing the left eye information and the right eye information through an AI (Artificial Intelligence) vision application program (that is, the preset application program), extracting eyeball image features in the left eye information and the right eye information, and dynamically tracking a gaze position of a person in an automobile according to different eyeball image features. And then acquiring area information corresponding to the preset control area, judging whether the sight position of the personnel in the automobile falls into the preset control area or not according to the area information, and if the sight position falls into the preset control area, judging that the user has an interaction intention.

The embodiment provides a voice awakening method based on user lip action, and the voice awakening method is used as a component of a multi-mode voice awakening method, can identify eyeball image characteristics of a user, is combined with voice information sent by the user, increases a mode which is helpful for awakening a voice assistant, and can judge whether the user has an interaction intention under the condition of noisy sound receiving environment, so that whether the voice assistant is awakened or not can be selected, interference of an external environment is reduced in a man-machine interaction process, man-machine interaction experience is enhanced, and a high awakening rate during voice interaction is ensured.

Further, referring to fig. 5, a fourth embodiment of the multi-modal voice wake-up method of the present invention is proposed, based on the above-mentioned embodiment shown in fig. 2, where the step C20 includes:

step C21, recognizing the voice information to obtain a voice data stream, and judging whether the voice data stream contains a preset awakening word;

and step C22, if the voice data stream contains a preset awakening word, determining that the user has a voice interaction intention.

It can be understood that if the voice data stream does not include a preset wake-up word, the determination result based on the voice information is that the user has no voice interaction intention, and other determination manners may be considered. The embodiment provides a conventional implementation manner for waking up a voice assistant by recognizing a wake-up word in a voice message sent by a user, which can be combined with the above embodiments.

The embodiment provides a voice awakening method based on user voice information, and the voice awakening method is used as a component of a multi-mode voice awakening method, and can be used for identifying awakening words in the voice information of a user, increasing a mode which is helpful for awakening a voice assistant, and judging whether the user has an interaction intention under the condition that a reception environment is noisy, so that whether the voice assistant is awakened or not can be selected, the interference of an external environment is reduced in the man-machine interaction process, the man-machine interaction experience is enhanced, and the high awakening rate during the voice interaction is ensured.

The above embodiments may be implemented independently or in combination with each other.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a multi-modal voice wake-up program is stored on the computer-readable storage medium, and when executed by a processor, the multi-modal voice wake-up program implements the following operations:

Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A multi-modal voice wake-up method, comprising:

2. The multi-modal voice wake-up method of claim 1 wherein the facial image features include lip contour features, the step of obtaining facial image features of the user comprising:

3. The multi-modal voice wake-up method of claim 2 wherein the step of deriving lip contour features based on the face ensemble frame image data is followed by:

4. The multi-modal voice wake-up method of claim 3 wherein the step of determining whether the user has an intent to interact with voice based on the facial image features comprises:

5. The multi-modal voice wake-up method of claim 1 wherein the facial image features further comprise eye image features, the step of obtaining facial image features of the user comprising:

6. The multi-modal voice wake-up method according to claim 5, wherein the step of obtaining eye image features based on the eye information comprises:

7. The multi-modal voice wake-up method of claim 6 wherein the step of determining whether the user has an intent to interact with voice based on the facial image features comprises:

8. A multi-modal voice wake-up method according to any of the claims 1-7 wherein the step of determining whether the user has a voice interaction intention based on the voice information comprises:

9. A multimodal voice wake-up apparatus, the multimodal voice wake-up apparatus comprising: memory, a processor and a multi-modal voice wake-up program stored on the memory and executable on the processor, the multi-modal voice wake-up program when executed by the processor implementing the steps of the multi-modal voice wake-up method as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a multi-modal voice wake-up program, which when executed by a processor implements the steps of the multi-modal voice wake-up method according to any of the claims 1 to 8.