CN114220420A - Multi-modal voice wake-up method, device and computer-readable storage medium - Google Patents

Multi-modal voice wake-up method, device and computer-readable storage medium Download PDF

Info

Publication number
CN114220420A
CN114220420A CN202210098130.8A CN202210098130A CN114220420A CN 114220420 A CN114220420 A CN 114220420A CN 202210098130 A CN202210098130 A CN 202210098130A CN 114220420 A CN114220420 A CN 114220420A
Authority
CN
China
Prior art keywords
voice
user
preset
facial image
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210098130.8A
Other languages
Chinese (zh)
Inventor
俞瑞华
陈铖彬
郭永利
柳文斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GAC Toyota Motor Co Ltd
Original Assignee
GAC Toyota Motor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GAC Toyota Motor Co Ltd filed Critical GAC Toyota Motor Co Ltd
Priority to CN202210098130.8A priority Critical patent/CN114220420A/en
Publication of CN114220420A publication Critical patent/CN114220420A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a multi-mode voice awakening method, a device and a computer readable storage medium, wherein the multi-mode voice awakening method comprises the following steps: acquiring facial image characteristics of a user and acquiring voice information from the user; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. By implementing the invention, the facial image characteristics of the user can be identified, and the voice information sent by the user is combined, so that whether the user has the interaction intention can be judged under the condition of noisy reception environment, and therefore, whether the voice assistant is awakened or not can be selected, the interference of the external environment is reduced in the process of man-machine interaction, the man-machine interaction experience is enhanced, and the high awakening rate during the voice interaction is ensured.

Description

Multi-modal voice wake-up method, device and computer-readable storage medium
Technical Field
The invention relates to the technical field of intelligent union interaction, in particular to a multi-mode voice awakening method, a multi-mode voice awakening device and a computer readable storage medium.
Background
The existing awakening method of the voice robot on the market mainly takes an awakening word, and when the voice of a user is recognized as a preset keyword, the robot is awakened and starts to interact with the user. Taking a voice assistant in the existing intelligent device as an example, when a user hits a corresponding keyword, the voice assistant can be seen on a user interface.
In the speech recognition process, a common method for reducing recognition interference caused by external noise is to perform noise reduction processing on audio data by using a microphone array, that is, multi-channel audio data acquired by the microphone array are input into a noise reduction algorithm to perform processing such as echo cancellation, dereverberation, beam forming and the like, so that clean single-channel audio is obtained, and then the clean single-channel audio is sent to a speech recognition engine for recognition.
However, microphone arrays and their noise reduction algorithms are very sensitive to external noise, especially non-stationary noise. When the signal-to-noise ratio is lower than 5dB (decibel), the performance of the algorithm will decrease rapidly, which makes it difficult for a single-dimensional speech noise reduction algorithm to meet the speech recognition requirements. When a plurality of people have a conversation after the voice robot is awakened, the voice data stream may contain the voice content of all people, so that a clear instruction of a corresponding user cannot be awakened.
Disclosure of Invention
The invention mainly aims to provide a multi-mode voice awakening method, a multi-mode voice awakening device and a computer readable storage medium, and aims to solve the technical problem of improving the voice awakening rate in a noisy environment so as to improve the identification effect.
In order to achieve the above object, the present invention provides a multi-modal voice wake-up method, which includes the following steps:
acquiring facial image characteristics of a user and acquiring voice information from the user;
judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information;
and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant.
Optionally, the facial image feature includes a lip contour feature, and the step of acquiring the facial image feature of the user includes:
the method comprises the steps of obtaining whole face frame image data of a user through a preset camera, and obtaining lip outline characteristics based on the whole face frame image data.
Optionally, the step of obtaining lip contour features based on the face whole frame image data is followed by:
comparing the lip contour characteristics at different moments, and judging whether position changes of key point coordinates of a preset number exist or not;
and if the position change of the key point coordinates of the preset number exists, obtaining a continuous lip action sequence based on the position change of the key point coordinates.
Optionally, the step of determining whether the user has a voice interaction intention based on the facial image features includes:
recording a special lip action sequence when a user uses an awakening word based on the lip feature key point change when the user uses the awakening word in a normal state;
comparing the continuous lip movement sequence with the dedicated lip movement sequence;
and if the comparison result is within a preset threshold value interval, judging that the user has the voice interaction intention.
Optionally, the facial image features further include eyeball image features, and the step of acquiring the facial image features of the user includes:
the method comprises the steps of obtaining eye information of a user through a preset camera, and obtaining eyeball image characteristics based on the eye information.
Optionally, the step of obtaining eyeball image features based on the eye information includes:
and processing the eye information through a preset application program, and extracting eyeball image features in the eye information.
Optionally, the step of determining whether the user has a voice interaction intention based on the facial image features includes:
according to the change of the eyeball image characteristics, dynamically tracking the sight position of the user;
judging whether the sight line position falls into a preset control area or not;
and if the sight line position falls into a preset control area, judging that the user has a voice interaction intention.
Optionally, the step of determining whether the user has a voice interaction intention based on the voice information includes:
recognizing the voice information to obtain a voice data stream, and judging whether the voice data stream contains a preset awakening word;
and if the voice data stream contains a preset awakening word, judging that the user has a voice interaction intention.
In addition, to achieve the above object, the present invention further provides a multi-modal voice wake-up apparatus, including: the system comprises a memory, a processor and a multi-modal voice wake-up program stored on the memory and capable of running on the processor, wherein the multi-modal voice wake-up program realizes the steps of the multi-modal voice wake-up method when being executed by the processor.
In addition, to achieve the above object, the present invention further provides a computer readable storage medium, having a multi-modal voice wake-up program stored thereon, which, when being executed by a processor, implements the steps of the multi-modal voice wake-up method as described above.
The invention provides a multi-mode voice awakening method, a multi-mode voice awakening device and a computer readable storage medium, which solve the technical problem of how to improve the voice awakening rate in a noisy environment so as to improve the identification effect; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. The method and the device can identify the facial image characteristics of the user, increase the mode which is helpful for awakening the voice assistant by combining the voice information sent by the user, and judge whether the user has the interaction intention under the condition of noisy reception environment, so that whether the voice assistant is awakened or not can be selected, the interference of the external environment is reduced in the process of man-machine interaction, the experience of man-machine interaction is enhanced, and the high awakening rate during the voice interaction is ensured.
Drawings
Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a multi-modal voice wake-up method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a multi-modal voice wake-up method according to a second embodiment of the present invention;
FIG. 4 is a flowchart illustrating a multi-modal voice wake-up method according to a third embodiment of the present invention;
fig. 5 is a flowchart illustrating a multimodal voice wake-up method according to a fourth embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The main solution of the embodiment of the invention is as follows: a multi-modal voice wake-up method, the multi-modal voice wake-up method comprising the steps of:
acquiring facial image characteristics of a user and acquiring voice information from the user;
judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information;
and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant.
Because the existing awakening method of the voice robot on the market mainly comprises an awakening word, when the voice of a user is recognized as a preset keyword, the robot is awakened and starts to interact with the user. Taking a voice assistant in the existing intelligent device as an example, when a user hits a corresponding keyword, the voice assistant can be seen on a user interface. However, when a plurality of people have a conversation after the voice robot is awakened, the voice data stream may contain the voice content of all people, so that the specific instruction of the corresponding user cannot be awakened.
The invention provides a multi-mode voice awakening method, which solves the technical problem of how to improve the voice awakening rate in a noisy environment so as to improve the recognition effect, and in the multi-mode voice awakening method, voice information from a user is acquired by acquiring facial image characteristics of the user; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. The method and the device can identify the facial image characteristics of the user, increase the mode which is helpful for awakening the voice assistant by combining the voice information sent by the user, and judge whether the user has the interaction intention under the condition of noisy reception environment, so that whether the voice assistant is awakened or not can be selected, the interference of the external environment is reduced in the process of man-machine interaction, the experience of man-machine interaction is enhanced, and the high awakening rate during the voice interaction is ensured.
As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.
The terminal provided by the embodiment of the invention is a multi-mode voice awakening device.
As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a multimodal voice wake-up program.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to invoke the multimodal voice wake-up program stored in the memory 1005 and perform the following operations:
acquiring facial image characteristics of a user and acquiring voice information from the user;
judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information;
and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant.
Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:
the facial image features include lip contour features, and the step of obtaining facial image features of the user includes:
the method comprises the steps of obtaining whole face frame image data of a user through a preset camera, and obtaining lip outline characteristics based on the whole face frame image data.
Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:
the step of obtaining lip contour features based on the face whole frame image data then comprises:
comparing the lip contour characteristics at different moments, and judging whether position changes of key point coordinates of a preset number exist or not;
and if the position change of the key point coordinates of the preset number exists, obtaining a continuous lip action sequence based on the position change of the key point coordinates.
Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:
the step of judging whether the user has the voice interaction intention or not based on the facial image features comprises the following steps:
recording a special lip action sequence when a user uses an awakening word based on the lip feature key point change when the user uses the awakening word in a normal state;
comparing the continuous lip movement sequence with the dedicated lip movement sequence;
and if the comparison result is within a preset threshold value interval, judging that the user has the voice interaction intention.
Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:
the facial image features further include eyeball image features, and the step of acquiring the facial image features of the user includes:
the method comprises the steps of obtaining eye information of a user through a preset camera, and obtaining eyeball image characteristics based on the eye information.
Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:
the step of obtaining eyeball image features based on the eye information comprises the following steps:
and processing the eye information through a preset application program, and extracting eyeball image features in the eye information.
Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:
the step of judging whether the user has the voice interaction intention or not based on the facial image features comprises the following steps:
according to the change of the eyeball image characteristics, dynamically tracking the sight position of the user;
judging whether the sight line position falls into a preset control area or not;
and if the sight line position falls into a preset control area, judging that the user has a voice interaction intention.
Further, the processor 1001 may call the multimodal voice wake-up program stored in the memory 1005, and also perform the following operations:
the step of judging whether the user has the voice interaction intention or not based on the voice information comprises the following steps:
recognizing the voice information to obtain a voice data stream, and judging whether the voice data stream contains a preset awakening word;
and if the voice data stream contains a preset awakening word, judging that the user has a voice interaction intention.
Referring to fig. 2, a first embodiment of the present invention provides a multi-modal voice wake-up method, where the multi-modal voice wake-up method includes:
step S10, acquiring the facial image characteristics of the user and acquiring the voice information from the user;
it should be noted that, in this embodiment, the execution main body is a multi-modal voice wake-up device, and the multi-modal voice wake-up device includes an information acquisition module, and determines whether to wake up a built-in voice assistant to interact with a user based on feature information from the user.
In this embodiment, the facial image features include lip contour features and eyeball image features, and step S10 includes:
step A10, acquiring the whole face frame image data of the user through a preset camera, and acquiring lip contour characteristics based on the whole face frame image data.
And step B10, acquiring eye information of the user through a preset camera, and acquiring eyeball image characteristics based on the eye information.
And step C10, acquiring voice information from the user through a preset microphone.
It should be noted that, in this embodiment, the multi-modal voice wake-up device includes a preset camera and a preset microphone, and is used to acquire user information. The facial image features comprise lip contour features and eyeball image features, namely whether the multi-mode voice awakening device awakens the voice assistant or not can be judged according to the lip contour features of the user, namely the lip language of the user, the eyeball image features of the user can be judged according to the sight direction of the user, and the voice information sent by the traditional user can be combined for judgment, namely the awakening words spoken by the user are used for judgment.
It will be appreciated that the present embodiment provides more options than the traditional way of waking up the voice assistant simply by capturing the wake-up word in the voice message.
Step S20, based on the face image feature or the voice information, judging whether the user has the voice interaction intention;
it can be understood that the wake-up precondition of the voice assistant is that the user needs the voice assistant to interact with the voice assistant, that is, the user has a voice interaction intention, and whether the user has the voice interaction intention is determined according to the facial image features or voice information of the user.
In this embodiment, step S20 includes:
and A20, when the lip contour feature changes, judging whether the user has the voice interaction intention according to the change state of the lip contour feature.
And step B20, when the eyeball image characteristics change, judging whether the user has voice interaction intention according to the change state of the eyeball image characteristics.
And step C20, when the voice information is acquired, judging whether the user has the voice interaction intention according to whether the voice information contains a wake-up word.
It can be understood that if none of the above three features are satisfied, the user is considered to have no voice interaction intention.
In a specific implementation manner, this embodiment provides three implementation manners, in one implementation manner, whether the user has a voice interaction intention is determined according to a lip change of the user, in another implementation manner, whether the user has the voice interaction intention is determined according to a gaze change of the user, and in another implementation manner, whether the user has the voice interaction intention is determined according to whether a wakeup word is included in voice information sent by the user.
Step S30, if any one of the facial image features and the voice information satisfies a preset interaction condition, determining that the user has a voice interaction intention, and waking up a preset voice assistant.
It is understood that the above three embodiments are parallel, and as long as one embodiment satisfies the preset interaction condition, the user is considered to have the voice interaction intention, and the preset voice assistant in the multi-modal voice wake is considered to be woken up.
In addition, before waking up the voice assistant, an additional implementation mode is provided, that is, after the former two implementation modes are judged, secondary judgment is further performed according to the user's wake-up word, so that the possibility of misjudgment is reduced, and bad experience brought to the user is avoided.
In the multi-modal voice awakening method, voice information from a user is acquired by acquiring facial image features of the user; judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information; and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant. The embodiment can identify the facial image characteristics of the user, combines the voice information sent by the user, increases the mode which is helpful for waking up the voice assistant, and can judge whether the user has an interaction intention under the noisy environment of the radio reception environment, so that the voice assistant can be selected to be woken up, the interference of the external environment is reduced in the process of man-machine interaction, the experience of man-machine interaction is enhanced, and the high waking-up rate during the voice interaction is ensured.
Further, referring to fig. 3, a second embodiment of the multi-modal voice wake-up method of the present invention is proposed, based on the above embodiment shown in fig. 2, the step a10 includes:
step A11, comparing the lip contour features at different moments, and judging whether position changes of key point coordinates of preset number exist or not;
step a12, if there are a preset number of position changes of the key point coordinates, a continuous lip motion sequence is obtained based on the position changes of the key point coordinates.
It can be understood that if there is no position change of a preset number of key point coordinates, the subsequent determination is not made based on the current lip contour feature.
In this embodiment, step a20 includes:
a21, recording a special lip action sequence when a user uses an awakening word based on the lip feature key point change when the user uses the awakening word in a normal state;
a step a22 of comparing the continuous lip movement sequence with the dedicated lip movement sequence;
and step A23, if the comparison result is within a preset threshold value interval, determining that the user has the voice interaction intention.
It is understood that if the comparison result is not within the preset threshold interval, the determination result based on the lip contour feature is that the user has no voice interaction intention, and other determination manners may be considered.
With reference to some steps included in the first embodiment, the present embodiment provides a voice wake-up method based on user lip movements, which includes firstly, taking 30 images per second of a user's face through a camera at a high speed, confirming a whole face frame map by referring to user face data, extracting lip features of a single face map, extracting at least one key point coordinate based on position information of a lip contour, connecting at least two key point coordinates to obtain semantic features of a single frame of lip map, and performing lip movement detection on the semantic features of the single lip map to obtain a single lip movement amplitude.
And then determining a lip action detection result of a video frame according to the at least two lip graphs and the lip action detection result of the single frame, splicing semantic features of the single frame lip graphs in the at least two lip graphs to obtain a lip feature sequence, splicing the lip action amplitudes of the at least two single frames to obtain an action amplitude sequence, and determining a continuous lip action sequence according to the lip feature sequence and the action amplitude sequence.
And finally, recording a special lip action sequence when the user uses the awakening word based on the change of the key points of the lip characteristics when the user uses the awakening word in a normal state, comparing the obtained lip action sequence with the special lip action sequence of the awakening word, and if the comparison result is within a preset threshold value, such as greater than 80% or greater than 75%, the specific preset threshold value can be modified according to actual requirements, wherein the embodiment is not limited, it is determined that the user has an interactive intention, and the voice assistant is awakened.
In the embodiment, a voice awakening method based on user lip actions is provided, and serves as a component of a multi-mode voice awakening method, the lip outline characteristics of a user can be identified, the voice information sent by the user is combined, a mode which is helpful for awakening a voice assistant is added, whether the user has an interaction intention can be judged under the condition that a reception environment is noisy, so that whether the voice assistant is awakened or not can be selected, the interference of an external environment is reduced in the man-machine interaction process, the man-machine interaction experience is enhanced, and the high awakening rate during the voice interaction is ensured.
Further, referring to fig. 4, a third embodiment of the multi-modal voice wake-up method of the present invention is proposed, based on the above-mentioned embodiment shown in fig. 2, where the step B10 includes:
and step B11, processing the eye information through a preset application program, and extracting eyeball image features in the eye information.
In this embodiment, step B20 includes:
step B21, dynamically tracking the sight line position of the user according to the change of the eyeball image characteristics;
step B22, judging whether the sight line position falls into a preset control area;
and step B23, if the sight line position falls into a preset control area, judging that the user has the voice interaction intention.
It is understood that if the gaze location does not fall within the preset control region, the determination result based on the eyeball image feature is that the user has no voice interaction intention, and other determination manners may be considered.
With reference to some steps included in the first embodiment, the present embodiment provides a voice wake-up method based on a gaze fixation point of an eyeball of a user, and the method includes acquiring two pieces of face image information through a camera, recognizing left eye information and right eye information in the face image information through a face recognition function, processing the left eye information and the right eye information through an AI (Artificial Intelligence) vision application program (that is, the preset application program), extracting eyeball image features in the left eye information and the right eye information, and dynamically tracking a gaze position of a person in an automobile according to different eyeball image features. And then acquiring area information corresponding to the preset control area, judging whether the sight position of the personnel in the automobile falls into the preset control area or not according to the area information, and if the sight position falls into the preset control area, judging that the user has an interaction intention.
The embodiment provides a voice awakening method based on user lip action, and the voice awakening method is used as a component of a multi-mode voice awakening method, can identify eyeball image characteristics of a user, is combined with voice information sent by the user, increases a mode which is helpful for awakening a voice assistant, and can judge whether the user has an interaction intention under the condition of noisy sound receiving environment, so that whether the voice assistant is awakened or not can be selected, interference of an external environment is reduced in a man-machine interaction process, man-machine interaction experience is enhanced, and a high awakening rate during voice interaction is ensured.
Further, referring to fig. 5, a fourth embodiment of the multi-modal voice wake-up method of the present invention is proposed, based on the above-mentioned embodiment shown in fig. 2, where the step C20 includes:
step C21, recognizing the voice information to obtain a voice data stream, and judging whether the voice data stream contains a preset awakening word;
and step C22, if the voice data stream contains a preset awakening word, determining that the user has a voice interaction intention.
It can be understood that if the voice data stream does not include a preset wake-up word, the determination result based on the voice information is that the user has no voice interaction intention, and other determination manners may be considered. The embodiment provides a conventional implementation manner for waking up a voice assistant by recognizing a wake-up word in a voice message sent by a user, which can be combined with the above embodiments.
The embodiment provides a voice awakening method based on user voice information, and the voice awakening method is used as a component of a multi-mode voice awakening method, and can be used for identifying awakening words in the voice information of a user, increasing a mode which is helpful for awakening a voice assistant, and judging whether the user has an interaction intention under the condition that a reception environment is noisy, so that whether the voice assistant is awakened or not can be selected, the interference of an external environment is reduced in the man-machine interaction process, the man-machine interaction experience is enhanced, and the high awakening rate during the voice interaction is ensured.
The above embodiments may be implemented independently or in combination with each other.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a multi-modal voice wake-up program is stored on the computer-readable storage medium, and when executed by a processor, the multi-modal voice wake-up program implements the following operations:
acquiring facial image characteristics of a user and acquiring voice information from the user;
judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information;
and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant.
Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:
the facial image features include lip contour features, and the step of obtaining facial image features of the user includes:
the method comprises the steps of obtaining whole face frame image data of a user through a preset camera, and obtaining lip outline characteristics based on the whole face frame image data.
Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:
the step of obtaining lip contour features based on the face whole frame image data then comprises:
comparing the lip contour characteristics at different moments, and judging whether position changes of key point coordinates of a preset number exist or not;
and if the position change of the key point coordinates of the preset number exists, obtaining a continuous lip action sequence based on the position change of the key point coordinates.
Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:
the step of judging whether the user has the voice interaction intention or not based on the facial image features comprises the following steps:
recording a special lip action sequence when a user uses an awakening word based on the lip feature key point change when the user uses the awakening word in a normal state;
comparing the continuous lip movement sequence with the dedicated lip movement sequence;
and if the comparison result is within a preset threshold value interval, judging that the user has the voice interaction intention.
Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:
the facial image features further include eyeball image features, and the step of acquiring the facial image features of the user includes:
the method comprises the steps of obtaining eye information of a user through a preset camera, and obtaining eyeball image characteristics based on the eye information.
Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:
the step of obtaining eyeball image features based on the eye information comprises the following steps:
and processing the eye information through a preset application program, and extracting eyeball image features in the eye information.
Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:
the step of judging whether the user has the voice interaction intention or not based on the facial image features comprises the following steps:
according to the change of the eyeball image characteristics, dynamically tracking the sight position of the user;
judging whether the sight line position falls into a preset control area or not;
and if the sight line position falls into a preset control area, judging that the user has a voice interaction intention.
Further, the multi-modal voice wake-up program when executed by the processor further performs the following operations:
the step of judging whether the user has the voice interaction intention or not based on the voice information comprises the following steps:
recognizing the voice information to obtain a voice data stream, and judging whether the voice data stream contains a preset awakening word;
and if the voice data stream contains a preset awakening word, judging that the user has a voice interaction intention.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A multi-modal voice wake-up method, comprising:
acquiring facial image characteristics of a user and acquiring voice information from the user;
judging whether the user has voice interaction intention or not based on the facial image characteristics or the voice information;
and if any one of the facial image characteristics and the voice information meets a preset interaction condition, judging that the user has a voice interaction intention, and awakening a preset voice assistant.
2. The multi-modal voice wake-up method of claim 1 wherein the facial image features include lip contour features, the step of obtaining facial image features of the user comprising:
the method comprises the steps of obtaining whole face frame image data of a user through a preset camera, and obtaining lip outline characteristics based on the whole face frame image data.
3. The multi-modal voice wake-up method of claim 2 wherein the step of deriving lip contour features based on the face ensemble frame image data is followed by:
comparing the lip contour characteristics at different moments, and judging whether position changes of key point coordinates of a preset number exist or not;
and if the position change of the key point coordinates of the preset number exists, obtaining a continuous lip action sequence based on the position change of the key point coordinates.
4. The multi-modal voice wake-up method of claim 3 wherein the step of determining whether the user has an intent to interact with voice based on the facial image features comprises:
recording a special lip action sequence when a user uses an awakening word based on the lip feature key point change when the user uses the awakening word in a normal state;
comparing the continuous lip movement sequence with the dedicated lip movement sequence;
and if the comparison result is within a preset threshold value interval, judging that the user has the voice interaction intention.
5. The multi-modal voice wake-up method of claim 1 wherein the facial image features further comprise eye image features, the step of obtaining facial image features of the user comprising:
the method comprises the steps of obtaining eye information of a user through a preset camera, and obtaining eyeball image characteristics based on the eye information.
6. The multi-modal voice wake-up method according to claim 5, wherein the step of obtaining eye image features based on the eye information comprises:
and processing the eye information through a preset application program, and extracting eyeball image features in the eye information.
7. The multi-modal voice wake-up method of claim 6 wherein the step of determining whether the user has an intent to interact with voice based on the facial image features comprises:
according to the change of the eyeball image characteristics, dynamically tracking the sight position of the user;
judging whether the sight line position falls into a preset control area or not;
and if the sight line position falls into a preset control area, judging that the user has a voice interaction intention.
8. A multi-modal voice wake-up method according to any of the claims 1-7 wherein the step of determining whether the user has a voice interaction intention based on the voice information comprises:
recognizing the voice information to obtain a voice data stream, and judging whether the voice data stream contains a preset awakening word;
and if the voice data stream contains a preset awakening word, judging that the user has a voice interaction intention.
9. A multimodal voice wake-up apparatus, the multimodal voice wake-up apparatus comprising: memory, a processor and a multi-modal voice wake-up program stored on the memory and executable on the processor, the multi-modal voice wake-up program when executed by the processor implementing the steps of the multi-modal voice wake-up method as claimed in any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a multi-modal voice wake-up program, which when executed by a processor implements the steps of the multi-modal voice wake-up method according to any of the claims 1 to 8.
CN202210098130.8A 2022-01-26 2022-01-26 Multi-modal voice wake-up method, device and computer-readable storage medium Pending CN114220420A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210098130.8A CN114220420A (en) 2022-01-26 2022-01-26 Multi-modal voice wake-up method, device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210098130.8A CN114220420A (en) 2022-01-26 2022-01-26 Multi-modal voice wake-up method, device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN114220420A true CN114220420A (en) 2022-03-22

Family

ID=80708757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210098130.8A Pending CN114220420A (en) 2022-01-26 2022-01-26 Multi-modal voice wake-up method, device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN114220420A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189680A (en) * 2023-05-04 2023-05-30 北京水晶石数字科技股份有限公司 Voice wake-up method of exhibition intelligent equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189680A (en) * 2023-05-04 2023-05-30 北京水晶石数字科技股份有限公司 Voice wake-up method of exhibition intelligent equipment
CN116189680B (en) * 2023-05-04 2023-09-26 北京水晶石数字科技股份有限公司 Voice wake-up method of exhibition intelligent equipment

Similar Documents

Publication Publication Date Title
WO2019107145A1 (en) Information processing device and information processing method
CN108427873B (en) Biological feature identification method and mobile terminal
CN112532266A (en) Intelligent helmet and voice interaction control method of intelligent helmet
CN111833872B (en) Voice control method, device, equipment, system and medium for elevator
CN110364156A (en) Voice interactive method, system, terminal and readable storage medium storing program for executing
WO2022042274A1 (en) Voice interaction method and electronic device
CN115620728B (en) Audio processing method and device, storage medium and intelligent glasses
CN112634895A (en) Voice interaction wake-up-free method and device
US11437031B2 (en) Activating speech recognition based on hand patterns detected using plurality of filters
CN111933167A (en) Noise reduction method and device for electronic equipment, storage medium and electronic equipment
CN114333774B (en) Speech recognition method, device, computer equipment and storage medium
CN110248401B (en) WiFi scanning control method and device, storage medium and mobile terminal
CN109982273B (en) Information reply method and mobile terminal
CN114220420A (en) Multi-modal voice wake-up method, device and computer-readable storage medium
CN108597495B (en) Method and device for processing voice data
CN108270928B (en) Voice recognition method and mobile terminal
CN108089935B (en) Application program management method and mobile terminal
US20230048330A1 (en) In-Vehicle Speech Interaction Method and Device
CN109819331B (en) Video call method, device and mobile terminal
WO2023006033A1 (en) Speech interaction method, electronic device, and medium
CN113744736B (en) Command word recognition method and device, electronic equipment and storage medium
CN115985309A (en) Voice recognition method and device, electronic equipment and storage medium
CN115469949A (en) Information display method, intelligent terminal and storage medium
CN110472520B (en) Identity recognition method and mobile terminal
CN111326175A (en) Prompting method for interlocutor and wearable device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination