CN111230891B

CN111230891B - Robot and voice interaction system thereof

Info

Publication number: CN111230891B
Application number: CN201811441703.2A
Authority: CN
Inventors: 熊友军; 胡佳文; 张木森; 黄高波
Original assignee: Ubtech Robotics Corp
Current assignee: Ubtech Robotics Corp
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2021-07-27
Anticipated expiration: 2038-11-29
Also published as: CN111230891A

Abstract

The invention is suitable for the technical field of robots, and provides a robot and a voice interaction system thereof, wherein the robot comprises a face detection module, a lip movement detection module and a voice pickup extraction module which are sequentially connected; the face detection module is used for detecting a face and sending a notice of detecting the face to the lip movement detection module when the face is detected; the lip movement detection module is used for detecting lip movement when receiving the notification of the detected face, and sending the notification of the detected lip movement to the voice pickup extraction module when detecting the lip movement; the voice pickup extraction module is used for entering a working state when receiving the notification of detecting lip movement, and picking up the current environment so as to extract voice signals in the current environment. The voice signal in the current environment can be extracted only when someone speaks, so that false recognition is avoided, user experience is improved, and power consumption can be effectively reduced.

Description

Robot and voice interaction system thereof

Technical Field

The invention belongs to the technical field of robots, and particularly relates to a robot and a voice interaction system thereof.

Background

With the continuous development of the robot technology, various intelligent robots emerge endlessly, and are widely applied to various fields such as finance, home furnishing, manufacturing, building, medical treatment and the like, so that great convenience is brought to daily production and life of people. At present, robots with intelligent voice interaction functions are popular due to their practicality and interest.

However, in order to make the voice communication of the robot more natural, smooth and intelligent, many robots of the prior art adopt a continuous listening working mode, and the working mode makes the robot easily misrecognize the voice when the environmental noise is relatively large, so that the robot can speak by itself when no person talks with the robot, and the user experience is poor.

Disclosure of Invention

In view of this, embodiments of the present invention provide a robot and a voice interaction system thereof, so as to solve the problem that many existing robots all employ a continuous listening working mode, and the working mode makes the robot easily misrecognize a voice when environmental noise is relatively high, so that when no person talks with the robot, the robot may speak by itself, and user experience is poor.

The embodiment of the invention provides a voice interaction system of a robot, which comprises a face detection module, a lip movement detection module and a voice pickup extraction module which are sequentially connected;

the face detection module is used for detecting a face and sending a notice of detecting the face to the lip movement detection module when the face is detected;

the lip movement detection module is used for detecting lip movement when receiving the notification of the detected face, and sending the notification of the detected lip movement to the voice pickup extraction module when detecting the lip movement;

the voice pickup extraction module is used for entering a working state when receiving the notification of detecting lip movement, and picking up the current environment so as to extract voice signals in the current environment.

In one embodiment, the face detection module is further configured to send a notification that a face is not detected to the voice pickup extraction module when the face is not detected, and continue to detect the face;

the voice pickup extraction module is further configured to enter a sleep state when receiving the notification that the face is not detected or when not receiving the notification that the face is detected.

In one embodiment, the face detection module is further configured to send a notification of face detection to the voice pickup extraction module and continue to detect a face when the face is detected;

the voice pickup extraction module is further configured to enter a preparation state upon receiving the notification of the detected face.

In one embodiment, the lip movement detection module is further configured to:

when the lip movement is not detected, accumulating the duration time of the undetected lip movement;

and entering a dormant state when the duration time of the lip movement is not detected to be longer than the preset duration time.

In one embodiment, the lip movement detection module is further configured to zero the accumulated duration of undetected lip movement when lip movement is detected.

In one embodiment, the lip movement detection module is further configured to send a notification that no lip movement is detected to the voice pickup extraction module when the duration of no lip movement is longer than a preset duration;

the voice pickup extraction module is further configured to enter a sleep state when receiving the notification that lip movement is not detected or when not receiving the notification that lip movement is detected.

In one embodiment, the voice interaction system of the robot further includes:

the natural semantic analysis module is connected with the voice pickup extraction module and is used for performing natural semantic analysis on the voice signals and identifying the meanings of the voice signals; and

and the voice playing module is connected with the natural voice analyzing module and used for searching and playing corresponding voice data according to the meaning of the voice signal.

In one embodiment, the voice interaction system of the robot further includes a sound box connected to the voice playing module.

In one embodiment, the voice interactive system for robot further comprises:

the camera is connected with the face detection module and is used for shooting an image of a preset area in the current environment; and

the microphone is connected with the voice pickup extraction module;

the face detection module is specifically configured to detect whether a face exists in a preset area in the current environment according to the image.

A second aspect of the embodiments of the present invention provides a robot, which includes the above-mentioned voice interaction system of the robot.

The embodiment of the invention provides a robot voice interaction system comprising a face detection module, a lip movement detection module and a voice pickup extraction module which are sequentially connected, wherein the face detection module is used for detecting a face and sending a notice of detecting the face to the lip movement detection module when the face is detected; when the face detection module receives the notice of detecting the face, lip movement is detected, and when the lip movement is detected, the notice of detecting the lip movement is sent to the voice pickup extraction module; the voice pickup module is used for receiving the notification of detecting lip movement, entering a working state and picking up the current environment so as to extract voice signals in the current environment, so that the voice signals in the current environment can be extracted only when someone speaks, the false recognition is avoided, the user experience is improved, and the power consumption can be effectively reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic structural diagram of a voice interaction system of a robot according to an embodiment of the present invention;

fig. 2 is a schematic view of a working flow of a face detection module according to a second embodiment of the present invention;

FIG. 3 is a schematic view of a working flow of a lip movement detection module according to a second embodiment of the present invention;

fig. 4 is a schematic flowchart of a voice pickup extraction module according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a voice interaction system of a robot according to a third embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, or system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Furthermore, the terms "first," "second," and "third," etc. are used to distinguish between different objects and are not used to describe a particular order.

Example one

As shown in fig. 1, the present embodiment provides a robot voice interaction system 10, which includes a face detection module 1, a lip movement detection module 2, and a voice pickup extraction module 3, which are connected in sequence.

In a specific application, the robot may be any type of robot including the voice interaction system and having a voice interaction function, for example, a service robot, an underwater robot, an entertainment robot, a military robot, an agricultural robot, a robotized machine, and the like.

It should be understood that in practical applications, the robot may further include components such as a power supply device, a mechanical motion mechanism, a wireless network communication module, etc. according to the specific application and application site of the robot, and the embodiments of the present invention and the corresponding drawings only show the parts that are closely related to the present invention by way of example, and do not constitute a limitation on the specific structure and function of the robot.

In specific application, the face detection module, the lip movement detection module and the voice pickup extraction module can be in wired connection through entity connecting wires such as a serial data bus, a cable wire and an optical fiber, and can also be in wireless connection through wireless communication modules such as a Bluetooth module, a WiFi module and a ZigBee module, and the face detection module, the lip movement detection module and the voice pickup extraction module can be sub-function partition modules of an entity in a processor of the robot, and can also be software program modules operated by the processor of the robot. The face detection module, the lip movement detection module and the voice pickup extraction module can be realized through independent processors respectively. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In this embodiment, the face detection module 1 is configured to detect a face, and send a notification of detecting the face to the lip movement detection module 2 when the face is detected;

the lip movement detection module 2 is used for detecting lip movement when receiving the notification of detecting the face, and sending the notification of detecting lip movement to the voice pickup extraction module 3 when detecting lip movement;

voice pickup extraction module 3 is used for entering a working state when receiving a notification of detecting lip movement, and picking up the current environment to extract a voice signal in the current environment.

In a specific application, the face detection module is specifically configured to acquire an image or video data of a current environment captured by a camera of the robot or a camera connected to the robot, and then detect whether the acquired image or video data includes a face through a face recognition technology; the lip movement detection module is used for further detecting whether the lips of the human faces in the images or the video data move or not through an image recognition technology or a facial features recognition technology after the human faces are detected by the human face detection module, and when a person in the current environment speaks, the lips move, so that the lip movement detection module detects the lip movement.

In a particular application, the notification of the detection of a human face and the notification of the detection of lip movements may be sent in the form of heartbeat messages or pulse signals.

In the embodiment, a human face is detected by a human face detection module, and when the human face is detected, a notification of the detected human face is sent to a lip movement detection module; when the face detection module receives the notice of detecting the face, lip movement is detected, and when the lip movement is detected, the notice of detecting the lip movement is sent to the voice pickup extraction module; the voice pickup module is used for receiving the notification of detecting lip movement, entering a working state and picking up the current environment so as to extract voice signals in the current environment, so that the voice signals in the current environment can be extracted only when someone speaks, the false recognition is avoided, the user experience is improved, and the power consumption can be effectively reduced.

Example two

As shown in fig. 2, in this embodiment, the face detection module 1 is further configured to send a notification that a face is not detected to the voice pickup extraction module 3 when the face is not detected, and continue to detect the face;

the voice pickup extraction module 3 is further configured to enter a sleep state when receiving a notification that a face is not detected or when receiving no notification that a face is detected.

In a specific application, the voice pickup extraction module enters a dormant state when receiving the notification that the face is not detected or when not receiving the notification that the face is detected by sending the notification that the face is not detected to the voice pickup extraction module, so that the voice pickup extraction module can be prevented from being in a state (working state) of listening to the sound in the current environment for a long time, power consumption is reduced, and the service life of the voice pickup extraction module is prolonged.

As shown in fig. 2, in this embodiment, the face detection module 1 is further configured to send a notification of detecting a face to the voice pickup extraction module 3 when the face is detected, and continue to detect the face;

the voice pickup extraction module 3 is also configured to enter a preparation state when receiving a notification that a face is detected.

In concrete application, through when detecting the people's face, make the pronunciation pickup extract the module and get into the ready state, can make the preparation work that the pickup extracted by the pronunciation pickup extract module in advance to when the detection module is moved to the lip detects the lip and moves, the pronunciation pickup extracts the module and can respond in time, improves entire system's sensitivity.

As shown in fig. 3, in the present embodiment, the lip movement detection module 2 is further configured to:

In specific application, the preset time duration can be set to any reasonable time duration according to actual needs, for example, any value within 5-30 minutes. Through not detecting the duration that the lip moved and being greater than when predetermineeing the time length, make the lip move detection module and get into dormant state, avoid the lip to move detection module and be in operating condition for a long time, can effectively reduce the lip and move the consumption that detection module, improve the lip and move detection module's life.

In this embodiment, the lip movement detection module 2 is further configured to zero the accumulated duration of undetected lip movement when the lip movement is detected.

As shown in fig. 3 or fig. 4, in this embodiment, the lip movement detection module 2 is further configured to send a notification that no lip movement is detected to the voice pickup extraction module 3 when the duration of no lip movement is longer than a preset duration;

the voice pickup extraction module 3 is further configured to enter a sleep state when receiving a notification that lip movement is not detected or when receiving no notification that lip movement is detected.

In concrete application, through when not detecting that the duration of lip movement is greater than predetermineeing the time length or not receiving the notice that detects the lip movement, make the pronunciation pickup extract the module and get into dormant state, avoid the pronunciation pickup to extract the module and be in operating condition for a long time, can effectively reduce the consumption that the module was extracted to the pronunciation pickup, improve the life that the module was extracted to the pronunciation pickup.

As shown in fig. 2, a schematic workflow diagram of the face detection module 1 is exemplarily shown, which includes:

step S201, starting; entering step S202;

step S202, judging whether a human face is detected; if yes, go to step S203; if not, go to step S204;

step S203, respectively sending a notification of face detection to the lip movement detection module 1 and the voice pickup extraction module 3; entering step S202;

step S204, sending a notification that a face is not detected to the voice pickup extraction module 3; the process advances to step S202.

As shown in fig. 3, a schematic workflow diagram of the lip movement detection module 2 is exemplarily shown, which includes:

step S301, starting; entering step S302;

step S302, entering a dormant state; entering step S303;

step S303, judging whether a notification of detecting a human face is received; if yes, go to step S304; if not, go to step S302;

step S304, judging whether lip movement is detected; if yes, go to step S305; if not, go to step S306;

step S305, sending a notification of detection of lip movement to the voice pickup extraction module 3; entering step S303;

step S306, accumulating the duration time of undetected lip movement; the flow advances to step S307;

step S307, judging whether the duration time of the undetected lip movement is longer than a preset time length; if yes, go to step S308; if not, go to step S306;

step S308, sending a notification that lip movement is not detected to the voice pickup extraction module 3; the process advances to step S302.

As shown in fig. 4, a schematic workflow diagram of the voice pickup extraction module 3 is exemplarily shown, which includes:

step S401, starting; entering step S402;

step S402, entering a dormant state; the flow advances to step S403;

step S403, judging whether a notification of detecting a human face is received; if yes, go to step S405; if not, go to step S402;

step S404, judging whether a notice that the human face is not detected is received; if yes, go to step S402;

step S405, entering a preparation state; the flow advances to step S406;

step S406, judging whether a notification of detecting lip movement is received; if yes, go to step S407; if not, go to step S402;

step S407, entering a working state;

step S408, judging whether a notification of detecting lip movement is received; if not, the process proceeds to step S402.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

EXAMPLE III

As shown in fig. 5, in this embodiment, the voice interaction system 10 in the first embodiment or the second embodiment further includes:

the natural semantic analysis module 4 is connected with the voice pickup extraction module 3 and used for performing natural semantic analysis on the voice signals and identifying the meanings of the voice signals; and

and the voice playing module 5 is connected with the natural voice analyzing module 4 and used for searching and playing the corresponding voice data according to the meaning of the voice signal.

In specific application, the voice pickup extraction module, the natural voice analysis module and the voice playing module can be in wired connection through entity connecting wires such as a serial data bus, a cable wire and an optical fiber, and can also be in wireless connection through wireless communication modules such as a bluetooth module, a WiFi module and a ZigBee module, and the natural voice analysis module and the voice playing module can be sub-function partition modules of an entity in a processor of the robot, and can also be software program modules operated by the processor of the robot. The natural voice analysis module and the voice playing module can also be respectively realized by independent processors.

In a specific application, the Natural Language parsing module may be specifically implemented by a Natural Language Processing (NLP) technology, and the speech playing module may be specifically implemented by a text-to-speech (TTS) technology.

As shown in fig. 5, in the present embodiment, the voice interaction system 10 further includes:

a sound box 6 connected with the voice playing module 5;

the camera 7 is connected with the face detection module 1 and is used for shooting an image of a preset area in the current environment; and

a microphone 8 connected to the voice pickup extraction module 3;

the face detection module 1 is specifically configured to detect whether a face exists in a preset area in the current environment according to the image.

In specific application, the sound box can be set as a loudspeaker or any device capable of amplifying and playing a voice signal according to actual needs. The camera can select any type of camera according to actual needs, for example, a camera with a movable or rotatable holder, an infrared camera, a wide-angle camera and the like. The microphones may be any type of microphone, e.g. microphone array, according to the actual need.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A voice interaction system of a robot is characterized by comprising a human face detection module, a lip movement detection module and a voice pickup extraction module which are sequentially connected;

the voice pickup extraction module is used for entering a working state and picking up the current environment to extract voice signals in the current environment when receiving the notification of detecting the lip movement;

the voice pickup extraction module is also used for entering a dormant state when a notice that a human face is not detected is received or the notice that the human face is detected is not received;

wherein the notification of the detected face and the notification of the detected lip movement are sent in the form of heartbeat messages or pulse signals.

2. The robot voice interaction system of claim 1, wherein the face detection module is further configured to send a notification that a face is not detected to the voice pickup extraction module and continue detecting a face when a face is not detected.

3. The voice interaction system of robot of claim 1 or 2, wherein the face detection module is further configured to send a notification of face detection to the voice pickup extraction module and continue to detect a face when a face is detected;

4. The robotic voice interaction system of claim 1, wherein the lip movement detection module is further to:

5. The robotic voice interaction system of claim 4, wherein the lip movement detection module is further configured to zero out the accumulated duration of undetected lip movement when lip movement is detected.

6. The voice interaction system of a robot according to claim 4 or 5, wherein the lip movement detection module is further configured to send a notification that no lip movement is detected to the voice pickup extraction module when the duration of no lip movement is longer than a preset duration;

7. The robotic voice interaction system of claim 1, further comprising:

and the voice playing module is connected with the natural semantic analysis module and used for searching and playing corresponding voice data according to the meaning of the voice signal.

8. The robotic speech interaction system according to claim 7, further comprising a speaker coupled to the speech playback module.

9. The robotic voice interaction system of claim 1, further comprising:

the microphone is connected with the voice pickup extraction module;

10. A robot characterized by a voice interaction system comprising the robot of any one of claims 1 to 9.