CN112634895A - Voice interaction wake-up-free method and device - Google Patents

Voice interaction wake-up-free method and device Download PDF

Info

Publication number
CN112634895A
CN112634895A CN202011573239.XA CN202011573239A CN112634895A CN 112634895 A CN112634895 A CN 112634895A CN 202011573239 A CN202011573239 A CN 202011573239A CN 112634895 A CN112634895 A CN 112634895A
Authority
CN
China
Prior art keywords
user
voice interaction
preset range
wake
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011573239.XA
Other languages
Chinese (zh)
Inventor
樊帅
林永楷
甘津瑞
宋洪博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN202011573239.XA priority Critical patent/CN112634895A/en
Publication of CN112634895A publication Critical patent/CN112634895A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a voice interaction wake-up-free method and a voice interaction wake-up-free device, wherein the voice interaction wake-up-free method comprises the following steps: responding to the acquired sensing data detected by the space sensing sensor, and judging whether a user appears in a first preset range; if a user appears in the first preset range, starting a camera to acquire related information of the user, and judging whether the user appears in a second preset range; if the user appears in the second preset range, judging whether the user has the intention of voice interaction or not based on the related information of the user; and if the user has the intention of voice interaction, entering a wake-up-free voice interaction mode. The judgment is performed in advance by using the space induction type sensor, so that the power consumption of the system can be greatly reduced, furthermore, the camera is started only when the space induction type sensor acquires that the user is close to the camera, and the relative information of the user is comprehensively judged, so that whether the camera enters the wake-up-free voice interaction mode can be accurately judged.

Description

Voice interaction wake-up-free method and device
Technical Field
The invention belongs to the technical field of voice interaction, and particularly relates to a voice interaction wake-up-free method and device.
Background
Traditional voice wake-up adopts a voice wake-up word mode, and a user needs to say a fixed wake-up word customized in advance to wake up a device, such as 'small X and small X' and 'XX sprite'. The device is awakened and then subsequent interaction can be performed. However, this method causes the following problems: in a noisy acoustic environment, false awakening and non-awakening are easy to occur; each interaction needs to say the awakening word first, so that the continuity of the interaction is broken, and the user experience is influenced. The invention has the starting point that an intelligent wake-up-free voice interaction method is designed, and the user experience is improved.
Similar techniques:
and A, touching or clicking a key to realize a wake-up-free method.
And the method B is a wake-up-free method based on the space induction type sensor.
And the method C is a camera-based wake-up-free method.
And D, voice endpoint detection and voice enhancement based on multi-modal information (various sensors such as vision, infrared and ultrasound).
Wherein, the method A: waking up by touching or clicking a button or key, e.g. a button on the device or a remote control
The method B comprises the following steps: the method uses information obtained by a space induction sensor (one type of multi-mode information) to realize non-awakening. The space induction type sensor comprises an infrared or ultrasonic distance measurement sensor, an infrared or ultrasonic proximity detection sensor, a pressure sensor and the like, and can sense the distance from a user to equipment, or sense whether the user appears in a certain area or sense the number of people in the area. One implementation is to enter the wake-up-free interaction mode if it is detected that a user appears in a set certain area or the number of users in the area is one.
The method C comprises the following steps: the method uses a picture or a video (one of multi-mode information) shot by a camera, obtains the relevant information of a user by using an image processing algorithm, and realizes the purpose of avoiding awakening under a certain condition. In one implementation, information such as a face orientation of a user, a gaze direction of a line of sight, motion information of lips, a posture of the user, and a position of the user may be extracted from visual information captured by a camera, and the device may be woken up when the user is in a specific position, has a specific posture, faces the device, and looks at the device while the lips are moving.
The method D comprises the following steps: the user first wakes up the device before voice interaction can take place. In the subsequent voice interaction, the starting point and the end point of the voice input need to be judged, invalid audio is filtered, and the process is the voice end point detection. Conventionally, the method is to calculate the energy of the speech signal over a period of time, which is interfered by background noise or other invalid human voice. The method uses multi-mode information for assistance, realizes more accurate voice endpoint detection, and simultaneously performs directional enhancement on voice signals. Specifically, the face information such as the mouth shape of the sound-producing object can be judged through the visual mode, and the starting point and the end point of the voice can be judged in an auxiliary manner, for example, when a user faces the equipment and opens the mouth, the voice input is considered to be available; the number, position, etc. of the objects of occurrence are determined by vision, infrared, ultrasound, etc. Whether a user speaks can be judged according to the factors, and after the position information is determined, the microphone array can be adjusted to directionally enhance the signals at the position.
The inventor finds that the defects of the similar technologies in the process of implementing the application comprise:
the method A comprises the following steps: the user needs to walk near the interactive device or place remote control devices such as a remote controller and a mobile phone in a distance where the user can touch, so that extra operation of the user is increased, and user experience is influenced. In some special scenes, such as new crown epidemic situations, the user does not want to directly contact the device with the hand, and the method is not suitable.
The method B comprises the following steps: the obtained information is limited, the accuracy of judging the non-awakening condition is influenced, the mistaken awakening condition is serious, and the user experience is influenced.
The method C comprises the following steps: the camera has high power consumption, is not suitable for some equipment sensitive to power consumption, and simultaneously, a user worrys about the problem of privacy disclosure.
The method D comprises the following steps: the wake-up exempt problem is not addressed.
Disclosure of Invention
An embodiment of the present invention provides a voice interaction wake-up-free method and apparatus, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a voice interaction wake-up exempting method, including: responding to the acquired sensing data detected by the space sensing sensor, and judging whether a user appears in a first preset range; if a user appears in the first preset range, starting a camera to acquire related information of the user, and judging whether the user appears in a second preset range; if the user appears in the second preset range, judging whether the user has the voice interaction intention or not based on the related information of the user; and if the user has the intention of voice interaction, entering a wake-up-free voice interaction mode.
In a second aspect, an embodiment of the present invention provides a voice interaction wake-up exempting apparatus, including: the acquisition judging program module is configured to respond to the acquired sensing data detected by the space sensing type sensor and judge whether a user appears in a first preset range; the starting judgment program module is configured to start a camera to acquire the related information of the user if the user appears in the first preset range, and judge whether the user appears in a second preset range; the judging program module is configured to judge whether the user has the voice interaction intention or not based on the relevant information of the user if the user appears in the second preset range; and the program entering module is configured to enter a wake-free voice interaction mode if the user has the voice interaction intention.
In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the voice interaction wake-free method of any of the embodiments of the present invention.
In a fourth aspect, the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is caused to execute the steps of the voice interaction wake-free method according to any embodiment of the present invention.
According to the method and the device, the spatial induction type sensor is used for prejudging, so that the power consumption of the system can be greatly reduced, furthermore, the camera can be started only when the spatial induction type sensor acquires that the user is close to the camera, the relative information of the user is comprehensively judged, so that whether the user enters a wake-up-free voice interaction mode can be accurately judged, and the privacy of the user can be protected to a certain degree.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a voice interaction wake-up exempting method according to an embodiment of the present invention;
fig. 2 is a flowchart of another voice interaction wake-up exemption method according to an embodiment of the present invention;
fig. 3 is a main flowchart of a specific example of a voice interaction wake-up exempt method according to an embodiment of the present invention;
fig. 4 is a sub-flowchart of the first stage judgment of a specific example of the voice interaction wake-up avoidance method according to an embodiment of the present invention;
fig. 5 is a sub-flowchart of the second stage judgment of a specific example of the voice interaction wake-up avoidance method according to an embodiment of the present invention;
fig. 6 is a block diagram of a voice interaction wake-up exempt apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Please refer to fig. 1, which shows a flowchart of an embodiment of a voice interaction wake-up exempt method according to the present application.
As shown in fig. 1, in step 101, in response to acquiring sensing data detected by a space sensing sensor, it is determined whether a user is present within a first preset range;
in step 102, if a user appears in the first preset range, starting a camera to acquire related information of the user, and judging whether the user appears in a second preset range;
in step 103, if the user appears in the second preset range, determining whether the user has an intention of voice interaction based on the related information of the user;
in step 104, if the user has an intention of voice interaction, enter a wake-up-free voice interaction mode.
In this embodiment, for step 101, in response to acquiring sensing data detected by the space sensing type sensor, the voice interaction wake-up avoidance apparatus determines whether a user is present within a first preset range, for example, after the device is started, the space sensing type sensor continuously detects a surrounding environment of the device, analyzes the sensing data detected by the space sensing type sensor through an algorithm, and determines whether a user is present within the first preset range, for example, whether the user is present within a preset area range or a distance between the user and the device. The space sensing sensor generally refers to a sensor capable of detecting a space distance and a position, such as an infrared or ultrasonic distance measuring sensor, an infrared or ultrasonic proximity detecting sensor, a pressure sensor, or a sensor capable of detecting a space distance and a position developed in the future, and the present application is not limited herein.
Then, for step 102, if there is a user present in the first preset range, the camera is turned on to obtain the related information of the user, and it is determined whether the user is present in the second preset range, for example, the camera takes a picture of the surrounding environment of the device, the picture or video taken by the camera is analyzed through a visual algorithm, the related information of the user is obtained, for example, the related information based on the face orientation, lip movement, sight line direction, posture, distance, and the like of the user is obtained, and at the same time the related information of the user is obtained, it is also determined whether the user is present in the second preset range, for example, the sensing data collected by the spatial sensing sensor can be used for determining and/or the image information and/or video information taken by the camera can be used for determining.
Then, in step 103, if the user appears in the second preset range, it is determined whether the user has the intention of voice interaction based on the related information of the user, for example, it may be determined based on whether the face of the user faces the device, whether the lips of the user are in an active state, or whether the posture of the user is consistent with a preset posture, or it may be determined whether the user has the intention of voice interaction by combining the above-mentioned methods.
Finally, in step 104, if the user has the intention of voice interaction, the user enters the wake-free voice interaction mode, for example, after entering the wake-free voice interaction mode, the user can directly perform voice interaction with the device without needing wake-up operations such as wake-up words.
According to the method, the spatial induction type sensor is used for prejudging, so that the power consumption of the system can be greatly reduced, furthermore, the camera can be started only when the spatial induction type sensor obtains that the user approaches, and the related information of the user is comprehensively judged, so that whether the user enters the wake-up-free voice interaction mode can be accurately judged, and the privacy of the user can be protected to a certain degree.
In the method according to the foregoing embodiment, after the determining whether the user is present within the first preset range, the method further includes:
and if no user appears in the second preset range, continuously acquiring sensing data detected by the space sensing type sensor.
In the method according to the foregoing embodiment, the determining whether the user appears within a second preset range further includes:
and if the user does not appear in the second preset range, the camera is not started, and the sensing data detected by the space sensing sensor is continuously acquired.
According to the method, the camera is turned off when the user does not appear in the second preset range, so that the power consumption of the equipment can be smaller.
Further referring to fig. 2, a flowchart of another voice interaction wake-up exempt method provided in an embodiment of the present application is shown. The flowchart is mainly a flowchart of a step defined further on the flow of "judging whether the user has an intention of voice interaction" in the flowchart 1.
As shown in fig. 2, in step 201, if the user does not have an intention of voice interaction, a camera is continuously turned on to obtain related information of the user, and it is determined whether the user appears within a second preset range;
in step 202, if the user appears in the second preset range, it is determined again whether the user has an intention of voice interaction.
In this embodiment, for step 201, if the user does not have the intention of voice interaction, the camera is continuously turned on to obtain the related information of the user, and whether the user appears in the second preset range is determined; then, for step 202, if the user appears in the second preset range, it is determined again whether the user has an intention of voice interaction, for example, when it is determined for the first time that the user does not have an intention of voice interaction, the related information of the user is continuously obtained until it is determined that the user has an intention of voice interaction or the user moves out of the second preset range through the related information of the user, for example, it is determined that the user has an intention of voice interaction, then the user enters a wake-up-free voice interaction mode, and if the user moves out of the second preset range, the camera is turned off and the spatial induction sensor is used for detection.
The method of the embodiment can avoid the failure of entering the wake-up-free mode by judging whether the user has the intention of voice interaction for many times.
In the method according to the foregoing embodiment, the determining whether the user appears within a second preset range includes:
judging sensing data acquired by using a space sensing sensor; and/or
And judging by using the image information and/or the video information shot by the camera.
According to the method, the sensing data collected by the space sensing sensor is used for judging and/or the image information and/or the video information shot by the camera is used for judging, so that the intention of the user can be accurately judged.
In the method according to the above embodiment, the space induction type sensor is always turned on to continuously detect the surrounding environment.
The method of the embodiment continuously detects the surrounding environment by using the space sensing sensor, so that the power consumption of the equipment is smaller and the privacy protection of the user is ensured.
In the method according to the above embodiment, the information related to the user includes: face orientation, lip motion, gaze direction, pose, and distance, e.g., the orientation of the user's face, the active state of the user's lips, whether the user is looking at the device or whether the user's pose is consistent with a preset pose, etc.
According to the method, the state of the user can be evaluated more accurately by comprehensively judging factors such as the face orientation, the lip movement, the sight line direction, the posture and the distance.
It should be noted that the above method steps are not intended to limit the execution order of the steps, and in fact, some steps may be executed simultaneously or in the reverse order of the steps, which is not limited herein.
The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.
The inventor finds that the defects in the prior art are mainly caused by the following reasons in the process of implementing the application:
the method A comprises the following steps: the user is required to enter the vicinity of the device because the user is required to wake up only by actively clicking a certain button or a certain position on the device, which brings inconvenience to the user in some scenes, for example, when the user is preparing to sleep in a bed, the device is placed on a windowsill or a cabinet far away from the bed; if the remote controller or the equipment related to the mobile phone is used for clicking, when the remote controller or the equipment is not at the user, the user needs to find the remote controller or the mobile phone first, and the user experience is influenced. In some special scenes, such as new crown epidemic situations, the user does not want to directly contact the device with the hand, and the method is not suitable.
The method B comprises the following steps: the information obtained by space induction sensors such as an infrared or ultrasonic distance measurement sensor, an infrared or ultrasonic proximity detection sensor, a pressure sensor and the like is limited, and only the distance from a user to equipment can be judged, or whether the user appears in a certain area is judged, so that the judgment on whether the user accords with the condition of no awakening is not sufficient, and the judgment error is easily caused.
The method C comprises the following steps: the camera is used for shooting images and videos, very detailed information such as the position, face orientation, lip movement, sight line direction, posture, gender, age and the like of a person can be obtained through algorithm operation of the images and the videos, whether the condition of awakening-free interaction is met or not can be accurately judged, and the defects that the camera is required to be in a working state all the time, the power consumption is high, and the camera is not suitable for some equipment sensitive to the power consumption; meanwhile, because the camera is always shooting, under certain private occasions, such as a home environment, a user can worry about the problem of privacy disclosure.
The method D comprises the following steps: the method uses multimodal information to assist in endpoint detection and speech enhancement during the interaction phase after the device wakes up. The method solves different problems, but does not solve the problem of awakening or awakening-free, and is in different stages of the voice interaction process with the method provided by the invention.
The inventors have found in the course of carrying out the invention why the reason is not easily imaginable:
commonly contemplated approaches are methods A, B and C, where method a: the voice interaction system gradually falls to the ground in products along with the development of artificial intelligence technology in recent years, the technology is a leading-edge technology, and a plurality of organizations and companies are easier to focus attention on the mode of voice, so that the realization difficulty is low. So most voice interaction products on the market today wake up devices using wake-up words. The scheme of clicking the key in the scheme is also naturally conceivable because clicking the key is easily implemented.
The method B comprises the following steps: multimodal technology has been applied in speech interactive systems in the last two years, and multimodal + speech interaction is an updated thing, and there are few enterprises and practitioners per se engaged in this field. Meanwhile, the technologies in different fields of artificial intelligence are combined together, so that the difficulty is greatly increased, and most of industry practitioners put attention on how to realize the artificial intelligence. The method provided by the invention is an optimization scheme, and is the optimization in the aspects of power consumption and privacy after the basic functions are realized.
The method C comprises the following steps: the multi-modal wake-up-free scheme proposed by practitioners in the industry often solves only one of the most concerned problems, and ignores other problems. For example, a camera is used as a mode, so that the precision of the wake-up-free scheme is high, but the power consumption and the privacy cannot be guaranteed; the use of space-sensing type sensors can guarantee power consumption and privacy, but with insufficient accuracy. The reason may be that the two types of sensors cannot be installed at the same time due to the limitation of the hardware of the device; or a scene that the precision, the power consumption and the privacy protection are not required to be ensured; or privacy protection is not appreciated by practitioners.
The biggest difficulty is as follows: the multi-modal + speech interaction technique is more advanced, and the combination of multi-directional techniques is difficult to achieve. Meanwhile, the application of the multi-mode voice interaction technology is less, and the application scene that the accuracy and the power and privacy protection are required to be ensured in the multi-mode wake-up-free application is also a difficult point to find, because the application scene is closely related to the popularization and the floor use of the technology.
The inventor also found that: traditional voice awakening adopts a voice awakening word mode, and a user needs to say a fixed awakening word which is customized in advance to awaken equipment, such as 'smallness', 'tianmao sprite'. The device is awakened and then subsequent interaction can be performed. However, this method causes the following problems: in a noisy acoustic environment, false awakening and non-awakening are easy to occur; each interaction needs to say the awakening word first, so that the continuity of the interaction is broken, and the user experience is influenced. The invention has the starting point that an intelligent wake-up-free voice interaction method is designed, and the user experience is improved.
Wherein, the method A: the user needs to press the keys of the device or the remote controller for additional operation, which brings inconvenience when the user is far away from the device. Meanwhile, the situation that the key is not suitable for being directly contacted with the public key exists, such as new crown epidemic situation.
The method B comprises the following steps: the information acquired by using the space induction type sensor is single, the accuracy of judgment of the wake-up-free condition is influenced, and the false wake-up condition is serious.
The method C comprises the following steps: according to the method, the camera is required to be in a working state all the time, the power consumption is high, and a user can worry about privacy disclosure.
The method D comprises the following steps: the problem of endpoint detection and speech enhancement is solved, not the wake-up free problem.
The inventor finds the difference between the invention and the prior art in the process of realizing the invention:
the invention uses the camera and the space induction sensor at the same time, but the camera is not always started, and the camera is started only when a certain condition is met, and the condition can be judged by the information acquired by the space induction sensor. In the prior art, one type of technology only uses one type of sensor, so that the precision, the power consumption and the privacy protection cannot be simultaneously met; in one technology, two types of sensors are used at the same time, but the two types of sensors are always in an on state, only one-time condition judgment on whether to be awakened or not is made, and the requirements on power consumption and privacy protection cannot be met.
The advantages are that: the two types of sensors are matched with each other, the awakening-free judgment process is divided into two stages, and meanwhile requirements of awakening-free precision, power consumption and privacy protection can be met.
The scheme of the application is mainly designed and optimized from the following aspects:
with the development of voice interaction technology and the popularization of voice interaction devices, the configuration and performance of the devices are higher and higher, and many devices are equipped with multi-modal input devices, such as cameras, ultrasonic devices, infrared devices and the like. The multi-modal data provides more user information, can help the device to make more intelligent decisions, and naturally thinks of an interaction mode of using the multi-modal information to help the device to realize the awakening-free.
The implementation of methods a, b and c is known through the experience of existing products and the inquiry of relevant papers and patents. However, in the project that the user is currently participating in the pre-research, the requirement that the precision, the power consumption and the user privacy are all ensured is met, so a two-stage wake-free method combining two types of sensor information is thought of.
Design concept and principle: the equipment is simultaneously provided with a space induction sensor and a camera, wherein the space induction sensor is always started, and the camera is only started under a certain condition. Because the space induction type sensor is low in price and low in power consumption, the problem of user privacy disclosure is avoided, the space induction type sensor can be started all the time, and the camera is started only when the space induction type sensor meets the conditions, so that the advantages of the scheme b and the scheme c are integrated, and the defects of the scheme b and the scheme c are avoided.
The whole wake-up-free process comprises two phases (two stages):
in the first stage, a space induction sensor is opened, sensor information is acquired and analyzed, whether a user exists in a preset range or not is judged, if the user exists, a camera is opened, and the second stage is started. Otherwise, the first stage is continued, and whether a user is present in the predetermined range is continuously detected.
And in the second stage, starting the camera, acquiring images and video information shot by the camera, acquiring related information of the user by using an image processing algorithm, such as face orientation, lip action, sight direction, posture, user distance and the like, judging whether the user has voice interaction intention or not according to the information, and entering a wake-up-free interaction mode if the user has the interaction intention.
Referring to fig. 3, a main flow chart of a specific example of a voice interaction wake-up exempt method according to an embodiment of the present invention is shown.
As shown in fig. 3, step 1: and entering a first-stage judgment sub-process, wherein the sub-process is used for judging whether a first-stage wake-up-free condition is met. If the condition is met, the second stage judgment sub-process is entered.
Step 2: and entering a second stage judgment sub-process, wherein the sub-process is used for judging whether the second-stage wake-up-free condition is met. And if the condition is met, entering a wake-up-free interactive mode.
And step 3: under the mode of no-wakeup interaction, the user can directly carry out voice interaction with the equipment without wakeup.
And 4, step 4: and the user and the equipment perform continuous voice interaction until the interaction task is finished, and the equipment enters the first-stage judgment sub-process again. The condition for determining the task end is not limited herein, and the user may actively speak to exit the interaction, or the user may not input voice within a preset time range, or the interaction end may be determined by using multi-modal information.
Referring to fig. 4, a sub-flowchart of the first stage judgment of a specific example of the voice interaction wake-up avoidance method according to an embodiment of the present invention is shown.
As shown in fig. 4, step 1: and starting the space induction type sensor to detect the surrounding environment of the equipment.
Step 2: the data of the space induction type sensor is obtained, the data are analyzed through an algorithm, and information related to a user, such as whether the user appears in a certain area or not, the distance between the user and equipment and the like, is obtained.
And step 3: judging whether the user appears in a certain area range, and if no user appears, repeating the step 2; and if the user appears, entering a sub-process of the second stage judgment.
Referring to fig. 5, a sub-flowchart of the second stage judgment of a specific example of the voice interaction wake-up avoidance method according to an embodiment of the invention is shown.
As shown in fig. 5, step 1: and starting a camera to shoot the surrounding environment of the equipment.
Step 2: the method comprises the steps of obtaining a picture or a video shot by a camera, analyzing the picture or the video through a visual algorithm, and obtaining information related to a user, such as face orientation, lip action, sight line direction, posture, distance and the like.
And step 3: judging whether the user appears in a certain area range, if no user appears, returning to the first stage to judge the sub-process (figure 4); and if the user appears, the next step of judgment is carried out. The judgment process may be consistent with step 3 of the first stage judgment sub-process (fig. 4), and the information of the space induction type sensor is used for judgment; the image or video information shot by the camera can be used for judging, and whether the face appears in the shot picture or not is analyzed, and whether the face is in the preset area range or not is judged. The above two ways are not limited herein.
And 4, step 4: judging whether the user has the voice interaction intention, and if the user does not have the voice interaction intention, repeating the step 2; otherwise, entering the wake-free interaction mode. The implementation manner of the determination process is not particularly limited. One way is to judge whether the distance and the angle between the human face and the equipment are within a preset range; one way is to determine whether the face is within a preset range (facing the device); one way is to determine if the gaze direction is within a preset range (gazing at the device); one way is to determine whether the lips are in an active state (speaking state); one way is to judge whether the posture of the user is consistent with a preset posture; various combinations of the above are also possible.
The inventor finds out the effect achieved by the invention in the process of realizing the invention:
the technical effect is as follows: compared with the image wake-up-free scheme, the power consumption is lower, and the privacy protection of the user is ensured. The awakening-free judging process is divided into two stages, the first stage uses a space induction type sensor with smaller power consumption to carry out prejudgment, and most of time is in the first stage in an actual scene, so that the power consumption of the system is greatly reduced; the camera is only started in the second stage, so that the shooting time of the user is greatly reduced, and the privacy of the user is protected to a certain degree (refer to the first stage judgment sub-process-figure 4).
The technical effect is as follows: compared with the wake-up-free scheme only using the space induction type sensor, the wake-up-free precision is higher, and the false wake-up probability is reduced. The scheme reserves the use of the camera, and can more accurately evaluate the state of the current user and judge whether the interaction requirement exists or not through comprehensively judging factors such as the face, the lip movement, the sight line, the posture, the distance and the like, so as to judge whether the mode enters the wake-up-free interaction mode or not. (refer to the second stage judgment subroutine-FIG. 5).
Beta version formed by the inventor in the process of implementing the invention:
in a brainstorm, methods B and C of the existing protocols have emerged.
Beta version 1: shooting images and videos through a camera, analyzing the distance between a human face and equipment, and entering a wake-up-free interaction mode when the distance is within a certain preset range; when a plurality of faces appear simultaneously, the largest (nearest) face is taken as an interactive object. The scheme requires the camera to be normally opened, and is suitable for occasions with low requirements on power consumption and privacy, such as public occasions like stations.
Key innovation points 1: meanwhile, the space induction type sensor and the camera are used for realizing the purpose of avoiding awakening.
Key innovation points 2: the space induction type sensor and the camera are respectively used in different judgment stages, and the advantages of the two types of sensors are fully utilized (the space induction type sensor is low in power consumption and has no privacy leakage problem but low in precision, and the camera is high in judgment precision but has large power consumption and privacy leakage problem). The first stage realizes initial judgment, and the second stage carries out more fine judgment.
The inventor finds that deeper effects are achieved in the process of implementing the invention:
the method for judging whether the user has the interaction intention or not can be used for judging effective audio in the interaction process, filtering noise and invalid background voice, namely the functions of voice endpoint detection and voice enhancement.
Referring to fig. 6, a block diagram of a voice interaction wake-up exempt apparatus according to an embodiment of the present invention is shown.
As shown in fig. 6, the voice interaction wake-up exempting apparatus 600 includes an acquisition determining program module 610, a start determining program module 620, a determining program module 630 and an entering program module 640.
The obtaining and determining program module 610 is configured to determine whether a user is present in a first preset range in response to obtaining sensing data detected by the space sensing sensor; a starting judgment program module 620 configured to, if a user appears in the first preset range, start a camera to obtain information related to the user, and judge whether the user appears in a second preset range; a determining program module 630, configured to determine whether the user has an intention of voice interaction based on the related information of the user if the user appears in the second preset range; and the entering program module 640 is configured to enter a wake-free voice interaction mode if the user has the intention of voice interaction.
It should be understood that the modules recited in fig. 6 correspond to various steps in the methods described with reference to fig. 1, and 2. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 6, and are not described again here.
It should be noted that the modules in the embodiments of the present disclosure are not limited to the aspects of the present disclosure, for example, the acquisition determining program module may be described as a module that determines whether a user is present within a first preset range in response to acquiring sensing data detected by the space sensing type sensor. In addition, the related function module may also be implemented by a hardware processor, for example, the acquisition determining program module may also be implemented by a processor, which is not described herein again.
In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the voice interaction wake-up exemption method in any of the above method embodiments;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
responding to the acquired sensing data detected by the space sensing sensor, and judging whether a user appears in a first preset range;
if a user appears in the first preset range, starting a camera to acquire related information of the user, and judging whether the user appears in a second preset range;
if the user appears in the second preset range, judging whether the user has the voice interaction intention or not based on the related information of the user;
and if the user has the intention of voice interaction, entering a wake-up-free voice interaction mode.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the voice interaction wake-up exempt device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, and these remote memories may be connected to the voice interactive wake-free device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, which, when executed by a computer, cause the computer to execute any one of the above voice interaction wake-up exemption methods.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device includes: one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7. The device of the voice interaction wake-free method may further include: an input device 730 and an output device 740. The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, such as the bus connection in fig. 7. The memory 720 is a non-volatile computer-readable storage medium as described above. The processor 710 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 720, that is, implements the voice interaction wake-up-free method of the above method embodiment. The input device 730 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the communication compensation device. The output device 740 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
As an embodiment, the electronic device is applied to a voice interaction wake-up exempting apparatus, and is used for a client, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
responding to the acquired sensing data detected by the space sensing sensor, and judging whether a user appears in a first preset range;
if a user appears in the first preset range, starting a camera to acquire related information of the user, and judging whether the user appears in a second preset range;
if the user appears in the second preset range, judging whether the user has the voice interaction intention or not based on the related information of the user;
and if the user has the intention of voice interaction, entering a wake-up-free voice interaction mode.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A voice interaction wake-free method comprises the following steps:
responding to the acquired sensing data detected by the space sensing sensor, and judging whether a user appears in a first preset range;
if a user appears in the first preset range, starting a camera to acquire related information of the user, and judging whether the user appears in a second preset range;
if the user appears in the second preset range, judging whether the user has the voice interaction intention or not based on the related information of the user;
and if the user has the intention of voice interaction, entering a wake-up-free voice interaction mode.
2. The method of claim 1, wherein after said determining whether a user is present within a first preset range, the method further comprises:
and if no user appears in the second preset range, continuously acquiring sensing data detected by the space sensing type sensor.
3. The method of claim 1, wherein the determining whether the user is present within a second preset range further comprises:
and if the user does not appear in the second preset range, the camera is not started, and the sensing data detected by the space sensing type sensor is continuously acquired.
4. The method of claim 1, wherein the determining whether the user has an intent to interact with speech further comprises:
if the user does not have the intention of voice interaction, continuously starting a camera to acquire the related information of the user, and judging whether the user appears in a second preset range;
and if the user appears in the second preset range, judging whether the user has the voice interaction intention again.
5. The method of claim 1, wherein the determining whether the user is present within a second preset range comprises:
judging sensing data acquired by using a space sensing sensor; and/or
And judging by using the image information and/or the video information shot by the camera.
6. The method of any one of claims 1-5, wherein the space-sensing type sensor is always on, continuously sensing the surrounding environment.
7. The method of claim 6, wherein the user's relevant information comprises: face orientation, lip motion, gaze direction, pose, and distance.
8. A voice interaction wake-free apparatus, comprising:
the acquisition judging program module is configured to respond to the acquired sensing data detected by the space sensing type sensor and judge whether a user appears in a first preset range;
the starting judgment program module is configured to start a camera to acquire the related information of the user if the user appears in the first preset range, and judge whether the user appears in a second preset range;
the judging program module is configured to judge whether the user has the voice interaction intention or not based on the relevant information of the user if the user appears in the second preset range;
and the program entering module is configured to enter a wake-free voice interaction mode if the user has the voice interaction intention.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 7.
10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 7.
CN202011573239.XA 2020-12-25 2020-12-25 Voice interaction wake-up-free method and device Withdrawn CN112634895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011573239.XA CN112634895A (en) 2020-12-25 2020-12-25 Voice interaction wake-up-free method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011573239.XA CN112634895A (en) 2020-12-25 2020-12-25 Voice interaction wake-up-free method and device

Publications (1)

Publication Number Publication Date
CN112634895A true CN112634895A (en) 2021-04-09

Family

ID=75325498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011573239.XA Withdrawn CN112634895A (en) 2020-12-25 2020-12-25 Voice interaction wake-up-free method and device

Country Status (1)

Country Link
CN (1) CN112634895A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593544A (en) * 2021-06-11 2021-11-02 青岛海尔科技有限公司 Device control method and apparatus, storage medium, and electronic apparatus
CN114007168A (en) * 2021-11-03 2022-02-01 长沙楚风数码科技有限公司 Intelligent audio control system and method
CN117119102A (en) * 2023-03-21 2023-11-24 荣耀终端有限公司 Awakening method of voice interaction function and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109955257A (en) * 2017-12-22 2019-07-02 深圳市优必选科技有限公司 A kind of awakening method of robot, device, terminal device and storage medium
CN111179927A (en) * 2019-12-20 2020-05-19 恒银金融科技股份有限公司 Financial equipment voice interaction method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109955257A (en) * 2017-12-22 2019-07-02 深圳市优必选科技有限公司 A kind of awakening method of robot, device, terminal device and storage medium
CN111179927A (en) * 2019-12-20 2020-05-19 恒银金融科技股份有限公司 Financial equipment voice interaction method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593544A (en) * 2021-06-11 2021-11-02 青岛海尔科技有限公司 Device control method and apparatus, storage medium, and electronic apparatus
CN114007168A (en) * 2021-11-03 2022-02-01 长沙楚风数码科技有限公司 Intelligent audio control system and method
CN117119102A (en) * 2023-03-21 2023-11-24 荣耀终端有限公司 Awakening method of voice interaction function and electronic equipment

Similar Documents

Publication Publication Date Title
CN109427333B (en) Method for activating speech recognition service and electronic device for implementing said method
EP3179474B1 (en) User focus activated voice recognition
WO2021013137A1 (en) Voice wake-up method and electronic device
CN108735209B (en) Wake-up word binding method, intelligent device and storage medium
CN112634895A (en) Voice interaction wake-up-free method and device
WO2020020063A1 (en) Object identification method and mobile terminal
CN108711430B (en) Speech recognition method, intelligent device and storage medium
CN111933112B (en) Awakening voice determination method, device, equipment and medium
WO2022110614A1 (en) Gesture recognition method and apparatus, electronic device, and storage medium
CN110910887B (en) Voice wake-up method and device
US11222231B2 (en) Target matching method and apparatus, electronic device, and storage medium
CN112860169B (en) Interaction method and device, computer readable medium and electronic equipment
CN108881544B (en) Photographing method and mobile terminal
CN111696570B (en) Voice signal processing method, device, equipment and storage medium
CN109618218B (en) Video processing method and mobile terminal
EP2992403B1 (en) Depth sensors
EP4199488A1 (en) Voice interaction method and electronic device
CN111387978A (en) Method, device, equipment and medium for detecting action section of surface electromyogram signal
CN112739507B (en) Interactive communication realization method, device and storage medium
CN111105792A (en) Voice interaction processing method and device
CN113035196A (en) Non-contact control method and device for self-service all-in-one machine
CN114333774B (en) Speech recognition method, device, computer equipment and storage medium
CN110262767B (en) Voice input wake-up apparatus, method, and medium based on near-mouth detection
CN115206306A (en) Voice interaction method, device, equipment and system
WO2023231211A1 (en) Voice recognition method and apparatus, electronic device, storage medium, and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
WW01 Invention patent application withdrawn after publication

Application publication date: 20210409

WW01 Invention patent application withdrawn after publication