CN112863508A

CN112863508A - Wake-up-free interaction method and device

Info

Publication number: CN112863508A
Application number: CN202011625969.XA
Authority: CN
Inventors: 林永楷; 樊帅; 李春; 石韡斯; 宋洪博; 朱成亚
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-28

Abstract

The invention discloses a wake-up-free interaction method and a wake-up-free interaction device, wherein the wake-up-free interaction method comprises the following steps: in response to receiving a valid voice signal of a user, determining a detection interval based on a time period in which the valid voice signal is located; if the effective pointing characteristics of the user are detected in the detection interval, judging whether the effective voice signals and the effective pointing characteristics correspond to effective instructions or not; and if the corresponding effective instruction is judged, processing and feeding back the effective instruction. According to the scheme, the directional characteristics are combined with voice interaction to achieve wake-up-free interaction, interaction experience of a user and intelligent voice equipment can be improved under the condition that low false wake-up rate is guaranteed, user interaction efficiency can be greatly improved particularly under a frequent interaction scene, and meanwhile, the application scene of a voice conversation system is enriched by taking the mode of the wake-up characteristics as multi-mode input of the conversation system.

Description

Wake-up-free interaction method and device

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a wake-up-free interaction method and device.

Background

In order to improve the interaction accuracy and avoid misoperation, the current intelligent equipment generally needs to be awakened to perform voice interaction. However, the existing wake-up technology still mainly uses voice wake-up, so as to avoid misoperation, part of the technology supports that a few commands are registered as quick wake-up words, such as the previous one and the next one, but the introduction of too many quick wake-up words increases the probability of false wake-up, so that the use of the quick wake-up words is relatively restrained. Some technologies support interaction without using a wake-up word after a face is recognized in a specific scene, but the gesture of a user is required, and the user can operate the system only after the face is successfully recognized. And the distance is easy to be awoken by mistake due to the orientation of the face. In general, the need to wake up first to operate a smart device is a long standing problem in the field.

The prior technical scheme related to the wake-up free on the market at present: some solutions require additional wireless headset devices for calculating distance, but even close distance does not represent a dialogue with the sound box; some schemes simply set some wake-up-free words, such as the last one and the next one, and such schemes mainly aim at the currently running application program, and too many wake-up words also cause the false wake-up rate to increase; some schemes have low detection precision for eye sight, for example, when eyes look at a keyboard, the eyes cannot capture whether to look at H or G; some schemes are not used for waking up, but only for keeping a conversation; some schemes have the defect that a plurality of people cannot avoid waking up. The prior art does not provide a scheme for user-friendly wake-up avoidance.

Disclosure of Invention

An embodiment of the present invention provides a wake-up free interaction method and device, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a wake-up exempt interaction method, including: in response to receiving an effective voice signal of a user, determining a detection interval based on a time period of the effective voice signal, if an effective pointing characteristic of the user is detected in the detection interval, and judging whether the effective voice signal and the effective pointing characteristic correspond to an effective instruction; and if the corresponding effective instruction is judged, processing and feeding back the effective instruction.

In a first aspect, an embodiment of the present invention provides a multimodal input feature processing method for a dialog system, including: in response to a device being awakened by a multimodal input feature, receiving the multimodal input feature and a user voice control instruction; forming an actual control instruction based on the multi-modal input features and the user voice control instruction; and responding to the actual control instruction.

In a third aspect, an embodiment of the present invention provides an apparatus for wake-up exemption interaction, including: the signal receiving program module is configured to respond to the fact that an effective voice signal is received, and judge whether an image acquired in an effective voice signal interval contains an effective pointing characteristic, wherein the effective pointing characteristic refers to a pointing action sent by a user, and an interval containing the effective pointing characteristic is a pointing interval; the signal judgment program module is configured to input multi-modal information of the effective voice signal interval into a dialog system to judge whether the instruction is effective or not if the effective voice signal interval is judged to contain effective pointing characteristics, wherein the multi-modal information comprises audio and an image containing pointing motion; and the instruction response program module is configured to respond to the instruction and feed back a response result if the multi-mode information is judged to be an effective instruction.

In a fourth aspect, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.

In a fifth aspect, an embodiment of the present invention further provides a storage medium, which includes: which when executed by a processor performs the steps of the method of the first aspect.

The embodiment of the application provides a method for achieving wake-up-free interaction by combining pointing information with voice interaction, which can improve the interaction experience of a user and intelligent voice equipment under the condition of ensuring low false wake-up rate, particularly can greatly improve the interaction efficiency of the user under the frequency interaction scene, and meanwhile, the application scene of a voice conversation system is enriched by using the mode of wake-up characteristics as multi-mode input of the conversation system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a wake-up free interaction method according to an embodiment of the present invention;

fig. 2 is a flowchart of another wake-up exempt interaction method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a multi-modal input feature processing method for a dialog system according to an embodiment of the present invention;

fig. 4 is a flowchart of an interaction without wake-up according to a specific embodiment of the interaction without wake-up scheme of the embodiment of the present invention;

fig. 5 is another flowchart of a wake-up exemption interaction scheme according to an embodiment of the present invention;

fig. 6 is another flowchart of a wake-up exemption interaction scheme according to an embodiment of the present invention;

fig. 7 is a block diagram of an awaking-free interaction apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a wake-up free interaction method according to the invention is shown.

As shown in fig. 1, in step 101, in response to receiving a valid voice signal of a user, determining a detection interval based on a time period in which the valid voice signal is located;

in step 102, if the valid pointing feature of the user is detected in the detection interval, determining whether the valid voice signal and the valid pointing feature correspond to a valid command;

in step 103, if the valid command is determined to correspond to, the valid command is processed and fed back.

In this embodiment, for step 101, in response to receiving an effective speech signal of a user, the smart device determines a detection interval based on a time period in which the effective speech signal is located, where the effective speech signal is the speech of the user and does not include background noise, for example, when the smart device detects a sound of the user, the time period in which the speech is located is determined to be the detection interval, and when the smart device only detects that the background noise does not include a human voice, the smart device does not respond, which is not described herein again.

Then, in step 102, if an effective pointing feature of the user is detected in the detection interval, it is determined whether the effective voice signal and the effective pointing feature correspond to an effective instruction, where the effective pointing feature is that the user points to a certain device, an article, a character, or a picture by a finger or an object (such as a pen, a remote controller, etc.) consciously, for example, the user holds the remote controller to say "temperature is adjusted to 20 degrees" while pointing to an air conditioner.

Finally, in step 103, if it is determined that the corresponding effective instruction corresponds to, the effective instruction is processed and fed back, where the effective instruction is that when the intelligent device performs the directional wake-up-free interaction, visual input and audio input are detected at the same time, for example, when the user holds the remote controller and points to the air conditioner to say that "the temperature is adjusted to 20 degrees", the visual input is that "the user holds the remote controller and points to the air conditioner", the audio input is that "the temperature is adjusted to 20 degrees", and "the temperature is adjusted to 20 degrees", a preset semantic is hit, the instruction is determined to be the effective instruction, and the intelligent device operates the air conditioner; when a user holds the remote controller and indicates that the air conditioner is good to use, visual input indicates that the user holds the remote controller and indicates that the air conditioner is good to use, and voice and audio input indicates that the air conditioner is good to use, but the air conditioner is good to use, the preset semantics are not hit, and the intelligent device does not respond.

In this embodiment, the gesture is combined with the voice interaction to achieve wake-up-free interaction, so that the interaction experience of a user and the intelligent voice device can be improved under the condition of ensuring a low false wake-up rate, the user interaction efficiency can be greatly improved particularly in a frequency interaction scene, and meanwhile, the application scene of the voice dialog system is enriched by using the mode of the wake-up feature as the multi-mode input of the dialog system.

In some optional embodiments, before, if a valid pointing feature of the user is detected in the pointing interval, determining whether the valid instruction corresponds to a valid instruction by combining the valid voice signal and the valid pointing feature, the method further includes: continuously detecting the visual signals, and marking a certain time period of the detected visual signals as a pointing interval when the effective pointing characteristics appear in the certain time period; and judging whether the detection interval contains a pointing interval or not. For example, the intelligent device keeps the on state of the visual sensor during the period without interaction, analyzes and detects whether the effective pointing feature appears by the user by using an image analysis algorithm, and marks the section containing the effective pointing feature as the pointing section when the effective pointing feature appears by the user. When a user sends a voice in 6 th to 20 th seconds to turn on the television and points at the television in 7 th to 12 th seconds, the application is not limited, the 6 th to 12 th seconds are used as a mark detection interval, the 7 th to 12 th seconds are marked as a pointing interval, the detection interval is judged to include the pointing interval, and the description is omitted again.

In some optional embodiments, the method further comprises: and if the effective pointing characteristic of the user is not detected in the detection interval, entering wakeup judgment on the effective voice signal, wherein the wakeup judgment is to detect whether the effective pointing characteristic is contained in a previous time interval and a later time interval of the effective voice signal interval. For example, when a user sends a voice "turn on a television" in 10 th to 20 th seconds, and a finger points to the television in 21 st to 23 th seconds, if the effective pointing feature of the user is not detected, whether the effective pointing feature appears within a period of time M before an effective voice signal interval is detected, if M is 3, the application is not limited herein, the scheme is not limited herein, and the detection result is none; and continuing to detect whether the effective pointing feature appears in a period of time N before the effective voice signal interval, assuming that N is 2, the application is not limited herein, and if the detection result is yes, processing the user instruction is started, which is not described herein again.

In some optional embodiments, after determining whether the valid speech signal and the valid pointing feature correspond to valid instructions, the method further comprises: and if the effective voice signal does not correspond to the effective command, entering wakeup judgment of the effective voice signal. For example, when the user holds the remote controller and says "this air conditioner is well used", the present application is not limited herein, and "this air conditioner is well used" misses the preset semantics, the intelligent device does not respond, continues to acquire images through the visual sensor, detects the effective pointing characteristic, and is not described herein again.

In some optional embodiments, the determining whether the valid speech signal and the valid pointing feature correspond to valid instructions comprises: acquiring the content pointed by the effective pointing characteristic; determining whether the valid speech signal is associated with the content; if so, determining that the effective voice signal and the effective pointing characteristic correspond to an effective instruction; and if not, determining that the effective voice signal and the effective pointing characteristic correspond to an ineffective instruction. For example, the user says "turn on the television" by pointing to the air conditioner with a hand, the present application is not limited herein, the effective pointing feature points to the air conditioner, the effective voice signal is the television, and if the two are judged to be irrelevant, the instruction is an invalid instruction, which is not described herein again.

In some optional embodiments, the determining whether the valid speech signal is related to the content comprises: and performing voice recognition and semantic understanding on the effective voice signal, and judging whether the effective voice signal is related to the content or not based on a semantic understanding result. For example, when the valid speech signal is "turn on the television", it may be parsed into "turn on the television", and the present application is not limited herein, and it is determined whether the parsing result hits the semantic meaning, which is not described herein again.

In some optional embodiments, the content comprises a visual signal stream or a picture, and the semantic understanding intent corresponding to the content comprises operating a smart home or learning an object. For example, when the user says "turn on the television" while pointing at the television, the semantic understanding may be to operate the television switch; when the user fingers the television to say that the semantic understanding of "how to pronounce with English" can be used for learning English words of the television, the application is not limited herein, and details are not repeated herein.

In some optional embodiments, the determining a detection interval based on the time period in which the valid speech signal is located includes: backtracking for a first preset time to form a backtracking interval based on the starting time point of the time period of the effective voice signal; keeping the end time point of the time end where the effective voice signal is located for a second preset time to form a keeping interval; and forming a detection interval according to the backtracking interval, the time period of the effective voice signal and the holding interval. For example, the time period of the valid voice signal is 15 to 20 seconds, the first preset time is 2 seconds, and the second preset time is 3 seconds, which is not limited herein, the backtracking period is 13 to 15 seconds, the holding interval is 20 to 23 seconds, and 13 to 23 seconds are the detection interval, which is not described herein again.

Referring to fig. 2, a flowchart of another activation-free interaction method according to an embodiment of the invention is shown.

As shown in fig. 2, when the smart device receives an effective speech signal of a user, it determines whether the marked detection interval containing the effective speech signal contains a pointing interval, where the pointing interval contains an effective pointing characteristic interval; if the detection interval does not contain the pointing interval, the intelligent equipment does not respond; if the detection interval contains the pointing interval, transmitting the picture or video clip of the voice and audio of the user containing the gesture to a conversation system; the dialogue system understands the pictures or video clips of the voice audio of the user containing the gestures and judges whether the pictures or video clips are effective instructions or not; if the command is judged to be invalid, no response is made; and if the command is judged to be effective, the intelligent system feeds back a response through TTS or a screen or other equipment.

Referring to fig. 3, a flowchart of a multimodal input feature processing method for a dialog system according to an embodiment of the present application is shown. The method and the method are both used for the scene of no awakening and belong to the same inventive concept. Specifically, the foregoing embodiment is used in the wake-up stage in the wake-up free scenario, and the present embodiment is used in the dialog scenario after wake-up in the wake-up free scenario.

In step 301, in response to a device being awakened by a multimodal input feature, receiving the multimodal input feature and a user voice control instruction;

in step 302, forming an actual control instruction based on the multi-modal input features and the user voice control instruction;

in step 303, the actual control command is responded to.

In this embodiment, when the previous wake-up system is woken up by the multi-modal features, the dialog system cannot discard the multi-modal wake-up features acquired by the previous wake-up system, but needs to combine the multi-modal input features with the following user voice control commands to form the actual control commands, for example, in a specific application scenario, the user points to the apple (things or pictures) to ask "how to speak in english", at this time, the user can trigger the wake-up avoidance, the subsequent dialog system needs not only to use the user's voice "how to speak in english", but also needs to combine other multi-modal features of the user, for example, "point to the apple" to form a completed actual control command, and can continue to parse the command into the intention of "learning from the view", thereby continuing the dialog.

The reason why the "wake-up word" in the wake-up phase is needed to be considered is that the wake-up word is deliberately discarded during the processing of the dialog system in the prior art, for example, the user's voice is "hello, turn on tv", after waking up only "turn on tv" will be transmitted to the dialog system for further parsing, while the wake-up word is discarded, which is not problematic in normal scenarios, but in a wake-free scenario, if the "wake word" (e.g., the pointing feature) is simply discarded, the subsequent instructions may be made incomplete, the dialog system cannot handle it normally, e.g. without the pointing feature, just "this is said in english", it is clear that this way of ignoring multimodal features is very unfriendly to the user, as previous logical dialog systems might ask the user "what you mean is saying in english.

It should be noted that, although the above embodiments adopt numbers with definite precedence order such as step 101 and step 102 to define the precedence order of the steps, in an actual application scenario, some steps may be executed in parallel, and the precedence order of some steps is also not defined by the numbers, and this application is not limited herein and is not described herein again.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventors have found in the course of carrying out the invention why it is not easy to think: the method usually adopted is to provide a proper amount of quick operation under the condition of guaranteeing the false wake-up rate, and the operation is mostly limited to the playing control of media, such as the previous and next, the voice is a little bit larger, and the like.

A few solutions may involve using a camera to perform face recognition on a person present in the current environment in real time, or to calculate the distance between the person and the device. When the face recognition matching degree or the distance probability reaches a certain threshold value, and the user starts to interact, the user does not need to wake up, and the interactive audio is directly sent to the cloud.

The scheme combines gestures with voice interaction to achieve wake-up-free interaction, and is not easy to think because products and patents which are intelligently awakened through gesture control do not exist in the market at present. Furthermore, the interaction is not only a wake-up, but also provides the position pointed by the finger as an input to the speech dialog system, which is a category of multi-modal interactions that need to be familiar with the development of interactions other than speech in order to think of similar solutions.

The scheme is novel, and meanwhile, finger pointing and voice instructions are comprehensively processed while the awakening is avoided.

The overall design concept and principle of the invention are as follows:

since AMAZON ECHO came to market in 2014, smart voice interaction devices always need to use a wakeup word as an initial judgment for voice interaction. At present, some products try to avoid requiring a user to still use a wake-up word in a specific scene, such as eye wake-up of a small-sized sound box and natural wake-up of a tianmao spirit, and can achieve certain degree of wake-up avoidance when the user is closer to the sound box. For example, it is captured that the eyes of the user are watching the sound box, and the user does not need to use a wake-up word when the user initiates a voice operation instruction. However, this type of waking method cannot avoid the possibility of using human-to-human communication as speech as the input of the smart device after waking.

Also, we are investigating how to avoid mandatorily asking the user to use the wake-up word in certain situations, and also to avoid having the person-to-person communication wrongly taken as input to the device. We know that specific voice skills or APPs can be activated in smart reading lights and tablet computers at present and then matched with a finger inquiry device to ' how to read ' the word ' and ' how to write ' the word. The query can also be made directly using the wake up word + voice content, such as "how you read this word" how you want.

The scheme is as follows: whether a pointing gesture appears in a picture is detected and recorded in real time through a camera, and when a user interacts with equipment in a voice mode, if a pointing gesture appears while speaking or within a certain time before speaking or after finishing speaking, the voice of the user and the picture containing the pointing are simultaneously used as input of a conversation. Therefore, the purpose of avoiding waking is achieved, and the problem that the communication error between people is taken as the input of the equipment after the waking is avoided.

The concept of the method is to consider the skills of some users needing gestures to point, whether voice wake-up words do not need to be spoken under certain scenes, for example, what is read by pointing the words to a text query device through fingers as mentioned above, and later, we widen the idea that the gesture of the fingers is a particularly common and natural gesture, and if the voice conforming to the gesture is taken as input, a new and more natural voice interaction mode can be created under a specific occasion. For example, when a finger points at the electric lamp, if the user says that the user hits the skill of the smart home, the user can avoid waking up to operate the electric lamp; when the finger of the child points at an object, such as a Christmas tree, if the user says that the skill of learning to read the picture is hit, the user can be informed that the object is a Christmas tree.

Technical details of the present invention: the invention provides a method for directly executing voice instructions in an awakening-free state of intelligent voice equipment. When a valid voice signal of a user is received, if the directional characteristic appears in the process of collecting the voice signal or appears in a certain time range before and after the voice signal, the voice is used as the input of the dialogue system, and the dialogue system responds to the voice request of the user. Because the pointed content is information which is used by skills, the whole wake-free process is natural and has lower learning cost. The method can improve the interaction experience of the user and the intelligent voice equipment under the condition of ensuring low false awakening rate, can greatly improve the interaction efficiency of the user especially under the frequency interaction scene, and simultaneously enriches the application scene of the voice conversation system by taking the mode of the awakening characteristic as the multimode input of the conversation system.

Please refer to fig. 4, which illustrates a wake-up exempt flow chart of an embodiment of the wake-up exempt interaction scheme of the present invention. The figure is mainly a step diagram for a method of marking valid pointing intervals.

As shown in fig. 4, the smart device will acquire images in real time by using a visual sensor during the period of no interaction, and analyze by using an image analysis algorithm (including but not limited to a neural network algorithm, a Convolutional Neural Network (CNN), a Support Vector Machine (SVM), etc.) to determine whether a person in the picture has a pointing feature that meets a condition, where the pointing feature that meets the condition refers to a device or an article or a text or a picture that is intentionally pointed by a finger or an object (such as a pen, a remote controller, etc.). If no pointing feature is present, the visual signal continues to be detected until a qualifying pointing feature is present. If a pointing feature is present, the time interval in which the pointing feature is present is marked as a pointing interval.

Please refer to fig. 5, which shows another wake-up-free flowchart of an embodiment of the wake-up-free interaction scheme of the present invention. The figure is a step diagram of the wake-up determination method for the valid voice signal.

As shown in fig. 5, after the smart device detects a valid voice signal, it will check whether the marked pointing interval is included in a period of time, specifically:

when the valid speech signal starts, the system recalls forward for a period of time, such as 1 second, and starts to continuously transmit data to the dialog system if a pointing interval (fig. 5 pointing interval B) occurs during this period of time.

If a pointing interval does not occur for a period of time before the start of the valid speech signal, attention is paid to whether a pointing interval occurs during the collection of the valid speech signal, and if so (fig. 5 pointing interval C, D, E), data (a stream of user speech audio, a stream of visual signals containing gestures or a picture) starts to be continuously transferred to the dialog system upon the occurrence of the pointing interval.

If no pointing interval is present at the end of the valid speech signal, the system will buffer the speech signal for a certain duration, for example 2 seconds, and if a pointing interval is present in this time region (fig. 5 pointing interval F), the data will be passed to the dialog system.

Wherein the valid speech signal generally refers to the user's speech rather than background noise, etc.

The direction section a and the direction section G in fig. 5 are not passed to the dialogue system process because there is no valid voice signal.

Usually, when the dialog system receives multi-modal data (the multi-modal data refers to a user voice audio stream and a visual signal stream or picture containing directions), a response result is obtained after the dialog system goes through the voice recognition module, the semantic processing module and the dialog processing module, and the response result is divided into two states of no response and response. When the input content is an invalid instruction, for example, although the pointing and valid audio simultaneously appear, the conversation content is chatted with friends, and the conversation content does not respond. When the input content is a valid instruction, the dialog system executes the instruction of the user and gives feedback. Valid commands mean that the dialog system uses both visual and audio input while interacting by pointing to wake-free. This avoids human-to-human communication errors as input to the device. Feedback includes, but is not limited to, playing information via text-to-audio output (TTS), displaying multimedia information on a screen, and handling smart home operations, etc., and if a smart device with a limb or mobile device, feedback includes, but is not limited to, moving and generating gestures, etc.

Please refer to fig. 6, which shows another wake-up-free flowchart of an embodiment of the wake-up-free interaction scheme of the present invention. The figure is mainly directed to a pipeline model diagram of the multi-modal dialog system.

As shown in fig. 6, a typical multimodal dialog system PIPELINE model (PIPELINE) is shown. In addition to conventional dialog systems that use only audio or text as input, the input to multimodal dialog systems is a variety of information, including but not limited to speech, text, visual information, gyroscope information, touch screen gestures and trajectory or ultrasound information.

The dialogue main control module of the dialogue system utilizes Automatic Speech Recognition (ASR) to perform speech recognition, natural language semantic understanding (NLU) to perform semantic understanding, multimodal information understanding and other modules to generate a reply, and the reply information usually comprises TTS speech synthesis or a section of audio address. According to different instructions of the user, multimedia information, information of the control mode and the like can be selectively returned.

The dispatching of the modules is controlled by a conversation state management module and a conversation strategy management module in the conversation main control.

The inventor finds that deeper effects are achieved in the process of implementing the invention:

firstly, the method achieves the effect of awakening-free interaction in a specific scene, and due to the fact that the awakening-free characteristic is added, the product can be designed without starting multiple turns of conversation to wake up the equipment again for interaction for a long time. However, if the multiple turns of conversation are opened for too long, many voices which are not directed to the intelligent device are mistakenly instructed, and the user experience is affected. The wake-up-free method can process the pointing interval and the effective audio input in a combined mode, so that the interactive experience of a user and the intelligent voice equipment is improved under the condition of low false wake-up rate. Especially, the user interaction efficiency can be greatly improved in a frequent interaction scene.

Secondly, since the pointed content is information used by skills, and the pointed substitute voice wake-up word is not used deliberately, the whole wake-up-free process is natural and has lower learning cost. Meanwhile, application scenes of the voice conversation system are enriched by introducing images as multi-modal input, for example, when a finger points to an air conditioner or an electric lamp, only the user needs to say that 'help me open', and the corresponding intelligent device can be opened by skill. For another example, location information that was originally difficult to describe using speech can also be easily made explicit by pointing, for example, by letting the machine go to "here", a child can tell what name the item is when pointing to the item in the cognitive phase, and the child can ask the question "what this is done".

In short, the multi-modal interaction enriches the use scenes of voice skills, and enables people to communicate with intelligent equipment more efficiently. The direction is taken as the characteristic of avoiding waking up, the direction is not used for replacing voice wake-up words intentionally, and the situation that the direction can be found only after the voice wake-up is carried out is avoided while the skill is enabled to obtain the directed content.

The invention has at least the following technical innovation points:

a voice wake-free method comprises the following steps: when a valid voice signal of a user starts to be received, judging whether a directional characteristic meeting the condition appears in the voice signal collecting process; if yes, directly starting to process the user voice control instruction in a state without waking up words; if not, judging whether the pointing characteristic appears in a certain time range before the effective voice signal; if yes, directly starting to process the user voice control instruction in a state without waking up words; if not, judging whether the pointing characteristics appear in a certain time range after the effective voice signal is finished; if yes, directly starting to process the user voice control instruction in a state without waking up words; if not, no response is required.

The method protects the product from waking up by pointing.

A method for using a wake-up feature as a multimodal input feature for a dialog system, comprising: directly taking modal characteristics required by the dialog system as wake-up words of the dialog system; when processing the user voice control command under the wake-up method, the dialog system responds only when the modal information (such as pointing visual information) and the user voice are simultaneously used in the process of generating the dialog result.

The voice awakening words are not needed when the user interacts with the intelligent equipment; and the false response is reduced. For a scenario where the false response is not sensitive, the response may also be given when only the user speech is responsive.

This method protects against wake-up, whether or not based on pointing, if the wake-up is not based on speech, and this feature is not only used on wake-up, but also as a way of input to the dialog system.

A multimodal dialog system architecture comprising: multi-modal information input and multi-modal information as output; the dialogue main control comprises a voice recognition module, a semantic analysis module, a multi-mode information understanding module, a voice synthesis module and a multimedia information generation module, and the dialogue main control relates to dialogue state management and dialogue strategy management.

Points that third parties may reference: the pointing information is not limited to the pointing direction of the finger, and can be a pen, a remote controller and any object which can be registered in advance.

When the pointing interval is judged, at present, whether the pointing interval appears during the effective audio frequency, before the effective audio frequency and after the effective audio frequency or not is judged, and the emphasis is needed, wherein the pointing interval before the effective audio frequency, after the effective audio frequency or even in the effective audio frequency can be unnecessary, although the effect can be reduced, the pointing interval also belongs to the protection scope of the invention.

Referring to fig. 7, a block diagram of a wake-up exempt interaction apparatus according to an embodiment of the present invention is shown.

As shown in fig. 7, the wake-up exempt interaction device 700 includes a signal receiving program module 710, a signal determining program module 720 and an instruction responding program module 730.

The signal receiving program module 710 is configured to, in response to receiving an effective voice signal, determine whether an image acquired in an effective voice signal interval contains an effective pointing feature, where the effective pointing feature refers to a pointing action sent by a user, and an interval containing the effective pointing feature is a pointing interval; a signal determination program module 720, configured to, if it is determined that the valid speech signal interval includes a valid pointing feature, input multi-modal information of the valid speech signal interval into a dialog system for performing valid instruction determination, where the multi-modal information includes audio and an image including a pointing motion; the instruction response program module 730 is configured to respond to the instruction and feed back a response result if the multi-modal information is judged to be an effective instruction.

It should be understood that the modules described in fig. 7 correspond to the respective steps in the method described with reference to fig. 1 and 2. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 7, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not limited to the scheme of the present application, for example, the signal receiving program module is configured to determine whether an image acquired in the effective speech signal interval includes an effective pointing feature in response to receiving an effective speech signal, where the effective pointing feature indicates a pointing action sent by a user, and the interval including the effective pointing feature is a pointing interval.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the wake-up exemption interaction method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

in response to receiving a valid voice signal of a user, determining a detection interval based on a time period in which the valid voice signal is located;

if the effective pointing characteristics of the user are detected in the detection interval, judging whether the effective voice signals and the effective pointing characteristics correspond to effective instructions or not;

and if the corresponding effective instruction is judged, processing and feeding back the effective instruction.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the wake-up exempt interactive apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, and these remote memories may be connected to the wake-exempt interaction device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, which, when executed by a computer, cause the computer to perform any one of the above-mentioned wake-exempt interaction methods.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 8, the electronic device includes: one or more processors 810 and a memory 820, with one processor 810 being an example in FIG. 8. The apparatus for the wake-exempt interaction method may further include: an input device 830 and an output device 840. The processor 810, the memory 820, the input device 830, and the output device 840 may be connected by a bus or other means, such as the bus connection in fig. 8. The memory 820 is a non-volatile computer-readable storage medium as described above. The processor 810 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 820, namely, the method for wake-free interaction device is implemented. The input device 830 may receive input numeric or character information and generate key signal inputs related to user settings and function control for the wake-up free interactive device. The output device 840 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a wake-up-free interaction apparatus, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc.

(3) A portable entertainment device: such devices can display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A wake-free interaction method comprises the following steps:

2. The method of claim 1, wherein before determining whether the valid instruction corresponds to the valid command by combining the valid voice signal and the valid pointing feature if the valid pointing feature of the user is detected within the pointing interval, the method further comprises:

continuously detecting the visual signals, and marking a certain time period of the detected visual signals as a pointing interval when the effective pointing characteristics appear in the certain time period;

and judging whether the detection interval contains a pointing interval or not.

3. The method of claim 1, wherein the method further comprises:

and if the effective pointing characteristic of the user is not detected in the detection interval, entering the awakening judgment of the effective voice signal.

4. The method of claim 1, wherein after determining whether the valid speech signal and the valid pointing feature correspond to valid instructions, the method further comprises:

and if the effective voice signal does not correspond to the effective command, entering wakeup judgment of the effective voice signal.

5. The method of claim 4, wherein said determining whether the valid speech signal and the valid pointing feature correspond to valid instructions comprises:

acquiring the content pointed by the effective pointing characteristic;

determining whether the valid speech signal is associated with the content;

if so, determining that the effective voice signal and the effective pointing characteristic correspond to an effective instruction;

and if not, determining that the effective voice signal and the effective pointing characteristic correspond to an ineffective instruction.

6. The method of claim 5, wherein the determining whether the valid speech signal is related to the content comprises:

and performing voice recognition and semantic understanding on the effective voice signal, and judging whether the effective voice signal is related to the content or not based on a semantic understanding result.

7. The method of claim 5, wherein the content comprises a visual signal stream or a picture, and the semantic understanding intent corresponding to the content comprises operating a smart home or learning an object.

8. The method according to any one of claims 1-7, wherein the determining a detection interval based on a time period in which the valid speech signal is present comprises:

backtracking for a first preset time to form a backtracking interval based on the starting time point of the time period of the effective voice signal;

keeping the end time point of the time end where the effective voice signal is located for a second preset time to form a keeping interval;

and forming a detection interval according to the backtracking interval, the time period of the effective voice signal and the holding interval.

9. A multi-modal input feature processing method for a dialog system, comprising:

in response to a device being awakened by a multimodal input feature, receiving the multimodal input feature and a user voice control instruction;

forming an actual control instruction based on the multi-modal input features and the user voice control instruction;

responding to the actual control instruction.

10. A device wake-free interaction apparatus, comprising:

the signal receiving program module is configured to respond to the fact that an effective voice signal is received, and judge whether an image acquired in an effective voice signal interval contains an effective pointing characteristic, wherein the effective pointing characteristic refers to a pointing action sent by a user, and an interval containing the effective pointing characteristic is a pointing interval;

the signal judgment program module is configured to input multi-modal information of the effective voice signal interval into a dialog system to judge whether the instruction is effective or not if the effective voice signal interval is judged to contain effective pointing characteristics, wherein the multi-modal information comprises audio and an image containing pointing motion;

and the instruction response program module is configured to respond to the instruction and feed back a response result if the multi-mode information is judged to be an effective instruction.