CN113948076A

CN113948076A - Voice interaction method, device and system

Info

Publication number: CN113948076A
Application number: CN202010690864.6A
Authority: CN
Inventors: 吴纲律; 王加芳; 王全占; 古鉴; 聂再清
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2022-01-18

Abstract

A voice interaction method, device and system are disclosed. The method comprises the following steps: starting a camera to acquire image information, and simultaneously starting a microphone to acquire sound information; inputting the acquired image information and the acquired sound information into an interactive judgment model; and acquiring sound information for voice interaction using a microphone based on an output of the interaction determination model. Therefore, the invention can judge the interaction intention of the user based on the collected image information or the preferred sound and picture information through artificial intelligence and intelligently receive and record the interaction information according to the interaction intention, thereby avoiding the need of inputting awakening words or manually starting the voice interaction function.

Description

Voice interaction method, device and system

Technical Field

The present disclosure relates to the field of human-computer interaction, and in particular, to a voice interaction method, device, and system.

Background

With the development of voice recognition technology and wireless networks, devices with voice interaction functions, such as smart speakers, have become popular. In these devices with voice interaction function, a special voice interaction device (such as a smart speaker) usually needs to enter an on state using a wake-up word and enter audio for recognition and feedback. For non-specialized voice interaction devices (such as smart phones, in-vehicle systems, smart appliances, etc.), a user is usually required to perform a specialized operation (e.g., clicking a physical or virtual button) to enable the on state.

Regardless of the above-mentioned waking manner, since each use requires a repeated waking operation (e.g. speaking a waking word or clicking an interactive button), the user feels cumbersome and uninteresting. Further, since the time for turning off the microphone after turning on is fixed, or the user is required to wait for a relatively long silent period, it is possible to record irrelevant sounds such as the voices of other people, or unnecessarily increase the user waiting time.

For this reason, a scheme for more intelligently recording voice interaction information is required.

Disclosure of Invention

One technical problem to be solved by the present disclosure is to provide a voice interaction scheme, which is capable of determining a user's interaction intention based on collected image information or preferred audio-visual information through artificial intelligence, and thereby performing intelligent recording of interaction information.

According to a first aspect of the present disclosure, there is provided a voice interaction method, including: starting a camera to acquire image information, and simultaneously starting a microphone to acquire sound information; inputting the acquired image information and the acquired sound information into an interactive judgment model; and acquiring sound information for voice interaction using a microphone based on an output of the interaction determination model.

According to a second aspect of the present disclosure, there is provided a voice interaction method, including: judging that a person approaches and acquiring image information; inputting the image information into an interaction judgment model; and acquiring sound information for voice interaction based on the output of the interaction judgment model.

According to a third aspect of the present disclosure, there is provided a voice interaction method, including: acquiring image information; inputting the image information into an interaction judgment model; based on the output of the interaction decision model, sound information is obtained for voice interaction.

According to a fourth aspect of the present disclosure, there is provided a voice interaction method, including: obtaining multi-mode information, wherein the multi-mode information comprises at least two paths of information which are obtained simultaneously; inputting the multi-modal information into an interactive decision model; and acquiring sound information for voice interaction based on the output of the interaction judgment model.

According to a fifth aspect of the present disclosure, there is provided a voice interaction device, comprising: a camera for obtaining image information, a microphone for obtaining sound information, a processor for: inputting the image information acquired by the camera into an interactive judgment model; based on an output of the interaction determination model, sound information for voice interaction is acquired via the microphone.

According to a sixth aspect of the present disclosure, there is provided a voice interaction system, comprising: the voice interaction device according to the fourth aspect above; and the computing node is communicated with the voice interaction equipment, stores the model and provides model output for the voice interaction equipment.

According to a seventh aspect of the present disclosure, there is provided a speech interaction model training method, including: and training an interaction judgment model by using the image of the speaking person as a positive label and the image of the non-speaking person as a negative label, so that the interaction judgment model judges the recording start and recording receiving moments of the sound information for voice interaction according to the input image information based on a recording start threshold and a recording end threshold.

According to an eighth aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described in the first to fourth aspects above.

According to a ninth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first to fourth aspects above.

Therefore, the voice interaction scheme of the invention can judge the interaction intention of the user based on the image information or the preferred sound and picture information through artificial intelligence, and directly carry out intelligent recording of the interaction information according to the interaction intention. Specifically, the interactive determination model may be used to determine the recording start and end times, the intention recognition model may be further used to dynamically adjust the recording start and end determination thresholds of the interactive determination model, and the dynamic threshold model may also be used to perform threshold adjustment based on dynamic learning. Therefore, natural and direct voice interaction can be realized under the condition of avoiding speaking the awakening words or opening the voice function.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a schematic flow diagram of a voice interaction method according to one embodiment of the present invention.

Fig. 2 shows an example of determination of an interaction determination model according to the present invention.

Fig. 3 shows an identification example of an intention recognition model according to the invention.

Fig. 4 shows an example of the output of the dynamic threshold model according to the invention.

FIG. 5 illustrates a three-model interaction example of a multi-modal self-adjusting entry system according to the present invention.

FIG. 6 shows a schematic flow chart of a voice interaction method according to another embodiment of the present invention.

FIG. 7 shows a schematic flow chart of a voice interaction method according to another embodiment of the present invention.

FIG. 8 shows a block diagram of the components of a voice interaction device, according to one embodiment of the present invention.

FIG. 9 is a schematic structural diagram of a computing device that can be used for implementing the voice interaction method according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Therefore, the voice interaction scheme is provided, and the interaction intention of the user can be judged based on image information or preferred sound and picture information through artificial intelligence, and accordingly intelligent recording of interaction information is directly carried out. Specifically, the interactive determination model may be used to determine the recording start and end times, the intention recognition model may be further used to dynamically adjust the recording start and end determination thresholds of the interactive determination model, and the dynamic threshold model may also be used to perform threshold adjustment based on dynamic learning. Therefore, natural and direct voice interaction can be realized under the condition of avoiding speaking the awakening words or opening the voice function.

FIG. 1 shows a schematic flow diagram of a voice interaction method according to one embodiment of the present invention. The method may be performed by a voice interaction device, in particular a voice interaction device comprising an image acquisition function (e.g. equipped with a camera).

In step S110, the camera is turned on to obtain image information, and the microphone is turned on to obtain sound information. In step S120, the acquired image information and the acquired sound information are input to an interaction determination model. Subsequently, in step S130, sound information for voice interaction is acquired using a microphone based on the output of the interaction determination model.

Therefore, the multi-mode data including image and sound information are processed by utilizing the trained model, and the collection of the user interaction information can be started directly based on the output result of the model, so that the requirement that a user additionally speaks a wakeup word or manually starts a voice interaction function is avoided.

For example, the voice interaction device may turn on the camera and the microphone at predetermined intervals to acquire the voice and picture information, or may turn on the camera and the microphone to acquire the voice and picture information in the case that a proximity sensor or other mechanism notifies a person of the proximity. The above-mentioned voice and picture information is fed into the interaction determination model in real time, and the interaction determination model may make a processing determination based on the inputted voice and picture information, for example, when it is determined that a human face is facing the device and ready to speak (for example, the inputted image frame is determined that a person is speaking and the inputted audio frame is also determined that a person is speaking or ready to speak), start acquisition of sound information for interaction. Subsequently, the sound-picture information acquired by the camera and the microphone may be continuously fed into the interaction determination model, and the end time of the sound information acquisition is determined based on the output of the model. A piece of audio information thus obtained (e.g., recorded from the start of the acquisition for interactive sound to the end of the acquisition) may be parsed semantically and fed back based on the processing results, either locally or by the server or by an edge computing node.

The interaction decision model may include or be implemented as a supervised learning deep neural network model. Specifically, the positive labels for training the deep neural network model include images of people who are speaking, and the negative labels include images of people who are not speaking. Therefore, when the trained interaction judgment model carries out reasoning later, the recording for interaction can be started based on the acquired image to indicate that the person speaks.

It is understood that, in order to dynamically determine the end time of sound reception, the camera may be used to continue to acquire image information while the microphone is used to acquire sound information for voice interaction. Thereby facilitating a determination of a recording start time and/or a recording end time for recording sound information for a voice interaction using a microphone based on an output of the interaction decision model. For example, at time t₀(e.g., second 0), the camera and microphone begin to capture image and voice information and continue to feed the information into the interaction decision model. At time t₁(e.g., second 1), the interaction determination model determines that the user has an interaction intention (e.g., based on t)₁Image frames acquired at the moment that the user intends to speak, and audio frames with human voice), the user voice interaction may be started to acquire the voice information based on the output of the interaction determination model. At time t₂(e.g., second 12), the interaction decision model decides that the user interaction intention is over (e.g., based on t₂The image frame of the user turning the head and the audio frame without the voice are acquired all the time), the user voice interaction of acquiring the voice information can be finished.

In other words, at the slave time t₀To time t₂And in the 12 seconds, the camera and the microphone always collect image and sound information, and the information is continuously sent to the interactive judgment model. For judgmentThe time period during which the user has an interactive intention (i.e., from time t)₁To time t₂Within 11 seconds), the sound information collected by the microphone can also be multiplexed as sound information for voice interaction. For example, the voice interaction device may locally process or upload the 11 seconds of voice information, perform Natural Language Processing (NLP) on the voice, extract semantics, analyze intentions, and give corresponding feedback. The voice information of 11 seconds may be transmitted in segments, processed in segments, or transmitted and processed once after the recording is finished.

Specifically, the determination of the interaction determination model may be made based on a threshold value. Specifically, the interaction decision model may start recording when the output meets a recording start time threshold and/or end recording when the output meets a recording end threshold based on the current image information and the sound information input.

Fig. 2 shows an example of determination of an interaction determination model according to the present invention. As illustrated in fig. 2, audio-visual data, e.g., video frames with audio information (e.g., images including a human face as illustrated), captured by, for example, a camera and a microphone provided with a voice interaction device, may be input into the interaction decision model of the present invention as a sequence of images (e.g., a sequence of image frames) and audio (e.g., audio frames), respectively. The above-described interaction determination model may calculate a recording start score based on the input image and audio, and start audio-video recording for voice interaction when the start score is greater than a threshold value 1 (recording start threshold value). The interactive decision model may continue to capture images and audio during the beginning of the recording and calculate the end-of-recording score. And when the calculated record ending score is larger than a threshold value 2 (a record ending threshold value), stopping audio and video recording for voice interaction.

Here, it should be understood that although one interactive decision model is shown that is capable of calculating both the start score and the end score based on the voice-picture input, in other embodiments, different models or different submodels in the models may be utilized to calculate the start score and the end score, respectively. In addition, in different embodiments, the threshold may be a fixed threshold, an empirically adjusted threshold, or a threshold that is dynamically self-adjusted based on the output of other models as follows.

Since both image and sound modality information is used, the recording function will only be enabled, for example, when both sound (when speaking) and image (when the mouth is starting) are triggered simultaneously, and there is substantial speech frequency, semantic similarity. When one of the two modes is not satisfied, the sound receiving system is automatically closed, so that the interaction intention of the user can be accurately judged, the user is allowed to perform voice interaction without performing additional operations (such as speaking a wakeup word or manually starting a voice interaction function), and a large amount of meaningless video sound is prevented from being recorded.

In addition, in order to improve the accurate judgment of the user interaction intention, other models can be further introduced into the voice interaction scheme provided by the invention to enhance the capability of the scheme in responding to different environments and user states.

To this end, in one embodiment, the voice interaction method may further include inputting the acquired sound information into an intention recognition model, and adjusting a value of the recording start time threshold and/or the recording end threshold based on an output of the intention recognition model. The intention recognition model may recognize the intention of the user from the input sound information, and turn down the recording start threshold when, for example, it is judged that the user has an intention to interact with the voice device, and change the recording end threshold in the case where, for example, the background sound is noisy, or the like. Here, it should be clear that, in different embodiments, the higher the threshold value is, the higher the start and end conditions that need to be satisfied are, or the lower the threshold value is, the higher the start and end conditions that need to be satisfied are. The invention does not limit the direction of the threshold. For example, when the intention recognition model judges that the environment is noisy according to sound, the recording start threshold may be raised, and the interactive judgment model needs to start recording when more specific interactive audio-visual data is received and end recording when more specific intention audio-visual data is based. For example, the intent recognition model may also recognize semantics of the input sound information and turn up the recording start threshold when the semantics are not relevant to the interaction.

In a preferred embodiment, the intention recognition model may further include an image input, whereby the intention decision is made based on the multi-modal sound-image joint information. For example, the user may have an interactive intent in semantic determination, but the image determination may be that the user is making a call, at which point the recording start threshold should still be raised.

Fig. 3 shows an identification example of an intention recognition model according to the invention. As shown, the intention recognition model may also take the image sequence and audio as inputs and output a corresponding feature vector (Embedding1), which may be compared with an existing feature vector (Embedding2) in the database, e.g., enter a context correlation function to determine the relevance of the two vectors, e.g., determine the relevance of the vectors based on Cosin (Embedding1, Embedding2), and output "yes" and "no" based on the result of the calculation. Here, the intention recognition model may be trained using, for example, noise intensity, noise frequency, human behavior or posture, and the like as a label. In the subsequent inference phase, the input images and audio may be output via processing of the intent recognition model as feature vectors (Embedding1) having multiple dimensions, which may be compared to one or more contextual feature vectors (Embedding2) stored in a database to represent intent to determine relevance.

Here, if the output "yes", it may indicate that the user has an interactive intention, and the threshold of the interaction determination model is turned down accordingly, whereby it may be made easier for the interaction determination model to determine that the user wants to interact. If the output is "no," it may indicate that the user does not have an interactive intent (although it may appear that the interactive decision model has an intent), and the threshold of the interactive decision model is adjusted up accordingly.

While the threshold adjustment can be made based directly on the output of the intent recognition model, in a preferred embodiment, a third model, a dynamic threshold model, can also be introduced. The model can obtain the output of the intention recognition model, and dynamically adjust the value of the recording start threshold and/or the recording end threshold.

Fig. 4 shows an example of the output of the dynamic threshold model according to the invention. As shown, the dynamic threshold model in the present invention may be a reinforcement learning model.

Machine learning is an important research field of artificial intelligence, and can be divided into three categories, namely supervised learning, unsupervised learning and reinforcement learning according to whether feedback is obtained from a system or not. Supervised learning, also known as mentor learning, requires a set of corresponding outputs given a set of inputs to the system, which learns in an environment with known input-output datasets. Both the interaction decision model and the intention recognition model of the present invention can be implemented by supervised learning.

In contrast to supervised learning, unsupervised learning, also called tutor-less learning. In the unsupervised learning, only one group of outputs is required to be given, and corresponding outputs are not required to be given, so that the system automatically learns according to the internal structure of given inputs. The supervised and unsupervised machine learning modes can solve most of machine learning problems, but the two machine learning modes are greatly different from the processes of human learning and biological evolution. The evolution of the living beings is a learning process of actively probing the environment, evaluating and summarizing results fed back by the environment after probing to improve and adjust own behaviors, and then making new feedback by the environment according to the new behaviors to continuously adjust the behaviors. A Learning mode embodying this idea is called Reinforcement Learning (RL) in the field of machine Learning, and may also be called Reinforcement Learning. Therefore, reinforcement learning is a machine learning mode parallel to supervised learning and unsupervised learning.

The whole reinforcement learning system consists of five parts, namely an Agent (Agent), a State (State), a Reward (Reward), an Action (Action) and an Environment (Environment).

And an Agent is the core of the whole reinforcement learning system. It can sense the State of the environment (State) and maximize the long-term Reward value by learning to select an appropriate Action (Action) based on the enhanced signal (Reward Si) provided by the environment. In short, Agent learns a series of mappings of environment states (State) to actions (Action) according to the rewarded provided by the environment as feedback, and the principle of Action selection is to maximize the probability of the rewarded accumulated in the future. The selected action not only affects the Reward at the current moment, but also affects the Reward at the next moment and even in the future, so the basic rule of the Agent in the learning process is as follows: if an Action (Action) brings a positive Reward (Reward) of the environment, the Action is strengthened, otherwise, the Action is gradually weakened, similar to the conditional reflex principle in physics.

In the present invention, the reinforcement learning model serving as a dynamic threshold model takes image information, sound information, and the recognition result of the speaking intention as inputs (states), and adjusts the values (actions) of the recording start threshold and/or the recording end threshold as actions in real time based on the correctness (reward) of the sound information for voice interaction acquisition. As shown in fig. 4, the dynamic threshold model may have n behaviors to correspond to different recording start and end threshold values. The dynamic threshold model may give a corresponding set of values based on currently input image and audio data, and an intent output (e.g., corresponding to "yes" or "no" for intentional and unintentional) of the intent recognition model output, which are given to the interaction decision model, and from which speech data for interaction is obtained. Subsequently, whether the interaction of the voice data is smooth or not is used for evaluating the selection correctness of the behaviors, and the selection of the threshold value is modified in real time, even the behaviors themselves.

The system consists of three large modules, namely three models: the interactive decision model, the dynamic threshold model and the intention recognition model, all of which preferably use image and sound data acquired by a camera and a microphone as input.

The interactive judgment model is the most main module of the whole system, and determines the starting time and the ending time by simultaneously extracting and fusing the characteristics of the images and the sounds, so that an audio clip (which can be preferably an audio-video clip) is output. Here, the start time and the end time are time points at which the user's speech starts and stops in a piece of audio-video, and may be stored in the form of a video segment. The output of the start time and the end time is determined by two threshold parameters, a recording start threshold and a recording end threshold. When the starting time score output by the interactive judgment model is larger than the recording starting threshold value, the interactive judgment model starts to output from the moment. Similarly, when the score of the ending time of the output of the interactive judgment model is larger than the recording ending threshold value, the output is ended from the moment.

The dynamic threshold model is a parameter adjustment module that outputs threshold parameters by simultaneously inputting images and sounds and outputs of an intention recognition model (such as intensity of noise, frequency of noise, state of human use, etc.): the parameters record start threshold and record end threshold. The two parameters are used for interactively judging whether the model needs to output the two parameters of the starting time and the ending time so as to decide whether the audio and video clips need to be saved.

The intent recognition model is a content/intent correlation module. By inputting the image and the sound at the same time (such as the semantic meaning contained in the sound, whether the user watches the screen for a long time in the image and the like), whether the user consciously operates the equipment is judged. If yes, yes is output, and if not, no is output.

Since we use both image and sound modality information, the recording function will only be enabled when both sound (when speaking) and image (when the mouth starts) are triggered at the same time and there is a rough speech frequency, semantic similarity. When one of the two modes is not satisfied, the sound receiving system is automatically turned off, so that a user can conveniently avoid recording a large amount of meaningless video sound without extra operation (such as speaking a wakeup word or clicking a voice interaction button).

In addition, by adding an intention recognition model and a dynamic threshold adjustment model, it is made possible to dynamically adjust model parameters according to changes in the environment and the state of the user at that time.

For example, when the environmental noise is large, the device is easily awoken by mistake due to the influence of the environment, and at this time, the dynamic threshold model automatically increases the threshold value of the interactive judgment model which is activated, so that the difficulty of the interactive judgment model being awoken to record the video is increased. When a user makes a call, the intention recognition model judges that the user is not operating the equipment at the moment through the voice semantics of the user holding the call and speaking, so that irrelevant results are transmitted to the dynamic threshold model, the threshold value of the interaction judgment model is increased, and the awakening difficulty of the interaction judgment model is increased. The self-adjusting design of this framework ensures that the device can be fully activated when the user really wants to use the device, while minimizing the interference of the environment (background sounds, etc.) and thus increasing the probability of false triggering of the recording.

The voice interaction method of the present invention may further include a step of performing at least partial blurring processing on the image information. Therefore, complete face information of the user cannot be extracted from the image information subjected to the blurring processing, and personal privacy is protected. For example, portions that are meaningless to determine the user's interactive intention, such as nose and ear portions, may be blurred, or the image as a whole may be blurred with an algorithm, but information that enables determination of the user's intention is retained.

In one embodiment, the step of acquiring the image by the camera may be directly performed with the image blurring process. In another embodiment, the blurring process may be performed before the image information is fed into the model. In other embodiments, the information sent into the cloud may be obfuscated and the local image information deleted if necessary.

A voice interaction method and a preferred implementation thereof according to the present invention is described above in connection with fig. 1-5. FIG. 6 shows a schematic flow chart of a voice interaction method according to another embodiment of the present invention.

In step S610, it is determined that a person approaches and image information is acquired. In step S620, the image information is input to an interaction determination model. In step S630, based on the output of the interaction determination model, sound information is acquired for voice interaction.

Here, the input of data to the interaction decision model may be made based on the proximity of the user to the voice interaction device. In different embodiments, the determination of human proximity may be based on different mechanisms. For example, the camera may be turned on at irregular intervals to acquire image information and identify faces based on simpler calculations such as keypoint extraction techniques. In other embodiments, the proximity information may also be received based on the networking unit. For example, in the home internet of things, the determination is made based on the proximity information of other devices. Proximity sensors may also be used to sense proximity. The proximity sensor can be installed on voice interaction equipment, and can also be internet of things equipment capable of communicating with the voice interaction equipment.

After the face is recognized, the screen may be lit and the interactive content displayed. And acquiring image information for inputting the interaction judgment model when the interactive content is displayed. For example, the voice interaction device may determine that a person is close based on a keypoint extraction technique. At this time, a display screen (e.g., a touch screen) of the apparatus may be automatically lit, and the acquired image information may be input into the interaction determination model. At this time, as long as the interaction determination model determines from the image that the user has an intention to speak open or is looking in the direction of the screen, the recording for voice interaction can be turned on. To this end, the interaction decision model may be trained using images of the person speaking as positive labels; and/or training the interaction decision model using images of a person looking in the direction of capture as positive labels.

Here, the interaction determination model may perform determination using only image information. In a preferred embodiment, the decision can also be made using multimodal information (e.g., also including audio information) as described above. In addition, the embodiment may also utilize an intent recognition model and/or a dynamic threshold model, thereby making a more accurate determination of the user intent.

FIG. 7 shows a schematic flow chart of a voice interaction method according to another embodiment of the present invention. Compared with the prior voice interaction method, the method has wider application scenes.

In step S710, image information is acquired. In step S720, the image information is input into an interaction determination model. In step S730, based on the output of the interaction determination model, sound information is acquired for voice interaction. Thus, using the trained deep learning model, the recording is determined based on the input images. Further, the method further comprises: acquiring sound information while acquiring image information; and inputting the sound information into the interaction determination model. Thus, the depth model may determine the start and end of the recording based on the image and sound combination.

In one embodiment, obtaining the acoustic information for the voice interaction based on the output of the interaction decision model may include: and recording sound information for voice interaction under the condition that the output of the interaction judgment model is greater than the recording start threshold value. Further, recording of the sound information for voice interaction may be ended in a case where the output of the interaction determination model is greater than a recording end threshold. Thus, by introducing a threshold value, the conditions required to be met for recording start and end are adjusted.

In one embodiment, the value of the recording start threshold and/or the recording end threshold may also be adjusted based on the recognition of the intent to speak. The above-mentioned intent recognition may be implemented by a machine learning model, and for this purpose, the recognition of the utterance intent includes: inputting the acquired sound information into an intention recognition model; and obtaining an output of the intent recognition model. Correspondingly, based on the recognition of the speaking intention, adjusting the value of the recording start threshold and/or the recording end threshold comprises: -reducing said recording start threshold and/or said recording end threshold upon recognition of a conscious interaction. Further, a third model, a dynamic threshold model, may be introduced, and the model may obtain a recognition result for the speaking intent, and dynamically adjust a value of the recording start threshold and/or the recording end threshold.

In addition, in addition to acquiring sound information for voice interaction based on the output of the interaction determination model, image information may be acquired to help improve interaction accuracy or to determine a recording end time.

The voice interaction scheme of the present invention can also be implemented as a voice interaction device. FIG. 8 shows a block diagram of the components of a voice interaction device, according to one embodiment of the present invention. Device 800 may include a camera 810, a microphone 820, and a processor 830.

The camera 810 is used for acquiring image information, the microphone 820 is used for acquiring sound information, and the processor 830 is used for inputting the image information acquired by the camera 810 into the interaction judgment model; and acquires sound information for voice interaction via the microphone 820 based on the output of the interaction determination model.

The processor 830 may control the turning on and off of the camera and the microphone, for example, turn on the microphone and input the sound information acquired by the microphone into the interaction determination model.

In various embodiments, the models may be stored locally, or in a networked manner, or both. To this end, the apparatus may further include: and the networking unit is used for sending the acquired image information and/or sound information and receiving a processing result aiming at the image information and/or the sound information. Thus, a determination, such as the start and end of a recording, can be made using a model stored, for example, at a cloud server, edge computing device, or other central computing node. Alternatively or additionally, the apparatus may further include a storage unit for storing a model for processing the acquired image information and/or sound information.

Further, the apparatus may further include: a screen for interacting with a user. For example, a touch screen may be included and illuminated when the user is determined to be in proximity. For example, a small person is displayed to call the user and a recording is started when the subsequent interactive decision model decides that the user desires to go to the screen.

Further, the apparatus may further include: and the voice output unit is used for outputting voice feedback of the voice interaction. The voice output unit may include a speaker, or a wired or wirelessly connected earphone or a sound box, etc.

The apparatus may further include a proximity determination unit for determining that a person approaches. In various implementations, the proximity determination unit may include: the camera is used for starting the camera to acquire image information and identifying a human face based on a key point extraction technology; a networking unit for receiving proximity information; a proximity sensor for sensing the proximity of a person.

The voice interaction device of the present invention may be implemented, in particular, as a smart speaker, e.g., a smart speaker with a screen and a camera. The intelligent sound box can realize the voice interaction method described above, voice interaction without wake-up word starting is carried out, the interaction intention of a user is judged when the user is positioned in front of the intelligent sound box and prepares to speak in an opening, and corresponding information recording is carried out.

In combination with the above voice interaction device, the present invention may further include a voice interaction system, which may include the voice interaction device as described above; and the computing node is communicated with the voice interaction equipment, stores the model and provides model output for the voice interaction equipment.

In different implementations, the compute nodes may have different identities. For example, when the voice interaction system is a locally implemented internet of things system, the computing node may be a local computing node, for example, a smart speaker as a central node of the internet of things, or a computing device with higher performance under commercial conditions. At this moment, in the internet of things system, besides the connection of the voice interaction device and the central computing node, other internet of things devices can be connected. Information can be shared among the devices to meet the requirement that one or more devices execute the voice interaction method.

In a more extensive implementation, the compute node may be an edge compute device. The edge computing device may support a wider network, such as an industrial campus network, a campus network, etc., and act as a storage and processing server for the models described above in the voice interaction devices in the network.

In a wider implementation, the compute node may be a server located in the cloud. The server may provide voice interaction services for a vast number of voice interaction devices, identified based on user intent as described above.

The computing node can subsequently acquire sound information for voice interaction; and generating and issuing feedback of the voice interaction.

The invention can also be realized as a method for training a voice interaction model, comprising: and training an interaction judgment model by using the image of the speaking person as a positive label and the image of the non-speaking person as a negative label, so that the interaction judgment model judges the recording start and recording receiving moments of the sound information for voice interaction according to the input image information based on a recording start threshold and a recording end threshold.

Further, an intent recognition model may also be trained using sound information (preferably including image information), wherein the recording start threshold and recording end threshold may be dynamically adjusted based on the output of the intent recognition model.

Further, a dynamic threshold model may be constructed as a reinforcement learning model, which takes image information, voice information, and the recognition result of the speaking intent as inputs, and adjusts, in real time, the value of the recording start threshold and/or the recording end threshold as a behavior based on whether the voice information acquired for voice interaction is correct or not.

Referring to fig. 9, computing device 900 includes memory 910 and processor 920.

The processor 920 may be a multi-core processor or may include multiple processors. In some embodiments, processor 920 may include a general-purpose main processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 920 may be implemented using custom circuits, such as Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs).

The memory 910 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 920 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 910 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 910 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 910 has stored thereon executable code, which when processed by the processor 920, causes the processor 920 to perform the voice interaction methods described above.

The voice interaction scheme according to the present invention has been described in detail above with reference to the accompanying drawings. The voice interaction scheme can judge the interaction intention of the user based on image information or preferred sound and picture information through artificial intelligence, and can directly carry out intelligent recording of the interaction information according to the interaction intention. Specifically, the interactive determination model may be used to determine the recording start and end times, the intention recognition model may be further used to dynamically adjust the recording start and end determination thresholds of the interactive determination model, and the dynamic threshold model may also be used to perform threshold adjustment based on dynamic learning. Therefore, natural and direct voice interaction can be realized under the condition of avoiding speaking the awakening words or opening the voice function. The video recording process is completed by using two modalities of images and sound, and a dynamic threshold model and an intention recognition model are designed, so that the recording process can be more personalized and intelligent.

In a broader implementation, the invention may utilize other combinations of information besides the voice-and-picture information, i.e. multi-modal information, to determine the user's interaction intention. Thus, the present invention may be implemented as a voice interaction method, comprising: obtaining multi-mode information, wherein the multi-mode information comprises at least two paths of information which are obtained simultaneously; inputting the multi-modal information into an interactive decision model; and acquiring sound information for voice interaction based on the output of the interaction judgment model.

In one embodiment, one of the pieces of information includes captured sound information. In another embodiment, one way of information in the multimodal information includes user status information obtained via a sensor. For example, the model intention determination may be made using the sound and picture information as described above, or may be made using, for example, position information acquired by a device (e.g., a smart watch) worn by the user in conjunction with sound information. In other embodiments, other sensors capable of acquiring information reflecting the user's communication intention may be used to perform the model determination.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A voice interaction method, comprising:

starting a camera to acquire image information, and simultaneously starting a microphone to acquire sound information;

inputting the acquired image information and the acquired sound information into an interactive judgment model; and

based on an output of the interaction determination model, sound information for voice interaction is acquired using a microphone.

2. The method of claim 1, further comprising:

carrying out voice recognition on the acquired voice information for voice interaction; and

based on the result of the speech recognition, outputting feedback of the speech interaction.

3. The method of claim 1, wherein the interaction decision model comprises:

there is a deep neural network model for supervised learning.

4. The method of claim 1, wherein positive labels of training the deep neural network model comprise images of a person who is speaking and negative labels comprise images of a person who is not speaking.

5. The method of claim 1, further comprising:

while sound information for voice interaction is acquired using the microphone, image information continues to be acquired using the camera.

6. The method of claim 5, wherein obtaining sound information for a voice interaction using a microphone based on an output of the interaction determination model comprises:

based on the output of the interaction decision model, a recording start time and/or a recording end time for recording sound information for voice interaction using a microphone is determined.

7. The method of claim 6, wherein determining a recording start time and/or a recording end time for recording sound information for a voice interaction using a microphone based on an output of the interaction determination model comprises:

and the interactive judgment model starts recording when the output meets a recording start time threshold and/or stops recording when the output meets a recording end threshold based on the current image information and sound information input.

8. The method of claim 7, wherein,

inputting the acquired sound information into an intention recognition model, and adjusting the value of the recording start time threshold and/or the recording end threshold based on the output of the intention recognition model.

9. The method of claim 8, wherein adjusting values of the recording start threshold and/or the recording end threshold comprises:

and the dynamic threshold model acquires the output of the intention identification model and dynamically adjusts the value of the recording start threshold and/or the recording end threshold.

10. The method according to claim 9, wherein the dynamic threshold model is a reinforcement learning model that takes image information, sound information, and the recognition result of the speaking intent as input, and adjusts values of the recording start threshold and/or the recording end threshold as behaviors in real time based on whether sound information for voice interaction acquisition is correct or not.

11. The method of claim 1, further comprising:

and performing at least partial blurring processing on the image information.

12. A voice interaction method, comprising:

judging that a person approaches and acquiring image information;

inputting the image information into an interaction judgment model;

based on the output of the interaction decision model, sound information is obtained for voice interaction.

13. The method of claim 12, wherein determining that a person is proximate comprises:

and starting a camera to acquire image information, and identifying the face based on a key point extraction technology.

14. The method of claim 12, further comprising:

after the face is identified, lightening a screen and displaying interactive contents; and

and acquiring image information for inputting the interaction judgment model when the interactive content is displayed.

15. The method of claim 12, wherein,

training the interaction decision model using the image of the person speaking as a positive label; and/or

The interaction decision model is trained using images of a person looking in the shooting direction as a positive label.

16. A voice interaction method, comprising:

acquiring image information;

inputting the image information into an interaction judgment model;

17. The method of claim 16, further comprising:

acquiring sound information while acquiring image information;

inputting the sound information into the interaction decision model.

18. The method of claim 16, wherein obtaining acoustic information for voice interaction based on an output of an interaction decision model comprises:

and recording sound information for voice interaction under the condition that the output of the interaction judgment model is greater than the recording start threshold value.

19. The method of claim 18, wherein obtaining acoustic information for voice interaction based on an output of an interaction decision model comprises:

and under the condition that the output of the interaction judgment model is greater than the recording ending threshold value, ending recording the sound information for voice interaction.

20. The method of claim 19, further comprising:

and adjusting the value of the recording start threshold and/or the recording end threshold based on the recognition of the speaking intention.

21. The method of claim 20, wherein the identifying of the intent to speak comprises:

inputting the acquired sound information into an intention recognition model; and

an output of the intent recognition model is obtained.

22. The method of claim 20, wherein adjusting values of the recording start threshold and/or the recording end threshold based on the recognition of the intent to speak comprises:

-reducing said recording start threshold and/or said recording end threshold upon recognition of a conscious interaction.

23. The method of claim 20, wherein adjusting values of the recording start threshold and/or the recording end threshold comprises:

and the dynamic threshold model acquires the recognition result aiming at the speaking intention and dynamically adjusts the value of the recording start threshold and/or the recording end threshold.

24. The method of claim 16, further comprising:

based on the output of the interaction determination model, image information is acquired while sound information for voice interaction is acquired.

25. A voice interaction method, comprising:

obtaining multi-mode information, wherein the multi-mode information comprises at least two paths of information which are obtained simultaneously;

inputting the multi-modal information into an interactive decision model; and

26. The method of claim 25, wherein the one-way information of the multi-modal information comprises captured sound information.

27. The method of claim 25, wherein the one-way information of the multi-modal information comprises user status information obtained via a sensor.

28. A voice interaction device, comprising:

a camera for acquiring image information,

a microphone for acquiring the sound information,

a processor to:

inputting the image information acquired by the camera into an interactive judgment model;

based on an output of the interaction determination model, sound information for voice interaction is acquired via the microphone.

29. The device of claim 27, wherein the processor is to:

and starting a microphone and inputting the sound information acquired by the microphone into the interaction judgment model.

30. The apparatus of claim 27, further comprising:

and the networking unit is used for sending the acquired image information and/or sound information and receiving a processing result aiming at the image information and/or the sound information.

31. The apparatus of claim 27, further comprising:

and a storage unit for storing a model for processing the acquired image information and/or sound information.

32. The apparatus of claim 27, further comprising:

a screen for interacting with a user.

33. The apparatus of claim 27, further comprising:

and the voice output unit is used for outputting voice feedback of the voice interaction.

34. The apparatus of claim 27, further comprising:

and an approach determination unit for determining that a person approaches.

35. The apparatus of claim 34, wherein the proximity determination unit comprises:

the camera is used for acquiring image information and identifying a human face based on a key point extraction technology;

a networking unit for receiving proximity information; and/or

A proximity sensor for sensing the proximity of a person.

36. A voice interaction system, comprising:

the voice interaction device of any of claims 28-35; and

and the computing node is communicated with the voice interaction equipment, stores the model and provides model output for the voice interaction equipment.

37. The system of claim 36, wherein the compute node comprises:

a local compute node;

an edge computing device; and/or

And (4) a server.

38. The system of claim 36, wherein the computing node is further to:

acquiring sound information for voice interaction;

and generating and issuing feedback of the voice interaction.

39. A speech interaction model training method comprises the following steps:

and training an interaction judgment model by using the image of the speaking person as a positive label and the image of the non-speaking person as a negative label, so that the interaction judgment model judges the recording start and recording receiving moments of the sound information for voice interaction according to the input image information based on a recording start threshold and a recording end threshold.

40. The method of claim 39, further comprising:

the intent recognition model is trained using the acoustic information,

wherein the recording start threshold and recording end threshold are dynamically adjusted based on an output of the intent recognition model.

41. The method of claim 40, further comprising:

and constructing a dynamic threshold model serving as a reinforcement learning model, wherein the reinforcement learning model takes image information, sound information and the recognition result of the speaking intention as input, and adjusts the value of the recording start threshold and/or the recording end threshold serving as a behavior in real time based on the correctness of the sound information acquired by voice interaction.

42. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-27.

43. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-27.