CN111694433B

CN111694433B - Voice interaction method and device, electronic equipment and storage medium

Info

Publication number: CN111694433B
Application number: CN202010530888.5A
Authority: CN
Inventors: 陈世伟
Original assignee: Apollo Zhilian Beijing Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2023-06-20
Anticipated expiration: 2040-06-11
Also published as: CN111694433A

Abstract

The application discloses a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to voice, natural language processing and image processing technologies. The specific implementation scheme is as follows: under the condition that the voice signal is detected to contain interaction information, determining a plurality of voice interaction users sending out the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor; a label is set for the interactive information in the voice signal, and the label corresponds to the user who sends out the voice signal. Generating feedback information for the interaction information; and playing the feedback information to the voice interaction user corresponding to the label. The problem that multiple persons cannot perform voice interaction at the same time is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.

Description

Voice interaction method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of signal processing technologies, and in particular, to a method and apparatus for voice interaction, an electronic device, and a storage medium.

Background

In the current vehicle-mounted voice system on the market, voice interaction of the next passenger at the same time can only be realized. When other passengers in the vehicle have voice interaction intention, the passenger needs to wait for the end of the previous voice interaction or to wake up again by voice, so as to start a new voice interaction flow.

Disclosure of Invention

The application provides a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to the fields of voice technology, natural language processing, image processing and the like.

According to an aspect of the present application, there is provided a method of voice interaction, including the steps of:

under the condition that the voice signal is detected to contain interaction information, determining a plurality of voice interaction users sending out the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor;

setting a label for interaction information in the voice signal, wherein the label corresponds to a voice interaction user sending the voice signal;

generating feedback information for the interaction information;

and playing the feedback information to the voice interaction user corresponding to the label.

According to another aspect of the present application, there is provided a device for audio interaction, comprising:

the voice interaction user determining module is used for determining a plurality of voice interaction users sending out voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor under the condition that the voice signals contain interaction information;

the tag setting module is used for setting tags for interaction information in the voice signals, and the tags correspond to voice interaction users sending out the voice signals;

the feedback information generation module is used for generating feedback information of the interaction information;

and the feedback information playing module is used for playing the feedback information to the voice interaction user corresponding to the tag.

In a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by any one of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

According to the technology of the application, the problem that multiple persons cannot conduct voice interaction at the same time is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.

It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flow chart of a method of voice interaction according to a first embodiment of the present application;

fig. 2 is a flowchart of auxiliary information determination according to a first embodiment of the present application;

FIG. 3 is a flow chart of voice interaction user determination according to a first embodiment of the present application;

fig. 4 is a flowchart of playing feedback information according to a first embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus for voice interaction according to a second embodiment of the present application;

FIG. 6 is a schematic diagram of a voice interactive user determination module according to a second embodiment of the present application;

FIG. 7 is a schematic diagram of a voice interactive user determination module according to a second embodiment of the present application;

fig. 8 is a schematic diagram of a feedback information playing module according to a second embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a method of voice interaction of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As shown in fig. 1, the present application provides a method for voice interaction, which includes the following steps:

s101: when the interactive information is contained in the voice signal, a plurality of voice interactive users emitting the voice signal are determined according to the sound source position of the voice signal and the auxiliary information detected by the sensor.

S102: and setting a label for the interaction information in the voice signal, wherein the label corresponds to the voice interaction user sending out the voice signal.

S103: feedback information for the interaction information is generated.

S104: and playing the feedback information to the voice interaction user corresponding to the label.

The method flow can be applied to a riding scene, a conference scene, a home scene and the like. Taking a riding scene as an example, the method execution main body for executing the method can be a vehicle-mounted computer. It is assumed that four passengers are included in the vehicle, namely a driver on the left side of the front row, a first passenger on the right side of the front row, a second passenger on the left side of the rear row, and a third passenger on the right side of the rear row.

The interaction information may be a wake-up word that initiates a voice interaction, or a sentence with an explicit interaction intent. For example, a sentence with explicit intent to interact may include: "lower the air conditioner temperature a little", "open the window", "where the current position is", etc.

In case it is detected that the interactive information is contained in the speech signal, a confirmation procedure for the speech interactive user may be initiated. For example, sound source localization may be performed based on sound waves detected by a microphone array provided in a vehicle, and the approximate position of the sound source of the voice signal including the interactive information may be obtained.

The auxiliary information detected by the sensor may be information that the seat detected by the in-vehicle seat sensor is occupied, or the like.

In addition, the auxiliary information detected by the sensor may also include the person who is speaking detected from the (dynamic) image acquired by the image sensor using the image recognition technique.

And comprehensively judging according to the sound source position of the voice signal and the auxiliary information detected by the sensor, and determining at least one voice interaction user sending the voice signal containing interaction information.

For example, two voice interaction users are determined, namely, a first passenger on the right side of the front row and a third passenger on the right side of the rear row, and labels can be set for the two voice interaction users respectively. The tag may be information related to a seat or an in-vehicle position, etc. After the tags are set, each piece of interaction information of the user can load the corresponding tag.

For each piece of interaction information, feedback information can be obtained locally or through cloud communication. For example, if the interactive information contained in the voice signal of the first passenger on the right side of the front row is "where the current position is", the current position of the vehicle may be determined by the vehicle-mounted GPS or by communicating with the satellite positioning server, and the current position is used as feedback information. The interactive information contained in the voice signal of the third passenger on the right side of the rear row is "the air conditioner temperature is lowered a little", so that a temperature control instruction can be directly generated, and the air conditioner temperature of the vehicle can be regulated.

And under the condition that the feedback information is information capable of broadcasting, the feedback information can be broadcasted to the voice interaction user corresponding to the tag. For example, the interactive information contained in the voice signal of the first passenger on the right side of the front row is "where the present position is". According to the tag, it can be determined that the passenger who has caused the problem is the first passenger on the right side of the front row, so that the speaker closest to the passenger can be selected to play the feedback information.

Through the scheme, simultaneous interaction of a plurality of voice interaction users can be realized, voice interaction efficiency under the condition of multiple people is improved, and intelligence of voice interaction is also improved.

As shown in fig. 2, in one embodiment, the sensor includes an image sensor, and the determination method of the auxiliary information includes:

s201: and identifying the image detected by the image sensor, and confirming each user in the image.

S202: and obtaining the probability of each user sending out the voice signal according to the facial features of each user.

S203: the probability of each user uttering a speech signal is determined as auxiliary information.

An image detected by an image sensor is acquired, and the image may be a moving image. Each user in the moving image can be confirmed by using the face recognition technique. In a ride scenario, the user may be a rider. Further, the probability of each user sending out the voice signal can be obtained according to the facial features of each user. For example, the facial features may be the frequency of mouth movements, or facial expressions, etc. The probability of each user sending out a voice signal is obtained according to the facial features of each passenger.

In addition, a probability threshold may be set in advance. And determining the probability of the user sending out the voice signal as auxiliary information only when the obtained probability of the user sending out the voice signal exceeds a probability threshold value. Therefore, the operation amount of the subsequent process can be saved, and the operation time is saved.

By the aid of the scheme, the voice interaction user is identified in an auxiliary mode by using the image detected by the image sensor, and accuracy of voice interaction user identification can be improved.

In one embodiment, the sensor includes a seat sensor, and the determining of the auxiliary information further includes:

the information that the seat sensor detects that the seat is occupied is determined as the assist information.

The seat sensor is a film type contact sensor, and contacts of the sensor are uniformly distributed on the stress surface of the automobile seat. When the seat is subjected to a sufficiently large weight from the outside, a trigger signal is generated. The triggering signal is used as the information that the seat is occupied, namely, the auxiliary information is determined.

By the aid of the scheme, the voice interaction user is identified in an auxiliary mode by utilizing the information that the seat detected by the seat sensor is occupied, and accuracy of voice interaction user identification can be improved.

As shown in fig. 3, in one embodiment, step S101 includes:

s1011: and determining a first probability that the user at each position is a voice interaction user according to the sound source position of the voice signal.

S1012: and determining a second probability that the user at each position is a voice interaction user according to the auxiliary information.

S1013: a weighted sum of the first probability and the second probability for each location is calculated using the pre-assigned weights.

S1014: and determining that the user at the corresponding position is a voice interaction user emitting a voice signal under the condition that the weighted sum is larger than a preset threshold value.

Sound source localization is performed by using sound waves detected by the microphone array, so that the approximate position of the sound source of the voice signal containing the interaction information can be obtained. For example, the microphone array is disposed at a center control position of the vehicle, and the first passenger on the right side of the front row and the third passenger on the right side of the rear row sound simultaneously. The microphone array is used as a direction array, the direction array can firstly divide the positioning area into grids, the relative sound pressure of each grid can be obtained through the time delay of the received sound source, and the hologram for positioning the sound source is finally determined based on the relative sound pressure. And sending the hologram to a pre-trained positioning probability model, and obtaining the probabilities of sound sources at different positions. The positioning probability model can be trained by adopting the holographic color pattern book and the sound source positioning sample, so that the trained positioning probability model can obtain the probabilities of sound sources at different positions according to the holographic color pattern. I.e. to obtain a first probability that the user located at each position is a voice interaction user.

In connection with the above-described example of the occupant, for example, the probability that the sound source position is on the right side of the front row may be obtained as 95%, the probability that the sound source position is on the right side of the rear row may be obtained as 90%, and the probability that the sound source position is in the middle of the rear row may be obtained as 40%. The probability of obtaining the sound source position as the other seat is 0.

In addition, a second probability that the user located at the first location is a voice interaction user may be determined from the image detected by the image sensor. In connection with the above-described occupant example, for example, the probability that the user who uttered a voice is on the right side of the front row is 85% and the probability that the user who uttered a voice is on the right side of the rear row is 88% according to the moving image output. The second probability that the user on the right side of the front row is a voice interaction user is 85% and the second probability that the user on the right side of the rear row is a voice interaction user is 88%. Since there is no occupant in the middle position in the rear row, the probability of detecting that the voice interaction user is located in the middle position in the rear row by the moving image is 0. Further, since the occupant of the other seat does not sound, the second probability of detecting the occupant of the other seat from the moving image is negligible.

Further, the information of whether the seat detected by the seat sensor is occupied or not can be used as the second probability that the user at the first position is a voice interaction user. In combination with the above-described occupant example, for example, in the case where the front right and rear right seats are detected to be occupied, the probability that the users in the above two positions are voice-interactive users is found to be 100%. The second probability that the voice interaction user is located on the right side of the front row and the right side of the rear row is 100%. In addition, since there is no occupant in the middle position in the rear row, the probability that the voice interaction user is located in the middle position in the rear row is 0.

Weights are assigned to the first probability and the second probability in advance. For example, the first probability may have a greater weight than the second probability. For another example, in the second probability, the weight of the second probability obtained from the moving image is larger than the weight of the second probability obtained from the seat sensor.

The weighted sum E of the first probability and the second probability of the first position is expressed as: e=q ₁ *W ₁ +q ₂ *W ₂ +q ₃ *W ₃ . In which W is ₁ Representing a first probability, W, that a voice interaction user is located at a first position based on a source position of the voice ₂ Representing a second probability, W, of determining that the voice interaction user is located at the first position based on the auxiliary information detected by the dynamic image ₃ Representing a second probability of determining that the voice interactive user is located at the first location by representing assistance information detected from the seat sensor. q ₁ 、q ₂ 、q ₃ Respectively represent W ₁ 、W ₂ 、W ₃ Is a weight of (2).

And determining that the voice interaction user at the first position is a voice interaction user sending out voice signals under the condition that the weighted sum E is larger than a preset threshold value. For example, in the above example, in the case where the calculated weighted sum of the front right side and the rear right side is greater than the predetermined threshold, it may be determined that the users on the front right side and the rear right side are voice interactive users who utter voice.

By the scheme, the voice interactive user who sends out the voice can be determined by integrating the position of the sound source of the voice and the auxiliary information detected by the sensor. The accuracy of the determination result is improved.

As shown in fig. 4, in one embodiment, step S104 includes:

s1041: the location of each voice interaction user is determined.

S1042: and determining the speaker closest to each voice interaction user according to the position distribution of each speaker.

S1043: and respectively sending the feedback information to a speaker closest to each voice interaction user for playing according to the labels.

And under the condition that the voice interaction user is determined, the position of the voice interaction user can be determined according to the sound source position and the auxiliary information. Such as the front right side and the rear right side in the previous occupant example.

According to the position distribution of the loudspeakers obtained in advance, the loudspeaker nearest to each voice interaction user can be determined. The position distribution of the speakers acquired in advance may be acquired by position information input by a user, or by configuration information acquisition by a download vehicle, or the like.

According to the labels, each voice interaction user and the position thereof can be confirmed. Therefore, the feedback information can be sent to the speaker closest to each voice interaction user for playing so as to realize voice interaction.

Through the scheme, a plurality of voice interaction users can be supported to interact simultaneously, and interference among the voice interaction users is reduced.

As shown in fig. 5, the present application provides a device for voice interaction, including:

the voice interaction user determining module 501 is configured to determine a plurality of voice interaction users sending out voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor when detecting that the voice signals contain interaction information.

The tag setting module 502 is configured to set a tag for the interaction information in the voice signal, where the tag corresponds to a voice interaction user who sends out the voice signal.

The feedback information generating module 503 is configured to generate feedback information for the interaction information.

And the feedback information playing module 504 is configured to play the feedback information to the voice interaction user corresponding to the tag.

In one embodiment, the sensor comprises an image sensor;

as shown in fig. 6, the voice interaction user determination module 501 includes:

the user identification submodule 5011 is used for identifying the image detected by the image sensor and confirming each user in the image.

The auxiliary information confirming submodule 5012 is used for obtaining the probability of each user sending out the voice signal according to the facial features of each user.

The probability of each user uttering a speech signal is determined as auxiliary information.

In one embodiment, the sensor comprises a seat sensor;

the auxiliary information confirmation sub-module 5012 is also for: the information that the seat sensor detects that the seat is occupied is determined as the assist information.

As shown in fig. 7, in one embodiment, the voice interaction user determination module 501 further includes:

a first probability determination submodule 5013 is configured to determine a first probability that the user located at each position is a voice interaction user according to the sound source position of the voice signal.

A second probability determination submodule 5014 is configured to determine, according to the auxiliary information, a second probability that the user located at each position is a voice interaction user.

A weighted sum computation submodule 5015 for computing a weighted sum of the first probability and the second probability for each position using the weight assigned in advance.

The voice interaction user determination execution submodule 5016 is used for determining that the user located at the corresponding position is the voice interaction user sending out the voice signal if the weighted sum is larger than the preset threshold value.

As shown in fig. 8, in one embodiment, the feedback information playing module 504 includes:

a location determination sub-module 5041 for determining the location of each voice interaction user.

Speaker determination submodule 5042 is configured to determine a speaker closest to each voice interaction user based on a position distribution of each speaker.

The feedback information playing execution submodule 5043 is used for respectively sending the feedback information to the speaker closest to each voice interaction user for playing according to the labels.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 9, a block diagram of an electronic device is provided for a method of voice interaction according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 9, the electronic device includes: one or more processors 910, a memory 920, and interfaces for connecting components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 910 is illustrated in fig. 9.

Memory 920 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of voice interaction provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of voice interaction provided herein.

The memory 920 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of voice interaction in the embodiments of the present application (e.g., the voice interaction user determination module 501, the tag setting module 502, the feedback information generation module 503, and the feedback information playing module 504 shown in fig. 5). The processor 910 executes various functional applications of the server and data processing, i.e., a method of implementing voice interaction in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 920.

Memory 920 may include a storage program area that may store an operating system, at least one application required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device of the method of voice interaction, etc. In addition, memory 920 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 920 may optionally include memory located remotely from processor 910 that may be connected to the electronic device of the method of voice interaction through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of voice interaction may further include: an input device 930, and an output device 940. The processor 910, memory 920, input device 930, and output device 940 may be connected by a bus or other means, for example in fig. 9.

The input device 930 may receive input numeric or character information as well as key signal inputs related to user settings and function control of the electronic device that generate the method of interacting with speech, such as a touch screen, a keypad, a mouse, a trackpad, a touch pad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output device 940 may include a display apparatus, an auxiliary lighting device (e.g., LED), a haptic feedback device (e.g., vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of voice interaction, comprising:

under the condition that interaction information is contained in the voice signal, determining a plurality of voice interaction users sending out the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor;

generating feedback information for the interaction information;

playing the feedback information to a voice interaction user corresponding to the tag;

determining a first probability that a user at each position is the voice interaction user according to the sound source position of the voice signal;

determining a second probability that the user at each position is the voice interaction user according to the auxiliary information;

calculating a weighted sum of the first probability and the second probability of each position by using a pre-allocated weight;

under the condition that the weighted sum is larger than a preset threshold value, determining that the user positioned at the corresponding position is a voice interaction user sending out the voice signal;

the determining, according to the sound source position of the voice signal, a first probability that the user located at each position is the voice interaction user, including:

and carrying out sound source localization by using a microphone array, wherein the microphone array is used as a direction array, the direction array firstly carries out grid division on a localization area, the relative sound pressure of each grid is obtained through the time delay of a received sound source, a hologram for sound source localization is determined based on the relative sound pressure, then the hologram is sent to a pre-trained localization probability model, and finally, the first probability that the user at each position is a voice interaction user is determined.

2. The method of claim 1, wherein the sensor comprises an image sensor;

the determination mode of the auxiliary information comprises the following steps:

identifying the image detected by the image sensor, and confirming each user in the image;

obtaining the probability of each user sending out the voice signal according to the facial features of each user;

and determining the probability of each user sending out the voice signal as the auxiliary information.

3. The method of claim 2, wherein the sensor comprises a seat sensor;

the determination mode of the auxiliary information further comprises the following steps:

and determining the information of the occupied seat detected by the seat sensor as the auxiliary information.

4. The method of claim 1, playing the feedback information to a voice interactive user corresponding to the tag, comprising:

determining the position of each voice interaction user;

determining the speaker closest to each voice interaction user according to the position distribution of each speaker;

and respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the labels.

5. An apparatus for voice interaction, comprising:

the voice interaction user determining module is used for determining a plurality of voice interaction users sending out the voice signals according to the sound source positions of the voice signals and the auxiliary information detected by the sensors under the condition that the voice signals contain interaction information;

the tag setting module is used for setting a tag for interaction information in the voice signal, and the tag corresponds to a voice interaction user sending the voice signal;

the feedback information playing module is used for playing the feedback information to the voice interaction user corresponding to the tag;

the first probability determination submodule is used for determining the first probability that the user at each position is the voice interaction user according to the sound source position of the voice signal;

a second probability determining sub-module, configured to determine, according to the auxiliary information, a second probability that the user located at each position is the voice interaction user;

a weighted sum calculation sub-module for calculating a weighted sum of the first probability and the second probability for each location using a pre-assigned weight;

the voice interaction user determining and executing sub-module is used for determining that the user positioned at the corresponding position is the voice interaction user sending the voice signal under the condition that the weighted sum is larger than a preset threshold value;

the first probability determination submodule includes:

6. The apparatus of claim 5, wherein the sensor comprises an image sensor;

the voice interaction user determination module comprises:

the user identification sub-module is used for identifying the image detected by the image sensor and confirming each user in the image;

the auxiliary information confirming sub-module is used for obtaining the probability of each user sending out the voice signal according to the facial features of each user;

7. The apparatus of claim 6, wherein the sensor comprises a seat sensor;

the auxiliary information confirmation sub-module is further configured to: and determining the information of the occupied seat detected by the seat sensor as the auxiliary information.

8. The apparatus of claim 5, wherein the feedback information playing module comprises:

the position determining sub-module is used for determining the position of each voice interaction user;

the speaker determining submodule is used for determining speakers closest to each voice interaction user according to the position distribution of each speaker;

and the feedback information playing execution sub-module is used for respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the labels.

9. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 4.