CN111694433B - Voice interaction method and device, electronic equipment and storage medium - Google Patents

Voice interaction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111694433B
CN111694433B CN202010530888.5A CN202010530888A CN111694433B CN 111694433 B CN111694433 B CN 111694433B CN 202010530888 A CN202010530888 A CN 202010530888A CN 111694433 B CN111694433 B CN 111694433B
Authority
CN
China
Prior art keywords
user
voice
voice interaction
probability
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010530888.5A
Other languages
Chinese (zh)
Other versions
CN111694433A (en
Inventor
陈世伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apollo Zhilian Beijing Technology Co Ltd
Original Assignee
Apollo Zhilian Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apollo Zhilian Beijing Technology Co Ltd filed Critical Apollo Zhilian Beijing Technology Co Ltd
Priority to CN202010530888.5A priority Critical patent/CN111694433B/en
Publication of CN111694433A publication Critical patent/CN111694433A/en
Application granted granted Critical
Publication of CN111694433B publication Critical patent/CN111694433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Abstract

The application discloses a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to voice, natural language processing and image processing technologies. The specific implementation scheme is as follows: under the condition that the voice signal is detected to contain interaction information, determining a plurality of voice interaction users sending out the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor; a label is set for the interactive information in the voice signal, and the label corresponds to the user who sends out the voice signal. Generating feedback information for the interaction information; and playing the feedback information to the voice interaction user corresponding to the label. The problem that multiple persons cannot perform voice interaction at the same time is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.

Description

Voice interaction method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of signal processing technologies, and in particular, to a method and apparatus for voice interaction, an electronic device, and a storage medium.
Background
In the current vehicle-mounted voice system on the market, voice interaction of the next passenger at the same time can only be realized. When other passengers in the vehicle have voice interaction intention, the passenger needs to wait for the end of the previous voice interaction or to wake up again by voice, so as to start a new voice interaction flow.
Disclosure of Invention
The application provides a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to the fields of voice technology, natural language processing, image processing and the like.
According to an aspect of the present application, there is provided a method of voice interaction, including the steps of:
under the condition that the voice signal is detected to contain interaction information, determining a plurality of voice interaction users sending out the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor;
setting a label for interaction information in the voice signal, wherein the label corresponds to a voice interaction user sending the voice signal;
generating feedback information for the interaction information;
and playing the feedback information to the voice interaction user corresponding to the label.
According to another aspect of the present application, there is provided a device for audio interaction, comprising:
the voice interaction user determining module is used for determining a plurality of voice interaction users sending out voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor under the condition that the voice signals contain interaction information;
the tag setting module is used for setting tags for interaction information in the voice signals, and the tags correspond to voice interaction users sending out the voice signals;
the feedback information generation module is used for generating feedback information of the interaction information;
and the feedback information playing module is used for playing the feedback information to the voice interaction user corresponding to the tag.
In a third aspect, an embodiment of the present application provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by any one of the embodiments of the present application.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.
According to the technology of the application, the problem that multiple persons cannot conduct voice interaction at the same time is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.
It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:
FIG. 1 is a flow chart of a method of voice interaction according to a first embodiment of the present application;
fig. 2 is a flowchart of auxiliary information determination according to a first embodiment of the present application;
FIG. 3 is a flow chart of voice interaction user determination according to a first embodiment of the present application;
fig. 4 is a flowchart of playing feedback information according to a first embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus for voice interaction according to a second embodiment of the present application;
FIG. 6 is a schematic diagram of a voice interactive user determination module according to a second embodiment of the present application;
FIG. 7 is a schematic diagram of a voice interactive user determination module according to a second embodiment of the present application;
fig. 8 is a schematic diagram of a feedback information playing module according to a second embodiment of the present application;
fig. 9 is a block diagram of an electronic device for implementing a method of voice interaction of an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
As shown in fig. 1, the present application provides a method for voice interaction, which includes the following steps:
s101: when the interactive information is contained in the voice signal, a plurality of voice interactive users emitting the voice signal are determined according to the sound source position of the voice signal and the auxiliary information detected by the sensor.
S102: and setting a label for the interaction information in the voice signal, wherein the label corresponds to the voice interaction user sending out the voice signal.
S103: feedback information for the interaction information is generated.
S104: and playing the feedback information to the voice interaction user corresponding to the label.
The method flow can be applied to a riding scene, a conference scene, a home scene and the like. Taking a riding scene as an example, the method execution main body for executing the method can be a vehicle-mounted computer. It is assumed that four passengers are included in the vehicle, namely a driver on the left side of the front row, a first passenger on the right side of the front row, a second passenger on the left side of the rear row, and a third passenger on the right side of the rear row.
The interaction information may be a wake-up word that initiates a voice interaction, or a sentence with an explicit interaction intent. For example, a sentence with explicit intent to interact may include: "lower the air conditioner temperature a little", "open the window", "where the current position is", etc.
In case it is detected that the interactive information is contained in the speech signal, a confirmation procedure for the speech interactive user may be initiated. For example, sound source localization may be performed based on sound waves detected by a microphone array provided in a vehicle, and the approximate position of the sound source of the voice signal including the interactive information may be obtained.
The auxiliary information detected by the sensor may be information that the seat detected by the in-vehicle seat sensor is occupied, or the like.
In addition, the auxiliary information detected by the sensor may also include the person who is speaking detected from the (dynamic) image acquired by the image sensor using the image recognition technique.
And comprehensively judging according to the sound source position of the voice signal and the auxiliary information detected by the sensor, and determining at least one voice interaction user sending the voice signal containing interaction information.
For example, two voice interaction users are determined, namely, a first passenger on the right side of the front row and a third passenger on the right side of the rear row, and labels can be set for the two voice interaction users respectively. The tag may be information related to a seat or an in-vehicle position, etc. After the tags are set, each piece of interaction information of the user can load the corresponding tag.
For each piece of interaction information, feedback information can be obtained locally or through cloud communication. For example, if the interactive information contained in the voice signal of the first passenger on the right side of the front row is "where the current position is", the current position of the vehicle may be determined by the vehicle-mounted GPS or by communicating with the satellite positioning server, and the current position is used as feedback information. The interactive information contained in the voice signal of the third passenger on the right side of the rear row is "the air conditioner temperature is lowered a little", so that a temperature control instruction can be directly generated, and the air conditioner temperature of the vehicle can be regulated.
And under the condition that the feedback information is information capable of broadcasting, the feedback information can be broadcasted to the voice interaction user corresponding to the tag. For example, the interactive information contained in the voice signal of the first passenger on the right side of the front row is "where the present position is". According to the tag, it can be determined that the passenger who has caused the problem is the first passenger on the right side of the front row, so that the speaker closest to the passenger can be selected to play the feedback information.
Through the scheme, simultaneous interaction of a plurality of voice interaction users can be realized, voice interaction efficiency under the condition of multiple people is improved, and intelligence of voice interaction is also improved.
As shown in fig. 2, in one embodiment, the sensor includes an image sensor, and the determination method of the auxiliary information includes:
s201: and identifying the image detected by the image sensor, and confirming each user in the image.
S202: and obtaining the probability of each user sending out the voice signal according to the facial features of each user.
S203: the probability of each user uttering a speech signal is determined as auxiliary information.
An image detected by an image sensor is acquired, and the image may be a moving image. Each user in the moving image can be confirmed by using the face recognition technique. In a ride scenario, the user may be a rider. Further, the probability of each user sending out the voice signal can be obtained according to the facial features of each user. For example, the facial features may be the frequency of mouth movements, or facial expressions, etc. The probability of each user sending out a voice signal is obtained according to the facial features of each passenger.
In addition, a probability threshold may be set in advance. And determining the probability of the user sending out the voice signal as auxiliary information only when the obtained probability of the user sending out the voice signal exceeds a probability threshold value. Therefore, the operation amount of the subsequent process can be saved, and the operation time is saved.
By the aid of the scheme, the voice interaction user is identified in an auxiliary mode by using the image detected by the image sensor, and accuracy of voice interaction user identification can be improved.
In one embodiment, the sensor includes a seat sensor, and the determining of the auxiliary information further includes:
the information that the seat sensor detects that the seat is occupied is determined as the assist information.
The seat sensor is a film type contact sensor, and contacts of the sensor are uniformly distributed on the stress surface of the automobile seat. When the seat is subjected to a sufficiently large weight from the outside, a trigger signal is generated. The triggering signal is used as the information that the seat is occupied, namely, the auxiliary information is determined.
By the aid of the scheme, the voice interaction user is identified in an auxiliary mode by utilizing the information that the seat detected by the seat sensor is occupied, and accuracy of voice interaction user identification can be improved.
As shown in fig. 3, in one embodiment, step S101 includes:
s1011: and determining a first probability that the user at each position is a voice interaction user according to the sound source position of the voice signal.
S1012: and determining a second probability that the user at each position is a voice interaction user according to the auxiliary information.
S1013: a weighted sum of the first probability and the second probability for each location is calculated using the pre-assigned weights.
S1014: and determining that the user at the corresponding position is a voice interaction user emitting a voice signal under the condition that the weighted sum is larger than a preset threshold value.
Sound source localization is performed by using sound waves detected by the microphone array, so that the approximate position of the sound source of the voice signal containing the interaction information can be obtained. For example, the microphone array is disposed at a center control position of the vehicle, and the first passenger on the right side of the front row and the third passenger on the right side of the rear row sound simultaneously. The microphone array is used as a direction array, the direction array can firstly divide the positioning area into grids, the relative sound pressure of each grid can be obtained through the time delay of the received sound source, and the hologram for positioning the sound source is finally determined based on the relative sound pressure. And sending the hologram to a pre-trained positioning probability model, and obtaining the probabilities of sound sources at different positions. The positioning probability model can be trained by adopting the holographic color pattern book and the sound source positioning sample, so that the trained positioning probability model can obtain the probabilities of sound sources at different positions according to the holographic color pattern. I.e. to obtain a first probability that the user located at each position is a voice interaction user.
In connection with the above-described example of the occupant, for example, the probability that the sound source position is on the right side of the front row may be obtained as 95%, the probability that the sound source position is on the right side of the rear row may be obtained as 90%, and the probability that the sound source position is in the middle of the rear row may be obtained as 40%. The probability of obtaining the sound source position as the other seat is 0.
In addition, a second probability that the user located at the first location is a voice interaction user may be determined from the image detected by the image sensor. In connection with the above-described occupant example, for example, the probability that the user who uttered a voice is on the right side of the front row is 85% and the probability that the user who uttered a voice is on the right side of the rear row is 88% according to the moving image output. The second probability that the user on the right side of the front row is a voice interaction user is 85% and the second probability that the user on the right side of the rear row is a voice interaction user is 88%. Since there is no occupant in the middle position in the rear row, the probability of detecting that the voice interaction user is located in the middle position in the rear row by the moving image is 0. Further, since the occupant of the other seat does not sound, the second probability of detecting the occupant of the other seat from the moving image is negligible.
Further, the information of whether the seat detected by the seat sensor is occupied or not can be used as the second probability that the user at the first position is a voice interaction user. In combination with the above-described occupant example, for example, in the case where the front right and rear right seats are detected to be occupied, the probability that the users in the above two positions are voice-interactive users is found to be 100%. The second probability that the voice interaction user is located on the right side of the front row and the right side of the rear row is 100%. In addition, since there is no occupant in the middle position in the rear row, the probability that the voice interaction user is located in the middle position in the rear row is 0.
Weights are assigned to the first probability and the second probability in advance. For example, the first probability may have a greater weight than the second probability. For another example, in the second probability, the weight of the second probability obtained from the moving image is larger than the weight of the second probability obtained from the seat sensor.
The weighted sum E of the first probability and the second probability of the first position is expressed as: e=q 1 *W 1 +q 2 *W 2 +q 3 *W 3 . In which W is 1 Representing a first probability, W, that a voice interaction user is located at a first position based on a source position of the voice 2 Representing a second probability, W, of determining that the voice interaction user is located at the first position based on the auxiliary information detected by the dynamic image 3 Representing a second probability of determining that the voice interactive user is located at the first location by representing assistance information detected from the seat sensor. q 1 、q 2 、q 3 Respectively represent W 1 、W 2 、W 3 Is a weight of (2).
And determining that the voice interaction user at the first position is a voice interaction user sending out voice signals under the condition that the weighted sum E is larger than a preset threshold value. For example, in the above example, in the case where the calculated weighted sum of the front right side and the rear right side is greater than the predetermined threshold, it may be determined that the users on the front right side and the rear right side are voice interactive users who utter voice.
By the scheme, the voice interactive user who sends out the voice can be determined by integrating the position of the sound source of the voice and the auxiliary information detected by the sensor. The accuracy of the determination result is improved.
As shown in fig. 4, in one embodiment, step S104 includes:
s1041: the location of each voice interaction user is determined.
S1042: and determining the speaker closest to each voice interaction user according to the position distribution of each speaker.
S1043: and respectively sending the feedback information to a speaker closest to each voice interaction user for playing according to the labels.
And under the condition that the voice interaction user is determined, the position of the voice interaction user can be determined according to the sound source position and the auxiliary information. Such as the front right side and the rear right side in the previous occupant example.
According to the position distribution of the loudspeakers obtained in advance, the loudspeaker nearest to each voice interaction user can be determined. The position distribution of the speakers acquired in advance may be acquired by position information input by a user, or by configuration information acquisition by a download vehicle, or the like.
According to the labels, each voice interaction user and the position thereof can be confirmed. Therefore, the feedback information can be sent to the speaker closest to each voice interaction user for playing so as to realize voice interaction.
Through the scheme, a plurality of voice interaction users can be supported to interact simultaneously, and interference among the voice interaction users is reduced.
As shown in fig. 5, the present application provides a device for voice interaction, including:
the voice interaction user determining module 501 is configured to determine a plurality of voice interaction users sending out voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor when detecting that the voice signals contain interaction information.
The tag setting module 502 is configured to set a tag for the interaction information in the voice signal, where the tag corresponds to a voice interaction user who sends out the voice signal.
The feedback information generating module 503 is configured to generate feedback information for the interaction information.
And the feedback information playing module 504 is configured to play the feedback information to the voice interaction user corresponding to the tag.
In one embodiment, the sensor comprises an image sensor;
as shown in fig. 6, the voice interaction user determination module 501 includes:
the user identification submodule 5011 is used for identifying the image detected by the image sensor and confirming each user in the image.
The auxiliary information confirming submodule 5012 is used for obtaining the probability of each user sending out the voice signal according to the facial features of each user.
The probability of each user uttering a speech signal is determined as auxiliary information.
In one embodiment, the sensor comprises a seat sensor;
the auxiliary information confirmation sub-module 5012 is also for: the information that the seat sensor detects that the seat is occupied is determined as the assist information.
As shown in fig. 7, in one embodiment, the voice interaction user determination module 501 further includes:
a first probability determination submodule 5013 is configured to determine a first probability that the user located at each position is a voice interaction user according to the sound source position of the voice signal.
A second probability determination submodule 5014 is configured to determine, according to the auxiliary information, a second probability that the user located at each position is a voice interaction user.
A weighted sum computation submodule 5015 for computing a weighted sum of the first probability and the second probability for each position using the weight assigned in advance.
The voice interaction user determination execution submodule 5016 is used for determining that the user located at the corresponding position is the voice interaction user sending out the voice signal if the weighted sum is larger than the preset threshold value.
As shown in fig. 8, in one embodiment, the feedback information playing module 504 includes:
a location determination sub-module 5041 for determining the location of each voice interaction user.
Speaker determination submodule 5042 is configured to determine a speaker closest to each voice interaction user based on a position distribution of each speaker.
The feedback information playing execution submodule 5043 is used for respectively sending the feedback information to the speaker closest to each voice interaction user for playing according to the labels.
According to embodiments of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 9, a block diagram of an electronic device is provided for a method of voice interaction according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 9, the electronic device includes: one or more processors 910, a memory 920, and interfaces for connecting components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 910 is illustrated in fig. 9.
Memory 920 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of voice interaction provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of voice interaction provided herein.
The memory 920 is used as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method of voice interaction in the embodiments of the present application (e.g., the voice interaction user determination module 501, the tag setting module 502, the feedback information generation module 503, and the feedback information playing module 504 shown in fig. 5). The processor 910 executes various functional applications of the server and data processing, i.e., a method of implementing voice interaction in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 920.
Memory 920 may include a storage program area that may store an operating system, at least one application required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device of the method of voice interaction, etc. In addition, memory 920 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 920 may optionally include memory located remotely from processor 910 that may be connected to the electronic device of the method of voice interaction through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the method of voice interaction may further include: an input device 930, and an output device 940. The processor 910, memory 920, input device 930, and output device 940 may be connected by a bus or other means, for example in fig. 9.
The input device 930 may receive input numeric or character information as well as key signal inputs related to user settings and function control of the electronic device that generate the method of interacting with speech, such as a touch screen, a keypad, a mouse, a trackpad, a touch pad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output device 940 may include a display apparatus, an auxiliary lighting device (e.g., LED), a haptic feedback device (e.g., vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (10)

1. A method of voice interaction, comprising:
under the condition that interaction information is contained in the voice signal, determining a plurality of voice interaction users sending out the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor;
setting a label for interaction information in the voice signal, wherein the label corresponds to a voice interaction user sending the voice signal;
generating feedback information for the interaction information;
playing the feedback information to a voice interaction user corresponding to the tag;
determining a first probability that a user at each position is the voice interaction user according to the sound source position of the voice signal;
determining a second probability that the user at each position is the voice interaction user according to the auxiliary information;
calculating a weighted sum of the first probability and the second probability of each position by using a pre-allocated weight;
under the condition that the weighted sum is larger than a preset threshold value, determining that the user positioned at the corresponding position is a voice interaction user sending out the voice signal;
the determining, according to the sound source position of the voice signal, a first probability that the user located at each position is the voice interaction user, including:
and carrying out sound source localization by using a microphone array, wherein the microphone array is used as a direction array, the direction array firstly carries out grid division on a localization area, the relative sound pressure of each grid is obtained through the time delay of a received sound source, a hologram for sound source localization is determined based on the relative sound pressure, then the hologram is sent to a pre-trained localization probability model, and finally, the first probability that the user at each position is a voice interaction user is determined.
2. The method of claim 1, wherein the sensor comprises an image sensor;
the determination mode of the auxiliary information comprises the following steps:
identifying the image detected by the image sensor, and confirming each user in the image;
obtaining the probability of each user sending out the voice signal according to the facial features of each user;
and determining the probability of each user sending out the voice signal as the auxiliary information.
3. The method of claim 2, wherein the sensor comprises a seat sensor;
the determination mode of the auxiliary information further comprises the following steps:
and determining the information of the occupied seat detected by the seat sensor as the auxiliary information.
4. The method of claim 1, playing the feedback information to a voice interactive user corresponding to the tag, comprising:
determining the position of each voice interaction user;
determining the speaker closest to each voice interaction user according to the position distribution of each speaker;
and respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the labels.
5. An apparatus for voice interaction, comprising:
the voice interaction user determining module is used for determining a plurality of voice interaction users sending out the voice signals according to the sound source positions of the voice signals and the auxiliary information detected by the sensors under the condition that the voice signals contain interaction information;
the tag setting module is used for setting a tag for interaction information in the voice signal, and the tag corresponds to a voice interaction user sending the voice signal;
the feedback information generation module is used for generating feedback information of the interaction information;
the feedback information playing module is used for playing the feedback information to the voice interaction user corresponding to the tag;
the first probability determination submodule is used for determining the first probability that the user at each position is the voice interaction user according to the sound source position of the voice signal;
a second probability determining sub-module, configured to determine, according to the auxiliary information, a second probability that the user located at each position is the voice interaction user;
a weighted sum calculation sub-module for calculating a weighted sum of the first probability and the second probability for each location using a pre-assigned weight;
the voice interaction user determining and executing sub-module is used for determining that the user positioned at the corresponding position is the voice interaction user sending the voice signal under the condition that the weighted sum is larger than a preset threshold value;
the first probability determination submodule includes:
and carrying out sound source localization by using a microphone array, wherein the microphone array is used as a direction array, the direction array firstly carries out grid division on a localization area, the relative sound pressure of each grid is obtained through the time delay of a received sound source, a hologram for sound source localization is determined based on the relative sound pressure, then the hologram is sent to a pre-trained localization probability model, and finally, the first probability that the user at each position is a voice interaction user is determined.
6. The apparatus of claim 5, wherein the sensor comprises an image sensor;
the voice interaction user determination module comprises:
the user identification sub-module is used for identifying the image detected by the image sensor and confirming each user in the image;
the auxiliary information confirming sub-module is used for obtaining the probability of each user sending out the voice signal according to the facial features of each user;
and determining the probability of each user sending out the voice signal as the auxiliary information.
7. The apparatus of claim 6, wherein the sensor comprises a seat sensor;
the auxiliary information confirmation sub-module is further configured to: and determining the information of the occupied seat detected by the seat sensor as the auxiliary information.
8. The apparatus of claim 5, wherein the feedback information playing module comprises:
the position determining sub-module is used for determining the position of each voice interaction user;
the speaker determining submodule is used for determining speakers closest to each voice interaction user according to the position distribution of each speaker;
and the feedback information playing execution sub-module is used for respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the labels.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 4.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 4.
CN202010530888.5A 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium Active CN111694433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010530888.5A CN111694433B (en) 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010530888.5A CN111694433B (en) 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111694433A CN111694433A (en) 2020-09-22
CN111694433B true CN111694433B (en) 2023-06-20

Family

ID=72480499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010530888.5A Active CN111694433B (en) 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111694433B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581981B (en) * 2020-11-04 2023-11-03 北京百度网讯科技有限公司 Man-machine interaction method, device, computer equipment and storage medium
CN112562664A (en) * 2020-11-27 2021-03-26 上海仙塔智能科技有限公司 Sound adjusting method, system, vehicle and computer storage medium
CN114664295A (en) * 2020-12-07 2022-06-24 北京小米移动软件有限公司 Robot and voice recognition method and device for same
CN113362823A (en) * 2021-06-08 2021-09-07 深圳市同行者科技有限公司 Multi-terminal response method, device, equipment and storage medium of household appliance
CN113407758A (en) * 2021-07-13 2021-09-17 中国第一汽车股份有限公司 Data processing method and device, electronic equipment and storage medium
CN114564265B (en) * 2021-12-22 2023-07-25 上海小度技术有限公司 Interaction method and device of intelligent equipment with screen and electronic equipment
CN113971954B (en) * 2021-12-23 2022-07-12 广州小鹏汽车科技有限公司 Voice interaction method and device, vehicle and storage medium
CN116978372A (en) * 2022-04-22 2023-10-31 华为技术有限公司 Voice interaction method, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7126583B1 (en) * 1999-12-15 2006-10-24 Automotive Technologies International, Inc. Interactive vehicle display system
US8560236B1 (en) * 2008-06-20 2013-10-15 Google Inc. Showing uncertainty of location
CN103439688A (en) * 2013-08-27 2013-12-11 大连理工大学 Sound source positioning system and method used for distributed microphone arrays
CN109147787A (en) * 2018-09-30 2019-01-04 深圳北极鸥半导体有限公司 A kind of smart television acoustic control identifying system and its recognition methods
CN109782231A (en) * 2019-01-17 2019-05-21 北京大学 A kind of end-to-end sound localization method and system based on multi-task learning

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10540957B2 (en) * 2014-12-15 2020-01-21 Baidu Usa Llc Systems and methods for speech transcription
US10134386B2 (en) * 2015-07-21 2018-11-20 Rovi Guides, Inc. Systems and methods for identifying content corresponding to a language spoken in a household
CN105159111B (en) * 2015-08-24 2019-01-25 百度在线网络技术(北京)有限公司 Intelligent interaction device control method and system based on artificial intelligence
CN109493871A (en) * 2017-09-11 2019-03-19 上海博泰悦臻网络技术服务有限公司 The multi-screen voice interactive method and device of onboard system, storage medium and vehicle device
CN108399916A (en) * 2018-01-08 2018-08-14 蔚来汽车有限公司 Vehicle intelligent voice interactive system and method, processing unit and storage device
CN108877795B (en) * 2018-06-08 2020-03-10 百度在线网络技术(北京)有限公司 Method and apparatus for presenting information
CN109490834A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of sound localization method, sound source locating device and vehicle
CN109493876A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of microphone array control method, device and vehicle
CN109545219A (en) * 2019-01-09 2019-03-29 北京新能源汽车股份有限公司 Vehicle-mounted voice exchange method, system, equipment and computer readable storage medium
CN110070868B (en) * 2019-04-28 2021-10-08 广州小鹏汽车科技有限公司 Voice interaction method and device for vehicle-mounted system, automobile and machine readable medium
CN110082723B (en) * 2019-05-16 2022-03-15 浙江大华技术股份有限公司 Sound source positioning method, device, equipment and storage medium
CN110047487B (en) * 2019-06-05 2022-03-18 广州小鹏汽车科技有限公司 Wake-up method and device for vehicle-mounted voice equipment, vehicle and machine-readable medium
CN110211585A (en) * 2019-06-05 2019-09-06 广州小鹏汽车科技有限公司 In-car entertainment interactive approach, device, vehicle and machine readable media
CN110364153A (en) * 2019-07-30 2019-10-22 恒大智慧科技有限公司 A kind of distributed sound control method, system, computer equipment and storage medium
CN112965033A (en) * 2021-02-03 2021-06-15 深圳市轻生活科技有限公司 Sound source positioning system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7126583B1 (en) * 1999-12-15 2006-10-24 Automotive Technologies International, Inc. Interactive vehicle display system
US8560236B1 (en) * 2008-06-20 2013-10-15 Google Inc. Showing uncertainty of location
CN103439688A (en) * 2013-08-27 2013-12-11 大连理工大学 Sound source positioning system and method used for distributed microphone arrays
CN109147787A (en) * 2018-09-30 2019-01-04 深圳北极鸥半导体有限公司 A kind of smart television acoustic control identifying system and its recognition methods
CN109782231A (en) * 2019-01-17 2019-05-21 北京大学 A kind of end-to-end sound localization method and system based on multi-task learning

Also Published As

Publication number Publication date
CN111694433A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111694433B (en) Voice interaction method and device, electronic equipment and storage medium
CN111179961B (en) Audio signal processing method and device, electronic equipment and storage medium
US20220172737A1 (en) Speech signal processing method and speech separation method
CN105793921A (en) Initiating actions based on partial hotwords
KR20230018534A (en) Speaker diarization using speaker embedding(s) and trained generative model
CN112970059B (en) Electronic device for processing user utterance and control method thereof
EP3923272B1 (en) Method and apparatus for adapting a wake-up model
CN110263131B (en) Reply information generation method, device and storage medium
EP3201770A1 (en) Methods and apparatus for module arbitration
US20220254369A1 (en) Electronic device supporting improved voice activity detection
EP4310838A1 (en) Speech wakeup method and apparatus, and storage medium and system
JP2022095768A (en) Method, device, apparatus, and medium for dialogues for intelligent cabin
US20230274740A1 (en) Arbitrating between multiple potentially-responsive electronic devices
CN111383661B (en) Sound zone judgment method, device, equipment and medium based on vehicle-mounted multi-sound zone
CN113823313A (en) Voice processing method, device, equipment and storage medium
CN112466327B (en) Voice processing method and device and electronic equipment
CN111276127B (en) Voice awakening method and device, storage medium and electronic equipment
CN113539265B (en) Control method, device, equipment and storage medium
CN116888665A (en) Electronic apparatus and control method thereof
US20240075944A1 (en) Localized voice recognition assistant
CN111640429B (en) Method for providing voice recognition service and electronic device for the same
EP4350693A2 (en) Voice processing method and apparatus, computer device, and storage medium
KR20190074344A (en) Dialogue processing apparatus and dialogue processing method
CN111640429A (en) Method of providing voice recognition service and electronic device for the same
KR20220169242A (en) Electronic devcie and method for personalized audio processing of the electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211013

Address after: 100176 Room 101, 1st floor, building 1, yard 7, Ruihe West 2nd Road, economic and Technological Development Zone, Daxing District, Beijing

Applicant after: Apollo Zhilian (Beijing) Technology Co.,Ltd.

Address before: 2 / F, baidu building, 10 Shangdi 10th Street, Haidian District, Beijing 100085

Applicant before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant