CN113628636B

CN113628636B - Voice interaction method, device and equipment

Info

Publication number: CN113628636B
Application number: CN202010388155.2A
Authority: CN
Inventors: 王英剑
Original assignee: Alibaba Group Holding Ltd
Current assignee: Zhejiang Future Elf Artificial Intelligence Technology Co ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2024-05-31
Anticipated expiration: 2040-05-09
Also published as: CN113628636A

Abstract

A voice interaction method, device and equipment are disclosed. Determining an audio receiving area; modulating the audio output to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and the difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; the first sound wave and the second sound wave are emitted to the direction where the audio receiving area is located. Therefore, the voice interaction service is provided for the target user, other users can be prevented from being disturbed by sound, and the privacy voice interaction experience of the user can be enhanced.

Description

Voice interaction method, device and equipment

Technical Field

The present disclosure relates to the field of voice interaction, and in particular, to a method, an apparatus, and a device for voice interaction.

Background

The voice interaction belongs to the category of man-machine interaction, and is an interaction mode with a relatively leading edge in the development of man-machine interaction. Voice interaction is the process that a user gives instructions to a machine through natural language to achieve the purpose of the user. The existing voice interaction scheme mainly considers how to improve the accuracy of voice recognition, but neglects the essence of voice interaction to provide convenience for users, so that the existing voice interaction scheme is not friendly to users.

Taking an intelligent sound box as an example, the intelligent sound box is used as a family intelligent partner, the music playing function is frequently used, and under the condition that the number of family members is relatively large, how to provide the audio playing function for a specific family member and enable other family members who do not want to listen to music to enjoy the sound interference-free service is deficient in the current voice interaction scheme.

Disclosure of Invention

The technical problem to be solved by the present disclosure is to provide a voice interaction scheme capable of improving user privacy voice interaction experience.

According to a first aspect of the present disclosure, a voice interaction method is provided, including: determining an audio receiving area; modulating the audio output to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and the difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; the first sound wave and the second sound wave are emitted to the direction where the audio receiving area is located.

According to a second aspect of the present disclosure, there is also provided a voice interaction method, including: taking the area where the target user is located as a first audio receiving area; modulating an audio output for a first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; transmitting the first sound wave and the second sound wave to the direction where the first audio receiving area is located; responsive to switching output audio to audio output for a second content, taking an area offset from the target user as a second audio receiving area; modulating the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency exceeding an audible frequency range, and the difference between the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; and transmitting the third sound wave and the fourth sound wave to the direction where the second audio receiving area is located.

According to a third aspect of the present disclosure, there is also provided a voice interaction device, including: a determining module for determining an audio receiving area; a modulation module for modulating the audio output to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; and the transmitting module is used for transmitting the first sound wave and the second sound wave to the direction where the audio receiving area is located.

According to a fourth aspect of the present disclosure, there is also provided a voice interaction device, including: the first determining module is used for taking the area where the target user is located as a first audio receiving area; a first modulation module for modulating an audio output for a first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; the first transmitting module is used for transmitting the first sound wave and the second sound wave to the direction where the first audio receiving area is located; a second determination module for switching, in response to the output audio being switched to the audio output for the second content, an area deviated from the target user as a second audio receiving area; a second modulation module for modulating an audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; and the second transmitting module is used for transmitting the third sound wave and the fourth sound wave to the direction where the second audio receiving area is located.

According to a fifth aspect of the present disclosure, there is also provided a voice interaction device, including: the microphone is used for collecting audio input, and the camera is used for shooting the surrounding environment to obtain image information of the surrounding environment; a speaker; and a processor for determining an audio receiving area according to the audio input and/or the image information, modulating the audio output to a first sound wave and a second sound wave, wherein the frequencies of the first sound wave and the second sound wave both exceed an audible frequency range, the difference value between the frequency of the first sound wave and the frequency of the second sound wave is in the audible frequency range, and controlling the loudspeaker to emit the first sound wave and the second sound wave to the direction where the audio receiving area is located.

According to a sixth aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described in the first or second aspect above.

According to a seventh aspect of the present disclosure, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method as described in the first or second aspect above.

Therefore, the audio signal sent by the voice interaction equipment can be heard only by the target user in the audio receiving area by means of the audio directional propagation technology, other users are prevented from being disturbed by sound while voice interaction service is provided for the target user, and privacy voice interaction experience of the user can be enhanced.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

Fig. 1 shows a schematic diagram of a voice interaction method according to one embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a structure of a voice interaction device according to one embodiment of the present disclosure.

Fig. 3 shows a schematic structural diagram of a speaker in a voice interaction device.

Fig. 4 shows a schematic flow chart of directional propagation of a music signal.

Fig. 5 shows a schematic diagram of a voice interaction method according to another embodiment of the present disclosure.

Fig. 6 shows a block diagram of a voice interaction device according to one embodiment of the present disclosure.

Fig. 7 illustrates a block diagram of a computing device according to one embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The method and the device can enable the audio signal sent by the voice interaction device to be heard by the user only at a certain angle and a certain area by utilizing the audio directional propagation technology, so that other users are prevented from being disturbed by sound while voice interaction service is provided for the target user, and meanwhile privacy voice interaction experience of the user can be enhanced. The audio directional propagation technology refers to a new sound source technology that sound propagates in a certain direction of air in a beam form, and can be but is not limited to an ultrasonic-based audio directional propagation technology.

Fig. 1 shows a schematic diagram of a voice interaction method according to one embodiment of the present disclosure. The voice interaction method of the present disclosure may be performed by, but not limited to, a smart speaker, a smart television.

Referring to fig. 1, taking a voice interaction device as an intelligent sound box and taking an ultrasonic-based audio directional propagation technology as an example, the intelligent sound box can transmit ultrasonic waves f ₁ and f ₂ with frequencies exceeding an audible frequency range to an audio receiving area where a target user is located, and the ultrasonic waves f ₁ and f ₂ can be configured to have differences between frequencies (i.e., f _1－f₂) within the audible frequency range.

Under the nonlinear action of the air medium, the signals can be interacted and decoupled, and then new sound waves with the frequencies of the sum (sum frequency) and the difference (difference frequency) of the original ultrasonic frequencies are generated. The new sound wave with the frequency difference is an audible sound wave, the audible sound wave falls in the audio receiving area and can be heard by the target user, and other users outside the audio receiving area cannot hear the new sound wave.

Fig. 2 shows a schematic diagram of a structure of a voice interaction device according to one embodiment of the present disclosure. The voice interaction device 200 may be, but is not limited to, a smart speaker, a smart television, a smart phone, a vehicle, a traffic light, and other devices with pronunciation functions.

Referring to fig. 2, the voice interaction device 200 may include a microphone 210 and/or a camera 220, a processor 230, and a speaker 240.

The microphone 210, i.e., a pickup, is used to collect audio input, which may include the speaker's voice input as well as ambient noise. The camera 220 is used for shooting the surrounding environment to obtain image information, and one or more users can be included in the image information shot by the camera 220.

The processor 230 may determine the audio receiving area based on the audio input captured by the microphone 210, or may determine the audio receiving area based on the image information captured by the camera 220.

1. Determining an audio receiving area based on audio input

The audio receiving area, i.e. the area where the target user is located. The target user refers to a user who needs to listen to the audio output from the voice interaction device.

As one example of the present disclosure, the processor 230 may calculate the sound source position from the audio input (e.g., speech input) captured by the microphone 210, e.g., the microphone 210 may be a microphone array composed of a plurality of microphones, and the sound source position may be calculated from the difference in signal strength of the speech input in the audio input captured by the plurality of microphones. An audio receiving area is then determined based on the sound source position, e.g., the sound source position may be taken as the audio receiving area. Thus, the target user may be referred to as a speaker.

As another example of the present disclosure, the user may also issue a voice input indicating the audio receiving area, at which time the processor 230 may perform semantic recognition on the voice input collected by the microphone 210, and determine the audio receiving area according to the semantic recognition result of the voice input. Thus, the audio receiving area can be specified by the user, that is, the target user can be specified by the speaker, so that the user can realize the voice interaction effect of the incoming sound by using the voice interaction device. For example, if the user a wants to share the song "desert camel" to the user B in the same scene separately, the user a may issue a voice command such as "play song" desert camel "to the left according to the relative positional relationship between the user B and the voice interaction device, and after receiving the voice command, the voice interaction device may determine that the audio receiving area is the left front of the device according to the voice recognition result, and then may transmit a corresponding sound wave signal to the left front by using the audio directional propagation technology, so that the song" desert camel "can only be listened to by the user B.

Optionally, the processor 230 may also determine the audio output based on audio input (e.g., voice input) captured by the microphone. For example, the processor 230 may recognize a voice input and determine an audio output suitable for feedback to the user based on the recognition result. For another example, the processor 230 may also upload the audio input collected by the microphone to a server, and receive the audio output fed back by the server.

In order to improve the threshold of the privacy voice interaction experience and enhance the voice interaction experience of the user, the processor 230 may further determine whether the current user has voice interaction permission based on the voice input in the audio input, for example, may identify the identity of the current user according to the voice input, and determine whether the current user is a registered user according to the identification result. If the user is registered, the voice interaction method is executed, privacy voice interaction experience is provided for the user, if the user is not registered, no processing is performed, or the voice which can be listened to in all directions is output according to a normal voice interaction mode.

2. Determining an audio receiving area based on image information

The image information collected by the camera may be image information including the target user. Processor 230 may determine a relative positional relationship between the target user and the device based on the image information. For example, the processor 230 may determine the relative positional relationship between the target user and the device based on the shooting parameters of the camera, the position and the size of the target user in the image information. After the relative position relation between the target user and the device is obtained, an audio receiving area can be determined, and the audio receiving area is the area where the target user is located.

One or more users may be present in the image information acquired by the camera. Processor 230 may identify the identity information of each user in the image information using biometric techniques (e.g., face recognition techniques); based on the identity information of the identified user, taking the user meeting the preset condition as a target user; then, based on the image information, determining the relative position relation between the target user and the equipment; the audio receiving area can thus be determined based on the determined relative positional relationship. The preset condition may be, but is not limited to, a judgment condition related to the age of the user, a judgment condition related to the audio output. For example, the preset condition may be whether the user is adult, whether the user belongs to the audience for which the current audio output is intended, and so on.

Taking the voice interaction device as an intelligent sound box as an example, family members such as parents, old people and children can be included in the image information collected by the camera, after identifying identity information of each user in the image information, the processor 230 can take adults (such as parents and old people) as target users and exclude children, then calculate a relative position relationship between the target users and the intelligent sound box, and can determine an audio receiving area based on the calculated relative position relationship. Therefore, under the action of the audio directional transmission technology, the intelligent sound box can only output audio to the area where the adult in the family member is located.

For another example, after identifying the identity information of each user in the image information, the processor 230 may further determine, according to the content of the audio output, an audience (such as adult males, adult females, students, etc.) to which the audio output is directed, use a user whose identity corresponds to the audience as a target user, exclude users whose identity does not correspond to the audience, calculate a relative positional relationship between the target user and the smart speaker, and determine the audio receiving area based on the calculated relative positional relationship. Thus, under the action of the audio directional transmission technology, the intelligent sound box can only output audio to users suitable for listening to the audio content.

Processor 230 may include a DSP chip (digital signal processing chip). The processor 230 may modulate the audio output to a first sound wave and a second sound wave based on the nonlinear ultrasonic modulation, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range. The audible frequency range may refer to the range of sound frequencies that can be heard by the human ear, i.e., 20HZ to 2000HZ.

The audio output is audio content directed to the user output and may be, but is not limited to, music, voice instructions fed back to the user. The audio output may be an audible sound signal and the processor 230 may select the appropriate ultrasonic frequency band using a particular algorithm to modulate the audio output onto the first sound wave and the second sound wave. Wherein the first sound wave and the second sound wave are ultrasonic carrier signals.

Speaker 240 is an audio output device of voice interaction device 200. The processor 230 may control the speaker 240 to emit the first sound wave and the second sound wave in a direction in which the audio receiving area, i.e., the area in which the target user desiring to listen to the audio, is located. As an example, speaker 240 may be a horn on which an ultrasonic transducer is mounted, and after modulating an audible sound signal onto an ultrasonic carrier signal, the horn may be algorithmically controlled to form an ultrasonic beam and be emitted into the air by the ultrasonic transducer.

The voice interaction device 200 may comprise a speaker array of a plurality of speakers 240 with different playback directions. That is, the voice interaction device 200 may include a plurality of audio output means having different playback directions. After modulation is complete, the processor 230 may select a speaker from the array of speakers with a playback direction directed to the audio receiving area and control the speaker to emit the first sound wave and the second sound wave.

As shown in fig. 3, the speakers 240 may be designed as a matrix horn module. The matrix horn module can carry a DSP chip for executing an algorithm part related to the voice interaction method.

The voice interaction device 200 may also comprise only one speaker 240. Wherein the speaker 240 may be provided in a movable structure. After the modulation is completed, the processor 240 may control the speaker 240 to move according to the audio receiving area such that the playback direction of the moved speaker 240 is directed to the audio receiving area.

As shown in fig. 4, taking a voice interaction device as an example, a music signal is output, the music signal can be modulated onto an ultrasonic carrier signal (corresponding to the first sound wave and the second sound wave), an ultrasonic beam is formed by a matrix loudspeaker controlled by an algorithm, and the ultrasonic beam is emitted into the air by an ultrasonic transducer, and in the process of propagating ultrasonic waves with different frequencies in the air, interaction and self-demodulation can occur on the signals due to the nonlinear acoustic effect of the air, so that new sound waves with frequencies being the sum (sum) and the difference (difference) of the original ultrasonic frequencies are generated; through the correction of the algorithm, the system automatically selects a proper ultrasonic frequency band to generate a difference frequency sound wave which can fall in an audible sound area. These difference frequency music signals can be heard only at a certain angle and a certain area, thereby increasing the privacy experience of the user.

Fig. 5 shows a schematic diagram of a voice interaction method according to another embodiment of the present disclosure. The voice interaction method disclosed by the disclosure can be executed by voice interaction equipment such as intelligent sound boxes, intelligent televisions and the like.

In this embodiment, for an audio signal (such as music) that is output by the voice interaction device and is of interest to the target user, the audio signal may be sent to the area where the target user is located by using an audio directional propagation technology, so that the audio signal is only listened to by the target user. For audio signals (e.g., advertisements) that are output by the voice interaction device that are not of interest to the target user, audio directional propagation techniques may be utilized to transmit the audio signals to areas that are off-target so that the audio signals are not heard by the target user.

As an example, an area where the target user is located may be taken as a first audio receiving area; modulating the audio output for the first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; the first sound wave and the second sound wave are emitted to the direction where the first audio receiving area is located. The first content is content of interest to the user, such as may be, but not limited to, audio works of music, novels, photo, small items, and the like.

In response to the output audio switching to the audio output for the second content, taking the area deviated from the target user as a second audio receiving area; modulating the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency that exceeds the audible frequency range, and the difference between the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; and transmitting the third sound wave and the fourth sound wave to the direction where the second audio receiving area is located. The second content is content that is not of interest to the user, such as may be an advertisement.

In response to the output audio being switched back to the audio output for the first content, the region in which the target user is located may be newly regarded as the first audio receiving region, and the first sound wave and the second sound wave for the first content may be emitted in the direction in which the first audio receiving region is located in the above manner.

As shown in fig. 5, when the smart speaker outputs a music signal, the music signal may be modulated to ultrasonic waves f ₁ and f ₂ having frequencies exceeding an audible frequency range, and ultrasonic waves f ₁ and f ₂ may be emitted to an area where a target user is located, and the ultrasonic waves f ₁ and f ₂ may be configured such that a difference in frequencies (e.g., f _1－f₂) is within the audible frequency range. Under the nonlinear effect of the air medium, the new sound wave with the frequency difference is an audible sound wave, and the audible sound wave can fall in the area where the target user is located and is heard by the target user.

When the smart speaker outputs the advertisement signal, the advertisement signal may be modulated to the ultrasonic waves f ₃ and f ₄ having frequencies exceeding the audible frequency range, and the ultrasonic waves f ₃ and f ₄ may be emitted to the area deviated from the target user, and the ultrasonic waves f ₃ and f ₄ may be configured such that the difference in frequencies (e.g., f _3－f₄) is within the audible frequency range. The new sound wave of the difference between the frequencies is an audible sound wave under the nonlinear effect of the air medium, and the audible sound wave falls in a region deviated from the target user and cannot be heard by the target user.

Therefore, in the process of utilizing the voice interaction equipment to listen to audio works such as music, small products, comments and the like, the user can realize privacy audio listening experience without wearing headphones, and can shield advertisements inserted in the audio works playing process.

The voice interaction method disclosed by the disclosure can also be realized as a voice interaction device. Fig. 6 shows a block diagram of a voice interaction device according to an exemplary embodiment of the present disclosure. Wherein the functional units of the voice interaction means may be realized by hardware, software or a combination of hardware and software implementing the principles of the present invention. Those skilled in the art will appreciate that the functional units depicted in fig. 6 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or even further definition of the functional units described herein.

The following is a brief description of functional units that may be provided with the voice interaction device and operations that may be performed by the functional units, and details related thereto may be referred to the above related description, which is not repeated herein.

Referring to fig. 6, the voice interaction apparatus 600 includes a determination module 610, a modulation module 620, and a transmission module 630.

The determining module 610 is configured to determine an audio receiving area. The modulation module 620 is configured to modulate the audio output to a first sound wave and a second sound wave, each having a frequency that exceeds an audible frequency range, and the difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range. The transmitting module 630 is configured to transmit the first sound wave and the second sound wave in a direction in which the audio receiving area is located.

The voice interaction device 600 may also include an audio receiving module. The audio receiving module is configured to receive a voice input, and the determining module 610 may determine an audio receiving area according to the voice input. For example, the determination module 610 may calculate a sound source location based on the speech input and determine the audio receiving area based on the sound source location. Alternatively, the determining module 610 may recognize the voice input using a voice recognition technique and determine the audio receiving area based on the semantic recognition result of the voice input. Alternatively, the determination module 610 may also be used to determine an audio output from a speech input.

The voice interaction device 600 may also include an image acquisition module. The image acquisition module may be used to acquire image information of the surrounding environment. The determining module 610 may identify identity information of each user in the image information; based on the identity information of the identified user, taking the user meeting the preset condition as a target user; determining a relative positional relationship between the target user and the device based on the image information; an audio receiving area is determined based on the relative positional relationship.

The voice interaction device 600 may further include a movable audio output device, and the voice interaction device 500 may further include a controller for moving the audio output device according to the audio receiving area such that a playback direction of the moved audio output device is directed to the audio receiving area.

The voice interaction device 600 may further include a plurality of audio output devices with different playback directions, and a switching device. The switching means may select the audio output means from the plurality of audio output means to emit the first sound wave and the second sound wave with the playback direction directed to the audio receiving area.

As an example, the determining module 610 may take the area where the target user is located as the first audio receiving area; the modulation module 620 may modulate the audio output for the first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; the transmitting module 630 may transmit the first sound wave and the second sound wave to a direction in which the first audio receiving area is located. In response to the output audio switching to audio output for the second content, the determination module 610 treats the area that is off-target user as a second audio receiving area; the modulation module 620 modulates the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency that exceeds the audible frequency range, and a difference in the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; the transmitting module 630 transmits the third sound wave and the fourth sound wave to a direction in which the second audio receiving area is located.

FIG. 7 illustrates a schematic diagram of a computing device that may be used to implement the voice interaction method described above, according to one embodiment of the present disclosure.

Referring to fig. 7, a computing device 700 includes a memory 710 and a processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, processor 720 may be implemented using custom circuitry, for example, an Application SPECIFIC INTEGRATED Circuit (ASIC) or a field programmable gate array (FPGA, field Programmable GATE ARRAYS).

Memory 710 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 720 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 710 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some embodiments, memory 710 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 710 has stored thereon executable code that, when processed by the processor 720, causes the processor 720 to perform the voice interaction method described above.

The voice interaction method, apparatus and device according to the present disclosure have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above steps defined in the above method of the present disclosure.

Or the disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the above-described methods according to the disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1.A voice interaction method, comprising:

Determining an audio receiving area, wherein the audio receiving area is an area where a target user is located;

Modulating an audio output for first content to a first sound wave and a second sound wave in response to the output audio being the audio output for the first content, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range, the first content being of interest to a user;

Transmitting the first sound wave and the second sound wave to the direction where the audio receiving area is located;

Modulating an audio output for a second content to a third sound wave and a fourth sound wave in response to the output audio switching to an audio output for the second content, the third sound wave and the fourth sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range, the second content being a content not of interest to a user;

and transmitting the third sound wave and the fourth sound wave to a direction deviating from the area of the target user.

2. The voice interaction method of claim 1, wherein the determining the audio receiving area comprises:

Receiving a voice input;

The audio receiving area is determined from the speech input.

3. The voice interaction method of claim 2, wherein determining the audio receiving area from the voice input comprises:

Calculating a sound source position according to the voice input;

An audio receiving area is determined based on the sound source position.

4. The voice interaction method of claim 2, wherein determining the audio receiving area from the voice input comprises:

Recognizing the speech input using speech recognition techniques;

And determining the audio receiving area according to the semantic recognition result of the voice input.

5. The voice interaction method of claim 2, further comprising:

the audio output is determined from the speech input.

6. The voice interaction method of claim 1, wherein the determining the audio receiving area comprises:

Collecting image information of surrounding environment;

identifying identity information of each user in the image information;

Based on the identity information of the identified user, taking the user meeting the preset condition as a target user;

Determining a relative positional relationship between the target user and the device based on the image information;

The audio receiving area is determined based on the relative positional relationship.

7. The voice interaction method of claim 1, wherein the method is performed by a voice interaction device comprising a removable audio output means, the method further comprising: and moving the audio output device according to the audio receiving area so that the playing direction of the moved audio output device points to the audio receiving area.

8. The voice interaction method of claim 1, wherein the method is performed by a voice interaction device comprising a plurality of audio output means having different playback directions,

The step of outputting the first sound wave and the second sound wave to the direction where the audio receiving area is located includes: an audio output device from the plurality of audio output devices that selects a playback direction to be directed toward the audio receiving area emits the first sound wave and the second sound wave.

9. A voice interaction method, comprising:

Taking the area where the target user is located as a first audio receiving area;

Modulating an audio output for a first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range;

transmitting the first sound wave and the second sound wave to the direction where the first audio receiving area is located;

responsive to switching output audio to audio output for a second content, taking an area offset from the target user as a second audio receiving area;

Modulating the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency exceeding an audible frequency range, and the difference between the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range;

and transmitting the third sound wave and the fourth sound wave to the direction where the second audio receiving area is located.

10. A voice interaction apparatus comprising:

The determining module is used for determining an audio receiving area, wherein the audio receiving area is an area where a target user is located;

A modulation module for modulating an audio output for a first content to a first sound wave and a second sound wave in response to the output audio being an audio output for the first content, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range, the first content being a content of interest to a user;

A transmitting module for transmitting the first sound wave and the second sound wave to the direction of the audio receiving area,

In response to switching output audio to audio output for a second content, the modulation module modulates the audio output for the second content to a third sound wave and a fourth sound wave, both of which have frequencies exceeding an audible frequency range, and a difference between the frequencies of the third sound wave and the fourth sound wave being within the audible frequency range, the second content being a content that is not of interest to a user;

The transmitting module transmits the third sound wave and the fourth sound wave in a direction deviating from the area of the target user.

11. A voice interaction apparatus comprising:

the determining module is used for taking the area where the target user is located as a first audio receiving area;

A modulation module for modulating an audio output for a first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range;

A transmitting module for transmitting the first sound wave and the second sound wave to the direction of the first audio receiving area, wherein,

In response to the output audio switching to an audio output for a second content, the determination module treats an area that deviates from the target user as a second audio receiving area; the modulation module modulating the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency that exceeds an audible frequency range, and a difference in the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; and the transmitting module transmits the third sound wave and the fourth sound wave to the direction where the second audio receiving area is located.

12. A voice interaction device, comprising:

The camera is used for shooting the surrounding environment to obtain image information;

A speaker; and

A processor for determining an audio receiving area according to the audio input and/or the image information, the audio receiving area being an area where a target user is located, modulating the audio output to a first sound wave and a second sound wave in response to outputting the audio as an audio output for a first content, the frequencies of the first sound wave and the second sound wave both exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range, the first content being content of interest to the user, and controlling the speaker to emit the first sound wave and the second sound wave in a direction where the audio receiving area is located,

The processor, in response to switching output audio to audio output for a second content, modulates the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having frequencies exceeding an audible frequency range, and a difference between the frequencies of the third sound wave and the fourth sound wave being within the audible frequency range, the second content being content that is not of interest to a user, and controls the speaker to emit the third sound wave and the fourth sound wave in directions away from an area of the target user.

13. A computing device, comprising:

A processor; and

A memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method of any of claims 1 to 9.

14. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1 to 9.