Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The method and the device can enable the audio signal sent by the voice interaction device to be heard by the user only at a certain angle and a certain area by utilizing the audio directional propagation technology, so that other users are prevented from being disturbed by sound while voice interaction service is provided for the target user, and meanwhile privacy voice interaction experience of the user can be enhanced. The audio directional propagation technology refers to a new sound source technology that sound propagates in a certain direction of air in a beam form, and can be but is not limited to an ultrasonic-based audio directional propagation technology.
Fig. 1 shows a schematic diagram of a voice interaction method according to one embodiment of the present disclosure. The voice interaction method of the present disclosure may be performed by, but not limited to, a smart speaker, a smart television.
Referring to fig. 1, taking a voice interaction device as an intelligent sound box and taking an ultrasonic-based audio directional propagation technology as an example, the intelligent sound box can transmit ultrasonic waves f 1 and f 2 with frequencies exceeding an audible frequency range to an audio receiving area where a target user is located, and the ultrasonic waves f 1 and f 2 can be configured to have differences between frequencies (i.e., f 1-f2) within the audible frequency range.
Under the nonlinear action of the air medium, the signals can be interacted and decoupled, and then new sound waves with the frequencies of the sum (sum frequency) and the difference (difference frequency) of the original ultrasonic frequencies are generated. The new sound wave with the frequency difference is an audible sound wave, the audible sound wave falls in the audio receiving area and can be heard by the target user, and other users outside the audio receiving area cannot hear the new sound wave.
Fig. 2 shows a schematic diagram of a structure of a voice interaction device according to one embodiment of the present disclosure. The voice interaction device 200 may be, but is not limited to, a smart speaker, a smart television, a smart phone, a vehicle, a traffic light, and other devices with pronunciation functions.
Referring to fig. 2, the voice interaction device 200 may include a microphone 210 and/or a camera 220, a processor 230, and a speaker 240.
The microphone 210, i.e., a pickup, is used to collect audio input, which may include the speaker's voice input as well as ambient noise. The camera 220 is used for shooting the surrounding environment to obtain image information, and one or more users can be included in the image information shot by the camera 220.
The processor 230 may determine the audio receiving area based on the audio input captured by the microphone 210, or may determine the audio receiving area based on the image information captured by the camera 220.
1. Determining an audio receiving area based on audio input
The audio receiving area, i.e. the area where the target user is located. The target user refers to a user who needs to listen to the audio output from the voice interaction device.
As one example of the present disclosure, the processor 230 may calculate the sound source position from the audio input (e.g., speech input) captured by the microphone 210, e.g., the microphone 210 may be a microphone array composed of a plurality of microphones, and the sound source position may be calculated from the difference in signal strength of the speech input in the audio input captured by the plurality of microphones. An audio receiving area is then determined based on the sound source position, e.g., the sound source position may be taken as the audio receiving area. Thus, the target user may be referred to as a speaker.
As another example of the present disclosure, the user may also issue a voice input indicating the audio receiving area, at which time the processor 230 may perform semantic recognition on the voice input collected by the microphone 210, and determine the audio receiving area according to the semantic recognition result of the voice input. Thus, the audio receiving area can be specified by the user, that is, the target user can be specified by the speaker, so that the user can realize the voice interaction effect of the incoming sound by using the voice interaction device. For example, if the user a wants to share the song "desert camel" to the user B in the same scene separately, the user a may issue a voice command such as "play song" desert camel "to the left according to the relative positional relationship between the user B and the voice interaction device, and after receiving the voice command, the voice interaction device may determine that the audio receiving area is the left front of the device according to the voice recognition result, and then may transmit a corresponding sound wave signal to the left front by using the audio directional propagation technology, so that the song" desert camel "can only be listened to by the user B.
Optionally, the processor 230 may also determine the audio output based on audio input (e.g., voice input) captured by the microphone. For example, the processor 230 may recognize a voice input and determine an audio output suitable for feedback to the user based on the recognition result. For another example, the processor 230 may also upload the audio input collected by the microphone to a server, and receive the audio output fed back by the server.
In order to improve the threshold of the privacy voice interaction experience and enhance the voice interaction experience of the user, the processor 230 may further determine whether the current user has voice interaction permission based on the voice input in the audio input, for example, may identify the identity of the current user according to the voice input, and determine whether the current user is a registered user according to the identification result. If the user is registered, the voice interaction method is executed, privacy voice interaction experience is provided for the user, if the user is not registered, no processing is performed, or the voice which can be listened to in all directions is output according to a normal voice interaction mode.
2. Determining an audio receiving area based on image information
The image information collected by the camera may be image information including the target user. Processor 230 may determine a relative positional relationship between the target user and the device based on the image information. For example, the processor 230 may determine the relative positional relationship between the target user and the device based on the shooting parameters of the camera, the position and the size of the target user in the image information. After the relative position relation between the target user and the device is obtained, an audio receiving area can be determined, and the audio receiving area is the area where the target user is located.
One or more users may be present in the image information acquired by the camera. Processor 230 may identify the identity information of each user in the image information using biometric techniques (e.g., face recognition techniques); based on the identity information of the identified user, taking the user meeting the preset condition as a target user; then, based on the image information, determining the relative position relation between the target user and the equipment; the audio receiving area can thus be determined based on the determined relative positional relationship. The preset condition may be, but is not limited to, a judgment condition related to the age of the user, a judgment condition related to the audio output. For example, the preset condition may be whether the user is adult, whether the user belongs to the audience for which the current audio output is intended, and so on.
Taking the voice interaction device as an intelligent sound box as an example, family members such as parents, old people and children can be included in the image information collected by the camera, after identifying identity information of each user in the image information, the processor 230 can take adults (such as parents and old people) as target users and exclude children, then calculate a relative position relationship between the target users and the intelligent sound box, and can determine an audio receiving area based on the calculated relative position relationship. Therefore, under the action of the audio directional transmission technology, the intelligent sound box can only output audio to the area where the adult in the family member is located.
For another example, after identifying the identity information of each user in the image information, the processor 230 may further determine, according to the content of the audio output, an audience (such as adult males, adult females, students, etc.) to which the audio output is directed, use a user whose identity corresponds to the audience as a target user, exclude users whose identity does not correspond to the audience, calculate a relative positional relationship between the target user and the smart speaker, and determine the audio receiving area based on the calculated relative positional relationship. Thus, under the action of the audio directional transmission technology, the intelligent sound box can only output audio to users suitable for listening to the audio content.
Processor 230 may include a DSP chip (digital signal processing chip). The processor 230 may modulate the audio output to a first sound wave and a second sound wave based on the nonlinear ultrasonic modulation, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range. The audible frequency range may refer to the range of sound frequencies that can be heard by the human ear, i.e., 20HZ to 2000HZ.
The audio output is audio content directed to the user output and may be, but is not limited to, music, voice instructions fed back to the user. The audio output may be an audible sound signal and the processor 230 may select the appropriate ultrasonic frequency band using a particular algorithm to modulate the audio output onto the first sound wave and the second sound wave. Wherein the first sound wave and the second sound wave are ultrasonic carrier signals.
Speaker 240 is an audio output device of voice interaction device 200. The processor 230 may control the speaker 240 to emit the first sound wave and the second sound wave in a direction in which the audio receiving area, i.e., the area in which the target user desiring to listen to the audio, is located. As an example, speaker 240 may be a horn on which an ultrasonic transducer is mounted, and after modulating an audible sound signal onto an ultrasonic carrier signal, the horn may be algorithmically controlled to form an ultrasonic beam and be emitted into the air by the ultrasonic transducer.
The voice interaction device 200 may comprise a speaker array of a plurality of speakers 240 with different playback directions. That is, the voice interaction device 200 may include a plurality of audio output means having different playback directions. After modulation is complete, the processor 230 may select a speaker from the array of speakers with a playback direction directed to the audio receiving area and control the speaker to emit the first sound wave and the second sound wave.
As shown in fig. 3, the speakers 240 may be designed as a matrix horn module. The matrix horn module can carry a DSP chip for executing an algorithm part related to the voice interaction method.
The voice interaction device 200 may also comprise only one speaker 240. Wherein the speaker 240 may be provided in a movable structure. After the modulation is completed, the processor 240 may control the speaker 240 to move according to the audio receiving area such that the playback direction of the moved speaker 240 is directed to the audio receiving area.
As shown in fig. 4, taking a voice interaction device as an example, a music signal is output, the music signal can be modulated onto an ultrasonic carrier signal (corresponding to the first sound wave and the second sound wave), an ultrasonic beam is formed by a matrix loudspeaker controlled by an algorithm, and the ultrasonic beam is emitted into the air by an ultrasonic transducer, and in the process of propagating ultrasonic waves with different frequencies in the air, interaction and self-demodulation can occur on the signals due to the nonlinear acoustic effect of the air, so that new sound waves with frequencies being the sum (sum) and the difference (difference) of the original ultrasonic frequencies are generated; through the correction of the algorithm, the system automatically selects a proper ultrasonic frequency band to generate a difference frequency sound wave which can fall in an audible sound area. These difference frequency music signals can be heard only at a certain angle and a certain area, thereby increasing the privacy experience of the user.
Fig. 5 shows a schematic diagram of a voice interaction method according to another embodiment of the present disclosure. The voice interaction method disclosed by the disclosure can be executed by voice interaction equipment such as intelligent sound boxes, intelligent televisions and the like.
In this embodiment, for an audio signal (such as music) that is output by the voice interaction device and is of interest to the target user, the audio signal may be sent to the area where the target user is located by using an audio directional propagation technology, so that the audio signal is only listened to by the target user. For audio signals (e.g., advertisements) that are output by the voice interaction device that are not of interest to the target user, audio directional propagation techniques may be utilized to transmit the audio signals to areas that are off-target so that the audio signals are not heard by the target user.
As an example, an area where the target user is located may be taken as a first audio receiving area; modulating the audio output for the first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; the first sound wave and the second sound wave are emitted to the direction where the first audio receiving area is located. The first content is content of interest to the user, such as may be, but not limited to, audio works of music, novels, photo, small items, and the like.
In response to the output audio switching to the audio output for the second content, taking the area deviated from the target user as a second audio receiving area; modulating the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency that exceeds the audible frequency range, and the difference between the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; and transmitting the third sound wave and the fourth sound wave to the direction where the second audio receiving area is located. The second content is content that is not of interest to the user, such as may be an advertisement.
In response to the output audio being switched back to the audio output for the first content, the region in which the target user is located may be newly regarded as the first audio receiving region, and the first sound wave and the second sound wave for the first content may be emitted in the direction in which the first audio receiving region is located in the above manner.
As shown in fig. 5, when the smart speaker outputs a music signal, the music signal may be modulated to ultrasonic waves f 1 and f 2 having frequencies exceeding an audible frequency range, and ultrasonic waves f 1 and f 2 may be emitted to an area where a target user is located, and the ultrasonic waves f 1 and f 2 may be configured such that a difference in frequencies (e.g., f 1-f2) is within the audible frequency range. Under the nonlinear effect of the air medium, the new sound wave with the frequency difference is an audible sound wave, and the audible sound wave can fall in the area where the target user is located and is heard by the target user.
When the smart speaker outputs the advertisement signal, the advertisement signal may be modulated to the ultrasonic waves f 3 and f 4 having frequencies exceeding the audible frequency range, and the ultrasonic waves f 3 and f 4 may be emitted to the area deviated from the target user, and the ultrasonic waves f 3 and f 4 may be configured such that the difference in frequencies (e.g., f 3-f4) is within the audible frequency range. The new sound wave of the difference between the frequencies is an audible sound wave under the nonlinear effect of the air medium, and the audible sound wave falls in a region deviated from the target user and cannot be heard by the target user.
Therefore, in the process of utilizing the voice interaction equipment to listen to audio works such as music, small products, comments and the like, the user can realize privacy audio listening experience without wearing headphones, and can shield advertisements inserted in the audio works playing process.
The voice interaction method disclosed by the disclosure can also be realized as a voice interaction device. Fig. 6 shows a block diagram of a voice interaction device according to an exemplary embodiment of the present disclosure. Wherein the functional units of the voice interaction means may be realized by hardware, software or a combination of hardware and software implementing the principles of the present invention. Those skilled in the art will appreciate that the functional units depicted in fig. 6 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or even further definition of the functional units described herein.
The following is a brief description of functional units that may be provided with the voice interaction device and operations that may be performed by the functional units, and details related thereto may be referred to the above related description, which is not repeated herein.
Referring to fig. 6, the voice interaction apparatus 600 includes a determination module 610, a modulation module 620, and a transmission module 630.
The determining module 610 is configured to determine an audio receiving area. The modulation module 620 is configured to modulate the audio output to a first sound wave and a second sound wave, each having a frequency that exceeds an audible frequency range, and the difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range. The transmitting module 630 is configured to transmit the first sound wave and the second sound wave in a direction in which the audio receiving area is located.
The voice interaction device 600 may also include an audio receiving module. The audio receiving module is configured to receive a voice input, and the determining module 610 may determine an audio receiving area according to the voice input. For example, the determination module 610 may calculate a sound source location based on the speech input and determine the audio receiving area based on the sound source location. Alternatively, the determining module 610 may recognize the voice input using a voice recognition technique and determine the audio receiving area based on the semantic recognition result of the voice input. Alternatively, the determination module 610 may also be used to determine an audio output from a speech input.
The voice interaction device 600 may also include an image acquisition module. The image acquisition module may be used to acquire image information of the surrounding environment. The determining module 610 may identify identity information of each user in the image information; based on the identity information of the identified user, taking the user meeting the preset condition as a target user; determining a relative positional relationship between the target user and the device based on the image information; an audio receiving area is determined based on the relative positional relationship.
The voice interaction device 600 may further include a movable audio output device, and the voice interaction device 500 may further include a controller for moving the audio output device according to the audio receiving area such that a playback direction of the moved audio output device is directed to the audio receiving area.
The voice interaction device 600 may further include a plurality of audio output devices with different playback directions, and a switching device. The switching means may select the audio output means from the plurality of audio output means to emit the first sound wave and the second sound wave with the playback direction directed to the audio receiving area.
As an example, the determining module 610 may take the area where the target user is located as the first audio receiving area; the modulation module 620 may modulate the audio output for the first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency that exceeds an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; the transmitting module 630 may transmit the first sound wave and the second sound wave to a direction in which the first audio receiving area is located. In response to the output audio switching to audio output for the second content, the determination module 610 treats the area that is off-target user as a second audio receiving area; the modulation module 620 modulates the audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency that exceeds the audible frequency range, and a difference in the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; the transmitting module 630 transmits the third sound wave and the fourth sound wave to a direction in which the second audio receiving area is located.
FIG. 7 illustrates a schematic diagram of a computing device that may be used to implement the voice interaction method described above, according to one embodiment of the present disclosure.
Referring to fig. 7, a computing device 700 includes a memory 710 and a processor 720.
Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, processor 720 may be implemented using custom circuitry, for example, an Application SPECIFIC INTEGRATED Circuit (ASIC) or a field programmable gate array (FPGA, field Programmable GATE ARRAYS).
Memory 710 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 720 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 710 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some embodiments, memory 710 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.
The memory 710 has stored thereon executable code that, when processed by the processor 720, causes the processor 720 to perform the voice interaction method described above.
The voice interaction method, apparatus and device according to the present disclosure have been described in detail above with reference to the accompanying drawings.
Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above steps defined in the above method of the present disclosure.
Or the disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the above-described methods according to the disclosure.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.