Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The method and the device have the advantages that the audio directional propagation technology can be utilized to enable the audio signals sent by the voice interaction equipment to be heard by the user only in a certain angle and a certain area, so that other users are prevented from being disturbed by sound while voice interaction service is provided for the target user, and meanwhile, the privacy voice interaction experience of the user can be enhanced. The audio directional transmission technology refers to a new sound source technology in which sound is transmitted in a beam form in a certain direction in the air, and may be, but is not limited to, an audio directional transmission technology based on ultrasound.
FIG. 1 shows a schematic diagram of a method of voice interaction, according to one embodiment of the present disclosure. The voice interaction method of the present disclosure may be executed by, but not limited to, a smart speaker, a smart television.
Referring to fig. 1, taking a voice interaction device as an intelligent sound box, and taking an audio directional propagation technology based on ultrasound as an example, the intelligent sound box can transmit ultrasonic waves f with frequencies exceeding an audible frequency range to an audio receiving area where a target user is located1And ultrasonic waves f2Ultrasonic wave f1And ultrasonic waves f2Can be configured as a difference in frequency (i.e., f)1-f2) Within the audible frequency range.
Under the nonlinear action of the air medium, the signals interact and are decoupled, and a new sound wave with the frequency of the sum (sum frequency) and the difference (difference frequency) of the original ultrasonic frequencies is generated. The new sound wave of the frequency difference is an audible sound wave, the audible sound wave falls in the audio receiving area and can be heard by the target user, and other users outside the audio receiving area cannot hear the audible sound wave.
FIG. 2 shows a schematic diagram of the structure of a voice interaction device, according to one embodiment of the present disclosure. The voice interaction device 200 may be, but is not limited to, a smart speaker, a smart television, a smart phone, a vehicle, a traffic signal lamp, and other devices having a pronunciation function.
Referring to fig. 2, the voice interaction device 200 may include a microphone 210 and/or a camera 220, a processor 230, and a speaker 240.
The microphone 210, i.e., a sound pick-up, is used to pick up audio input, which may include voice input from a speaker and ambient noise. The camera 220 is used for shooting the surrounding environment to obtain image information, and the image information shot by the camera 220 may include one or more users.
The processor 230 may determine the audio receiving area according to the audio input collected by the microphone 210, or may determine the audio receiving area according to the image information collected by the camera 220.
1. Determining an audio receiving area based on audio input
And the audio receiving area is the area where the target user is located. The target user refers to a user who needs to listen to the audio output emitted by the voice interaction device.
As an example of the present disclosure, the processor 230 may calculate the sound source position according to the audio input (e.g., voice input) collected by the microphone 210, for example, the microphone 210 may be a microphone array composed of a plurality of microphones, and the sound source position may be calculated according to the difference of the signal strength of the voice input in the audio input collected by the plurality of microphones. The audio receiving area is then determined based on the sound source position, as may be the audio receiving area. Thus, the target user may refer to a speaker.
As another example of the present disclosure, the user may also issue a voice input for indicating the audio receiving area, and the processor 230 may perform semantic recognition on the voice input collected by the microphone 210, and determine the audio receiving area according to a semantic recognition result of the voice input. Therefore, the audio receiving area can be specified by the user, namely the target user can be specified by the speaker, so that the user can realize the voice interaction effect of introducing voice into the ear by using the voice interaction device. For example, if the user a wishes to share the song "desert camel" separately to the user B in the same scene, the user a may issue a voice instruction such as "play song" desert camel "to the front left" according to the relative positional relationship between the user B and the voice interaction device, and after receiving the voice instruction, the voice interaction device may determine that the audio receiving area is the front left of the device according to the voice recognition result, and then may transmit a corresponding sound wave signal to the front left by using an audio directional propagation technology, so that the song "desert camel" can only be listened to by the user B.
Optionally, the processor 230 may also determine an audio output based on an audio input (e.g., a voice input) captured by the microphone. For example, the processor 230 may recognize a speech input and determine an audio output suitable for feedback to the user based on the recognition result. For another example, the processor 230 may also upload audio input collected by the microphone to the server, and receive audio output fed back by the server.
In order to raise the threshold of the privacy voice interaction experience and enhance the voice interaction experience of the user, the processor 230 may further determine whether the current user has the voice interaction authority based on the voice input in the audio input, and if so, identify the identity of the current user according to the voice input, and determine whether the current user is a registered user according to the identification result. If the user is the registered user, executing the voice interaction method disclosed by the invention to provide a private voice interaction experience for the user, and if the user is not the registered user, performing no processing or outputting voice which can be listened to in all directions according to a normal voice interaction mode.
2. Determining an audio receiving area based on image information
The image information collected by the camera may be image information including a target user. Processor 230 may determine a relative positional relationship between the target user and the device based on the image information. For example, the processor 230 may determine the relative position relationship between the target user and the device based on the shooting parameters of the camera, the position and size of the target user in the image information. After the relative position relationship between the target user and the device is obtained, an audio receiving area, that is, an area where the target user is located, can be determined.
One or more users may be present in the image information collected by the camera. The processor 230 may identify the identity information of each user in the image information using biometric identification techniques (e.g., face recognition techniques); based on the identified identity information of the user, taking the user meeting the preset conditions as a target user; then, based on the image information, determining the relative position relationship between the target user and the equipment; thus, the audio receiving area can be determined based on the determined relative positional relationship. The preset condition may be, but is not limited to, a judgment condition related to the age of the user, and a judgment condition related to audio output. For example, the preset condition may be whether the user is adult, whether the user belongs to an audience for which the current audio output is intended, and so on.
Taking voice interaction equipment as an example of a smart sound box, the image information collected by the camera may include family members such as parents, old people, children and the like, after identifying the identity information of each user in the image information, the processor 230 may take adults (such as parents and old people) as target users and exclude children, then calculate the relative position relationship between the target users and the smart sound box, and determine the audio receiving area based on the calculated relative position relationship. Therefore, under the action of the audio directional transmission technology, the intelligent sound box can only output audio to the area where the adults in the family members are located.
For another example, after identifying the identity information of each user in the image information, the processor 230 may further determine, according to the content of the audio output, an audience (such as an adult male, an adult female, a student, and the like) to which the audio output is directed, take the user whose identity matches the audience as a target user, and exclude the user whose identity does not match the audience, then calculate a relative position relationship between the target user and the smart speaker, and determine the audio receiving area based on the calculated relative position relationship. Therefore, under the action of the audio directional transmission technology, the intelligent loudspeaker box can only output audio to a user suitable for listening to the audio content.
The processor 230 may include a DSP chip (digital signal processing chip). The processor 230 may modulate the audio output to a first sound wave and a second sound wave based on the non-linear ultrasonic modulation, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and the difference in the frequencies of the first sound wave and the second sound wave being within the audible frequency range. The audible frequency range may refer to a frequency range of sound that can be heard by the human ear, i.e., 20HZ to 2000 HZ.
The audio output is audio content output to the user, and may be, but is not limited to, music, voice instructions fed back to the user. The audio output may be an audible sound signal and the processor 230 may select the appropriate ultrasonic band using a particular algorithm to modulate the audio output onto the first and second sound waves. Wherein, the first sound wave and the second sound wave are ultrasonic carrier signals.
The speaker 240 is an audio output device of the voice interaction apparatus 200. The processor 230 may control the speaker 240 to emit the first sound wave and the second sound wave in a direction of an audio receiving area, i.e., an area where a target user desiring to listen to audio is located. As an example, the speaker 240 may be a horn carrying an ultrasonic transducer, and after modulating an audible sound signal onto an ultrasonic carrier signal, the horn, which may be controlled by an algorithm, forms an ultrasonic beam and is emitted into the air by the ultrasonic transducer.
The voice interaction device 200 may include a speaker array consisting of a plurality of speakers 240 that reproduce different directions. That is, the voice interaction apparatus 200 may include a plurality of audio output devices having different playback directions. After the modulation is completed, the processor 230 may select a speaker from the speaker array whose sound reproduction direction is directed to the audio receiving area, and control the speaker to emit the first sound wave and the second sound wave.
As shown in fig. 3, the speakers 240 may be designed as a matrix horn module. The matrix speaker module can carry a DSP chip for executing the algorithm part related to the voice interaction method.
The voice interaction device 200 may also include only one speaker 240. Wherein the speaker 240 may be provided as a movable structure. After the modulation is completed, the processor 240 may control the speaker 240 to move according to the audio receiving area, so that the playing direction of the moved speaker 240 points to the audio receiving area.
As shown in fig. 4, taking the example that the voice interaction device outputs a music signal, the music signal may be modulated onto an ultrasonic carrier signal (corresponding to the first sound wave and the second sound wave mentioned above), an ultrasonic beam is formed by a matrix loudspeaker controlled by an algorithm, and the ultrasonic beam is emitted into the air by an ultrasonic transducer, and during the process that ultrasonic waves with different frequencies propagate in the air, due to the nonlinear acoustic effect of the air, the signals may interact and self-demodulate, and then a new sound wave with a frequency that is the sum (sum frequency) and the difference (difference frequency) between the original ultrasonic frequencies is generated; through correction of the algorithm, the system automatically selects a proper ultrasonic frequency band to generate difference frequency sound waves which can fall in an audible sound area. These difference frequency music signals can be heard only at a certain angle and a certain area, thereby increasing the privacy experience of the user.
FIG. 5 shows a schematic diagram of a voice interaction method according to another embodiment of the present disclosure. The voice interaction method disclosed by the invention can be executed by voice interaction equipment such as a smart sound box, a smart television and the like.
In this embodiment, for an audio signal (e.g. music) output by the voice interaction device and interested by the target user, the audio signal may be transmitted to the area where the target user is located by using an audio directional propagation technique, so that the audio signal is only listened to by the target user. For audio signals output by the voice interaction device that are not of interest to the target user (e.g., advertisements), audio targeting techniques may be utilized to transmit to areas that are off-target from the target user so that the audio signals are not heard by the target user.
As an example, the area where the target user is located may be taken as the first audio receiving area; modulating an audio output for the first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference in the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range; and transmitting the first sound wave and the second sound wave to the direction of the first audio receiving area. The first content is content of interest to the user, such as audio works which may be, but are not limited to, music, novel, vocals, novels, and the like.
In response to the output audio being switched to the audio output for the second content, taking the area deviating from the target user as a second audio receiving area; modulating an audio output for the second content to a third sound wave and a fourth sound wave, the third sound wave and the fourth sound wave each having a frequency exceeding an audible frequency range, and a difference in the frequency of the third sound wave and the frequency of the fourth sound wave being within the audible frequency range; and transmitting the third sound wave and the fourth sound wave to the direction of the second audio receiving area. The second content is content that is not of interest to the user, such as may be an advertisement.
In response to the output audio being switched to the audio output for the first content again, the area where the target user is located may be set as the first audio receiving area again, and the first sound wave and the second sound wave for the first content may be transmitted in the above-described manner to the direction where the first audio receiving area is located.
As shown in fig. 5, when the smart speaker outputs a music signal, the smart speaker can modulate the music signal to an ultrasonic wave f with a frequency exceeding the audible frequency range1And ultrasonic waves f2And transmits ultrasonic waves f to the area where the target user is located1And ultrasonic waves f2Ultrasonic wave f1And ultrasonic waves f2Can be configured as a difference in frequency (e.g. f)1-f2) Within the audible frequency range. Under the nonlinear action of the air medium, the new sound wave of the frequency difference is an audible sound wave, and the audible sound wave can fall in the area where the target user is located and be heard by the target user.
When the intelligent sound box outputs the advertisement signal, the advertisement signal can be modulated to the ultrasonic wave f with the frequency exceeding the audible frequency range3And ultrasonic waves f4And emits ultrasonic waves f to an area deviated from the target user3And ultrasonic waves f4Ultrasonic wave f3And ultrasonic waves f4Can be configured as a difference in frequency (e.g. f)3-f4) Within the audible frequency range. Under the nonlinear action of the air medium, the new sound wave with the frequency difference is an audible sound wave, and the audible sound wave falls in an area deviated from the target user and cannot be heard by the target user.
Therefore, when a user listens to audio works such as music, novels and commentary by using the voice interaction equipment, the user can not only realize private audio listening experience without wearing earphones, but also shield the advertisements interspersed in the playing process of the audio works.
The voice interaction method can also be realized as a voice interaction device. Fig. 6 shows a block diagram of a voice interaction apparatus according to an exemplary embodiment of the present disclosure. Wherein the functional elements of the voice interaction device can be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional units described in fig. 6 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional units described herein.
In the following, functional units that the voice interaction apparatus can have and operations that each functional unit can perform are briefly described, and for details related thereto, reference may be made to the above-mentioned related description, which is not described herein again.
Referring to fig. 6, the voice interacting device 600 includes a determining module 610, a modulating module 620, and a transmitting module 630.
The determining module 610 is used for determining an audio receiving area. The modulation module 620 is configured to modulate the audio output to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference between the frequency of the first sound wave and the frequency of the second sound wave being within the audible frequency range. The transmitting module 630 is configured to transmit the first sound wave and the second sound wave to a direction in which the audio receiving area is located.
The voice interaction device 600 may also include an audio receiving module. The audio receiving module is configured to receive a voice input, and the determining module 610 may determine an audio receiving area according to the voice input. For example, the determining module 610 may calculate a sound source position from the voice input, and determine an audio receiving area based on the sound source position. Alternatively, the determining module 610 may recognize the voice input by using a voice recognition technology, and determine the audio receiving area according to a semantic recognition result of the voice input. Optionally, the determining module 610 may also be configured to determine an audio output based on the speech input.
The voice interaction device 600 may also include an image capture module. The image acquisition module may be configured to acquire image information of the surrounding environment. The determining module 610 may identify identity information of each user in the image information; based on the identified identity information of the user, taking the user meeting the preset conditions as a target user; determining a relative positional relationship between the target user and the device based on the image information; the audio receiving area is determined based on the relative positional relationship.
The voice interaction apparatus 600 may further include a movable audio output apparatus, and the voice interaction apparatus 500 may further include a controller for moving the audio output apparatus according to the audio receiving area such that a playback direction of the moved audio output apparatus is directed to the audio receiving area.
The voice interaction apparatus 600 may further include a plurality of audio output apparatuses with different playback directions, and a switching apparatus. The switching means may select an audio output device, of which a sound reproduction direction is directed to the audio receiving area, from among the plurality of audio output devices, to emit the first sound wave and the second sound wave.
As an example, the determining module 610 may take the area where the target user is located as the first audio receiving area; the modulation module 620 may modulate the audio output for the first content to a first sound wave and a second sound wave, the first sound wave and the second sound wave each having a frequency exceeding an audible frequency range, and a difference in the frequencies of the first sound wave and the second sound wave being within the audible frequency range; the transmitting module 630 may transmit the first sound wave and the second sound wave in a direction in which the first audio receiving region is located. In response to the output audio switching to audio output for the second content, the determination module 610 takes the region deviating from the target user as a second audio receiving region; the modulation module 620 modulates the audio output for the second content to a third sound wave and a fourth sound wave, the frequencies of the third sound wave and the fourth sound wave both exceeding the audible frequency range, and the difference of the frequencies of the third sound wave and the fourth sound wave being within the audible frequency range; the transmitting module 630 transmits the third sound wave and the fourth sound wave to the direction of the second audio receiving area.
Fig. 7 shows a schematic structural diagram of a computing device that can be used to implement the voice interaction method according to an embodiment of the present disclosure.
Referring to fig. 7, computing device 700 includes memory 710 and processor 720.
Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 720 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).
The memory 710 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by processor 720 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 710 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 710 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 710 has stored thereon executable code that, when processed by the processor 720, causes the processor 720 to perform the voice interaction methods described above.
The voice interaction method, apparatus and device according to the present disclosure have been described in detail above with reference to the accompanying drawings.
Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above-mentioned steps defined in the above-mentioned method of the present disclosure.
Alternatively, the present disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the above-described method according to the present disclosure.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.