WO2020063752A1 - 远场拾音设备、及远场拾音设备中采集人声信号的方法 - Google Patents

远场拾音设备、及远场拾音设备中采集人声信号的方法 Download PDF

Info

Publication number
WO2020063752A1
WO2020063752A1 PCT/CN2019/108166 CN2019108166W WO2020063752A1 WO 2020063752 A1 WO2020063752 A1 WO 2020063752A1 CN 2019108166 W CN2019108166 W CN 2019108166W WO 2020063752 A1 WO2020063752 A1 WO 2020063752A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
unit
microphone
played
voice
Prior art date
Application number
PCT/CN2019/108166
Other languages
English (en)
French (fr)
Inventor
郑脊萌
于蒙
苏丹
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19866344.5A priority Critical patent/EP3860144A4/en
Publication of WO2020063752A1 publication Critical patent/WO2020063752A1/zh
Priority to US17/032,278 priority patent/US11871176B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/12Circuits for transducers, loudspeakers or microphones for distributing signals to two or more loudspeakers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups

Definitions

  • the present application relates to the field of electronic equipment, and in particular, to a far-field pickup device, a method and device for collecting a human voice signal in a far-field pickup device, an electronic device, and a storage medium.
  • Artificial intelligence is a theory, method, technology, and application system that uses digital computers or digital computer-controlled machines to simulate, extend, and extend human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, both at the hardware level and at the software level.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation / interaction systems, and mechatronics.
  • Artificial intelligence software technologies mainly include computer vision technology, speech processing technology, natural language processing technology, and machine learning / deep learning.
  • Speech Technology automatic speech recognition technology (ASR) and speech synthesis technology (TTS), and voiceprint recognition technology. Enabling computers to listen, see, speak, and feel is the future direction of human-computer interaction, and speech has become one of the most promising ways of human-computer interaction in the future.
  • ASR automatic speech recognition technology
  • TTS speech synthesis technology
  • the smart speaker can recognize the voice instructions issued by the user during playback, such as "turn on the volume", and then follow the user's voice instructions to perform actions, such as increasing the playback volume.
  • the microphone can pick up sound signals in the environment, including user voice, interference noise in the environment, and echoes of sound signals played by the smart speaker.
  • the sound signal collected by the microphone is converted into digitized information and sent to the voice signal pre-processing module.
  • the voice signal pre-processing module mainly has two tasks: first, use the echo cancellation algorithm to remove or reduce the echo of the sound signal played by the smart speaker itself picked up by the microphone, that is, when the playback signal source generates a speech signal to be played, the speech signal
  • the pre-processing module extracts the echo reference signal from the generated voice signal.
  • the echo signal is used to cancel the mixed signal.
  • the second is to reduce the noise in the environment by using noise cancellation algorithms such as beamforming.
  • speakers and microphones are often integrated on a main body.
  • the distance between the microphone and the speaker is very small, which causes the echo of the sound signal played by the smart speaker itself received by the microphone to be easily clipped and saturated, causing the performance of the echo cancellation algorithm to be significantly reduced.
  • the speaker and microphone are integrated on a main body, the user is far away from the microphone.
  • the interference noise in the environment is large or the reverberation in the room is large, the input signal noise of the microphone will be lower, which will cause Noise reduction algorithms such as beamforming provide less gain and degrade speech recognition performance. If the reverberation in the room is small, the high-frequency components of the user's speech will be lost during the propagation process, which will also deteriorate the speech recognition performance.
  • a separate far-field pickup scheme has been produced. That is, the microphone is separately arranged from the main body where the speaker is located, and the sound signal picked up by the microphone is sent to the main body by wireless or other methods for pre-processing of the voice signal. Because the distance between the microphone and the speaker becomes larger, the echo of the sound signal played by the smart speaker itself received by the microphone will not appear clipped and saturated. Moreover, the user gets closer to the microphone, which can improve speech recognition performance. However, since the distance between the microphone and the host body is variable, the sound signal of the microphone and the echo reference signal received by the host body cannot be synchronized, resulting in a decrease in the application performance of the echo cancellation algorithm. In addition, the existing far-field pickup equipment consumes a large amount of power.
  • the embodiments of the present application provide a far-field pickup device, a method and an apparatus for collecting human voice signals in the far-field pickup device, an electronic device, and a storage medium to solve the problem that the sound signal of the microphone and the echo reference signal cannot be synchronized. Thus improving speech recognition performance.
  • a far-field pickup device which includes a device main body and a microphone pickup unit with separated components.
  • the microphone pickup unit collects user voice and a sound signal played by the device main body after space propagation. And echo the digitally converted signal of the collected user voice and echo back to the device body,
  • the device body includes:
  • a playback signal source for generating a sound signal to be played
  • a synchronization signal generator for generating a synchronization signal synchronized with the sound signal to be played and occupying a second frequency band different from the first frequency band where the sound signal to be played is located;
  • a delay determining unit configured to determine a time delay between the second frequency band component in the signal sent back by the microphone pickup unit and the synchronization signal
  • the echo canceling unit is configured to perform echo canceling based on a signal sent back by the microphone pickup unit to use a sound signal to be played delayed by a determined time delay to obtain a collected human voice signal.
  • the far-field pickup device includes a device main body and a microphone pickup unit with separated components.
  • a synchronization signal synchronized with a sound signal to be played and occupying a second frequency band different from a first frequency band where the sound signal to be played is located;
  • the device main body Based on the signal sent back by the microphone pickup unit, the device main body performs echo cancellation using a sound signal to be played that is delayed by a determined time delay to obtain a collected human voice signal.
  • a device for collecting human voice signals in a far-field pickup device including:
  • a synchronization signal generating module configured to generate a synchronization signal synchronized with a sound signal to be played and occupying a second frequency band different from a first frequency band where the sound signal to be played is located;
  • a playing module configured to play the synchronization signal together with the sound signal to be played
  • a receiving module configured to receive a user voice collected by the microphone pickup unit and an echo of a sound signal played by the main body of the device in space, and digitally convert the collected user voice and echo;
  • a determining module configured to determine a second frequency band component in a signal sent back by the microphone pickup unit and a time delay with the synchronization signal
  • the signal acquisition module is configured to perform echo cancellation based on the signal sent back by the microphone pickup unit, using the sound signal to be played delayed by the determined time delay to obtain the collected human voice signal.
  • an electronic device including at least one processing unit and at least one storage unit, wherein the storage unit stores computer-readable program instructions, and when the computer-readable program instructions are When the processing unit executes, the processing unit is caused to execute the steps of the foregoing method.
  • a computer-readable storage medium which stores computer-readable program instructions executable by an electronic device, and when the computer-readable program instructions are run on the electronic device, the electronic device Perform the steps of the above method.
  • the synchronization signal generator generates a synchronization signal synchronized with the sound signal to be played and occupying a second frequency band different from the first frequency band where the sound signal to be played is located.
  • the synchronization signal is broadcasted along with the sound signal to be played. Because it is different from the frequency band of the sound signal to be played, the microphone pickup unit collects the user's voice and the sound signal played by the device body and digitally converts it to the device body and sends it back At this time, the main body of the device can easily filter out the components of the second frequency band, and perform a time comparison with the generated synchronization signal to determine the time delay.
  • this time delay is also the time delay for the sound signal to be played until the device body receives the echo of the sound signal to be played back which is resent by the microphone pickup unit.
  • the signal is delayed according to the time delay, and echo cancellation is performed by using the sound signal to be played delayed by the determined time delay, thereby solving the problem that the microphone signal and the echo reference signal cannot be synchronized and improving the performance of speech recognition.
  • FIG. 1A is a schematic diagram of a scenario where a far-field pickup device is applied to a smart speaker according to an embodiment of the present application
  • FIG. 1B is a schematic diagram of a scenario where a far-field pickup device is applied to a smart TV according to an embodiment of the present application
  • FIG. 1C is a schematic diagram of a scenario where a far-field pickup device is applied to voice-controlled intelligent navigation according to an embodiment of the present application
  • FIG. 1D is a schematic diagram of a scenario where a far-field pickup device is applied to a KTV music playback system according to an embodiment of the present application
  • FIG. 2A shows a schematic diagram of a construction layout when a far-field sound pickup device is applied according to an embodiment of the present application
  • 2B shows a schematic diagram of a construction layout when a far-field sound pickup device is applied according to another embodiment of the present application
  • FIG. 2C shows a schematic diagram of a construction layout of a far-field pickup device according to another embodiment of the present application.
  • 2D shows a schematic diagram of a construction layout of a far-field pickup device according to another embodiment of the present application
  • 3A shows a schematic architecture diagram of a system in which a far-field pickup device is applied to a smart speaker according to an embodiment of the present application
  • 3B shows a schematic diagram of an architecture of a far-field pickup device applied to a smart TV according to an embodiment of the present application
  • 3C shows a schematic diagram of an architecture of a far-field pickup device applied to voice-controlled intelligent navigation according to an embodiment of the present application
  • 3D shows a schematic diagram of an architecture of a far-field pickup device applied to a KTV music playback system according to an embodiment of the present application
  • FIG. 4A is a schematic diagram of a system architecture corresponding to the building layout of FIG. 2A according to an embodiment of the present application;
  • FIG. 4B is a schematic diagram of a system architecture corresponding to the building layout of FIG. 2A according to another embodiment of the present application;
  • 4C is a schematic diagram of a system architecture corresponding to the building layout of FIG. 2B according to an embodiment of the present application;
  • 4D is a schematic diagram of an architecture corresponding to the building layout of FIG. 2C according to an embodiment of the present application.
  • 4E is a schematic diagram of a system architecture corresponding to the building layout of FIG. 2D according to an embodiment of the present application;
  • FIG. 5 is a schematic structural diagram of a far-field sound pickup device according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a far-field pickup device according to another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a far-field pickup device according to another embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a far-field pickup device according to another embodiment of the present application.
  • FIG. 9 shows a flowchart of a method for collecting a human voice signal in a far-field pickup device according to an embodiment of the present application.
  • FIG. 10 illustrates a device for collecting a human voice signal in a far-field pickup device according to an embodiment of the present application
  • FIG. 11 illustrates an electronic device for implementing a method for collecting a human voice signal in a far-field sound pickup device according to an embodiment of the present application.
  • FIGS. 1A-1D are schematic diagrams of four application scenarios of a far-field pickup device according to an embodiment of the present application.
  • FIGS. 1A-1D are used as examples to illustrate four application scenarios of the far-field pickup device according to the embodiments of the present application, it is obvious that the embodiments of the present application are not limited to these four application scenarios.
  • Those skilled in the art can benefit from the teachings of this application, and apply the far-field pickup device according to the embodiments of this application to various other scenarios.
  • the far-field pickup device refers to a device in which the microphone is separately arranged from the main body where the speaker is located.
  • the sound signal picked up by the microphone is sent to the host by wireless or other methods for pre-processing of the voice signal.
  • the advantage of the far-field pickup device is that the distance between the microphone and the speaker is relatively large, so that the echo of the sound signal played by the smart speaker itself received by the microphone does not appear clipping and saturation. Moreover, the user gets closer to the microphone, which can improve speech recognition performance.
  • FIG. 1A illustrates a scenario in which a far-field sound pickup device is applied to a smart speaker according to an embodiment of the present application.
  • the smart speaker can recognize the voice instructions issued by the user during playback, such as "turn on the volume", and then follow the user's voice instructions to perform actions, such as increasing the playback volume.
  • the smart speaker can recognize the voices made by a person in these sound signals while playing sound signals, and thus perform actions based on the human voice.
  • the far-field pickup device 1 includes a device main body 11 and a microphone pickup unit 12 with separated components, where the device main body 11 has a speaker 1101 for playing to-be-played Sound signal.
  • the microphone pickup unit 12 may include a plurality of microphone units 1201. In FIG. 1A, each microphone unit 1201 is a point, and the microphone pickup unit 12 includes a dot matrix composed of these points.
  • the microphone unit 1201 picks up sound signals in the environment, including user voice, interference noise in the environment, and echoes of sound signals played by the smart speaker.
  • the sound signal picked up by each microphone unit 1201 is converted into a digital signal and sent to a processing device (not shown in the figure) in the device body 11.
  • the processing device removes the echo of the sound signal played by the smart speaker and the interference noise in the environment from the received sound signal, obtains the user's voice, and generates control instructions based on the user's voice, such as increasing or decreasing the speaker volume, playing a certain Songs, etc.
  • FIG. 1B illustrates a scenario in which a far-field sound pickup device is applied to a smart TV according to an embodiment of the present application.
  • the smart TV can recognize voice commands issued by the user during playback, such as "switch to channel XX", and then follow the user's voice commands to perform actions, such as switching to channel XX.
  • voice commands issued by the user during playback such as "switch to channel XX”
  • actions such as switching to channel XX.
  • the smart TV while playing the TV, it can recognize the voice made by the person in the sound signal of the TV, and thus act according to the voice of the person.
  • the far-field pickup device 1 includes a device main body 11 and a microphone pickup unit 12 with separated components, where the device main body 11 is a television main body and has a display screen 1102 and The speaker 1101 is used for displaying a video of a television program and playing a sound of the television program.
  • the microphone pickup unit 12 may include a plurality of microphone units 1201. In FIG. 1B, each microphone unit 1201 is a point, and the microphone pickup unit 12 includes a dot matrix composed of these points.
  • the microphone unit 1201 picks up sound signals in the environment, including user voice, interference noise in the environment, and echo of the sound of a TV program played by a smart TV.
  • the sound signal picked up by each microphone unit 1201 is converted into a digital signal and sent to a processing device (not shown in the figure) in the device body 11.
  • the processing device removes the echo of the sound of the TV program played by the smart TV and the interference noise in the environment from the received sound signal, obtains the user's voice, and generates control instructions based on the user's voice, such as switching to a certain channel and increasing the volume Or get smaller and so on.
  • FIG. 1C is a schematic diagram of a scenario where a far-field pickup device is applied to voice-controlled intelligent navigation according to an embodiment of the present application.
  • Voice-controlled intelligent navigation means that when driving, you can plan the navigation route according to the starting point and destination entered by the user, and send out a voice announcement during the driving process, so that the user always drives on the route announced by the voice while driving, while the user is driving (It may be at the same time as the announcement of the voice) Give out voice instructions, such as "change to XX place", "I want to park, help me find a nearby parking space", "the navigation display in the north is changed to the head up” and so on.
  • the voice-controlled intelligent navigation device recognizes the user's voice command and acts according to the voice command, such as restarting the navigation according to the new destination XX, helping the user find a nearby parking space, changing the navigation display from north to head up.
  • the far-field pickup device 1 includes a device main body 11 and a microphone pickup unit 12 with separated components, where the device main body 11 is a voice-controlled intelligent navigation device main body and has a display screen. 1106 and a loudspeaker 1101.
  • a display 1106 is used to input a desired destination and display navigation lines.
  • the loudspeaker 1101 is used to play a voice broadcast during the navigation of a voice-controlled intelligent navigation device.
  • the microphone pickup unit 12 is located on the side close to the driver to more clearly collect the voice instructions issued by the driver.
  • the display 1106 is located in the middle, and the speaker 1101 is located on the side of the front console away from the driver to avoid clipping. And saturation phenomenon to improve speech recognition performance.
  • the microphone pickup unit 12 includes a plurality of microphone units 1201. Each microphone unit 1201 is a dot, and the microphone pickup unit 12 includes a dot matrix composed of these dots.
  • the microphone unit 1201 picks up sound signals in the environment, including user's voice, interference noise in the environment, and echo of the voice broadcast by the voice-controlled intelligent navigation device.
  • the sound signal picked up by each microphone unit 1201 is converted into a digital signal and sent to a processing device (not shown in the figure) in the device body 11.
  • the processing device removes the echo of the voice broadcast by the voice-controlled smart navigation device and the interference noise in the environment from the received sound signal, obtains the user's voice, and generates control instructions based on the user's voice, such as restarting navigation according to the new destination XX and helping the user Find nearby parking spaces, change the navigation display from north to head up.
  • FIG. 1D is a schematic diagram of a scenario where a far-field pickup device is applied to a KTV music playback system according to an embodiment of the present application.
  • the KTV music playback system can play the accompaniment of the song through the speaker according to the song selected by the user.
  • the user sings into the microphone during the accompaniment the user's voice is recognized and the user's voice is played with the accompaniment through the speaker.
  • the far-field pickup device 1 includes a device main body 11 with separated components, a microphone pickup unit 12, a display screen 1102, and a song selection table 1103.
  • the song selection station 1103 displays a list of songs for accompaniment, and the user can select songs to sing from the list.
  • the device main body 11 is the main body of the KTV music playback system, and has a speaker 1101 thereon for playing the accompaniment of the song selected by the user.
  • the display screen 1102 is used to display the lyrics and pictures of the song selected by the user.
  • the microphone pickup unit 12 includes a plurality of microphone units (not shown in the figure), each microphone unit is a point, and the microphone pickup unit 12 includes a dot matrix composed of these points.
  • the user can sing into the microphone including the microphone pickup unit 12, the singing voice of the user, the interference noise in the environment, and the echo of the accompaniment played by the speaker 1101 will be converted into digital signals by the microphone unit and sent to the processing device in the device body 11.
  • the processing device does not eliminate the echo of the accompaniment, the echo of the accompaniment has a time delay with the accompaniment being played, and it will form an aliasing, resulting in blurred sound. Therefore, the processing equipment should eliminate the echo of the accompaniment and the interference noise in the environment, and then play the voice of the user singing.
  • the far-field sound pickup device 1 includes a device main body 11 and a microphone sound pickup unit 12 in which components are separated. There may be one or more microphone pickup units 12.
  • the device main body 11 can be all built locally, that is, located at the same location as the microphone pickup unit 12, or a part (speaker and receiver) can be built locally, and the core part for sound processing, that is, the processing device, is placed at the far end.
  • the far end means an area that can be connected to the Internet or a telecommunication network, but is not in the same location as the microphone pickup unit 12.
  • the processing equipment can be connected to the speakers and receivers via the Internet, or can be connected in a wired or wireless manner through a telecommunications network.
  • Figures 2A-2D use the smart speaker shown in Figure 1A as an example to illustrate the different construction layouts of the far-field pickup device, but those skilled in the art should understand that a slight change in Figures 2A-2D can also be used as Figures 1B-1D
  • the structure and layout of the far-field pickup device of the present invention can be easily made by those skilled in the art based on the teachings of the embodiments of the present application.
  • the device main body 11 and the microphone pickup unit 12 can all be built locally, and there is only one device main body 11 and one microphone pickup unit 12. Since there is only one microphone pickup unit 12 in the room, it is possible that the distance between the microphone pickup unit 12 and the user is also too long, which causes a problem of low speech recognition performance. Therefore, in the building layout shown in FIG. 2A, the far-field pickup device 1 includes a plurality of microphone pickup units 12 and a device main body 11 that are locally arranged.
  • the device main body 11 includes a speaker 1101.
  • a microphone pickup unit 12 is arranged in each corner of the room.
  • the microphone unit 1201 of multiple microphone pickup units 12 at different distances from the user will receive the user's voice, and echo the user's voice, the sound of the sound broadcast from the speaker 1101, and the interference noise in the environment They are sent to the device body 11 together, and the device body 11 processes these signals according to different principles to obtain the user's voice.
  • the device body 11 may include different receivers corresponding to different microphone pickup units 12, or may include only one receiver for receiving signals sent by all microphone pickup units 12, which will be described in detail later in conjunction with FIGS. 4A and 4B. Described.
  • the microphone pickup unit 12 is arranged locally, and the speaker 1101 and the receiver (not shown) in the device main body 11 are also arranged locally, and the processing equipment is a core part of sound processing. 1103 is arranged at the far end 1104.
  • This layout is because the processing of sound signals has nothing to do with the collection and playback of live sound, and it is not necessary to arrange it locally, and it is arranged at the remote end 1104, which is beneficial to the reduction of the volume of the locally arranged components.
  • the microphone unit 1201 of the microphone pickup unit 12 receives the user's voice, and sends the user's voice, the echo of the sound played by the speaker, and the interference noise in the environment to the receiver of the device body 11, the device The receiver of the main body 11 sends the received signal to the processing device 1103 of the far end 1104 through the Internet or a telecommunication connection.
  • the processing device 1103 removes the echo of the sound played by the speaker and the interference noise in the environment from the received sound signal to obtain the user's voice, and generates control instructions based on the user's voice, such as increasing or decreasing the volume of the speaker, etc. And send it to the speaker 1101, so as to control the volume of playback.
  • the processing device 1103 located at the far end 1104 communicates with the speakers 1101 and receivers (not shown) located at multiple locations 2 (e.g., multiple rooms) and processes multiple
  • the signal sent to the receiver by the microphone pickup unit 12 at the site 2, that is, one processing device 1103 at the far end 1104 may form the device main body 11 with the speakers 1101 and receivers at multiple sites, respectively.
  • the horn 1101 and the receiver, and the microphone pickup unit 12 are arranged locally.
  • the microphone unit 1201 of the local microphone pickup unit 12 receives the user's voice, and sends the user's voice, the echo of the sound played by the speaker, and the interference noise in the environment together through the Internet or telecommunications connection
  • the receiver of the device main body 11 sends the received signal to the processing device 1103 of the remote end 1104 through the Internet or a telecommunication connection.
  • the processing device 1103 removes the echo of the sound broadcast from the speaker and the interference noise in the environment from the received sound signal, obtains the user's voice, and generates control instructions based on the user's voice, such as increasing or decreasing the speaker volume.
  • a processing device 1103 that is not related to the collection and playback of live sound is arranged at the far end 1104, and local devices located at multiple locations are shared, which is conducive to the efficient use of resources.
  • the receiver communicates with a plurality of processing devices 1103 located at the remote end 1104 through the scheduling module 1105.
  • the scheduling module 1105 specifies a processing device 1103 that processes the signal. That is, which processing device 1103 located at the far end 1104 and the speaker 1101 and the receiver of which place 2 constitute the device main body 11 are not fixed.
  • the advantage of this layout is that because the combination of the processing device 1103, the speaker 1101, and the receiver is not fixed, when the sound signal sent by the local receiver is received, which processing device 1103 is assigned to the processing by the scheduling module 1105 according to the processing
  • the current load of the device 1103 is determined, so that the processing load of each processing device 1103 can be balanced, so that network resources are effectively allocated, thereby avoiding a processing device that simultaneously processes sound signals sent by multiple local receivers and causes an overload.
  • the microphone unit 1201 of the local microphone pickup unit 12 receives the user's voice, and transmits the user's voice, the echo of the sound played by the speaker, and the interference noise in the environment together through the Internet or telecommunications.
  • the receiver is connected to the receiver of the device main body 11, and the receiver of the device main body 11 sends the received signal to the scheduling module 1105 of the remote end 1104 through the Internet or a telecommunication connection.
  • the scheduling module 1105 allocates a processing device 1103 to the received signal, and sends the received signal to the processing device 1103.
  • the processing device 1103 removes the echo of the sound played by the speaker and the interference noise in the environment from the received signal, obtains the user's voice, and generates control instructions based on the user's voice, such as increasing or decreasing the volume of the speaker, and Send to local speaker 1101 via internet or telecommunications connection to perform control over the sound played.
  • FIG. 3A is a schematic diagram of an architecture of a far-field pickup device according to the first application applied to a smart speaker.
  • the far-field pickup device 1 includes a device main body 11 and a microphone pickup unit 12 with separated components, where the device main body 11 has a speaker 1101 for playing to-be-played Sound signal.
  • the microphone pickup unit 12 may include a plurality of microphone units 1201. In FIG. 3A, each of the plurality of microphone units 1201 is a point, and the microphone pickup unit 12 includes a dot matrix composed of these points.
  • the microphone unit 1201 picks up sound signals in the environment, including user voice 24 of the user 21, interference noise 23 (referred to as environmental noise) in the environment, and echo 22 (referred to as speaker echo) of the sound signal played by the smart speaker.
  • Each microphone unit 1201 converts the collected sound signal into a digital signal (ie, a mixed signal 25) and sends it to a processing device in the device main body 11.
  • the processing device removes the echo of the sound signal played by the smart speaker and the interference noise in the environment from the received mixed signal 25, obtains the user voice, and generates a control instruction according to the user voice.
  • FIG. 3B is a schematic diagram showing the architecture of a far-field pickup device applied to a smart TV according to an embodiment of the present application.
  • the far-field pickup device 1 includes a device main body 11 and a microphone pickup unit 12 with separated components.
  • the device main body 11 is a television main body, and has a display screen 1102 and a speaker 1101, which are respectively used to display a video of a television program and play a sound of the television program.
  • the microphone pickup unit 12 may include a plurality of microphone units 1201. The user watches the picture of the TV program displayed on the display screen 1102, and at the same time listens to the sound of the TV program played by the speaker 1101. While playing the sound of a TV program, the user can issue a voice command to control the TV, such as "change to XX channel".
  • the microphone unit 1201 picks up sound signals in the environment, including user voice 24, interference noise 23 in the environment (referred to as environmental noise), and echo 22 (referred to as speaker echo) of the sound of a television program played by the speaker 1101.
  • Each microphone unit 1201 converts the collected sound signal into a digital signal (ie, a mixed signal 25) and sends it to a processing device in the device main body 11.
  • the processing device removes the echo of the sound of the TV program played by the smart TV and the interference noise in the environment from the received mixed signal 25, obtains the user voice, and generates a control instruction according to the user voice, such as switching to a certain channel.
  • FIG. 3C is a schematic diagram showing the architecture of a far-field pickup device applied to voice-controlled intelligent navigation according to an embodiment of the present application.
  • the far-field sound pickup device 1 includes a device main body 11 and a microphone sound pickup unit 12 with separated components.
  • the microphone pickup unit 12 includes a plurality of microphone units 1201. Each of the microphone units 1201 is a dot, and the microphone pickup unit 12 includes a dot matrix composed of these dots.
  • the device main body 11 is a voice-controlled intelligent navigation device main body, and has a display screen 1106 and a speaker 1101.
  • the display screen 1106 is also referred to as a navigation display device 1106, and is used to input a destination and display navigation lines.
  • Voice broadcast during device navigation While broadcasting the voice, the user may issue a voice command to control the smart navigation device, for example, "I want to change to XX".
  • the microphone unit 1201 picks up sound signals in the environment, including user voice 24, interference noise 23 in the environment (referred to as environmental noise), and echo 22 (referred to as horn echo) of the voice broadcast by the voice-controlled intelligent navigation device.
  • Each microphone unit 1201 converts the collected sound signal into a digital signal (ie, a mixed signal 25) and sends it to a processing device in the device main body 11.
  • the processing device removes the echo of the voice broadcast by the voice-controlled intelligent navigation device and the interference noise in the environment from the received mixed signal 25, obtains the user's voice, and generates control instructions based on the user's voice, such as restarting navigation according to the new destination XX, and the like.
  • FIG. 3D illustrates a system architecture diagram of a far-field pickup device applied to a KTV music playback system according to an embodiment of the present application.
  • the far-field pickup device 1 includes a device main body 11 and a microphone pickup unit 12 with separated components.
  • the user selects a song to sing at the song selection station.
  • the device main body 11 is the main body of the KTV music playback system, and has a speaker 1101 and a KTV display screen 1102 thereon.
  • the speaker 1101 is used to play the accompaniment of the song selected by the user
  • the KTV display 1102 is used to display the lyrics and picture of the song selected by the user.
  • the user sings into the microphone pickup unit 12, the singing sound (referred to as the user's voice) 24, the interference noise 23 in the environment (referred to as environmental noise) and the echo 22 (referred to as the speaker echo) of the accompaniment played by the speaker will be microphone
  • a microphone unit (not shown) of the sound pickup unit 12 is converted into a digital signal (ie, a mixed signal 25) and sent to a processing device in the device body 11.
  • the processing device eliminates the echo of the accompaniment in the mixed signal 25 and the interference noise in the environment, and then plays the voice sung by the user.
  • FIG. 4A-4B are schematic diagrams of a system structure corresponding to the construction layout of the far-field pickup device of FIG. 2A.
  • FIG. 4A corresponds to the case where the device main body 11 in FIG. 2A has multiple receivers and receives mixed signals 25 reported by multiple microphone pickup units 12 respectively;
  • FIG. 4B corresponds to the device main body 11 in FIG. 2A having one receiver, A situation where the mixed signal 25 reported by a plurality of microphone pickup units 12 is received.
  • each microphone pickup unit 12 includes three microphone units 1201. Although the number of the microphone pickup units 12 in the same room is three as illustrated in FIG. 4A, those skilled in the art should understand that the number of the microphone pickup units 12 may be less or more. Although the number of the microphone units 1201 in the same microphone pickup unit 12 illustrated in FIG. 4A is three, those skilled in the art should understand that the number of the microphone units 1201 may be less or more.
  • each microphone pickup unit 12 converts the received user voice 24, the echo 22 (referred to as the speaker echo) of the sound broadcasted by the speaker, and the interference noise 23 (referred to as the environmental noise) in the environment into a mix.
  • the signal 25 is transmitted to a receiver 1107 of the device main body 11 together.
  • the microphone pickup unit 12 may be connected to the receiver 1107 through a wire so as to form a one-to-one correspondence relationship.
  • the microphone pickup unit 12 may send the mixed signal 25 in groups, but the mixed signal 25 carries a specific identification of the microphone pickup unit 12, and each receiver 1107 receives the mixed signal 25 from the received mixed signal 25 according to the identification. Extract the mixed signal 25 sent by the microphone pickup unit 12 corresponding to itself. After each receiver 1107 extracts the mixed signal 25 sent from the respective microphone pickup unit 12, the processing equipment removes the echo 22 (referred to as the horn echo) of the sound broadcast from the speaker from each mixed signal 25 and the ambient After the interference noise 23 (referred to as environmental noise for short), a plurality of user voices extracted are obtained, and then the user voices extracted from the plurality of channels are combined and processed to obtain an enhanced user voice.
  • the processing equipment removes the echo 22 (referred to as the horn echo) of the sound broadcast from the speaker from each mixed signal 25 and the ambient
  • the interference noise 23 referred to as environmental noise for short
  • the microphone pickup unit 12 By placing the microphone pickup unit 12 at different positions in the room, the sound signals sent by each microphone pickup unit 12 are processed separately, and the processing results are combined, which is conducive to overcoming if the microphone pickup unit 12 is only placed in a single location, The distance between the user and the microphone pickup unit 12 may be too far, resulting in a low accuracy of the speech recognition result.
  • the difference between the schematic diagram of the system architecture shown in FIG. 4B and FIG. 4A is that there is only one receiver 1107 in the device body 11 of FIG. 4B, which is used to receive the mixed signals sent by the microphone pickup units 12 and select only one of them according to predetermined criteria.
  • the mixed signal is processed and other mixed signals are discarded.
  • the predetermined criterion is, for example, selecting a mixed signal from the microphone pickup unit 12 closest to the user for processing. This selection is because the closer the microphone pickup unit 12 is to the user, the louder the volume of the collected user's voice is, the more beneficial it is to improve the user's voice recognition effect.
  • the selection of the mixed signal can be implemented by adding a time stamp to the mixed signal 25 when the microphone pickup unit 12 sends the mixed signal 25, which indicates the time when the microphone pickup unit 12 receives the user's voice.
  • the receiver 1107 can select the mixed signal 25 with the earliest reception time indicated by the timestamp for processing according to the sequence of the reception times indicated by the time stamp.
  • the quality of the user's voice in the mixed signal 25 indicated by the earliest reception time indicated by the timestamp is the highest, which is conducive to improving the effect of voice recognition.
  • FIG. 4C is a schematic diagram of a system architecture of a remote construction layout corresponding to FIG. 2B.
  • the microphone pickup unit 12 is arranged locally.
  • the speaker 1101 and the receiver 1107 in the device main body 11 are also arranged locally.
  • a processing device 1103 as a core part of the processing is arranged at the remote end 1104.
  • the processing device 1103 communicates with the local speakers 1101 and the receiver 1107 through an Internet or telecommunications connection.
  • the microphone unit 1201 of the microphone pickup unit 12 receives the user's voice 24, and sends the user's voice 24, the echo 22 (referred to as the speaker echo) of the sound played by the speaker, and the interference noise 23 (referred to as the Ambient noise) is converted into a mixed signal 25 and sent to a local receiver 1107 together.
  • the local receiver 1107 sends the received mixed signal 25 to the processing device 1103 of the remote 1104 through the Internet or a telecommunication connection.
  • the processing device 1103 removes the echo of the sound broadcasted by the speaker and the interference noise in the environment from the received mixed signal 25 to obtain the user voice, and generates control instructions based on the user voice, such as increasing or decreasing the volume of the speaker, and Control instructions are sent to the speaker 1101 via the Internet or telecommunications connection to perform control of the volume played.
  • FIG. 4D is a schematic diagram of a system architecture of a remote construction layout corresponding to FIG. 2C.
  • the processing device 1103 located at the far end 1104 communicates with the speakers 1101 and the receiver 1107 located at multiple locations 2 (for example, rooms), and processes the signals sent by the microphone pickup unit 12 located at multiple locations 2 to the receiver 1107, that is, remote A processing device 1103 at the end 1104 may form a respective device body 11 with speakers 1101 and receivers 1107 at multiple locations, respectively.
  • the local microphone pickup unit 12 receives the user's voice, converts the user's voice, the echo of the sound played by the speaker, and the interference noise in the environment into a mixed signal 25 and sends it to the local receiver 1107. .
  • the local receiver 1107 sends the received mixed signal to the processing device 1103 of the remote 1104 through the Internet or a telecommunication connection.
  • the processing device 1103 removes the echo of the sound broadcasted by the speaker and the interference noise in the environment from the received mixed signal, obtains the user's voice, generates a control instruction according to the user's voice, and sends the control instruction to the speaker 1101 through the Internet or a telecommunication connection. Thereby control is performed on the volume of playback.
  • the mixed signal reported by each receiver 1107 is uniformly processed by the same processing device 1103, which is beneficial to the efficient use of resources.
  • FIG. 4E is a schematic diagram of a system architecture of a remote construction layout corresponding to FIG. 2D.
  • the remote end 1104 may include multiple processing devices 1103 and a scheduling module 1105.
  • the scheduling module 1105 is connected to a plurality of processing devices 1103.
  • the speaker 1101 and the receiver 1107 located at any local place 2 and any processing equipment 1103 of the remote 1104 may form a pair.
  • the receiver 1107 receives the mixed signal 25 sent by the microphone unit 1201 of the local microphone pickup unit 12, it sends it to the scheduling module 1105 via the Internet or telecommunications connection, and the scheduling module 1105 specifies a processing device 1103 to perform the mixed signal deal with.
  • processing device 1103 of the remote end 1104 and the speaker 1101 and the receiver 1107 of which place 2 constitute the device main body 11 are not fixed.
  • the advantage of this layout is that because the combination of the processing device 1103 with the speakers 1101 and the receiver 1107 is not fixed, when the mixed signal sent by the local receiver 1107 is received, which processing device 1103 is allocated for processing by the scheduling module 1105 According to the current load decision of the processing device 1103, in this way, the processing load of each processing device 1103 can be balanced, and network resources can be effectively allocated.
  • a far-field sound pickup device 1 refers to a device in which a microphone is separately arranged from a main body where a speaker is located.
  • the advantage of the far-field pickup device is that the distance between the microphone and the speaker is relatively large, so that the echo of the sound signal played by the smart speaker itself received by the microphone does not appear clipping and saturation. Moreover, the user gets closer to the microphone, which can improve speech recognition performance.
  • FIG. 5 is a schematic structural diagram of a far-field pickup device according to an embodiment of the present application.
  • the far-field pickup device includes a device main body 11 and a microphone pickup unit 12 with separated components.
  • the device main body includes a playback signal source 1108, a synchronization signal generator 1109, a speaker 1101, a delay determination unit 1110, and an echo cancellation unit 1111.
  • the playback signal source 1108 is a component that generates a sound signal to be played, and can be implemented by an audio signal generating circuit.
  • the audio signal generating circuit generates a corresponding audio signal according to a stored audio file or an audio file received by an antenna.
  • the audio signal is 48KHz or 96KHz, it can be a digital signal, and the generation of each sample point is strictly synchronized by the clock signal.
  • the playback signal source 1108 is an audio signal generating circuit that generates a corresponding sound signal according to the stored audio file.
  • the corresponding sound signal is the corresponding music.
  • the audio file can be stored in the smart speaker, or it can be stored in other terminal devices, such as the user's mobile phone, and transmitted by other terminal devices to the smart speaker through Bluetooth transmission.
  • the playback signal source 1108 is a frequency dividing circuit that can separate the audio signal received by the TV antenna.
  • the frequency division circuit separates the video signal and audio signal received by the TV antenna.
  • the playback signal source 1108 is a voice output circuit that converts the prompt information generated by the navigation device into a voice output.
  • the voice output circuit stores the basic waveforms of the voices of different people, such as the basic waveforms of voices of Lin Zhiling and Guo Degang, and converts the prompt information generated by the navigation device into the voice of the person set by the user according to the settings of the user.
  • the playback signal source 1108 is an audio signal generating circuit that converts the accompaniment of a song ordered by the user in the KTV system into a sound signal. The user clicks the song at the song station, and the accompaniment file of the song is converted into an audio signal by the audio signal generating circuit.
  • the synchronization signal generator 1109 is a signal generation circuit that can generate an output signal synchronized with an input signal. Specifically, in the embodiment of the present application, the synchronization signal generator 1109 generates synchronization that is synchronized with the to-be-played sound signal generated by the playback signal source 1108 and occupies a second frequency band different from the first frequency band in which the to-be-played sound signal is located. signal.
  • the second frequency band refers to a frequency band that does not overlap with the first frequency band.
  • a frequency band is a range of frequencies. For example, if the first frequency band is [48kHz, 96kHz] and the second frequency band is [22kHz, 24kHz], the two are different frequency bands.
  • a clock circuit may be included in the synchronization signal generator 1109, which can extract a clock from a sound signal to be played generated by the playback signal source 1108, and the extracted clock is used to generate a clock synchronized with the sound signal to be played.
  • the clock circuit 1112 may also be provided outside the synchronization signal generator 1109. As shown in FIG. 6, the clock circuit 1112 is provided between the playback signal source 1108 and the synchronization signal generator 1109. The clock circuit 1112 extracts a clock from the to-be-played sound signal generated by the playback signal source 1108, and then uses the clock as a basis to generate a second frequency band signal that is synchronized with the to-be-played sound signal.
  • the second frequency band must be in a range inaudible to the human ear. If the synchronization signal in the second frequency band is a sound signal that can be heard by the human ear, there will be a situation where the synchronization signal is broadcasted by the speaker 1101 after the synchronization signal is superimposed with the sound signal to be played by the playback signal device 1108. It is broadcasted together with the sound signal to be played and is heard by the user, thereby causing the user's hearing interference.
  • the second frequency band is an ultrasound frequency band.
  • the synchronization signal generator 1109 is an ultrasonic synchronization signal generator. Because the ultrasonic frequency band is inaudible to the human ear, it will not cause auditory interference to the sound signal being played. The sampling frequency of the synchronization signal is completely consistent with the sound signal to be played, but its energy is concentrated in the ultrasonic frequency band. If the sampling frequency of the sound signal to be played is 48KHz, it is recommended to select the ultrasonic frequency band from 21KHz to 23KHz. If the sampling frequency of the sound signal to be played is 96KHz, you can choose an ultrasonic frequency band higher than 23KHz (but less than 48KHz).
  • the synchronization signal is a pseudo-random sequence after carrier modulation, which is to obtain higher anti-interference ability and good auto-correlation characteristics. Since the synchronization signal is used to judge the time delay between the sound signal to be played generated by the playback signal source 1108 and the mixed signal 25 received by the device main body 11 and use this time delay to perform echo cancellation, good anti-interference ability and good Auto-correlation characteristics, and the pseudo-random sequence after carrier modulation has the ability to:
  • n (t) is a pseudo-random sequence
  • f s is the carrier frequency in the ultrasonic frequency range
  • s (t) is a pseudo-random sequence after carrier modulation.
  • the selection of the pseudo-random sequence is based on at least one of the following parameters:
  • the frequency of the carrier can be selected based on its distance from 20KHz.
  • the pseudo-random sequence after the carrier modulation satisfying Formula 1 can bring higher anti-interference ability and good auto-correlation characteristics. If the frequency of the pseudo-random sequence and the carrier is selected as described above, the anti-interference ability can be further improved and good auto-correlation characteristics can be obtained.
  • the autocorrelation function of the pseudo-random sequence n (t) can be approximated by an impact function, the anti-interference ability and autocorrelation characteristics of the pseudo-random sequence n (t) are considered to be sufficiently good.
  • the delay determination unit 1110 determines the microphone pickup When the second frequency band component in the mixed signal 25 sent back by the unit 12 is time-delayed with the synchronization signal, since the time delay may include multiple periods of the synchronization signal, the determined time delay may be inaccurate.
  • T is a pseudo-random sequence n (t) of the periodic sequence
  • t 1 is the playback signal source 1108 to generate an average value of multiple measurements to be emitted sound signal to the mixing signal 25 is delayed by the apparatus main body 11 receives a time.
  • the time delay measurement method may be: let the playback signal source 1108 generate a test sound signal, record the time when the test sound signal is generated, and record the time when the device body 11 receives the mixed signal 25 from the microphone pickup unit 12 , Subtract the two recorded times to get the time delay.
  • the time delay can be measured multiple times and the time delay can be averaged.
  • is a constant and can be set in advance, such as 3 or 4.
  • the spectrum width of the autocorrelation function of the pseudo-random sequence n (t) should be greater than a predetermined spectrum width threshold, so that the anti-interference ability and autocorrelation characteristics of the pseudo-random sequence n (t) are good enough.
  • the predetermined spectrum width threshold may be preset by a user.
  • the carrier frequency can be selected based on its distance from 20KHz. In one embodiment,
  • the speaker 1101 which is a speaker, is a circuit device that converts sound signals into sound. In the embodiment of the present application, it is used to play a superposition of a sound signal to be played and the synchronization signal as a sound signal played by the device main body 11.
  • the synchronization signal output by the synchronization signal generator 1109 and the to-be-played sound signal output by the playback signal source 1108 can be input to the dual input terminals of the speaker 1101 respectively, and one output can be output from the speaker 1101 output terminal.
  • the speaker 1101 has only a single input terminal.
  • the to-be-played sound signal output by the playback signal source 1108 and the synchronization signal output by the synchronization signal generator 1109 first enter the adder 1113 for superposition, and the superimposed output sound signal enters the speaker 1101 after superposition.
  • the speaker 1101 converts the sound signal into sound and plays it.
  • the sound played by the loudspeaker 1101 includes not only the sound to be played in the first frequency band that can be heard by the human ear, but also components converted by the synchronization signal in the second frequency band that cannot be heard by the human ear, such as ultrasound.
  • the component converted by the synchronization signal will not interfere with the user's hearing, but when the microphone pickup unit 12 sends the collected mixed signal 25 back to the device main body 11, the component in the second frequency band can be used to generate the synchronization signal.
  • the time synchronization between the synchronization signals generated by the receiver 1109 is obtained, so as to obtain the time delay from the generation of the synchronization signal to the reception of the device main body 11.
  • This time delay is also the time when the playback signal source 1108 generates the sound signal to be played and the mixed signal 25 is received by the device main body 11. Time delay. Therefore, the sound signal to be played can be delayed by this time delay, and the delayed sound signal to be played can be used for echo cancellation to obtain the collected human voice signal, thereby solving the problem that the microphone signal and the echo reference signal cannot be synchronized, and the voice is improved. Identify performance.
  • the sound signal p (t) played by the speaker 1101 is actually the sum of the synchronization signal s (t) and the sound signal r (t) to be played, that is,
  • the sound signal received by the microphone pickup unit 12 becomes:
  • m (t) represents the sound signal received by the microphone pickup unit 12
  • g (t) is the transmission function of s (t) transmitted between the speaker 1101 and the microphone pickup unit 12
  • k (t) is r ( t) A transfer function transmitted between the speaker 1101 and the microphone pickup unit 12.
  • the microphone pickup unit 12 collects the user's voice 24, the echo 22 (referred to as the speaker echo) and the interference noise 23 (referred to as the environmental noise) in the environment after the sound signal played by the speaker 1101 propagates in space.
  • the user voice 24 is a sound issued by the user, which may include a voice control instruction that the user wants to operate the smart speaker, such as "turn the sound up or down".
  • the echo 22 of the sound signal played by the speaker 1101 in the space refers to the sound played by the speaker 1101 and reaching the microphone pickup unit 12 after propagating in the space.
  • the space refers to a space in an environment (for example, a room) in which the speaker 1101 and the microphone pickup unit 12 are located.
  • Propagation in space includes rectilinear propagation, as well as propagation that reaches the microphone pickup unit 12 again through reflections, diffractions, etc. of walls, windows, and the like.
  • Interference noise 23 in the environment refers to background noises such as environmental noises and air sloshing sounds.
  • the microphone pickup unit 12 shown in FIG. 5 includes a microphone unit 1201.
  • the microphone unit 1201 is a sound-electric signal conversion unit, which can convert sound into an electric signal representing the sound, and in particular can be converted into a digital signal.
  • the digital signal is actually a mixed signal 25 converted from user voice, interference noise in the environment, and echo of a sound signal played by a smart speaker.
  • the microphone unit 1201 of the microphone pickup unit 12 sends the mixed signal 25 to the device main body 11.
  • the delay determining unit 1110 in the device main body 11 may determine a component of the second frequency band in the mixed signal 25 and a time delay with the synchronization signal.
  • the delay determining unit 1110 may have a filtering unit. Since the first frequency band of the sound signal to be played is different from the second frequency band of the synchronization signal, the filtering unit can filter the second signal from the mixed signal 25. The component of the frequency band.
  • This component is actually a delay signal generated after a series of processes of the synchronization signal from being generated to being received by the microphone pickup unit 12 and then being sent back to the device main body 11 by the microphone pickup unit 12, which is compared with the initial By comparing the generated synchronization signals, the time delay between the generation of the synchronization signals and the re-reception by the device main body 11 can be determined.
  • the delay determining unit 1110 may not have a filtering unit itself, but a filter 1116 may be placed in front of it, as shown in FIG. 6.
  • the filter 1116 can filter the components of the second frequency band from the received mixed signal 25 for the delay determining unit 1110 to determine the delay time.
  • the sampling frequency of the filter 1116 is consistent with the sampling frequency of the synchronization signal and the sampling frequency of the mixed signal received by the device body 11.
  • the filtering delay introduced by filter 1116 is r f .
  • the delay determining unit 1110 performs a time comparison between the component of the second frequency and the synchronization signal generated by the synchronization signal generator 1109, thereby determining a time delay. Therefore, the synchronization signal historically generated by the synchronization signal generator 1109 needs to be recorded or buffered. In one embodiment, the time when the synchronization signal generator 1109 generates the synchronization signal may be recorded in a time stamp manner. When the synchronization signal generator 1109 generates a synchronization signal, a time stamp is added to the synchronization signal, for example, a part of the synchronization signal is occupied by the time stamp. After the mixed signal 25 is received by the device main body 11, the components of the filtered second frequency band still contain the timestamp. At this time, the delay determination unit 1110 can determine the time of receiving the components of the second frequency band output by the filter 1116 and the time The time difference between the stamps to determine the time delay.
  • a second buffer 1115 is provided for buffering the synchronization signal generated by the synchronization signal generator 1109, so that the synchronization signal generated in history will not disappear.
  • the synchronization signal generated by the synchronization signal generator 1109 may be stored with the generation time as an axis. After receiving the component of the second frequency band output by the filter 1116, the component is matched with the synchronization signal generated in the history of the synchronization signal generator 1109 buffered in the second buffer.
  • this embodiment does not need to add a timestamp and does not need to occupy a part of the synchronization signal, which can reduce the occupation of network resources.
  • the duration of the second buffer 1115 buffering the synchronization signal is at least the transmission time of the sound signal played by the speaker 1101 to the microphone pickup unit 12, and the sound signal output from the microphone pickup unit 12 is sent back to the The total transmission duration of the device main body 11 is described.
  • a test sound signal can be played by the playback signal source 1108, and then the time when the test sound signal is played by the speaker 1101, and then the time when the test sound signal is received by the microphone pickup unit 12, to find out The transmission duration of the sound signal played by the speaker 1101 to the microphone pickup unit 12 is described.
  • the significance of the sum of the transmission duration of 11 is that if the duration of the second buffer 1115 buffering synchronization signal is set to be shorter than the transmission duration of the sound signal played by the speaker 1101 to the microphone pickup unit 12 and the microphone pickup unit 12
  • the output sound signal is sent back to the sum of the transmission duration of the device main body 11 and may occur when the delay determining unit 1110 searches the second buffer for a matching segment in the synchronization signal according to the component of the second frequency band filtered by the filter 1116.
  • the matching fragment has been squeezed out of the second buffer, because the synchronization signal is broadcasted through the speaker 1101, received by the microphone pickup unit 12, and sent back to the device main body 11, and filtered by the filter 1116 of the device main body 11.
  • the capacity of the second cache 1115 is exceeded.
  • the delay determining unit 1110 searches for a synchronization signal segment corresponding to the filtered second component in the second buffer 1115, the synchronization signal segment has disappeared or partially disappeared in the second buffer.
  • the time delay between the synchronization signal and the component of the second frequency band may be determined by a cross-correlation method.
  • the delay determining unit 1110 may determine the time delay between the second frequency band component in the mixed signal 25 sent back by the microphone pickup unit 12 and the synchronization signal by:
  • the time corresponding to the second frequency band component in the mixed signal 25 sent back by the microphone pickup unit 12 and the maximum value of the cross-correlation function of the synchronization signal is determined as the time delay.
  • the data stream output from the filter 1116 is set to h (t), and the synchronization signal generated by the synchronization signal generator 1109 buffered in the second buffer 1115 is s (t).
  • a method for determining the time delay ⁇ (t) between h (t) and s (t) is to calculate the correlation between the two:
  • Can use functions The time corresponding to the maximum value of is used as an estimated value ⁇ ′ 1 (t 1 ) of the time delay between the two signals h (t) and s (t).
  • the first correlation signal is determined by calculating the cross-correlation function between the synchronization signal and the second frequency band component in the mixed signal 25 sent back by the microphone pickup unit 12.
  • the synchronization signal is demodulated.
  • the cross-correlation function is a cross-correlation function between a demodulated synchronization signal and a second frequency band component.
  • the time ⁇ ′ 1 (t 1 ) corresponding to the maximum value of the cross-correlation function between the two signals h (t) and s (t) is sent as the microphone pickup unit 12
  • the second frequency band component in the returned mixed signal 25 is time-delayed with the synchronization signal.
  • the delay determining unit 1110 may determine the time delay between the second frequency band component and the synchronization signal in the mixed signal 25 sent back by the microphone pickup unit 12 by:
  • the sum of the determined time and the filtering delay introduced by the filter 1116 is taken as the time delay.
  • the advantage of this embodiment is that the influence of the filtering of the filter on the delay is fully taken into account, thereby improving the accuracy of determining the time delay and performing echo cancellation.
  • the echo cancellation unit 1111 performs echo cancellation based on the mixed signal 25 sent back by the microphone pickup unit 12 and uses the to-be-played sound signal delayed by the determined time delay to obtain the collected human voice signal.
  • the echo cancellation unit 1111 needs to delay the sound signal to be played generated by the playback signal source 1108 based on the determined time delay.
  • it can be implemented by a delay circuit built in the echo cancellation unit 1111.
  • it may be implemented by a delay unit 1118 provided between the delay determination unit 1110 and the echo cancellation unit 1111, as shown in FIG.
  • the delay unit 1118 is configured to delay the sound signal to be played by using the time delay determined according to the delay determination unit 1110, so that the echo cancellation unit 1111 performs echo cancellation.
  • the device main body 11 further includes: a first buffer 1114, configured to buffer a sound signal to be played generated by the playback signal source 1108.
  • the first buffer 1114 may be connected between the playback signal source 1108 and the delay unit 1118. In this way, after the playback signal source 1108 generates a sound signal to be played, it can be buffered in the first buffer 1114.
  • the delay unit 1118 may use the time delay to delay the sound signal to be played in the first buffer 1114.
  • the duration of the first buffer 1114 to buffer the sound signal to be played is at least the transmission time of the sound signal played by the speaker 1101 to the microphone pickup unit 12 and the sound output from the microphone pickup unit 12 The sum of the transmission durations of the signals sent back to the device body 11.
  • a test sound signal can be played by the playback signal source 1108, and then the time when the test sound signal is played by the speaker 1101, and then the time when the test sound signal is received by the microphone pickup unit 12, to find out The transmission duration of the sound signal played by the speaker 1101 to the microphone pickup unit 12 is described.
  • the length of the first buffer 1114 buffering the sound signal to be played is at least the transmission time of the sound signal played by the speaker 1101 to the microphone pickup unit 12, and the sound signal output from the microphone pickup unit 12 is sent back to the device.
  • the significance of the sum of the transmission duration of the main body 11 is that if the duration of the first buffer 1114 buffering the sound signal to be played is set to be shorter than the transmission duration of the sound signal played by the speaker 1101 to the microphone pickup unit 12 and the microphone pickup unit 12 The sound signal output by the sound unit 12 is sent back to the sum of the transmission duration of the device body 11.
  • the delay unit 1118 delays playing the corresponding synchronization signal segment according to the time delay determined by the delay determination unit 1110, it will find the corresponding synchronization signal segment Has been removed from the first cache 1114.
  • the echo cancellation unit 1111 may directly perform echo cancellation on the mixed signal 25 sent back by the microphone pickup unit 12 by using the to-be-played sound signal delayed by the determined time delay.
  • the mixed signal 25 sent back by the microphone pickup unit 12 also includes a component of the second frequency band, because the component of the second frequency band is inaudible to the human ear, the received microphone can also be directly received.
  • the mixed signal 25 sent back by the sound pickup unit 12 is sent to the echo cancellation unit 1111 for echo cancellation.
  • the mixed signal 25 sent back by the microphone pickup unit 12 needs to be sent to the downsampler 1117 first, and then processed by the downsampler 1117 and then sent to the echo cancellation unit 1111 for processing. Echo cancellation.
  • the down-sampler 1117 is used to convert the sampling frequency of the mixed signal 25 sent back from the microphone pickup unit 12 from the sampling frequency (such as 48KHz or 96KHz) used for playback to the sampling frequency (such as 16KHz) used for human voice recognition.
  • the echo cancellation unit 1111 performs echo cancellation. After processing by the down-sampler 1117, the components in the second frequency band are naturally eliminated, and will not enter the echo cancellation unit 1111. In this way, the echo cancellation unit 1111 can use the sound signal to be played delayed by the determined time delay to perform echo cancellation on the received signal in the first frequency band (the frequency band consistent with the sound signal to be played), thereby improving the echo. Elimination of quality.
  • the echo cancellation unit 1111 may be implemented by using an echo cancellation circuit or the like.
  • the to-be-played sound signal delayed by the determined time delay output from the delay unit 1118 is also down-sampled by the down-sampler 1119.
  • the down-sampler 1119 converts the sampling frequency from the sampling frequency used for playback to the sampling frequency used for human voice recognition, for the echo cancellation unit 1111 to perform echo cancellation.
  • a receiver 1122 is further provided in the device main body 11 for receiving the mixed signal 25 sent back by the microphone pickup unit 12. After receiving the mixed signal 25 sent back by the microphone pickup unit 12 by the receiver 1122, it is sent to the filter 1116 on the one hand to filter out the components of the second frequency band for determining the time delay, and on the other hand is sent by the down-sampler 1117
  • the echo cancellation unit 1111 performs echo cancellation.
  • the device main body 11 further includes a voice enhancement unit 1120 connected to the output of the echo cancellation unit 1111 and used for signal separation such as beamforming, voice noise reduction, and the like.
  • the algorithm removes the interference noise 23 in the environment in the output of the echo cancellation unit 1111.
  • the voice enhancement unit 1120 is also used for The user's voice collected by the multi-channel microphone unit 1201 where the echo cancellation unit 1111 has completed echo cancellation is combined to obtain an enhanced human voice signal.
  • the device main body 11 further includes a voice recognition unit 1121 configured to perform voice recognition on the collected human voice signals.
  • the speech recognition unit 1121 is connected to the output of the speech enhancement unit 1120, performs speech recognition on the enhanced human voice signal, and recognizes it as text.
  • the device main body 11 further includes a control unit 1124 configured to perform a control action based on a control instruction in a voice recognition result.
  • a control unit 1124 configured to perform a control action based on a control instruction in a voice recognition result.
  • it is connected to the output of the speech recognition unit 1121, and performs a control action based on the recognized text output by the speech recognition unit 1121.
  • the control unit 1124 may store a language mode corresponding to various actions, for example, corresponding to an action of "turning up the volume", the corresponding language mode may include “turning up the volume", "a louder voice", "Noise louder” and many other language modes.
  • the action and the corresponding language pattern are stored in the language pattern and action comparison table in the control unit 1124 in advance, so that when the control unit 1124 receives the speech recognition result, it matches the speech recognition result with the language pattern in the comparison table. To find the action corresponding to the matching language pattern in the lookup table and execute the action.
  • control action performed is a control action on playback of the speaker, for example, "play XXX”, “turn up the volume”, and the like.
  • control action performed is a control action on the playback of the TV, for example, "change to XX channel", “turn up the volume”, and the like.
  • control action performed is a control action on the navigation, for example, "change destination to XXX”, “enlarge the navigation sound”, “change the navigation display from north to up To “head up” and so on.
  • the voice recognition unit 1121 and the control unit 1124 are not needed.
  • the specific structure of the microphone pickup unit 12 is described in detail below with reference to FIG. 6.
  • the microphone pickup unit 12 has a function of transmitting the collected user voice 24 to the device main body 11, an echo 22 (referred to as a horn echo for short) of a sound signal played by the device main body 11 in space, A transmitter 1204 of ambient noise 23 (referred to simply as ambient noise) in the environment.
  • the transmitter 1204 has a standby mode and a transmission mode. In the standby mode, the transmitter 1204 does not work. When a human voice signal is recognized from the signal sent from the microphone unit 1201, the transmitter enters the transmission mode from the standby mode. In the microphone pickup unit 12, the power consumption of the transmitter 1204 is the most important one. Since the standby mode basically consumes no power, the transmitter 1204 is put into the transmission mode only when the human voice is recognized by the microphone unit 1201. When the human voice is not recognized, the transmitter 1204 is put in the standby mode. The standby power consumption of the microphone pickup unit 12 is greatly reduced, and the problem of large power consumption of the microphone pickup unit 12 is solved.
  • the microphone pickup unit 12 includes a microphone unit 1201, a human voice recognition unit 1203, and a buffer unit 1202.
  • the microphone unit 1201 receives a user voice 24, an echo 22 of a sound signal played by the device main body 11 in space, and an interference noise 23 in the environment, and converts the received sound signal into a digital signal.
  • the sampling frequency of the microphone unit 1201 is consistent with the sampling frequency of the sound signal to be played and the synchronization signal, and is greater than or equal to 2 times the highest frequency of the synchronization signal. The advantage of selecting the sampling frequency is that it can Preventing frequency aliasing and improving sampling quality, thereby improving the quality of human voice signals collected in the embodiments of the present application.
  • the buffer unit 1202 is configured to buffer the signals collected by the microphone unit 1201, that is, the echo 22 of the sound signal played by the device main body 11 in space, and the interference noise 23 in the environment, and convert the received sound signal into a digital signal.
  • the cache unit 1202 is a circular cache unit, that is, when the data capacity entering the cache unit 1202 exceeds the cache capacity Tc of the cache unit 1202, the data that enters the cache unit 1202 will be early according to the principle of FIFO. It is removed so that the data capacity cached by the cache unit 1202 is always not greater than the cache capacity Tc.
  • the buffer capacity Tc is set to be not less than the detection delay caused by the human voice recognition of the human voice recognition unit 1203.
  • the human voice recognition unit 1203 is configured to recognize a human voice from the output of the microphone unit 1201, and when the human voice is recognized, trigger the transmitter 1204 to enter a transmission mode, and play the device main body 11 buffered by the buffer unit 1202.
  • the echo 22 of the sound signal transmitted in space, and the interference noise 23 in the environment are converted into digital signals for transmission.
  • the human voice recognition unit 1203 may be implemented by using a human voice recognition module.
  • the function of setting the buffer unit 1202 is: the existing human voice recognition technology, detecting human voice from the input voice signal needs a certain arithmetic processing, and there is a delay caused by detection.
  • the human voice recognition unit 1203 recognizes a human voice and triggers the transmitter 1204 to enter the transmission mode, if the human voice is not buffered, then the human voice signal is received until the human voice signal is recognized by the human voice recognition unit 1203. In the time period, the transmitter 1204 is not working, and the human voice signal is not transmitted. After the human voice recognition unit 1203 recognizes the human voice signal, the human voice signal is transmitted. In this way, one time period is lost. Vocal signals.
  • the human voice signal buffered by the buffer unit 1202 is released, thereby avoiding signal loss during human voice recognition.
  • the buffer capacity Tc is set to be not less than the detection delay caused by the human voice recognition unit 1203 for human voice recognition. This can ensure that the human voice signal is not lost at least during the detection delay period, and improves the integrity of the final human voice signal. Sex.
  • the human voice recognition unit 1203 sends a start trigger to the generator 1204 if the human voice signal is recognized from the mixed signal 25 sent from the microphone unit 1201 when the human voice signal is not recognized.
  • the transmitter 1204 enters the transmission mode from the standby mode.
  • a stop trigger is sent to the generator 1204. .
  • the transmitter 1204 enters the standby mode from the transmission mode.
  • the microphone pickup unit 12 may include a plurality of microphone units 1201.
  • One microphone unit is connected to the human voice recognition unit 1203.
  • a first The microphone unit is connected to the human voice recognition unit 1203, as shown in FIG.
  • the second or third microphone unit may also be designated to be connected to the human voice recognition unit 1203.
  • the human voice recognition unit 1203 detects a human voice from the output of the connected microphone unit 1201
  • the human voice recognition unit 1203 triggers the transmitter 1204 to enter a transmission mode, and mixes the mixture collected by each microphone unit 1201 buffered by the buffer unit 1202. The signal is sent.
  • the plurality of microphone units 1201 simultaneously collect user voice 24, and also simultaneously collect echoes 22 of sound signals played by the speaker 1101, and interference noises 23 in the environment.
  • the sound signal p (t) played by the speaker 1101 is actually the sum of the synchronization signal s (t) and the sound signal to be played r (t), that is, if the formula 4 is satisfied, the i-th microphone unit 1201 (1 ⁇ i ⁇ n, where n is the total number of microphone units)
  • the sound signals received are:
  • m i (t) represents the sound signal received by the i-th microphone unit 1201
  • g i (t) is a transfer function of s (t) transmitted between the speaker 1101 and the i-th microphone unit 1201
  • k i ( t) is a transfer function of r (t) transmitted between the speaker 1101 and the i-th microphone unit 1201.
  • multiple microphone units 1201 collect user voice 24, speaker echo 22, and interference noise 23 in the environment at the same time, multiple acquired signals will be obtained separately.
  • the multiple acquired signals will be processed separately in the device main body 11 and processed separately. Echo cancellation to generate individual vocal signals, and finally combining these vocal signals to enhance the resulting vocal signals, thereby overcoming the single microphone unit 1201's collection of user voice signals is prone to appear weaker and easier to collect Disadvantages affected by the deterioration of the performance of the microphone unit 1201 or other components in the path.
  • the microphone units 1201 are located in the same microphone pickup unit 12, they can be considered to receive human voice signals at almost the same time. Therefore, any one of the microphone units can be connected to the human voice recognition unit 1203, so that the human voice recognition unit 1203 recognizes human voice.
  • the human voice recognition unit 1203 may be connected to each microphone unit 1201.
  • the human voice recognition unit 1203 detects a human voice from the output of any of the connected microphone units 1201, the transmitter 1204 is triggered to enter a transmission mode, and each microphone unit 1201 buffered by the buffer unit 1202
  • the collected human voice signal, interference noise in the environment, and the horn echo are converted into digital signals for transmission.
  • the earliest microphone unit 1201 among the plurality of microphone units 1201 that receives the human voice signal will send the generated digital signal to the human voice recognition unit 1203 at the earliest, so that the human voice recognition unit 1203 starts to recognize.
  • the transmitter 1204 is triggered to start sending digital signals.
  • the human voice recognition unit 1203 can recognize in response to the earliest user voice received, and trigger the transmitter 1204 to enter a transmission mode, thereby transmitting a digital signal as early as possible.
  • the buffer unit 1202 since there are multiple microphone units 1201, the buffer unit 1202 also caches multiple digits obtained by converting the collected vocal signals, interference noise in the environment, and the speaker echo. signal.
  • the human voice recognition unit 1203 recognizes the human voice and triggers the transmitter 1204 to enter the transmission mode, the transmitter 1204 can respectively send the multiple digital signals to the receiver of the device body 11 through multiple channels (such as wireless communication channels). 1122.
  • the transmitter 1204 may multiplex the multiple digital signals before transmitting, multiplex them into a communication channel, and send the signals to the receiver 1122.
  • the multiplexing method includes, for example, an encapsulation package, that is, a digital packet of each channel is added with an identification (ID) of the channel to be encapsulated into a sub-package, and the digital signals of all channels are encapsulated into multiple sub-packages, and the multiple sub-packages are re-encapsulated Into a large packet, the large packet is sent to the receiver 1122 through a channel.
  • ID identification
  • Other multiplexing methods such as time division multiplexing and frequency division multiplexing can also be used. This embodiment is beneficial for reducing the occupation of channels and the rational use of resources.
  • the receiver 1122 in the device main body 11 may demultiplex the multiplexed signal sent by the transmitter 1204, that is, parse each digital signal from the multiplexed signal, that is, each microphone unit 1201 will collect Various digital signals obtained by converting human voice signals, interference noise in the environment, and the speaker echo.
  • the method of demultiplexing includes, for example, decapsulating a package, that is, decapsulating a large package to take out each subpackage, and then decapsulating each subpackage, and according to the subpackage identification (ID), the content extracted from the subpackage is attributed to the corresponding identifier. So that the digital signal is sent to the down-sampler 1117 for down-sampling according to a different way.
  • the receiver 1122 of the device main body 11 also has a standby mode and a reception mode.
  • the receiver 1122 does not work in the standby mode.
  • the receiver 1122 enters the receiving mode from the standby mode. Since the receiver 1122 does not work in the standby mode, it does not consume power, and only in the receive mode, the receiver 1122 consumes power. Therefore, the power consumption of the device body 11 is greatly reduced.
  • the receiver 1122 has a wireless signal sensing unit 1123.
  • the wireless signal sensing unit 1123 senses a wireless signal
  • the receiver 1122 enters a transmission mode from a standby mode.
  • the wireless signal sensing unit 1123 may be implemented by a wireless signal sensor or the like.
  • the wireless signal sensing unit 1123 generates a start trigger after sensing the wireless signal. In response to the start trigger, the receiver 1122 enters the transmission mode from the standby mode.
  • a stop trigger is generated. In response to the stop trigger, the receiver 1122 enters the standby mode from the transmission mode. Since the wireless signal sensing unit uses a wireless signal sensor, its power consumption is far less than the power consumption of the receiver during operation. Therefore, in this way, the power consumption of the receiver 1122 of the device main body 11 can be greatly reduced.
  • the receiver 1122 sends the demultiplexed digital signals to the down-sampler 1117 according to different paths, and the down-sampler 1117 performs down-sampling.
  • the down-sampler 1117 sends the signals of each channel obtained after the down-sampling to the echo cancellation unit 1111, respectively.
  • the echo cancellation unit 1111 performs echo cancellation on each of the demultiplexed signals, obtains the human voice signals of each channel after the echo cancellation, and inputs the human voice signals to the voice enhancement unit 1120.
  • the voice enhancement unit 1120 combines the human voice signals obtained after the echo cancellation unit 1111 completes the echo cancellation to obtain an enhanced human voice signal.
  • the delay determining unit 1110 may determine that one of the multiple microphone units 1201 has acquired The second frequency band component of the mixed signal 25 sent by the microphone pickup unit 12 back to the device main body 11 and the time delay of the synchronization signal.
  • the echo cancellation unit 1111 performs echo cancellation on the mixed signal 25 collected by all microphone units 1201 and sent by the microphone pickup unit 12 back to the device main body 11 using the to-be-played sound signal delayed by the determined time delay. .
  • the filter 1116 may target multiple channels (each channel corresponds to a microphone unit 1201), may filter out a plurality of different second frequency band components, and send the components to the delay determining unit 1110 to determine the time delay.
  • the second frequency band component that the delay determining unit 1110 first receives will determine the time delay between the second frequency band component and the synchronization signal, and use the determined time delay as the sum of all the second frequency band components and the synchronization signal.
  • the time delay is sent to the delay unit 118 to perform a time delay on the sound signal to be played to obtain the time-delayed sound signal.
  • the time-delayed sound signal to be played is input to the echo cancellation unit 1111, so that the echo cancellation unit 1111 performs echo cancellation on all channels.
  • This embodiment is based on the theory that the time delays of all channels are basically consistent. In this way, a second frequency band component received first can be used to quickly obtain a time delay and quickly use it for echo cancellation on all channels. The advantage is that the echo is improved. Elimination efficiency.
  • the delay determination unit 1110 may respectively determine the data collected by the plurality of microphone units 1201.
  • the microphone pickup unit 12 sends back each second frequency band component in the mixed signal 25 of the device main body 11 and the respective time delay with the synchronization signal, that is, the time delay of each channel, where each channel corresponds to a microphone.
  • the delay unit 1118 also delays the to-be-played sound signals buffered in the first buffer 1114 according to the time delay determined by each channel to obtain multiple delayed to-be-played sound signals.
  • the echo cancellation unit 1111 performs echo cancellation on the output of the down-sampler 1117 of each channel by using the multiple delayed sound signals to be played. That is, the echo cancelling unit 1111 uses the hybrid of the sound signals to be played which are delayed according to the determined respective time delays to the corresponding microphone unit 1201 and sent back to the device main body 11 by the microphone pickup unit 12 Signal 25 performs echo cancellation.
  • This embodiment is based on the fact that although the time delays of all the channels are generally the same, there is a slight difference in theory. In this way, the time delay of each channel can be determined separately. Improved the accuracy of echo cancellation.
  • each microphone pickup unit 12 there are multiple microphone pickup units 12, while there is only one device main body 11, and there are multiple receivers 1107 in the device main body 11, respectively.
  • the plurality of microphone pickup units 12 correspond to each other. The corresponding meaning is that each receiver only receives signals from the corresponding microphone pickup unit 12 and does not receive signals from other microphone pickup units 12 or discards them after receiving.
  • the receiver 1107 and the corresponding microphone pickup unit 12 are connected by wire, so that the signal sent by the microphone pickup unit 12 can only be received by the corresponding receiver 1107 connected to it.
  • the receiver 1107 and the microphone pickup unit 12 corresponding thereto perform wireless communication.
  • the microphone pickup unit 12 may be broadcasted, but the broadcast signal contains the unique identifier of the microphone pickup unit 12. Different microphone pickup units 12 have different pickup unit identifications.
  • the signals broadcasted by the microphone pickup unit 12 have corresponding pickup unit identifications.
  • the receiver 1107 determines whether the pickup unit identifier carried in the signal is the identifier of the microphone pickup unit 12 corresponding to itself. If so, keep the signal. If not, discard the signal. This embodiment avoids cluttered wiring in a room caused by wired wiring.
  • multiple receivers 1107 each receive a signal and generate a corresponding signal flow, and there is only one other component in the device main body 11, the other components in the device main body 11 can process and generate an output signal in the order in which they receive the input signal.
  • multiple receivers 1107 send the received mixed signal 25 to the filter 1116.
  • the order in which the filter 1116 receives the mixed signal 25 must be in order. It can filter out the first of the input mixed signals 25 in the order
  • the two-band components are output to the delay determining unit 1110. Since the order in which the delay determining unit 1110 receives the second frequency band component is different, it also processes in accordance with the order in which the second frequency band component is input.
  • the multiple receivers 1107 also send the received mixed signal 25 to the down-sampler 1117.
  • the order in which the down-sampler 1117 receives the mixed signals 25 sent from the multiple receivers 1107 is necessarily in order. In this way, the down-sampler 1117 performs down-sampling in accordance with the order in which the mixed signals 25 are received.
  • the signal sent by a receiver received by the down-sampler 1117 also contains multiple, down-samplers. 1117 receives multiple signals in parallel, and sends down-sampled signals to the echo cancellation unit 1111 in parallel after performing down-sampling.
  • the delay determining unit 1110 determines the second frequency band component in the mixed signal 25 received by each receiver 1122 and the time delay with the synchronization signal (according to the second signal filtered by the received filter 1116). Order of frequency band components).
  • the delay unit 1118 delays the sound signal to be played in the first buffer 1114 according to the respectively determined time delays.
  • the echo cancellation unit 1111 Based on the mixed signal 25 received by each receiver, the echo cancellation unit 1111 performs echo cancellation by using the sound signal to be played that is delayed by the determined corresponding time delay to obtain respective collected human voice signals. There are still multiple human voice signals obtained by the echo cancellation unit 1111.
  • the voice enhancement unit 1120 can superimpose a plurality of human voice signals output by the echo cancellation unit 1111, thereby enhancing the human voice signals.
  • the microphone pickup unit 12 After the user utters the voice, since the distance between each microphone pickup unit 12 and the user is different, the strength of the voice signal received by the user is different. By placing the microphone pickup unit 12 at different positions in the room, and processing the signals sent by each microphone pickup unit 12, and combining the processing results, it is helpful to overcome that the single position microphone pickup unit 12 may be far away from the user. The problem that the accuracy of human voice recognition results is not high.
  • the receiver 1107 is connected to all the microphone pickup units 12 by wire. In another embodiment, the receiver 1107 is in wireless communication with all the microphone pickup units 12. The receiver 1107 receives signals sent by all the microphone pickup units 12, but selects only one of them for processing according to a predetermined criterion, discarding other mixed signals.
  • the predetermined criterion is to retain the signals received by the microphone pickup unit 12 received first for processing, and discard the signals transmitted from other microphone pickup units 12.
  • the microphone pickup unit 12 has different pickup unit identifiers, and the signals sent by the microphone pickup unit 12 have corresponding pickup unit identifiers. If the receiver 1107 receives a signal with an identification of a pickup unit and then receives a signal with an identification of another pickup unit, discards the signal with an identification of another pickup unit. In other words, there is only one receiver 1107, and it cannot receive signals from multiple microphone pickup units 12 at the same time. Therefore, it only retains the signals received by the microphone pickup unit 12 that were earliest received. If a signal with another pickup unit identification is received, it is discarded.
  • This embodiment is based on the consideration that the user's voice picked up by each microphone pickup unit 12 will not be much different, and only one of the earliest received signals from the microphone pickup unit 12 can be retained. Processing only the signal received by the earliest microphone pickup unit 12 is beneficial to improving the speed of echo cancellation.
  • the predetermined criterion may be selecting signals from the microphone pickup unit 12 closest to the user, and discarding signals from other microphone pickup units 12. This selection is because the closer the microphone pickup unit 12 is to the user, the greater the amount of human voice collected, which is more conducive to improving the human voice recognition effect.
  • This can be achieved by adding a time stamp to the signal 25 when the microphone pickup unit 12 sends the signal 25, which indicates the time when the microphone pickup unit 12 receives the user's voice. In this way, the receiver 1107 can select the earliest mixed signal 25 indicated by the time stamp for processing according to the time sequence indicated by the time stamp. The earliest mixed signal 25 indicated by the timestamp has the highest user voice quality, which is conducive to improving the effect of voice recognition.
  • the delay determining unit 1110 determines only the second frequency band component of the mixed signal 25 sent by the one microphone pickup unit 12 and the time delay from the synchronization signal.
  • the delay unit 1118 delays the sound signal to be played in the first buffer 1114 according to the determined time delay.
  • the echo cancellation unit 1111 performs echo cancellation based on the received mixed signal 25 sent by a microphone pickup unit 12 and uses the sound signal to be played delayed by a predetermined time delay to obtain a collected human voice signal.
  • the device main body 11 includes a local speaker 1101 and a receiver 1107, and a processing device 1103 located at a remote end 1104.
  • the delay unit 1118, the downsampler 1119, the echo cancellation unit 1111, the downsampler 1117, the speech enhancement unit 1120, the speech recognition unit 1121, and the control unit 1124 are all located in the processing device 1103 of the far end 1104.
  • the processing device 1103 communicates with the local speakers 1101 and the receiver 1107 through an Internet or telecommunications connection. After the user speaks, the microphone pickup unit 12 receives the user's voice 24, and sends the user's voice 24, the echo 22 of the sound played by the speaker, and the environmental noise 23 as a mixed signal 25 to the local receiver 1107.
  • the local receiver 1107 sends the received mixed signal to the processing device 1103 of the remote 1104 through the Internet or a telecommunication connection.
  • the processing device 1103 removes the echo of the sound broadcast by the horn and the interference noise in the environment from the received mixed signal 25 according to the process described in the previous embodiment in conjunction with FIGS. 5-8, and obtains a human voice signal, which is generated according to the human voice signal.
  • the control instruction such as increasing or decreasing the volume, sends the control instruction to the speaker 1101 through the Internet or a telecommunication connection, so as to control the volume played by the speaker 1101.
  • the advantage of this embodiment is that the processing device 1103 that does not need to collect signals is moved from the local to the remote end, reducing the occupation of the local space, and at the same time providing the possibility of centralized network processing.
  • a processing device 1103 located at the far end 1104 communicates with the speakers 1101 and receivers 1107 at multiple locations 2 and processes the transmission by the microphone pickup unit 12 at the multiple locations 2.
  • the local microphone pickup unit 12 After each local user sends a voice, the local microphone pickup unit 12 receives the user voice 24, and sends the user voice 24, the echo 22 of the sound played by the speaker, and the interference noise 23 in the environment as a mixed signal 25 to Receiver 1107.
  • the receiver 1107 sends the received mixed signal 25 to the processing device 1103 of the far end 1104 through the Internet or a telecommunication connection.
  • the processing device 1103 removes the echo of the sound broadcast by the horn and the interference noise in the environment from the received mixed signal 25 according to the process described in the previous embodiment in conjunction with FIGS. 5-8, and obtains a human voice signal, which is generated according to the human voice signal.
  • the control instruction is sent to the speaker 1101 through the Internet or a telecommunication connection, so as to control the volume played by the speaker 1101.
  • the advantage of this embodiment is that the mixed signal 25 reported by each receiver 1107 is uniformly processed by the same processing device 1103, which realizes centralized processing, which is beneficial to the efficient use of resources and reduces the occupation of local resources.
  • the far end may have multiple processing devices 1103 and a scheduling module 1105.
  • the scheduling module 1105 is connected to a plurality of processing devices 1103.
  • Each local receiver 1107 is connected to multiple remote processing devices 1103 through a scheduling module 1105.
  • the receiver 1107 receives the mixed signal 25 sent by the local microphone pickup unit 12, it sends the mixed signal 25 to the scheduling module 1105 through the Internet or telecommunications connection, and the scheduling module 1105 specifies a processing device 1103 that processes the mixed signal 25 .
  • the specific processing procedure is as described in the previous embodiment with reference to Figs. 5-8.
  • the advantage of this is that the combination of the processing device 1103 with the speaker 1101 and the receiver 1107 is not fixed.
  • a mixed signal 25 sent by a local receiver 1107 is received, which processing device 1103 is allocated to the scheduling module 1105 according to The current load of each processing device 1103 is determined, so that the processing load of each processing device 1103 can be balanced, and network resources can be effectively allocated.
  • the scheduling module 1105 designates a processing device 1103 that processes the mixed signal 25 received by the receiver 1107 based on the number of tasks currently being processed by each processing device 1103, where the task is one of the tasks assigned to the scheduling module 1105. The process of processing the signals sent by the receiver 1107 to obtain the collected human voice signals.
  • the local microphone pickup unit 12 After each local user sends a voice, the local microphone pickup unit 12 receives the user voice 24, and sends the user voice 24, the echo 22 of the sound played by the speaker, and the interference noise 23 in the environment as a mixed signal 25 to Receiver 1107.
  • the receiver 1107 sends the received mixed signal 25 to the scheduling module 1105 of the far end 1104 through the Internet or a telecommunication connection.
  • the scheduling module 1105 allocates a processing device 1103 to the mixed signal 25 according to the number of tasks currently being processed by each processing device 1103.
  • the assigned processing device 1103 processes the mixed signal 25 according to the process described in the previous embodiment in conjunction with FIGS. 5-8, and removes the echo of the sound broadcast from the speaker and the interference noise in the environment from the received mixed signal 25 to obtain a person.
  • the acoustic signal generates a control instruction according to the human voice signal, and then sends the control instruction to the speaker 1101 via the Internet or a telecommunication connection via the dispatching module 1105, so as to control
  • the scheduling module 1105 allocates a processing device 1103 and allocates the mixed signal 25 to its processing, it can be considered as a task start.
  • the control instruction generated according to the human voice signal is sent to the speaker 1101 via the Internet or telecommunication connection via the dispatching module 1105.
  • the dispatching module 1105 receives a control instruction, which means the completion of the corresponding task. If a task is started, that is, it is assigned to a processing device 1103 but has not received a control instruction returned by the processing device 1103, it is considered that the processing device 1103 is currently processing the task.
  • the scheduling module 1105 can allocate a processing device 1103 to the mixed signal according to the number of tasks currently being processed by each processing device 1103. For example, the received mixed signal 25 is allocated to a processing device 1103 with the least number of tasks currently being processed.
  • a task identification may be added to the mixed signal 25 and the processing device 1103 is required to return
  • the corresponding control instruction adds the task ID to the corresponding control instruction.
  • the above-mentioned embodiment of specifying the processing device 1103 that processes the mixed signal 25 according to the number of tasks currently being processed by each processing device 1103 is that the processing device 1103 can be flexibly allocated based on the current processing load of each processing device 1103, which is beneficial to each processing device Load balancing to improve the efficiency of comprehensive utilization of resources.
  • the far-field pickup device 1 includes a device main body 11 and a microphone pickup unit 12 with separated components.
  • the method includes:
  • Step 310 The device main body 11 generates a synchronization signal synchronized with a sound signal to be played and occupying a second frequency band different from a first frequency band where the sound signal to be played is located;
  • Step 320 The device main body 11 plays the synchronization signal together with the sound signal to be played;
  • Step 330 The device main body 11 receives the user's voice collected by the microphone pickup unit 12 and the echo of the sound signal played by the device main body 11 in space, and digitally converts the collected user voice and echo. ;
  • Step 340 The device main body 1 determines a second frequency band component in the signal sent back by the microphone pickup unit 12 and a time delay with the synchronization signal;
  • Step 350 Based on the signal sent back by the microphone pickup unit 12, the device main body 11 performs echo cancellation by using the to-be-played sound signal delayed by the determined time delay to obtain the collected human voice signal.
  • the method before step 350, the method further includes:
  • the device main body 11 uses the time delay determined in step 340 to delay the sound signal to be played for echo cancellation.
  • the method before step 340, the method further includes:
  • the device main body 11 filters out a second frequency band component from a signal sent back by the microphone pickup unit 12 for determining a time delay.
  • the method before step 350, the method further includes:
  • the device main body 11 converts a sampling frequency of a signal sent back by the microphone pickup unit 12 from a sampling frequency used for playing a sound signal to be played back to a sampling frequency used for human voice recognition, and is used for echo cancellation.
  • the method before step 340, the method further includes:
  • the device main body 11 receives a signal sent back by the microphone pickup unit 12.
  • the microphone pickup unit 12 has a transmitter 1204 for sending the collected user voice and the echo of the sound signal played by the device main body to the device main body 11 in space.
  • the transmitter 1204 has a standby mode and a transmission mode. In the standby mode, the transmitter 1204 does not work. When the user voice is recognized from the collected sound signal, the transmitter 1204 enters the transmission mode from the standby mode.
  • the microphone pickup unit 12 includes a microphone unit 1201, a human voice recognition unit 1203, and a buffer unit 1202.
  • the microphone unit 1201 is configured to collect a user voice and an echo of a sound signal played by the main body of the device after spatial propagation;
  • the buffer unit 1202 is configured to buffer a user voice and an echo collected by the microphone unit 1201;
  • the recognition unit 1203 is configured to recognize a human voice from the output of the microphone unit 1201, and when the human voice is recognized, trigger the transmitter 1204 to enter a transmission mode, and send the user voice and echo buffered by the buffer unit 1202. .
  • One microphone unit 1201 is connected to the human voice recognition unit 1203, and the human voice recognition unit 1203 detects from the output of the connected microphone unit 1201.
  • the transmitter 1204 is triggered to enter a transmission mode, and sends user voices and echoes collected by the microphone units 1201 buffered by the buffer unit 1202.
  • the user voice and echo collected by each microphone unit 1201 buffered by the buffer unit 1202 are multiplexed before sending.
  • the method before step 340, the method includes:
  • the device main body 11 demultiplexes the user voice and echo collected by the multiplexed microphone units 1201.
  • step 350 includes: the device main body 11 performs echo cancellation on the user voice collected by each of the demultiplexed microphone units 1201.
  • the method further includes:
  • the device main body 11 combines human voice signals obtained by performing echo cancellation on user voices collected by the demultiplexed microphone units 1201 to obtain enhanced human voice signals.
  • the method further includes:
  • the device main body 11 performs voice recognition on the collected human voice signals.
  • the method further includes:
  • the device main body 11 has a receiver 1122, and the receiver 1122 has a standby mode and a receiving mode.
  • the receiver 1122 does not work in the standby mode.
  • the receiver 1122 enters the receiving mode from the standby mode.
  • the receiver 1122 has a wireless signal sensing unit 1123.
  • the wireless signal sensing unit 1123 senses a wireless signal
  • the receiver 1122 enters the receiving mode from the standby mode.
  • the time for the buffer unit 1202 to buffer the user's voice and echo collected by the microphone unit 1201 is not less than the recognition delay of the human voice recognition unit 1203.
  • the human voice recognition unit 1203 is connected to each microphone unit 1201, and the human voice recognition unit 1203 outputs from any one of the connected microphone units 1201
  • the transmitter 1204 is triggered to enter a transmission mode, and sends user voices and echoes collected by the microphone units 1201 buffered by the buffer unit 1202.
  • the second frequency band is an ultrasonic frequency band.
  • the sampling frequency of the microphone unit 1201 is consistent with the sampling frequency of the sound signal to be played and the synchronization signal, and is greater than or equal to 2 times the highest frequency of the synchronization signal.
  • the synchronization signal is a pseudo-random sequence after carrier modulation.
  • the selection of the pseudo-random sequence is based on at least one of the following:
  • the frequency of the carrier is selected based on a distance from 20 KHz.
  • the method before step 320, the method further includes:
  • the device main body 11 buffers the sound signal to be played for echo cancellation, and buffers the synchronization signal for determining a time delay.
  • the synchronization signal or the sound signal to be played is buffered for at least the transmission time of the played sound signal to the microphone pickup unit 12, and the sound signal output from the microphone pickup unit 12 is sent back to The total transmission time of the device body is described.
  • step 340 includes:
  • the sum of the determined time and the delay introduced by the filtering is taken as the time delay.
  • the step 340 includes: respectively determining the microphone pickup units 121 and the microphone pickup unit 12 separately collected by the microphone pickup unit 12 Each second frequency band component in the signal sent back to the device main body 11 and a respective time delay from the synchronization signal; the step 350 includes: using the sound signal to be played which is delayed according to the determined respective time delay, respectively Echo cancellation is performed on signals collected by the corresponding microphone unit 1201 and sent back to the device main body 11 by the microphone pickup unit 12.
  • the step 340 includes: determining that one of the plurality of microphone units 1201 is collected and picked up by the microphone.
  • the step 350 includes: using the sound signal to be played delayed by the determined time delay for all microphone units.
  • the step 340 includes: separately determining a second frequency band component in the signal received by each receiver 1122 and a time delay with the synchronization signal; and the step 350 includes: based on the signal received by each receiver 1122, using the determined The time-delayed and delayed sound signals to be played are subjected to echo cancellation to obtain respective collected human voice signals.
  • the microphone pickup unit 12 has a different pickup unit identifier, and the signal sent by the microphone pickup unit 12 has a corresponding pickup unit identifier. After the receiver 1122 receives the signal, it retains The received signal has a signal of a pickup unit identifier corresponding to the receiver 1122, and the received signal does not have a signal of a pickup unit identifier corresponding to the receiver 1122.
  • step 340 includes: determining The second frequency band component of the received signal sent by a microphone pickup unit 12 is time-delayed with the synchronization signal.
  • step 350 includes: based on the received signal sent by the microphone pickup unit 12, using a The delayed sound signal to be played after a certain time delay is subjected to echo cancellation to obtain a collected human voice signal.
  • the microphone pickup unit 12 has a different pickup unit identifier, and the signal sent by the microphone pickup unit 12 has a corresponding pickup unit identifier. If the receiver 1122 receives a After receiving the signal of the pickup unit, the signal with the other pickup unit is received, and the signal with the other pickup unit is discarded.
  • the device body 11 includes a local speaker 1101 and a receiver 1107, and a remote processing device 1103. Steps 310, 340, and 350 are all performed by the processing device 1103.
  • the far-end processing device 1103 communicates with the speakers 1101 and receivers 1107 at multiple locations, and processes signals sent by the microphone pickup unit 12 at multiple locations to the receiver 1122. .
  • the receiver 1122 communicates with a plurality of remote processing devices 1103 through the scheduling module 1105.
  • the scheduling module 1105 designates a processing station.
  • the processing device 1103 for a signal received by the receiver 1107 is described.
  • the scheduling module 1105 designates a processing device 1103 that processes signals received by the receiver 1122 based on the number of tasks currently being processed by each processing device 1103, where the task is a receiver assigned to the scheduling module 1105 1122 The process of the signal sent to obtain the collected human voice signal.
  • FIG. 10 illustrates a device for collecting human voice signals in a far-field pickup device according to an embodiment of the present application. As shown in FIG. 10, the device includes:
  • a synchronization signal generating module 400 is configured to generate a synchronization signal that is synchronized with a sound signal to be played and occupies a second frequency band different from a first frequency band where the sound signal to be played is located;
  • a playing module 401 configured to play the synchronization signal together with the sound signal to be played
  • the receiving module 402 is configured to receive a user voice collected by the microphone pickup unit and an echo of a sound signal played by the main body of the device after space propagation, and digitally convert the collected user voice and echo.
  • a determining module 403, configured to determine a second frequency band component in a signal sent back by the microphone pickup unit and a time delay with the synchronization signal
  • the signal acquisition module 404 is configured to perform echo cancellation based on a signal sent back by the microphone pickup unit, using a sound signal to be played delayed by a determined time delay, to obtain a collected human voice signal.
  • FIG. 11 illustrates an electronic device for implementing a method for collecting a human voice signal in a far-field sound pickup device according to an embodiment of the present application.
  • the electronic device includes: at least one processor 501 and at least one memory 502, wherein the memory 502 stores computer-readable program instructions, and when the computer-readable program instructions are received by the processor, When 501 is executed, the processor 510 is caused to execute the steps of the method for collecting a human voice signal in the far-field pickup device according to the foregoing embodiment.
  • the memory 501 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store all data stored in the method execution process. Created data, etc.
  • the processor 502 may be a central processing unit (CPU), or a digital processing unit.
  • the memory 501 and the processor 502 are connected by a bus 503, and the bus 503 is indicated by a thick line in FIG.
  • the bus 505 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 11, but it does not mean that there is only one bus or one type of bus.
  • the memory 501 may be a volatile memory, such as a random-access memory (RAM); the memory 501 may also be a non-volatile memory, such as a read-only memory, and a flash memory.
  • RAM random-access memory
  • the memory 501 may also be a non-volatile memory, such as a read-only memory, and a flash memory.
  • Memory flash memory
  • HDD hard disk
  • SSD solid-state drive
  • memory 501 can be used to carry or store the desired program code in the form of instructions or data structures and can be implemented by Any other media that the computer accesses, but is not limited to.
  • the memory 501 may be a combination of the above-mentioned memories.
  • An embodiment of the present application further provides a computer-readable storage medium that stores computer-readable program instructions executable by an electronic device, and when the computer-readable program instructions are run on the electronic device, the electronic device is described above. The steps of the method for collecting human voice signals in the far-field pickup device according to the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本申请实施例涉及人工智能的智能音箱等技术,具体提供了一种远场拾音设备、及远场拾音设备中采集人声信号的方法。所述远场拾音设备包括部件分离的设备主体和麦克风拾音单元。所述麦克风拾音单元采集用户语音和所述设备主体播放的声音信号在空间传播后的回声并发送回所述设备主体。所述设备主体包括:播放信号源、同步信号发生器、喇叭、延迟确定单元和回声消除单元。本申请实施例能解决麦克风信号和回声参考信号无法同步的问题,提高语音识别性能。

Description

远场拾音设备、及远场拾音设备中采集人声信号的方法
本申请要求于2018年9月29日提交中国专利局、申请号为201811150947.5、发明名称为“远场拾音设备、及远场拾音设备中采集人声信号的方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电子设备领域,具体涉及一种远场拾音设备、及远场拾音设备中采集人声信号的方法和装置、电子设备及存储介质。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
智能音箱在播放的过程中可以识别用户发出的语音指令,如“开大音量”,然后按照用户的语音指令进行动作,如提高播放的音量。在智能音箱中,麦克风可以拾取环境中的声音信号,包括用户语音、环境中的干扰噪声、智能音箱播放的声音信号的回声等。麦克风收集的声音信号被转换成数字化信息发送到语音信号前处理模块。语音信号前处理模块主要有两个任务:第一,利用回声消除算法去除或降低麦克风拾取的智能音箱自身所播放的声音信号的回声,即当播放信号源产生要播放的语音信号时,语音信号前处理模块从产生的语音信号中提取回声参考信号,在接收到麦克风拾取的包括用户语音、环境中的干扰噪声、智能音箱 播放的声音信号的回声的混合信号后,用回声参考信号抵消混合信号中的回声;第二,利用波束形成等噪声消除算法降低环境中的干扰噪声。
在智能音箱中,喇叭和麦克风往往集成在一个主机体上。麦克风和喇叭的距离很小,导致麦克风接收到的智能音箱自身所播放的声音信号的回声容易出现截幅和饱和现象,造成回声消除算法的性能明显下降。另外,由于喇叭和麦克风集成在一个主机体上,用户离麦克风较远,此时,如果环境中的干扰噪声较大,或房间的混响较大,麦克风的输入信噪会比较低,进而导致波束形成等噪声消除算法提供的增益变小,恶化语音识别性能。而如果房间的混响很小,用户语音的高频成分在传播过程中的损耗会较大,也会恶化语音识别性能。
为了解决上述问题,产生了分离式的远场拾音方案。即,麦克风脱离于喇叭所在的主机体单独布置,麦克风拾取的声音信号通过无线等方式发送到主机体进行语音信号前处理。因为麦克风和喇叭的距离变大,导致麦克风接收到的智能音箱自身所播放的声音信号的回声不会出现截幅和饱和现象。而且,用户离麦克风变近,可以改善语音识别性能。但是,由于麦克风到主机体之间的距离是可变的,主机体接收到的麦克风的声音信号和回声参考信号无法同步,导致回声消除算法的应用性能降低。另外,现有的远场拾音设备耗电量较大。
技术内容
本申请实施例提供一种远场拾音设备、及远场拾音设备中采集人声信号的方法和装置、电子设备及存储介质,以解决麦克风的声音信号和回声参考信号无法同步的问题,从而提高语音识别性能。
根据本申请实施例,提供了一种远场拾音设备,包括部件分离的设备主体和麦克风拾音单元,所述麦克风拾音单元采集用户语音和所述设备主体播放的声音信号在空间传播后的回声并对所述采集的用户语音和回声进行数字转换后的信号发送回所述设备主体,
所述设备主体包括:
播放信号源,用于产生待播放声音信号;
同步信号发生器,用于产生同步于所述待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的的同步信号;
喇叭,用于播放待播放声音信号和所述同步信号的叠加,作为所述设备主体播放的声音信号;
延迟确定单元,用于确定所述麦克风拾音单元发送回的信号中第二频段分量、与所述同步信号的时间延迟;
回声消除单元,用于基于所述麦克风拾音单元发送回的信号,利用按所确定 的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
根据本申请实施例,提供了一种远场拾音设备中采集人声信号的方法,所述远场拾音设备包括部件分离的设备主体和麦克风拾音单元,所述方法包括:
所述设备主体产生同步于待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号;
所述设备主体将所述同步信号与所述待播放声音信号一起播放;
所述设备主体接收所述麦克风拾音单元采集的用户语音和所述设备主体播放的声音信号在空间传播后的回声并对采集的用户语音和回声进行数字转换后的信号;
所述设备主体确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的时间延迟;
所述设备主体基于所述麦克风拾音单元发送回的信号,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
根据本申请实施例,提供了一种远场拾音设备中采集人声信号的装置,包括:
同步信号产生模块,用于产生同步于待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号;
播放模块,用于将所述同步信号与所述待播放声音信号一起播放;
接收模块,用于接收所述麦克风拾音单元采集的用户语音和所述设备主体播放的声音信号在空间传播后的回声并对采集的用户语音和回声进行数字转换后的信号;
确定模块,用于确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的时间延迟;
信号获取模块,用于基于所述麦克风拾音单元发送回的信号,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
根据本申请实施例,提供了一种电子设备,包括至少一个处理单元、以及至少一个存储单元,其中,所述存储单元存储有计算机可读程序指令,当所述计算机可读程序指令被所述处理单元执行时,使得所述处理单元执行上述方法的步骤。
根据本申请实施例,提供了一种计算机可读存储介质,其存储有可由电子设备执行的计算机可读程序指令,当所述计算机可读程序指令在电子设备上运行时,使得所述电子设备执行上述方法的步骤。
本申请实施例中,同步信号发生器产生同步于待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号。该同步信号随着待 播放声音信号一起播出,由于它与待播放声音信号的频段不同,麦克风拾音单元采集了用户语音和所述设备主体播放的声音信号并进行数字转换后发送回设备主体时,设备主体就很容易过滤出第二频段的分量,与产生的同步信号进行时间上的比较,确定时间延迟。由于同步信号是与待播放声音信号一起播放的,这个时间延迟也是待播放声音信号产生到设备主体接收到被麦克风拾音单元重新发送回的待播放声音信号的回声的时间延迟,将待播放声音信号按照该时间延迟进行延迟,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,就解决了麦克风信号和回声参考信号无法同步的问题,提高了语音识别性能。
本申请的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本申请的实践而习得。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
附图简要说明
通过参照附图详细描述其示例实施例,本申请的上述和其它特征及优点将变得更加显而易见。
图1A示出了根据本申请一实施例的远场拾音设备应用于智能音箱的场景示意图;
图1B示出了根据本申请一实施例的远场拾音设备应用于智能电视的场景示意图;
图1C示出了根据本申请一实施例的远场拾音设备应用于声控智能导航的场景示意图;
图1D示出了根据本申请一实施例的远场拾音设备应用于KTV音乐播放系统的场景示意图;
图2A示出了根据本申请一实施例的远场拾音设备应用时的搭建布局的示意图;
图2B示出了根据本申请另一实施例的远场拾音设备应用时的搭建布局的示意图;
图2C示出了根据本申请另一实施例的远场拾音设备应用时的搭建布局的示意图;
图2D示出了根据本申请另一实施例的远场拾音设备应用时的搭建布局的示意图;
图3A示出了根据本申请一实施例的远场拾音设备应用于智能音箱的体系构架示意图;
图3B示出了根据本申请一实施例的远场拾音设备应用于智能电视的体系构架示意图;
图3C示出了根据本申请一实施例的远场拾音设备应用于声控智能导航的体系构架示意图;
图3D示出了根据本申请一实施例的远场拾音设备应用于KTV音乐播放系统的体系构架示意图;
图4A是根据本申请一实施例的对应于图2A的搭建布局的体系构架示意图;
图4B是根据本申请另一实施例的对应于图2A的搭建布局的体系构架示意图;
图4C是根据本申请一实施例的对应于图2B的搭建布局的体系构架示意图;
图4D是根据本申请一实施例的对应于图2C的搭建布局的体系构架示意图;
图4E是根据本申请一实施例的对应于图2D的搭建布局的体系构架示意图;
图5示出了根据本申请一实施例的远场拾音设备的结构示意图;
图6示出了根据本申请另一实施例的远场拾音设备的结构示意图;
图7示出了根据本申请另一实施例的远场拾音设备的结构示意图;
图8示出了根据本申请另一实施例的远场拾音设备的结构示意图;
图9示出了根据本申请一实施例的远场拾音设备中采集人声信号的方法的流程图;
图10示出了根据本申请实施例的远场拾音设备中采集人声信号的装置;
图11示出了根据本申请实施例的用于实现远场拾音设备中采集人声信号的方法的电子设备。
具体实施方式
现在将参考附图更全面地描述本申请的示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些示例实施方式使得本申请的技术方案将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。
此外,本申请所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多示例实施方式中。在下面的描述中,将提供许多具体细节从而使得本领域技术人员能够对本申请的示例实施方式充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、步骤等。在其它情况下,不详细示出或描述公知结构、 方法、实现或者操作以避免喧宾夺主而使得本申请的各方面变得模糊。
附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式并通过计算机执行来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
随着人工智能技术研究和进步,人工智能技术已经在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等。随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。本申请实施例提供的方案涉及人工智能的智能音箱等技术,具体通过如下实施例进行说明:
图1A-1D示出了根据本申请实施例的远场拾音设备的四种应用场景的示意图。本领域技术人员应当理解,虽然以图1A-1D为例示出了根据本申请实施例的远场拾音设备的四种应用场景,但显然本申请实施例不仅限于这四种应用场景。本领域技术人员可以受益于本申请的教导,将根据本申请实施例的远场拾音设备应用到各种其它场景中。
远场拾音设备是指麦克风脱离于喇叭所在的主机体单独布置的设备。麦克风拾取的声音信号通过无线等方式发送到主机体进行语音信号前处理。远场拾音设备的优点是,麦克风和喇叭的距离比较大,使得麦克风接收到的智能音箱自身所播放的声音信号的回声不会出现截幅和饱和现象。而且,用户离麦克风变近,可以改善语音识别性能。
图1A示出了根据本申请一实施例的远场拾音设备应用于智能音箱的场景示意图。智能音箱在播放的过程中可以识别用户发出的语音指令,如“开大音量”,然后按照用户的语音指令进行动作,如提高播放的音量。也就是说,智能音箱可以在正在播放声音信号的同时,识别出人在这些声音信号中发出的语音,从而根据人的语音进行动作。
如图1A所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12,其中,设备主体11上具有喇叭1101,用于播放待播放的声音信号。麦克风拾音单元12可以包含有多个麦克风单元1201。在图1A中,每个麦克风单元1201是一个点,麦克风拾音单元12包括这些点组成的点阵。
麦克风单元1201拾取环境中的声音信号,包括用户语音、环境中的干扰噪声、智能音箱播放的声音信号的回声。各麦克风单元1201拾取的声音信号被转换成数字信号发送到设备主体11中的处理设备(图中未示出)。处理设备从接收到的声音信号中去除智能音箱播放的声音信号的回声、环境中的干扰噪声,得到用户语 音,并根据用户语音产生控制指令,如使音箱的音量变大或变小,播放某首歌曲等。
图1B示出了根据本申请一实施例的远场拾音设备应用于智能电视的场景示意图。智能电视在播放的过程中可以识别用户在播放过程中发出的语音指令,如“换到XX频道”,然后按照用户的语音指令进行动作,如切换到XX频道。也就是说,智能电视可以在正在播放电视的同时,识别出人在电视的声音信号中发出的语音,从而根据人的语音进行动作。
如图1B所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12,其中,设备主体11即电视机主体,具有显示屏1102和喇叭1101,分别用于显示电视节目的视频和播放电视节目的声音。麦克风拾音单元12可以包含有多个麦克风单元1201。在图1B中,每个麦克风单元1201是一个点,麦克风拾音单元12包括这些点组成的点阵。
麦克风单元1201拾取环境中的声音信号,包括用户语音、环境中的干扰噪声、智能电视播放的电视节目的声音的回声。各麦克风单元1201拾取的声音信号被转换成数字信号发送到设备主体11中的处理设备(图中未示出)。处理设备从接收到的声音信号中去除智能电视播放的电视节目的声音的回声、环境中的干扰噪声,得到用户语音,并根据用户语音产生控制指令,如切换到某一频道,把音量变大或变小等。
图1C示出了根据本申请一实施例的远场拾音设备应用于声控智能导航的场景示意图。声控智能导航是指,在驾驶时,可以按照用户输入的起点和目的地规划导航线路,并在驾驶过程中发出语音播报,让用户驾驶时始终按照语音播报的路线进行驾驶,同时用户在驾驶时(有可能在语音播报的同时)发出语音指令,如“改去XX地方”,“我要停车,帮我找到附近的停车位”、“北朝上导航显示改成车头朝上”等等。声控智能导航设备识别出用户的语音指令,按照该语音指令进行动作,如按照新目的地XX重新开始导航、帮用户找到附近的停车位、将导航显示由北朝上改为车头朝上等。
如图1C所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12,其中,设备主体11即声控智能导航设备主体,具有显示屏1106和喇叭1101,显示屏1106用于输入要去的目的地、和显示导航线路,喇叭1101用于播放声控智能导航设备导航过程中播报的语音。在图1C中,麦克风拾音单元12位于靠近司机的一侧,以便更清楚地收集司机发出的语音指令,显示屏1106位于中间,喇叭1101位于前控制台远离司机的一侧,以避免截幅和饱和现象,提高语音识别性能。麦克风拾音单元12包括多个麦克风单元1201。每个麦克风单元1201是一个点,麦克风拾音单元12包括这些点组成的点 阵。
麦克风单元1201拾取环境中的声音信号,包括用户语音、环境中的干扰噪声、声控智能导航设备播报的语音的回声。各麦克风单元1201拾取的声音信号被转换成数字信号发送到设备主体11中的处理设备(图中未示出)。处理设备从接收到的声音信号中去除声控智能导航设备播报的语音的回声、环境中的干扰噪声,得到用户语音,并根据用户语音产生控制指令,如按照新目的地XX重新开始导航、帮用户找到附近的停车位、将导航显示由北朝上改为车头朝上等。
图1D示出了根据本申请一实施例的远场拾音设备应用于KTV音乐播放系统的场景示意图。KTV音乐播放系统可以按照用户选择的歌曲通过喇叭播放该歌曲的伴奏,当用户在伴奏中对着麦克风演唱时,识别用户的声音并通过喇叭将用户的声音与伴奏一起播放。
如图1D所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11、麦克风拾音单元12、显示屏1102、选歌台1103。选歌台1103显示用于伴奏的歌曲的列表,用户可以从列表中选择要演唱的歌曲。设备主体11即KTV音乐播放系统的主体,其上具有喇叭1101,用于播放用户选择歌曲的伴奏。显示屏1102用于显示用户选择歌曲的歌词和画面。麦克风拾音单元12包括多个麦克风单元(图中未示出),每个麦克风单元是一个点,麦克风拾音单元12包括这些点组成的点阵。用户可以对着包括麦克风拾音单元12的麦克风演唱,用户演唱的声音、环境中的干扰噪声以及喇叭1101播放的伴奏的回声会被麦克风单元转换成数字信号一起发送到设备主体11中的处理设备(图中未示出),处理设备如果不消除其中伴奏的回声,该伴奏的回声与正在播放的伴奏有一个时间延迟,会形成一个混叠,造成声音模糊。因此,处理设备要消除其中伴奏的回声、以及环境中的干扰噪声,再将用户演唱的声音播放出来。
图2A-2D示出了根据本申请实施例的远场拾音设备的四种搭建布局的示意图。如上所述,远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12。麦克风拾音单元12可以是一个,也可以是多个。设备主体11可以全部在本地搭建,即与麦克风拾音单元12位于同一地点,也可以一部分(喇叭和接收器)在本地搭建,将用于声音处理的内核部分,即处理设备,放到远端。远端的含义是可以通过互联网或电信网连接到、但与麦克风拾音单元12不在同一个地点的区域。处理设备与喇叭和接收器之间可以经过互联网连接,也可以通过电信网以有线或无线的形式连接。图2A-2D是以图1A所示的智能音箱为例来示例远场拾音设备的不同搭建布局的,但本领域技术人员应当理解,图2A-2D稍加改变也可以作为图1B-1D的远场拾音设备的搭建布局,这种改变是本领域技术人员基于本申请实施例的教导容易作出的。
基于图1A示出的远场拾音设备的实施例,设备主体11和麦克风拾音单元12可以全部在本地搭建,且仅有一个设备主体11和一个麦克风拾音单元12。由于房间中只有一个麦克风拾音单元12,有可能麦克风拾音单元12与用户的距离同样很远,从而导致语音识别性能较低的问题。因此,在图2A中示出的搭建布局中,远场拾音设备1包括在本地布局的多个麦克风拾音单元12和一个设备主体11。设备主体11具有喇叭1101。例如,在房间的各个角落都布置一个麦克风拾音单元12,这样,总有一个麦克风拾音单元12是离用户比较近的,从而避免用户与麦克风拾音单元12的距离太远,导致语音识别性能较低的问题。用户发出语音后,与用户具有不同距离的多个麦克风拾音单元12的麦克风单元1201都会接收到用户的语音,并将用户的语音、喇叭1101播出的声音的回声、以及环境中的干扰噪声一起发送给设备主体11,设备主体11按不同的原则对这些信号进行处理,得到用户的语音。设备主体11中可以包括对应于各不同麦克风拾音单元12的不同接收器,也可以只包括一个接收器用于接收所有麦克风拾音单元12发送的信号,这将在后面结合图4A和4B进行详述。
图2B中示出的远程搭建布局中,麦克风拾音单元12布置在本地,设备主体11中的喇叭1101和接收器(未示出)也布置在本地,而作为声音处理的核心部分的处理设备1103布置在远端1104。这样布局是因为,声音信号的处理与现场声音的采集和播放无关,没必要布置在本地,将其布置在远端1104,有利于本地布置部件体积的减小。用户发出语音后,麦克风拾音单元12的麦克风单元1201接收到用户的语音,将用户的语音、喇叭播出的声音的回声、以及环境中的干扰噪声一起发送给设备主体11的接收器,设备主体11的接收器通过互联网或电信连接将接收到的信号发送给远端1104的处理设备1103。处理设备1103从接收到的声音信号中去除喇叭播出的声音的回声、环境中的干扰噪声,得到用户的语音,根据用户的语音产生控制指令,如将音箱的音量变大或变小等,并发送到喇叭1101,从而对播放的音量执行控制。
图2C中示出的远程搭建布局中,位于远端1104的处理设备1103与位于多个地点2(例如多个房间)的喇叭1101和接收器(未示出)进行通信,并处理位于多个地点2的麦克风拾音单元12发送到接收器的信号,即远端1104的一个处理设备1103可以分别与多个地点的喇叭1101和接收器组成设备主体11。喇叭1101和接收器、以及麦克风拾音单元12布置在本地。本地的用户发出语音后,本地的麦克风拾音单元12的麦克风单元1201接收到用户的语音,将用户的语音、喇叭播出的声音的回声、以及环境中的干扰噪声一起通过互联网或电信连接发送给设备主体11的接收器,设备主体11的接收器再通过互联网或电信连接将接收到的信号发送给远端1104的处理设备1103。处理设备1103从接收到的声音信号中去 除喇叭播出的声音的回声、环境中的干扰噪声,得到用户的语音,并根据用户的语音产生控制指令,如将音箱的音量变大或变小等,通过互联网或电信连接发送到喇叭1101,从而对播放的音量执行控制。在图2C中,将与现场声音的采集和播放无关的处理设备1103布置在远端1104,并让位于多个地点的本地设备共享,有利于资源的有效利用。
图2D中示出的远程搭建布局中,接收器通过调度模块1105与多个位于远端1104的处理设备1103通信。当接收器接收到布置在本地的麦克风拾音单元12发送的信号时,由调度模块1105指定处理所述信号的处理设备1103。也就是说,位于远端1104的哪个处理设备1103与哪个地点2的喇叭1101和接收器组成设备主体11是不固定的。这样布局的好处是,因为处理设备1103与喇叭1101和接收器的组合不固定,所以当接收到本地的接收器发送来的声音信号时,分给哪个处理设备1103进行处理由调度模块1105根据处理设备1103当前的负荷决定,这样可以均衡每个处理设备1103的处理负荷,使网络资源得到有效的分配,从而避免了一个处理设备同时处理多个本地接收器发送的声音信号而造成超负荷的情况。本地的用户发出语音后,位于本地的麦克风拾音单元12的麦克风单元1201接收到用户的语音,并将用户的语音、喇叭播出的声音的回声、以及环境中的干扰噪声一起通过互联网或电信连接发送给设备主体11的接收器,设备主体11的接收器通过互联网或电信连接将接收到的信号发送给远端1104的调度模块1105。调度模块1105为该接收到的信号分配一个处理设备1103,并将接收到的信号发送给该处理设备1103。处理设备1103从接收到的信号中去除喇叭播出的声音的回声、环境中的干扰噪声,得到用户的语音,根据用户的语音产生控制指令,如将音箱的音量变大或变小等,并通过互联网或电信连接发送到本地的喇叭1101,从而对播放的声音执行控制。
图3A示出了根据本申请一的远场拾音设备应用于智能音箱的体系构架示意图。
如图3A所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12,其中,设备主体11上具有喇叭1101,用于播放待播放的声音信号。麦克风拾音单元12可以包含有多个麦克风单元1201。在图3A中,多个麦克风单元1201的每一个都是一个点,麦克风拾音单元12包括这些点组成的点阵。麦克风单元1201拾取环境中的声音信号,包括用户21的用户语音24、环境中的干扰噪声23(简称为环境噪声)、智能音箱播放的声音信号的回声22(简称为喇叭回声)。各麦克风单元1201将收集的声音信号转换成数字信号(即混合信号25)发送到设备主体11中的处理设备。处理设备从接收到的混合信号25中去除智能音箱播放的声音信号的回声、环境中的干扰噪声,得到用户语 音,并根据用户语音产生控制指令。
图3B示出了根据本申请实施例的远场拾音设备应用于智能电视的体系构架示意图。
如图3B所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12。设备主体11即电视机主体,具有显示屏1102和喇叭1101,分别用于显示电视节目的视频和播放电视节目的声音。麦克风拾音单元12可以包含有多个麦克风单元1201。用户观看显示屏1102上显示的电视节目的画面,同时收听喇叭1101播放的电视节目的声音。在播放电视节目的声音的同时,用户可以发出对电视进行控制的语音指令,如“换到XX频道”。麦克风单元1201拾取环境中的声音信号,包括用户语音24、环境中的干扰噪声23(简称为环境噪声)、喇叭1101播放的电视节目的声音的回声22(简称为喇叭回声)。各麦克风单元1201将收集的声音信号转换成数字信号(即混合信号25)发送到设备主体11中的处理设备。处理设备从接收到的混合信号25中去除智能电视播放的电视节目的声音的回声、环境中的干扰噪声,得到用户语音,并根据用户语音产生控制指令,如切换到某一频道等。
图3C示出了根据本申请实施例的远场拾音设备应用于声控智能导航的体系构架示意图。
如图3C所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12。麦克风拾音单元12包括多个麦克风单元1201。每个麦克风单元1201的每一个都是一个点,麦克风拾音单元12包括这些点组成的点阵。设备主体11即声控智能导航设备主体,具有显示屏1106和喇叭1101,显示屏1106也称为导航显示装置1106,用于输入要去的目的地以及显示导航线路,喇叭1101用于播放声控智能导航设备导航过程中播报的语音。在播报语音的同时,用户可能发出控制智能导航设备的语音指令,例如,“我要改去XX”。麦克风单元1201拾取环境中的声音信号,包括用户语音24、环境中的干扰噪声23(简称为环境噪声)、声控智能导航设备播报的语音的回声22(简称为喇叭回声)。各麦克风单元1201将收集的声音信号转换成数字信号(即混合信号25)发送到设备主体11中的处理设备。处理设备从接收到的混合信号25中去除声控智能导航设备播报的语音的回声、环境中的干扰噪声,得到用户语音,并根据用户语音产生控制指令,如按照新目的地XX重新开始导航等。
图3D示出了根据本申请实施例的远场拾音设备应用于KTV音乐播放系统的系统构架示意图。
如图3D所示,根据本申请一种示例实施方式的远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12。用户在选歌台选择要演唱的歌曲。设备主体 11即KTV音乐播放系统的主体,其上具有喇叭1101和KTV显示屏1102。喇叭1101用于播放用户选择歌曲的伴奏,KTV显示屏1102用于显示用户选择歌曲的歌词和画面。用户对着麦克风拾音单元12演唱,演唱的声音(称为用户语音)24、环境中的干扰噪声23(简称为环境噪声)以及喇叭播放的伴奏的回声22(简称为喇叭回声)会被麦克风拾音单元12的麦克风单元(未示出)转换成数字信号(即混合信号25)一起发送到设备主体11中的处理设备。处理设备消除混合信号25中的伴奏的回声、以及环境中的干扰噪声,再将用户演唱的声音播放出来。
图4A-4B是对应于图2A的远场拾音设备的搭建布局的体系构架示意图。图4A对应于图2A中的设备主体11具有多个接收器,分别接收多个麦克风拾音单元12上报的混合信号25的情形;图4B对应于图2A中的设备主体11具有一个接收器,接收多个麦克风拾音单元12上报的混合信号25的情形。
在如图4A所示的体系构架示意图中,3个麦克风拾音单元12分别放置在房间的三个角落。每个麦克风拾音单元12包括3个麦克风单元1201。虽然图4A中例示的在同一房间中的麦克风拾音单元12的数目为3个,但本领域技术人员应当理解,麦克风拾音单元12的数目可以为更少或更多。虽然图4A中例示的在同一麦克风拾音单元12中的麦克风单元1201的数目为3个,但本领域技术人员应当理解,麦克风单元1201的数目可以为更少或更多。
用户发出语音后,由于各麦克风拾音单元12与用户的距离不同,其接收到的用户语音的信号强度不同。在一实施例中,各麦克风拾音单元12将接收的用户语音24、喇叭播出的声音的回声22(简称为喇叭回声)、以及环境中的干扰噪声23(简称为环境噪声)转换成混合信号25一起发送给设备主体11的一个接收器1107。麦克风拾音单元12可以与该接收器1107通过有线连接,以便形成一一对应关系。在另一实施例中,麦克风拾音单元12可以将混合信号25群发,但是该混合信号25中携带有麦克风拾音单元12特定的标识,各个接收器1107按照该标识从接收的混合信号25中提取与自己对应的麦克风拾音单元12发送的混合信号25。每个接收器1107提取到各自麦克风拾音单元12发来的混合信号25后,由处理设备从各混合信号25中去除喇叭播出的声音的回声22(简称为喇叭回声)、以及环境中的干扰噪声23(简称为环境噪声)后,得到多路提取出的用户语音,然后通过将多路提取出的用户语音进行合并等处理,以得到增强的用户语音。通过在房间的不同位置放置麦克风拾音单元12,对各麦克风拾音单元12发送的声音信号分别进行处理,并对处理结果进行合并,有利于克服如果仅在单一位置放置麦克风拾音单元12,用户与麦克风拾音单元12的距离可能太远而导致的语音识别结果的精确度不高的问题。
图4B所示的系统架构示意图与图4A的区别在于,图4B的设备主体11中 只有一个接收器1107,用于接收各麦克风拾音单元12发送的混合信号,并仅按照预定准则选择其中一个混合信号进行处理,抛弃其它混合信号。预定准则例如,选择离用户最近的麦克风拾音单元12发出的混合信号进行处理。这样选择是因为,麦克风拾音单元12离用户越近,收集到的用户语音的音量越大,越有利于提高用户语音识别效果。混合信号的选择可以通过麦克风拾音单元12在发送混合信号25时,在混合信号25中加入一个时间戳来实现,该时间戳表明该麦克风拾音单元12接收到用户语音的时间。这样,接收器1107就可以根据时间戳指示的接收时间的先后,选择时间戳指示的最早接收时间的混合信号25进行处理。该时间戳指示的最早接收时间的混合信号25中的用户语音的质量最高,有利于提高语音识别效果。
图4C是与图2B对应的远程搭建布局的系统构架示意图。在图4C中,麦克风拾音单元12布置在本地。设备主体11中喇叭1101和接收器1107也布置在本地。作为处理的核心部分的处理设备1103布置在远端1104。处理设备1103与本地的喇叭1101和接收器1107通过互联网或电信连接进行通信。用户发出语音后,麦克风拾音单元12的麦克风单元1201接收到用户语音24,将用户语音24、喇叭播出的声音的回声22(简称为喇叭回声)、以及环境中的干扰噪声23(简称为环境噪声)转换为混合信号25一起发送给本地的接收器1107。本地的接收器1107通过互联网或电信连接将接收到的混合信号25发送给远端1104的处理设备1103。处理设备1103从接收到的混合信号25中去除喇叭播出的声音的回声、环境中的干扰噪声,得到用户语音,根据用户语音产生控制指令,如将音箱的音量变大或变小等,并通过互联网或电信连接将控制指令发送到喇叭1101,从而对播放的音量执行控制。
图4D是与图2C对应的远程搭建布局的系统构架示意图。位于远端1104的处理设备1103与位于多个地点2(例如房间)的喇叭1101和接收器1107进行通信,处理位于多个地点2的麦克风拾音单元12发送到接收器1107的信号,即远端1104的一个处理设备1103可以分别与多个地点的喇叭1101和接收器1107组成各自的设备主体11。本地的用户发出语音后,本地的麦克风拾音单元12接收到用户语音,将用户语音、喇叭播出的声音的回声、以及环境中的干扰噪声转换为混合信号25一起发送给本地的接收器1107。本地的接收器1107通过互联网或电信连接将接收到的混合信号发送给远端1104的处理设备1103。处理设备1103从接收到的混合信号中去除喇叭播出的声音的回声、环境中的干扰噪声,得到用户语音,根据用户语音产生控制指令,并通过互联网或电信连接将控制指令发送到喇叭1101,从而对播放的音量执行控制。每个接收器1107上报的混合信号都由同一处理设备1103进行统一处理,有利于资源的有效利用。
图4E是与图2D对应的远程搭建布局的系统构架示意图。在图4E所示的远程搭建布局中,远端1104可以包括多个处理设备1103和一个调度模块1105。调度模块1105连接多个处理设备1103。位于本地任一地点2的喇叭1101和接收器1107与远端1104的任一处理设备1103都可以组成对。当接收器1107接收到本地的麦克风拾音单元12的麦克风单元1201发送的混合信号25时,将其通过互联网或电信连接发送给调度模块1105,由调度模块1105指定一个处理设备1103对混合信号进行处理。这样,远端1104的哪个处理设备1103与哪个地点2的喇叭1101和接收器1107组成设备主体11是不固定的。这样布局的好处是,因为处理设备1103与喇叭1101和接收器1107的组合不固定,所以当接收到本地的接收器1107发送来的混合信号时,分给哪个处理设备1103进行处理由调度模块1105根据处理设备1103当前的负荷决定,这样可以均衡每个处理设备1103的处理负荷,使网络资源得到有效的分配。
根据本申请的一个实施例,提供了一种远场拾音设备1。远场拾音设备1是指麦克风脱离于喇叭所在的主机体单独布置的设备。远场拾音设备的优点是,麦克风和喇叭的距离比较大,使得麦克风接收到的智能音箱自身所播放的声音信号的回声不会出现截幅和饱和现象。而且,用户离麦克风变近,可以改善语音识别性能。
图5示出了根据本申请一实施例的远场拾音设备的结构示意图。如图5所示,远场拾音设备包括部件分离的设备主体11和麦克风拾音单元12。设备主体包括播放信号源1108、同步信号发生器1109、喇叭1101、延迟确定单元1110、回声消除单元1111。
播放信号源1108是产生待播放声音信号的部件,可以通过音频信号产生电路实现。该音频信号产生电路根据存储的音频文件,或者根据天线接收的音频文件,产生对应的音频信号。一般来说,该音频信号为48KHz或96KHz,它可以为数字信号,其每一个样本点的产生都受时钟信号严格同步。
在图1A所示的智能音箱的应用场景中,播放信号源1108是根据存储的音频文件产生对应的声音信号的音频信号产生电路。例如,在音频文件是音乐文件时,对应的声音信号就是相应的音乐。音频文件可以存储在智能音箱中,也可以存储于其它终端设备,如用户的手机中,并通过蓝牙传输等方式由其它终端设备传输至智能音箱。
在图1B所示的智能电视的应用场景中,播放信号源1108是可以分离出电视天线接收到的音频信号的分频电路。分频电路将电视天线接收到的视频信号和音频信号分开。
在图1C所示的声控智能导航的应用场景中,播放信号源1108是将导航设备 所产生的提示信息转换成语音输出的语音输出电路。该语音输出电路存储着不同人的声音基础波形,如林志玲、郭德纲的声音基础波形,并按照用户的设置,将导航设备所产生的提示信息转换成用户所设置的人的声音。
在图1D所示的KTV播放系统的应用场景中,播放信号源1108是将用户在KTV系统点的歌曲的伴奏转换成声音信号的音频信号产生电路。用户在点歌台点歌,点的歌曲的伴奏文件就被音频信号产生电路转换成音频信号。
同步信号发生器1109是可以产生同步于输入信号的输出信号的信号发生电路。具体地,在本申请实施例中,同步信号发生器1109产生同步于播放信号源1108产生的待播放声音信号、并占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号。第二频段是指与第一频段没有重合的频率的频段。频段是频率的区间。例如,第一频段是[48kHz,96kHz],第二频段是[22kHz,24kHz],则两者是不同频段。
在具体实现上,同步信号发生器1109的内部可以包含有一个时钟电路,它能提取出播放信号源1108产生的待播放声音信号中的时钟,提取出的时钟作为产生与待播放声音信号同步的信号的基础。时钟电路1112也可以设置在同步信号发生器1109的外部,如图6所示,时钟电路1112设置在播放信号源1108与同步信号发生器1109之间。时钟电路1112从播放信号源1108产生的待播放声音信号中提取出时钟后,以该时钟为基础产生与待播放声音信号同步的第二频段的信号。
在一个实施例中,第二频段必须位于人耳听不到的范围内。如果第二频段的同步信号是人耳能听到的声音信号,则会出现将同步信号与播放信号器1108产生的待播放声音信号叠加后由喇叭1101播出的情况,此时同步信号就会与待播放声音信号一起播出,并被用户听见,从而造成用户听觉上的干扰。
在一个实施例中,第二频段是超声频段。同步信号发生器1109是超声波同步信号发生器。由于超声频段是人耳听不到的,因此不会对待播放声音信号造成听觉上的干扰。该同步信号的采样频率与待播放声音信号完全一致,但其能量全部集中在超声波频段。如果待播放声音信号的采样频率为48KHz,建议选择21KHz到23KHz的超声波频段。如果待播放声音信号的采样频率为96KHz,可以选择高于23KHz的超声波频段(但要小于48KHz)。
在一个实施例中,所述同步信号是载波调制之后的伪随机序列,这是为了获得较高的抗干扰能力和良好的自相关特性。由于同步信号是用来判断播放信号源1108产生的待播放声音信号与设备主体11接收到的混合信号25的时间延迟并利用这个时间延迟进行回声消除的,因此需要良好的抗干扰能力和良好的自相关特性,而载波调制之后的伪随机序列具有这样的能力,即:
s(t)=n(t)sin(f st)      公式1
n(t)是一个伪随机序列,f s是超声波频段范围内的载波频率,s(t)是载波调制之后的伪随机序列。
在一个实施例中,所述伪随机序列的选择基于以下参数中的至少一项:
伪随机序列的自相关函数;
伪随机序列的周期;
伪随机序列的自相关函数的频谱宽度。
另外,载波的频率可以基于其离20KHz的距离来进行选择。
应当理解,满足公式1的载波调制之后的伪随机序列能够带来较高的抗干扰能力和良好的自相关特性。如果伪随机序列和载波的频率按照以上所述的进行选择,则能够进一步提高抗干扰能力并获得良好的自相关特性。
对于伪随机序列n(t)的自相关函数,其自相关函数越接近冲击函数越好。在一个实施例中,如果该伪随机序列n(t)的自相关函数可以用冲击函数进行近似,则认为该伪随机序列n(t)的抗干扰能力和自相关特性足够好。
对于伪随机序列n(t)的周期,需要明显大于播放信号源1108产生待播放声音信号到混合信号25被设备主体11接收的时间延迟。这是因为,如果伪随机序列n(t)序列的周期小于播放信号源1108产生待播放声音信号到混合信号25被设备主体11接收的时间延迟,则延迟确定单元1110在确定所述麦克风拾音单元12发送回的混合信号25中第二频段分量、与所述同步信号的时间延迟时,由于该时间延迟可能包括多个同步信号的周期,会造成确定的时间延迟不准确。
在一个实施例中,需要满足:
T≥μt 1       公式2
其中,T为伪随机序列n(t)序列的周期;t 1是多次测量的播放信号源1108产生待播放声音信号到混合信号25被设备主体11接收的时间延迟的平均值。时间延迟的测量方法可以是:让播放信号源1108产生一个测试用声音信号,记录产生该测试用声音信号的时间,并记录设备主体11接收到麦克风拾音单元12发来的混合信号25的时间,将记录的两个时间相减,得到时间延迟。时间延迟可测量多次,并求多次得到的时间延迟的平均值。μ为常数,可以预先设定,例如设置为3或4。
对于伪随机序列n(t)的自相关函数的频谱宽度,应大于预定频谱宽度阈值,这样,才能使伪随机序列n(t)的抗干扰能力和自相关特性足够好。该预定频谱宽度阈值可以由用户预先设置。
关于载波的频率可以基于其离20KHz的距离来进行选择,在一个实施例中,需要满足
|f s-20kHz|≥Q     公式3
其中,Q是距离阈值,单位是KHz。满足公式3,同步信号的加入才不会干扰播放信号源1108播放的待播放声音信号的音质。
喇叭1101即扬声器,是将声音信号转换为声音的电路器件。在本申请实施例中,它用于播放待播放声音信号和所述同步信号的叠加,作为所述设备主体11播放的声音信号。如图5所示,在一个实施例中,同步信号发生器1109输出的同步信号可以和播放信号源1108输出的待播放声音信号分别输入喇叭1101的双输入端,在喇叭1101的输出端输出一个代表两个信号的和的声音信号。在另一个实施例中,由图6所示,喇叭1101只有单个输入端。播放信号源1108输出的待播放声音信号与同步信号发生器1109输出的同步信号先进入加法器1113进行叠加,叠加后输出叠加的声音信号进入喇叭1101。由喇叭1101将声音信号转换成声音播放出来。
喇叭1101播放出的声音既具有人耳能够听到的、位于第一频段中的待播放声音,又有人耳不能听到的、位于第二频段中的同步信号转换出的分量,如超声。该同步信号转换出的分量不会干扰用户的听觉,但又可以在麦克风拾音单元12将采集的混合信号25发送回设备主体11时,其中的第二频段的分量可以用来与同步信号发生器1109产生的同步信号进行时间对比,从而得出同步信号从产生到被设备主体11接收的时间延迟,该时间延迟同时也是播放信号源1108产生待播放声音信号到混合信号25被设备主体11接收的时间延迟。因此,可以将待播放声音信号延迟该时间延迟,利用延迟了的待播放声音信号进行回声消除,从而得到采集的人声信号,进而解决了麦克风信号和回声参考信号无法同步的问题,提高了语音识别性能。
喇叭1101播放的声音信号p(t)实际上是同步信号s(t)和待播放声音信号r(t)的和,即
p(t)=s(t)+r(t)       公式4
由于传输损耗的存在,喇叭1101播放的声音信号被麦克风拾音单元12接收时,麦克风拾音单元12接收到的声音信号变成:
m(t)=g(t)*s(t)+k(t)*r(t)   公式5
其中,m(t)表示麦克风拾音单元12接收到的声音信号,g(t)是s(t)在喇叭1101和麦克风拾音单元12之间传输的传输函数,k(t)是r(t)在喇叭1101和麦克风拾音单元12之间传输的传输函数。
麦克风拾音单元12采集用户语音24、喇叭1101播放的声音信号在空间传播后的回声22(简称为喇叭回声)和环境中的干扰噪声23(简称为环境噪声)。用户语音24即用户发出的声音,其可以包含用户想要操作智能音箱的语音控制指令,如“将声音调大或调小”。喇叭1101播放的声音信号在空间传播后的回声22是指 喇叭1101播放的、在空间中传播后达到麦克风拾音单元12时的声音。空间是指喇叭1101和麦克风拾音单元12所在环境(例如房间)中的空间。在空间传播包括直线传播,也包括经墙壁、窗户等的反射、衍射等再次到达麦克风拾音单元12的传播。环境中的干扰噪声23是指环境的杂音、空气啸动的声音等背景声音。图5所示的麦克风拾音单元12包括麦克风单元1201。麦克风单元1201是声音-电信号转换单元,它能将声音转换成代表声音的电信号,尤其可以转换成数字信号。该数字信号实际上是用户语音、环境中的干扰噪声、智能音箱播放的声音信号的回声叠加后转换成的混合信号25。
麦克风拾音单元12的麦克风单元1201将该混合信号25发送到设备主体11。设备主体11中的延迟确定单元1110可以确定该混合信号25中第二频段的分量、与所述同步信号的时间延迟。在一个实施例中,延迟确定单元1110可以本身带有一个滤波单元,由于待播放声音信号所在的第一频段与同步信号所在的第二频段不同,滤波单元能从混合信号25中过滤出第二频段的分量,该分量实际上是同步信号从产生到由被麦克风拾音单元12接收,再由麦克风拾音单元12发送回设备主体11的一系列过程后生成的延时信号,将其与初始产生的同步信号进行比较,就可以确定出同步信号从产生到由设备主体11重新接收之间的时间延迟。
在另一个实施例中,延迟确定单元1110可以本身不带有滤波单元,而是在其前面放置一个滤波器1116,如图6所示。滤波器1116能从接收的混合信号25中过滤出第二频段的分量,供延迟确定单元1110确定延迟时间。在一个实施例中,滤波器1116的采样频率与同步信号的采样频率、以及设备主体11接收到的混合信号的采样频率一致。滤波器1116引入的滤波延迟为r f
延迟确定单元1110要将第二频率的分量与同步信号发生器1109产生的同步信号进行时间的比较,从而确定出时间延迟。因此,同步信号发生器1109历史上产生的同步信号需要记录或缓存。在一个实施例中,可以采取添加时间戳的方式记录同步信号发生器1109产生同步信号的时间。在同步信号发生器1109产生同步信号时,在同步信号中打上时间戳,例如同步信号中的一部分由该时间戳占用。混合信号25被设备主体11接收后,滤波出的第二频段的分量仍含有该时间戳,这时延迟确定单元1110就可以根据接收到滤波器1116输出的第二频段的分量的时间与该时间戳之间的时间差,来确定时间延迟。
在另一个实施例中,如图6所示,设置第二缓存1115,用于缓存所述同步信号发生器1109产生的同步信号,这样,历史上产生的同步信号不会消失。第二缓存1115中,所述同步信号发生器1109产生的同步信号可以以产生时间为轴进行存储。当接收到滤波器1116输出的第二频段的分量后,将该分量与第二缓存中缓存的同步信号发生器1109历史上产生的同步信号进行匹配,如果发现该分 量与同步信号发生器1109历史上产生的同步信号某一段匹配,则读出这一段的产生时间,并与接收到第二频段的分量的时间相减,就得到了时间延迟。该实施例相对于添加时间戳的方式,由于不需要添加时间戳,不需要占用同步信号的一部分,可以减少网络资源的占用。
在一个实施例中,第二缓存1115缓存同步信号的时长至少为所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和。在实际操作中,可以由播放信号源1108播放一个测试声音信号,然后记录喇叭1101播放该测试声音信号的时间,再记录所述麦克风拾音单元12接收到该测试声音信号的时间,求出所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长。然后,记录麦克风拾音单元12输出含有所述声音信号的混合信号25的时间,并记录该混合信号25被接收器1122接收的时间,求出麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长。然后,可以针对该测试声音信号,求出所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和。针对多个测试声音信号,求出多个和后,确定多个和的平均值,然后将第二缓存1115的容量设置为大于或等于所述平均值。
将第二缓存1115缓存同步信号的时长设置成至少为所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和的意义在于,如果将第二缓存1115缓存同步信号的时长设置成小于所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和,延迟确定单元1110根据滤波器1116滤出的第二频段的分量到第二缓存中查找同步信号中匹配的片段时,可能会发生匹配的片段已经被挤出第二缓存的情况,因为同步信号经喇叭1101播出,被麦克风拾音单元12接收又被发送回设备主体11,经设备主体11的滤波器1116滤波等环节的时间超出了第二缓存1115的容量。当延迟确定单元1110查找第二缓存1115中与滤出的第二分量对应的同步信号片段时,该同步信号片段已在第二缓存中消失或部分消失。
在另一个实施例中,可以通过互相关的方法确定同步信号与第二频段的分量之间的时间延迟。
在该实施例中,所述延迟确定单元1110可以通过以下来确定所述麦克风拾音单元12发送回的混合信号25中第二频段分量、与所述同步信号的时间延迟:
确定所述麦克风拾音单元12发送回的混合信号25中第二频段分量、与所述同步信号的互相关函数的最大值所对应的时间,作为所述时间延迟。
从滤波器1116输出的数据流设为h(t),第二缓存1115中缓存的同步信号发生器1109产生的同步信号为s(t)。延迟确定单元1110接收到h(t)后,h(t)和s(t)之间的时间延迟τ(t)的一种确定方法是计算两者之间的互相关性:
Figure PCTCN2019108166-appb-000001
其中,
Figure PCTCN2019108166-appb-000002
是h(t)和s(t)的互相关性函数。可以利用函数
Figure PCTCN2019108166-appb-000003
的最大值所对应的时间作为对两个信号h(t)和s(t)之间的时间延迟的估计值τ′ 1(t 1)。
在一个实施例中,在同步信号是载波调制之后的伪随机序列的情况下,在通过计算同步信号和麦克风拾音单元12发送回的混合信号25中第二频段分量的互相关函数来确定第二频段分量与同步信号的时间延迟之前,对同步信号进行解调。所述互相关函数是解调后的同步信号与第二频段分量之间的互相关函数。先进行解调的好处是,可以使通过互相关函数确定的时间延迟更加准确。
在该实施例中,将上述两个信号h(t)和s(t)之间的互相关函数的最大值所对应的时间τ′ 1(t 1),作为所述麦克风拾音单元12发送回的混合信号25中第二频段分量、与所述同步信号的时间延迟。
在另一个实施例中,所述延迟确定单元1110可以通过以下来确定所述麦克风拾音单元12发送回的混合信号25中第二频段分量、与所述同步信号的时间延迟:
确定所述麦克风拾音单元12发送回的混合信号25中第二频段分量、与所述同步信号的互相关函数的最大值所对应的时间;
将确定的时间与滤波器1116引入的滤波延迟之和,作为所述时间延迟。
即,将上述两个信号h(t)和s(t)之间的互相关函数的最大值所对应的时间τ′ 1(t 1)与滤波器1116引入的滤波延迟τ f的和τ 1(t),作为所述麦克风拾音单元12发送回的混合信号25中第二频段分量、与所述同步信号的时间延迟。即:
τ 1(t)=τ′ 1(t 1)+τ f      公式7
该实施例的好处是,充分考虑到了滤波器的滤波对时延造成的影响,从而提高了确定时间延迟并进行回声消除的准确性。
接着,回声消除单元1111基于所述麦克风拾音单元12发送回的混合信号25,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。首先,回声消除单元1111需要基于确定的时间延迟来延迟播放信号源1108产生的待播放声音信号。在图5所示的实施例中,它可以通过回声消除单元1111中内置的一个延迟电路来实现。在另一个实施例中,它可以通过在延迟确定单元1110和回声消除单元1111之间设置的一个延迟单元1118来实现,如图6所示。延迟单元1118用于利用按所述延迟确定单元1110确定的时间延迟而延迟待播放声音信号,以供所述回声消除单元1111进行回声消除。
无论是内置延迟电路还是单独设置延迟单元1118,在利用确定的时间延迟来延迟待播放声音信号时,需要找到播放信号源1108当时产生的待播放声音信号。在该实施例中,所述设备主体11还包括:第一缓存1114,用于缓存所述播放信号源1108产生的待播放声音信号。如图6所示,第一缓存1114可以连接在播放信号源1108和延迟单元1118之间。这样,当播放信号源1108产生待播放声音信号之后,可以将其缓存在第一缓存1114中。当延迟确定单元1110确定时间延迟后,延迟单元1118就可以利用该时间延迟来延迟第一缓存1114中的待播放声音信号。
在一个实施例中,所述第一缓存1114缓存待播放声音信号的时长至少为所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和。在实际操作中,可以由播放信号源1108播放一个测试声音信号,然后记录喇叭1101播放该测试声音信号的时间,再记录所述麦克风拾音单元12接收到该测试声音信号的时间,求出所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长。然后,记录麦克风拾音单元12输出含有所述声音信号的混合信号25的时间,并记录该混合信号25被接收器1122接收的时间,求出与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长。然后,可以针对该测试声音信号,求出所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和。针对多个测试声音信号,求出多个和后,确定多个和的平均值,然后将第一缓存1114的容量设置为大于或等于所述平均值。
所述第一缓存1114缓存待播放声音信号的时长至少为所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和的意义在于,如果将第一缓存1114缓存待播放声音信号的时长设置成小于所述喇叭1101播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体11的传输时长之和,延迟单元1118在根据延迟确定单元1110确定的时间延迟来延迟播放相应的同步信号片段时,会发现相应的同步信号片段已经从第一缓存1114移出。
在一个实施例中,回声消除单元1111可以对所述麦克风拾音单元12发送回的混合信号25,利用按所确定的时间延迟而延迟了的待播放声音信号直接进行回声消除。虽然所述麦克风拾音单元12发送回的混合信号25还包括有第二频段的分量,但由于第二频段的分量是人耳听不到的,因此,也可以直接将接收到的所述麦克风拾音单元12发送回的混合信号25发送到回声消除单元1111进行回声消除。但在另一个实施例中,如图6所示,所述麦克风拾音单元12发送回的混合信 号25需要先发送到下采样器1117,经下采样器1117处理后再送到回声消除单元1111进行回声消除。下采样器1117用于将从所述麦克风拾音单元12发送回的混合信号25的采样频率从播放所用的采样频率(比如48KHz或者96KHz)转换为人声识别所用的采样频率(比如16KHz),以供所述回声消除单元1111进行回声消除。经过了下采样器1117处理之后,第二频段的分量就自然消除了,不会进入回声消除单元1111。这样,回声消除单元1111就可以利用按所确定的时间延迟而延迟了的待播放声音信号,对接收到的第一频段(与待播放声音信号一致的频段)的信号进行回声消除,从而提高回声消除的质量。回声消除单元1111可以采用回声消除电路等实现。
在一个实施例中,如图6所示,从延迟单元1118输出的按所确定的时间延迟而延迟了的待播放声音信号也经过延迟后下采样器1119。延迟后下采样器1119使采样频率从播放所用的采样频率转换为人声识别所用的采样频率,以供所述回声消除单元1111进行回声消除。
另外,在一个实施例中,如图6所示,设备主体11中还设置接收器1122,用于接收所述麦克风拾音单元12发送回的混合信号25。接收器1122接收到所述麦克风拾音单元12发送回的混合信号25后,一方面发送到滤波器1116以滤出第二频段的分量用于确定时间延迟,另一方面通过下采样器1117发送给回声消除单元1111进行回声消除。
另外,如图6所示,在一个实施例中,所述设备主体11还包括:语音增强单元1120,其连接到回声消除单元1111的输出,用于利用波束形成等信号分离、语音降噪等算法去除回声消除单元1111的输出中的环境中的干扰噪声23。此外,如后文详细描述,当麦克风拾音单元12具有多个麦克风单元1201,从而麦克风拾音单元12发送到接收器1122的混合信号25是多路混合信号时,语音增强单元1120还用于对所述回声消除单元1111完成回声消除的多路的麦克风单元1201采集的用户语音进行合并,以得到增强的人声信号。
另外,如图6所示,在一个实施例中,所述设备主体11还包括:语音识别单元1121,用于对采集的人声信号进行语音识别。在图6中,语音识别单元1121连接到语音增强单元1120的输出,对增强的人声信号进行语音识别,将其识别成文字。
另外,如图6所示,在一个实施例中,所述设备主体11还包括:控制单元1124,用于基于语音识别结果中的控制指令,执行控制动作。在图6中,它连接到语音识别单元1121的输出,根据语音识别单元1121输出的识别出的文字,执行控制动作。例如,控制单元1124内部可以存储有各种动作对应的语言模式,例如,对应于“将音量开大”这样一个动作,对应的语言模式可能有“把音量开大”、“大点 声”、“再大点声”等很多语言模式。事先将动作和对应的语言模式存储在控制单元1124中的语言模式与动作对照表中,这样,当控制单元1124接收到语音识别结果后,就将语音识别结果与对照表中的语言模式进行匹配,查找对照表中与匹配的语言模式对应的动作,并执行该动作。
在图1A所示的智能音箱的应用场景中,执行的控制动作是对音箱的播放的控制动作,例如,“播放XXX”、“将音量开大”等。
在图1B所示的智能音箱的应用场景中,执行的控制动作是对电视的播放的控制动作,例如,“换到XX频道”、“将音量开大”等。
在图1C所示的声控智能导航的应用场景中,执行的控制动作是对导航的控制动作,例如,“改目的地到XXX”、“将导航声音开大”、“将导航显示由北朝上改为车头朝上”等。
在图1D所示的KTV播放系统的应用场景中,不需要语音识别单元1121和控制单元1124。
下面结合图6详细描述麦克风拾音单元12的具体结构。
在一个实施例中,麦克风拾音单元12具有用于向所述设备主体11发送采集的用户语音24、所述设备主体11播放的声音信号在空间传播后的回声22(简称为喇叭回声)、环境中的干扰噪声23(简称为环境噪声)的发送器1204。发送器1204具有待机模式和发送模式,在所述待机模式下所述发送器1204不工作,当从麦克风单元1201发来的信号中识别出人声信号,发送器由待机模式进入发送模式。在麦克风拾音单元12中,发送器1204的功耗是最主要的一个功耗。由于待机模式基本不耗电,通过只有在识别出麦克风单元1201采集到人声时,才让发送器1204进入发送模式,在没有识别出采集到人声时,让发送器1204处于待机模式的方式,大大降低了麦克风拾音单元12的待机功耗,解决了麦克风拾音单元12功耗大的问题。
在一个实施例中,如图6所示,麦克风拾音单元12具有麦克风单元1201、人声识别单元1203和缓存单元1202。
麦克风单元1201接收用户语音24、所述设备主体11播放的声音信号在空间传播后的回声22、环境中的干扰噪声23,并将接收到的声音信号转换成数字信号。在一个实施例中,所述麦克风单元1201的采样频率与待播放声音信号和同步信号的采样频率一致,且大于或等于该同步信号的最高频率的2倍,这样选择采样频率的好处是,可以防止频率混叠,提高采样质量,从而提高本申请实施例中采集的人声信号的质量。
缓存单元1202用于缓存所述麦克风单元1201采集得到的信号,即设备主体11播放的声音信号在空间传播后的回声22、环境中的干扰噪声23,并将接收到的 声音信号转换成的数字信号。在一个实施例中,缓存单元1202是一个循环缓存单元,即当进入缓存单元1202的数据容量超过该缓存单元1202的缓存容量Tc时,按照先进先出的原则将早进入该缓存单元1202的数据移出,使缓存单元1202缓存的数据容量始终不大于该缓存容量Tc。在一个实施例中,缓存容量Tc设置为不小于人声识别单元1203的人声识别导致的检测延时。
人声识别单元1203用于从所述麦克风单元1201的输出中识别出人声,并在识别出人声时触发所述发送器1204进入发送模式,将所述缓存单元1202缓存的设备主体11播放的声音信号在空间传播后的回声22、环境中的干扰噪声23转换成数字信号进行发送。人声识别单元1203可以采用人声识别模块来实现。
设置缓存单元1202的作用是:现有的人声识别技术,从输入的声音信号中检测出人声需要进行一定的运算处理,有一个检测引起的延迟。当人声识别单元1203识别出有人声,触发发送器1204进入发送模式时,如果不对人声进行缓存,则从该人声信号被接收到该人声信号被人声识别单元1203识别出之间的时间段中,发送器1204未工作,人声信号也未被发送,在人声识别单元1203识别出该人声信号之后,该人声信号才被发送,这样,则丢失了一个时间段的人声信号。通过将人声信号先保存到缓存单元1202一段时间,当人声识别单元1203识别出人声信号时,再将缓存单元1202缓存的人声信号释放出,避免了人声识别期间的信号丢失。缓存容量Tc设置为不小于人声识别单元1203进行人声识别导致的检测延时,这样可以保证至少在检测延时这段期间的人声信号不会丢失,提高最后得到的人声信号的完整性。
在一个实施例中,人声识别单元1203在未识别出人声信号的状态下,如果从麦克风单元1201发来的混合信号25中识别出人声信号,则向发生器1204发送一个开始触发。当接收到该开始触发时,发送器1204由待机模式进入发送模式。
在一个实施例中,人声识别单元1203在识别出人声信号的状态下,如果从麦克风单元1201发来的混合信号25中开始识别不出人声信号,则向发生器1204发送一个停止触发。当接收到该停止触发时,发送器1204由发送模式进入待机模式。
在一个实施例中,为了提高采集的人声信号的精度,麦克风拾音单元12中可以包含多个麦克风单元1201,其中一个麦克风单元连接到所述人声识别单元1203,例如可以指定第一个麦克风单元连接到人声识别单元1203,如图7所示。当然,本领域技术人员应当理解,也可以指定第二个或第三个麦克风单元连接到人声识别单元1203。所述人声识别单元1203在从该连接的麦克风单元1201的输出中检测出人声时,触发所述发送器1204进入发送模式,并将所述缓存单元1202缓存的各麦克风单元1201采集的混合信号进行发送。
多个麦克风单元1201同时采集用户语音24,也同时采集了喇叭1101播放的 声音信号的回声22、以及环境中的干扰噪声23。在喇叭1101播放的声音信号p(t)实际上是同步信号s(t)和待播放声音信号r(t)的和,即满足公式4的情况下,第i个麦克风单元1201(1≦i≦n,其中n为麦克风单元的总数)接收到的声音信号为:
m i(t)=g i(t)*s(t)+k i(t)*r(t)   公式8
其中,m i(t)代表第i个麦克风单元1201接收到的声音信号,g i(t)是s(t)在喇叭1101和第i个麦克风单元1201之间传输的传输函数,k i(t)是r(t)在喇叭1101和第i个麦克风单元1201之间传输的传输函数。
由于多个麦克风单元1201同时收集用户语音24、喇叭回声22、环境中的干扰噪声23,因此会分别得到多路采集的信号,多路采集的信号在设备主体11中会分别进行处理,经过各自的回声消除,产生各自的人声信号,最后将这些人声信号进行合并处理,使得到的人声信号增强,由此克服单一麦克风单元1201收集用户语音信号容易出现收集的信号较弱,且容易受到该麦克风单元1201或该路中的其它部件性能恶化的影响的缺点。鉴于麦克风单元1201都位于同一麦克风拾音单元12中,因此可以认为它们能几乎同时接收到人声信号,因此,可以让其中的任一个麦克风单元连接到人声识别单元1203,以便人声识别单元1203识别人声。
尽管位于同一麦克风拾音单元12中的麦克风单元1201可以几乎同时接收到人声信号,但如果需要做到更高的精确性,就需要考虑它们之间接收到人声信号的微小的时间差异。因此,在另一个实施例中,如图8所示,所述人声识别单元1203可以连接每个麦克风单元1201。当所述人声识别单元1203从该连接的麦克风单元1201中任一个的输出中检测出人声时,触发所述发送器1204进入发送模式,并将所述缓存单元1202缓存的各麦克风单元1201采集的人声信号、环境中的干扰噪声和所述喇叭回声转换成数字信号进行发送。这样,多个麦克风单元1201中最早一个接收到人声信号的麦克风单元1201,会最早地将产生的数字信号发送给人声识别单元1203,以便人声识别单元1203开始识别。当人声识别单元1203识别出人声信号时,则触发发送器1204开始发送数字信号。这样,人声识别单元1203能够响应于最早接收到的用户语音而进行识别,并触发发送器1204进入发送模式,从而尽早发送数字信号。
在图7-8中,由于麦克风单元1201有多个,缓存单元1202缓存的也是多个麦克风单元1201将采集的人声信号、环境中的干扰噪声和所述喇叭回声进行转换得到的多路数字信号。在人声识别单元1203识别出人声并触发发送器1204进入发送模式时,发送器1204可以将这多路数字信号分别通过多个信道(例如无线通信信道)分别发送到设备主体11的接收器1122。
在一个实施例中,发送器1204可以将这多路数字信号在发送之前进行复用,将其复用到一个通信信道中发送到接收器1122。复用的方式包括例如封装包,即, 将每一路的数字信号添加该路的标识(ID)后封装成一个子包,所有路的数字信号封装成多个子包,将这多个子包再封装成大包,将大包通过一个信道发送给接收器1122。也可以采用时分复用、频分复用等其它复用方式。该实施例有利于减少信道的占用和资源的合理化运用。
在一个实施例中,设备主体11中的接收器1122可以对发送器1204发送的复用信号进行解复用,即从复用信号中解析出各路数字信号,即各个麦克风单元1201将采集的人声信号、环境中的干扰噪声和所述喇叭回声进行转换得到的各路数字信号。解复用的方法包括例如解封装包,即解封装大包从中取出每个子包,再解封装每个子包,按照子包的标识(ID)将从该子包解出的内容归于该标识对应的路中,以便按照不同的路将数字信号发送到下采样器1117进行下采样。
为了减少设备主体11的功耗,在一个实施例中,设备主体11的接收器1122也具有待机模式和接收模式。在所述待机模式下所述接收器1122不工作,当感测出无线信号,接收器1122由待机模式进入接收模式。由于待机模式下,接收器1122不工作,也就不会耗电,而只有在接收模式下,接收器1122才耗电,因此,大大降低了设备主体11的功耗。
在一个实施例中,如图6所示,所述接收器1122具有无线信号感测单元1123,当无线信号感测单元1123感测到无线信号,接收器1122由待机模式进入发送模式。无线信号感测单元1123可以通过无线信号传感器等实现。在一个实施例中,无线信号感测单元1123感测到无线信号后,产生一个开始触发。响应于该开始触发,接收器1122由待机模式进入发送模式。当无线信号感测单元1123感测到无线信号消失后,产生一个停止触发。响应于该停止触发,接收器1122由发送模式进入待机模式。由于无线信号感测单元采用的是无线信号传感器,它的功耗远远小于接收器工作时的功耗,因此,通过这种方式,能大大降低设备主体11的接收器1122的功耗。
在一个实施例中,如图7-8所示,接收器1122将解复用后的各路的数字信号按照不同的路发送到下采样器1117,由下采样器1117进行下采样。下采样器1117将下采样后获得的各路的信号分别发送到回声消除单元1111。回声消除单元1111对解复用的各路信号分别进行回声消除,得到回声消除后的各路的人声信号,并输入语音增强单元1120。语音增强单元1120对所述回声消除单元1111完成回声消除后得到的人声信号进行合并,以得到增强的人声信号。通过对每路信号分别执行回声消除,再将回声消除后的各路的人声信号进行合并,能达到增强人声信号,提高采集到的人声信号的质量的效果。
如图7-8所示,在一个实施例中,当所述麦克风拾音单元12具有多个麦克风单元1201时,所述延迟确定单元1110可以确定所述多个麦克风单元1201中的一 个采集到的、并由所述麦克风拾音单元12发送回设备主体11的混合信号25中的第二频段分量、与所述同步信号的时间延迟。所述回声消除单元1111利用按所确定的时间延迟而延迟了的待播放声音信号对所有麦克风单元1201采集的、并由所述麦克风拾音单元12发送回设备主体11的混合信号25进行回声消除。在该实施例中,滤波器1116针对不同的路(每个路对应一个麦克风单元1201),可能滤出多个不同的第二频段分量,并发送给延迟确定单元1110确定时间延迟。此时,延迟确定单元1110最先接收到哪个第二频段分量,就会确定该第二频段分量与所述同步信号的时间延迟,并将确定的时间延迟作为所有第二频段分量与同步信号的时间延迟,将该时间延迟发送给延迟单元118对待播放声音信号进行时间延迟,得到时间延迟后的待播放声音信号。然后,将该时间延迟后的待播放声音信号输入回声消除单元1111,以便回声消除单元1111对所有路进行回声消除。该实施例是建立在所有路的时间延迟基本一致的理论上,这样,可以用最先接收的一个第二频段分量快速得到一个时间延迟并迅速用于所有路的回声消除,优点是提高了回声消除的效率。
如图7-8所示,在一个实施例中,当所述麦克风拾音单元12具有多个麦克风单元1201时,所述延迟确定单元1110可以分别确定所述多个麦克风单元1201分别采集的、并由所述麦克风拾音单元12发送回设备主体11的混合信号25中的各第二频段分量、与所述同步信号的各自的时间延迟,即每一路的时间延迟,其中每一路对应一个麦克风单元1201。也就是说,延迟确定单元1110是针对每一路分别确定时间延迟。延迟单元1118也是按照每一路确定的时间延迟,分别延迟第一缓存1114中缓存的待播放声音信号,得到多个延迟的待播放声音信号。然后,回声消除单元1111利用该多个延迟的待播放声音信号,分别对每一路的下采样器1117的输出进行回声消除。也就是说,回声消除单元1111利用按所确定的各自的时间延迟而延迟了的待播放声音信号分别对相应麦克风单元1201采集的、并由所述麦克风拾音单元12发送回设备主体11的混合信号25进行回声消除。该实施例是建立在所有路的时间延迟虽然大体一致,但毕竟有细微差别的理论上,这样,可以分别确定每一路的时间延迟,每一路的时间延迟仅用于相应路上的回声消除,从而提高了回声消除的精度。
在如图4A所示的体系构架示意图中,所述麦克风拾音单元12有多个,而所述设备主体11只有一个,所述设备主体11中的接收器1107有多个,分别与所述多个麦克风拾音单元12相对应。相对应的含义是每个接收器仅接收与其对应的麦克风拾音单元12发来的信号,而不接收其它麦克风拾音单元12发来的信号,或者接收后将其丢弃。
在一个实施例中,接收器1107和与其相对应的麦克风拾音单元12通过有线 连接,这样,麦克风拾音单元12发送的信号仅能被对应的、与其连接的接收器1107接收。
在另一个实施例中,接收器1107和与其相对应的麦克风拾音单元12进行无线通信。具体地说,麦克风拾音单元12可以采用广播的方式,但是广播的信号中含有麦克风拾音单元12唯一的标识。不同的麦克风拾音单元12具有不同的拾音单元标识。在麦克风拾音单元12广播的信号中有相应的拾音单元标识。接收器1107接收到麦克风拾音单元12发来的信号后,确定该信号中携带的拾音单元标识是否是与自身对应的麦克风拾音单元12的标识。如果是,保留该信号。如果不是,丢弃该信号。该实施例避免了有线布线导致的房间中的线路庞杂混乱。
虽然设备主体11中的接收器1107有多个,但设备主体11中的其它部件都只有一个。如图6-8所示的设备主体11中的播放信号源1108、时钟电路1112、同步信号发生器1109、加法器1113、喇叭1101、第一缓存1114、第二缓存1115、滤波器1116、延迟确定单元1110、延迟单元1118、延迟后下采样器1119、回声消除单元1111、下采样器1117、语音增强单元1120、语音识别单元1121、控制单元1124,都只有一个。由于多个接收器1107各自接收信号,产生相应的信号流,而设备主体11中其它部件只有一个,因此设备主体11中的其它部件可以按照其接收输入信号的先后顺序来处理并产生输出信号。例如,多个接收器1107将接收到的混合信号25发送给滤波器1116,滤波器1116接收到混合信号25的顺序必然有先后,其可以按照输入混合信号25的先后顺序来滤出其中的第二频段分量,输出给延迟确定单元1110。由于延迟确定单元1110接收第二频段分量的顺序不同,其也是按照输入第二频段分量的顺序进行处理。多个接收器1107将接收到的混合信号25还发送给下采样器1117。下采样器1117接收到多个接收器1107发送来的混合信号25的顺序必然也有先后之分,这样,下采样器1117按照接收混合信号25的先后顺序依次进行下采样。另外,应当注意,如图7-8所示,在多个麦克风单元1201产生多路信号的情况下,下采样器1117接收到的一个接收器发送过来的信号也有含有多路的,下采样器1117并行接收多路的信号,进行下采样之后并行地将下采样后的信号发送给回声消除单元1111。
在该实施例中,延迟确定单元1110会分别确定每个接收器1122接收的混合信号25中的第二频段分量、与所述同步信号的时间延迟(按照接收到滤波器1116滤出的第二频段分量的顺序)。延迟单元1118会按照分别确定出的时间延迟来延迟第一缓存1114中的待播放声音信号。回声消除单元1111基于每个接收器接收的混合信号25,利用按确定的相应的时间延迟而延迟了的待播放声音信号进行回声消除,得到各自的采集的人声信号。回声消除单元1111得到的人声信号依然是多个。语音增强单元1120能够将回声消除单元1111输出的多个人声信号进行叠加, 从而增强人声信号。
用户发出语音后,由于各麦克风拾音单元12与用户的距离不同,其接收到的用户的语音信号的强度不同。通过在房间的不同位置放置麦克风拾音单元12,并对各麦克风拾音单元12发送的信号分别进行处理,对处理结果进行合并,有利于克服单一位置麦克风拾音单元12可能离用户较远而造成的人声识别结果的精确度不高的问题。
在如图4B所示的体系构架示意图中,所述麦克风拾音单元12有多个,而所述设备主体11只有一个,所述设备主体11中的接收器1107也只有一个,因此接收器1107需要对应于所有多个麦克风拾音单元12。
在一个实施例中,接收器1107与所有麦克风拾音单元12通过有线连接。在另一个实施例中,接收器1107与所有麦克风拾音单元12进行无线通信。接收器1107接收所有麦克风拾音单元12发送过来的信号,但仅按照预定准则选择其中一个进行处理,抛弃其它混合信号。
在一个实施例中,预定准则是保留最先接收到的麦克风拾音单元12发过来的信号进行处理,将其它麦克风拾音单元12发过来的信号丢弃。具体地说,在一个实施例中,麦克风拾音单元12具有不同的拾音单元标识,在麦克风拾音单元12发送的信号中有相应的拾音单元标识。如果所述接收器1107接收到带有一个拾音单元标识的信号之后又接收到带有其它拾音单元标识的信号,丢弃该带有其它拾音单元标识的信号。也就是说,接收器1107只有一个,它无法同时接收多个麦克风拾音单元12发送的信号,因此,它只保留最早接收到的麦克风拾音单元12发来的信号,在接收过程中如果还接收到带有其它拾音单元标识的信号,则将其丢弃。该实施例建立在认为各麦克风拾音单元12拾取的用户语音不会有太大差异,可以仅保留其中一个最早接收的麦克风拾音单元12发来的信号的考虑之上。仅对最早接收的麦克风拾音单元12发来的信号进行处理,有利于提高回声消除的速度。
在另一个实施例中,预定准则可以是选择离用户最近的麦克风拾音单元12发出的信号,丢弃其它麦克风拾音单元12发出的信号。这样选择是因为,麦克风拾音单元12离用户越近,收集到的人声音量越大,越有利于提高人声识别效果。这可以通过麦克风拾音单元12在发送信号25时,在信号25中加入一个时间戳来实现,该时间戳表明该麦克风拾音单元12接收到用户语音的时间。这样,接收器1107就可以根据时间戳指示的时间的先后,选择时间戳指示的最早的混合信号25进行处理。该时间戳指示的最早的混合信号25中的用户语音质量最高,有利于提高语音识别效果。
在该实施例中,延迟确定单元1110仅确定该接收到的一个麦克风拾音单元 12发送的混合信号25中的第二频段分量、与所述同步信号的时间延迟。延迟单元1118会按照确定出的时间延迟来延迟第一缓存1114中的待播放声音信号。回声消除单元1111基于该接收到的一个麦克风拾音单元12发送的混合信号25,利用按确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
在图4C所示的体系构架示意图中,设备主体11包括位于本地的喇叭1101和接收器1107、和位于远端1104的处理设备1103。如图6-8所示的设备主体11中的播放信号源1108、时钟电路1112、同步信号发生器1109、加法器1113、第一缓存1114、第二缓存1115、滤波器1116、延迟确定单元1110、延迟单元1118、延迟后下采样器1119、回声消除单元1111、下采样器1117、语音增强单元1120、语音识别单元1121、控制单元1124,都位于远端1104的处理设备1103中。
处理设备1103与本地的喇叭1101和接收器1107通过互联网或电信连接进行通信。用户发出语音后,麦克风拾音单元12接收到用户语音24,将用户语音24、喇叭播出的声音的回声22、以及环境中的干扰噪声23一起作为混合信号25发送给本地的接收器1107。本地的接收器1107通过互联网或电信连接将接收到的混合信号发送给远端1104的处理设备1103。处理设备1103按照前面实施例结合图5-8所述的过程,从接收到的混合信号25中去除喇叭播出的声音的回声、环境中的干扰噪声,得到人声信号,根据人声信号产生控制指令,如将音量变大或变小等,通过互联网或电信连接,将控制指令发送到喇叭1101,从而对喇叭1101播放的音量执行控制。该实施例的优点是,将不需要收集信号的处理设备1103由本地移到远端,减小对本地空间的占用,同时为网络集中式处理提供了可能。
在图4D所示的系统构架示意图中,位于远端1104的处理设备1103与多个地点2的所述喇叭1101和接收器1107进行通信,处理所述多个地点2的麦克风拾音单元12发送到接收器1107的混合信号25。也就是说,一个远端1104的处理设备1103可以分别与多个地点的喇叭1101和接收器1107组成各自的设备主体11。
每个本地的用户发出语音后,本地的麦克风拾音单元12接收到用户语音24,将用户语音24、喇叭播出的声音的回声22、以及环境中的干扰噪声23一起作为混合信号25发送给接收器1107。接收器1107通过互联网或电信连接将接收到的混合信号25发送给远端1104的处理设备1103。处理设备1103按照前面实施例结合图5-8所述的过程,从接收到的混合信号25中去除喇叭播出的声音的回声、环境中的干扰噪声,得到人声信号,根据人声信号产生控制指令,并通过互联网或电信连接将控制指令发送到喇叭1101,从而对喇叭1101播放的音量执行控制。该实施例的优点是,每个接收器1107上报的混合信号25都由同一处理设备1103进行统一处理,实现了集中式处理,有利于资源的有效利用,减少对本地资源的 占用。
在图4E所示的系统构架示意图中,远端可以有多个处理设备1103和一个调度模块1105。调度模块1105连接多个处理设备1103。各本地的接收器1107通过调度模块1105与位于远端的多个处理设备1103相联系。当接收器1107接收到本地的麦克风拾音单元12发送的混合信号25时,通过互联网或电信连接将混合信号25发送到调度模块1105,由调度模块1105指定处理所述混合信号25的处理设备1103。具体的处理过程如前面实施例结合图5-8所述。
这样做的好处是,处理设备1103与喇叭1101和接收器1107的组合不固定,当接收到由一个本地的接收器1107发送来的混合信号25时,分给哪个处理设备1103由调度模块1105根据每个处理设备1103的当前负荷决定,这样可以均衡每个处理设备1103的处理负荷,使网络资源得到有效的分配。
在一个实施例中,调度模块1105指定处理所述接收器1107接收到的混合信号25的处理设备1103是基于各处理设备1103当前正在处理的任务数,其中,任务是对调度模块1105分配的一个接收器1107发送来的信号的处理,以得到采集的人声信号的过程。
每个本地的用户发出语音后,本地的麦克风拾音单元12接收到用户语音24,将用户语音24、喇叭播出的声音的回声22、以及环境中的干扰噪声23一起作为混合信号25发送给接收器1107。接收器1107通过互联网或电信连接将接收到的混合信号25发送给远端1104的调度模块1105。调度模块1105根据各处理设备1103当前正在处理的任务数分配一个处理设备1103给该混合信号25。该分配的处理设备1103按照前面实施例结合图5-8所述的过程处理该混合信号25,从接收到的混合信号25中去除喇叭播出的声音的回声、环境中的干扰噪声,得到人声信号,根据人声信号产生控制指令,然后经由调度模块1105,通过互联网或电信连接,将控制指令发送到喇叭1101,从而对喇叭1101播放的音量执行控制。
当调度模块1105分配一个处理设备1103并将混合信号25分配给其处理时,可以认为是一个任务开始。根据人声信号产生的控制指令是经由调度模块1105通过互联网或电信连接发送到喇叭1101的,这样,调度模块1105接收到一个控制指令,意味着相应任务的完成。如果一个任务开始了,即分配给一个处理设备1103,但未接到该处理设备1103返回的控制指令,则认为该处理设备1103当前正处理该任务。这样,调度模块1105就可以根据各处理设备1103当前正处理的任务数,分配一个处理设备1103给该混合信号。例如,将接收到的混合信号25分配给当前正处理的任务数最少的一个处理设备1103。
更具体地,在一个实施例中,在分配一个处理设备1103并将混合信号25发送给该处理设备1103处理时,可以在混合信号25中加入一个任务标识(ID),并 要求处理设备1103返回相应控制指令时在相应控制指令中加入该任务ID。这样,在混合信号25中加入一个任务ID发送给处理设备1103后,根据是否从该处理设备1103接收到含有该任务ID的控制指令,就可以确定出该处理设备1103是否当前正处理该任务。
上述根据各处理设备1103当前正处理的任务数来指定处理混合信号25的处理设备1103的实施例的好处在于,可以基于各处理设备1103当前处理负荷,灵活分配处理设备1103,有利于各处理设备的负载平衡,提高资源综合利用的效率。
如图9所示,根据本申请的一个实施例,还提供了一种远场拾音设备中采集人声信号的方法。如上所述,远场拾音设备1包括部件分离的设备主体11和麦克风拾音单元12,所述方法包括:
步骤310、所述设备主体11产生同步于待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号;
步骤320、所述设备主体11将所述同步信号与所述待播放声音信号一起播放;
步骤330、所述设备主体11接收所述麦克风拾音单元12采集的用户语音和所述设备主体11播放的声音信号在空间传播后的回声并对采集的用户语音和回声进行数字转换后的信号;
步骤340、所述设备主体1确定所述麦克风拾音单元12发送回的信号中的第二频段分量、与所述同步信号的时间延迟;
步骤350、所述设备主体11基于所述麦克风拾音单元12发送回的信号,利用按所确定的时间延迟而延迟了的所述待播放声音信号进行回声消除,得到采集的人声信号。
在一个实施例中,在步骤350之前,所述方法还包括:
所述设备主体11利用步骤340中确定的时间延迟来延迟待播放声音信号,用于进行回声消除。
在一个实施例中,在步骤340之前,所述方法还包括:
所述设备主体11从所述麦克风拾音单元12发送回的信号中滤出第二频段分量,用于确定时间延迟。
在一个实施例中,在步骤350之前,所述方法还包括:
所述设备主体11将所述麦克风拾音单元12发送回的信号的采样频率从播放待播放声音信号所用的采样频率转换为人声识别所用的采样频率,用于进行回声消除。
在一个实施例中,在步骤340之前,所述方法还包括:
所述设备主体11接收所述麦克风拾音单元12发送回的信号。
在一个实施例中,所述麦克风拾音单元12具有用于向所述设备主体11发送采集的用户语音和所述设备主体播放的声音信号在空间传播后的回声的发送器1204,所述发送器1204具有待机模式和发送模式,在所述待机模式下所述发送器1204不工作,当从采集的声音信号中识别出用户语音时,发送器1204由待机模式进入发送模式。
在一个实施例中,所述麦克风拾音单元12具有麦克风单元1201、人声识别单元1203和缓存单元1202。所述麦克风单元1201用于采集用户语音和所述设备主体播放的声音信号在空间传播后的回声;所述缓存单元1202用于缓存所述麦克风单元1201采集的用户语音和回声;所述人声识别单元1203用于从所述麦克风单元1201的输出中识别出人声,并在识别出人声时触发所述发送器1204进入发送模式,将所述缓存单元1202缓存的用户语音和回声进行发送。
在一个实施例中,所述麦克风单元1201有多个,其中一个麦克风单元1201连接到所述人声识别单元1203,所述人声识别单元1203在从该连接的麦克风单元1201的输出中检测出人声时,触发所述发送器1204进入发送模式,将所述缓存单元1202缓存的各麦克风单元1201采集的用户语音和回声进行发送。
在一个实施例中,所述缓存单元1202缓存的各麦克风单元1201采集的用户语音和回声在进行发送之前进行复用。
在一个实施例中,在步骤340之前,所述方法包括:
所述设备主体11对复用的各麦克风单元1201采集的用户语音和回声进行解复用。
在一个实施例中,步骤350包括:所述设备主体11对解复用的各麦克风单元1201采集的用户语音分别进行回声消除。
在一个实施例中,在步骤350之后,所述方法还包括:
所述设备主体11对解复用的各麦克风单元1201采集的用户语音分别进行回声消除得到的人声信号进行合并,以得到增强的人声信号。
在一个实施例中,在步骤350之后,所述方法还包括:
所述设备主体11对采集的人声信号进行语音识别。
在一个实施例中,在对采集的人声信号进行语音识别之后,所述方法还包括:
基于语音识别结果中的控制指令,执行控制动作。
在一个实施例中,所述设备主体11具有接收器1122,所述接收器1122具有待机模式和接收模式。在所述待机模式下所述接收器1122不工作,当感测出无线信号,接收器1122由待机模式进入接收模式。
在一个实施例中,所述接收器1122具有无线信号感测单元1123,当无线信号感测单元1123感测到无线信号,接收器1122由待机模式进入接收模式。
在一个实施例中,所述缓存单元1202缓存所述麦克风单元1201采集的用户语音和回声的时间至少不低于人声识别单元1203的识别延迟。
在一个实施例中,所述麦克风单元1201有多个,其中所述人声识别单元1203连接每个麦克风单元1201,所述人声识别单元1203在从该连接的麦克风单元1201中任一个的输出中检测出人声时,触发所述发送器1204进入发送模式,将所述缓存单元1202缓存的各麦克风单元1201采集的用户语音和回声进行发送。
在一个实施例中,所述第二频段是超声波频段。
在一个实施例中,所述麦克风单元1201的采样频率与待播放声音信号和同步信号的采样频率一致,且大于或等于该同步信号的最高频率的2倍。
在一个实施例中,所述同步信号是载波调制之后的伪随机序列。
在一个实施例中,所述伪随机序列的选择基于以下中的至少一项:
伪随机序列的自相关函数;
伪随机序列的周期;
伪随机序列的自相关函数的频谱宽度;
且,载波的频率基于离20KHz的距离来进行选择。
在一个实施例中,在步骤320之前,所述方法还包括:
所述设备主体11缓存所述待播放声音信号,用于进行回声消除,并缓存所述同步信号,用于确定时间延迟。
在一个实施例中,所述同步信号或待播放声音信号被缓存的时长至少为播放的声音信号到所述麦克风拾音单元12的传输时长、与麦克风拾音单元12输出的声音信号发送回所述设备主体的传输时长之和。
在一个实施例中,步骤340包括:
确定所述麦克风拾音单元12发送回的信号中的第二频段分量、与所述同步信号的互相关函数的最大值所对应的时间;
将确定的时间与滤波引入的延迟之和,作为所述时间延迟。
在一个实施例中,当所述麦克风拾音单元12具有多个麦克风单元1201时,所述步骤340包括:分别确定所述多个麦克风单元1201分别采集的、并由所述麦克风拾音单元12发送回设备主体11的信号中的各第二频段分量、与所述同步信号的各自的时间延迟;所述步骤350包括:利用按所确定的各自的时间延迟而延迟了的待播放声音信号分别对相应麦克风单元1201采集的、并由所述麦克风拾音单元12发送回设备主体11的信号进行回声消除。
在一个实施例中,当所述麦克风拾音单元12具有多个麦克风单元1201时,所述步骤340包括:确定所述多个麦克风单元1201中的一个采集到的、并由所述麦克风拾音单元12发送回设备主体11的信号中的第二频段分量、与所述同步 信号的时间延迟;所述步骤350包括:利用按所确定的时间延迟而延迟了的待播放声音信号对所有麦克风单元1201采集的、并由所述麦克风拾音单元12发送回设备主体11的信号进行回声消除。
在一个实施例中,所述麦克风拾音单元12有多个,而所述设备主体11只有一个,所述设备主体11中的接收器1122有多个,分别与所述多个麦克风拾音单元12相对应。所述步骤340包括:分别确定每个接收器1122接收的信号中的第二频段分量、与所述同步信号的时间延迟;步骤350包括:基于每个接收器1122接收的信号,利用按确定的相应的时间延迟而延迟了的待播放声音信号进行回声消除,得到各自的采集的人声信号。
在一个实施例中,所述麦克风拾音单元12具有不同的拾音单元标识,在麦克风拾音单元12发送的信号中有相应的拾音单元标识,所述接收器1122接收到信号后,保留接收的信号中具有与所述接收器1122对应的拾音单元标识的信号,而丢弃接收的信号中不具有与所述接收器1122对应的拾音单元标识的信号。
在一个实施例中,所述麦克风拾音单元12有多个,而所述设备主体11只有一个,所述设备主体11中的接收器1122也只有一个。如果所述接收器1122接收到一个麦克风拾音单元12发送的信号之后又接到其它麦克风拾音单元12发送的信号,将所述其它麦克风拾音单元12发送的信号丢弃;步骤340包括:确定该接收到的一个麦克风拾音单元12发送的信号中的第二频段分量、与所述同步信号的时间延迟;步骤350包括:基于该接收到的一个麦克风拾音单元12发送的信号,利用按确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
在一个实施例中,所述麦克风拾音单元12具有不同的拾音单元标识,在麦克风拾音单元12发送的信号中有相应的拾音单元标识,如果所述接收器1122接收到带有一个拾音单元标识的信号之后又接收到带有其它拾音单元标识的信号,丢弃该带有其它拾音单元标识的信号。
在一个实施例中,所述设备主体11包括位于本地的喇叭1101和接收器1107、和位于远端的处理设备1103,步骤310、340和350都由处理设备1103执行。
在一个实施例中,所述位于远端的处理设备1103与多个地点的所述喇叭1101和接收器1107进行通信,处理所述多个地点的麦克风拾音单元12发送到接收器1122的信号。
在一个实施例中,接收器1122通过调度模块1105与多个位于远端的处理设备1103通信,当接收器1122接收到本地的麦克风拾音单元12发送的信号时,由调度模块1105指定处理所述接收器1107接收到的信号的处理设备1103。
在一个实施例中,调度模块1105指定处理所述接收器1122接收到的信号的 处理设备1103是基于各处理设备1103当前正在处理的任务数,其中,任务是对调度模块1105分配的一个接收器1122发送来的信号的处理,以得到采集的人声信号的过程。
由于结合图9的方法实施例的详细细节已在前面的远场拾音设备1的详细描述中涉及,远场拾音设备1的详细描述完全适用于这一部分,因此,在这里不再赘述。
此外,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
图10示出了本申请实施例提供的一种远场拾音设备中采集人声信号的装置。如图10所示,该装置包括:
同步信号产生模块400,用于产生同步于待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号;
播放模块401,用于将所述同步信号与所述待播放声音信号一起播放;
接收模块402,用于接收所述麦克风拾音单元采集的用户语音和所述设备主体播放的声音信号在空间传播后的回声并对采集的用户语音和回声进行数字转换后的信号;
确定模块403,用于确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的时间延迟;
信号获取模块404,用于基于所述麦克风拾音单元发送回的信号,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
图11示出了本申请实施例提供的一种用于实现远场拾音设备中采集人声信号的方法的电子设备。如图11所示,该电子设备包括:至少一个处理器501、以及至少一个存储器502,其中,所述存储器502存储有计算机可读程序指令,当所述计算机可读程序指令被所述处理器501执行时,使得所述处理器510执行上述实施例所述的远场拾音设备中采集人声信号的方法的步骤。
在本申请实施例中,存储器501可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储方法执行过程中所创建的数据等。处理器502,可以是一个中央处理单元(central processing unit,CPU),或者为数字处理单元等等。
在图11中,存储器501和处理器502之间通过总线503连接,总线503在 图11中以粗线表示。总线505可以分为地址总线、数据总线、控制总线等。为便于表示,图11中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器501可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);存储器501也可以是非易失性存储器(non-volatile memory),例如只读存储器,快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)、或者存储器501是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器501可以是上述存储器的组合。
本申请实施例还提供了一种计算机可读存储介质,其存储有可由电子设备执行的计算机可读程序指令,当所述计算机可读程序指令在电子设备上运行时,使得所述电子设备上述实施例所述的远场拾音设备中采集人声信号的方法的步骤。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由所附的权利要求指出。

Claims (22)

  1. 一种远场拾音设备,包括部件分离的设备主体和麦克风拾音单元,所述麦克风拾音单元采集用户语音和所述设备主体播放的声音信号在空间传播后的回声并对所述采集的用户语音和回声进行数字转换后的信号发送回所述设备主体,
    所述设备主体包括:
    播放信号源,用于产生待播放声音信号;
    同步信号发生器,用于产生同步于所述待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号;
    喇叭,用于播放待播放声音信号和所述同步信号的叠加,作为所述设备主体播放的声音信号;
    延迟确定单元,用于确定所述麦克风拾音单元发送回的信号中第二频段分量、与所述同步信号的时间延迟;
    回声消除单元,用于基于所述麦克风拾音单元发送回的信号,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
  2. 根据权利要求1所述的远场拾音设备,其中,所述设备主体还包括:
    延迟单元,用于利用按所述延迟确定单元确定的时间延迟,来延迟待播放声音信号,以供所述回声消除单元进行回声消除。
  3. 根据权利要求1所述的远场拾音设备,其中,所述设备主体还包括:
    滤波器,用于从所述麦克风拾音单元发送回的信号中滤出第二频段分量,以供所述延迟确定单元确定时间延迟。
  4. 根据权利要求1所述的远场拾音设备,其中,所述设备主体还包括:
    下采样器,用于将所述麦克风拾音单元发送回的信号的采样频率转换为人声识别所用的采样频率,并输出以供所述回声消除单元进行回声消除。
  5. 根据权利要求1所述的远场拾音设备,其中,所述设备主体还包括:
    接收器,用于接收所述麦克风拾音单元发送回的信号。
  6. 根据权利要求5所述的远场拾音设备,其中,所述麦克风拾音单元具有用于向所述设备主体发送采集的用户语音和所述设备主体播放的声音信号在空间传播后的回声的发送器,所述发送器具有待机模式和发送模式,在所述待机模式下所述发送器不工作,当从采集的信号中识别出用户语音时,发送器由待机模式进入发送模式。
  7. 根据权利要求6所述的远场拾音设备,其中,所述麦克风拾音单元具有麦 克风单元、人声识别单元和缓存单元,
    所述麦克风单元用于采集用户语音和所述设备主体播放的声音信号在空间传播后的回声;
    所述缓存单元用于缓存所述麦克风单元采集的用户语音和所述回声;
    所述人声识别单元用于从所述麦克风单元的输出中识别出人声,并在识别出人声时触发所述发送器进入发送模式,所述发送器将所述缓存单元缓存的用户语音和所述回声向所述设备主体进行发送。
  8. 根据权利要求7所述的远场拾音设备,其中,所述麦克风单元有多个,其中一个麦克风单元连接到所述人声识别单元,所述人声识别单元在从该连接的麦克风单元的输出中检测出人声时,触发所述发送器进入发送模式,所述发送器将所述缓存单元缓存的各麦克风单元采集的用户语音和所述回声向所述设备主体进行发送。
  9. 根据权利要求8所述的远场拾音设备,其中,所述回声消除单元对接收器接收到的各麦克风单元采集的用户语音分别进行回声消除,
    所述设备主体还包括:语音增强单元,用于对所述回声消除单元对接收器接收到的各麦克风单元采集的用户语音分别进行回声消除得到的人声信号进行合并,以得到增强的人声信号。
  10. 根据权利要求5所述的远场拾音设备,其中,所述接收器具有待机模式和接收模式,在所述待机模式下所述接收器不工作,当感测出无线信号时,接收器由待机模式进入接收模式。
  11. 根据权利要求1所述的远场拾音设备,其中,所述第二频段是超声波频段,所述同步信号是载波调制之后的伪随机序列。
  12. 根据权利要求1所述的远场拾音设备,其中,所述设备主体还包括:
    第一缓存,用于缓存所述播放信号源产生的待播放声音信号;
    第二缓存,用于缓存所述同步信号发生器产生的同步信号;
    其中,所述延迟确定单元确定时间延迟时使用的同步信号取自于第二缓存;
    所述回声消除单元进行回声消除时使用的待播放声音信号取自于第一缓存。
  13. 根据权利要求3所述的远场拾音设备,其中,所述延迟确定单元进一步用于:
    确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的互相关函数的最大值所对应的时间;
    将确定的时间与滤波器引入的延迟之和,作为所述时间延迟。
  14. 根据权利要求5所述的远场拾音设备,其中,所述设备主体包括位于本地的喇叭和接收器、和位于远端的处理设备,所述播放信号源、同步信号发生器、延迟确定单元、回声消除单元都在远端的处理设备中。
  15. 一种远场拾音设备中采集人声信号的方法,其中,所述远场拾音设备包括部件分离的设备主体和麦克风拾音单元,所述方法包括:
    所述设备主体产生同步于待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号;
    所述设备主体将所述同步信号与所述待播放声音信号一起播放;
    所述设备主体接收所述麦克风拾音单元采集的用户语音和所述设备主体播放的声音信号在空间传播后的回声并对采集的用户语音和回声进行数字转换后的信号;
    所述设备主体确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的时间延迟;
    所述设备主体基于所述麦克风拾音单元发送回的信号,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
  16. 根据权利要求15所述的方法,其中,在进行回音消除之前,进一步包括:
    所述设备主体从所述麦克风拾音单元发送回的信号中滤出所述第二频段分量,用于确定所述时间延迟。
  17. 根据权利要求15所述的方法,其中,在进行回音消除之前,进一步包括:
    所述设备主体将所述麦克风拾音单元发送回的信号的采样频率从播放待播放声音信号所用的采样频率转换为人声识别所用的采样频率,用于进行回声消除。
  18. 根据权利要求15所述的方法,其中,所述设备主体确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的时间延迟包括:
    所述设备主体确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的互相关函数的最大值所对应的时间;
    将确定的时间与滤波引入的延迟之和,作为所述时间延迟。
  19. 根据权利要求15所述的方法,其中,所述麦克风拾音单元包括多个用于采集用户语音的麦克风单元,所述进行回声消除,得到采集的人声信号包括:
    所述设备主体对接收到的各麦克风单元采集的用户语音分别进行回声消除,并对分别进行回声消除得到的人声信号进行合并,以得到增强的人声信号。
  20. 一种远场拾音设备中采集人声信号的装置,包括:
    同步信号产生模块,用于产生同步于待播放声音信号、占用与所述待播放声音信号所在的第一频段不同的第二频段的同步信号;
    播放模块,用于将所述同步信号与所述待播放声音信号一起播放;
    接收模块,用于接收所述麦克风拾音单元采集的用户语音和所述设备主体播放的声音信号在空间传播后的回声并对采集的用户语音和回声进行数字转换后的信号;
    确定模块,用于确定所述麦克风拾音单元发送回的信号中的第二频段分量、与所述同步信号的时间延迟;
    信号获取模块,用于基于所述麦克风拾音单元发送回的信号,利用按所确定的时间延迟而延迟了的待播放声音信号进行回声消除,得到采集的人声信号。
  21. 一种电子设备,包括至少一个处理单元、以及至少一个存储单元,其中,所述存储单元存储有计算机可读程序指令,当所述计算机可读程序指令被所述处理单元执行时,使得所述处理单元执行权利要求15~19任一权利要求所述方法的步骤。
  22. 一种计算机可读存储介质,其存储有可由电子设备执行的计算机可读程序指令,当所述计算机可读程序指令在电子设备上运行时,使得所述电子设备执行权利要求15~19任一所述方法的步骤。
PCT/CN2019/108166 2018-09-29 2019-09-26 远场拾音设备、及远场拾音设备中采集人声信号的方法 WO2020063752A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19866344.5A EP3860144A4 (en) 2018-09-29 2019-09-26 FAR FIELD SOUND RECORDING DEVICE AND VOICE SIGNAL DETECTION METHOD IMPLEMENTED THEREIN
US17/032,278 US11871176B2 (en) 2018-09-29 2020-09-25 Far-field pickup device and method for collecting voice signal in far-field pickup device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811150947.5 2018-09-29
CN201811150947.5A CN110166882B (zh) 2018-09-29 2018-09-29 远场拾音设备、及远场拾音设备中采集人声信号的方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/032,278 Continuation US11871176B2 (en) 2018-09-29 2020-09-25 Far-field pickup device and method for collecting voice signal in far-field pickup device

Publications (1)

Publication Number Publication Date
WO2020063752A1 true WO2020063752A1 (zh) 2020-04-02

Family

ID=67645094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108166 WO2020063752A1 (zh) 2018-09-29 2019-09-26 远场拾音设备、及远场拾音设备中采集人声信号的方法

Country Status (4)

Country Link
US (1) US11871176B2 (zh)
EP (1) EP3860144A4 (zh)
CN (1) CN110166882B (zh)
WO (1) WO2020063752A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112203180A (zh) * 2020-09-24 2021-01-08 安徽文香信息技术有限公司 一种智慧教室扩音器耳麦自适应音量调节系统及方法
CN113709607A (zh) * 2021-08-20 2021-11-26 深圳市京华信息技术有限公司 声信号传播方法、系统和计算机可读存储介质
CN113965801A (zh) * 2021-10-11 2022-01-21 Oppo广东移动通信有限公司 播放控制方法、装置以及电子设备
WO2024098279A1 (en) * 2022-11-09 2024-05-16 Qualcomm Incorporated Automated echo control

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110166882B (zh) 2018-09-29 2021-05-25 腾讯科技(深圳)有限公司 远场拾音设备、及远场拾音设备中采集人声信号的方法
CN110691301A (zh) * 2019-09-25 2020-01-14 晶晨半导体(深圳)有限公司 一种测试远场语音设备与外置喇叭之间延迟时间的方法
CN111048096B (zh) * 2019-12-24 2022-07-26 大众问问(北京)信息科技有限公司 一种语音信号处理方法、装置及终端
CN111048118B (zh) * 2019-12-24 2022-07-26 大众问问(北京)信息科技有限公司 一种语音信号处理方法、装置及终端
CN111402868B (zh) * 2020-03-17 2023-10-24 阿波罗智联(北京)科技有限公司 语音识别方法、装置、电子设备及计算机可读存储介质
CN111464868A (zh) * 2020-03-31 2020-07-28 深圳Tcl数字技术有限公司 显示终端关机控制方法、装置、设备及可读存储介质
CN111736797B (zh) * 2020-05-21 2024-04-05 阿波罗智联(北京)科技有限公司 负延时时间的检测方法、装置、电子设备及存储介质
CN112073872B (zh) * 2020-07-31 2022-03-11 深圳市沃特沃德信息有限公司 远距离声音放大方法、装置、系统、存储介质及智能设备
CN111883168B (zh) * 2020-08-04 2023-12-22 上海明略人工智能(集团)有限公司 一种语音处理方法及装置
WO2023155607A1 (zh) * 2022-02-17 2023-08-24 海信视像科技股份有限公司 终端设备和语音唤醒方法
US11971845B2 (en) * 2022-06-16 2024-04-30 Bae Systems Information And Electronic Systems Integration Inc. DSP encapsulation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100107856A1 (en) * 2008-11-03 2010-05-06 Qnx Software Systems (Wavemakers), Inc. Karaoke system
CN104157293A (zh) * 2014-08-28 2014-11-19 福建师范大学福清分校 一种增强声环境中目标语音信号拾取的信号处理方法
CN105304093A (zh) * 2015-11-10 2016-02-03 百度在线网络技术(北京)有限公司 用于语音识别的信号前端处理方法及装置
CN107454508A (zh) * 2017-08-23 2017-12-08 深圳创维-Rgb电子有限公司 麦克风阵列的电视机及电视系统
CN110166882A (zh) * 2018-09-29 2019-08-23 腾讯科技(深圳)有限公司 远场拾音设备、及远场拾音设备中采集人声信号的方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7366659B2 (en) * 2002-06-07 2008-04-29 Lucent Technologies Inc. Methods and devices for selectively generating time-scaled sound signals
US20080317241A1 (en) * 2006-06-14 2008-12-25 Derek Wang Code-based echo cancellation
JP2008259032A (ja) * 2007-04-06 2008-10-23 Toshiba Corp 情報処理装置、およびプログラム
CN103516921A (zh) * 2012-06-28 2014-01-15 杜比实验室特许公司 通过隐藏音频信号的回声控制
CN104412618B (zh) * 2012-07-03 2017-11-21 索诺瓦公司 用于助听器的方法
US9668053B1 (en) * 2013-03-12 2017-05-30 Chien Luen Industries Co., Ltd., Inc. Bluetooth landscape/pathway lights
US9762742B2 (en) * 2014-07-24 2017-09-12 Conexant Systems, Llc Robust acoustic echo cancellation for loosely paired devices based on semi-blind multichannel demixing
GB201518004D0 (en) * 2015-10-12 2015-11-25 Microsoft Technology Licensing Llc Audio signal processing
CN106210371B (zh) * 2016-08-31 2018-09-18 广州视源电子科技股份有限公司 一种回声时延的确定方法、装置及智能会议设备
US10546581B1 (en) * 2017-09-08 2020-01-28 Amazon Technologies, Inc. Synchronization of inbound and outbound audio in a heterogeneous echo cancellation system
CN107610713B (zh) * 2017-10-23 2022-02-01 科大讯飞股份有限公司 基于时延估计的回声消除方法及装置
TWI665661B (zh) * 2018-02-14 2019-07-11 美律實業股份有限公司 音頻處理裝置及音頻處理方法
US10943599B2 (en) * 2018-10-26 2021-03-09 Spotify Ab Audio cancellation for voice recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100107856A1 (en) * 2008-11-03 2010-05-06 Qnx Software Systems (Wavemakers), Inc. Karaoke system
CN104157293A (zh) * 2014-08-28 2014-11-19 福建师范大学福清分校 一种增强声环境中目标语音信号拾取的信号处理方法
CN105304093A (zh) * 2015-11-10 2016-02-03 百度在线网络技术(北京)有限公司 用于语音识别的信号前端处理方法及装置
CN107454508A (zh) * 2017-08-23 2017-12-08 深圳创维-Rgb电子有限公司 麦克风阵列的电视机及电视系统
CN110166882A (zh) * 2018-09-29 2019-08-23 腾讯科技(深圳)有限公司 远场拾音设备、及远场拾音设备中采集人声信号的方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3860144A4

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112203180A (zh) * 2020-09-24 2021-01-08 安徽文香信息技术有限公司 一种智慧教室扩音器耳麦自适应音量调节系统及方法
CN113709607A (zh) * 2021-08-20 2021-11-26 深圳市京华信息技术有限公司 声信号传播方法、系统和计算机可读存储介质
CN113965801A (zh) * 2021-10-11 2022-01-21 Oppo广东移动通信有限公司 播放控制方法、装置以及电子设备
WO2024098279A1 (en) * 2022-11-09 2024-05-16 Qualcomm Incorporated Automated echo control

Also Published As

Publication number Publication date
US20210021925A1 (en) 2021-01-21
EP3860144A4 (en) 2022-04-13
CN110166882A (zh) 2019-08-23
EP3860144A1 (en) 2021-08-04
CN110166882B (zh) 2021-05-25
US11871176B2 (en) 2024-01-09

Similar Documents

Publication Publication Date Title
WO2020063752A1 (zh) 远场拾音设备、及远场拾音设备中采集人声信号的方法
CN109192203B (zh) 多音区语音识别方法、装置及存储介质
KR102015745B1 (ko) 개인화된 실시간 오디오 프로세싱
JP2019518985A (ja) 分散したマイクロホンからの音声の処理
JP5003531B2 (ja) 音声会議システム
CN108962240A (zh) 一种基于耳机的语音控制方法及系统
CN101751918B (zh) 新型消音装置及消音方法
CN114727212B (zh) 音频的处理方法及电子设备
CN113411726A (zh) 一种音频处理方法、装置及系统
JPWO2009075085A1 (ja) 収音装置、収音方法、収音プログラム、および集積回路
US20230171337A1 (en) Recording method of true wireless stereo earbuds and recording system
CN110187859A (zh) 一种去噪方法及电子设备
CN110024418A (zh) 声音增强装置、声音增强方法和声音处理程序
CN111741394A (zh) 一种数据处理方法、装置及可读介质
CN109600677A (zh) 数据传输方法及装置、存储介质、电子设备
US11749293B2 (en) Audio signal processing device
CN116795753A (zh) 音频数据的传输处理的方法及电子设备
CN113223544B (zh) 音频的方向定位侦测装置及方法以及音频处理系统
CN114220454B (zh) 一种音频降噪方法、介质和电子设备
CN113270082A (zh) 一种车载ktv控制方法及装置、以及车载智能网联终端
CN207977110U (zh) 电子设备主板和电子设备
WO2024058147A1 (ja) 処理装置、出力装置及び処理システム
JP2010221945A (ja) 信号処理方法、装置及びプログラム
CN215818639U (zh) 一种延时补偿系统
CN110121744A (zh) 处理来自分布式麦克风的语音

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19866344

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2019866344

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2019866344

Country of ref document: EP

Effective date: 20210429