TW202314684A

TW202314684A - Processing of audio signals from multiple microphones

Info

Publication number: TW202314684A
Application number: TW111127948A
Authority: TW
Inventors: 艾里克維瑟; 法特梅薩基; 郭銀怡; 金來勳; 羅吉歐古迪斯艾維斯; 漢內斯派森泰納
Original assignee: 美商高通公司
Priority date: 2021-07-27
Filing date: 2022-07-26
Publication date: 2023-04-01
Also published as: WO2023010011A1; KR20240040737A

Abstract

A first device includes a memory configured to store instructions and one or more processors configured to receive audio signals from multiple microphones. The one or more processors are configured to process the audio signals to generate direction-of-arrival information corresponding to one or more sources of sound represented in one or more of the audio signals. The one or more processors are also configured to and send, to a second device, data based on the direction-of-arrival information and a class or embedding associated with the direction-of-arrival information.

Description

Processing of audio signals from multiple microphones

相關申請的交叉引用Cross References to Related Applications

本專利申請案請求2021年7月27日提出申請的題為「DIRECTIONAL AUDIO SIGNAL PROCESSING」的臨時專利申請第63/203,562號的優先權，其內容經由引用整體結合於此。This patent application claims priority to Provisional Patent Application No. 63/203,562, filed July 27, 2021, entitled "DIRECTIONAL AUDIO SIGNAL PROCESSING," the contents of which are hereby incorporated by reference in their entirety.

本案大體係關於音訊信號處理。The large system of this case is about audio signal processing.

技術的進步引起了更小和更強大的計算設備。例如，當前存在各種可攜式個人計算設備，包括無線電話（諸如行動電話和智慧型電話）、平板電腦和膝上型電腦，它們體積小、重量輕並且使用者易於攜帶。這些設備可以經由無線網路傳達語音和資料封包。此外，許多此類設備結合了附加功能，諸如數位靜態相機、數位攝像機、數位記錄器和音訊檔播放機。此外，這種設備可以處理可執行指令，包括可以用於存取網際網路的軟體應用（諸如網路瀏覽器應用）。因此，這些設備可以包括強大的計算能力。Advances in technology have led to smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless telephones (such as cell phones and smart phones), tablet computers, and laptop computers, which are small, lightweight, and easy for users to carry. These devices can communicate voice and data packets over a wireless network. In addition, many of these devices incorporate additional functions such as digital still cameras, digital video cameras, digital recorders, and audio file players. In addition, such devices can process executable instructions, including software applications (such as web browser applications) that can be used to access the Internet. Accordingly, these devices can include significant computing capabilities.

諸如行動電話和智慧型電話的設備可以與耳機配對，以使使用者能夠在不將行動電話放在耳邊的情況下收聽音訊。使用者戴耳機的一個缺點是使用者可能意識不到周圍的環境。作為非限制性示例，如果使用者步行穿過十字路口，使用者可能無法聽見正在靠近的交通工具。在使用者的焦點在別處（例如，在使用者的行動電話上或者看著背向正在靠近的交通工具的方向）的場景中，使用者可能無法決定交通工具正在靠近或者交通工具正從哪個方向靠近。Devices such as mobile phones and smartphones can be paired with headsets to allow users to listen to audio without holding the mobile phone to their ear. One disadvantage of the user wearing headphones is that the user may not be aware of the surrounding environment. As a non-limiting example, if a user walks across an intersection, the user may not be able to hear approaching vehicles. In scenarios where the user's focus is elsewhere (for example, on the user's mobile phone or looking away from an approaching vehicle), the user may not be able to determine whether the vehicle is approaching or which direction the vehicle is coming from near.

根據本發明的一個實施方式，第一設備包括配置為儲存指令的記憶體以及一或多個處理器。該一或多個處理器被配置為從多個麥克風接收音訊信號。一或多個處理器還被配置為處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊。該一或多個處理器還被配置為向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。According to one embodiment of the present invention, the first device includes a memory configured to store instructions and one or more processors. The one or more processors are configured to receive audio signals from a plurality of microphones. The one or more processors are also configured to process the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals. The one or more processors are further configured to send the direction-of-arrival information and the category or embedded data associated with the direction-of-arrival information to the second device.

根據本發明的另一實施方式，一種處理音訊的方法包括在第一設備的一或多個處理器處接收來自多個麥克風的音訊信號。該方法還包括處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊。該方法還包括向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。According to another embodiment of the present invention, a method of processing audio includes receiving, at one or more processors of a first device, audio signals from a plurality of microphones. The method also includes processing the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals. The method also includes sending data based on the direction-of-arrival information and the category or embedding associated with the direction-of-arrival information to the second device.

根據本發明的另一種實施方式，一種非暫時性電腦可讀取媒體包括指令，當由第一設備的一或多個處理器執行時，該些指令使得一或多個處理器從多個麥克風接收音訊信號。當由一或多個處理器執行時，該些指令還使得一或多個處理器處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊。當由一或多個處理器執行時，該些指令還使得一或多個處理器向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。According to another embodiment of the present invention, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a first device, cause the one or more processors to read from a plurality of microphones Receive audio signal. When executed by one or more processors, the instructions further cause the one or more processors to process the audio signal to generate arrivals corresponding to one or more sources of sound represented by one or more of the audio signals. Direction information. When executed by the one or more processors, the instructions further cause the one or more processors to send to the second device data based on the direction-of-arrival information and the class or embedding associated with the direction-of-arrival information.

根據本發明的另一實施方式，第一設備包括用於從多個麥克風接收音訊信號的構件。第一設備還包括用於處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊的構件。第一設備還包括用於向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料的構件。According to another embodiment of the invention, the first device comprises means for receiving audio signals from a plurality of microphones. The first apparatus also includes means for processing the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals. The first device also includes means for sending data based on the direction-of-arrival information and the category or embedding associated with the direction-of-arrival information to the second device.

在閱讀了包括以下章節的整個申請之後，本案的其他態樣、優點和特徵將變得顯而易見：附圖說明、具體實施方式和請求項。Other aspects, advantages and features of this case will become apparent after reading the entire application including the following sections: Description of Drawings, Detailed Description and Claims.

揭示執行定向（directional）音訊信號處理的系統和方法。第一設備（諸如耳機（headset））可以包括被配置為擷取周圍環境中的聲音的複數個麥克風。每個麥克風可以在第一設備上具有不同的朝向和位置，諸如以擷取來自不同方向的聲音。回應於擷取到聲音，每個麥克風可以產生相對應的音訊信號，該音訊信號被提供給定向音訊信號處理單元。定向音訊信號處理單元可以處理來自麥克風的音訊信號，以辨識與每個音訊事件的聲音和位置相關聯的不同音訊事件。在一些實施方式中，經由第一設備處的一或多個分類器來處理與音訊事件相關聯的音訊信號，以辨識音訊事件的音訊類別。在非限制性示例中，如果複數個麥克風中的至少一個麥克風擷取到汽車聲音，則定向音訊信號處理單元可以基於與相對應的音訊信號相關聯的特性（例如，音調、頻率等）來辨識出汽車聲音，並且可以基於擷取到聲音的相應麥克風來辨識出汽車聲音的相對方向。回應於辨識出汽車聲音和相對應的相對方向，第一設備可以產生表示聲音和方向的資料，並且可以將該資料提供給第二設備（諸如行動電話）。在一些示例中，表示聲音的資料可以包括與聲源相關聯的音訊類別或嵌入（embedding）和到達方向資訊。第二設備可以使用該資料（例如，方向資訊）來執行附加操作。作為非限制性示例，第二設備可以決定是產生視覺警報還是實體警報向耳機的使用者警告附近的交通工具。Systems and methods for performing directional audio signal processing are disclosed. A first device, such as a headset, may include a plurality of microphones configured to pick up sounds in the surrounding environment. Each microphone may have a different orientation and position on the first device, such as to pick up sounds from different directions. In response to capturing sound, each microphone can generate a corresponding audio signal, which is provided to the directional audio signal processing unit. The directional audio signal processing unit may process audio signals from the microphones to identify different audio events associated with the sound and location of each audio event. In some embodiments, an audio signal associated with an audio event is processed via one or more classifiers at the first device to identify an audio class of the audio event. In a non-limiting example, if at least one of the plurality of microphones picks up car sounds, the directional audio signal processing unit may identify based on characteristics (e.g., pitch, frequency, etc.) associated with the corresponding audio signal car sounds, and the relative direction of the car sounds can be identified based on the corresponding microphones that pick up the sounds. In response to recognizing the car sound and the corresponding relative direction, the first device may generate data representing the sound and direction, and may provide this data to a second device (such as a mobile phone). In some examples, data representing sound may include audio class or embedding and direction-of-arrival information associated with the sound source. The second device can use this data (eg, direction information) to perform additional operations. As a non-limiting example, the second device may decide whether to generate a visual alert or a physical alert to warn the user of the headset of nearby vehicles.

根據一些態樣，使用第一設備（諸如耳機設備）來執行分散式音訊處理，以使用多個麥克風來擷取聲音，並且執行與擷取到的聲音相對應的音訊的初步處理。例如，作為說明性而非限制性的示例，第一設備可以執行用於定位一或多個聲源的到達方向處理、用於基於環境聲音偵測第一設備的環境或環境變化的聲學環境處理、用於辨識與音訊事件相對應的聲音的音訊事件處理或其組合。According to some aspects, distributed audio processing is performed using a first device, such as a headset device, to capture sound using a plurality of microphones, and to perform preliminary processing of the audio corresponding to the captured sound. For example, as illustrative and non-limiting examples, the first device may perform direction-of-arrival processing for locating one or more sound sources, acoustic environment processing for detecting the environment or changes in the environment of the first device based on ambient sounds. , an audio event handler for identifying a sound corresponding to the audio event, or a combination thereof.

因為第一設備可能在處理資源、記憶體容量、電池壽命等態樣相對受限，所以第一設備可以向具有更大計算、記憶體和電力資源的第二設備（諸如行動電話）發送關於音訊處理的資訊。例如，在一些實施方式中，第一設備向第二設備發送音訊資料的表示和在音訊資料中偵測到的音訊事件的分類，並且第二設備執行附加處理以驗證音訊事件的分類。根據一些態樣，第二設備使用由第一設備提供的資訊（諸如與聲音事件相關聯的方向資訊和分類）作為對處理音訊資料的分類器的附加輸入。結合方向資訊、來自第一設備的分類或兩者來執行音訊資料的分類可以改善第二設備處的分類器的準確度、速度或一或多個其他態樣。Since the first device may be relatively limited in terms of processing resources, memory capacity, battery life, etc., the first device may send information about the audio to a second device (such as a mobile phone) that has greater computing, memory, and power resources. processed information. For example, in some embodiments, the first device sends a representation of the audio data and classifications of audio events detected in the audio data to the second device, and the second device performs additional processing to verify the classification of the audio events. According to some aspects, the second device uses information provided by the first device, such as directional information and classifications associated with sound events, as additional input to a classifier that processes the audio data. Performing classification of audio data in conjunction with directional information, classification from the first device, or both may improve the accuracy, speed, or one or more other aspects of the classifier at the second device.

這種分散式音訊處理（諸如藉由提供對發生在使用者附近的聲音事件的準確偵測並且使第一設備能夠向使用者警告偵測到的事件）使第一設備的使用者能夠受益於第二設備的增強的處理能力。例如，第一設備可以自動從重播模式（例如，向使用者播放音樂或其他音訊）轉換到透明模式，在透明模式中，向使用者播放與偵測到的音訊事件相對應的聲音。下面參考附圖更詳細地描述可以使用所揭示的技術的其他益處和應用示例。This distributed audio processing enables the user of the first device to benefit from the Enhanced processing capabilities of the second device. For example, the first device may automatically transition from playback mode (eg, playing music or other audio to the user) to a transparency mode in which sounds corresponding to detected audio events are played to the user. Other benefits and examples of applications in which the disclosed technology may be used are described in more detail below with reference to the accompanying figures.

下面參考附圖描述本案的特定態樣。在說明書中，共同的特徵由共同的元件符號表示。如本文所使用的，各種術語僅用於描述特定實施方式的目的，並不旨在限制實施方式。例如，單數形式「一」、「一個」和「該」旨在也包括複數形式，除非上下文另外清楚地指示。此外，本文描述的一些特徵在一些實施方式中是單數，而在其他實施方式中是複數。為了說明，圖1圖示了包括一或多個處理器（圖1的「（多個）處理器」116）的設備110，這指示了，在一些實施方式中，設備110包括單個處理器116，而在其他實施方式中，設備110包括多個處理器116。為了在本文中易於引用，這些特徵大體被介紹為「一或多個」特徵，並且隨後以單數形式指稱，除非描述了與多個該特徵相關的態樣。Specific aspects of the present case are described below with reference to the drawings. In the description, common features are denoted by common reference numerals. As used herein, various terms are used for the purpose of describing particular embodiments only and are not intended to be limiting of the embodiments. For example, the singular forms "a", "an" and "the" are intended to include the plural forms as well unless the context clearly dictates otherwise. Furthermore, some features described herein are in the singular in some embodiments and in the plural in other embodiments. To illustrate, FIG. 1 illustrates device 110 including one or more processors ("processor(s)" 116 in FIG. 1), which indicates that, in some implementations, device 110 includes a single processor 116 , while in other implementations, the device 110 includes a plurality of processors 116 . For ease of reference herein, these features are generally introduced as "one or more" features, and are subsequently referred to in the singular unless an aspect is described in relation to a plurality of such features.

還可以理解，術語「包含（comprise/comprises/comprising）」可以與「包括（include/includes/including）」互換使用。此外，應當理解，術語「其中（wherein）」可以與「其中（where）」互換使用。如本文所使用的，「示例性」可以指示示例、實施方式及/或態樣，並且不應被解釋為限制或者指示偏好或優選實施方式。如本文所使用的，用於修飾諸如結構、元件、操作等元素的序數術語（例如，「第一」、「第二」、「第三」等）本身並不指示該元素相對於另一元素的任何優先順序或次序，而是僅僅將該元素與具有相同名稱（但使用了序數術語）的另一元素區分開來。如本文所使用的，術語「集合」是指一或多個特定元素，而術語「複數個」是指多個（例如，兩個或更多個）特定元素。It will also be understood that the terms "comprise/comprises/comprising" may be used interchangeably with "include/includes/including". Furthermore, it should be understood that the term "wherein" may be used interchangeably with "where". As used herein, "exemplary" may indicate an example, an implementation, and/or an aspect, and should not be construed as limiting or indicating a preference or preferred implementation. As used herein, ordinal terms (e.g., "first," "second," "third," etc.) used to modify an element such as structure, element, operation, etc., do not by themselves indicate that the element is relative to another element. any precedence or order, but merely to distinguish that element from another element of the same name (but using ordinal terminology). As used herein, the term "collection" refers to one or more of the specified elements, and the term "plurality" refers to a plurality (eg, two or more) of the specified elements.

如本文所使用的，「耦合」可以包括「通訊耦合」、「電耦合」或「實體耦合」，並且還可以（或替代地）包括它們的任意組合。兩個設備（或元件）可以經由一或多個其他設備、元件、導線、匯流排、網路（例如，有線網路、無線網路或其組合）等直接或間接地耦合（例如，通訊耦合、電耦合或實體耦合）。作為說明性地非限制性示例，電耦合的兩個設備（或元件）可以被包括在相同的設備或不同的設備中，並且可以經由電子裝置、一或多個連接器或電感耦合來連接。在一些實施方式中，（諸如以電通訊方式）通訊耦合的兩個設備（或元件）可以經由一或多個導線、匯流排、網路等直接或間接地發送和接收信號（例如，數位信號或類比信號）。如本文所使用的，「直接耦合」可以包括在沒有中間元件的情況下耦合（例如，通訊耦合、電耦合或實體耦合）的兩個設備。As used herein, "coupled" may include "communicatively coupled," "electrically coupled," or "physically coupled," and may also (or alternatively) include any combination thereof. Two devices (or components) may be directly or indirectly coupled (for example, communicatively coupled) via one or more other devices, components, wires, buses, networks (for example, wired networks, wireless networks, or combinations thereof), , electrical coupling or physical coupling). As illustrative and non-limiting examples, two devices (or elements) that are electrically coupled may be included in the same device or different devices and may be connected via electronics, one or more connectors, or inductive coupling. In some embodiments, two devices (or components) that are communicatively coupled (such as in electrical communication) can send and receive signals (e.g., digital signals) directly or indirectly via one or more wires, buses, networks, etc. or analog signals). As used herein, "directly coupled" may include two devices that are coupled (eg, communicatively, electrically, or physically) without intervening elements.

在本案中，諸如「決定」、「計算」、「估計」、「移位」、「調整」等術語可以用於描述如何執行一或多個操作。應當注意，這些術語不應被解釋為限制性的，並且可以利用其他技術來執行類似的操作。此外，如參考本文的，「產生」、「計算」、「估計」、「使用」、「選擇」、「存取」和「決定」可以互換使用。例如，「產生」、「計算」、「估計」或「決定」參數（或信號）可以是指主動地產生、估計、計算或決定參數（或信號），或者可以是指使用、選擇或存取（諸如由另一元件或設備）已經產生的參數（或信號）。In this case, terms such as "determine", "calculate", "estimate", "shift", "adjust", etc. may be used to describe how to perform one or more operations. It should be noted that these terms should not be construed as limiting and that other techniques may be utilized to perform similar operations. Furthermore, as referred to herein, "generate", "calculate", "estimate", "use", "select", "access" and "determine" may be used interchangeably. For example, "produce", "calculate", "estimate" or "determine" a parameter (or signal) may refer to actively producing, estimating, calculating or determining a parameter (or signal), or may refer to using, selecting or accessing A parameter (or signal) that has been produced (such as by another element or device).

參考圖1，揭示被配置為對從多個麥克風接收的多個音訊信號執行定向處理的系統的特定說明性態樣，並且將其整體指定為100。系統100包括均耦合到設備110或整合在設備110中的第一麥克風102和第二麥克風104。系統100還包括耦合到設備120或整合在設備120中的第三麥克風106和第四麥克風108。儘管兩個麥克風102、104被示為耦合到設備110或整合在設備110中，並且兩個麥克風106、108被示為耦合到設備120或整合在設備120中，但是在其他實施方式中，設備110、設備120或兩者均可以耦合到任何數量的附加麥克風。作為非限制性示例，四（4）個麥克風可以耦合到設備110，並且另外四（4）個麥克風可以耦合到設備120。在一些實施方式中，麥克風102、104、106和108被實施為定向麥克風。在其他實施方式中，麥克風102、104、106和108中的一或多個（或全部）被實施為全向麥克風。Referring to FIG. 1 , a particular illustrative aspect of a system configured to perform directional processing on a plurality of audio signals received from a plurality of microphones is disclosed and generally designated 100 . System 100 includes a first microphone 102 and a second microphone 104 each coupled to or integrated in device 110 . System 100 also includes third microphone 106 and fourth microphone 108 coupled to or integrated in device 120 . Although two microphones 102, 104 are shown coupled to or integrated in device 110, and two microphones 106, 108 are shown coupled to or integrated in device 120, in other implementations, the device 110, device 120, or both may be coupled to any number of additional microphones. As a non-limiting example, four (4) microphones may be coupled to device 110 and another four (4) microphones may be coupled to device 120 . In some implementations, microphones 102, 104, 106, and 108 are implemented as directional microphones. In other implementations, one or more (or all) of microphones 102, 104, 106, and 108 are implemented as omnidirectional microphones.

根據一個實施方式，設備110對應於耳機，並且設備120對應於行動電話。在一些場景中，設備110可以使用無線連接（例如，Bluetooth®（華盛頓的藍芽技術聯盟有限公司的註冊商標）連接）與設備120配對。例如，設備110可以使用低能量協定（例如，Bluetooth®低能量（BLE）協定）與設備120通訊。在其他示例中，無線連接對應於根據IEEE 802.11型（例如，WiFi）無線區域網路或一或多個其他無線射頻（RF）通訊協定的信號發送和接收。According to one embodiment, the device 110 corresponds to a headset and the device 120 corresponds to a mobile phone. In some scenarios, device 110 may be paired with device 120 using a wireless connection (eg, a Bluetooth® (registered trademark of Bluetooth SIG, Inc., Washington) connection). For example, device 110 may communicate with device 120 using a low energy protocol, such as the Bluetooth® Low Energy (BLE) protocol. In other examples, the wireless connection corresponds to signal transmission and reception according to an IEEE 802.11 type (eg, WiFi) wireless area network or one or more other wireless radio frequency (RF) communication protocols.

第一麥克風102被配置為擷取來自一或多個源180的聲音182。在圖1的說明性示例中，源180對應於交通工具，諸如汽車。因此，如果設備110對應於耳機，則麥克風102、104可以用於擷取附近汽車的聲音182。然而，應當理解，交通工具僅僅是聲源的非限制性示例，並且本文描述的技術可以用其他聲源來實施。在擷取到來自源180的聲音182時，第一麥克風102被配置為產生表示擷取到的聲音182的音訊信號170。以類似的方式，第二麥克風104被配置為擷取來自一或多個源180的聲音182。在擷取到來自源180的聲音182時，第二麥克風104被配置為產生表示擷取到的聲音182的音訊信號172。The first microphone 102 is configured to capture sound 182 from one or more sources 180 . In the illustrative example of FIG. 1 , source 180 corresponds to a vehicle, such as an automobile. Thus, if the device 110 corresponds to a headset, the microphones 102, 104 can be used to pick up the sound 182 of nearby cars. It should be understood, however, that a vehicle is only a non-limiting example of a sound source, and that the techniques described herein may be implemented with other sound sources. Upon capturing a sound 182 from a source 180 , the first microphone 102 is configured to generate an audio signal 170 representative of the captured sound 182 . In a similar manner, the second microphone 104 is configured to capture sound 182 from one or more sources 180 . Upon capturing sound 182 from source 180 , second microphone 104 is configured to generate an audio signal 172 representative of the captured sound 182 .

第一麥克風102和第二麥克風104可以具有不同的位置、不同的朝向或兩者。結果，麥克風102、104可以在不同的時間、以不同的相位或兩者兼有來擷取聲音182。為了說明，如果第一麥克風102比第二麥克風104更靠近源180，則第一麥克風102可以在第二麥克風104擷取聲音182之前擷取聲音182。如下所述，如果麥克風102、104的位置和朝向是已知的，則由麥克風102、104分別產生的音訊信號170、172可以用於在設備110、設備120或兩者處執行定向處理。也就是說，設備110可以使用音訊信號170、172來決定源180的位置、決定聲音182的到達方向、對與聲音182相對應的音訊進行空間濾波等。如下面進一步描述的，設備110可以向設備120提供定向處理的結果（例如，與定向處理相關聯的資料）以進行高複雜度處理，反之亦然。The first microphone 102 and the second microphone 104 may have different positions, different orientations, or both. As a result, the microphones 102, 104 may pick up the sound 182 at different times, with different phases, or both. To illustrate, if first microphone 102 is closer to source 180 than second microphone 104 , first microphone 102 may capture sound 182 before second microphone 104 captures sound 182 . As described below, audio signals 170, 172 produced by microphones 102, 104, respectively, may be used to perform orientation processing at device 110, device 120, or both, if the location and orientation of microphones 102, 104 are known. That is, device 110 may use audio signals 170, 172 to determine the location of source 180, determine the direction of arrival of sound 182, spatially filter audio corresponding to sound 182, and the like. As described further below, device 110 may provide results of targeted processing (eg, material associated with targeted processing) to device 120 for high-complexity processing, and vice versa.

設備110包括第一輸入介面111、第二輸入介面112、記憶體114、一或多個處理器116和數據機118。第一輸入介面111耦合到一或多個處理器116，並且被配置為耦合到第一麥克風102。第一輸入介面111被配置為從第一麥克風102接收音訊信號170（例如，第一麥克風輸出），並且將音訊信號170提供給處理器116作為音訊訊框174。第二輸入介面112耦合到一或多個處理器116，並且被配置為耦合到第二麥克風104。第二輸入介面112被配置為從第二麥克風104接收音訊信號172（例如，第二麥克風輸出），並且將音訊信號172提供給處理器116作為音訊訊框176。音訊訊框174、176在本文中也可以被稱為音訊資料178。The device 110 includes a first input interface 111 , a second input interface 112 , a memory 114 , one or more processors 116 and a modem 118 . The first input interface 111 is coupled to one or more processors 116 and is configured to be coupled to the first microphone 102 . The first input interface 111 is configured to receive an audio signal 170 (eg, the first microphone output) from the first microphone 102 and provide the audio signal 170 to the processor 116 as an audio frame 174 . The second input interface 112 is coupled to one or more processors 116 and is configured to be coupled to the second microphone 104 . The second input interface 112 is configured to receive an audio signal 172 (eg, the second microphone output) from the second microphone 104 and provide the audio signal 172 to the processor 116 as an audio frame 176 . The audio frames 174, 176 may also be referred to as audio data 178 herein.

一或多個處理器116可選地包括到達方向處理單元132、音訊事件處理單元134、聲學環境處理單元136、波束成形單元138或其組合。根據一個實施方式，可以使用專用電路來實施一或多個處理器116的一或多個元件。作為非限制性示例，可以使用現場可程式設計閘陣列（FPGA）、專用積體電路（ASIC）等來實施一或多個處理器116的一或多個元件。根據另一實施方式，可以藉由執行儲存在記憶體114中的指令115來實施一或多個處理器116的一或多個元件。例如，記憶體114可以是非暫時性電腦可讀取媒體，其儲存可由一或多個處理器116執行以執行本文描述的操作的指令115。The one or more processors 116 optionally include a direction of arrival processing unit 132, an audio event processing unit 134, an acoustic environment processing unit 136, a beamforming unit 138, or a combination thereof. According to one embodiment, one or more elements of one or more processors 116 may be implemented using dedicated circuitry. As non-limiting examples, one or more elements of one or more processors 116 may be implemented using a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like. According to another embodiment, one or more elements of one or more processors 116 may be implemented by executing instructions 115 stored in memory 114 . For example, memory 114 may be a non-transitory computer-readable medium storing instructions 115 executable by one or more processors 116 to perform the operations described herein.

到達方向處理單元132可以被配置為處理多個音訊信號170、172，以產生與以音訊信號170、172表示的聲音182的源180相對應的到達方向資訊142。為了說明，到達方向處理單元132可以選擇根據表示類似聲音（諸如來自源180的聲音182）的來自每個麥克風102、104的音訊信號170、172而產生的音訊訊框174、176。例如，到達方向處理單元132可以處理音訊訊框174、176以比較聲音特性，並且確保音訊訊框174、176表示聲音182的相同實例。在到達方向處理的說明性非限制性示例中，回應於決定音訊訊框174、176表示聲音182的相同實例，到達方向處理單元132可以比較每個音訊訊框174、176的時間戳記，以決定哪個麥克風102、104首先擷取到聲音182的對應的實例。如果音訊訊框174具有比音訊訊框176更早的時間戳記，則到達方向處理單元132可以產生指示源180更接近第一麥克風102的到達方向資訊142。如果音訊訊框176具有比音訊訊框174更早的時間戳記，則到達方向處理單元132可以產生指示源180更接近第二麥克風104的到達方向資訊142。因此，基於類似的音訊訊框174、176的時間戳記，到達方向處理單元132可以定位聲音182和對應的源180。來自附加麥克風的音訊訊框的時間戳記可以用於以與上述類似的方式改善定位。The direction of arrival processing unit 132 may be configured to process the plurality of audio signals 170 , 172 to generate direction of arrival information 142 corresponding to a source 180 of a sound 182 represented by the audio signals 170 , 172 . To illustrate, direction of arrival processing unit 132 may select audio frames 174 , 176 generated from audio signals 170 , 172 from each microphone 102 , 104 representing similar sounds, such as sound 182 from source 180 . For example, direction of arrival processing unit 132 may process audio frames 174 , 176 to compare sound characteristics and ensure that audio frames 174 , 176 represent the same instance of sound 182 . In an illustrative, non-limiting example of direction-of-arrival processing, in response to determining that audio frames 174, 176 represent the same instance of sound 182, direction-of-arrival processing unit 132 may compare the timestamps of each audio frame 174, 176 to determine Which microphone 102 , 104 captures the corresponding instance of sound 182 first. If the audio frame 174 has an earlier time stamp than the audio frame 176 , the direction of arrival processing unit 132 may generate the direction of arrival information 142 indicating that the source 180 is closer to the first microphone 102 . If the audio frame 176 has an earlier time stamp than the audio frame 174 , the direction of arrival processing unit 132 may generate the direction of arrival information 142 indicating that the source 180 is closer to the second microphone 104 . Thus, based on the timestamps of similar audio frames 174 , 176 , the direction of arrival processing unit 132 can locate the sound 182 and the corresponding source 180 . Timestamps of audio frames from additional microphones can be used to improve localization in a similar manner to that described above.

在一些實施方式中，代替如前述的時間差或除此之外，還可以使用一或多個其他技術來決定到達方向資訊142，諸如測量在設備110的麥克風陣列之每一者麥克風（例如，麥克風102和104）處接收到的聲音182的相位差。在一些實施方式中，麥克風102、104、106和108可以結合設備120作為分散式麥克風陣列來操作，並且到達方向資訊142是基於來自麥克風102、104、106和108中的每一個的聲音的特性（諸如到達時間或相位）並且基於麥克風102、104、106和108的相對位置和朝向來產生的。在此類實施方式中，關於聲音特性的資訊（例如，相位資訊、時間資訊或兩者）、擷取到的音訊資料（例如，音訊信號170、172的至少一部分）或其組合可以在設備110與設備120之間傳輸，以使用分散式麥克風陣列進行到達方向偵測。In some implementations, one or more other techniques may be used to determine direction of arrival information 142 instead of or in addition to the time difference as described above, such as measuring each microphone in the microphone array of device 110 (e.g., microphone 102 and 104) the phase difference of the received sound 182. In some implementations, microphones 102, 104, 106, and 108 may operate in conjunction with device 120 as a distributed microphone array, and direction-of-arrival information 142 is based on the characteristics of the sound from each of microphones 102, 104, 106, and 108 (such as time of arrival or phase) and is generated based on the relative positions and orientations of the microphones 102 , 104 , 106 and 108 . In such embodiments, information about sound characteristics (e.g., phase information, time information, or both), captured audio data (e.g., at least a portion of audio signals 170, 172), or a combination thereof may be recorded at device 110 To and from device 120 for direction-of-arrival detection using a distributed microphone array.

到達方向資訊142可以被發送到設備120。例如，數據機118可以向設備120發送基於到達方向資訊142的資料。在一些示例中，在設備110處產生到達方向資訊142對應於執行低複雜度處理操作。設備120可以使用到達方向資訊142來執行高複雜度處理操作。例如，在一些實施方式中，設備110可以是資源受限的設備，諸如相對於設備120具有有限的電池壽命、有限的記憶體容量或有限的處理能力的設備。在設備120處執行高複雜度處理操作可以從設備110卸載資源密集型操作。Arrival direction information 142 may be sent to device 120 . For example, modem 118 may send data based on direction of arrival information 142 to device 120 . In some examples, generating direction-of-arrival information 142 at device 110 corresponds to performing low-complexity processing operations. The device 120 may use the direction of arrival information 142 to perform high-complexity processing operations. For example, in some implementations, device 110 may be a resource-constrained device, such as a device with limited battery life, limited memory capacity, or limited processing capability relative to device 120 . Performing high-complexity processing operations at device 120 may offload resource-intensive operations from device 110 .

為了說明，設備120可以可選地包括一或多個感測器129。作為非限制性示例，感測器129可以包括非音訊感測器，諸如360度相機、雷射雷達感測器等。基於到達方向資訊142，設備120可以命令360度相機聚焦在源180上，命令雷射雷達感測器測量設備110、120的使用者與源180之間的距離，等等。To illustrate, device 120 may optionally include one or more sensors 129 . As non-limiting examples, sensors 129 may include non-audio sensors, such as 360-degree cameras, lidar sensors, and the like. Based on the direction of arrival information 142, the device 120 may command a 360-degree camera to focus on the source 180, command a lidar sensor to measure the distance between the user of the device 110, 120 and the source 180, and so on.

音訊事件處理單元134可以被配置為處理多個音訊信號170、172以執行音訊事件偵測。為了說明，音訊事件處理單元134可以處理音訊訊框174、176的聲音特性，並且將該聲音特性與複數個音訊事件模型進行比較以決定音訊事件是否已經發生。例如，音訊事件處理單元134可以存取包括針對不同音訊事件（諸如汽車喇叭、火車喇叭、行人談話等）的模型的資料庫（未示出）。回應於聲音特性匹配（或基本匹配）特定模型，音訊事件處理單元134可以產生指示聲音182表示與特定模型相關聯的音訊事件的音訊事件資訊144。如本文所使用的，如果音訊訊框的音調和頻率分量在特定聲音模型的音調和頻率分量的閾值內，則音訊訊框的聲音特性可以「匹配」特定聲音模型。The audio event processing unit 134 may be configured to process a plurality of audio signals 170, 172 to perform audio event detection. To illustrate, the audio event processing unit 134 may process the sound characteristics of the audio frames 174, 176 and compare the sound characteristics to a plurality of audio event models to determine whether an audio event has occurred. For example, the audio event processing unit 134 may access a database (not shown) that includes models for different audio events, such as car horns, train horns, pedestrian talking, and the like. In response to the sound characteristic matching (or substantially matching) the particular model, audio event processing unit 134 may generate audio event information 144 indicating that sound 182 represents an audio event associated with the particular model. As used herein, the sound characteristics of an audio frame may "match" a particular sound model if the pitch and frequency components of the audio frame are within thresholds of the pitch and frequency components of the particular sound model.

在一些實施方式中，音訊事件處理單元134包括一或多個分類器，該一或多個分類器被配置為處理音訊信號資料（諸如音訊信號170、172，音訊訊框174、176的聲音特性，基於音訊信號170、172的波束成形資料或其組合），以從由一或多個分類器支持的多個類別當中決定相關聯的類別。在一個示例中，一或多個分類器結合上述複數個音訊事件模型進行操作，以決定以音訊信號中的一或多個表示的並且與音訊事件相關聯的聲音的類別（例如，分類，諸如「犬吠聲」、「玻璃破碎」、「嬰兒啼哭」等）。例如，一或多個分類器可以包括神經網路，該神經網路已經使用帶標記的聲音資料進行了訓練以區分與各種類別相對應的聲音，並且被配置為處理音訊信號資料以決定由音訊信號資料表示的聲音的特定類別（或者針對每個類別來決定聲音屬於該類別的概率）。類別可以對應於音訊事件資訊144或者被包括在音訊事件資訊144中。參考圖6更詳細地描述了包括一或多個分類器的設備110的示例。In some embodiments, audio event processing unit 134 includes one or more classifiers configured to process audio signal data (such as audio signals 170, 172, audio characteristics of audio frames 174, 176) , based on the beamforming data of the audio signals 170, 172 or a combination thereof) to determine an associated class from among a plurality of classes supported by one or more classifiers. In one example, one or more classifiers operate in conjunction with the plurality of audio event models described above to determine the class (e.g., classification, such as "dog barking", "glass breaking", "baby crying", etc.). For example, one or more classifiers may include a neural network that has been trained using labeled sound data to distinguish sounds corresponding to various categories, and configured to process audio signal data to determine A specific category of sound represented by the signal data (or for each category, the probability that the sound belongs to that category is determined). The categories may correspond to or be included in the audio event information 144 . An example of a device 110 including one or more classifiers is described in more detail with reference to FIG. 6 .

在一些實施方式中，音訊事件處理單元134包括一或多個編碼器，該一或多個編碼器被配置為處理音訊信號資料（諸如音訊信號170、172，音訊訊框174、176的聲音特性，基於音訊信號170、172的波束成形資料或其組合），以產生以音訊信號資料表示的聲音的簽名。例如，編碼器可以包括被配置為處理音訊信號資料以產生與音訊信號資料中的特定聲音相對應的並且與音訊事件相關聯的嵌入的一或多個神經網路。「嵌入」可以指定由較高維度向量可以轉換成的並且可以保持語義關係的向量（例如，有序的值序列或索引值集合）所表示的相對低維度空間。為了說明，音訊信號可以使用相對較大向量（例如，表示頻譜資料和其他音訊特徵）的序列來表示，可以處理該相對較大向量的序列以產生由較小向量表示的嵌入。嵌入可以包括足夠的資訊以能夠偵測出音訊信號中的特定聲音。簽名（例如，嵌入）可以對應於音訊事件資訊144或者被包括在音訊事件資訊144中。參考圖7更詳細地描述了包括一或多個編碼器的設備110的示例。In some implementations, audio event processing unit 134 includes one or more encoders configured to process audio signal data (such as audio signals 170, 172, audio characteristics of audio frames 174, 176) , based on the beamforming data of the audio signals 170, 172 or a combination thereof) to generate a signature of the sound represented by the audio signal data. For example, an encoder may include one or more neural networks configured to process audio signal data to generate embeddings corresponding to particular sounds in the audio signal data and associated with audio events. An "embedding" may specify a relatively low-dimensional space represented by a vector (e.g., an ordered sequence of values or a collection of indexed values) into which a higher-dimensional vector can be transformed and that preserves semantic relationships. To illustrate, an audio signal may be represented using a sequence of relatively large vectors (eg, representing spectral data and other audio features), which may be processed to produce embeddings represented by smaller vectors. The embedding may include enough information to be able to detect specific sounds in the audio signal. A signature (eg, embedded) may correspond to or be included in the audio event information 144 . An example of a device 110 including one or more encoders is described in more detail with reference to FIG. 7 .

在非限制性示例中，音訊事件可以對應於正在靠近的交通工具（例如，源180）的聲音。基於音訊事件，音訊事件處理單元134可以產生音訊事件資訊144，並且音訊事件資訊144可以被發送到設備120。例如，數據機118可以向設備120發送與偵測到的事件相對應的資料。在一些示例中，在設備110處產生音訊事件資訊144對應於執行低複雜度處理操作。設備120可以使用音訊事件資訊144來執行高複雜度處理操作。為了說明，基於音訊事件資訊144，設備120可以執行一或多個操作，諸如在更大、更準確的分類器處處理音訊資料以驗證音訊事件，基於聲音簽名編輯音訊場景（例如，以移除與音訊事件資訊144中包括的嵌入相對應的聲音或者移除不與嵌入相對應的聲音），命令360度相機聚焦在源180上，命令雷射雷達感測器測量設備110、120的使用者與源180之間的距離，等等。In a non-limiting example, an audio event may correspond to the sound of an approaching vehicle (eg, source 180 ). Based on the audio event, the audio event processing unit 134 can generate audio event information 144 , and the audio event information 144 can be sent to the device 120 . For example, modem 118 may send data corresponding to the detected event to device 120 . In some examples, generating audio event information 144 at device 110 corresponds to performing a low-complexity processing operation. Device 120 may use audio event information 144 to perform high-complexity processing operations. To illustrate, based on the audio event information 144, the device 120 may perform one or more operations, such as processing the audio data at a larger, more accurate classifier to validate the audio event, editing the audio scene based on the sound signature (e.g., to remove sound corresponding to an embedding included in the audio event information 144 or remove a sound that does not correspond to an embedding), command the 360-degree camera to focus on the source 180, command the user of the lidar sensor measurement device 110, 120 Distance from source 180, etc.

聲學環境處理單元136可以被配置為處理多個音訊信號170、172以執行聲學環境偵測。為了說明，聲學環境處理單元136可以處理音訊訊框174、176的聲音特性，以決定周圍環境的聲學特性。作為非限制性示例，聲學特性可以包括周圍環境的直接混響比（Direct-to-Reverberant Ratio，DRR）估計。聲學環境處理單元136可以基於周圍環境的聲學特性產生環境資訊146。例如，如果DRR估計相對高，則環境資訊146可以指示設備110處於室內環境中。然而，如果DRR估計相對低，則環境資訊146可以指示設備110處於室外環境中。在一些實施方式中，聲學環境處理單元136可以包括或者被實施為一或多個分類器，該一或多個分類器被配置為產生指示可以對應於環境資訊146的或者可以被包括在環境資訊146中的音訊環境類別的輸出。The acoustic environment processing unit 136 may be configured to process the plurality of audio signals 170 , 172 to perform acoustic environment detection. To illustrate, the acoustic environment processing unit 136 may process the sound characteristics of the audio frames 174, 176 to determine the acoustic characteristics of the surrounding environment. As a non-limiting example, the acoustic properties may include a Direct-to-Reverberant Ratio (DRR) estimate of the surrounding environment. The acoustic environment processing unit 136 can generate the environment information 146 based on the acoustic characteristics of the surrounding environment. For example, environmental information 146 may indicate that device 110 is in an indoor environment if the DRR estimate is relatively high. However, if the DRR estimate is relatively low, environmental information 146 may indicate that device 110 is in an outdoor environment. In some implementations, the acoustic environment processing unit 136 may include or be implemented as one or more classifiers configured to generate an indication that may correspond to the environment information 146 or may be included in the environment information 146. 146 in the output of the audio environment category.

環境資訊146可以被發送到設備120。例如，數據機118可以向設備120發送對應於（例如，辨識）偵測到的環境的資料。在一些示例中，在設備110處產生環境資訊146對應於執行低複雜度處理操作。設備120可以使用環境資訊146來執行高複雜度處理操作。為了說明，作為說明性地非限制性示例，基於環境資訊146，設備120可以執行一或多個操作，諸如從一或多個音訊信號中移除環境或背景雜訊、基於環境資訊146編輯音訊場景、或者改變360度相機的設置以擷取室外圖像而不是室內圖像。Environmental information 146 may be sent to device 120 . For example, modem 118 may transmit data corresponding to (eg, identifying) the detected environment to device 120 . In some examples, generating contextual information 146 at device 110 corresponds to performing low-complexity processing operations. Device 120 may use context information 146 to perform high-complexity processing operations. To illustrate, as an illustrative and non-limiting example, based on environmental information 146, device 120 may perform one or more operations, such as removing environmental or background noise from one or more audio signals, editing audio based on environmental information 146 scene, or change the 360 camera settings to capture outdoor images instead of indoor images.

波束成形單元138可以被配置為處理多個音訊信號170、172以執行波束成形。在一些示例中，波束成形單元138基於到達方向資訊142執行波束成形。替代地或附加地，在一些示例中，波束成形單元138執行自適應波束成形，該自適應波束成形利用多通道信號處理演算法對音訊信號170、172進行空間濾波並決定源180的位置。波束成形單元138可以將靈敏度增加的波束指向源180的位置，並且抑制來自其他位置的音訊信號。在一些示例中，波束成形單元138被配置為（例如，基於從源180到不同麥克風102、104中的每一個的不同聲音傳播路徑，藉由引入時間或相位延遲、調整信號幅度或兩者）相對於音訊信號172調整音訊信號170的處理，以（例如，經由相長干涉）強調從源180的方向到達的聲音並衰減從一或多個其他方向到達的聲音。在一些示例中，如果波束成形單元138決定源180的位置接近第一麥克風102，則波束成形單元138可以發送用於改變第一麥克風102的朝向或方向以擷取聲音182並使來自其他方向（例如，與第二麥克風104相關聯的方向）的聲音為零（null）的命令。The beamforming unit 138 may be configured to process the plurality of audio signals 170, 172 to perform beamforming. In some examples, the beamforming unit 138 performs beamforming based on the direction of arrival information 142 . Alternatively or additionally, in some examples, the beamforming unit 138 performs adaptive beamforming that utilizes a multi-channel signal processing algorithm to spatially filter the audio signals 170 , 172 and determine the location of the source 180 . The beamforming unit 138 may direct the increased sensitivity beam at the location of the source 180 and suppress audio signals from other locations. In some examples, beamforming unit 138 is configured to (eg, by introducing time or phase delays, adjusting signal amplitudes, or both based on the different sound propagation paths from source 180 to each of the different microphones 102, 104) The processing of audio signal 170 is adjusted relative to audio signal 172 to emphasize (eg, via constructive interference) sounds arriving from the direction of source 180 and attenuate sounds arriving from one or more other directions. In some examples, if beamforming unit 138 determines that source 180 is located close to first microphone 102, beamforming unit 138 may send a signal to change the orientation or direction of first microphone 102 to pick up sound 182 and make sound 182 from other directions ( For example, the direction associated with the second microphone 104) sounds a null (null) command.

所得的一或多個波束成形音訊信號148（例如，音訊信號170、172的表示）可以被發送到設備120。例如，數據機118可以向設備120發送一或多個波束成形音訊信號148。在特定實施方式中，針對每個感興趣的音訊源，將單個波束成形音訊信號148提供給設備120。在一些示例中，在設備110處產生波束成形音訊信號148對應於執行低複雜度處理操作。設備120可以使用波束成形音訊信號148來執行高複雜度處理操作。在說明性示例中，基於波束成形音訊信號148，設備120可以命令360度相機聚焦在源180上，命令雷達感測器測量設備110、120的使用者與源180之間的距離，等等。The resulting one or more beamformed audio signals 148 (eg, representations of audio signals 170 , 172 ) may be sent to device 120 . For example, modem 118 may send one or more beamformed audio signals 148 to device 120 . In particular embodiments, a single beamformed audio signal 148 is provided to device 120 for each audio source of interest. In some examples, generating beamformed audio signal 148 at device 110 corresponds to performing a low-complexity processing operation. Device 120 may use beamformed audio signals 148 to perform high-complexity processing operations. In an illustrative example, based on beamforming audio signal 148 , device 120 may command a 360-degree camera to focus on source 180 , command a radar sensor to measure the distance between a user of device 110 , 120 and source 180 , and so forth.

可選地，設備110可以將由麥克風102、104擷取的音訊資料（例如，音訊信號170、172）的至少一部分發送到設備120以進行分散式音訊處理，或者使用設備120處可用的更大的處理、記憶體和電力資源進行附加處理，在該分散式音訊處理中，被描述為由設備110執行的處理的一部分被卸載到設備120。作為示例，在一些實施方式中，設備110可以向設備120發送音訊信號170、172（例如，音訊資料178）的至少一部分以進行更高準確度的到達方向處理、更高準確度的音訊事件偵測、更高準確度的環境偵測或其組合。在一些實施方式中，不發送波束成形音訊信號148或者除此之外，設備110還可以向設備120發送音訊信號170、172（例如，音訊資料178）的至少一部分。Optionally, device 110 may send at least a portion of the audio data (e.g., audio signals 170, 172) captured by microphones 102, 104 to device 120 for distributed audio processing, or use a larger Processing, memory, and power resources perform additional processing, and in this distributed audio processing, a portion of the processing described as being performed by device 110 is offloaded to device 120 . As an example, in some implementations, device 110 may send at least a portion of audio signals 170, 172 (e.g., audio data 178) to device 120 for higher accuracy direction-of-arrival processing, higher accuracy audio event detection detection, higher-accuracy environmental detection, or a combination thereof. In some implementations, the device 110 may transmit at least a portion of the audio signal 170 , 172 (eg, audio material 178 ) to the device 120 instead of, or in addition to, the beamformed audio signal 148 .

可選地，設備110可以包括或耦合到使用者周邊設備，諸如視覺使用者周邊設備（例如，諸如圖25所示的顯示器或者諸如圖26所示的全息投影單元，作為非限制性示例）、音訊使用者周邊設備（例如，諸如參考圖3之揚聲器或者諸如參考圖5之語音使用者介面，作為非限制性示例）、或觸覺使用者周邊設備（例如，參考圖22所述，作為非限制性示例）。一或多個處理器116可以被配置為向使用者周邊設備提供指示環境事件或聲學事件中的至少一個的使用者介面輸出。為了說明，使用者介面輸出可以使得使用者介面設備（諸如基於音訊事件資訊144、從設備120接收的音訊事件資訊145、環境資訊146、從設備120接收的環境資訊147或其組合）提供偵測到的音訊事件或環境條件的通知。Optionally, device 110 may include or be coupled to user peripherals, such as visual user peripherals (eg, a display such as that shown in FIG. 25 or a holographic projection unit such as that shown in FIG. 26 as non-limiting examples), Audio user peripherals (e.g., such as a speaker with reference to FIG. 3 or a voice user interface such as with reference to FIG. 5, as non-limiting examples), or tactile user peripherals (such as described with reference to FIG. sex example). The one or more processors 116 may be configured to provide a user interface output indicative of at least one of an environmental event or an acoustic event to a user peripheral. To illustrate, the UI output may cause the UI device to provide detections (such as based on audio event information 144, audio event information 145 received from device 120, ambient information 146, ambient information 147 received from device 120, or a combination thereof). Notification of incoming audio events or environmental conditions.

上述各種技術圖示設備110（例如，低功率設備）執行定向上下文感知處理。也就是說，設備110處理來自多個麥克風102、104的音訊信號170、172，以決定聲音182所源自的方向。在特定實施方式中，設備110對應於耳機，並且設備120對應於行動電話。在該實施方式中，耳機執行定向上下文感知處理，並且可以將所得資料發送到行動電話以執行附加的高複雜度處理。在其他實施方式中，設備110對應於與設備120（例如，行動電話、平板設備、個人電腦、伺服器、交通工具等）相比具有較小計算能力的一或多個其他設備，諸如頭戴式設備（例如，虛擬實境耳機、混合現實耳機或增強現實耳機）、眼鏡（例如，增強現實眼鏡或混合現實眼鏡）、「智慧手錶」設備、虛擬助理設備或物聯網設備。The various techniques described above illustrate that a device 110 (eg, a low-power device) performs directional context-aware processing. That is, the device 110 processes the audio signals 170, 172 from the plurality of microphones 102, 104 to determine the direction from which the sound 182 originates. In a particular embodiment, device 110 corresponds to a headset and device 120 corresponds to a mobile phone. In this embodiment, the headset performs directional context-aware processing, and the resulting data can be sent to the mobile phone to perform additional high-complexity processing. In other embodiments, device 110 corresponds to one or more other devices having less computing power than device 120 (eg, mobile phone, tablet device, personal computer, server, vehicle, etc.), such as a headset headsets (for example, virtual reality headsets, mixed reality headsets, or augmented reality headsets), glasses (for example, augmented reality glasses or mixed reality glasses), "smart watch" devices, virtual assistant devices, or Internet of Things devices.

如下所述，設備120（例如，行動電話）還可以基於從設備110接收的音訊信號170、172，基於來自麥克風106、108的音訊信號190、192或其組合，來執行定向上下文感知處理。設備120可以向設備110（例如，耳機）提供定向上下文感知處理的結果，使得設備110可以執行附加操作，諸如參考圖3更詳細描述的音訊縮放操作。As described below, device 120 (eg, a mobile phone) may also perform directional context-aware processing based on audio signals 170 , 172 received from device 110 , based on audio signals 190 , 192 from microphones 106 , 108 , or a combination thereof. Device 120 may provide the results of the directional context-aware processing to device 110 (eg, a headset) so that device 110 may perform additional operations, such as the audio zoom operation described in more detail with reference to FIG. 3 .

設備120包括記憶體124、一或多個處理器126和數據機128。可選地，設備120還包括第一輸入介面121、第二輸入介面122和一或多個感測器129中的一或多個。Device 120 includes memory 124 , one or more processors 126 and a data engine 128 . Optionally, the device 120 further includes one or more of the first input interface 121 , the second input interface 122 and one or more sensors 129 .

在一些實施方式中，第一輸入介面121和第二輸入介面122均耦合到一或多個處理器126，並且被配置為分別耦合到第三麥克風106和第四麥克風108。第一輸入介面121被配置為從第三麥克風106接收音訊信號190，並且將音訊信號190（諸如音訊訊框194）提供給一或多個處理器126。第二輸入介面122被配置為從第四麥克風108接收音訊信號192，並且將音訊信號192（諸如音訊訊框196）提供給一或多個處理器126。音訊信號190、192（例如，音訊訊框194、196）可以被稱為被提供給一或多個處理器126的音訊資料198。In some implementations, both the first input interface 121 and the second input interface 122 are coupled to one or more processors 126 and are configured to be coupled to the third microphone 106 and the fourth microphone 108, respectively. The first input interface 121 is configured to receive an audio signal 190 from the third microphone 106 and provide the audio signal 190 , such as an audio frame 194 , to the one or more processors 126 . The second input interface 122 is configured to receive an audio signal 192 from the fourth microphone 108 and provide the audio signal 192 , such as an audio frame 196 , to the one or more processors 126 . Audio signals 190 , 192 (eg, audio frames 194 , 196 ) may be referred to as audio data 198 provided to one or more processors 126 .

一或多個處理器126可選地包括到達方向處理單元152、音訊事件處理單元154、聲學環境處理單元156、波束成形單元158或其組合。根據一些實施方式，一或多個處理器126的一或多個元件可以使用專用電路來實施。作為非限制性示例，一或多個處理器126的一或多個元件可以使用FPGA、ASIC等來實施。根據另一實施方式，可以藉由執行儲存在記憶體124中的指令125來實施一或多個處理器126的一或多個元件。例如，記憶體124可以是儲存指令124的非暫時性電腦可讀取媒體，指令125可由一或多個處理器126執行以執行本文描述的操作。The one or more processors 126 optionally include a direction of arrival processing unit 152, an audio event processing unit 154, an acoustic environment processing unit 156, a beamforming unit 158, or a combination thereof. According to some implementations, one or more elements of the one or more processors 126 may be implemented using dedicated circuitry. As non-limiting examples, one or more elements of one or more processors 126 may be implemented using FPGAs, ASICs, or the like. According to another embodiment, one or more elements of one or more processors 126 may be implemented by executing instructions 125 stored in memory 124 . For example, memory 124 may be a non-transitory computer-readable medium storing instructions 124 executable by one or more processors 126 to perform the operations described herein.

到達方向處理單元152可以被配置為處理多個音訊信號（例如，音訊信號170、172、190或192中的兩個或更多個）以產生與以多個音訊信號表示的聲音182的源180相對應的到達方向資訊143。為了說明，到達方向處理單元152可以被配置為使用參考到達方向處理單元132描述的一或多個技術（例如，到達時間、相位差等）來處理多個音訊信號。與到達方向處理單元132相比，到達方向處理單元152可以具有更強大的處理能力，因此可以產生更準確的結果。The direction of arrival processing unit 152 may be configured to process a plurality of audio signals (eg, two or more of the audio signals 170, 172, 190, or 192) to generate a source 180 of a sound 182 represented by the plurality of audio signals. Corresponding arrival direction information 143 . To illustrate, direction of arrival processing unit 152 may be configured to process multiple audio signals using one or more techniques described with reference to direction of arrival processing unit 132 (eg, time of arrival, phase difference, etc.). Compared with the direction of arrival processing unit 132, the direction of arrival processing unit 152 may have more powerful processing capabilities and thus may generate more accurate results.

在一些實施方式中，音訊信號170、172是從設備110接收的，並且到達方向處理單元152可以處理音訊信號170、172以決定到達方向資訊143，而不在到達方向處理單元152處處理音訊信號190、192。例如，麥克風106、108中的一或多個可能被遮擋或者以其他方式無法產生聲音182的有用表示，諸如當設備120是正被攜帶在使用者的口袋或包中的行動設備時。In some embodiments, the audio signals 170, 172 are received from the device 110, and the direction-of-arrival processing unit 152 may process the audio signals 170, 172 to determine the direction-of-arrival information 143 without processing the audio signal 190 at the direction-of-arrival processing unit 152 , 192. For example, one or more of microphones 106, 108 may be blocked or otherwise unable to produce a useful representation of sound 182, such as when device 120 is a mobile device being carried in a user's pocket or bag.

在其他實施方式中，音訊信號190、192是從麥克風106、108接收的，並且在到達方向處理單元152處被處理以決定到達方向資訊143，而不在到達方向處理單元152處處理音訊信號170、172。例如，音訊信號170、172可以不由設備110發送，或者可以不由設備120接收。在另一示例中，音訊信號170、172可能（諸如由於麥克風102、104處的大量雜訊（例如，風雜訊））是低品質的，並且設備120可以選擇使用音訊信號190、192並且忽略音訊信號170、172。In other embodiments, the audio signals 190, 192 are received from the microphones 106, 108 and processed at the direction of arrival processing unit 152 to determine the direction of arrival information 143 without processing the audio signals 170, 192 at the direction of arrival processing unit 152. 172. For example, audio signals 170 , 172 may not be sent by device 110 or may not be received by device 120 . In another example, the audio signals 170, 172 may be of low quality (such as due to a large amount of noise (e.g., wind noise) at the microphones 102, 104), and the device 120 may choose to use the audio signals 190, 192 and ignore Audio signals 170,172.

在一些實施方式中，音訊信號170、172是從設備110接收的，並且在到達方向處理單元152處與音訊信號190、192結合使用以產生到達方向資訊143。為了說明，設備110可以對應於具有一或多個感測器（諸如位置或定位感測器（例如，全球定位系統（GPS）接收器），追蹤設備110的朝向、移動或加速度中的一或多個（例如，頭部追蹤器資料）的慣性測量單元（IMU）或其組合）的耳機。設備120還可以包括一或多個位置或定位感測器（例如，GPS接收器）和IMU，以使設備120能夠結合從設備110接收的頭部追蹤器資料來決定作為分散式麥克風陣列進行操作的麥克風102、104、106和108的絕對或相對的位置和朝向。到達方向資訊142、到達方向資訊143或兩者可以相對於設備110的參考系、相對於設備120的參考系、相對於絕對參考系或其組合，並且可以由設備110、設備120或兩者在各種參考系之間適當地轉換。In some embodiments, the audio signals 170 , 172 are received from the device 110 and used in combination with the audio signals 190 , 192 at the direction-of-arrival processing unit 152 to generate the direction-of-arrival information 143 . To illustrate, device 110 may correspond to a sensor having one or more sensors, such as a position or positioning sensor (eg, a Global Positioning System (GPS) receiver), that tracks one or more of the orientation, movement, or acceleration of device 110 . Multiple (e.g. head tracker profiles) inertial measurement units (IMUs) or combinations thereof) headsets. Device 120 may also include one or more position or positioning sensors (e.g., a GPS receiver) and an IMU to enable device 120 to decide to operate as a distributed microphone array in conjunction with head tracker data received from device 110 The absolute or relative positions and orientations of the microphones 102, 104, 106 and 108. Direction of arrival information 142, direction of arrival information 143, or both may be relative to the frame of reference of device 110, relative to the frame of reference of device 120, relative to an absolute frame of reference, or a combination thereof, and may be generated by device 110, device 120, or both in Translate appropriately between various reference frames.

到達方向資訊143可以被發送到設備110。例如，數據機128可以基於到達方向資訊143向設備110發送資料。設備110可以使用到達方向資訊143來執行音訊操作，諸如音訊縮放操作。例如，一或多個處理器116可以發送用於擷取（或聚焦）來自源180和聲音182的方向的音訊的命令。Arrival direction information 143 may be sent to device 110 . For example, modem 128 may send data to device 110 based on direction of arrival information 143 . Device 110 may use direction of arrival information 143 to perform audio operations, such as audio zoom operations. For example, one or more processors 116 may send commands to retrieve (or focus) audio from source 180 and the direction of sound 182 .

音訊事件處理單元154可以被配置為處理多個音訊信號以執行音訊事件偵測，並且產生與一或多個偵測到的音訊事件相對應的音訊事件資訊145。為了說明，在設備120處接收到音訊信號170、172的實施方式中，音訊事件處理單元154可以處理音訊信號170、172（例如，音訊訊框174、176）的聲音特性，並且將該聲音特性與複數個音訊事件模型進行比較以決定音訊事件是否已經發生。在設備120處接收到音訊信號190、192的一些實施方式中，音訊事件處理單元154可以處理音訊信號190、192（例如，音訊訊框194、196）的聲音特性，並且將該聲音特性與複數個音訊事件模型進行比較以偵測音訊事件。在接收到波束成形音訊信號148的一些實施方式中，音訊事件處理單元154可以處理波束成形音訊信號148的聲音特性以偵測音訊事件。在波束成形單元158產生波束成形音訊信號149的一些實施方式中，音訊事件處理單元154可以處理波束成形音訊信號149的聲音特性以偵測音訊事件。The audio event processing unit 154 may be configured to process a plurality of audio signals to perform audio event detection, and generate audio event information 145 corresponding to one or more detected audio events. To illustrate, in embodiments where audio signals 170, 172 are received at device 120, audio event processing unit 154 may process the sound characteristics of audio signals 170, 172 (e.g., audio frames 174, 176) and Compare with a plurality of audio event models to determine whether an audio event has occurred. In some implementations where audio signals 190, 192 are received at device 120, audio event processing unit 154 may process the sound characteristics of audio signals 190, 192 (e.g., audio frames 194, 196) and compare the sound characteristics with complex Audio event models are compared to detect audio events. In some implementations where the beamformed audio signal 148 is received, the audio event processing unit 154 may process the acoustic characteristics of the beamformed audio signal 148 to detect the audio event. In some implementations where the beamforming unit 158 generates the beamformed audio signal 149, the audio event processing unit 154 may process the acoustic characteristics of the beamformed audio signal 149 to detect audio events.

音訊事件處理單元154可以存取包括針對不同音訊事件（諸如汽車喇叭、火車喇叭、行人談話等）的模型的資料庫（未示出）。回應於聲音特性匹配（或基本匹配）特定模型，音訊事件處理單元154可以產生指示聲音182表示與特定模型相關聯的音訊事件的音訊事件資訊145。在一些實施方式中，音訊事件處理單元154包括被配置為以與針對音訊事件處理單元134所描述的方式類似的方式來決定音訊事件的類別的一或多個分類器。然而，與音訊事件處理單元134相比，音訊事件處理單元154可以執行更複雜的操作，可以支援比音訊事件處理單元134大得多的模型或音訊類別的集合，並且可以產生比音訊事件處理單元134更準確的音訊事件決定（或分類）。The audio event processing unit 154 may access a database (not shown) that includes models for different audio events, such as car horns, train horns, pedestrian talking, and the like. In response to the sound characteristic matching (or substantially matching) the particular model, audio event processing unit 154 may generate audio event information 145 indicating that sound 182 represents an audio event associated with the particular model. In some implementations, audio event processing unit 154 includes one or more classifiers configured to determine the class of an audio event in a manner similar to that described for audio event processing unit 134 . However, compared to audio event processing unit 134, audio event processing unit 154 can perform more complex operations, can support a much larger set of models or audio classes than audio event processing unit 134, and can generate 134 More accurate audio event determination (or classification).

在一些示例中，音訊事件處理單元134是被配置為具有相對高的靈敏度的相對低功率的偵測器，這降低了音訊事件未被偵測到的概率，也可能導致錯誤警報的數量增加（例如，當沒有音訊事件實際發生時決定偵測到音訊事件）。音訊事件處理單元154可以使用從設備110接收的資訊來提供更高的音訊事件偵測準確度，並且可以藉由處理對應的音訊信號（例如，音訊信號170、172、190、192中的一或多個，波束成形音訊信號148、149中的一或多個或其組合）來驗證從音訊事件處理單元134接收的音訊事件（例如，分類）。In some examples, the audio event processing unit 134 is a relatively low power detector configured with a relatively high sensitivity, which reduces the probability of an audio event not being detected and may also result in an increased number of false alarms ( For example, to detect an audio event when no audio event actually occurs). The audio event processing unit 154 can use the information received from the device 110 to provide higher audio event detection accuracy, and can process the corresponding audio signal (for example, one of the audio signals 170, 172, 190, 192 or Multiple, beamformed audio signals 148 , 149 one or more or a combination thereof) to verify (eg, classify) audio events received from the audio event processing unit 134 .

音訊事件資訊145可以被發送到設備110。例如，數據機128可以向設備110發送與偵測到的事件相對應的資料。設備110可以使用音訊事件資訊145來執行音訊操作，諸如音訊縮放操作。例如，一或多個處理器116可以發送用於擷取（或聚焦）來自音訊事件的聲音的命令。在另一示例中，音訊事件資訊145可以使得一或多個處理器116忽略（例如，不聚焦）或者衰減或移除來自音訊事件的聲音。例如，音訊事件處理單元154可以決定音訊事件對應於設備110附近的蒼蠅的嗡嗡聲，並且音訊事件資訊145可以指示設備110要忽略嗡嗡聲或者在嗡嗡聲的源的方向上引導零波束。在設備110選擇是否向設備110的使用者重播環境聲音的實施方式中，諸如當設備110是被配置為進入「透明」模式以使使用者能夠在特定環境下聽見外部聲音的耳機時，音訊事件資訊145可以向設備110指示聲音182是否應當觸發設備110轉換到透明模式。Audio event information 145 may be sent to device 110 . For example, modem 128 may send data corresponding to the detected event to device 110 . Device 110 may use audio event information 145 to perform audio operations, such as audio zoom operations. For example, one or more processors 116 may send a command to capture (or focus) sound from an audio event. In another example, audio event information 145 may cause one or more processors 116 to ignore (eg, not focus on) or attenuate or remove sounds from the audio event. For example, audio event processing unit 154 may determine that the audio event corresponds to the buzzing of a fly near device 110, and audio event information 145 may instruct device 110 to ignore the buzzing sound or to direct a null beam in the direction of the source of the buzzing sound . In embodiments where device 110 chooses whether to replay ambient sounds to the user of device 110, such as when device 110 is a headset configured to enter "transparency" mode to enable the user to hear external sounds in certain circumstances, the audio event Information 145 may indicate to device 110 whether sound 182 should trigger device 110 to transition to transparency mode.

聲學環境處理單元156可以被配置為處理多個音訊信號170、172、多個音訊信號190、192或其組合，以執行聲學環境偵測。為了說明，聲學環境處理單元156可以處理音訊訊框174、176、音訊訊框194、196或兩者的聲音特性，以決定周圍環境的聲學特性。在一些實施方式中，聲學環境處理單元156以與聲學環境處理單元136類似的方式起作用。然而，與聲學環境處理單元136相比，聲學環境處理單元156可以執行更複雜的操作，可以支援比聲學環境處理單元136大得多的模型或音訊環境類別的集合，並且可以產生比聲學環境處理單元136更準確的聲學環境決定（或分類）。The acoustic environment processing unit 156 may be configured to process the plurality of audio signals 170 , 172 , the plurality of audio signals 190 , 192 or a combination thereof to perform acoustic environment detection. To illustrate, the acoustic environment processing unit 156 may process the sound characteristics of the audio frames 174, 176, the audio frames 194, 196, or both to determine the acoustic characteristics of the surrounding environment. In some embodiments, the acoustic environment processing unit 156 functions in a similar manner as the acoustic environment processing unit 136 . However, the AEP unit 156 can perform more complex operations than the AEP unit 136, can support a much larger set of models or audio environment classes than the AEP unit 136, and can generate Unit 136 for more accurate acoustic environment determination (or classification).

在一些示例中，與聲學環境處理單元156相比，聲學環境處理單元136是被配置為對環境變化具有相對高的靈敏度（例如，作為非限制性示例，當設備110從室內環境移動到室外環境或者從室外環境移動到交通工具時，偵測到背景聲音特性的變化）但是在決定環境本身時可能具有相對低的準確度的相對低功率的偵測器。聲學環境處理單元156可以使用從設備110接收的資訊來提供更高的聲學環境偵測準確度，並且可以藉由處理對應的音訊信號（例如，音訊信號170、172、190、192中的一或多個，波束成形音訊信號148、149中的一或多個或其組合）來驗證從聲學環境處理單元136接收的環境資訊146（例如，分類）。In some examples, acoustic environment processing unit 136 is configured to have a relatively high sensitivity to environmental changes compared to acoustic environment processing unit 156 (e.g., as a non-limiting example, when device 110 moves from an indoor environment to an outdoor environment Or relatively low power detectors that detect changes in the characteristics of background sound when moving from an outdoor environment to a vehicle) but may have relatively low accuracy in determining the environment itself. The acoustic environment processing unit 156 can use the information received from the device 110 to provide higher acoustic environment detection accuracy, and can process the corresponding audio signal (for example, one of the audio signals 170, 172, 190, 192 or Multiple, beamformed audio signals 148 , 149 one or more or a combination thereof) to verify the environment information 146 received from the acoustic environment processing unit 136 (eg, classification).

聲學環境處理單元156可以基於周圍環境的聲學特性產生環境資訊147。環境資訊147可以被發送到設備110。例如，數據機128可以向設備110發送與偵測到的環境相對應的資料。設備110可以使用環境資訊147來執行附加的音訊操作。The acoustic environment processing unit 156 can generate the environment information 147 based on the acoustic characteristics of the surrounding environment. Environmental information 147 may be sent to device 110 . For example, modem 128 may send data corresponding to the detected environment to device 110 . Device 110 may use environmental information 147 to perform additional audio operations.

波束成形單元158可以被配置為處理多個音訊信號170、172以執行自適應波束成形。為了說明，在一些示例中，波束成形單元158利用多通道信號處理演算法對音訊信號170、172進行空間濾波，以將靈敏度增加的波束指向源180的位置，並且以與針對波束成形單元138所描述的方式類似的方式抑制來自其他位置的音訊信號。在另一示例中，波束成形單元158利用多通道信號處理演算法對音訊信號190、192進行空間濾波，以將靈敏度增加的波束指向源180的位置。在設備120從設備110接收音訊信號170、172並且還接收音訊信號190、192的另一示例中，波束成形單元158可以基於所有音訊信號170、172、190和192來執行空間濾波。在一些實施方式中，波束成形單元158為音訊信號中偵測到的每個聲源產生單個波束成形音訊信號。例如，如果偵測到單個聲源，則產生指向該聲源的單個波束成形音訊信號149。在另一示例中，如果偵測到多個聲源，則可以產生多個波束成形音訊信號149，其中多個波束成形音訊信號149中的每一個指向聲源中的相應一個。The beamforming unit 158 may be configured to process the plurality of audio signals 170, 172 to perform adaptive beamforming. To illustrate, in some examples, the beamforming unit 158 spatially filters the audio signals 170, 172 using a multi-channel signal processing algorithm to direct a beam of increased sensitivity to the location of the source 180 in the same manner as for the beamforming unit 138. Audio signals from other locations are suppressed in a similar manner to that described. In another example, the beamforming unit 158 spatially filters the audio signals 190 , 192 using a multi-channel signal processing algorithm to direct a beam of increased sensitivity to the location of the source 180 . In another example where device 120 receives audio signals 170 , 172 from device 110 and also receives audio signals 190 , 192 , beamforming unit 158 may perform spatial filtering based on all audio signals 170 , 172 , 190 , and 192 . In some embodiments, the beamforming unit 158 generates a single beamformed audio signal for each sound source detected in the audio signal. For example, if a single sound source is detected, a single beamformed audio signal 149 directed towards that sound source is generated. In another example, if multiple sound sources are detected, multiple beamformed audio signals 149 may be generated, wherein each of the multiple beamformed audio signals 149 is directed to a corresponding one of the sound sources.

所得的波束成形音訊信號149可以被發送到設備110。例如，數據機128可以向設備110發送一或多個波束成形音訊信號149。設備110可以使用波束成形音訊信號149來重播改善的音訊。The resulting beamformed audio signal 149 may be sent to device 110 . For example, modem 128 may transmit one or more beamformed audio signals 149 to device 110 . Device 110 may use beamformed audio signal 149 to replay the improved audio.

儘管上文示出和描述了設備110和設備120的各種元件，但是應當理解，在其他實施方式中，可以省去或繞過一或多個元件。還應當理解，設備110、設備120或兩者的元件的各種組合能夠實現增強設備110、設備120或兩者的效能的互通性，諸如在下面列出的非限制性示例中所述。While various elements of device 110 and device 120 have been shown and described above, it should be understood that in other implementations, one or more elements may be omitted or bypassed. It should also be appreciated that various combinations of elements of device 110, device 120, or both enable interoperability that enhances the performance of device 110, device 120, or both, such as described in the non-limiting examples listed below.

在特定實施方式中，設備110包括音訊事件處理單元134，並且省去到達方向處理單元132、聲學環境處理單元136和波束成形單元138（或將其停用或繞過其操作）。在該實施方式中，如前述，音訊事件資訊144可以被提供給設備120，並且與在設備120處使用音訊信號170、172、使用音訊信號190、192或者使用音訊信號170、172、190、192的組合的處理結合使用。In a particular embodiment, device 110 includes audio event processing unit 134 and omits (or disables or bypasses its operation) direction of arrival processing unit 132 , acoustic environment processing unit 136 , and beamforming unit 138 . In this embodiment, the audio event information 144 may be provided to the device 120 as previously described and communicated with the device 120 using the audio signals 170 , 172 , using the audio signals 190 , 192 , or using the audio signals 170 , 172 , 190 , 192 Combination of processing combined use.

在另一特定實施方式中，設備110包括音訊事件處理單元134和到達方向處理單元132，並且省去聲學環境處理單元136和波束成形單元138（或將其停用或繞過其操作）。在該實施方式中，在設備110處產生到達方向資訊142和音訊事件資訊144，並且可以將其提供給設備120以供如前述用途。到達方向資訊142可以用於（例如，經由增加的準確度、降低的時延或兩者）增強可在音訊事件處理單元134、音訊事件處理單元154或兩者處執行的音訊事件偵測。例如，到達方向資訊142可以作為輸入被提供給音訊事件處理單元134，並且音訊事件處理單元134可以將到達方向資訊142與和一或多個先前偵測到的音訊事件或聲源相關聯的方向進行比較。在另一示例中，音訊事件處理單元134可以使用到達方向資訊142來增強或降低偵測到特定音訊事件的可能性。為了說明，因為源自使用者上方的聲音更可能來自鳥或飛機而不是汽車，所以作為說明性地非限制性示例，可以應用加權因數來降低頭頂聲音被決定為與基於汽車的音訊事件相匹配的概率。附加地或替代地，到達方向資訊142可以用於以與針對音訊事件處理單元134所描述的方式類似的方式來增強音訊事件處理單元154的效能。In another particular embodiment, the device 110 includes the audio event processing unit 134 and the direction of arrival processing unit 132, and the acoustic environment processing unit 136 and the beamforming unit 138 are omitted (or disabled or bypassed for their operation). In this embodiment, direction of arrival information 142 and audio event information 144 are generated at device 110 and may be provided to device 120 for use as previously described. Direction of arrival information 142 may be used to enhance (eg, via increased accuracy, reduced latency, or both) audio event detection that may be performed at audio event processing unit 134 , audio event processing unit 154 , or both. For example, direction of arrival information 142 may be provided as input to audio event processing unit 134, and audio event processing unit 134 may associate direction of arrival information 142 with a direction associated with one or more previously detected audio events or sound sources Compare. In another example, the audio event processing unit 134 may use the direction of arrival information 142 to increase or decrease the probability of detecting a particular audio event. To illustrate, since sounds originating above the user are more likely to be from birds or airplanes than cars, as an illustrative and non-limiting example, a weighting factor may be applied to reduce overhead sounds determined to match car-based audio events The probability. Additionally or alternatively, the direction of arrival information 142 may be used to enhance the performance of the audio event processing unit 154 in a manner similar to that described for the audio event processing unit 134 .

如參考圖9進一步解釋的，音訊事件處理單元154的效能可以藉由將音訊事件資訊144（例如，由音訊事件處理單元134偵測的音訊類別）作為輸入提供給音訊事件處理單元154來增強。例如，音訊事件資訊144可以用作事件模型資料庫搜尋的起始點，或者用作可以影響由基於神經網路的音訊事件分類器執行的分類操作的輸入。因此，藉由在音訊事件處理單元134處使用到達方向資訊142來改善音訊事件資訊144的準確度，音訊事件資訊144的改善的準確度也可以改善音訊事件處理單元154的效能。As further explained with reference to FIG. 9 , the performance of the audio event processing unit 154 may be enhanced by providing audio event information 144 (eg, the audio class detected by the audio event processing unit 134 ) as input to the audio event processing unit 154 . For example, audio event information 144 may be used as a starting point for an event model database search, or as an input that may affect a classification operation performed by a neural network based audio event classifier. Thus, by using the direction of arrival information 142 at the audio event processing unit 134 to improve the accuracy of the audio event information 144 , the improved accuracy of the audio event information 144 may also improve the performance of the audio event processing unit 154 .

在設備110還包括聲學環境處理單元136的一些實施方式中，環境資訊146可以用於改善音訊事件處理單元134、音訊事件處理單元154或兩者的效能。例如，因為一些音訊事件（例如，汽車喇叭）在一些環境中（例如，在繁忙的街道上或在交通工具中）比在其他環境中（例如，在辦公室中）更可能發生，所以音訊事件處理單元134可以基於環境調整操作。例如，音訊事件處理單元134可以優先搜尋在特定環境中更有可能發生的聲音事件模型，這可以導致準確度增加、時延降低或兩者。作為另一示例，音訊事件處理單元134可以基於環境來調整針對一或多個聲音事件模型的加權因數，以增加或降低聲音182被決定為匹配那些聲音事件模型的可能性。在一些實施方式中，環境資訊146可以被發送到設備120，並且用於以類似的方式改善音訊事件處理單元154的效能。In some implementations where the device 110 also includes an acoustic environment processing unit 136, the environment information 146 may be used to improve the performance of the audio event processing unit 134, the audio event processing unit 154, or both. For example, because some audio events (for example, a car horn) are more likely to occur in some environments (for example, on a busy street or in traffic) than others (for example, in an office), audio event processing Unit 134 may adjust operation based on circumstances. For example, audio event processing unit 134 may prioritize searching for acoustic event models that are more likely to occur in a particular environment, which may result in increased accuracy, decreased latency, or both. As another example, audio event processing unit 134 may adjust weighting factors for one or more sound event models based on circumstances to increase or decrease the likelihood that sound 182 is determined to match those sound event models. In some implementations, environmental information 146 may be sent to device 120 and used to improve the performance of audio event processing unit 154 in a similar manner.

在設備110包括波束成形單元138的一些實施方式中，波束成形音訊信號148可以用於改善音訊事件處理單元134、音訊事件處理單元154或兩者的操作。例如，波束成形音訊信號148可以指向聲音182的源180，因此可以增強聲音182、衰減或移除來自其他源的聲音或環境雜訊或其組合。結果，在音訊事件處理單元134對波束成形音訊信號148進行操作的實施方式中，與音訊信號170、172相比，波束成形音訊信號148可以提供聲音182的改善表示，這使音訊事件處理單元134能夠（例如，經由降低聲音182的錯誤分類的可能性）更準確地決定音訊事件資訊144。類似地，在波束成形音訊信號148被發送到設備120並且音訊事件處理單元154對波束成形音訊信號148進行操作的實施方式中，波束成形音訊信號148能夠實現音訊事件處理單元154的改善效能。In some implementations where device 110 includes beamforming unit 138, beamforming audio signal 148 may be used to improve the operation of audio event processing unit 134, audio event processing unit 154, or both. For example, beamforming audio signal 148 may be directed at source 180 of sound 182, thereby enhancing sound 182, attenuating or removing sound from other sources or ambient noise, or a combination thereof. As a result, in embodiments where audio event processing unit 134 operates on beamformed audio signal 148, beamformed audio signal 148 may provide an improved representation of sound 182 compared to audio signals 170, 172, which allows audio event processing unit 134 Audio event information 144 can be determined more accurately (eg, by reducing the likelihood of misclassification of sound 182 ). Similarly, in embodiments where beamformed audio signals 148 are sent to device 120 and audio event processing unit 154 operates on beamformed audio signals 148 , beamforming audio signals 148 enables improved performance of audio event processing unit 154 .

在特定實施方式中，設備120包括音訊事件處理單元154，並且省去到達方向處理單元152、聲學環境處理單元156和波束成形單元158（或將其停用或繞過其操作）。在該實施方式中，如前述，音訊事件處理單元154可以使用音訊信號170、172、使用波束成形音訊信號148、使用音訊信號190、192或其組合來操作。In a particular embodiment, device 120 includes audio event processing unit 154 and omits (or disables or bypasses its operation) direction of arrival processing unit 152 , acoustic environment processing unit 156 , and beamforming unit 158 . In this embodiment, the audio event processing unit 154 may operate using the audio signals 170, 172, using the beamforming audio signals 148, using the audio signals 190, 192, or a combination thereof, as previously described.

在另一特定實施方式中，設備120包括音訊事件處理單元154和到達方向處理單元152，並且省去聲學環境處理單元156和波束成形單元158（或將其停用或繞過其操作）。在該實施方式中，到達方向資訊143和音訊事件資訊145在設備120處產生，並且可以被提供給設備110以供如前述使用。到達方向資訊143可以用於（例如，經由增加的準確度、降低的時延或兩者）增強可以在音訊事件處理單元154處以與針對到達方向資訊142所描述的方式類似的方式執行的音訊事件偵測。In another particular embodiment, the device 120 includes the audio event processing unit 154 and the direction of arrival processing unit 152, and the acoustic environment processing unit 156 and the beamforming unit 158 are omitted (or disabled or bypassed for their operation). In this embodiment, direction of arrival information 143 and audio event information 145 are generated at device 120 and may be provided to device 110 for use as previously described. Direction of arrival information 143 may be used to enhance (e.g., via increased accuracy, reduced latency, or both) audio events that may be performed at audio event processing unit 154 in a manner similar to that described for direction of arrival information 142 detection.

在設備120還包括聲學環境處理單元156的一些實施方式中，環境資訊147可以用於以與針對環境資訊146所描述的方式類似的方式來改善音訊事件處理單元134、音訊事件處理單元154或兩者的效能。在設備120包括波束成形單元158的一些實施方式中，由波束成形單元158產生的波束成形音訊信號可以用於以與針對波束成形音訊信號148所描述的方式類似的方式來改善音訊事件處理單元154的操作。In some embodiments where device 120 also includes acoustic environment processing unit 156, environment information 147 may be used to improve audio event processing unit 134, audio event processing unit 154, or both in a manner similar to that described for environment information 146. the efficacy of the In some embodiments where device 120 includes beamforming unit 158, the beamforming audio signals generated by beamforming unit 158 may be used to improve audio event processing unit 154 in a manner similar to that described for beamforming audio signal 148. operation.

參考圖1描述的技術使每個設備110、120能夠基於由麥克風102、104產生的音訊信號170、172、由麥克風106、108產生的音訊信號190、192或其組合來執行定向上下文感知處理。結果，每個設備110、120能夠偵測針對不同用例的上下文，並且能夠決定與周圍環境相關聯的特性。作為非限制性示例，該些技術使每個設備110、120能夠區分一或多個移動聲源（例如，汽笛、鳥等）、一或多個固定聲源（例如，電視、揚聲器等）或其組合。The techniques described with reference to FIG. 1 enable each device 110, 120 to perform directional context-aware processing based on audio signals 170, 172 produced by microphones 102, 104, audio signals 190, 192 produced by microphones 106, 108, or a combination thereof. As a result, each device 110, 120 is able to detect context for different use cases and to determine characteristics associated with the surrounding environment. As non-limiting examples, these techniques enable each device 110, 120 to distinguish between one or more moving sound sources (e.g., sirens, birds, etc.), one or more stationary sound sources (e.g., televisions, speakers, etc.), or its combination.

應當理解，針對圖1描述的技術可以實現多通道或單通道音訊上下文偵測，以基於到達方向區分不同聲音。根據一個實施方式，麥克風102、104、106和108可以被包括在麥克風陣列中，該麥克風陣列具有位於建築物（諸如房屋）中不同位置的麥克風。在有人摔倒在地板上的場景中，如果麥克風陣列的麥克風連接到行動設備（諸如設備120），則使用本文描述的技術，行動設備可以使用到達方向資訊來決定聲音來自哪裡，決定聲音的上下文，並且執行適當的動作（例如，通知護理人員）。It should be appreciated that the technique described with respect to FIG. 1 can implement multi-channel or single-channel audio context detection to distinguish different sounds based on direction of arrival. According to one embodiment, the microphones 102, 104, 106, and 108 may be included in a microphone array having microphones located at different locations in a building, such as a house. In the scenario of someone falling on the floor, if the microphones of the microphone array are connected to a mobile device such as device 120, using the techniques described herein, the mobile device can use the direction of arrival information to determine where the sound is coming from, to determine the context of the sound , and perform the appropriate action (eg, notify the nursing staff).

參考圖2，揭示被配置為對從多個麥克風接收的多個音訊信號執行定向處理的系統的另一特定說明性態樣，並且將其整體指定為200。系統200包括一或多個處理器202。一或多個處理器202可以被整合到設備110或設備120中。例如，一或多個處理器202可以對應於一或多個處理器116或一或多個處理器126。Referring to FIG. 2 , another particular illustrative aspect of a system configured to perform directional processing on a plurality of audio signals received from a plurality of microphones is disclosed and generally designated 200 . System 200 includes one or more processors 202 . One or more processors 202 may be integrated into device 110 or device 120 . For example, one or more processors 202 may correspond to one or more processors 116 or one or more processors 126 .

一或多個處理器202可選地包括音訊輸入204，音訊輸入204被配置為接收音訊資料278（諸如圖1的音訊資料178）並輸出音訊訊框274、276。一或多個處理器202包括第一處理域210和第二處理域220。第一處理域210可以對應於在低功率狀態下操作的低功率域，諸如「永遠線上」功率域。第一處理域210可以保持在活動狀態下以處理音訊訊框274和音訊訊框276。在一些實施方式中，音訊訊框274和276分別對應於音訊訊框174和176。在另一實施方式中，音訊訊框274和276分別對應於音訊訊框194和196。第二處理域220可以對應於在閒置狀態與高功率狀態之間轉換的高功率域。The one or more processors 202 optionally include an audio input 204 configured to receive audio data 278 (such as audio data 178 of FIG. 1 ) and output audio frames 274 , 276 . The one or more processors 202 include a first processing domain 210 and a second processing domain 220 . The first processing domain 210 may correspond to a low power domain operating in a low power state, such as an "always on" power domain. The first processing domain 210 can remain active to process the audio frame 274 and the audio frame 276 . In some implementations, audio frames 274 and 276 correspond to audio frames 174 and 176, respectively. In another embodiment, audio frames 274 and 276 correspond to audio frames 194 and 196, respectively. The second processing domain 220 may correspond to a high power domain that transitions between an idle state and a high power state.

第一處理域210包括音訊預處理單元230。與第二處理域220中的一或多個元件相比，音訊預處理單元230可以消耗相對低的電量。音訊預處理單元230可以處理音訊訊框274、276，以決定是否存在任何音訊活動。根據一些實施方式，音訊預處理單元230可以接收和處理來自單個麥克風的音訊訊框，以節省額外的電力。例如，在一些實施方式中，音訊訊框276可以不被提供給第一處理域210，並且音訊預處理單元230可以決定在音訊訊框274中是否存在音訊活動。The first processing domain 210 includes an audio preprocessing unit 230 . Compared with one or more components in the second processing domain 220, the audio pre-processing unit 230 may consume relatively low power. The audio pre-processing unit 230 can process the audio frames 274, 276 to determine whether there is any audio activity. According to some embodiments, the audio pre-processing unit 230 may receive and process audio frames from a single microphone to save additional power. For example, in some implementations, the audio frame 276 may not be provided to the first processing domain 210 , and the audio pre-processing unit 230 may determine whether there is audio activity in the audio frame 274 .

如果音訊預處理單元230決定在音訊訊框274或兩個音訊訊框274、276中存在音訊活動，則音訊預處理單元230可以產生啟動信號252，以將第二處理域220從閒置狀態轉換到高功率狀態。根據一些實施方式，音訊預處理單元230可以決定關於音訊活動的初步方向資訊250，並且將初步方向資訊250提供給第二處理域220。例如，如果在音訊訊框274中存在音訊活動，並且在音訊訊框276中存在較少量的音訊活動或沒有音訊活動，則初步方向資訊250可以指示聲音182源自擷取到與音訊訊框274相對應的音訊信號的麥克風附近。If the audio pre-processing unit 230 determines that there is audio activity in the audio frame 274 or both audio frames 274, 276, the audio pre-processing unit 230 may generate an enable signal 252 to switch the second processing domain 220 from the idle state to high power state. According to some embodiments, the audio pre-processing unit 230 may determine preliminary direction information 250 about the audio activity, and provide the preliminary direction information 250 to the second processing domain 220 . For example, if there is audio activity in audio frame 274, and there is less or no audio activity in audio frame 276, preliminary direction information 250 may indicate that sound 182 originates from the captured audio frame 276. 274 corresponding to the audio signal near the microphone.

第二處理域220包括到達方向處理單元232、音訊事件處理單元234、聲學環境處理單元236、波束成形單元238或其組合。到達方向處理單元232可以對應於圖1的到達方向處理單元132或者圖1的到達方向處理單元152，並且可以以基本類似的方式操作。音訊事件處理單元234可以對應於圖1的音訊事件處理單元134或者圖1的音訊事件處理單元154，並且可以以基本類似的方式操作。聲學環境處理單元236可以對應於圖1的聲學環境處理單元136或者圖1的聲學環境處理單元156，並且可以基本類似的方式操作。波束成形單元238可以對應於圖1的波束成形單元138或者圖1的波束成形單元158，並且可以以基本類似的方式操作。The second processing domain 220 includes a direction of arrival processing unit 232 , an audio event processing unit 234 , an acoustic environment processing unit 236 , a beamforming unit 238 or a combination thereof. The direction of arrival processing unit 232 may correspond to the direction of arrival processing unit 132 of FIG. 1 or the direction of arrival processing unit 152 of FIG. 1 and may operate in a substantially similar manner. The audio event processing unit 234 may correspond to the audio event processing unit 134 of FIG. 1 or the audio event processing unit 154 of FIG. 1 and may operate in a substantially similar manner. The acoustic environment processing unit 236 may correspond to the acoustic environment processing unit 136 of FIG. 1 or the acoustic environment processing unit 156 of FIG. 1 and may operate in a substantially similar manner. Beamforming unit 238 may correspond to beamforming unit 138 of FIG. 1 or beamforming unit 158 of FIG. 1 and may operate in a substantially similar manner.

因此，第二處理域220可以在不同模式下操作。例如，第二處理域220可以用於啟動不同的感測器，諸如圖1的感測器129。此外，第二處理域220可以用於執行到達方向處理以及計算、波束成形、DRR操作、室內/室外偵測、源距離決定等。Accordingly, the second processing domain 220 may operate in different modes. For example, the second processing domain 220 may be used to enable different sensors, such as the sensor 129 of FIG. 1 . In addition, the second processing domain 220 can be used to perform direction of arrival processing and calculations, beamforming, DRR operations, indoor/outdoor detection, source distance determination, and the like.

系統200使第一處理域210能夠回應於偵測到音訊活動的存在而選擇性地啟動第二處理域220。結果，當藉由使用低功率處理沒有偵測到音訊活動時，藉由將第二處理域220（例如，高功率處理域）轉換成閒置狀態，可以在諸如耳機或行動電話的設備處節省電池電力。The system 200 enables the first processing domain 210 to selectively activate the second processing domain 220 in response to detecting the presence of audio activity. As a result, battery can be saved at devices such as headsets or mobile phones by switching the second processing domain 220 (e.g., the high power processing domain) to an idle state when no audio activity is detected by using low power processing electricity.

參考圖3，揭示被配置為對從多個麥克風接收的多個音訊信號執行定向處理的系統的另一特定說明性態樣，並且將其整體指定為300。系統300包括耳機310和行動電話320。耳機310可以對應於設備110，並且行動電話320可以對應於設備120。Referring to FIG. 3 , another particular illustrative aspect of a system configured to perform directional processing on a plurality of audio signals received from a plurality of microphones is disclosed and generally designated 300 . System 300 includes headset 310 and mobile phone 320 . Headset 310 may correspond to device 110 , and mobile phone 320 may correspond to device 120 .

耳機310包括音訊處理單元330、音訊縮放單元332、可選的使用者提示產生單元334或其組合。音訊處理單元330包括到達方向處理單元132和音訊事件處理單元134。如參考圖1所述，到達方向處理單元132可以產生指示聲音182的源180的位置（例如，所朝方向）的到達方向資訊142。將到達方向資訊142提供給音訊縮放單元332和使用者提示產生單元334。如參考圖1所述，音訊事件處理單元134可以產生指示聲音182與交通工具聲音相關的音訊事件資訊144。將音訊事件資訊144提供給使用者提示產生單元334。The headset 310 includes an audio processing unit 330, an audio scaling unit 332, an optional user prompt generating unit 334, or a combination thereof. The audio processing unit 330 includes a direction of arrival processing unit 132 and an audio event processing unit 134 . As described with reference to FIG. 1 , the direction-of-arrival processing unit 132 may generate direction-of-arrival information 142 indicative of a location (eg, a heading) of a source 180 of the sound 182 . The direction of arrival information 142 is provided to the audio scaling unit 332 and the user prompt generating unit 334 . As described with reference to FIG. 1 , the audio event processing unit 134 may generate the audio event information 144 indicating that the sound 182 is related to a vehicle sound. The audio event information 144 is provided to the user prompt generation unit 334 .

音訊縮放單元332也可以從行動電話320接收到達方向資訊143。音訊縮放單元332可以被配置為基於到達方向資訊142或到達方向資訊143來調整波束成形單元138的波束成形演算法。結果，音訊縮放單元332可以將麥克風102、104的焦點調整到感興趣的聲音（例如，聲音182）並衰減來自其他方向的聲音。耳機310因此可以產生聚焦在來自源180的聲音182上的波束成形音訊信號148，並且將波束成形音訊信號148提供給揚聲器336以進行重播。在一些實施方式中，以保持聲音182的源180的方向性的方式在多個揚聲器336（例如，用於使用者左耳的左揚聲器和用於使用者右耳的右揚聲器）處執行波束成形音訊信號148的重播，使得使用者感知到聚焦的聲音182源自源180的方向（或者，在距離資訊被決定的情況下，源自源180的位置）。The audio scaling unit 332 can also receive the direction of arrival information 143 from the mobile phone 320 . The audio scaling unit 332 may be configured to adjust the beamforming algorithm of the beamforming unit 138 based on the direction of arrival information 142 or the direction of arrival information 143 . As a result, audio scaling unit 332 may adjust the focus of microphones 102, 104 to the sound of interest (eg, sound 182) and attenuate sounds from other directions. Headphones 310 may thus generate beamformed audio signals 148 focused on sound 182 from source 180 and provide beamformed audio signals 148 to speakers 336 for playback. In some implementations, beamforming is performed at multiple speakers 336 (e.g., a left speaker for the user's left ear and a right speaker for the user's right ear) in a manner that preserves the directionality of the source 180 of the sound 182 The playback of the audio signal 148 allows the user to perceive the direction in which the focused sound 182 originates from the source 180 (or, where distance information is determined, originates from the location of the source 180).

使用者提示產生單元334可以產生被提供給揚聲器336以進行重播的使用者警報350。例如，使用者警報350可以是指示交通工具（例如，源180）正在靠近的音訊。使用者提示產生單元334還可以產生被提供給行動電話320的一或多個使用者警報352。使用者警報350可以包括指示交通工具正在靠近的文字、被程式設計為指示交通工具正在靠近的振動等。User alert generation unit 334 may generate user alert 350 that is provided to speaker 336 for rebroadcast. For example, user alert 350 may be an audio signal indicating that a vehicle (eg, source 180 ) is approaching. The user alert generation unit 334 may also generate one or more user alerts 352 that are provided to the mobile phone 320 . User alert 350 may include text indicating an approaching vehicle, a vibration programmed to indicate an approaching vehicle, or the like.

因此，圖3的系統300使耳機310能夠聚焦（例如，音訊縮放）在感興趣的聲音182上，並且可以產生使用者警報350、352。為了說明，在使用者佩戴耳機310的場景中，系統300可以向使用者提醒周圍的事件（諸如正在靠近的交通工具），否則使用者可能不會意識到。Thus, the system 300 of FIG. 3 enables the headset 310 to focus (eg, audio zoom) on the sound of interest 182 and can generate user alerts 350 , 352 . To illustrate, in a scenario where the user is wearing the headset 310, the system 300 may alert the user to surrounding events, such as an approaching vehicle, that the user may not otherwise be aware of.

參考圖4，揭示被配置為對從多個麥克風接收的多個音訊信號執行定向處理的系統的另一特定說明性態樣，並且將其整體指定為400。系統400包括耳機410和行動電話420。耳機410可以對應於設備110，並且行動電話420可以對應於設備120。Referring to FIG. 4 , another particular illustrative aspect of a system configured to perform directional processing on a plurality of audio signals received from a plurality of microphones is disclosed and generally designated 400 . System 400 includes headset 410 and mobile phone 420 . Headset 410 may correspond to device 110 , and mobile phone 420 may correspond to device 120 .

耳機410包括音訊處理單元430，並且可選地包括音訊縮放單元432、除噪單元434、一或多個揚聲器436或其組合。音訊處理單元430包括到達方向處理單元132和音訊事件處理單元134。如參考圖1所述，到達方向處理單元132可以產生指示聲音182的源180的近似位置的到達方向資訊。到達方向處理單元132還可以產生指示聲音186的源184的近似位置的到達方向資訊。如參考圖1所述，音訊事件處理單元134可以產生指示聲音182與交通工具聲音相關的音訊事件資訊。音訊事件處理單元134還可以產生指示聲音186與人類語音相關的音訊事件資訊。The headset 410 includes an audio processing unit 430, and optionally includes an audio scaling unit 432, a noise canceling unit 434, one or more speakers 436, or a combination thereof. The audio processing unit 430 includes a direction of arrival processing unit 132 and an audio event processing unit 134 . As described with reference to FIG. 1 , direction of arrival processing unit 132 may generate direction of arrival information indicative of an approximate location of source 180 of sound 182 . The direction of arrival processing unit 132 may also generate direction of arrival information indicating the approximate location of the source 184 of the sound 186 . As described with reference to FIG. 1 , the audio event processing unit 134 may generate audio event information indicating that the sound 182 is related to a vehicle sound. The audio event processing unit 134 can also generate audio event information indicating that the sound 186 is related to human speech.

音訊處理單元430可以被配置為產生第一聲音資訊440，第一聲音資訊440指示與聲音182相關聯的到達方向資訊（例如，到達方向處理單元132的第一輸出），並且指示聲音182與交通工具相關（例如，音訊事件處理單元134的第一輸出）。音訊處理單元430還可以被配置為產生第二聲音資訊442，第二聲音資訊442指示與聲音186相關聯的到達方向資訊（例如，到達方向處理單元132的第二輸出），並且指示聲音186與人類語音相關（例如，音訊事件處理單元134的第二輸出）。可選地，耳機410可以向行動電話420發送音訊信號資料，諸如與聲音182、186相對應的音訊信號170、172的一或多個部分。音訊信號資料可以被包括在聲音資訊440、442中，或者可以與聲音資訊440、442分開。The audio processing unit 430 may be configured to generate first sound information 440 indicating the direction of arrival information associated with the sound 182 (eg, the first output of the direction of arrival processing unit 132 ), and indicating that the sound 182 is related to traffic Tool-dependent (eg, the first output of the audio event processing unit 134). The audio processing unit 430 may also be configured to generate second sound information 442 indicating the direction of arrival information associated with the sound 186 (eg, the second output of the direction of arrival processing unit 132 ), and indicating that the sound 186 is related to Human speech related (eg, the second output of the audio event processing unit 134). Optionally, headset 410 may transmit audio signal data, such as one or more portions of audio signal 170 , 172 corresponding to sound 182 , 186 , to mobile phone 420 . The audio signal data may be included in the audio information 440 , 442 or may be separate from the audio information 440 , 442 .

行動電話420包括單麥克風音訊上下文偵測單元450、音訊調整單元452和模式控制器454。將第一聲音資訊440和第二聲音資訊442提供給音訊調整單元452。根據一些實施方式，單麥克風音訊上下文偵測單元450可以向音訊調整單元452提供附加的上下文資訊496，諸如圖1的由到達方向處理單元152產生的到達方向資訊143、由音訊事件處理單元154產生的音訊事件資訊145、由聲學環境處理單元156產生的環境資訊147或其組合。例如，單麥克風音訊上下文偵測單元450可以處理從耳機410接收的音訊信號資料（例如，音訊信號170、172的一或多個部分）、從行動電話420的一或多個麥克風接收的音訊信號資料（例如，音訊信號190、192）或其組合。The mobile phone 420 includes a single microphone audio context detection unit 450 , an audio adjustment unit 452 and a mode controller 454 . The first sound information 440 and the second sound information 442 are provided to the audio adjustment unit 452 . According to some embodiments, the single-microphone audio context detection unit 450 may provide additional context information 496 to the audio adjustment unit 452, such as the direction of arrival information 143 generated by the direction of arrival processing unit 152 of FIG. The audio event information 145, the environment information 147 generated by the acoustic environment processing unit 156 or a combination thereof. For example, the single-microphone audio context detection unit 450 can process audio signal data received from the headset 410 (eg, one or more portions of the audio signal 170 , 172 ), audio signals received from one or more microphones of the mobile phone 420 data (eg, audio signals 190, 192) or a combination thereof.

音訊調整單元452可以被配置為基於來自音訊處理單元430的聲音資訊440、442產生音訊縮放角度460和降噪參數462。也就是說，基於來自單麥克風音訊上下文偵測單元450的上下文資訊496，音訊調整單元452可以決定出於波束成形目的而要聚焦的音訊縮放角度460，並且可以決定降噪參數462以降低來自其他方向的雜訊。因此，基於上下文資訊496，如果音訊調整單元452決定優先聚焦在聲音182上，則音訊縮放角度460可指示與源180相關聯的角度，並且降噪參數462可以包括用於降低來自源184的雜訊的參數。將音訊縮放角度460提供給音訊縮放單元432，並且將降噪參數462提供給除噪單元434。The audio adjustment unit 452 may be configured to generate an audio scaling angle 460 and a noise reduction parameter 462 based on the audio information 440 , 442 from the audio processing unit 430 . That is, based on the context information 496 from the single-microphone audio context detection unit 450, the audio adjustment unit 452 can determine the audio zoom angle 460 to focus for beamforming purposes, and can determine the noise reduction parameters 462 to reduce noise from other direction noise. Thus, based on contextual information 496 , if audio adjustment unit 452 decides to prioritize focusing on sound 182 , audio zoom angle 460 may indicate an angle associated with source 180 , and noise reduction parameter 462 may include parameters for reducing noise from source 184 . parameters of the message. The audio scaling angle 460 is provided to the audio scaling unit 432 and the denoising parameters 462 are provided to the denoising unit 434 .

音訊調整單元452還可以被配置為產生提供給模式控制器454的模式信號464。模式信號464可以指示是否應當為行動電話420的使用者產生振動警報、是否應當為行動電話420的使用者產生文字警報、是否應當為行動電話420的使用者產生語音警報等等。The audio adjustment unit 452 may also be configured to generate a mode signal 464 provided to the mode controller 454 . The mode signal 464 may indicate whether a vibration alert should be generated for the user of the mobile phone 420 , whether a text alert should be generated for the user of the mobile phone 420 , whether a voice alert should be generated for the user of the mobile phone 420 , and so on.

音訊縮放單元432可以被配置為基於音訊縮放角度460來調整波束成形單元（諸如圖1的波束成形單元138）的波束成形演算法。結果，音訊縮放單元432可以將麥克風102、104的焦點調整到感興趣的聲音（例如，聲音182）。基於降噪參數462，除噪單元434可以被配置為產生降噪信號490以衰減來自其他方向的聲音186。可以將波束成形音訊信號148和降噪信號490提供給一或多個揚聲器436以進行重播。The audio scaling unit 432 may be configured to adjust a beamforming algorithm of a beamforming unit, such as the beamforming unit 138 of FIG. 1 , based on the audio scaling angle 460 . As a result, audio zoom unit 432 may adjust the focus of microphones 102, 104 to the sound of interest (eg, sound 182). Based on the noise reduction parameters 462 , the noise reduction unit 434 may be configured to generate a noise reduction signal 490 to attenuate sound 186 from other directions. The beamformed audio signal 148 and the noise-reduced signal 490 may be provided to one or more speakers 436 for playback.

圖4的系統400能夠分析偵測到的聲音事件和對應的到達方向以改善聽覺感受。基於上下文資訊496，系統400可以決定使用者對哪個聲音特別感興趣。例如，如果使用者正在過馬路，則系統400可以決定交通工具的聲音182比人們說話的聲音186更重要。結果，系統400可以聚焦於重要的聲音182並抑制其他聲音。The system 400 of FIG. 4 is capable of analyzing detected sound events and corresponding arrival directions to improve auditory experience. Based on the contextual information 496, the system 400 can determine which sounds are of particular interest to the user. For example, if the user is crossing the street, the system 400 may decide that the sound 182 of vehicles is more important than the sound 186 of people speaking. As a result, system 400 can focus on important sounds 182 and suppress other sounds.

儘管將耳機410描述為提供對聲音182的聚焦和對其他聲音的抑制，但是應當注意，由音訊縮放單元432提供的對聲音182的聚焦和由除噪單元434提供的對其他聲音的抑制中的每一個都向耳機410的使用者提供了對聲音182的增強感知。例如，在耳機410包括音訊縮放單元432但省去了除噪單元434（或繞過除噪單元434的操作）的實施方式中，即使在沒有降噪信號490的情況下，聲音182也經由音訊縮放操作被增強。作為另一示例，在耳機410包括除噪單元434但省去了音訊縮放單元432（或繞過音訊縮放單元432的操作）的實施方式中，聲音182經由應用於其他聲音的降噪而相對於其他聲音得到增強。Although headphone 410 is described as providing the focusing of sound 182 and the suppression of other sounds, it should be noted that the focus of sound 182 provided by audio scaling unit 432 and the suppression of other sounds provided by noise canceling unit 434 Each provides an enhanced perception of sound 182 to the user of earphone 410 . For example, in an embodiment in which the headset 410 includes the audio scaling unit 432 but omits the noise canceling unit 434 (or bypasses the operation of the noise canceling unit 434 ), even in the absence of the noise canceling signal 490 , the sound 182 is routed through the audio Zoom operations are enhanced. As another example, in an implementation in which the headset 410 includes the noise cancellation unit 434 but omits the audio scaling unit 432 (or bypasses the operation of the audio scaling unit 432), the sound 182 is relatively Other sounds are enhanced.

參考圖5，揭示被配置為對從多個麥克風接收的多個音訊信號執行定向處理的系統的另一特定說明性態樣，並且將其整體指定為500。系統500包括空間濾波處理單元502、音訊事件處理單元504、應用程式設計介面506和語音使用者介面508。根據一個實施方式，系統500可以整合到設備110或設備120中。Referring to FIG. 5 , another particular illustrative aspect of a system configured to perform directional processing on a plurality of audio signals received from a plurality of microphones is disclosed and generally designated 500 . The system 500 includes a spatial filter processing unit 502 , an audio event processing unit 504 , an API 506 and a voice user interface 508 . According to one embodiment, the system 500 may be integrated into the device 110 or the device 120 .

空間濾波處理單元502可以被配置為對與接收到的音訊信號相關聯的音訊訊框（被示為音訊訊框574和576）執行一或多個空間濾波操作。在一些實施方式中，音訊訊框574和576分別對應於音訊訊框174和176。在另一實施方式中，音訊訊框574和576分別對應於音訊訊框194和196。在非限制性示例中，空間濾波處理單元502可以對音訊訊框574、576執行自適應波束成形，對音訊訊框574、576執行音訊縮放操作，對音訊訊框574、576執行波束成形操作，對音訊訊框574、576執行零波束成形操作或其組合。The spatial filtering processing unit 502 may be configured to perform one or more spatial filtering operations on audio frames (shown as audio frames 574 and 576 ) associated with the received audio signal. In some implementations, audio frames 574 and 576 correspond to audio frames 174 and 176, respectively. In another embodiment, audio frames 574 and 576 correspond to audio frames 194 and 196, respectively. In a non-limiting example, the spatial filtering processing unit 502 may perform adaptive beamforming on the audio frames 574, 576, perform audio scaling operations on the audio frames 574, 576, perform beamforming operations on the audio frames 574, 576, A zero beamforming operation or a combination thereof is performed on the audio frames 574, 576.

基於空間濾波操作，空間濾波處理單元502可以產生複數個輸出510、512、514以及針對每個輸出510、512、514的相對應的到達方向資訊542。在圖5的說明性示例中，空間濾波處理單元502可以根據音訊訊框574、576和兩個其他輸出512、514（例如，來自兩個其他偵測到的音訊源的音訊）產生語音內容輸出510。將輸出510、512、514提供給音訊事件處理單元504，並且將針對每個輸出510、512、514的到達方向資訊542提供給應用程式設計介面506。Based on the spatial filtering operation, the spatial filtering processing unit 502 may generate a plurality of outputs 510 , 512 , 514 and corresponding direction of arrival information 542 for each output 510 , 512 , 514 . In the illustrative example of FIG. 5 , spatial filter processing unit 502 may generate speech content output based on audio frames 574, 576 and two other outputs 512, 514 (e.g., audio from two other detected audio sources) 510. The outputs 510 , 512 , 514 are provided to the audio event processing unit 504 and the direction of arrival information 542 for each output 510 , 512 , 514 is provided to the API 506 .

音訊事件處理單元504被配置為處理每個輸出510、512、514，以決定與輸出510、512、514相關聯的音訊事件資訊544。例如，音訊事件處理單元504可以指示輸出510與語音內容相關聯，輸出512與非語音內容相關聯，並且輸出514與非語音內容相關聯。音訊事件處理單元504向語音使用者介面508提供語音內容輸出510以進行使用者重播，並且向應用程式設計介面506提供音訊事件資訊544。The audio event processing unit 504 is configured to process each output 510 , 512 , 514 to determine audio event information 544 associated with the output 510 , 512 , 514 . For example, audio event processing unit 504 may indicate that output 510 is associated with speech content, output 512 is associated with non-speech content, and output 514 is associated with non-speech content. The audio event processing unit 504 provides audio content output 510 to the audio user interface 508 for user playback, and provides audio event information 544 to the API 506 .

應用程式設計介面506可以被配置為向其他應用或設備提供到達方向資訊542和音訊事件資訊544以進行進一步專用處理，如參考圖1-圖4所述。API 506 may be configured to provide direction of arrival information 542 and audio event information 544 to other applications or devices for further specialized processing, as described with reference to FIGS. 1-4 .

圖6圖示了設備110的實施方式600。一或多個處理器116被配置為從多個麥克風接收音訊信號（被示為音訊信號170、172）。一或多個處理器116還被配置為向第二設備發送基於以音訊信號170、172中的一或多個表示的並且與音訊事件相關聯的聲音的類別612的資料。例如，一或多個處理器116向第二設備（例如，設備120）發送類別612的指示616。在說明性示例中，一或多個處理器116被整合到耳機設備中，並且第二設備對應於行動電話。在另一說明性示例中，一或多個處理器116被整合在交通工具中。FIG. 6 illustrates an embodiment 600 of the device 110 . The one or more processors 116 are configured to receive audio signals (shown as audio signals 170 , 172 ) from a plurality of microphones. The one or more processors 116 are also configured to send to the second device data based on a category 612 of sounds represented by one or more of the audio signals 170, 172 and associated with the audio event. For example, one or more processors 116 sends an indication 616 of category 612 to a second device (eg, device 120 ). In an illustrative example, one or more processors 116 are integrated into a headset device, and the second device corresponds to a mobile phone. In another illustrative example, one or more processors 116 are integrated in a vehicle.

一或多個處理器116被配置為在一或多個分類器610處處理信號資料602，以從由一或多個分類器610支持的多個支持的類別614當中決定類別612。信號資料602對應於音訊信號170、172。例如，在一些實施方式中，一或多個處理器被配置為（例如，在波束成形單元138處）對音訊信號170、172執行波束成形操作以產生信號資料602，該信號資料602可以對應於波束成形音訊信號148。替代地或附加地，一或多個處理器116被配置為決定音訊信號170、172的一或多個特徵以包括在信號資料602中。替代地或附加地，信號資料602包括音訊信號170、172。One or more processors 116 are configured to process signal data 602 at one or more classifiers 610 to determine a class 612 from among a plurality of supported classes 614 supported by the one or more classifiers 610 . Signal data 602 corresponds to audio signals 170 , 172 . For example, in some embodiments, one or more processors are configured (e.g., at beamforming unit 138) to perform beamforming operations on audio signals 170, 172 to generate signal data 602, which may correspond to The audio signal 148 is beamformed. Alternatively or additionally, the one or more processors 116 are configured to determine one or more characteristics of the audio signal 170 , 172 to include in the signal data 602 . Alternatively or additionally, signal data 602 includes audio signals 170 , 172 .

根據一些態樣，一或多個分類器610包括一或多個神經網路，該一或多個神經網路被配置為處理信號資料602並產生指示類別612比多個支援的類別614中的其餘類別更緊密地與音訊事件相關聯的輸出（例如，單穩（one-shot）輸出）。類別612經由指示616被發送到第二設備。在一些示例中，指示616包括類別612的位元配置、數量或其他指示符。在其他示例中，指示616包括使類別612能夠被第二設備辨識的文字名稱、標籤或其他描述符。在一些實施方式中，一或多個分類器610對應於圖1的音訊事件處理單元134（或者被包括在音訊事件處理單元134中），並且指示616對應於音訊事件資訊144（或者被包括在音訊事件資訊144中）。According to some aspects, the one or more classifiers 610 include one or more neural networks configured to process the signal data 602 and generate an indication that the class 612 is more accurate than one of the plurality of supported classes 614 The remaining categories are outputs that are more closely associated with audio events (for example, one-shot outputs). Category 612 is sent to the second device via indication 616 . In some examples, indication 616 includes a bit configuration, quantity, or other indicator of category 612 . In other examples, indication 616 includes a textual name, label, or other descriptor that enables category 612 to be recognized by the second device. In some implementations, one or more classifiers 610 correspond to (or are included in) audio event processing unit 134 of FIG. Audio Event Information 144).

可選地，一或多個處理器116還被配置為在一或多個分類器610處處理圖像資料以決定類別612。例如，設備110可以可選地包括被配置為產生圖像資料的一或多個相機，或者可以（例如，經由數據機）從另一設備接收圖像資料。類別612可以對應於以圖像資料表示的並且與音訊事件相關聯的物件（例如，聲源）。例如，在一些實施方式中，一或多個處理器116可以基於音訊信號170、172產生到達方向資訊142（或者從第二設備接收到達方向資訊143），並且使用到達方向資訊142或143來在圖像資料中定位與聲源相對應的物件。在一或多個分類器610除了處理音訊資料之外還處理圖像資料的實施方式中，圖像資料可以被包括在信號資料602中，或者作為單獨的輸入被提供給一或多個分類器610。Optionally, one or more processors 116 are further configured to process the image data at one or more classifiers 610 to determine a class 612 . For example, device 110 may optionally include one or more cameras configured to generate image material, or may receive image material (eg, via a modem) from another device. Classes 612 may correspond to objects (eg, sound sources) represented by graphical material and associated with audio events. For example, in some implementations, one or more processors 116 may generate direction-of-arrival information 142 (or receive direction-of-arrival information 143 from a second device) based on audio signals 170, 172, and use direction-of-arrival information 142 or 143 to Objects corresponding to sound sources are located in the image data. In embodiments where the one or more classifiers 610 process image data in addition to audio data, the image data may be included in the signal data 602 or provided as a separate input to the one or more classifiers 610.

在一些實施方式中，多個支援的類別614包括「未知」類別，這表示音訊事件未能在置信度閾值內對應於任何其他支持的類別614。在一個示例中，一或多個分類器610針對多個支持的類別614中的每一個來計算音訊事件對應於該特定類別的概率。如果計算出的概率都沒有超過閾值量，則一或多個分類器610將類別612指定為「未知」類別。In some implementations, the plurality of supported categories 614 includes an "unknown" category, which indicates that the audio event fails to correspond to any other supported category 614 within a confidence threshold. In one example, one or more classifiers 610 calculate, for each of a plurality of supported categories 614, a probability that an audio event corresponds to that particular category. If none of the calculated probabilities exceed a threshold amount, one or more classifiers 610 designate class 612 as an "unknown" class.

在一些實施方式中，一或多個處理器116被配置為處理音訊信號170、172，以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊，並且類別612與到達方向資訊相關聯。例如，到達方向資訊和類別612對應於音訊信號170、172中的相同聲音。為了說明，一或多個處理器116可以可選地包括圖1的到達方向處理單元132。一或多個處理器116可以被配置為向第二設備發送基於到達方向資訊的資料。在一個示例中，基於到達方向資訊的資料包括指示至少一個偵測到的事件和偵測到的事件的方向的報告。In some implementations, the one or more processors 116 are configured to process the audio signals 170, 172 to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals , and category 612 is associated with direction-of-arrival information. For example, the direction of arrival information and category 612 correspond to the same sound in the audio signals 170 , 172 . To illustrate, the one or more processors 116 may optionally include the direction of arrival processing unit 132 of FIG. 1 . The one or more processors 116 may be configured to send data based on the direction of arrival information to the second device. In one example, the data based on the direction of arrival information includes a report indicating at least one detected event and a direction of the detected event.

根據各種實施方式，設備110可以可選地包括先前參考圖1描述的一或多個附加的元件或態樣。例如，一或多個處理器可以被配置為基於到達方向資訊對音訊信號執行空間處理以產生一或多個波束成形音訊信號，並且可以將一或多個波束成形音訊信號發送給第二設備。為了說明，一或多個處理器116可以可選地包括圖1的波束成形單元138。在另一示例中，一或多個處理器116可以被配置為基於聲學環境偵測操作來產生與偵測到的環境相對應的環境資料。為了說明，一或多個處理器116可以可選地包括圖1的聲學環境處理單元136。According to various embodiments, device 110 may optionally include one or more additional elements or aspects previously described with reference to FIG. 1 . For example, the one or more processors may be configured to perform spatial processing on the audio signal based on the direction of arrival information to generate one or more beamformed audio signals, and may transmit the one or more beamformed audio signals to the second device. To illustrate, one or more processors 116 may optionally include beamforming unit 138 of FIG. 1 . In another example, the one or more processors 116 may be configured to generate environment data corresponding to the detected environment based on the acoustic environment detection operation. To illustrate, the one or more processors 116 may optionally include the acoustic environment processing unit 136 of FIG. 1 .

在另一示例中，一或多個處理器116可以被配置為向第二設備發送音訊信號170、172的表示。在一些實施方式中，音訊信號170、172的表示對應於一或多個波束成形音訊信號，諸如波束成形音訊信號148。在另一示例中，一或多個處理器116可以被配置為從第二設備接收與音訊信號相關聯的方向資訊，並且基於方向資訊執行音訊縮放操作，如參考圖3和圖4所述。In another example, the one or more processors 116 may be configured to send the representation of the audio signal 170, 172 to the second device. In some implementations, the representations of audio signals 170 , 172 correspond to one or more beamformed audio signals, such as beamformed audio signal 148 . In another example, the one or more processors 116 may be configured to receive direction information associated with the audio signal from the second device, and perform an audio zoom operation based on the direction information, as described with reference to FIGS. 3 and 4 .

藉由發送與以音訊信號170、172表示的聲音相對應的類別612的指示616，設備110提供了可以被第二設備用來提高第二設備處的音訊事件處理的準確度的資訊，如參考圖9進一步描述的。By sending an indication 616 of a category 612 corresponding to a sound represented by an audio signal 170, 172, the device 110 provides information that can be used by the second device to improve the accuracy of audio event processing at the second device, as described in reference Figure 9 is further described.

圖7圖示了設備110的實施方式700。與實施方式600相比，實施方式700中包括了一或多個編碼器710，並且省去了一或多個分類器610。信號資料602由一或多個編碼器710處理，以產生與以音訊信號170、172中的一或多個表示的並且關聯於音訊事件的聲音相對應的嵌入712。一或多個處理器116還被配置為向第二設備發送基於嵌入712的資料。在一個示例中，一或多個處理器116向第二設備發送嵌入712的指示716。FIG. 7 illustrates an embodiment 700 of the device 110 . Compared with the embodiment 600, the embodiment 700 includes one or more encoders 710 and omits one or more classifiers 610 . Signal data 602 is processed by one or more encoders 710 to generate embeddings 712 corresponding to sounds represented by one or more of audio signals 170, 172 and associated with audio events. The one or more processors 116 are also configured to send the embedding 712 based profile to the second device. In one example, the one or more processors 116 send the indication 716 of the embedding 712 to the second device.

根據一些態樣，一或多個編碼器710包括被配置為處理信號資料602以產生聲音的嵌入712的一或多個神經網路。嵌入712表示聲音的「簽名」，其包括關於聲音的各種特性的足夠的資訊以使聲音能夠在其他音訊信號中被偵測到，但是可能不包括足夠的資訊以能夠僅根據嵌入712再現聲音。根據一些態樣，嵌入712可以對應於使用者的語音、來自環境的特定聲音（諸如犬吠聲等），並且嵌入712可以用於偵測和放大或提取可能發生在其他音訊資料中的聲音的其他實例，如參考圖11進一步描述的。在一些實施方式中，一或多個編碼器710對應於圖1的音訊事件處理單元134（或者被包括在音訊事件處理單元134中），並且指示716對應於音訊事件資訊144（或者被包括在音訊事件資訊144中）。According to some aspects, the one or more encoders 710 include one or more neural networks configured to process the signal data 602 to generate an embedding 712 of sound. Embedding 712 represents a "signature" of the sound, which includes enough information about various characteristics of the sound to enable the sound to be detected in other audio signals, but may not include enough information to be able to reproduce the sound from embedding 712 alone. According to some aspects, embedding 712 may correspond to the user's speech, specific sounds from the environment (such as barking, etc.), and embedding 712 may be used to detect and amplify or extract other sounds that may occur in other audio data. Example, as further described with reference to FIG. 11 . In some implementations, one or more encoders 710 correspond to (or are included in) audio event processing unit 134 of FIG. Audio Event Information 144).

在一些實施方式中，一或多個處理器116被配置為處理音訊信號170、172以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊，並且嵌入712與到達方向資訊相關聯。在一個示例中，到達方向資訊和嵌入712對應於音訊信號170、172中的相同聲音。為了說明，一或多個處理器116可以可選地包括圖1的到達方向處理單元132。一或多個處理器116可以被配置為向第二設備發送基於到達方向資訊的資料。In some embodiments, the one or more processors 116 are configured to process the audio signals 170, 172 to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals, And embedding 712 is associated with direction-of-arrival information. In one example, the direction of arrival information and embedding 712 correspond to the same sound in the audio signal 170 , 172 . To illustrate, the one or more processors 116 may optionally include the direction of arrival processing unit 132 of FIG. 1 . The one or more processors 116 may be configured to send data based on the direction of arrival information to the second device.

可選地，一或多個處理器116還被配置為在一或多個編碼器710處處理圖像資料以產生嵌入712。例如，設備110可以可選地包括被配置為產生圖像資料的一或多個相機，或者可以（例如，經由數據機）從另一設備接收圖像資料。嵌入712可以對應於以圖像資料表示的並且與音訊事件相關聯的物件（例如，聲源）。例如，在一些實施方式中，一或多個處理器116可以基於音訊信號170、172產生到達方向資訊142（或者從第二設備接收到達方向資訊143），並且使用到達方向資訊142或143來在圖像資料中定位與聲源相對應的物件。在一或多個編碼器710除了處理音訊資料之外還處理圖像資料的實施方式中，圖像資料可以被包括在信號資料602中，或者作為單獨的輸入被提供給一或多個編碼器710。Optionally, one or more processors 116 are also configured to process image data at one or more encoders 710 to generate embedding 712 . For example, device 110 may optionally include one or more cameras configured to generate image material, or may receive image material (eg, via a modem) from another device. Embedding 712 may correspond to an object (eg, a sound source) represented in graphical material and associated with an audio event. For example, in some implementations, one or more processors 116 may generate direction-of-arrival information 142 (or receive direction-of-arrival information 143 from a second device) based on audio signals 170, 172, and use direction-of-arrival information 142 or 143 to Objects corresponding to sound sources are located in the image data. In embodiments where one or more encoders 710 process image data in addition to audio data, the image data may be included in signal material 602 or provided as a separate input to one or more encoders 710.

圖8圖示了設備110的實施方式800，其包括圖6的一或多個分類器610並且還包括圖7的一或多個編碼器710。信號資料602（或信號資料602的一或多個部分）由一或多個分類器610處理以決定類別612，並且信號資料602（或信號資料602的一或多個部分）由一或多個編碼器710處理以產生嵌入712。一或多個處理器116還被配置為向第二設備發送基於類別612、嵌入712或兩者的資料。例如，類別612的指示616、嵌入712的指示716或兩者可以對應於被發送到圖1的設備120的音訊事件處理單元134或者被包括在音訊事件處理單元134 中。FIG. 8 illustrates an embodiment 800 of an apparatus 110 that includes one or more classifiers 610 of FIG. 6 and further includes one or more encoders 710 of FIG. 7 . Signal data 602 (or one or more portions of signal data 602) is processed by one or more classifiers 610 to determine category 612, and signal data 602 (or one or more portions of signal data 602) is processed by one or more Encoder 710 processes to produce embedding 712 . The one or more processors 116 are also configured to send the profile based on category 612, embedding 712, or both to the second device. For example, indication 616 of category 612 , indication 716 of embedded 712 , or both may correspond to being sent to or included in audio event handling unit 134 of device 120 of FIG. 1 .

圖9圖示了包括一或多個處理器126的設備120（例如，第二設備）的實施方式900。一或多個處理器126包括音訊事件處理單元154，並且被配置為從第一設備（例如，設備110）接收與音訊事件相對應的音訊類別的指示902。在一些示例中，指示902對應於圖6或圖8的指示616，其指示在設備110的一或多個分類器610處偵測到的類別612。在一些實施方式中，一或多個處理器126耦合到記憶體（例如，記憶體124）並被整合到行動電話中，並且第一設備對應於耳機設備。在另一實施方式中，記憶體和一或多個處理器126被整合到交通工具中。FIG. 9 illustrates an embodiment 900 of a device 120 (eg, a second device) including one or more processors 126 . The one or more processors 126 include an audio event processing unit 154 and are configured to receive an indication 902 of an audio category corresponding to an audio event from a first device (eg, device 110 ). In some examples, indication 902 corresponds to indication 616 of FIG. 6 or FIG. 8 , which indicates classes 612 detected at one or more classifiers 610 of device 110 . In some implementations, the one or more processors 126 are coupled to memory (eg, memory 124 ) and integrated into a mobile phone, and the first device corresponds to a headset device. In another embodiment, the memory and one or more processors 126 are integrated into the vehicle.

可選地，一或多個處理器126包括一或多個分類器920，一或多個分類器920可以對應於音訊事件處理單元154或者被包括在音訊事件處理單元154中。根據一個態樣，一或多個分類器920比產生指示902的第一設備中的（多個）分類器更強大且更準確，諸如參考圖1的音訊事件處理單元154所述。一或多個處理器126可以被配置為還接收表示與音訊事件相關聯的聲音的音訊資料904。在一些實施方式中，作為說明性地非限制性示例，音訊資料904可以對應於來自第一設備的音訊信號170、172、來自第一設備的波束成形音訊信號148、音訊信號190、192或其組合。一或多個處理器126可以被配置為在一或多個分類器920處處理音訊資料904，以（諸如經由將指示902與由一或多個分類器920決定的分類922進行比較）驗證指示902是正確的。可以從多個支援的類別924當中選擇分類922作為與音訊資料904中偵測到的音訊事件最佳對應的音訊類別。Optionally, one or more processors 126 include one or more classifiers 920 , which may correspond to or be included in the audio event processing unit 154 . According to one aspect, the one or more classifiers 920 are more powerful and more accurate than the classifier(s) in the first device that generated the indication 902, such as described with reference to the audio event processing unit 154 of FIG. 1 . One or more processors 126 may be configured to also receive audio data 904 representing sounds associated with the audio event. In some implementations, audio material 904 may correspond to audio signal 170, 172 from first device, beamformed audio signal 148 from first device, audio signal 190, 192, or combination. The one or more processors 126 may be configured to process the audio data 904 at the one or more classifiers 920 to verify the indication (such as by comparing the indication 902 with the classification 922 determined by the one or more classifiers 920). 902 is correct. The category 922 can be selected from a plurality of supported categories 924 as the audio category that best corresponds to the audio event detected in the audio data 904 .

在一些實施方式中，驗證指示902或驗證由指示902指示的類別包括決定由指示902指示的類別是否與由一或多個分類器920決定的類別（例如，分類922）相匹配。替代地或附加地，驗證指示902或驗證由指示902指示的類別包括決定由一或多個分類器920決定的類別是由指示902指示的類別的特定實例或子類。例如，與類別「交通工具事件」相對應的指示902可以藉由一或多個分類器920決定分類922對應於「汽車發動機」、「摩托車發動機」、「刹車聲」、「汽車喇叭」、「摩托車喇叭」、「火車喇叭」、「交通工具碰撞」等（其可以被分類為不同類型的交通工具事件）來驗證。In some implementations, validating the indication 902 or validating the class indicated by the indication 902 includes determining whether the class indicated by the indication 902 matches a class determined by one or more classifiers 920 (eg, classification 922 ). Alternatively or additionally, validating the indication 902 or the class indicated by the indication 902 includes deciding that the class determined by the one or more classifiers 920 is a particular instance or subclass of the class indicated by the indication 902 . For example, an indication 902 corresponding to the category "vehicle event" may be determined by one or more classifiers 920 that the categories 922 correspond to "car engine", "motorcycle engine", "brake sound", "car horn", "Motorcycle Horn", "Train Horn", "Vehicle Collision", etc. (which can be classified as different types of vehicle incidents) to verify.

根據一些態樣，藉由除了音訊資料904之外還向一或多個分類器920提供與音訊事件相關的其他資訊，改善了一或多個分類器920的準確度。例如，一或多個處理器126可以可選地被配置為將音訊資料904和音訊類別的指示902作為輸入提供給一或多個分類器920，以決定與音訊資料904相關聯的分類922。在實施方式900中，音訊資料904包括被輸入到一或多個分類器920的一或多個波束成形信號910（例如，波束成形音訊信號148）。在另一示例中，一或多個處理器126可以可選地被配置為從第一設備接收與聲源相對應的方向資料912（例如，到達方向資訊142），並且將音訊資料904、方向資料912和音訊類別的指示902作為輸入提供給一或多個分類器920，以決定與音訊資料904相關聯的分類922。According to some aspects, the accuracy of the one or more classifiers 920 is improved by providing the one or more classifiers 920 with other information related to the audio event in addition to the audio data 904 . For example, the one or more processors 126 may optionally be configured to provide the audio data 904 and the indication 902 of the audio class as input to the one or more classifiers 920 to determine the class 922 associated with the audio data 904 . In embodiment 900 , audio data 904 includes one or more beamformed signals 910 (eg, beamformed audio signals 148 ) that are input to one or more classifiers 920 . In another example, one or more processors 126 may optionally be configured to receive direction data 912 (eg, direction-of-arrival information 142 ) corresponding to a sound source from a first device, and combine audio data 904 , direction The data 912 and the indication 902 of the audio class are provided as input to one or more classifiers 920 to determine a class 922 associated with the audio data 904 .

可選地，一或多個處理器126被配置為除了音訊事件資訊145之外還產生一或多個輸出來代替音訊事件資訊145，或者產生一或多個輸出以包括在音訊事件資訊145中，諸如通知930、控制信號932、分類器輸出934或其組合。例如，在音訊類別（例如，分類922）對應於交通工具事件（例如，碰撞）的實施方式中，一或多個處理器126可以基於第一設備（例如，設備110）的位置和一或多個第三設備的位置向一或多個第三設備發送交通工具事件的通知930，諸如參考圖14和圖15進一步描述的。在另一示例中，設備120的使用者可能正在參與室外事件（諸如沿著小徑徒步旅行），並且音訊類別（例如，分類922）對應於安全相關事件（諸如動物咆哮）。在該示例中，一或多個處理器126可以將安全相關事件的通知930發送到基於與一或多個第三設備相關聯的位置資料而被決定為在附近的一或多個第三設備（諸如其他徒步旅行者的電話或耳機）。Optionally, one or more processors 126 are configured to generate one or more outputs in addition to audio event information 145 instead of audio event information 145 or to generate one or more outputs for inclusion in audio event information 145 , such as a notification 930, a control signal 932, a classifier output 934, or a combination thereof. For example, in an embodiment where the audio category (eg, category 922 ) corresponds to a vehicle event (eg, collision), the one or more processors 126 may base the location of the first device (eg, device 110 ) on one or more The location of the third device sends a notification 930 of the vehicle event to one or more third devices, such as further described with reference to FIGS. 14 and 15 . In another example, a user of device 120 may be participating in an outdoor event, such as hiking along a trail, and the audio category (eg, classification 922 ) corresponds to a safety-related event, such as an animal growling. In this example, one or more processors 126 may send a notification 930 of a security-related event to one or more third devices determined to be nearby based on location profile associated with the one or more third devices (such as other hikers' phones or headsets).

在另一示例中，基於分類器輸出934將控制信號932發送到第一設備。為了說明，分類器輸出934可以包括位元模式、數字指示符、或者指示由一或多個分類器920決定的分類922的文字標籤或描述。在說明性示例中，控制信號932指示第一設備執行音訊縮放操作。在另一示例中，控制信號932指示第一設備基於聲源的方向來執行空間處理。在另一示例中，控制信號932指示第一設備改變操作模式，諸如從媒體重播模式（例如，向第一設備的使用者播放流音訊）轉換到透明模式（例如，使第一設備的使用者能夠聽見環境聲音）。In another example, the control signal 932 is sent to the first device based on the classifier output 934 . To illustrate, classifier output 934 may include a bit pattern, a numeric indicator, or a text label or description indicating the classification 922 determined by one or more classifiers 920 . In an illustrative example, control signal 932 instructs the first device to perform an audio zoom operation. In another example, the control signal 932 instructs the first device to perform spatial processing based on the direction of the sound source. In another example, the control signal 932 instructs the first device to change the mode of operation, such as transitioning from a media playback mode (e.g., playing streaming audio to the user of the first device) to a transparent mode (e.g., making the user of the first device Able to hear ambient sounds).

可選地，一或多個處理器126被配置為執行與追蹤音訊場景中的定向音訊聲音的源相關聯的一或多個操作，如參考圖16進一步解釋的。在一個示例中，一或多個處理器126可以接收與由第一設備偵測到的聲源相對應的方向資料912。基於音訊事件，一或多個處理器126可以更新音訊場景中的定向聲源的地圖，以產生更新後的地圖。一或多個處理器126可以向地理上遠離第一設備的一或多個第三設備發送與更新後的地圖相對應的資料。作為說明性地非限制性示例，一或多個第三設備可以使用更新後的地圖向一或多個第三設備的使用者通知在第一設備附近偵測到的聲源，或者為參與共享虛擬環境（例如，在虛擬會議室中）的使用者提供共享音訊體驗。Optionally, the one or more processors 126 are configured to perform one or more operations associated with tracking the source of directional audio sounds in the audio scene, as further explained with reference to FIG. 16 . In one example, the one or more processors 126 may receive direction data 912 corresponding to sound sources detected by the first device. Based on the audio events, the one or more processors 126 may update the map of directional sound sources in the audio scene to generate an updated map. The one or more processors 126 may send data corresponding to the updated map to one or more third devices geographically remote from the first device. As an illustrative and non-limiting example, the updated map may be used by one or more third devices to notify users of the one or more third devices of sound sources detected near the first device, or to participate in shared Users of a virtual environment (eg, in a virtual conference room) provide a shared audio experience.

圖10圖示了設備120的另一實施方式1000。與圖9的實施方式900相比，音訊事件處理單元154（例如，一或多個分類器920）接收多通道音訊信號1002作為輸入，而不是波束成形信號910。例如，多通道音訊信號1002可以包括在音訊資料904中接收的音訊信號170、172、從麥克風106、108接收的音訊信號190、192或其組合。多通道音訊信號1002可以作為輸入結合指示902、方向資料912或兩者被提供給一或多個分類器920。FIG. 10 illustrates another embodiment 1000 of the device 120 . Compared to the embodiment 900 of FIG. 9 , the audio event processing unit 154 (eg, one or more classifiers 920 ) receives as input the multi-channel audio signal 1002 instead of the beamformed signal 910 . For example, multi-channel audio signal 1002 may include audio signals 170, 172 received in audio data 904, audio signals 190, 192 received from microphones 106, 108, or a combination thereof. The multi-channel audio signal 1002 may be provided as input to one or more classifiers 920 in combination with the indication 902, the direction data 912, or both.

例如，在一些情況下，波束成形資料是不可用的，諸如當偵測到音訊事件但是不能以足夠的準確度決定音訊事件的方向性時（例如，聲音主要是擴散的或者不定向的，或者被干擾波束成形的其他聲音所掩蓋）。參考圖12和圖13描述了基於設備之間傳輸的是音訊信號還是波束成形信號的處理的示例。For example, beamforming data is not available in some situations, such as when audio events are detected but the directionality of the audio events cannot be determined with sufficient accuracy (e.g., the sound is mostly diffuse or non-directional, or masked by other sounds interfering with beamforming). An example of processing based on whether an audio signal or a beamformed signal is transmitted between devices is described with reference to FIGS. 12 and 13 .

圖11圖示了設備120的實施方式1100以及表示可以在設備120處執行的音訊處理的圖1150。一或多個處理器126包括內容分離器1120，內容分離器1120被配置為基於與音訊信號相對應的嵌入將音訊內容中的前景信號與背景信號分離。FIG. 11 illustrates an embodiment 1100 of device 120 and a diagram 1150 representing audio processing that may be performed at device 120 . The one or more processors 126 include a content separator 1120 configured to separate foreground signals from background signals in the audio content based on embeddings corresponding to the audio signals.

內容分離器1120可以包括音訊產生網路1122，音訊產生網路1120被配置為接收與特定聲音的一或多個簽名相對應的一或多個嵌入1104。例如，一或多個嵌入1104可以對應於或包括圖7的嵌入712。在一些示例中，一或多個嵌入1104可以包括一或多個音訊事件的簽名、特定人的語音的簽名等。音訊產生網路1122還被配置為接收可以包括來自各種聲源的背景和前景聲音的音訊資料（被示為輸入混合波形1102）。音訊產生網路1122被配置為決定輸入混合波形1102是否包括與一或多個嵌入1104相對應的任何聲音，並且提取、隔離或移除那些特定聲音。Content separator 1120 may include an audio production network 1122 configured to receive one or more embeddings 1104 corresponding to one or more signatures of a particular sound. For example, one or more embeddings 1104 may correspond to or include embeddings 712 of FIG. 7 . In some examples, one or more embeddings 1104 may include a signature of one or more audio events, a signature of a particular person's voice, or the like. Audio generation network 1122 is also configured to receive audio data (shown as input mixed waveform 1102 ) which may include background and foreground sounds from various sound sources. The audio generation network 1122 is configured to determine whether the input mixed waveform 1102 includes any sounds corresponding to the one or more embeddings 1104, and to extract, isolate, or remove those particular sounds.

目標輸出1106由內容分離器1120產生。目標輸出1106可以包括與特定聲音相對應的音訊信號。例如，與一或多個嵌入1104相對應的特定聲音可以與輸入混合波形1102中的剩餘聲音隔開，以產生目標輸出1106。在一個示例中，特定聲音可以對應於輸入混合波形1102中的前景聲音，並且目標輸出1106可以包括背景被移除或衰減的前景聲音。Target output 1106 is produced by content separator 1120 . Target output 1106 may include an audio signal corresponding to a particular sound. For example, specific sounds corresponding to one or more embeddings 1104 may be isolated from the remaining sounds in the input mixed waveform 1102 to produce the target output 1106 . In one example, a particular sound may correspond to a foreground sound in the input mixed waveform 1102, and the target output 1106 may include the foreground sound with the background removed or attenuated.

在另一示例中，目標輸出1106對應於輸入混合波形1102的修改版本，並且可以包括以輸入混合波形1102表示的並且在移除（或衰減）特定聲音之後剩餘的聲音。例如，特定聲音可以對應於輸入混合波形1102中的前景聲音，並且目標輸出1106可以包括在前景聲音被移除（或衰減）之後剩餘在輸入混合波形1102中的背景聲音。In another example, the target output 1106 corresponds to a modified version of the input mixing waveform 1102 and may include the sound represented by the input mixing waveform 1102 and remaining after a particular sound is removed (or attenuated). For example, a particular sound may correspond to a foreground sound in the input mixing waveform 1102, and the target output 1106 may include the background sound remaining in the input mixing waveform 1102 after the foreground sound is removed (or attenuated).

在另一示例中，目標輸出1106可以包括含有作為前景聲音的特定聲音的音訊信號，該前景聲音已經從輸入混合波形1102的背景聲音中移除並且被添加到不同的背景聲音集合。In another example, the target output 1106 may include an audio signal containing a particular sound as a foreground sound that has been removed from the background sound of the input mixing waveform 1102 and added to a different set of background sounds.

在圖示1150中，在包括第一氛圍1152（例如，背景）的音訊場景1151中圖示了第一前景聲音（FG1）1154、第二前景聲音（FG2）1156和第三前景聲音（FG3）1158。由內容分離器1120使用一或多個嵌入1104中針對第一前景聲音1154的第一嵌入、一或多個嵌入1104中針對第二前景聲音1156的第二嵌入、以及一或多個嵌入1104中針對第三前景聲音1158的第三嵌入來執行前景提取操作1160，以將前景聲音1154、1156、1158與第一氛圍1152隔開（被示為經隔離的前景聲音1162）。場景產生操作1164將前景聲音1154、1156、1158添加到具有第二氛圍1172的音訊場景1171（例如，更新後的音訊場景）中。場景產生操作1164可以由音訊產生網路1122、內容分離器1120、一或多個處理器1126或其組合來執行。In illustration 1150 , a first foreground sound ( FG1 ) 1154 , a second foreground sound ( FG2 ) 1156 , and a third foreground sound ( FG3 ) are illustrated in an audio scene 1151 that includes a first ambiance 1152 (eg, background). 1158. A first embedding of the one or more embeddings 1104 for the first foreground sound 1154, a second embedding of the one or more embeddings 1104 for the second foreground sound 1156, and one or more of the embeddings 1104 are used by the content splitter 1120 A foreground extraction operation 1160 is performed for the third embedding of the third foreground sound 1158 to isolate the foreground sounds 1154 , 1156 , 1158 from the first ambience 1152 (shown as isolated foreground sound 1162 ). A scene generation operation 1164 adds foreground sounds 1154 , 1156 , 1158 to an audio scene 1171 with a second ambiance 1172 (eg, an updated audio scene). Scene generation operation 1164 may be performed by audio generation network 1122, content separator 1120, one or more processors 1126, or a combination thereof.

在一個示例中，輸入混合波形1102表示與音訊場景1151相對應的音訊資料，其由一或多個處理器1126處理以產生經調整的音訊資料（例如，包括經隔離的前景聲音1162的目標輸出1106），並且經調整的資料再次由一或多個處理器1126調整（例如，場景產生操作1164）以產生更新後的音訊場景（例如，音訊場景1171）。音訊場景1171可以包括與各種物件和音訊事件（例如，與共享音訊場景中的其他參與者相關聯的音訊和事件）相關聯的方向資訊，如參考圖16-圖18進一步描述的。In one example, input mixed waveform 1102 represents audio data corresponding to audio scene 1151, which is processed by one or more processors 1126 to produce adjusted audio data (e.g., a target output including isolated foreground sound 1162 1106 ), and the adjusted data is again adjusted by one or more processors 1126 (eg, scene generation operation 1164 ) to generate an updated audio scene (eg, audio scene 1171 ). The audio scene 1171 may include directional information associated with various objects and audio events (eg, audio and events associated with other participants in the shared audio scene), as further described with reference to FIGS. 16-18 .

包括音訊產生網路1122的內容分離器1120可以使任何目標聲音隔開背景，並且不限於從雜訊中分離出語音。在一些實施方式中，使用音訊產生網路1122的內容分離器1120實現了特定音訊事件、語音等的單麥克風目標分離，並且可以克服與無法區分音訊源的習知技術相關聯的限制。 The content separator 1120, including the audio generation network 1122, can isolate any target sound from the background and is not limited to separating speech from noise. In some embodiments, single-microphone object separation of specific audio events, speech, etc. is achieved using the content separator 1120 of the audio production network 1122 and can overcome limitations associated with conventional techniques that cannot distinguish audio sources.

圖12圖示了與方法1200相對應的流程圖，方法1200可以由諸如設備110（例如，一或多個處理器116）的第一設備針對向諸如設備120的第二設備發送資訊而執行。FIG. 12 illustrates a flowchart corresponding to a method 1200 that may be performed by a first device, such as device 110 (eg, one or more processors 116 ), for sending information to a second device, such as device 120 .

方法1200包括在方塊1202處理音訊信號的一或多個訊框。例如，音訊資料178（例如，音訊信號170、172的訊框）可以在到達方向處理單元132、音訊事件處理單元134、聲學環境處理單元136、單元138或其組合處進行處理，如圖1中所述。Method 1200 includes processing one or more frames of an audio signal at block 1202 . For example, audio data 178 (e.g., frames of audio signals 170, 172) may be processed at direction of arrival processing unit 132, audio event processing unit 134, acoustic environment processing unit 136, unit 138, or a combination thereof, as in FIG. mentioned.

方法1200包括在方塊1204處決定對音訊信號的一或多個訊框的處理是否引起環境偵測。在一些示例中，環境偵測可以包括決定已經偵測到環境變化。回應於決定環境偵測已經發生，方法1200包括在方塊1206處向第二設備發送環境資訊。例如，設備110向設備120發送環境資訊146。Method 1200 includes determining, at block 1204, whether processing of one or more frames of an audio signal results in ambient detection. In some examples, environmental detection may include determining that an environmental change has been detected. In response to determining that environmental detection has occurred, method 1200 includes, at block 1206, sending environmental information to the second device. For example, device 110 sends environmental information 146 to device 120 .

回應於在方塊1204處決定沒有發生環境偵測，或者在方塊1206處發送環境資訊之後，方法1200包括在1208處決定對音訊信號的一或多個訊框的處理是否引起偵測到音訊事件。回應於決定偵測到音訊事件，方法1200包括在方塊1210處向第二設備發送音訊事件資訊。例如，設備110向設備120發送音訊事件資訊144。In response to determining at block 1204 that no ambient detection occurred, or after sending the ambient information at block 1206, method 1200 includes determining at 1208 whether processing of one or more frames of the audio signal resulted in detection of an audio event. In response to determining that an audio event is detected, the method 1200 includes, at block 1210, sending audio event information to the second device. For example, device 110 sends audio event information 144 to device 120 .

此外，回應於決定偵測到音訊事件，方法1200包括在方塊1212處決定有效的到達方向資訊是否可用。例如，有效的到達方向資訊可以對應於偵測到具有以高於置信度閾值的置信度位准決定的到達方向的聲源，以區分離散聲源與沒有可區分源的擴散聲音。在特定實施方式中，對於以一或多個音訊信號表示的聲音可用的有效的到達方向資訊指示該聲音來自可辨識的方向（例如，來自離散聲源），並且對於該聲音不可用的有效的到達方向資訊指示該聲音不是來自可辨識的方向。回應於在1212處決定有效的到達方向資訊可用，方法1200包括在方塊1214處向第二設備發送到達方向資訊。例如，設備110向設備120發送到達方向資訊142。Additionally, in response to determining that an audio event is detected, the method 1200 includes determining at block 1212 whether valid direction of arrival information is available. For example, valid direction-of-arrival information may correspond to detection of a sound source with a direction-of-arrival determined with a confidence level higher than a confidence threshold to distinguish discrete sound sources from diffuse sound with no distinguishable source. In particular embodiments, available direction-of-arrival information available for a sound represented by one or more audio signals indicates that the sound is coming from an identifiable direction (e.g., from a discrete sound source), and available information that is not available for the sound The direction of arrival information indicates that the sound is not coming from a recognizable direction. In response to determining at 1212 that valid direction of arrival information is available, the method 1200 includes, at block 1214 , sending the direction of arrival information to the second device. For example, device 110 sends direction of arrival information 142 to device 120 .

回應於在方塊1208處決定沒有偵測到音訊事件，在方塊1212處決定沒有有效的到達方向資訊可用，或者在方塊1214處向第二設備發送到達方向資訊之後，方法1200進行到在方塊1222處決定是否向第二設備發送一或多個音訊信號（例如，音訊信號170、172）、一或多個波束成形信號（例如，波束成形音訊信號148）或者不發送音訊信號。In response to determining at block 1208 that no audio event has been detected, at block 1212 determining that no valid direction of arrival information is available, or at block 1214 after sending direction of arrival information to the second device, method 1200 proceeds to block 1222 A decision is made whether to send one or more audio signals (eg, audio signals 170, 172), one or more beamforming signals (eg, beamforming audio signal 148), or no audio signals to the second device.

圖12圖示在一些實施方式中可以用來在方塊1220處決定是否向第二設備發送一或多個音訊信號、一或多個波束成形信號或者不發送音訊信號的若干可選決策操作。12 illustrates several optional decision operations that may be used to decide at block 1220 whether to send one or more audio signals, one or more beamformed signals, or no audio signals to the second device in some embodiments.

在方塊1230處，決定是否發生了至少一個環境偵測或音訊事件偵測。回應於決定沒有發生環境偵測並且沒有偵測到音訊事件，方法1200在方塊1240處決定沒有音訊將被發送到第二設備。因此，在該示例中，當沒有環境偵測且沒有音訊事件時，第一設備（例如，設備110）不向第二設備（例如，設備120）傳送音訊資訊以進行附加的處理。At block 1230, it is determined whether at least one environmental detection or audio event detection has occurred. In response to determining that ambient detection did not occur and no audio events were detected, method 1200 determines at block 1240 that no audio will be sent to the second device. Therefore, in this example, when there is no environment detection and no audio event, the first device (eg, device 110 ) does not transmit audio information to the second device (eg, device 120 ) for additional processing.

否則，回應於決定發生了環境偵測或音訊事件偵測中的至少一個，方法1200包括在方塊1232處決定可用於向第二設備傳輸的電量或頻寬量是否受限。例如，如果第一設備具有低於電力閾值的可用電池電量，或者如果用於向第二設備發送音訊資料的可用傳輸頻寬量低於傳輸閾值，則第一設備可以決定要節省與向第二設備傳輸音訊資料相關聯的資源。否則，第一設備可以以預設（例如，非節約）模式進行。Otherwise, in response to determining that at least one of environmental detection or audio event detection has occurred, method 1200 includes determining at block 1232 whether an amount of power or bandwidth available for transmission to the second device is limited. For example, if the first device has available battery power below the power threshold, or if the amount of available transmission bandwidth for sending audio data to the second device is below the transmission threshold, the first device may decide to conserve power with the second device. The resources associated with the device transmitting audio data. Otherwise, the first device may operate in a preset (eg, non-saving) mode.

回應於在方塊1232處決定電力和傳輸頻寬都不受限制，方法1200包括在方塊1248處向第二設備發送音訊信號。例如，設備110可以向設備120發送音訊信號170、172。In response to determining at block 1232 that neither power nor transmission bandwidth is limited, method 1200 includes, at block 1248 , sending an audio signal to the second device. For example, device 110 may send audio signals 170 , 172 to device 120 .

否則，回應於在方塊1232處決定電力或傳輸頻寬中的至少一個受限，方法1200包括在方塊1234處決定第二設備處的麥克風是否可用於擷取音訊資料。例如，在第二設備處的麥克風（例如，麥克風106、108）被遮擋或阻擋的情況下（諸如在使用者的口袋或包中），或者位於太遠而無法擷取與第一設備處的麥克風基本相同的音訊資訊的情況下，第二設備處的麥克風可以被認為是不可用的。Otherwise, in response to determining at block 1232 that at least one of power or transmission bandwidth is limited, method 1200 includes determining at block 1234 whether a microphone at the second device is available to capture audio data. For example, where the microphones (e.g., microphones 106, 108) at the second device are obscured or obstructed (such as in a user's pocket or bag), or located too far away to retrieve the same In the case of substantially the same audio information for the microphones, the microphone at the second device may be considered unavailable.

回應於在方塊1234處決定第二設備處的麥克風可供使用，方法1200包括在方塊1236處決定波束成形音訊信號是否可用。例如，當基於擴散的環境聲音而不是來自其方向可被定位的特定源的聲音發生環境偵測時，在第一設備處不執行波束成形操作。作為另一示例，當偵測到音訊事件，但是不能以大於閾值置信度的置信度來決定與音訊事件相對應的聲源的方向時，在第一設備處不產生有效的波束成形信號。In response to determining at block 1234 that a microphone at the second device is available, method 1200 includes determining at block 1236 whether a beamformed audio signal is available. For example, when ambient detection occurs based on diffuse ambient sound rather than sound from a particular source whose direction can be localized, no beamforming operation is performed at the first device. As another example, when an audio event is detected, but a direction of a sound source corresponding to the audio event cannot be determined with a confidence greater than a threshold confidence, no valid beamforming signal is generated at the first device.

回應於在方塊1236處決定沒有波束成形音訊信號可用，方法1200在方塊1240處決定沒有音訊資料要被發送到第二設備。否則，當在方塊1236處決定波束成形音訊信號可用時，方法1200前進到方塊1242，在方塊1242處，波束成形信號或者沒有信號被發送到第二設備。例如，因為電力資源或傳輸資源受限，但是麥克風在第二設備處可用於音訊擷取和分析，所以第一設備可以決定沒有音訊要被發送到第二設備，而第二設備可以擷取音訊以用於在第二設備處進行分析。否則，儘管電力資源或傳輸資源是受限的並且麥克風可用於第二設備處的音訊擷取，但是第一設備可以決定向第二設備發送波束成形音訊信號。在特定實施方式中，在方塊1242處是否發送波束成形信號或不發送信號的決策可以至少部分地基於可用於波束成形信號的傳輸的電量或頻寬量（例如，可以執行與一或多個頻寬閾值或電力閾值的比較，以決定是否發送一或多個波束成形音訊信號）。In response to determining at block 1236 that no beamformed audio signals are available, method 1200 determines at block 1240 that no audio data is to be sent to the second device. Otherwise, when it is determined at block 1236 that a beamformed audio signal is available, the method 1200 proceeds to block 1242 where the beamformed signal or no signal is sent to the second device. For example, because power resources or transmission resources are limited, but a microphone is available at the second device for audio capture and analysis, the first device may decide that no audio is to be sent to the second device, and the second device may capture the audio for analysis at the second device. Otherwise, the first device may decide to send a beamformed audio signal to the second device although power resources or transmission resources are limited and a microphone is available for audio retrieval at the second device. In particular embodiments, the decision at block 1242 whether to transmit a beamformed signal or not to transmit a signal may be based at least in part on the amount of power or bandwidth available for transmission of the beamformed signal (e.g., may be performed in conjunction with one or more frequency bands). comparison of wide or power thresholds to decide whether to transmit one or more beamforming audio signals).

返回到方塊1234，回應於決定第二設備的麥克風不可用，方法1200在方塊1238處決定一或多個波束成形音訊信號是否可用。回應於一或多個波束成形音訊信號可用，方法1200包括在方塊1244處發送一或多個波束成形音訊信號。否則，回應於在方塊1238處決定一或多個波束成形音訊信號不可用，方法1200包括在方塊1246處向第二設備發送精簡信號。例如，發送精簡信號可以包括發送與減少數量的麥克風通道相對應的音訊（例如，發送音訊信號170或172中的單個），發送一或多個麥克風通道的降低解析度版本（例如，音訊信號170、172中的一或多個的較低解析度版本），發送提取的音訊特徵資料（例如，從音訊信號170、172之一或兩者中提取的特徵資料，諸如頻譜資訊），與發送完全的音訊信號170、172相比，這可以以降低的電力和頻寬使用向第二設備提供有用的資訊。Returning to block 1234, in response to determining that the microphone of the second device is not available, method 1200 determines at block 1238 whether one or more beamformed audio signals are available. In response to the one or more beamforming audio signals being available, the method 1200 includes, at block 1244 , sending the one or more beamforming audio signals. Otherwise, in response to determining at block 1238 that one or more beamformed audio signals are not available, method 1200 includes, at block 1246 , sending the reduced signal to the second device. For example, sending a reduced signal may include sending audio corresponding to a reduced number of microphone channels (e.g., sending a single of audio signals 170 or 172), sending a reduced-resolution version of one or more microphone channels (e.g., audio signal 170 lower resolution versions of one or more of 172, 172), transmitting extracted audio characteristic data (e.g., characteristic data extracted from one or both of audio signals 170, 172, such as spectral information), and transmitting completely This can provide useful information to the second device with reduced power and bandwidth usage compared to the audio signals 170, 172 of the audio system.

圖13圖示了與方法1300相對應的流程圖，方法1300可以由諸如設備120（例如，一或多個處理器126）的第二設備針對從諸如設備110的第一設備接收資訊而執行。FIG. 13 illustrates a flowchart corresponding to a method 1300 that may be performed by a second device, such as device 120 (eg, one or more processors 126 ), for receiving information from a first device, such as device 110 .

方法1300包括在方塊1302處接收來自第一設備的資料傳輸。方法1300包括在方塊1304處決定傳輸是否包括音訊信號資料。作為示例，第二設備可以解析接收到的資料以決定是否接收到一或多個音訊信號（例如，音訊信號170、172、一或多個波束成形信號148或其組合）。Method 1300 includes, at block 1302, receiving a data transmission from a first device. Method 1300 includes determining at block 1304 whether the transmission includes audio signal data. As an example, the second device may parse the received data to determine whether one or more audio signals (eg, audio signals 170, 172, one or more beamforming signals 148, or a combination thereof) were received.

如果傳輸不包括音訊信號資料，則方法1300可選地包括在方塊1304處決定第二設備的一或多個麥克風是否可用於音訊擷取。例如，在第二設備的麥克風（例如，麥克風106、108）被遮擋或阻擋的情況下（諸如在使用者的口袋或包中），或者位於太遠而無法擷取與第一設備處的麥克風基本相同的音訊資訊的情況下，第二設備處的麥克風可以被認為是不可用的。If the transmission does not include audio signal data, method 1300 optionally includes determining at block 1304 whether one or more microphones of the second device are available for audio retrieval. For example, where the microphones (e.g., microphones 106, 108) of the second device are obscured or blocked (such as in a user's pocket or bag), or are located too far away to retrieve the microphone at the first device With substantially the same audio information, the microphone at the second device may be considered unavailable.

回應於在方塊1304處決定一或多個麥克風不可用，方法1300可選地包括在1306處向第一設備發送麥克風不可用的信號，並且該方法在1308處結束。否則，當一或多個麥克風可用時，方法1300可選地包括在方塊1310處在第二設備處執行資料擷取操作以擷取音訊信號。In response to determining that one or more microphones are unavailable at block 1304 , method 1300 optionally includes, at 1306 , signaling the first device that the microphones are unavailable, and the method ends at 1308 . Otherwise, when one or more microphones are available, the method 1300 optionally includes performing a data retrieval operation at the second device at block 1310 to retrieve the audio signal.

方法1300可選地包括在方塊1312處決定傳輸是否包括環境資料。作為示例，設備120可以解析接收到的資料以決定是否接收到環境資訊146。回應於包括環境資料的傳輸，方法1300可選地包括在1314處執行環境處理。例如，設備120可以在聲學環境處理單元156處處理音訊信號170、172、190、192或其組合以產生環境資訊147。Method 1300 optionally includes determining at block 1312 whether the transmission includes environmental data. As an example, device 120 may parse the received data to determine whether environmental information 146 is received. In response to the transmission including context data, method 1300 optionally includes performing context processing at 1314 . For example, device 120 may process audio signals 170 , 172 , 190 , 192 or a combination thereof at acoustic environment processing unit 156 to generate environment information 147 .

方法1300包括在方塊1320處決定傳輸是否包括音訊事件資料。作為示例，設備120可以解析接收到的資料以決定是否接收到音訊事件資訊144。如果傳輸不包括音訊事件資料，則在1322處，對在傳輸中接收到的資料的處理結束。回應於傳輸包括音訊事件資料，方法1300可選地包括在方塊1330處決定傳輸是否包括到達方向資料。作為示例，設備120可以解析接收到的資料以決定是否接收到到達方向資訊142。回應於傳輸不包括到達方向資料，方法1300可選地包括在1332處執行到達方向處理，以產生到達方向資料。例如，設備120可以在到達方向處理單元152處處理音訊信號170、172、190、192或其組合，以產生到達方向資訊143。然而，如果傳輸包括到達方向資料，則繞過方塊1332的到達方向處理。因此，第二設備可以基於是否從第一設備接收到到達方向資訊而選擇性地繞過對接收到的與音訊事件相對應的音訊資料的到達方向處理。Method 1300 includes determining at block 1320 whether the transmission includes audio event data. As an example, device 120 may parse the received data to determine whether audio event information 144 is received. If the transmission does not include audio event data, then at 1322, processing of data received in the transmission ends. In response to the transmission including audio event data, method 1300 optionally includes determining at block 1330 whether the transmission includes direction of arrival data. As an example, device 120 may parse the received data to determine whether direction-of-arrival information 142 is received. In response to the transmission not including direction-of-arrival data, method 1300 optionally includes performing direction-of-arrival processing at 1332 to generate direction-of-arrival data. For example, device 120 may process audio signals 170 , 172 , 190 , 192 or a combination thereof at direction of arrival processing unit 152 to generate direction of arrival information 143 . However, if the transmission includes direction-of-arrival data, then the direction-of-arrival processing of block 1332 is bypassed. Accordingly, the second device may selectively bypass direction-of-arrival processing of received audio data corresponding to audio events based on whether direction-of-arrival information is received from the first device.

當在方塊1330處傳輸包括到達方向資訊時，或者在方塊1332處產生到達方向資訊之後，方法1300可選地包括在方塊1340處決定傳輸是否包括波束成形資料。作為示例，設備120可以解析接收到的資料，以決定是否接收到波束成形音訊信號148。回應於傳輸不包括波束成形資料，方法1300可選地包括在1342處執行波束成形操作，以產生波束成形資料。例如，設備120可以在波束成形單元158處處理音訊信號170、172、190、192或其組合，以產生波束成形音訊信號149。然而，如果傳輸包括波束成形資料，則繞過框1342的波束成形操作的執行。因此，第二設備可以基於接收到的音訊資料是對應於來自第一設備的多通道麥克風信號還是對應於來自第一設備的波束成形信號而選擇性地繞過波束成形操作。When the transmission includes direction of arrival information at block 1330 , or after generating the direction of arrival information at block 1332 , method 1300 optionally includes determining, at block 1340 , whether the transmission includes beamforming data. As an example, device 120 may parse the received data to determine whether beamformed audio signal 148 is received. In response to the transmission not including beamforming data, method 1300 optionally includes performing, at 1342, a beamforming operation to generate beamforming data. For example, device 120 may process audio signals 170 , 172 , 190 , 192 or a combination thereof at beamforming unit 158 to generate beamformed audio signal 149 . However, if the transmission includes beamforming data, then performance of the beamforming operation of block 1342 is bypassed. Thus, the second device may selectively bypass the beamforming operation based on whether the received audio material corresponds to a multi-channel microphone signal from the first device or to a beamformed signal from the first device.

當在方塊1340處傳輸包括波束成形資料時，或者在方塊1342處產生波束成形資料之後，方法1300包括在方塊1350處執行音訊事件處理。例如，設備120可以在音訊事件處理單元154處處理音訊信號170、172、190、192或其組合，以產生音訊事件資訊145。When the transmission includes beamforming data at block 1340 , or after generating the beamforming data at block 1342 , method 1300 includes performing audio event processing at block 1350 . For example, device 120 may process audio signals 170 , 172 , 190 , 192 or a combination thereof at audio event processing unit 154 to generate audio event information 145 .

藉由選擇性地繞過一或多個操作（諸如到達方向處理或波束成形操作），方法1300實現了與處理從第一設備接收的音訊事件資料相關聯的降低的功耗、降低的時延或兩者。By selectively bypassing one or more operations, such as direction of arrival processing or beamforming operations, method 1300 achieves reduced power consumption, reduced latency associated with processing audio event data received from the first device or both.

參考圖14，揭示被配置為對從多個麥克風接收的多個音訊信號執行定向處理的系統的特定說明性態樣，並且將其整體指定為1400。系統1400包括耦合到第一麥克風1402和第二麥克風1404的交通工具1410。儘管圖示兩個麥克風1402、1404，但是在其他實施方式中，附加的麥克風可以耦合到交通工具1410。作為非限制性示例，八（8）個麥克風可以耦合到交通工具1410。在一些實施方式中，麥克風1402、1404是定向麥克風。在其他實施方式中，麥克風1402、1404之一或兩者是全向麥克風。Referring to FIG. 14 , a particular illustrative aspect of a system configured to perform directional processing on a plurality of audio signals received from a plurality of microphones is disclosed and generally designated 1400 . System 1400 includes vehicle 1410 coupled to first microphone 1402 and second microphone 1404 . Although two microphones 1402 , 1404 are shown, in other implementations additional microphones may be coupled to the vehicle 1410 . As a non-limiting example, eight (8) microphones may be coupled to vehicle 1410 . In some implementations, the microphones 1402, 1404 are directional microphones. In other implementations, one or both of the microphones 1402, 1404 are omnidirectional microphones.

根據一些實施方式，交通工具1410可以是自主交通工具。也就是說，交通工具1410可以在沒有使用者互動的情況下導航。根據其他實施方式，交通工具1410可以包括一或多個使用者輔助模式（例如，障礙偵測、障礙迴避、車道維護、速度控制等），並且在一些示例中可以在使用者輔助模式與自主模式之間切換。系統1400還包括設備1420。根據一個實施方式，設備1420包括第二交通工具。根據另一實施方式，設備1420包括伺服器。如下所述，交通工具1410可以與設備1420無線通訊，以基於在交通工具1410處偵測到的聲音來執行一或多個操作（諸如自主導航）。在特定實施方式中，交通工具1410對應於設備110，並且設備1420對應於設備120。According to some implementations, vehicle 1410 may be an autonomous vehicle. That is, vehicle 1410 can navigate without user interaction. According to other implementations, vehicle 1410 may include one or more user-assisted modes (e.g., obstacle detection, obstacle avoidance, lane maintenance, speed control, etc.) switch between. System 1400 also includes device 1420 . According to one embodiment, the device 1420 includes a second vehicle. According to another embodiment, the device 1420 comprises a server. As described below, vehicle 1410 may communicate wirelessly with device 1420 to perform one or more operations (such as autonomous navigation) based on sounds detected at vehicle 1410 . In a particular embodiment, vehicle 1410 corresponds to device 110 and device 1420 corresponds to device 120 .

第一麥克風1402被配置為擷取來自一或多個源1480的聲音1482。在圖14的說明性示例中，源1480對應於另一交通工具（諸如汽車）。然而，應當理解，交通工具僅僅是聲源的非限制性示例，並且本文描述的技術可以用其他聲源來實施。在擷取來自源1480的聲音1482時，第一麥克風1402被配置為產生表示擷取到的聲音1482的音訊信號1470。以類似的方式，第二麥克風1404被配置為擷取來自一或多個源1480的聲音1482。在擷取來自源1480的聲音1482時，第二麥克風1404被配置為產生表示擷取到的聲音1482的音訊信號1472。The first microphone 1402 is configured to capture sound 1482 from one or more sources 1480 . In the illustrative example of FIG. 14 , source 1480 corresponds to another vehicle, such as an automobile. It should be understood, however, that a vehicle is only a non-limiting example of a sound source, and that the techniques described herein may be implemented with other sound sources. Upon capturing sound 1482 from source 1480 , first microphone 1402 is configured to generate an audio signal 1470 representative of the captured sound 1482 . In a similar manner, second microphone 1404 is configured to capture sound 1482 from one or more sources 1480 . Upon capturing sound 1482 from source 1480 , second microphone 1404 is configured to generate an audio signal 1472 representative of the captured sound 1482 .

第一麥克風1402和第二麥克風1404可以在交通工具1410上具有不同的位置、不同的方向或兩者。結果，麥克風1402、1404可以在不同的時間、以不同的接收相位或兩者兼有來擷取聲音1482。為了說明，如果第一麥克風1402比第二麥克風1404更靠近源1480，則第一麥克風1402可以在第二麥克風1404擷取聲音1482之前擷取聲音1482。如下所述，如果麥克風1402、1404的位置和朝向是已知的，則由麥克風1402、1404分別產生的音訊信號1470、1472可以用於執行定向處理。也就是說，交通工具1410可以使用音訊信號1470、1472來決定源1480的相對位置、決定聲音1482的到達方向等。The first microphone 1402 and the second microphone 1404 may have different locations on the vehicle 1410, different orientations, or both. As a result, the microphones 1402, 1404 may pick up the sound 1482 at different times, with different receive phases, or both. To illustrate, if first microphone 1402 is closer to source 1480 than second microphone 1404 , first microphone 1402 may capture sound 1482 before second microphone 1404 captures sound 1482 . As described below, if the location and orientation of the microphones 1402, 1404 are known, the audio signals 1470, 1472 produced by the microphones 1402, 1404, respectively, may be used to perform orientation processing. That is, the vehicle 1410 may use the audio signals 1470, 1472 to determine the relative location of the source 1480, determine the direction of arrival of the sound 1482, and the like.

交通工具1410包括第一輸入介面1411、第二輸入介面1412、記憶體1414和一或多個處理器1416。第一輸入介面1411耦合到一或多個處理器1416，並且被配置為耦合到第一麥克風1402。第一輸入介面1411被配置為從第一麥克風1402接收音訊信號1470（例如，第一麥克風輸出），並且可以將音訊信號1470提供給處理器1416作為音訊訊框1474。第二輸入介面1412耦合到一或多個處理器1416，並且被配置為耦合到第二麥克風1404。第二輸入介面1412被配置為從第二麥克風1404接收音訊信號1472（例如，第二麥克風輸出），並且可以將音訊信號1472提供給處理器1416作為音訊訊框1476。音訊信號1470、1472、音訊訊框1474、1476或兩者在本文中也可以被稱為音訊資料1478。The vehicle 1410 includes a first input interface 1411 , a second input interface 1412 , a memory 1414 and one or more processors 1416 . The first input interface 1411 is coupled to one or more processors 1416 and is configured to be coupled to the first microphone 1402 . The first input interface 1411 is configured to receive an audio signal 1470 from the first microphone 1402 (eg, the first microphone output), and may provide the audio signal 1470 to the processor 1416 as an audio frame 1474 . The second input interface 1412 is coupled to one or more processors 1416 and is configured to be coupled to the second microphone 1404 . The second input interface 1412 is configured to receive an audio signal 1472 from the second microphone 1404 (eg, the second microphone output), and may provide the audio signal 1472 to the processor 1416 as an audio frame 1476 . Audio signals 1470, 1472, audio frames 1474, 1476, or both may also be referred to herein as audio data 1478.

一或多個處理器1416包括到達方向處理單元1432，並且可選地包括音訊事件處理單元1434、報告產生器1436、導航指令產生器1438或其組合。根據一個實施方式，可以使用專用電路來實施一或多個處理器1416的一或多個元件。作為非限制性示例，可以使用FPGA、ASIC等來實施一或多個處理器1416的一或多個元件。根據另一實施方式，一或多個處理器1416的一或多個元件可以藉由執行儲存在記憶體1414中的指令1415來實施。例如，記憶體1414可以是儲存可由一或多個處理器1416執行以執行本文描述的操作的指令1415的非暫時性電腦可讀取媒體。The one or more processors 1416 include an arrival direction processing unit 1432, and optionally include an audio event processing unit 1434, a report generator 1436, a navigation instruction generator 1438, or a combination thereof. According to one embodiment, one or more elements of the one or more processors 1416 may be implemented using dedicated circuitry. As non-limiting examples, one or more elements of one or more processors 1416 may be implemented using an FPGA, ASIC, or the like. According to another embodiment, one or more elements of one or more processors 1416 may be implemented by executing instructions 1415 stored in memory 1414 . For example, memory 1414 may be a non-transitory computer-readable medium storing instructions 1415 executable by one or more processors 1416 to perform operations described herein.

到達方向處理單元1432可以被配置為處理多個音訊信號1470、1472，以產生與以音訊信號1470、1472表示的聲音1482的源1480相對應的到達方向資訊1442。在一些實施方式中，到達方向處理單元1432被配置為以與圖1的到達方向處理單元132類似的方式操作。在說明性地非限制性示例中，到達方向處理單元1432可以選擇從每個麥克風1402、1404產生的表示類似聲音（諸如來自源1480的聲音1482）的音訊訊框1474、1476。例如，到達方向處理單元1432可以處理音訊訊框1474、1476以比較聲音特性並確保音訊訊框1474、1476表示聲音1482的相同實例。回應於決定音訊訊框1474、1476表示聲音1482的相同實例，到達方向處理單元1432可以比較每個音訊訊框1474、1476的時間戳記，以決定哪個麥克風1402、1404首先擷取到聲音1482的相對應的實例。如果音訊訊框1474具有比音訊訊框1476更早的時間戳記，則到達方向處理單元1432可以產生指示源1480更接近第一麥克風1402的到達方向資訊1442。如果音訊訊框1476具有比音訊訊框1474更早的時間戳記，則到達方向處理單元1432可以產生指示源1480更接近第二麥克風1404的到達方向資訊1442。因此，基於類似的音訊訊框1474、1476的時間戳記，到達方向處理單元1432可以定位聲音1482和相對應的源1480。來自附加麥克風的音訊訊框的時間戳記可以用於以與上述類似的方式改善定位。The direction of arrival processing unit 1432 may be configured to process the plurality of audio signals 1470 , 1472 to generate direction of arrival information 1442 corresponding to a source 1480 of a sound 1482 represented by the audio signals 1470 , 1472 . In some embodiments, the direction of arrival processing unit 1432 is configured to operate in a similar manner as the direction of arrival processing unit 132 of FIG. 1 . In an illustrative, non-limiting example, direction of arrival processing unit 1432 may select audio frames 1474 , 1476 produced from each microphone 1402 , 1404 that represent similar sounds, such as sound 1482 from source 1480 . For example, direction of arrival processing unit 1432 may process audio frames 1474 , 1476 to compare sound characteristics and ensure that audio frames 1474 , 1476 represent the same instance of sound 1482 . In response to determining that audio frames 1474, 1476 represent the same instance of sound 1482, direction-of-arrival processing unit 1432 may compare the timestamps of each audio frame 1474, 1476 to determine which microphone 1402, 1404 first picked up the corresponding instance of sound 1482. corresponding instance. If the audio frame 1474 has an earlier timestamp than the audio frame 1476 , the direction-of-arrival processing unit 1432 may generate direction-of-arrival information 1442 indicating that the source 1480 is closer to the first microphone 1402 . If the audio frame 1476 has an earlier timestamp than the audio frame 1474 , the direction-of-arrival processing unit 1432 may generate direction-of-arrival information 1442 indicating that the source 1480 is closer to the second microphone 1404 . Thus, based on the timestamps of similar audio frames 1474 , 1476 , the direction of arrival processing unit 1432 can locate the sound 1482 and the corresponding source 1480 . Timestamps of audio frames from additional microphones can be used to improve localization in a similar manner to that described above.

在一些實施方式中，代替如前述的時間差或除此之外，還可以使用一或多個其他技術來決定到達方向資訊1442，諸如測量在交通工具1410的麥克風陣列之每一者麥克風（例如，麥克風1402和1404）處接收到的聲音1482的相位差。在一些實施方式中，麥克風1402、1404可以作為麥克風陣列來操作，或者被包括在麥克風陣列中，並且到達方向資訊1442是基於來自麥克風陣列之每一者麥克風的聲音的特性（諸如到達時間或相位）並且基於麥克風陣列中的麥克風的相對位置和朝向來產生的。在此類實施方式中，關於聲音特性的資訊或擷取到的音訊資料可以在交通工具1410與用於到達方向偵測的設備1420之間傳輸。In some embodiments, one or more other techniques may be used to determine the direction of arrival information 1442 instead of or in addition to the time difference as described above, such as measuring each microphone in the microphone array of the vehicle 1410 (e.g., The phase difference of sound 1482 received at microphones 1402 and 1404). In some implementations, the microphones 1402, 1404 may operate as, or be included in, a microphone array, and the direction of arrival information 1442 is based on characteristics of the sound from each microphone of the microphone array, such as time of arrival or phase ) and are generated based on the relative positions and orientations of the microphones in the microphone array. In such embodiments, information about sound characteristics or captured audio data may be transmitted between the vehicle 1410 and the device for direction of arrival detection 1420 .

音訊事件處理單元1434可以被配置為處理多個音訊信號1470、1472，以與音訊事件處理單元134類似的方式執行音訊事件偵測。為了說明，音訊事件處理單元1434可以處理音訊訊框1474、1476的聲音特性，並且將該聲音特性與複數個音訊事件模型進行比較以決定音訊事件是否已經發生。例如，音訊事件處理單元1434可以存取包括不同音訊事件（諸如汽車喇叭、火車喇叭、行人談話等）的模型的資料庫（未示出）。回應於聲音特性匹配（或基本匹配）特定模型，音訊事件處理單元1434可以產生指示聲音1482表示與特定模型相關聯的音訊事件的音訊事件資訊1444。作為非限制性示例，音訊事件可以對應於正在靠近的交通工具（例如，源1480）的聲音。The audio event processing unit 1434 may be configured to process a plurality of audio signals 1470 , 1472 to perform audio event detection in a similar manner to the audio event processing unit 134 . To illustrate, the audio event processing unit 1434 may process the sound characteristics of the audio frames 1474, 1476 and compare the sound characteristics to a plurality of audio event models to determine whether an audio event has occurred. For example, the audio event processing unit 1434 may access a database (not shown) that includes models of different audio events (such as car horns, train horns, pedestrian talking, etc.). In response to the sound characteristic matching (or substantially matching) the particular model, audio event processing unit 1434 may generate audio event information 1444 indicating that sound 1482 represents an audio event associated with the particular model. As a non-limiting example, an audio event may correspond to the sound of an approaching vehicle (eg, source 1480).

報告產生器1436可以被配置為基於到達方向資訊1442和音訊事件資訊1444來產生報告1446。因此，報告1446可以指示至少一個偵測到的事件和偵測到的事件的方向。在麥克風1402、1404從各個方向擷取到多個聲音的場景中，報告1446可以指示在一段時間內偵測到的事件和偵測到的事件的方向資訊的清單。Report generator 1436 may be configured to generate report 1446 based on direction of arrival information 1442 and audio event information 1444 . Accordingly, report 1446 may indicate at least one detected event and a direction of the detected event. In scenarios where the microphones 1402, 1404 pick up multiple sounds from various directions, the report 1446 may indicate a list of detected events and direction information for the detected events over a period of time.

處理器1416可以被配置為向設備1420發送報告1446。根據一個實施方式，基於報告1446，設備1420可以向交通工具1410發送導航指令1458。在從設備1420接收到導航指令1458時，處理器1416可以基於導航指令1458來導航（例如，自主地導航）交通工具1410。替代地或附加地，導航指令1458可以被提供給交通工具1410的操作者，諸如用於調整交通工具1410的操作的可視或可聽的警報或指令。在一些示例中，導航指令1458指示交通工具1410要採用的路徑（例如，當可能安全地讓應急交通工具通過時，停在一側）。在一些示例中，導航指令1458通知交通工具1410一或多個其他交通工具的路徑（例如，前方的交通工具偵測到事故並且將要減速）。處理器1416可以自主地導航交通工具1410以改變路徑（例如，改變路線或改變速度）來考慮一或多個其他交通工具的路徑。Processor 1416 may be configured to send report 1446 to device 1420 . Based on report 1446, device 1420 may send navigation instructions 1458 to vehicle 1410, according to one embodiment. Upon receiving navigation instructions 1458 from device 1420 , processor 1416 may navigate (eg, autonomously navigate) vehicle 1410 based on navigation instructions 1458 . Alternatively or additionally, navigational instructions 1458 may be provided to the operator of vehicle 1410 , such as visual or audible alerts or instructions to adjust the operation of vehicle 1410 . In some examples, navigation instructions 1458 instruct vehicle 1410 on a path to take (eg, pull over to one side when it is possible to safely allow emergency vehicles to pass). In some examples, navigation instructions 1458 inform vehicle 1410 of the path of one or more other vehicles (eg, a vehicle ahead has detected an accident and is about to slow down). The processor 1416 may autonomously navigate the vehicle 1410 to change the route (eg, change the route or change the speed) to take into account the paths of one or more other vehicles.

根據另一實施方式，基於報告1446或獨立於報告1446，設備1420可以向交通工具1410發送第二報告1456。回應於接收到第二報告1456，根據一個實施方式，處理器1416可以基於報告1446和第二報告1456來導航（例如，自主地導航）交通工具1410。根據另一實施方式，回應於接收到第二報告1456，導航指令產生器1438可以被配置為產生要被處理器1416用來導航交通工具1410的導航指令1448。在一些示例中，第二報告1456指示由另一交通工具偵測到的事件（例如，前方的交通工具偵測到指示事故的聲音）。導航指令產生器1438可以產生導航指令1448以自主地導航交通工具1410來改變行進路徑，從而避開事件的位置或改變速度（例如，減速）。處理器1416還可以向設備1420發送導航指令1448，以通知設備1420交通工具1410的路徑。在一些示例中，導航指令1448指示建議一或多個其他交通工具採用的路徑（例如，路線或速度）。例如，導航指令1448指示交通工具1410正在減速，並且建議交通工具1410的20英尺內的任何交通工具減速或改變路線。According to another embodiment, the device 1420 may send the second report 1456 to the vehicle 1410 based on the report 1446 or independently of the report 1446 . In response to receiving second report 1456 , processor 1416 may navigate (eg, autonomously navigate) vehicle 1410 based on report 1446 and second report 1456 , according to one embodiment. According to another embodiment, in response to receiving second report 1456 , navigation instruction generator 1438 may be configured to generate navigation instructions 1448 to be used by processor 1416 to navigate vehicle 1410 . In some examples, the second report 1456 indicates an event detected by another vehicle (eg, a vehicle ahead detects a sound indicating an accident). Navigation instruction generator 1438 may generate navigation instructions 1448 to autonomously navigate vehicle 1410 to change the path of travel to avoid the location of an event or change speed (eg, slow down). Processor 1416 may also send navigation instructions 1448 to device 1420 to inform device 1420 of the route of vehicle 1410 . In some examples, navigation instructions 1448 indicate a suggested route (eg, route or speed) for one or more other vehicles to take. For example, navigation instructions 1448 indicate that vehicle 1410 is slowing down and advise any vehicles within 20 feet of vehicle 1410 to slow down or change course.

可選地，設備1420可以基於交通工具1410的位置和一或多個其他設備1490的位置向一或多個其他設備1490發送音訊事件（例如，交通工具碰撞）的通知1492。在一個示例中，通知1492對應於圖9的通知930。作為說明性地非限制性示例，一或多個設備1490可以包括可被決定為在交通工具1410附近或者正在靠近交通工具1410的位置的一或多個其他交通工具，或者結合在該一或多個其他交通工具中，以向交通工具通知在交通工具1410附近的一或多個偵測到的音訊事件（例如，警笛、碰撞等）。Optionally, device 1420 may send notification 1492 of an audio event (eg, a vehicle collision) to one or more other devices 1490 based on the location of vehicle 1410 and the location of one or more other devices 1490 . In one example, notification 1492 corresponds to notification 930 of FIG. 9 . As an illustrative, non-limiting example, one or more devices 1490 may include, or be incorporated into, one or more other vehicles that may be determined to be near or approaching the location of vehicle 1410. other vehicles to notify the vehicle of one or more detected audio events in the vicinity of the vehicle 1410 (eg, siren, collision, etc.).

圖14的系統1400使交通工具1410能夠偵測外部聲音（諸如警笛聲），並且相應地導航。應當理解，使用多個麥克風能夠決定警笛聲（例如，源1480）的位置和相對距離，並且當偵測到的警笛聲正在靠近或遠離時，可以顯示位置和相對距離。The system 1400 of FIG. 14 enables a vehicle 1410 to detect external sounds, such as sirens, and navigate accordingly. It should be appreciated that the use of multiple microphones enables the location and relative distance of a siren (eg, source 1480 ) to be determined and displayed when a detected siren is approaching or moving away.

圖15圖示了包括與設備1520（例如，第二設備）通訊的交通工具1510（例如，第一設備）的系統1500的特定說明性態樣。交通工具1510包括圖14的輸入介面1412、1411、記憶體1414和一或多個處理器1416。在特定實施方式中，交通工具1510對應於設備110，並且設備1520對應於設備120。FIG. 15 illustrates a certain illustrative aspect of a system 1500 including a vehicle 1510 (eg, a first device) in communication with a device 1520 (eg, a second device). Vehicle 1510 includes input interfaces 1412 , 1411 , memory 1414 and one or more processors 1416 of FIG. 14 . In particular embodiments, vehicle 1510 corresponds to device 110 and device 1520 corresponds to device 120 .

一或多個處理器1416包括音訊事件處理單元1434的實施方式，其中產生的音訊事件資訊1444指示偵測到的音訊事件對應於交通工具事件1502和與交通工具事件1502相關聯的音訊類別1504。例如，音訊事件處理單元1434可以包括一或多個分類器（諸如圖6的一或多個分類器610），該一或多個分類器被配置為處理音訊資料1478以決定與以音訊資料1478表示的並且關聯於交通工具事件1502的聲音1482相對應的音訊類別1504。The one or more processors 1416 include an implementation of the audio event processing unit 1434 in which the generated audio event information 1444 indicates that the detected audio event corresponds to the vehicle event 1502 and the audio category 1504 associated with the vehicle event 1502 . For example, audio event processing unit 1434 may include one or more classifiers (such as one or more classifiers 610 of FIG. The audio category 1504 that corresponds to the sound 1482 that is represented and associated with the vehicle event 1502 .

一或多個處理器1416被配置為向設備1520發送表示與交通工具事件1502相關聯的聲音的音訊資料1550。例如，音訊資料1550可以包括音訊資料1478、音訊信號1470、1472、指向聲音1482的源1480的一或多個波束成形音訊信號或其組合。一或多個處理器1416還被配置為向設備1520發送音訊資料1550對應於與交通工具事件1502相關聯的音訊類別1504的指示1552。例如，指示1552可以對應於圖6或圖8的指示616。One or more processors 1416 are configured to send audio material 1550 representing sounds associated with vehicle event 1502 to device 1520 . For example, audio data 1550 may include audio data 1478, audio signals 1470, 1472, one or more beamformed audio signals directed at source 1480 of sound 1482, or a combination thereof. The one or more processors 1416 are also configured to send an indication 1552 to the device 1520 that the audio data 1550 corresponds to the audio category 1504 associated with the vehicle event 1502 . For example, indication 1552 may correspond to indication 616 of FIG. 6 or FIG. 8 .

設備1520包括被配置為儲存指令1515的記憶體1514，並且還包括耦合到記憶體1514的一或多個處理器1516。一或多個處理器1516被配置為從交通工具1510（例如，第一設備）接收表示聲音1482的音訊資料1550和音訊資料1554對應於與交通工具事件1502相關聯的音訊類別1504的指示1552。在特定實施方式中，作為非限制性示例，設備1520對應於另一交通工具、伺服器或分散式運算（例如，基於雲端的）系統。Device 1520 includes memory 1514 configured to store instructions 1515 and also includes one or more processors 1516 coupled to memory 1514 . One or more processors 1516 are configured to receive from vehicle 1510 (eg, first device) audio data 1550 representing sound 1482 and an indication 1552 that audio data 1554 corresponds to audio category 1504 associated with vehicle event 1502 . In particular embodiments, device 1520 corresponds to another vehicle, server, or distributed computing (eg, cloud-based) system, as non-limiting examples.

一或多個處理器1516還被配置為在一或多個分類器1530處處理音訊資料1550，以驗證以音訊資料1550表示的聲音1482對應於交通工具事件1502。例如，在特定實施方式中，一或多個分類器1530對應於圖9的一或多個分類器920。一或多個處理器1516被配置為基於交通工具1510（例如，第一設備）的位置和一或多個設備1490（例如，一或多個第三設備）的位置向一或多個設備1490發送交通工具事件1502的通知1492。One or more processors 1516 are also configured to process audio data 1550 at one or more classifiers 1530 to verify that sound 1482 represented by audio data 1550 corresponds to vehicle event 1502 . For example, in particular embodiments, one or more classifiers 1530 correspond to one or more classifiers 920 of FIG. 9 . The one or more processors 1516 are configured to send information to the one or more devices 1490 based on the location of the vehicle 1510 (eg, the first device) and the location of the one or more devices 1490 (eg, the one or more third devices) A notification 1492 of the vehicle event 1502 is sent.

圖16圖示了設備120（例如，第二設備）的特定實施方式，其中一或多個處理器126被配置為基於由第一設備（例如，設備110）偵測到的音訊事件來更新定向聲源的地圖1614。FIG. 16 illustrates a particular embodiment of a device 120 (eg, a second device) in which one or more processors 126 are configured to update orientation based on audio events detected by a first device (eg, device 110 ). A map of sound sources 1614.

一或多個處理器126包括音訊事件處理單元154、地圖更新器1612和音訊場景渲染器1618。一或多個處理器126被配置為執行與追蹤音訊場景中的定向音訊聲源相關聯的一或多個操作。在一個示例中，一或多個處理器126可以從第一設備接收與音訊事件相對應的音訊類別的指示1602（諸如圖6的指示616）以及對應於與音訊事件相關聯的聲源的方向資料1604（諸如到達方向資訊142）。The one or more processors 126 include an audio event processing unit 154 , a map updater 1612 and an audio scene renderer 1618 . The one or more processors 126 are configured to perform one or more operations associated with tracking directional audio sources in an audio scene. In one example, one or more processors 126 may receive from the first device an indication 1602 of an audio category corresponding to the audio event, such as indication 616 of FIG. 6 , and a direction corresponding to a sound source associated with the audio event. Data 1604 (such as arrival direction information 142).

一或多個處理器126可以基於音訊事件更新音訊場景中的定向聲源的地圖1614，以產生更新後的地圖1616。例如，當音訊事件對應於新偵測到的音訊事件時，地圖更新器1612被配置為將與音訊事件相對應的資訊插入到地圖1614中，以產生更新後的地圖1616。插入的資訊可以包括諸如與音訊事件相關聯的聲源的位置、音訊事件的類型（例如，與音訊事件相對應的音訊類別）的指示、以及與音訊事件相關聯的音訊（例如，到表示聲音的音訊信號資料的連結）的資訊。The one or more processors 126 may update the map 1614 of directional sound sources in the audio scene based on the audio events to generate an updated map 1616 . For example, when the audio event corresponds to a newly detected audio event, the map updater 1612 is configured to insert information corresponding to the audio event into the map 1614 to generate an updated map 1616 . The inserted information may include, for example, the location of the sound source associated with the audio event, an indication of the type of the audio event (e.g., an audio class corresponding to the audio event), and the audio associated with the audio event (e.g., to represent a sound Links to audio signal data for ) information.

可選地，一或多個處理器126可以向地理上遠離第一設備的一或多個第三設備（被示為設備1670、1672和1674）發送與更新後的地圖1616相對應的資料1660。資料1660使設備1670、1672和1674均能夠更新設備的地圖1614的本端複本，以使設備1670、1672或1674的使用者能夠被通知、存取或體驗與音訊事件相關聯的聲音。Optionally, one or more processors 126 may send data 1660 corresponding to updated map 1616 to one or more third devices (shown as devices 1670 , 1672 , and 1674 ) that are geographically remote from the first device. . Profile 1660 enables each of devices 1670, 1672, and 1674 to update the device's local copy of map 1614 to enable a user of device 1670, 1672, or 1674 to be notified, access, or experience sounds associated with audio events.

在一些實施方式中，地圖1614（以及更新後的地圖1616）對應於分佈在地理區域上的音訊事件和位置的資料庫（諸如「眾包」資料庫），以在附近偵測到碰撞時通知交通工具或者更新交通工具導航指令從而避開特定音訊事件，諸如圖14和圖15中所述。在其他實施方式中，地圖1614（以及更新後的地圖1616）可以用於其他應用，諸如提供在附近、城鎮、城市等中偵測到的聲音事件的地圖。例如，與犯罪相關聯的音訊事件（例如，槍擊、叫喊、警笛、玻璃破碎等）的地圖可以被執法部門用於規劃資源配置或用於偵測值得調查的事件。作為另一示例，音訊事件的地圖可以與自然相關聯。例如，鳥愛好者可以使用各種類型的鳥的地圖，各種類型的鳥已經基於對它們的特定的鳥叫的偵測和分類進行了定位。In some implementations, map 1614 (and updated map 1616 ) corresponds to a database (such as a "crowdsourced" database) of audio events and locations distributed over a geographic area to notify when a collision is detected nearby. The vehicle or update vehicle navigation instructions to avoid specific audio events, such as described in FIGS. 14 and 15 . In other implementations, map 1614 (and updated map 1616 ) may be used in other applications, such as providing a map of sound events detected in a neighborhood, town, city, or the like. For example, maps of audio events (eg, gunshots, yells, sirens, glass breaking, etc.) associated with crimes can be used by law enforcement to plan resource allocation or to detect incidents worthy of investigation. As another example, a map of audio events may be associated with nature. For example, a bird lover may use a map of various types of birds that have been located based on the detection and classification of their particular song.

在一些實施方式中，音訊場景渲染器1618被配置為基於更新後的地圖1616產生與三維聲音場景相對應的聲音資料，以向第一設備的使用者重播。例如，第一設備可以對應於使用者佩戴的音訊耳機（諸如參考圖21所述），或者虛擬實境、增強現實或混合現實耳機（諸如參考圖25所述）。In some implementations, the audio scene renderer 1618 is configured to generate audio data corresponding to the three-dimensional audio scene based on the updated map 1616 for playback to the user of the first device. For example, the first device may correspond to an audio headset worn by a user (such as described with reference to FIG. 21 ), or a virtual reality, augmented reality, or mixed reality headset (such as described with reference to FIG. 25 ).

圖17圖示了佩戴耳機的使用者1702周圍的音訊場景的3D音訊地圖1700的圖形示例。3D音訊地圖1700可以對應於圖16的地圖1614（或更新後的地圖1616）。3D音訊地圖1700包括在整體朝向使用者1702的方向上移動的第一交通工具1710以及也在整體朝向使用者的方向上移動的第二交通工具1712。（移動音訊源的移動方向由箭頭表示。）其他聲源包括犬吠聲1714、說話的人1716、對穿過街道的剩餘時間倒計時的人行橫道計時器1718以及已經被編輯到3D音訊地圖1700中的人造聲音1720。例如，聲源1710-1718可以是經由使用者1702所佩戴的耳機的麥克風偵測到的真實世界聲源，並且人造聲音1720可以由增強現實引擎（或遊戲引擎）添加到聲音場景中的特定位置，諸如與該位置處的商店或餐館相關聯的聲音效果（例如，商業廣告鈴）。FIG. 17 illustrates a graphical example of a 3D audio map 1700 of the audio scene around a user 1702 wearing headphones. 3D audio map 1700 may correspond to map 1614 (or updated map 1616 ) of FIG. 16 . The 3D audio map 1700 includes a first vehicle 1710 moving in a direction generally towards the user 1702 and a second vehicle 1712 also moving in a direction generally towards the user. (The direction of movement of a moving audio source is indicated by an arrow.) Other sound sources include barking dogs 1714, people talking 1716, crosswalk timers counting down the time remaining to cross the street 1718, and man-made audio maps that have been compiled into the 3D audio map 1700. Sound 1720. For example, sound sources 1710-1718 may be real-world sound sources detected via a microphone of a headset worn by user 1702, and artificial sound 1720 may be added by an augmented reality engine (or game engine) to a specific location in the sound scene , such as a sound effect (eg, a commercial bell) associated with a store or restaurant at the location.

圖18圖示了（諸如基於圖16的地圖1614（或更新後的地圖1616））用聲音事件和環境類別偵測擷取到的定向音訊場景1802的示例。使用者1804在定向音訊場景1802的中心，並且圖示與定向音訊場景1802的聲場相關聯的多個虛擬（或實際）揚聲器集合，包括基本位於使用者1804上方和下方的第一揚聲器集合中的第一代表性揚聲器1810、沿著定向音訊場景1802的上週邊和下週邊放置的第二揚聲器集合中的第二代表性揚聲器1812、以及位於使用者1804周圍大約頭部高度的第三揚聲器集合中的第三代表性揚聲器1814。FIG. 18 illustrates an example of a directional audio scene 1802 captured with sound event and ambient class detection (such as based on map 1614 (or updated map 1616 ) of FIG. 16 ). The user 1804 is in the center of the directional audio scene 1802 and a plurality of sets of virtual (or real) speakers associated with the sound field of the directional audio scene 1802 are illustrated, including a first set of speakers located substantially above and below the user 1804 A first representative speaker 1810 of the directional audio scene 1802, a second representative speaker 1812 of a second set of speakers placed along the upper and lower perimeters of the directional audio scene 1802, and a third set of speakers located around the user 1804 at approximately head height The third representative speaker 1814 in .

在特定實施方式中，操作1820（例如，基於類型、方向等來添加或移除聲音事件的地圖1614的更新）引起除了使用者1804之外還包括多個虛擬參與者1832、1834的更新後的定向音訊場景1830。例如，虛擬參與者1832、1834可以對應於共享關於他們各自的本地聲場的資訊的遠端使用者，這些資訊可以與定向音訊場景1802進行組合，從而為使用者1804和各種參與者1832、1834產生沉浸式共享虛擬體驗。這種共享虛擬體驗可以用於諸如實況旅遊頻道指南或實況會議、聚會或事件沉浸的應用，以供由於社交、健康或其他約束而不能親自參加的人使用。In particular embodiments, operation 1820 (e.g., an update of map 1614 to add or remove sound events based on type, direction, etc.) Directional audio scene 1830. For example, virtual participants 1832, 1834 may correspond to remote users sharing information about their respective local sound fields, which information may be combined with directional audio scene 1802 to create Generate immersive shared virtual experiences. This shared virtual experience can be used in applications such as live travel channel guides or live meeting, party or event immersion for people who are unable to attend in person due to social, health or other constraints.

圖19將設備110、120中的至少一個的實施方式1900圖示為包括定向音訊信號處理電路的積體電路1902。例如，積體電路1902包括一或多個處理器1916。一或多個處理器1916可以對應於一或多個處理器116、一或多個處理器126、圖2的一或多個處理器202、針對圖3至圖5描述的處理電路、一或多個處理器1416、一或多個處理器1516或其組合。一或多個處理器1916包括定向音訊信號處理單元1990。定向音訊信號處理單元1990可以包括處理器116的至少一個元件、處理器126的至少一個元件、處理器202的至少一個元件、耳機310的至少一個元件、耳機410的至少一個元件、行動電話420的至少一個元件、系統500的至少一個元件、處理器1416的至少一個元件、處理器1516的至少一個元件或其組合。FIG. 19 illustrates an implementation 1900 of at least one of the devices 110, 120 as an integrated circuit 1902 including directional audio signal processing circuitry. For example, integrated circuit 1902 includes one or more processors 1916 . One or more processors 1916 may correspond to one or more processors 116, one or more processors 126, one or more processors 202 of FIG. Multiple processors 1416, one or more processors 1516, or a combination thereof. The one or more processors 1916 include a directional audio signal processing unit 1990 . The directional audio signal processing unit 1990 may include at least one element of the processor 116, at least one element of the processor 126, at least one element of the processor 202, at least one element of the earphone 310, at least one element of the earphone 410, and at least one element of the mobile phone 420. at least one element, at least one element of system 500, at least one element of processor 1416, at least one element of processor 1516, or a combination thereof.

積體電路1902還包括音訊輸入1904（諸如一或多個匯流排介面）以使音訊資料178能夠被接收以進行處理。積體電路1902還包括信號輸出1906（諸如匯流排介面）以能夠發送定向音訊信號資料1992。定向音訊信號資料1992可以對應於以下中的至少一個：到達方向資訊142、143、音訊事件資訊144、145、環境資訊146、147、波束成形音訊信號148、149、方向資訊250、第一聲音資訊440、第二聲音資訊442、上下文資訊496、音訊縮放角度460、降噪參數462、到達方向資訊542、音訊事件資訊544、指示616、指示716、通知930、控制信號932、分類器輸出934、目標輸出1106、報告1446、1456、導航指令1448、1458、通知1492、指示1552、音訊資料1550、資料1660或其組合。Integrated circuit 1902 also includes audio input 1904 (such as one or more bus interfaces) to enable audio data 178 to be received for processing. The integrated circuit 1902 also includes a signal output 1906 (such as a bus interface) to enable sending directional audio signal data 1992 . The directional audio signal data 1992 may correspond to at least one of: direction of arrival information 142, 143, audio event information 144, 145, environmental information 146, 147, beamforming audio signals 148, 149, direction information 250, first sound information 440, second sound information 442, context information 496, audio zoom angle 460, noise reduction parameters 462, direction of arrival information 542, audio event information 544, indication 616, indication 716, notification 930, control signal 932, classifier output 934, Target output 1106, reports 1446, 1456, navigation instructions 1448, 1458, notifications 1492, instructions 1552, audio data 1550, data 1660, or combinations thereof.

積體電路1902作為包括麥克風的系統中的元件來實現定向音訊信號處理，該系統諸如是圖20所示的行動電話或平板電腦、圖21所示的耳機、圖22所示的可佩戴電子設備、圖23所示的音控揚聲器系統、圖24所示的相機、圖25所示的虛擬實境耳機、混合現實耳機或增強現實耳機、圖26所示的增強現實眼鏡或混合現實眼鏡、圖27所示的一套入耳式設備、或者如圖28或圖29所示的交通工具。The integrated circuit 1902 implements directional audio signal processing as a component in a system including a microphone, such as a mobile phone or tablet computer as shown in FIG. 20 , a headset as shown in FIG. 21 , a wearable electronic device as shown in FIG. 22 , the sound control speaker system shown in Figure 23, the camera shown in Figure 24, the virtual reality headset, mixed reality headset or augmented reality headset shown in Figure 25, the augmented reality glasses or mixed reality glasses shown in Figure 26, the A set of in-ear devices as shown in 27, or a vehicle as shown in Figure 28 or Figure 29.

作為說明性地非限制性示例，圖20圖示了其中設備120是行動設備2002（諸如電話或平板電腦）的實施方式2000。行動設備2002包括被定位成主要擷取使用者的語音的第三麥克風106、被定位成主要擷取環境聲音的一或多個第四麥克風108以及顯示螢幕2004。定向音訊信號處理單元1990被整合在行動設備2002中，並且使用虛線來示出，以指示行動設備2002的使用者大體不可見的內部元件。在特定示例中，定向音訊信號處理單元1990可以用於產生定向音訊信號資料1992，該定向音訊信號資料1992隨後被處理以在行動設備2002處執行一或多個操作，諸如啟動圖形化使用者介面或者以其他方式（例如，經由整合的「智慧助理」應用）在顯示螢幕2004處顯示與偵測到的音訊事件相關聯的其他資訊。As an illustrative, non-limiting example, FIG. 20 illustrates an implementation 2000 in which device 120 is a mobile device 2002 such as a phone or tablet. The mobile device 2002 includes a third microphone 106 positioned to primarily capture the user's voice, one or more fourth microphones 108 positioned to primarily capture ambient sounds, and a display screen 2004 . The directional audio signal processing unit 1990 is integrated in the mobile device 2002 and is shown using dashed lines to indicate internal elements that are not generally visible to a user of the mobile device 2002 . In a particular example, directional audio signal processing unit 1990 may be used to generate directional audio signal data 1992 that is subsequently processed to perform one or more operations at mobile device 2002, such as activating a graphical user interface Or otherwise display other information associated with the detected audio event at the display screen 2004 (eg, via an integrated "smart assistant" application).

圖21圖示了其中設備110是耳機設備2102的實施方式2100。耳機設備2102包括被定位成主要擷取使用者的語音的第一麥克風102以及被定位成主要擷取環境聲音的一或多個第二麥克風104。定向音訊信號處理單元1990被整合在耳機設備2102中。在特定示例中，定向音訊信號處理單元1990可以用於產生可以使得耳機設備2102在耳機設備2102處執行一或多個操作的定向音訊信號資料1992，將定向音訊信號資料1992發送到第二設備（未示出）以供進一步處理或其組合。耳機設備2102可以被配置為（諸如基於音訊事件資訊144、音訊事件資訊145、環境資訊146、環境資訊147或其組合）向耳機設備2102的佩戴者提供偵測到的音訊事件或環境的可聽通知。FIG. 21 illustrates an embodiment 2100 in which the device 110 is a headset device 2102 . The earphone device 2102 includes a first microphone 102 positioned to primarily capture a user's voice and one or more second microphones 104 positioned to primarily capture ambient sounds. The directional audio signal processing unit 1990 is integrated in the earphone device 2102 . In a particular example, directional audio signal processing unit 1990 may be configured to generate directional audio signal data 1992 that may cause headphone device 2102 to perform one or more operations at headphone device 2102, transmit directional audio signal data 1992 to the second device ( not shown) for further processing or a combination thereof. Headphone device 2102 may be configured to provide an audible representation of a detected audio event or environment to a wearer of headphone device 2102 (such as based on audio event information 144, audio event information 145, environment information 146, environment information 147, or a combination thereof). notify.

圖22圖示了其中設備110、120中的至少一個是可佩戴電子設備2202（被示為「智慧手錶」）的實施方式2200。定向音訊信號處理單元1990、第一麥克風102和一或多個第二麥克風104被整合到可佩戴電子設備2202中。在特定示例中，定向音訊信號處理單元1990可以用於產生定向音訊信號資料1992，定向音訊信號資料1992隨後被處理以在可佩戴電子設備2202處執行一或多個操作，諸如啟動圖形化使用者介面或以其他方式在可佩戴電子設備2202的顯示螢幕2204處顯示與偵測到的音訊事件相關聯的其他資訊。為了說明，可佩戴電子設備2202的顯示螢幕2204可以被配置為基於由可佩戴電子設備2202偵測到的語音來顯示通知。在特定示例中，可佩戴電子設備2202包括回應於偵測到音訊事件而提供觸覺通知（例如，振動）的觸覺設備。例如，觸覺通知可以使得使用者看著可佩戴電子設備2202，以查看（諸如基於音訊事件資訊144、音訊事件資訊145、環境資訊146、環境資訊147或其組合）顯示的偵測到的音訊事件或環境的通知。可佩戴電子設備2202因此可以向具有聽力障礙的使用者或佩戴耳機的使用者警告偵測到特定的音訊活動。FIG. 22 illustrates an embodiment 2200 in which at least one of the devices 110, 120 is a wearable electronic device 2202 (shown as a "smart watch"). The directional audio signal processing unit 1990 , the first microphone 102 and one or more second microphones 104 are integrated into the wearable electronic device 2202 . In a particular example, directional audio signal processing unit 1990 may be used to generate directional audio signal data 1992, which is then processed to perform one or more operations at wearable electronic device 2202, such as activating a graphical user Interface or otherwise display other information associated with the detected audio event at the display screen 2204 of the wearable electronic device 2202. To illustrate, display screen 2204 of wearable electronic device 2202 may be configured to display notifications based on speech detected by wearable electronic device 2202 . In a particular example, wearable electronic device 2202 includes a haptic device that provides a haptic notification (eg, vibration) in response to detecting an audio event. For example, a haptic notification may cause a user to look at wearable electronic device 2202 to view a detected audio event displayed (such as based on audio event information 144, audio event information 145, environmental information 146, environmental information 147, or a combination thereof) or environmental notifications. The wearable electronic device 2202 may thus alert a hearing-impaired user or a user wearing headphones to the detection of certain audio activity.

圖23是其中設備110、120中的至少一個是無線揚聲器和語音啟動設備2302的實施方式2300。無線揚聲器和語音啟動設備2302可以具有無線網路連接，並且被配置為執行輔助操作。定向音訊信號處理單元1990、第一麥克風102、一或多個第二麥克風104、第三麥克風106、第四麥克風108或其組合被包括在無線揚聲器和語音啟動設備2302中。無線揚聲器和語音啟動設備2302還包括揚聲器2304。在特定態樣中，揚聲器2304對應於圖3的揚聲器336、圖4的揚聲器436或兩者。在操作期間，定向音訊信號處理單元1990可以用於產生定向音訊信號資料1992，並且決定是否說出了關鍵字。回應於決定說出了關鍵字，無線揚聲器和語音啟動設備2302可以（諸如經由執行整合的輔助應用）執行輔助操作。輔助操作可以包括調節溫度、播放音樂、開燈等。例如，回應於接收到關鍵字或關鍵短語（例如，「你好助理」）之後的命令而執行輔助操作。FIG. 23 is an embodiment 2300 in which at least one of the devices 110 , 120 is a wireless speaker and voice-activated device 2302 . Wireless speaker and voice-activated device 2302 may have a wireless network connection and be configured to perform secondary operations. Directional audio signal processing unit 1990 , first microphone 102 , one or more second microphones 104 , third microphone 106 , fourth microphone 108 , or a combination thereof are included in wireless speaker and voice activated device 2302 . The wireless speaker and voice activated device 2302 also includes a speaker 2304 . In certain aspects, speaker 2304 corresponds to speaker 336 of FIG. 3 , speaker 436 of FIG. 4 , or both. During operation, the directional audio signal processing unit 1990 may be used to generate the directional audio signal data 1992 and determine whether a keyword was spoken. In response to deciding to speak the keyword, the wireless speaker and voice-activated device 2302 may perform an auxiliary operation (such as via execution of an integrated auxiliary application). Secondary actions can include adjusting the temperature, playing music, turning on lights, and more. For example, secondary actions are performed in response to receiving a command following a keyword or key phrase (eg, "Hello Assistant").

圖24圖示了其中設備110、120中的至少一個是與相機設備2402相對應的可攜式電子設備的實施方式2400。定向音訊信號處理單元1990、第一麥克風102、一或多個第二麥克風104或其組合被包括在相機設備2402中。在操作期間，定向音訊信號處理單元1990可以用於產生定向音訊信號資料1992，並且決定是否說出了關鍵字。作為說明性示例，回應於決定說出了關鍵字，相機設備2402可以執行回應於口頭使用者命令的操作，諸如調整圖像或視訊擷取設置、圖像或視訊重播設置、或者圖像或視訊擷取指令。FIG. 24 illustrates an embodiment 2400 in which at least one of the devices 110 , 120 is a portable electronic device corresponding to a camera device 2402 . The directional audio signal processing unit 1990 , the first microphone 102 , the one or more second microphones 104 , or a combination thereof are included in the camera device 2402 . During operation, the directional audio signal processing unit 1990 may be used to generate the directional audio signal data 1992 and determine whether a keyword was spoken. As an illustrative example, in response to determining that a keyword was spoken, camera device 2402 may perform an action in response to a spoken user command, such as adjusting image or video capture settings, image or video replay settings, or image or video playback settings. Fetch command.

圖25圖示了其中設備110包括與擴展現實（「XR」）耳機2502（諸如虛擬實境（「VR」）、增強現實（「AR」）或混合現實（「MR」）耳機設備）相對應的可攜式電子設備的實施方式2500。定向音訊信號處理單元1990、第一麥克風102、一或多個第二麥克風104或其組合被整合到耳機2502中。在特定態樣中，耳機2502包括被定位成主要擷取使用者的語音的第一麥克風102以及被定位成主要擷取環境聲音的第二麥克風104。定向音訊信號處理單元1990可以用於基於從耳機2502的第一麥克風102和第二麥克風104接收的音訊信號來產生定向音訊信號資料1992。視覺周邊設備被放置在使用者的眼睛前面，以能夠在佩戴耳機2502時向使用者顯示增強現實或虛擬實境圖像或場景。在特定示例中，視覺周邊設備被配置為顯示指示在音訊信號中偵測到的使用者語音的通知。在特定示例中，視覺周邊設備被配置為顯示疊加在顯示的內容上（例如，在虛擬實境應用中）或者疊加在使用者的視野上（例如，在增強現實應用中）的、指示偵測到的音訊事件的通知，以向使用者視覺地指示與音訊事件相關聯的聲源的位置。為了說明，視覺周邊設備可以被配置為（諸如基於音訊事件資訊144、音訊事件資訊145、環境資訊146、環境資訊147或其組合）顯示偵測到的音訊事件或環境的通知。25 illustrates where device 110 includes a device corresponding to an extended reality ("XR") headset 2502, such as a virtual reality ("VR"), augmented reality ("AR"), or mixed reality ("MR") headset device. Embodiment 2500 of the portable electronic device. The directional audio signal processing unit 1990 , the first microphone 102 , one or more second microphones 104 or a combination thereof are integrated into the earphone 2502 . In a particular aspect, the headset 2502 includes a first microphone 102 positioned to primarily capture the user's voice and a second microphone 104 positioned to primarily capture ambient sounds. The directional audio signal processing unit 1990 can be used to generate directional audio signal data 1992 based on audio signals received from the first microphone 102 and the second microphone 104 of the headset 2502 . Visual peripherals are placed in front of the user's eyes to be able to display augmented reality or virtual reality images or scenes to the user while wearing the headset 2502 . In a particular example, the visual peripheral device is configured to display a notification indicating user voice detected in the audio signal. In certain examples, the visual peripheral device is configured to display pointing detections superimposed on displayed content (e.g., in a virtual reality application) or superimposed on the user's field of view (e.g., in an augmented reality application). notification of incoming audio events to visually indicate to the user the location of the sound source associated with the audio event. To illustrate, visual peripheral devices may be configured to display notifications of detected audio events or circumstances (such as based on audio event information 144 , audio event information 145 , environment information 146 , environment information 147 , or combinations thereof).

圖26圖示了其中設備110包括與增強現實或混合現實眼鏡2602相對應的可攜式電子設備的實施方式2600。眼鏡2602包括全息投影單元2604，全息投影單元2604被配置為將視覺資料投影到鏡片2606的表面上，或者將視覺資料從鏡片2606的表面反射到佩戴者的視網膜上。定向音訊信號處理單元1990、第一麥克風102、一或多個第二麥克風104或其組合被整合到眼鏡2602中。定向音訊信號處理單元1990可以用於基於從第一麥克風102和第二麥克風104接收的音訊信號來產生定向音訊信號資料1992。在特定示例中，全息投影單元2604被配置為顯示指示在音訊信號中偵測到的使用者語音的通知。在特定示例中，全息投影單元2604被配置為顯示指示偵測到的音訊事件的通知。例如，通知可以被疊加在使用者視野上與關聯於音訊事件的聲源的位置相一致的特定位置。為了說明，聲音可以被使用者感知為從通知的方向發出。在說明性實施方式中，全息投影單元2604被配置為（諸如基於音訊事件資訊144、音訊事件資訊145、環境資訊146、環境資訊147或其組合）顯示偵測到的音訊事件或環境的通知。FIG. 26 illustrates an embodiment 2600 in which device 110 includes a portable electronic device corresponding to augmented reality or mixed reality glasses 2602 . Glasses 2602 include holographic projection unit 2604 configured to project visual material onto the surface of lens 2606 or to reflect visual material from the surface of lens 2606 onto the wearer's retina. The directional audio signal processing unit 1990 , the first microphone 102 , the one or more second microphones 104 , or a combination thereof are integrated into the glasses 2602 . The directional audio signal processing unit 1990 can be used to generate directional audio signal data 1992 based on the audio signals received from the first microphone 102 and the second microphone 104 . In a particular example, holographic projection unit 2604 is configured to display a notification indicating user voice detected in the audio signal. In a particular example, holographic projection unit 2604 is configured to display a notification indicating a detected audio event. For example, a notification may be overlaid at a specific location on the user's field of view that coincides with the location of the sound source associated with the audio event. To illustrate, the sound may be perceived by the user as emanating from the direction of the notification. In an illustrative embodiment, holographic projection unit 2604 is configured to display a notification of a detected audio event or environment (such as based on audio event information 144 , audio event information 145 , environment information 146 , environment information 147 , or a combination thereof).

圖27圖示了其中設備110包括與一對耳塞2706（包括第一耳塞2702和第二耳塞2704）相對應的可攜式電子設備的實施方式2700。儘管描述了耳塞，但是應當理解，本技術可以應用於其他耳內或耳掛式重播設備。FIG. 27 illustrates an embodiment 2700 in which device 110 includes a portable electronic device corresponding to a pair of earbuds 2706 , including a first earbud 2702 and a second earbud 2704 . Although earbuds are described, it should be understood that the present technology may be applied to other in-ear or on-ear playback devices.

第一耳塞2702包括第一麥克風2720，諸如被定位成擷取第一耳塞2702的佩戴者的語音的高訊雜比麥克風、被配置為偵測環境聲音並在空間上分佈以支持波束成形的一或多個其他麥克風的陣列（被示為麥克風2722A、2722B和2722C）、接近佩戴者的耳道的「內部」麥克風2724（例如，以幫助主動除噪）以及自語音麥克風2726（諸如被配置為將佩戴者的耳骨或頭骨的聲音振動轉換成音訊信號的骨傳導麥克風）。The first earbud 2702 includes a first microphone 2720, such as a high signal-to-noise ratio microphone positioned to pick up the voice of the wearer of the first earbud 2702, a microphone configured to detect ambient sound and spatially distributed to support beamforming. or an array of other microphones (shown as microphones 2722A, 2722B, and 2722C), an "internal" microphone 2724 close to the wearer's ear canal (e.g., to aid in active noise cancellation), and a self-speech microphone 2726 (such as configured as A bone conduction microphone that converts the sound vibrations of the wearer's ear bones or skull into audio signals).

在特定實施方式中，第一麥克風2720對應於麥克風102，並且麥克風2722A、2722B和2722C對應於麥克風104的多個實例，並且由麥克風2720和2722A、2722B和2722C產生的音訊信號被提供給定向音訊信號處理單元1990。定向音訊信號處理單元1990可以用於基於音訊信號產生定向音訊信號資料1992。在一些實施方式中，定向音訊信號處理單元1990還可以被配置為處理來自第一耳塞2702的一或多個其他麥克風（諸如內部麥克風2724、自語音麥克風2726或兩者）的音訊信號。In a particular embodiment, first microphone 2720 corresponds to microphone 102, and microphones 2722A, 2722B, and 2722C correspond to multiple instances of microphone 104, and audio signals generated by microphones 2720 and 2722A, 2722B, and 2722C are provided to directional audio Signal processing unit 1990. The directional audio signal processing unit 1990 can be used to generate directional audio signal data 1992 based on the audio signal. In some embodiments, the directional audio signal processing unit 1990 may also be configured to process audio signals from one or more other microphones of the first earbud 2702, such as the internal microphone 2724, the self-voice microphone 2726, or both.

第二耳塞2704可以以與第一耳塞2702基本類似的方式配置。在一些實施方式中，第一耳塞2702的定向音訊信號處理單元1990還被配置為（諸如經由耳塞2702、2704之間的無線傳輸，或者在耳塞2702、2704經由傳輸線耦合的實施方式中經由有線傳輸）接收由第二耳塞2704的一或多個麥克風產生的一或多個音訊信號。在其他實施方式中，第二耳塞2704還包括定向音訊信號處理單元1990，以使佩戴耳塞2702、2704中的單個耳塞的使用者能夠執行本文描述的技術。Second earbud 2704 may be configured in a substantially similar manner as first earbud 2702 . In some embodiments, the directional audio signal processing unit 1990 of the first earbud 2702 is also configured to (such as via wireless transmission between the earbuds 2702, 2704, or via wired transmission in embodiments where the earbuds 2702, 2704 are coupled via a transmission line) ) to receive one or more audio signals generated by one or more microphones of the second earbud 2704 . In other implementations, the second earbud 2704 also includes a directional audio signal processing unit 1990 to enable a user wearing a single one of the earbuds 2702, 2704 to perform the techniques described herein.

在一些實施方式中，耳塞2702、2704被配置為在各種操作模式之間自動切換，諸如其中經由揚聲器2730播放環境聲音的直通模式、其中經由揚聲器2730播放非環境聲音（例如，與電話對話、媒體重播、視訊遊戲等相對應的串流音訊）的重播模式、以及其中強調一或多個環境聲音及/或抑制其他環境聲音以在揚聲器2730處進行重播的音訊縮放模式或波束成形模式。在其他實施方式中，耳塞2702、2704可以支援更少的模式，或者代替所描述的模式或除了所描述的模式之外，耳塞2702、2704還可以支援一或多個其他模式。In some implementations, the earbuds 2702, 2704 are configured to automatically switch between various modes of operation, such as a pass-through mode in which ambient sounds are played through the speakers 2730, non-ambient sounds (e.g., talking to a phone, media, etc.) Replay mode, corresponding to streaming audio for video games, etc.), and audio zoom mode or beamforming mode in which one or more ambient sounds are emphasized and/or other ambient sounds are suppressed for playback at speaker 2730. In other implementations, the earbuds 2702, 2704 may support fewer modes, or the earbuds 2702, 2704 may support one or more other modes instead of or in addition to the modes described.

在說明性示例中，耳塞2702、2704可以回應於偵測到佩戴者的語音而自動從重播模式轉換到直通模式，並且可以在佩戴者停止說話之後自動轉換回回放模式。在一些示例中，耳塞2702、2704可以（諸如藉由對特定環境聲音（例如，犬吠聲）執行音訊縮放並且在佩戴者正在收聽音樂時播放疊加在正在播放的聲音上的經音訊縮放的聲音（在播放經音訊縮放的聲音時可以降低該音樂的音量））同時在兩種或更多種模式下操作。在該示例中，可以警告佩戴者注意與音訊事件相關聯的環境聲音，而不停止音樂的重播。In an illustrative example, earbuds 2702, 2704 may automatically transition from playback mode to pass-through mode in response to detecting the wearer's voice, and may automatically transition back to playback mode after the wearer stops speaking. In some examples, the earbuds 2702, 2704 may (such as by performing audio scaling on a particular ambient sound (e.g., a dog barking) and playing the audio-scaled sound superimposed on the sound being played while the wearer is listening to music ( It is possible to lower the volume of the audio scaled sound while playing it)) to operate in two or more modes at the same time. In this example, the wearer may be alerted to ambient sounds associated with the audio event without stopping playback of the music.

圖28圖示了其中所揭示的技術在交通工具2802（被示為有人駕駛或無人駕駛的航空設備（例如，包裹遞送無人機））中實施的實施方式2800。定向音訊信號處理單元2850被整合到交通工具2802中。定向音訊信號處理單元2850包括或對應於定向音訊信號處理單元1990，並且可以進一步被配置用於自主地導航交通工具2802。定向音訊信號處理單元2850可以包括例如圖14的一或多個處理器1416，並且交通工具2802可以對應於交通工具1410。定向音訊信號處理單元2850可以基於從交通工具2802的第一麥克風102和第二麥克風104接收的音訊信號來產生並執行導航指令，諸如以用於遞送來自交通工具2802的授權使用者的指令。FIG. 28 illustrates an embodiment 2800 in which the disclosed technology is implemented in a vehicle 2802 , shown as a manned or unmanned aerial device (eg, a package delivery drone). The directional audio signal processing unit 2850 is integrated into the vehicle 2802 . The directional audio signal processing unit 2850 includes or corresponds to the directional audio signal processing unit 1990 and may be further configured for navigating the vehicle 2802 autonomously. Directional audio signal processing unit 2850 may include, for example, one or more processors 1416 of FIG. 14 , and vehicle 2802 may correspond to vehicle 1410 . Directional audio signal processing unit 2850 may generate and execute navigation instructions based on audio signals received from first microphone 102 and second microphone 104 of vehicle 2802 , such as for delivering instructions from authorized users of vehicle 2802 .

圖29圖示了其中交通工具1410或交通工具1510對應於交通工具2902（被示為汽車）的另一實施方式2900。交通工具2902包括定向音訊信號處理單元2950。定向音訊信號處理單元2950包括或對應於定向音訊信號處理單元1990，並且可以進一步被配置用於自主地導航交通工具2902。交通工具2902還包括第一麥克風102和第二麥克風104。在一些示例中，第一麥克風102和第二麥克風104中的一或多個位於交通工具2902的外部以擷取周圍的聲音，諸如警笛聲和其他交通工具的聲音。在一些實施方式中，可以基於從外部麥克風（例如，第一麥克風102和第二麥克風104）接收的音訊信號來執行任務，諸如對環境資訊和音訊聲音事件的偵測、對交通工具2902的自主導航等。FIG. 29 illustrates another embodiment 2900 in which vehicle 1410 or vehicle 1510 corresponds to vehicle 2902 (shown as an automobile). The vehicle 2902 includes a directional audio signal processing unit 2950 . The directional audio signal processing unit 2950 includes or corresponds to the directional audio signal processing unit 1990 and may be further configured for navigating the vehicle 2902 autonomously. Vehicle 2902 also includes first microphone 102 and second microphone 104 . In some examples, one or more of the first microphone 102 and the second microphone 104 are located outside the vehicle 2902 to pick up ambient sounds, such as sirens and other vehicle sounds. In some implementations, tasks can be performed based on audio signals received from external microphones (e.g., first microphone 102 and second microphone 104), such as detection of environmental information and audio sound events, autonomous control of vehicle 2902 navigation etc.

在一些示例中，第一麥克風102和第二麥克風104中的一或多個位於交通工具2902內部以擷取交通工具內的聲音，諸如語音命令或指示醫療應急的聲音。在一些實施方式中，可以基於從內部麥克風（例如，第一麥克風102和第二麥克風104）接收的音訊信號來執行任務，諸如對交通工具2902的自主導航。交通工具2902的一或多個操作可以（諸如藉由經由顯示器2920或一或多個揚聲器（例如，揚聲器2910）提供回饋或資訊）基於偵測到的一或多個關鍵字（例如，「解鎖」、「啟動引擎」、「播放音樂」、「顯示天氣預報」或另一語音命令）來發起。In some examples, one or more of the first microphone 102 and the second microphone 104 are located inside the vehicle 2902 to pick up sounds within the vehicle, such as voice commands or sounds indicating a medical emergency. In some implementations, tasks such as autonomous navigation of the vehicle 2902 may be performed based on audio signals received from internal microphones (eg, first microphone 102 and second microphone 104 ). One or more operations of vehicle 2902 may be based on one or more detected keywords (e.g., "unlock ”, “Start Engine”, “Play Music”, “Show Weather Forecast” or another voice command).

參考圖30，圖示處理音訊的方法3000的特定實施方式。在特定態樣中，方法3000的一或多個操作由設備110、系統200、耳機310、耳機410、系統500、交通工具1410、交通工具1510或其組合來執行。Referring to FIG. 30 , a particular embodiment of a method 3000 of processing audio is illustrated. In certain aspects, one or more operations of method 3000 are performed by device 110 , system 200 , headset 310 , headset 410 , system 500 , vehicle 1410 , vehicle 1510 , or combinations thereof.

方法3000包括在方塊3002處在第一設備的一或多個處理器處接收來自多個麥克風的音訊信號。例如，參考圖1，處理器130可以分別從麥克風102、104接收音訊信號170、172的音訊訊框174、176。Method 3000 includes receiving, at block 3002, audio signals from a plurality of microphones at one or more processors of a first device. For example, referring to FIG. 1, processor 130 may receive audio frames 174, 176 of audio signals 170, 172 from microphones 102, 104, respectively.

方法3000還包括在方塊3004處理音訊信號，以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊。例如，參考圖1，到達方向處理單元132可以處理音訊訊框174、176，以產生與以音訊信號170、172表示的聲音182的源180相對應的到達方向資訊142。Method 3000 also includes processing the audio signal at block 3004 to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals. For example, referring to FIG. 1 , direction of arrival processing unit 132 may process audio frames 174 , 176 to generate direction of arrival information 142 corresponding to source 180 of sound 182 represented by audio signals 170 , 172 .

方法3000還包括在方塊3006處向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。例如，數據機118可以向設備120發送到達方向資訊142以及指示616或指示716之一或兩者。類別可以對應於以音訊信號表示的並且與特定音訊事件相關聯的特定聲音的類別，並且嵌入可以包括與特定聲音或特定音訊事件相對應的簽名或資訊，並且可以被配置為能夠經由對其他音訊信號的處理來偵測其他音訊信號中的特定聲音或特定音訊事件。在一些實施方式中，方法3000還包括向第二設備發送音訊信號的表示。例如，音訊信號的表示可以包括音訊信號170、172的一或多個部分、波束成形音訊信號148的一或多個部分或其組合。根據方法3000的一個實施方式，向設備120發送資料可以觸發一或多個感測器129的啟動。Method 3000 also includes, at block 3006, sending to the second device data based on the direction of arrival information and the category or embedding associated with the direction of arrival information. For example, modem 118 may send direction of arrival information 142 and one or both of indication 616 or indication 716 to device 120 . The category may correspond to the category of a particular sound represented by the audio signal and associated with a particular audio event, and the embedding may include a signature or information corresponding to the particular sound or the particular audio event, and may be configured to be able to communicate via other audio Signal processing to detect specific sounds or specific audio events in other audio signals. In some implementations, the method 3000 also includes sending the representation of the audio signal to the second device. For example, a representation of an audio signal may include one or more portions of the audio signals 170, 172, one or more portions of the beamformed audio signal 148, or a combination thereof. According to one embodiment of method 3000 , sending data to device 120 may trigger activation of one or more sensors 129 .

在一些實施方式中，方法3000包括處理與音訊信號相對應的信號資料以決定類別或嵌入。在一個示例中，方法3000包括對音訊信號執行波束成形操作（例如，在波束成形單元138處）以產生信號資料。在一個示例中，在一或多個分類器（諸如一或多個分類器610）處處理信號資料，以從由一或多個分類器支持的多個類別當中為以音訊信號中的一或多個表示的並且與音訊事件相關聯的聲音決定類別。類別（諸如經由指示616）被發送到第二設備（例如，設備120）。In some embodiments, method 3000 includes processing signal data corresponding to an audio signal to determine a class or embedding. In one example, method 3000 includes performing a beamforming operation (eg, at beamforming unit 138 ) on the audio signal to generate signal data. In one example, signal data is processed at one or more classifiers, such as one or more classifiers 610, to generate one or more audio signals from among a plurality of classes supported by the one or more classifiers. A number of sounds represented and associated with an audio event determine the class. The category is sent to the second device (eg, device 120 ), such as via indication 616 .

在一些實施方式中，在一或多個編碼器（諸如一或多個編碼器710）處處理信號資料以產生嵌入。嵌入對應於以音訊信號中的一或多個表示的並且與音訊事件相關聯的聲音。嵌入（諸如經由指示716）被發送到第二設備（例如，設備120）。In some implementations, signal data is processed at one or more encoders, such as one or more encoders 710, to produce embeddings. Embeddings correspond to sounds represented by one or more of the audio signals and associated with the audio events. The embedding is sent (such as via indication 716 ) to the second device (eg, device 120 ).

在一些實施方式中，方法3000包括在第二設備的一或多個處理器處接收基於到達方向資訊和類別的資料。例如，設備120的數據機128可以接收資料，並且向一或多個處理器126提供到達方向資訊142和指示616。方法3000可以包括在第二設備的一或多個處理器處獲得表示與到達方向資訊和類別相關聯的聲音的音訊資料。例如，一或多個處理器126從第一設備獲得音訊信號170、172中的一或多個，從本地麥克風（例如，麥克風106、108）獲得音訊信號190、192中的一或多個，從第一設備獲得波束成形音訊信號148或其組合。方法3000還可以包括在第二設備的一或多個處理器處至少基於音訊資料和到達方向資訊來驗證類別，諸如在音訊事件處理單元154處或者如參考一或多個分類器610所述。In some implementations, method 3000 includes receiving, at one or more processors of a second device, data based on direction-of-arrival information and categories. For example, modem 128 of device 120 may receive data and provide direction of arrival information 142 and indication 616 to one or more processors 126 . Method 3000 may include obtaining, at one or more processors of a second device, audio data representing sounds associated with direction of arrival information and categories. For example, the one or more processors 126 obtain one or more of the audio signals 170, 172 from the first device, obtain one or more of the audio signals 190, 192 from a local microphone (eg, microphone 106, 108), A beamformed audio signal 148 or a combination thereof is obtained from the first device. Method 3000 may also include validating the class at one or more processors of the second device based on at least the audio data and direction-of-arrival information, such as at audio event processing unit 154 or as described with reference to one or more classifiers 610 .

在一些實施方式中，方法3000包括在第二設備的一或多個處理器處接收基於到達方向資訊和嵌入的資料。例如，設備120的數據機128可以接收資料，並且向一或多個處理器126提供到達方向資訊142和指示716。方法3000還可以包括在第二設備的一或多個處理器處基於到達方向資訊和嵌入來處理表示聲音場景的音訊資料，以產生與更新後的聲音場景相對應的修改後的音訊資料。例如，一或多個處理器126可以結合一或多個嵌入1104和方向資訊912來處理表示音訊場景1151的輸入混合波形1102，以產生更新後的音訊場景1171。In some implementations, method 3000 includes receiving, at one or more processors of a second device, data based on direction of arrival information and embedding. For example, modem 128 of device 120 may receive data and provide direction of arrival information 142 and indication 716 to one or more processors 126 . Method 3000 may also include processing, at the one or more processors of the second device, the audio data representing the sound scene based on the direction of arrival information and the embedding to generate modified audio data corresponding to the updated sound scene. For example, one or more processors 126 may process input mixed waveform 1102 representing audio scene 1151 in combination with one or more embeddings 1104 and direction information 912 to generate updated audio scene 1171 .

方法3000能夠基於由多個麥克風產生的音訊信號來執行定向上下文感知處理。結果，實現了針對各種用例的上下文偵測以及對與周圍環境相關聯的特性的決定。Method 3000 can perform directional context-aware processing based on audio signals produced by multiple microphones. As a result, context detection and decisions on properties associated with the surrounding environment are enabled for various use cases.

參考圖31，圖示處理音訊的方法3100的特定實施方式。在特定態樣中，方法3100的一或多個操作由圖14的交通工具1410執行。Referring to FIG. 31 , a particular embodiment of a method 3100 of processing audio is illustrated. In a particular aspect, one or more operations of method 3100 are performed by vehicle 1410 of FIG. 14 .

方法3100包括在方塊3102處在交通工具的一或多個處理器處接收來自多個麥克風的多個音訊信號。例如，參考圖14，處理器1416可以分別從麥克風1402、1404接收音訊信號1470、1472的音訊訊框1474、1476。Method 3100 includes receiving a plurality of audio signals from a plurality of microphones at one or more processors of a vehicle at block 3102 . For example, referring to FIG. 14, processor 1416 may receive audio frames 1474, 1476 of audio signals 1470, 1472 from microphones 1402, 1404, respectively.

方法3100還包括在方塊3104處理多個音訊信號，以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊。例如，參考圖14，到達方向處理單元1432可以處理音訊訊框1474、1476，以產生與以音訊信號1470、1472表示的聲音1482的源1480相對應的到達方向資訊1442。Method 3100 also includes processing the plurality of audio signals at block 3104 to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals. For example, referring to FIG. 14 , direction of arrival processing unit 1432 may process audio frames 1474 , 1476 to generate direction of arrival information 1442 corresponding to source 1480 of sound 1482 represented by audio signals 1470 , 1472 .

方法3100還包括在方塊3106處基於到達方向資訊產生指示至少一個偵測到的事件和偵測到的事件的方向的報告。例如，參考圖14，報告產生器1436可以產生指示（來自音訊事件資訊1444的）至少一個偵測到的事件和（來自到達方向資訊1442的）偵測到的事件的方向的報告1446。The method 3100 also includes generating a report indicating at least one detected event and a direction of the detected event based on the direction of arrival information at block 3106 . For example, referring to FIG. 14 , report generator 1436 may generate report 1446 indicating at least one detected event (from audio event information 1444 ) and the direction of the detected event (from direction-of-arrival information 1442 ).

根據一個實施方式，方法3100可以包括向第二設備（例如，第二交通工具或伺服器）發送報告，並且從第二設備接收導航指令或第二報告。基於第二報告，處理器可以產生導航指令以自主地導航交通工具。如果第二設備發送導航指令，則處理器可以使用發送的導航指令來自主地導航交通工具。According to one embodiment, method 3100 may include sending a report to a second device (eg, a second vehicle or server), and receiving navigation instructions or a second report from the second device. Based on the second report, the processor may generate navigation instructions to autonomously navigate the vehicle. If the second device sends navigation instructions, the processor may autonomously navigate the vehicle using the sent navigation instructions.

方法3100使交通工具1410能夠偵測外部聲音（諸如警笛），並且相應地導航。應當理解，使用多個麥克風能夠決定警笛聲（例如，源1480）的位置和相對距離，並且當偵測到的警笛聲正在靠近或遠離時，可以顯示位置和相對距離。Method 3100 enables vehicle 1410 to detect external sounds, such as sirens, and navigate accordingly. It should be appreciated that the use of multiple microphones enables the location and relative distance of a siren (eg, source 1480 ) to be determined and displayed when a detected siren is approaching or moving away.

參考圖32，圖示處理音訊的方法3200的特定實施方式。在特定態樣中，方法3200的一或多個操作由設備120（諸如在一或多個處理器126處）執行。Referring to FIG. 32 , a particular embodiment of a method 3200 of processing audio is illustrated. In a particular aspect, one or more operations of method 3200 are performed by device 120 , such as at one or more processors 126 .

方法3200包括在方塊3202處在第二設備的一或多個處理器處接收音訊類別的指示，該指示是從第一設備接收的並且對應於音訊事件。例如，圖9的設備120的一或多個處理器126從圖6的設備110接收指示902（例如，指示616）。Method 3200 includes receiving, at block 3202, an indication of an audio class at one or more processors of a second device, the indication being received from the first device and corresponding to an audio event. For example, one or more processors 126 of device 120 of FIG. 9 receive indication 902 (eg, indication 616 ) from device 110 of FIG. 6 .

方法3200包括在方塊3204處在第二設備的一或多個處理器處處理音訊資料，以驗證以該音訊資料表示的聲音對應於該音訊事件。例如，圖2的設備120的一或多個處理器126處理音訊資料904以產生分類922，從而驗證以音訊資料904表示的聲音對應於音訊事件。在一個示例中，一或多個處理器126將分類922與由指示902指示的音訊類別進行比較。Method 3200 includes processing the audio data at one or more processors of the second device at block 3204 to verify that the sound represented by the audio data corresponds to the audio event. For example, one or more processors 126 of device 120 of FIG. 2 process audio data 904 to generate classification 922 to verify that the sound represented by audio data 904 corresponds to an audio event. In one example, one or more processors 126 compare classification 922 to the audio category indicated by indication 902 .

可選地，方法3200包括從第一設備（例如，設備110）接收音訊資料，並且對音訊資料的處理可選地包括將音訊資料作為輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。例如，在一些實施方式中，音訊資料904包括音訊信號170、172的一或多個部分、波束成形音訊信號148的一或多個部分或其組合，並且音訊資料904被輸入到一或多個分類器920。在一些實施方式中，音訊資料的處理還包括將音訊類別的指示（例如，指示902）作為第二輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。Optionally, method 3200 includes receiving audio data from a first device (e.g., device 110), and processing the audio data optionally includes providing the audio data as input to one or more classifiers to determine Associated categories. For example, in some implementations, the audio data 904 includes one or more portions of the audio signals 170, 172, one or more portions of the beamformed audio signal 148, or a combination thereof, and the audio data 904 is input to one or more classifier 920 . In some embodiments, the processing of the audio data further includes providing an indication of the audio category (eg, indication 902 ) as a second input to one or more classifiers to determine a classification associated with the audio data.

可選地，方法3200包括基於一或多個分類器的輸出向第一設備（例如，設備110）發送控制信號（諸如控制信號932）。在一些實施方式中，控制信號包括音訊縮放指令。在一些實施方式中，控制信號包括用於基於聲源的方向執行空間處理的指令。Optionally, method 3200 includes sending a control signal (such as control signal 932 ) to the first device (eg, device 110 ) based on the output of the one or more classifiers. In some implementations, the control signal includes an audio zoom command. In some embodiments, the control signal includes instructions for performing spatial processing based on the direction of the sound source.

在一些實施方式中，音訊類別對應於交通工具事件，並且方法3200可選地包括基於第一設備的位置和一或多個第三設備的位置向一或多個第三設備發送交通工具事件的通知。例如，通知1492被發送到一或多個設備1490，如參考圖14和圖15所述。In some implementations, the audio category corresponds to a vehicle event, and the method 3200 optionally includes sending information about the vehicle event to one or more third devices based on the location of the first device and the location of the one or more third devices. notify. For example, a notification 1492 is sent to one or more devices 1490 as described with reference to FIGS. 14 and 15 .

可選地，方法3200包括從第一設備（例如，設備110）接收對應於與音訊事件相關聯的聲源的方向資料（諸如方向資料912）。方法3200可以包括基於音訊事件更新音訊場景中的定向聲源的地圖以產生更新後的地圖，諸如參考地圖更新器1612所述，並且向地理上遠離第一設備的一或多個第三設備發送與更新後的地圖相對應的資料。例如，設備120向設備1670、1672和1674中的一或多個發送資料1660。Optionally, method 3200 includes receiving, from a first device (eg, device 110 ), directional data (such as directional data 912 ) corresponding to a sound source associated with the audio event. Method 3200 may include updating a map of directional sound sources in an audio scene based on the audio event to generate an updated map, such as described with reference to map updater 1612, and sending the map to one or more third devices geographically remote from the first device. Materials corresponding to the updated map. For example, device 120 sends profile 1660 to one or more of devices 1670 , 1672 , and 1674 .

可選地，方法3200包括基於是否從第一設備（例如，設備110）接收到到達方向資訊而選擇性地繞過對接收到的與音訊事件相對應的音訊資料的到達方向處理。例如，一或多個處理器126可以基於在圖13的方塊1330處決定在來自第一設備的傳輸中接收到到達方向資訊而選擇性地繞過執行圖13的方塊1332處所示的到達方向處理。Optionally, method 3200 includes selectively bypassing direction-of-arrival processing of received audio data corresponding to an audio event based on whether direction-of-arrival information is received from a first device (eg, device 110 ). For example, one or more processors 126 may selectively bypass performing the direction-of-arrival information shown at block 1332 of FIG. 13 based on a determination at block 1330 of FIG. deal with.

可選地，方法3200包括基於接收到的音訊資料是對應於來自第一設備（例如，設備110）的多通道麥克風信號還是對應於來自第一設備的波束成形信號而選擇性地繞過波束成形操作。例如，一或多個處理器126可以基於在圖13的方塊1340處決定傳輸包括波束成形資料（諸如波束成形音訊信號148）而選擇性地繞過執行圖13的方塊1342處所示的波束成形操作。Optionally, method 3200 includes selectively bypassing beamforming based on whether the received audio material corresponds to a multi-channel microphone signal from a first device (e.g., device 110) or to a beamformed signal from the first device operate. For example, one or more processors 126 may selectively bypass performing beamforming shown at block 1342 of FIG. 13 based on a determination at block 1340 of FIG. operate.

藉由接收與音訊事件相對應的音訊類別的指示並處理音訊資料以驗證以該音訊資料表示的聲音對應於該音訊事件，方法3200能夠執行分散式音訊事件偵測，使得第一級（諸如在耳機處）能夠以與第二級（諸如在行動電話處）相比相對高的靈敏度和相對低的準確度（例如，由於電力、記憶體或計算約束）來辨識音訊事件。第二級可以使用更高功率、更準確的音訊事件偵測來驗證音訊事件，並且可以基於偵測到的音訊事件來傳達偵測結果、控制信號等。結果，可以向可佩戴電子設備（諸如耳機）的使用者提供準確的音訊事件偵測，而不需要可佩戴電子設備支援與全功率音訊事件偵測相關聯的計算負載、記憶體佔用空間和功耗。By receiving an indication of an audio class corresponding to an audio event and processing the audio data to verify that the sound represented by the audio data corresponds to the audio event, the method 3200 can perform decentralized audio event detection such that a first level (such as at Headphones) can recognize audio events with relatively high sensitivity and relatively low accuracy (eg, due to power, memory or computational constraints) compared to a second level (such as at a mobile phone). The second stage can use higher power, more accurate audio event detection to verify audio events, and can communicate detection results, control signals, etc. based on detected audio events. As a result, accurate audio event detection can be provided to users of wearable electronic devices, such as headphones, without requiring the wearable electronic device to support the computational load, memory footprint, and performance associated with full-power audio event detection. consumption.

參考圖33，圖示處理音訊的方法3300的特定實施方式。在特定態樣中，方法3300的一或多個操作由設備120（諸如在一或多個處理器126處）執行。在另一特定態樣中，方法3300的一或多個操作由設備1520（諸如在一或多個處理器1526處）執行。Referring to FIG. 33 , a particular embodiment of a method 3300 of processing audio is illustrated. In certain aspects, one or more operations of method 3300 are performed by device 120 , such as at one or more processors 126 . In another particular aspect, one or more operations of method 3300 are performed by device 1520 , such as at one or more processors 1526 .

方法3300包括在方塊3302處在第二設備的一或多個處理器處接收來自第一設備的音訊資料以及來自第一設備的、音訊資料對應於與交通工具事件相關聯的音訊類別的指示。例如，設備1520從交通工具1510接收音訊資料1550和指示1552。Method 3300 includes receiving, at one or more processors of a second device, audio data from a first device and an indication from the first device that the audio data corresponds to an audio category associated with a vehicle event at block 3302 . For example, device 1520 receives audio material 1550 and instructions 1552 from vehicle 1510 .

方法3300包括在方塊3304處在第二設備（例如，設備1520）的一或多個分類器處處理音訊資料，以驗證以音訊資料表示的聲音對應於交通工具事件。例如，在一或多個分類器1530處處理音訊資料1550以決定分類1522。Method 3300 includes, at block 3304 , processing the audio data at one or more classifiers of a second device (eg, device 1520 ) to verify that the sound represented by the audio data corresponds to a vehicle event. For example, audio data 1550 is processed at one or more classifiers 1530 to determine classification 1522 .

方法3300包括在方塊3306處基於第一設備（例如，交通工具1510）的位置和一或多個第三設備的位置向一或多個第三設備發送交通工具事件的通知。例如，設備1520基於交通工具1510的位置和一或多個設備1490的位置向一或多個設備1490發送通知1592。Method 3300 includes, at block 3306, sending a notification of the vehicle event to one or more third devices based on the location of the first device (eg, vehicle 1510) and the location of the one or more third devices. For example, device 1520 sends notification 1592 to one or more devices 1490 based on the location of vehicle 1510 and the location of one or more devices 1490 .

參考圖34，圖示處理音訊的方法3400的特定實施方式。在特定態樣中，方法3400的一或多個操作由設備110（諸如在一或多個處理器116處）執行。Referring to FIG. 34 , a particular embodiment of a method 3400 of processing audio is illustrated. In a particular aspect, one or more operations of method 3400 are performed by device 110 , such as at one or more processors 116 .

方法3400包括在方塊3402處在第一設備的一或多個處理器處接收來自一或多個麥克風的一或多個音訊信號。例如，設備110分別從麥克風102、104接收音訊信號170、172。Method 3400 includes receiving, at block 3402, one or more audio signals from one or more microphones at one or more processors of a first device. For example, device 110 receives audio signals 170, 172 from microphones 102, 104, respectively.

方法3400包括在方塊3404處在一或多個處理器處處理一或多個音訊信號，以決定以音訊信號中的一或多個表示的聲音是否來自可辨識的方向。例如，在圖12的方塊1212處，設備110決定在圖12的方塊1202處對音訊信號的處理是否產生了關於音訊事件源的有效的到達方向資訊。Method 3400 includes processing one or more audio signals at one or more processors at block 3404 to determine whether a sound represented by one or more of the audio signals is from a recognizable direction. For example, at block 1212 of FIG. 12, device 110 determines whether the processing of the audio signal at block 1202 of FIG. 12 yielded valid direction-of-arrival information about the source of the audio event.

方法3400包括在方塊3406處基於該決定而選擇性地向第二設備發送聲源的到達方向資訊。例如，設備110基於決定有效的到達方向資訊是否可用來選擇是否向第二設備發送到達方向資訊，諸如結合圖12的方塊1212和方塊1214所述。Method 3400 includes, at block 3406, selectively transmitting direction of arrival information of the sound source to the second device based on the determination. For example, device 110 may select whether to send direction-of-arrival information to the second device based on determining whether valid direction-of-arrival information is available, such as described in connection with blocks 1212 and 1214 of FIG. 12 .

藉由基於以音訊信號中的一或多個表示的聲音是否來自可辨識的方向而選擇性地發送到達方向資訊，方法3400可以節省原本會藉由向第二設備發送無效或不可靠的到達方向資訊而消耗的功耗和傳輸資源。By selectively sending direction-of-arrival information based on whether the sound represented by one or more of the audio signals is from a recognizable direction, method 3400 can save money that would otherwise be generated by sending invalid or unreliable direction-of-arrival to the second device. Power consumption and transmission resources consumed by information.

參考圖35，圖示處理音訊的方法3500的特定實施方式。在特定態樣中，方法3500的一或多個操作由設備110（諸如在一或多個處理器116處）執行。Referring to FIG. 35 , a particular embodiment of a method 3500 of processing audio is illustrated. In certain aspects, one or more operations of method 3500 are performed by device 110 , such as at one or more processors 116 .

方法3500包括在方塊3502處在第一設備的一或多個處理器處接收來自一或多個麥克風的一或多個音訊信號。例如，設備110分別從麥克風102、104接收音訊信號170、172。Method 3500 includes receiving, at block 3502, one or more audio signals from one or more microphones at one or more processors of a first device. For example, device 110 receives audio signals 170, 172 from microphones 102, 104, respectively.

方法3500包括在方塊3504處在一或多個處理器處基於一或多個標準來決定是向第二設備發送一或多個音訊信號還是向第二設備發送基於一或多個音訊信號而產生的波束成形音訊信號。例如，如果波束成形音訊信號在設備110處可用，則設備110可以基於諸如可用電量和頻寬資源量的標準來決定是否發送一或多個音訊信號或者是否發送波束成形音訊信號，如參考圖12的方塊1220所述。在第二設備處沒有麥克風可用的說明性地非限制性示例中，如果用於向第二設備傳輸的可用電力或頻寬超過閾值，則如結合圖12的方塊1232所述，則決定發送音訊信號（例如，經由來自方塊1232的「否」路徑）；否則，決定發送波束成形信號（例如，經由來自方塊1232的「是」路徑、來自方塊1234的「否」路徑和來自方塊1238的「是」路徑）。Method 3500 includes determining at one or more processors at block 3504 whether to send one or more audio signals to the second device based on one or more criteria or to send an audio signal generated based on the one or more audio signals to the second device. beamformed audio signal. For example, if beamforming audio signals are available at device 110, device 110 may decide whether to transmit one or more audio signals or whether to transmit beamforming audio signals based on criteria such as available power and amount of bandwidth resources, as shown in FIG. 12 Block 1220 of . In an illustrative, non-limiting example where no microphone is available at the second device, if the available power or bandwidth for transmission to the second device exceeds a threshold, then a decision is made to send the audio as described in connection with block 1232 of FIG. signal (e.g., via the "No" path from block 1232); "path).

方法3500包括在方塊3506處基於該決定向第二設備發送與一或多個音訊信號相對應或者與波束成形音訊信號相對應的音訊資料。繼續上面的示例，設備110可以在圖12的方塊1248處向設備120發送音訊信號，或者在圖12的方塊1244處向設備120發送波束成形信號。Method 3500 includes sending, at block 3506, audio data corresponding to the one or more audio signals or corresponding to the beamformed audio signal to the second device based on the determination. Continuing with the example above, device 110 may send an audio signal to device 120 at block 1248 of FIG. 12 , or send a beamforming signal to device 120 at block 1244 of FIG. 12 .

藉由基於諸如電力可用性或傳輸資源的一或多個標準來選擇是發送音訊信號還是波束成形信號，方法3400 使發送設備能夠根據情況適當地決定是否向接收設備提供全音訊解析度（例如，藉由發送與包括感興趣的聲音的完全麥克風通道集合相對應的資料）或者是否提供更精細的目標音訊（例如，藉由發送與瞄準感興趣的聲源的單個波束成形通道相對應的資料）。By selecting whether to transmit an audio signal or a beamformed signal based on one or more criteria, such as power availability or transmission resources, the method 3400 enables the transmitting device to make appropriate decisions on whether to provide full audio resolution to the receiving device (e.g., by By sending data corresponding to the full set of microphone channels including the sound of interest) or whether to provide finer-grained targeted audio (eg, by sending data corresponding to a single beamforming channel aimed at the sound source of interest).

參考圖36，圖示處理音訊的方法3600的特定實施方式。在特定態樣中，方法3600的一或多個操作由設備120（諸如在一或多個處理器126處）執行。Referring to FIG. 36 , a particular embodiment of a method 3600 of processing audio is illustrated. In a particular aspect, one or more operations of method 3600 are performed by device 120 , such as at one or more processors 126 .

方法3600包括在方塊3602處在第二設備的一或多個處理器處接收表示聲音的音訊資料、與聲源相對應的方向資料、以及與音訊事件相對應的聲音的分類，其中音訊資料、方向資料和分類是從第一設備接收的。例如，設備120的一或多個處理器126可以從設備110接收圖9或圖10的音訊資料904、圖16的指示1602和方向資料1604。Method 3600 includes receiving at one or more processors of a second device at block 3602 audio data representing a sound, direction data corresponding to a sound source, and a classification of the sound corresponding to an audio event, wherein the audio data, Direction data and categories are received from the first device. For example, one or more processors 126 of device 120 may receive audio data 904 of FIG. 9 or FIG. 10 , indication 1602 and direction data 1604 of FIG. 16 from device 110 .

方法3600包括在方塊3604處在一或多個處理器處處理音訊資料以驗證聲音對應於音訊事件。例如，音訊事件處理單元154處理音訊資料以驗證由指示1602指示的音訊類別。Method 3600 includes, at block 3604, processing the audio data at one or more processors to verify that the sound corresponds to an audio event. For example, the audio event processing unit 154 processes the audio data to verify the audio type indicated by the indication 1602 .

方法3600包括在方塊3606處在一或多個處理器處基於音訊事件更新音訊場景中的定向聲源的地圖以產生更新後的地圖。例如，地圖更新器1612更新地圖1614以產生更新後的地圖1616。Method 3600 includes updating, at one or more processors, a map of directional sound sources in an audio scene based on the audio events at block 3606 to generate an updated map. For example, map updater 1612 updates map 1614 to generate updated map 1616 .

方法3600包括在方塊3608處向地理上遠離第一設備的一或多個第三設備發送與更新後的地圖相對應的資料。例如，更新後的地圖資料1660被發送到在地理上遠離設備110的設備1670、1672和1674。Method 3600 includes, at block 3608, sending material corresponding to the updated map to one or more third devices that are geographically remote from the first device. For example, updated map material 1660 is sent to devices 1670 , 1672 , and 1674 that are geographically remote from device 110 .

藉由更新音訊場景中的定向聲源的地圖並將更新後的地圖資料發送到地理上遠端的設備，方法3600實現了諸如其中多個參與者沉浸在共享聲音場景中的虛擬環境的應用，諸如參考圖18所述。Method 3600 enables applications such as virtual environments in which multiple participants are immersed in a shared sound scene by updating a map of directional sound sources in an audio scene and sending the updated map data to a geographically remote device, Such as described with reference to FIG. 18 .

圖12、圖13和圖30-圖36的方法可以由現場可程式設計閘陣列（FPGA）設備、專用積體電路（ASIC）、處理單元（諸如中央處理單元（CPU））、數位信號處理單元（DSP）、控制器、另一硬體設備、韌體設備或其任意組合來實施。作為示例，圖12、圖13和圖30-圖36的方法可以由執行指令的處理器來執行，諸如參考圖37所述。The methods of Fig. 12, Fig. 13 and Fig. 30-Fig. 36 may be implemented by a field programmable gate array (FPGA) device, an application specific integrated circuit (ASIC), a processing unit (such as a central processing unit (CPU)), a digital signal processing unit (DSP), a controller, another hardware device, a firmware device, or any combination thereof. As an example, the methods of FIGS. 12 , 13 and 30-36 may be performed by a processor executing instructions, such as described with reference to FIG. 37 .

參考圖37，圖示了設備的特定說明性實施方式的方塊圖，並且將其整體指定為3700。在各種實施方式中，設備3700可以具有比圖37所示更多或更少的組件。在說明性實施方式中，設備3700可以對應於設備110、設備120、交通工具1410、設備1420、交通工具1510或設備1520。在說明性實施方式中，設備3700可以執行參考圖1-圖36描述的一或多個操作。Referring to FIG. 37 , a block diagram of a particular illustrative embodiment of an apparatus is shown and generally designated 3700 . In various implementations, device 3700 may have more or fewer components than shown in FIG. 37 . In an illustrative embodiment, device 3700 may correspond to device 110 , device 120 , vehicle 1410 , device 1420 , vehicle 1510 , or device 1520 . In an illustrative embodiment, device 3700 may perform one or more operations described with reference to FIGS. 1-36 .

在特定實施方式中，設備3700包括處理器3706（例如，CPU）。設備3700可以包括一或多個附加處理器3710（例如，一或多個DSP）。在特定態樣中，圖1的（多個）處理器116、126或圖14的（多個）處理器1416對應於處理器3706、處理器3710或其組合。處理器3710可以包括語音和音樂轉碼器-解碼器（轉碼器，CODEC）3708，語音和音樂轉碼器3708包括語音轉碼器（「聲碼器」）編碼器3736、聲碼器解碼器3738、定向音訊信號處理單元1990或其組合。In a particular embodiment, device 3700 includes a processor 3706 (eg, a CPU). Device 3700 may include one or more additional processors 3710 (eg, one or more DSPs). In particular aspects, processor(s) 116, 126 of FIG. 1 or processor(s) 1416 of FIG. 14 correspond to processor 3706, processor 3710, or a combination thereof. Processor 3710 may include speech and music transcoder-decoder (CODEC) 3708, which includes speech transcoder ("vocoder") encoder 3736, vocoder decoding device 3738, directional audio signal processing unit 1990 or a combination thereof.

設備3700可以包括記憶體3786和CODEC 3734。記憶體3786可以包括指令3756，指令3756可由一或多個附加處理器3710（或處理器3706）執行，以實施參考定向音訊信號處理單元1990描述的功能。在特定態樣中，記憶體3786對應於記憶體114、圖1的記憶體124、圖14的記憶體1414或其組合。在特定態樣中，指令3756包括圖1的指令115、指令125，圖14的指令1415或其組合。設備3700可以包括經由收發器3750耦合到天線3752的數據機3770。數據機3770可以被配置為向第二設備（未示出）發送信號。根據特定實施方式，數據機3770可以對應於圖1的數據機128。Device 3700 may include memory 3786 and CODEC 3734 . Memory 3786 may include instructions 3756 executable by one or more additional processors 3710 (or processor 3706 ) to implement the functions described with reference to directional audio signal processing unit 1990 . In certain aspects, memory 3786 corresponds to memory 114 , memory 124 of FIG. 1 , memory 1414 of FIG. 14 , or combinations thereof. In certain aspects, instructions 3756 include instruction 115, instruction 125 of FIG. 1, instruction 1415 of FIG. 14, or a combination thereof. Device 3700 may include modem 3770 coupled to antenna 3752 via transceiver 3750 . Modem 3770 may be configured to send signals to a second device (not shown). According to particular implementations, modem 3770 may correspond to modem 128 of FIG. 1 .

設備3700可以包括耦合到顯示控制器3726的顯示器3728。揚聲器3792、第一麥克風102和第二麥克風104可以耦合到CODEC 3734。CODEC 3734可以包括數位類比轉換器（DAC）3702、類比數位轉換器（ADC）3704或兩者。在特定實施方式中，CODEC 3734可以從第一麥克風102和第二麥克風104接收類比信號，使用類比數位轉換器3704將類比信號轉換成數位信號，並且將數位信號提供給語音和音樂轉碼器3708。語音和音樂轉碼器3708可以處理數位信號，並且數位信號可以由定向音訊信號處理單元1990進一步處理。在特定實施方式中，語音和音樂轉碼器3708可以向CODEC 3734提供數位信號。CODEC 3734可以使用數位類比轉換器3702將數位信號轉換成類比信號，並且可以將類比信號提供給揚聲器3792。Device 3700 can include a display 3728 coupled to a display controller 3726 . Speaker 3792 , first microphone 102 and second microphone 104 may be coupled to CODEC 3734 . The CODEC 3734 may include a digital-to-analog converter (DAC) 3702, an analog-to-digital converter (ADC) 3704, or both. In particular embodiments, CODEC 3734 may receive analog signals from first microphone 102 and second microphone 104, convert the analog signals to digital signals using analog-to-digital converter 3704, and provide the digital signals to voice and music transcoder 3708 . The speech and music transcoder 3708 can process the digital signal, and the digital signal can be further processed by the directional audio signal processing unit 1990 . In particular embodiments, speech and music transcoder 3708 may provide digital signals to CODEC 3734 . The CODEC 3734 may convert the digital signal into an analog signal using the digital-to-analog converter 3702 and may provide the analog signal to a speaker 3792 .

在特定實施方式中，設備3700可以被包括在系統級封裝或片上系統設備3722中。在特定實施方式中，記憶體3786、處理器3706、處理器3710、顯示控制器3726、CODEC 3734和數據機3770被包括在系統級封裝或片上系統設備3722中。在特定實施方式中，輸入裝置3730和電源3744耦合到片上系統設備3722。此外，在特定實施方式中，如圖37所示，顯示器3728、輸入裝置3730、揚聲器3792、第一麥克風102、第二麥克風104、天線3752和電源3744在片上系統設備3722的外部。在特定實施方式中，顯示器3728、輸入裝置3730、揚聲器3792、第一麥克風102、第二麥克風104、天線3752和電源3744中的每一個可以耦合到片上系統設備3722的元件，諸如介面（例如，輸入介面121或輸入介面122）或控制器。In particular embodiments, device 3700 may be included in a system-in-package or system-on-chip device 3722 . In a particular embodiment, memory 3786 , processor 3706 , processor 3710 , display controller 3726 , CODEC 3734 , and modem 3770 are included in a system-in-package or system-on-chip device 3722 . In a particular embodiment, an input device 3730 and a power supply 3744 are coupled to the system-on-chip device 3722 . Additionally, in certain embodiments, as shown in FIG. 37 , display 3728 , input device 3730 , speaker 3792 , first microphone 102 , second microphone 104 , antenna 3752 , and power supply 3744 are external to system-on-chip device 3722 . In particular embodiments, each of display 3728, input device 3730, speaker 3792, first microphone 102, second microphone 104, antenna 3752, and power supply 3744 may be coupled to an element of system-on-chip device 3722, such as an interface (e.g., input interface 121 or input interface 122) or a controller.

設備3700可以包括智慧揚聲器、揚聲器條、行動通訊設備、智慧型電話、蜂巢式電話、膝上型電腦、電腦、平板電腦、個人數位助理、顯示裝置、電視、遊戲控制台、音樂播放機、收音機、數位視訊播放機、數位視訊光碟（DVD）播放機、調諧器、相機、導航設備、交通工具、耳機、增強現實耳機、混合現實耳機、虛擬實境耳機、飛行器、家庭自動化系統、語音啟動設備、無線揚聲器和語音啟動設備、可攜式電子設備、汽車、交通工具、計算設備、通訊設備、物聯網路（IoT）設備、虛擬實境（VR）設備、基地台、行動設備或其任意組合。Device 3700 may include a smart speaker, speaker bar, mobile communication device, smartphone, cellular phone, laptop, computer, tablet, personal digital assistant, display device, television, game console, music player, radio , Digital Video Players, Digital Video Disc (DVD) Players, Tuners, Cameras, Navigation Devices, Vehicles, Headphones, Augmented Reality Headsets, Mixed Reality Headsets, Virtual Reality Headsets, Aircraft, Home Automation Systems, Voice Activated Devices , wireless speakers and voice-activated devices, portable electronic devices, automobiles, vehicles, computing devices, communication devices, Internet of Things (IoT) devices, virtual reality (VR) devices, base stations, mobile devices, or any combination thereof .

結合所描述的實施方式，一種裝置包括用於從多個麥克風接收音訊信號的構件。例如，用於接收音訊信號的構件可以對應於輸入介面112、輸入介面111、處理器116或其元件、輸入介面121、輸入介面122、處理器126或其元件、第一處理域210或其元件、第二處理域220或其元件、耳機310或其元件、耳機410或其元件、空間濾波處理單元502、音訊輸入1904、一或多個處理器1916、定向音訊信號處理單元1990、一或多個處理器3710、被配置為從多個麥克風接收音訊信號的一或多個其他電路或元件、或其任意組合。In connection with the described embodiments, an apparatus includes means for receiving audio signals from a plurality of microphones. For example, means for receiving an audio signal may correspond to the input interface 112, the input interface 111, the processor 116 or elements thereof, the input interface 121, the input interface 122, the processor 126 or elements thereof, the first processing domain 210 or elements thereof , second processing domain 220 or components thereof, earphone 310 or components thereof, earphone 410 or components thereof, spatial filter processing unit 502, audio input 1904, one or more processors 1916, directional audio signal processing unit 1990, one or more A processor 3710, one or more other circuits or elements configured to receive audio signals from a plurality of microphones, or any combination thereof.

該裝置還包括用於處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊的構件。例如，用於處理的構件可以對應於（多個）處理器116或其組件、（多個）處理器126或其元件、第一處理域210或其元件、第二處理域220或其元件、耳機310或其元件、耳機410或其元件、空間濾波處理單元502、音訊事件處理單元504、定向音訊信號處理單元1990、一或多個處理器1916、一或多個處理器3710、被配置為處理音訊信號的一或多個其他電路或元件、或其任意組合。The device also includes means for processing the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals. For example, means for processing may correspond to processor(s) 116 or components thereof, processor(s) 126 or elements thereof, first processing domain 210 or elements thereof, second processing domain 220 or elements thereof, The earphone 310 or its components, the earphone 410 or its components, the spatial filter processing unit 502, the audio event processing unit 504, the directional audio signal processing unit 1990, the one or more processors 1916, the one or more processors 3710, are configured to One or more other circuits or components, or any combination thereof, that process audio signals.

該裝置還包括用於向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料的構件。例如，用於發送的構件可以對應於數據機118、數據機128、信號輸出1906、定向音訊信號處理單元1990、一或多個處理器1916、數據機3770、收發器3750、天線3752、被配置為發送資料和類別或嵌入的一或多個其他電路或元件、或其任意組合。The apparatus also includes means for sending data based on the direction-of-arrival information and the category or embedding associated with the direction-of-arrival information to the second device. For example, means for transmitting may correspond to modem 118, modem 128, signal output 1906, directional audio signal processing unit 1990, one or more processors 1916, modem 3770, transceiver 3750, antenna 3752, configured One or more other circuits or components, or any combination thereof, for transmitting data and type or embedding.

結合所描述的實施方式，一種裝置包括用於從多個麥克風接收多個音訊信號的構件。例如，用於接收多個音訊信號的構件可以對應於輸入介面1412、輸入介面1411、一或多個處理器1416或其元件、定向音訊信號處理單元2850、定向音訊信號處理單元2950、一或多個處理器3710、被配置為從多個麥克風接收多個音訊信號的一或多個其他電路或元件、或其任意組合。In connection with the described embodiments, an apparatus includes means for receiving a plurality of audio signals from a plurality of microphones. For example, means for receiving a plurality of audio signals may correspond to input interface 1412, input interface 1411, one or more processors 1416 or components thereof, directional audio signal processing unit 2850, directional audio signal processing unit 2950, one or more A processor 3710, one or more other circuits or elements configured to receive a plurality of audio signals from a plurality of microphones, or any combination thereof.

該裝置還包括用於處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊的構件。例如，用於處理的構件包括一或多個處理器1416或其元件、定向音訊信號處理單元2850、定向音訊信號處理單元2950、一或多個處理器3710、被配置為處理多個音訊信號的一或多個其他電路或元件、或其任意組合。The device also includes means for processing the plurality of audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals. For example, means for processing includes one or more processors 1416 or elements thereof, a directional audio signal processing unit 2850, a directional audio signal processing unit 2950, one or more processors 3710, a device configured to process a plurality of audio signals one or more other circuits or components, or any combination thereof.

該裝置還包括用於基於到達方向資訊產生指示至少一個偵測到的事件和偵測到的事件的方向的報告的構件。例如，用於產生的構件包括一或多個處理器1416或其元件、定向音訊信號處理單元2850、定向音訊信號處理單元2950、一或多個處理器3710、被配置為產生報告的一或多個其他電路或元件、或其任意組合。The apparatus also includes means for generating a report indicating at least one detected event and a direction of the detected event based on the direction of arrival information. For example, means for generating includes one or more processors 1416 or elements thereof, directional audio signal processing unit 2850, directional audio signal processing unit 2950, one or more processors 3710, one or more processors configured to generate reports other circuits or components, or any combination thereof.

結合所描述的實施方式，一種裝置包括用於接收音訊類別的指示的構件，該指示從遠端設備接收並且對應於音訊事件。例如，用於接收指示的構件可以對應於數據機128、一或多個處理器126、一或多個處理器1516、音訊輸入1904、一或多個處理器1916、天線3752、收發器3750、數據機3770、處理器3706、一或多個處理器3710、被配置為接收指示的一或多個其他電路或元件、或其任意組合。In connection with the described embodiments, an apparatus includes means for receiving an indication of an audio category, the indication being received from a remote device and corresponding to an audio event. For example, means for receiving an indication may correspond to modem 128, one or more processors 126, one or more processors 1516, audio input 1904, one or more processors 1916, antenna 3752, transceiver 3750, A modem 3770, a processor 3706, one or more processors 3710, one or more other circuits or elements configured to receive an indication, or any combination thereof.

該裝置還包括用於處理音訊資料以驗證以該音訊資料表示的聲音對應於該音訊事件的構件。例如，用於處理音訊資料的構件可以對應於一或多個處理器126、一或多個處理器1516、一或多個處理器1916、處理器3706、一或多個處理器3710、被配置為處理音訊資料以驗證以該音訊資料表示的聲音對應於該音訊事件的一或多個其他電路或元件、或其任意組合。The device also includes means for processing the audio data to verify that the sound represented by the audio data corresponds to the audio event. For example, means for processing audio data may correspond to one or more processors 126, one or more processors 1516, one or more processors 1916, one or more processors 3706, one or more processors 3710, configured One or more other circuits or components, or any combination thereof, for processing audio data to verify that the sound represented by the audio data corresponds to the audio event.

在一些實施方式中，一種非暫時性電腦可讀取媒體（例如，電腦可讀儲存裝置，諸如記憶體114或記憶體3786）包括指令（例如，指令115或指令3756），當由一或多個處理器（例如，一或多個處理器116、一或多個處理器3710或處理器3706）執行時，該些指令使得一或多個處理器從多個麥克風（例如，麥克風102、104）接收音訊信號（例如，音訊信號170、172）。當由一或多個處理器執行時，該些指令還使得一或多個處理器處理音訊信號以產生與以音訊信號中的一或多個表示的聲音（例如，聲音182）的一或多個源（例如，一或多個源180）相對應的到達方向資訊（例如，到達方向資訊142）。當由一或多個處理器執行時，該些指令還使得一或多個處理器向第二設備（例如，設備120）發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。In some implementations, a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as memory 114 or memory 3786) includes instructions (e.g., instructions 115 or instructions 3756), which when executed by one or more When executed by a processor (e.g., one or more processors 116, one or more processors 3710, or processor 3706), these instructions cause the one or more processors to receive data from multiple microphones (e.g., microphones 102, 104 ) to receive audio signals (eg, audio signals 170, 172). When executed by the one or more processors, the instructions also cause the one or more processors to process the audio signal to produce one or more sounds (e.g., sound 182) that are represented by one or more of the audio signals. The direction-of-arrival information (eg, direction-of-arrival information 142 ) corresponding to each source (eg, one or more sources 180 ). When executed by the one or more processors, the instructions also cause the one or more processors to send to a second device (e.g., device 120) data based on the direction-of-arrival information and classes or embedded data associated with the direction-of-arrival information. .

在一些實施方式中，一種非暫時性電腦可讀取媒體（例如，電腦可讀儲存裝置，諸如記憶體3786）包括指令（例如，指令3756），當由交通工具（例如，交通工具1410）的一或多個處理器（例如，一或多個處理器3710或處理器3706）執行時，該些指令使得一或多個處理器從多個麥克風（例如，麥克風1402、1404）接收多個音訊信號（例如，音訊信號1470、1472）。當由一或多個處理器執行時，該些指令還使得一或多個處理器處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音（例如，聲音1482）的一或多個源（例如，一或多個源1480）相對應的到達方向資訊（例如，到達方向資訊1442）。當由一或多個處理器執行時，該些指令還使得一或多個處理器基於到達方向資訊產生指示至少一個偵測到的事件和偵測到的事件的方向的報告（例如，報告1446）。In some implementations, a non-transitory computer-readable medium (eg, computer-readable storage device, such as memory 3786 ) includes instructions (eg, instructions 3756 ) that when executed by a vehicle (eg, vehicle 1410 ) When executed by one or more processors (e.g., one or more processors 3710 or processor 3706), the instructions cause the one or more processors to receive multiple audio signals from multiple microphones (e.g., microphones 1402, 1404) signal (eg, audio signal 1470, 1472). When executed by one or more processors, the instructions also cause the one or more processors to process a plurality of audio signals to generate a sound (e.g., sound 1482 ) corresponding to one or more of the audio signals. or multiple sources (eg, one or more sources 1480 ) corresponding to the direction of arrival information (eg, direction of arrival information 1442 ). When executed by the one or more processors, the instructions also cause the one or more processors to generate a report (e.g., report 1446) indicating at least one detected event and the direction of the detected event based on the direction of arrival information ).

在一些實施方式中，一種非暫時性電腦可讀取媒體（例如，電腦可讀儲存裝置，諸如記憶體124、記憶體1514或記憶體3786）包括指令（例如，指令125、指令1515或指令3756），當由一或多個處理器（例如，一或多個處理器126、一或多個處理器1516、一或多個處理器3710或處理器3706）執行時，該些指令使得一或多個處理器從第一設備接收與音訊事件相對應的音訊類別的指示（例如，指示902、指示1552或指示1602）。In some embodiments, a non-transitory computer-readable medium (e.g., a computer-readable storage device such as memory 124, memory 1514, or memory 3786) includes instructions (e.g., instructions 125, instructions 1515, or instructions 3756 ), which, when executed by one or more processors (eg, one or more processors 126, one or more processors 1516, one or more processors 3710, or processor 3706), cause one or more The plurality of processors receives an indication (eg, indication 902 , indication 1552 , or indication 1602 ) of an audio category corresponding to an audio event from a first device.

本案包括以下第一組示例。This case includes the first set of examples below.

示例1包括一種第一設備，該第一設備包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從多個麥克風接收多個音訊信號；處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及向第二設備發送基於到達方向資訊的資料。Example 1 includes a first device comprising: a memory configured to store instructions; and one or more processors configured to: receive a plurality of audio signals from a plurality of microphones; process the plurality of audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and sending data based on the direction of arrival information to a second device.

示例2包括根據示例1之第一設備，其中記憶體和一或多個處理器被整合到耳機設備中，並且其中第二設備對應於行動電話。Example 2 includes the first device according to example 1, wherein the memory and the one or more processors are integrated into the headset device, and wherein the second device corresponds to a mobile phone.

示例3包括根據示例1之第一設備，其中記憶體和一或多個處理器被整合到行動電話中，並且其中第二設備對應於耳機設備。Example 3 includes the first device according to example 1, wherein the memory and the one or more processors are integrated into the mobile phone, and wherein the second device corresponds to a headset device.

示例4包括根據示例1至3中任一項所述的第一設備，其中被發送到第二設備的資料觸發第二設備處的一或多個感測器的啟動。Example 4 includes the first device of any one of Examples 1-3, wherein the data sent to the second device triggers activation of one or more sensors at the second device.

示例5包括根據示例1至4中任一項所述的第一設備，其中一或多個感測器中的至少一個包括非音訊感測器。Example 5 includes the first device of any of Examples 1-4, wherein at least one of the one or more sensors comprises a non-audio sensor.

示例6包括根據示例1至5中任一項所述的第一設備，其中非音訊感測器包括360度相機。Example 6 includes the first device of any one of Examples 1-5, wherein the non-audio sensor includes a 360-degree camera.

示例7包括根據示例1至6中任一項所述的第一設備，其中非音訊感測器包括雷射雷達感測器。Example 7 includes the first device of any one of Examples 1-6, wherein the non-audio sensor comprises a lidar sensor.

示例8包括根據示例1至7中任一項所述的第一設備，其中一或多個處理器包括在低功率狀態下操作的第一處理域。Example 8 includes the first device of any one of Examples 1-7, wherein the one or more processors include a first processing domain operating in a low power state.

示例9包括根據示例1至8中任一項所述的第一設備，其中一或多個處理器還包括在高功率狀態下操作的第二處理域，第二功率域被配置為處理多個音訊信號以產生到達方向資訊。Example 9 includes the first device of any one of Examples 1 to 8, wherein the one or more processors further include a second processing domain operating in a high power state, the second power domain being configured to process a plurality of audio signals to generate direction-of-arrival information.

示例10包括根據示例1至9中任一項所述的第一設備，其中一或多個處理器還被配置為：處理多個音訊信號以執行音訊事件偵測；及向第二設備發送與偵測到的音訊事件相對應的資料。Example 10 includes the first device of any one of Examples 1 to 9, wherein the one or more processors are further configured to: process the plurality of audio signals to perform audio event detection; Data corresponding to detected audio events.

示例11包括根據示例1至9中任一項所述的第一設備，其中一或多個處理器還被配置為：基於音訊事件偵測操作產生與偵測到的音訊事件相對應的事件資料；及將事件資料發送到第二設備。Example 11 includes the first device of any one of Examples 1 to 9, wherein the one or more processors are further configured to: generate event data corresponding to the detected audio event based on the audio event detection operation ; and sending the event data to the second device.

示例12包括根據示例1至11中任一項所述的第一設備，其中一或多個處理器還被配置為：處理多個音訊信號以執行聲學環境偵測；及向第二設備發送與偵測到的環境相對應的資料。Example 12 includes the first device of any one of Examples 1-11, wherein the one or more processors are further configured to: process the plurality of audio signals to perform acoustic environment detection; The data corresponding to the detected environment.

示例13包括根據示例1至11中任一項所述的第一設備，其中一或多個處理器還被配置為基於聲學環境偵測操作產生與偵測到的環境相對應的環境資料。Example 13 includes the first device of any one of Examples 1-11, wherein the one or more processors are further configured to generate environment profile corresponding to the detected environment based on the acoustic environment detection operation.

示例14包括根據示例1至13中任一項所述的第一設備，其中一或多個處理器還被配置為：基於到達方向資訊對多個音訊信號執行空間處理，以產生波束成形音訊信號；及將波束成形音訊信號發送到第二設備。Example 14 includes the first device of any one of Examples 1 to 13, wherein the one or more processors are further configured to: perform spatial processing on the plurality of audio signals based on the direction of arrival information to generate beamformed audio signals ; and sending the beamformed audio signal to the second device.

示例15包括根據示例1至14中任一項所述的第一設備，其中一或多個處理器還被配置為基於到達方向資訊調整多個麥克風中的至少一個麥克風的焦點。Example 15 includes the first device of any one of Examples 1-14, wherein the one or more processors are further configured to adjust the focus of at least one of the plurality of microphones based on the direction of arrival information.

示例16包括根據示例1至15中任一項所述的第一設備，還包括數據機，其中資料經由數據機被發送到第二設備。Example 16 includes the first device of any one of Examples 1 to 15, further comprising a modem, wherein the material is sent to the second device via the modem.

示例17包括根據示例1至16中任一項所述的第一設備，其中一或多個處理器還被配置為向第二設備發送多個音訊信號的表示。Example 17 includes the first device of any one of Examples 1-16, wherein the one or more processors are further configured to send the plurality of representations of the audio signal to the second device.

示例18包括根據示例17之第一設備，其中多個音訊信號的表示對應於一或多個波束成形音訊信號。Example 18 includes the first apparatus according to example 17, wherein the representations of the plurality of audio signals correspond to one or more beamformed audio signals.

示例19包括根據示例1至18中任一項所述的第一設備，其中一或多個處理器還被配置為產生指示環境事件或聲學事件中的至少一個的使用者介面輸出。Example 19 includes the first device of any one of Examples 1-18, wherein the one or more processors are further configured to generate a user interface output indicative of at least one of an environmental event or an acoustic event.

示例20包括根據示例1至19中任一項所述的第一設備，其中一或多個處理器還被配置為從第二設備接收指示聲學事件的資料。Example 20 includes the first device of any one of Examples 1-19, wherein the one or more processors are further configured to receive material indicative of the acoustic event from the second device.

示例21包括根據示例1至20中任一項所述的第一設備，其中一或多個處理器還被配置為從第二設備接收指示環境事件的資料。Example 21 includes the first device of any one of Examples 1-20, wherein the one or more processors are further configured to receive material indicative of an environmental event from the second device.

示例22包括根據示例1至21中任一項所述的第一設備，其中一或多個處理器還被配置為從第二設備接收指示波束成形音訊信號的資料。Example 22 includes the first device of any one of Examples 1-21, wherein the one or more processors are further configured to receive data indicative of beamformed audio signals from the second device.

示例23包括根據示例1至22中任一項所述的第一設備，其中一或多個處理器還被配置為：從第二設備接收與多個音訊信號相關聯的方向資訊；及基於方向資訊執行音訊縮放操作。Example 23 includes the first device of any one of Examples 1-22, wherein the one or more processors are further configured to: receive direction information associated with the plurality of audio signals from the second device; and Info Performs an audio zoom operation.

示例24包括根據示例1至23中任一項所述的第一設備，其中一或多個處理器還被配置為：從第二設備接收與多個音訊信號相關聯的方向資訊；及基於方向資訊執行除噪操作。Example 24 includes the first device of any one of Examples 1-23, wherein the one or more processors are further configured to: receive direction information associated with the plurality of audio signals from the second device; and information to perform noise removal operations.

示例25包括根據示例1至24中任一項所述的第一設備，還包括多個麥克風。Example 25 includes the first device of any one of Examples 1-24, further comprising a plurality of microphones.

示例26包括根據示例1至25中任一項所述的第一設備，還包括被配置為輸出與多個音訊信號中的至少一個音訊信號相關聯的聲音的至少一個揚聲器。Example 26 includes the first device of any one of Examples 1-25, further comprising at least one speaker configured to output a sound associated with at least one audio signal of the plurality of audio signals.

示例27包括根據示例1至26中任一項所述的第一設備，其中一或多個處理器被整合在交通工具中。Example 27 includes the first device of any one of Examples 1-26, wherein the one or more processors are integrated in a vehicle.

示例28包括根據示例1至27中任一項所述的第一設備，其中基於到達方向資訊的資料包括指示至少一個偵測到的事件和偵測到的事件的方向的報告。Example 28 includes the first device of any one of Examples 1 to 27, wherein the direction-of-arrival information-based profile includes a report indicating at least one detected event and a direction of the detected event.

示例29包括一種處理音訊的方法，該方法包括：在第一設備的一或多個處理器處接收來自多個麥克風的多個音訊信號；處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及向第二設備發送基於到達方向資訊的資料。Example 29 includes a method of processing audio, the method comprising: receiving, at one or more processors of a first device, a plurality of audio signals from a plurality of microphones; processing the plurality of audio signals to generate and use one of the audio signals or direction-of-arrival information corresponding to one or more sources of sound represented; and sending data based on the direction-of-arrival information to a second device.

示例30包括根據示例29之方法，還包括：處理多個音訊信號以執行音訊事件偵測；及向第二設備發送與偵測到的音訊事件相對應的資料。Example 30 includes the method of example 29, further comprising: processing the plurality of audio signals to perform audio event detection; and sending data corresponding to the detected audio events to the second device.

示例31包括根據示例30之方法，其中音訊事件偵測包括在一或多個分類器處處理多個音訊信號中的一或多個，以從由一或多個分類器支持的多個類別當中為以音訊信號中的一或多個表示的聲音決定類別，其中與偵測到的音訊事件相對應的資料包括類別的指示。Example 31 includes the method according to example 30, wherein the audio event detection includes processing one or more of the plurality of audio signals at the one or more classifiers to select from among the plurality of classes supported by the one or more classifiers A category is determined for the sound represented by one or more of the audio signals, wherein data corresponding to the detected audio event includes an indication of the category.

示例32包括根據示例29至31中任一項所述的方法，還包括：處理多個音訊信號以執行聲學環境偵測；及向第二設備發送與偵測到的環境相對應的資料。Example 32 includes the method of any one of Examples 29-31, further comprising: processing the plurality of audio signals to perform acoustic environment detection; and sending data corresponding to the detected environment to the second device.

示例33包括根據示例29至32中任一項所述的方法，其中資料經由數據機被發送到第二設備。Example 33 includes the method of any one of Examples 29 to 32, wherein the profile is sent to the second device via a modem.

示例34包括根據示例29至33中任一項所述的方法，還包括向第二設備發送多個音訊信號的表示。Example 34 includes the method of any one of Examples 29-33, further comprising sending a plurality of representations of the audio signal to the second device.

示例35包括根據示例29至34中任一項所述的方法，其中被發送到第二設備的基於到達方向資訊的資料觸發第二設備處的一或多個感測器的啟動。Example 35 includes the method of any one of Examples 29 to 34, wherein the data sent to the second device based on the direction of arrival information triggers activation of one or more sensors at the second device.

示例36包括根據示例29至35中任一項所述的方法，其中一或多個感測器中的至少一個包括非音訊感測器。Example 36 includes the method of any of Examples 29-35, wherein at least one of the one or more sensors includes a non-audio sensor.

示例37包括一種設備，該設備包括：記憶體，被配置為儲存指令；及處理器，被配置為執行指令以執行根據請求項 29至36中任一項所述的方法。Example 37 includes an apparatus comprising: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method of any one of claims 29-36.

示例38包括一種設備，該設備包括含有指令的非暫時性電腦可讀取媒體，當由第一設備的一或多個處理器執行時，該些指令使得一或多個處理器執行根據請求項 29至36中任一項所述的方法。Example 38 includes an apparatus comprising a non-transitory computer-readable medium containing instructions that, when executed by one or more processors of the first apparatus, cause the one or more processors to perform the requested The method of any one of 29 to 36.

示例39包括一種裝置，該裝置包括用於執行根據請求項 29至36中任一項所述的方法的構件。Example 39 includes an apparatus comprising means for performing the method of any one of claims 29 to 36.

示例40包括一種含有指令的非暫時性電腦可讀取媒體，當由第一設備的一或多個處理器執行時，該些指令使得一或多個處理器：從多個麥克風接收多個音訊信號；處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及向第二設備發送基於到達方向資訊的資料。Example 40 includes a non-transitory computer-readable medium containing instructions that, when executed by one or more processors of a first device, cause the one or more processors to: receive a plurality of audio from a plurality of microphones signals; process a plurality of audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and send data based on the direction of arrival information to a second device.

示例41包括根據示例40之非暫時性電腦可讀取媒體，其中被發送到第二設備的資料觸發第二設備處的一或多個感測器的啟動。Example 41 includes the non-transitory computer-readable medium of example 40, wherein the data sent to the second device triggers activation of one or more sensors at the second device.

示例42包括根據示例41或42之非暫時性電腦可讀取媒體，其中一或多個感測器中的至少一個包括非音訊感測器。Example 42 includes the non-transitory computer-readable medium of Examples 41 or 42, wherein at least one of the one or more sensors includes a non-audio sensor.

示例43包括根據示例40至42中任一項所述的非暫時性電腦可讀取媒體，其中該些指令可執行以進一步使得一或多個處理器向第二設備發送多個音訊信號的表示。Example 43 includes the non-transitory computer-readable medium of any of Examples 40-42, wherein the instructions are executable to further cause the one or more processors to send representations of the plurality of audio signals to the second device .

示例44包括根據示例43之非暫時性電腦可讀取媒體，其中多個音訊信號的表示對應於一或多個波束成形音訊信號。Example 44 includes the non-transitory computer-readable medium of example 43, wherein the representations of the plurality of audio signals correspond to one or more beamformed audio signals.

示例45包括第一設備，該第一設備包括：用於從多個麥克風接收多個音訊信號的構件；用於處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊的構件；及用於向第二設備發送基於到達方向資訊的資料的構件。Example 45 includes a first device comprising: means for receiving a plurality of audio signals from a plurality of microphones; means for processing the plurality of audio signals to produce sound associated with one or more of the audio signals means for direction of arrival information corresponding to one or more sources; and means for sending data based on the direction of arrival information to a second device.

示例46包括一種交通工具，該交通工具包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為執行指令以：從多個麥克風接收多個音訊信號；處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及基於到達方向資訊產生指示至少一個偵測到的事件和偵測到的事件的方向的報告。Example 46 includes a vehicle comprising: a memory configured to store instructions; and one or more processors configured to execute instructions to: receive a plurality of audio signals from a plurality of microphones; process a plurality of audio signals signal to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and generating an indication of at least one detected event and the detected event based on the direction of arrival information direction report.

示例47包括根據示例46之交通工具，其中一或多個處理器還被配置為向第二設備發送報告。Example 47 includes the vehicle of example 46, wherein the one or more processors are further configured to send the report to the second device.

示例48包括根據示例46至47中任一項所述的交通工具，其中第二設備包括第二交通工具。Example 48 includes the vehicle of any one of Examples 46-47, wherein the second device comprises a second vehicle.

示例49包括根據示例46至48中任一項所述的交通工具，其中第二設備包括伺服器。Example 49 includes the vehicle of any one of Examples 46-48, wherein the second device comprises a server.

示例50包括根據示例46至49中任一項所述的交通工具，其中一或多個處理器還被配置為：從第二設備接收導航指令；及基於導航指令進行導航。Example 50 includes the vehicle of any one of Examples 46-49, wherein the one or more processors are further configured to: receive navigation instructions from the second device; and navigate based on the navigation instructions.

示例51包括工具示例46至50中任一項所述的交通工具，其中一或多個處理器還被配置為：從第二設備接收第二報告；及基於該報告和第二報告進行導航。Example 51 includes the vehicle of any of tool examples 46-50, wherein the one or more processors are further configured to: receive a second report from the second device; and navigate based on the report and the second report.

示例52包括根據示例46至51中任一項所述的交通工具，其中一或多個處理器還被配置為：從第二設備接收第二報告；基於第二報告產生導航指令；及將導航指令發送到第二設備。Example 52 includes the vehicle of any one of Examples 46-51, wherein the one or more processors are further configured to: receive a second report from the second device; generate navigation instructions based on the second report; The command is sent to the second device.

示例53包括根據示例46至52中任一項所述的交通工具，其中報告指示在一段時間內偵測到的事件和偵測到的事件的方向資訊的清單。Example 53 includes the vehicle of any one of examples 46 to 52, wherein the report indicates a list of detected events and directional information for the detected events over a period of time.

示例54包括一種處理音訊的方法，該方法包括：在交通工具的一或多個處理器處接收來自多個麥克風的多個音訊信號；處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及基於到達方向資訊產生指示至少一個偵測到的事件和偵測到的事件的方向的報告。Example 54 includes a method of processing audio, the method comprising: receiving a plurality of audio signals from a plurality of microphones at one or more processors of a vehicle; processing the plurality of audio signals to generate one or more of the audio signals direction of arrival information corresponding to one or more sources of a plurality of represented sounds; and generating a report indicating at least one detected event and a direction of the detected event based on the direction of arrival information.

示例55包括根據示例54之方法，還包括向第二設備發送報告。Example 55 includes the method of example 54, further comprising sending the report to the second device.

示例56包括根據示例54至55中任一項所述的方法，其中第二設備包括第二交通工具。Example 56 includes the method of any of Examples 54-55, wherein the second device comprises a second vehicle.

示例57包括根據示例54至56中任一項所述的方法，其中第二設備包括伺服器。Example 57 includes the method of any one of examples 54-56, wherein the second device comprises a server.

示例58包括根據示例54至57中任一項所述的方法，還包括：從第二設備接收導航指令；及基於導航指令進行導航。Example 58 includes the method of any one of Examples 54-57, further comprising: receiving navigation instructions from the second device; and navigating based on the navigation instructions.

示例59包括根據示例54至58中任一項所述的方法，還包括：從第二設備接收第二報告；及基於該報告和第二報告進行導航。Example 59 includes the method of any one of Examples 54 to 58, further comprising: receiving a second report from the second device; and navigating based on the report and the second report.

示例60包括根據示例54至59中任一項所述的方法，還包括：從第二設備接收第二報告；基於第二報告產生導航指令；及將導航指令發送到第二設備。Example 60 includes the method of any one of Examples 54 to 59, further comprising: receiving a second report from the second device; generating navigation instructions based on the second report; and sending the navigation instructions to the second device.

示例61包括根據示例54至60中任一項所述的方法，其中報告指示在一段時間內偵測到的事件和偵測到的事件的方向資訊的清單。Example 61 includes the method of any one of examples 54 to 60, wherein the report is a list indicating detected events and directional information for the detected events over a period of time.

示例62包括一種含有指令的非暫時性電腦可讀取媒體，當由交通工具的一或多個處理器執行時，使得一或多個處理器：從多個麥克風接收多個音訊信號；處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及基於到達方向資訊產生指示至少一個偵測到的事件和偵測到的事件的方向的報告。Example 62 includes a non-transitory computer-readable medium containing instructions that, when executed by one or more processors of a vehicle, cause the one or more processors to: receive a plurality of audio signals from a plurality of microphones; audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and generating an indication of at least one detected event and a detected event based on the direction of arrival information Reporting of the direction of the event.

示例63包括根據示例62之非暫時性電腦可讀取媒體，其中當由一或多個處理器執行時，該些指令還使得一或多個處理器向第二設備發送報告。Example 63 includes the non-transitory computer-readable medium of example 62, wherein when executed by the one or more processors, the instructions further cause the one or more processors to send a report to the second device.

示例64包括根據示例62至63中任一項所述的非暫時性電腦可讀取媒體，其中第二設備包括第二交通工具。Example 64 includes the non-transitory computer readable medium of any one of Examples 62-63, wherein the second device includes a second vehicle.

示例65包括根據示例62至64中任一項所述的非暫時性電腦可讀取媒體，其中第二設備包括伺服器。Example 65 includes the non-transitory computer readable medium of any one of Examples 62-64, wherein the second device comprises a server.

示例66包括根據示例62至65中任一項所述的非暫時性電腦可讀取媒體，其中當由一或多個處理器執行時，該些指令還使得一或多個處理器：從第二設備接收導航指令；及基於導航指令進行導航。Example 66 includes the non-transitory computer-readable medium of any one of Examples 62-65, wherein when executed by the one or more processors, the instructions further cause the one or more processors to: The second device receives the navigation instruction; and performs navigation based on the navigation instruction.

示例67包括根據示例62至66中任一項所述的非暫時性電腦可讀取媒體，其中當由一或多個處理器執行時，該些指令還使得一或多個處理器：從第二設備接收第二報告；及基於該報告和第二報告進行導航。Example 67 includes the non-transitory computer-readable medium of any of Examples 62-66, wherein when executed by the one or more processors, the instructions further cause the one or more processors to: The second device receives the second report; and navigates based on the report and the second report.

示例68包括根據示例62至67中任一項所述的非暫時性電腦可讀取媒體，其中當由一或多個處理器執行時，該些指令還使得一或多個處理器：從第二設備接收第二報告；基於第二報告產生導航指令；及將導航指令發送到第二設備。Example 68 includes the non-transitory computer-readable medium of any of Examples 62-67, wherein when executed by the one or more processors, the instructions further cause the one or more processors to: The second device receives the second report; generates navigation instructions based on the second report; and sends the navigation instructions to the second device.

示例69包括根據示例62至68中任一項所述的非暫時性電腦可讀取媒體，其中報告指示在一段時間內偵測到的事件和偵測到的事件的方向資訊的清單。Example 69 includes the non-transitory computer-readable medium of any one of Examples 62-68, wherein the report indicates a list of detected events and directional information for the detected events over a period of time.

示例70包括一種交通工具，該交通工具包括：用於從多個麥克風接收多個音訊信號的構件；用於處理多個音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊的構件；及用於基於到達方向資訊產生指示至少一個偵測到的事件和偵測到的事件的方向的報告的構件。Example 70 includes a vehicle comprising: means for receiving a plurality of audio signals from a plurality of microphones; a means for processing the plurality of audio signals to produce a sound associated with one or more of the audio signals or means for direction of arrival information corresponding to a plurality of sources; and means for generating a report indicating at least one detected event and a direction of the detected event based on the direction of arrival information.

示例71包括根據示例70之交通工具，還包括用於向第二設備發送報告的構件。Example 71 includes the vehicle of example 70, further comprising means for sending the report to the second device.

示例72包括根據示例70至71中任一項所述的交通工具，其中第二設備包括第二交通工具。Example 72 includes the vehicle of any one of examples 70-71, wherein the second device comprises a second vehicle.

示例73包括根據示例70至72中任一項所述的交通工具，其中第二設備包括伺服器。Example 73 includes the vehicle of any one of Examples 70-72, wherein the second device includes a server.

示例74包括根據示例70至73中任一項所述的交通工具，其中報告指示在一段時間內偵測到的事件和偵測到的事件的方向資訊的清單。Example 74 includes the vehicle of any one of examples 70 to 73, wherein the report indicates a list of detected events and directional information for the detected events over a period of time.

示例75包括根據示例70至74中任一項所述的交通工具，還包括基於報告執行自主導航的設備。Example 75 includes the vehicle of any one of Examples 70-74, further comprising a device that performs autonomous navigation based on the report.

本案包括以下第二組示例。This case includes the following second set of examples.

根據示例1，一種第一設備包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從多個麥克風接收音訊信號；處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。According to example 1, a first device includes: a memory configured to store instructions; and one or more processors configured to: receive audio signals from a plurality of microphones; direction-of-arrival information corresponding to one or more sources of the one or more represented sounds; and sending to the second device data based on the direction-of-arrival information and the category or embedding associated with the direction-of-arrival information.

示例2包括根據示例1之第一設備，其中一或多個處理器還被配置為處理與音訊信號相對應的信號資料以決定類別或嵌入。Example 2 includes the first apparatus according to example 1, wherein the one or more processors are further configured to process signal data corresponding to the audio signal to determine a class or embedding.

示例3包括根據示例2之第一設備，其中一或多個處理器還被配置為對音訊信號執行波束成形操作以產生信號資料。Example 3 includes the first apparatus according to example 2, wherein the one or more processors are further configured to perform a beamforming operation on the audio signal to generate signal data.

示例4包括根據示例2或示例3之第一設備，其中一或多個處理器還被配置為在一或多個分類器處處理信號資料，以從由一或多個分類器支持的多個類別當中為以音訊信號中的一或多個表示的並且與音訊事件相關聯的聲音決定類別，並且其中類別被發送到第二設備。Example 4 includes the first device according to example 2 or example 3, wherein the one or more processors are further configured to process the signal data at the one or more classifiers to obtain information from multiple classes supported by the one or more classifiers Among the categories are sounds represented by one or more of the audio signals and associated with the audio event determining a category, and wherein the category is sent to the second device.

示例5包括根據示例2至4中任一項所述的第一設備，其中一或多個處理器還被配置為在一或多個編碼器處處理信號資料以產生嵌入，該嵌入對應於以音訊信號中的一或多個表示的並且與音訊事件相關聯的聲音，並且其中嵌入被發送到第二設備。Example 5 includes the first device according to any one of Examples 2 to 4, wherein the one or more processors are further configured to process the signal material at the one or more encoders to produce an embedding corresponding to One or more sounds represented in the audio signal and associated with the audio event and embedded therein are sent to the second device.

示例6包括根據示例1至5中任一項所述的第一設備，其中一或多個處理器還被配置為在一或多個編碼器處處理圖像資料以產生嵌入，該嵌入對應於以圖像資料表示的並且與音訊事件相關聯的物件，並且其中該嵌入被發送到第二設備。Example 6 includes the first device of any one of Examples 1 to 5, wherein the one or more processors are further configured to process the image data at the one or more encoders to produce an embedding corresponding to An object represented by image data and associated with an audio event, and wherein the embedding is sent to the second device.

示例7包括根據示例6之第一設備，還包括被配置為產生圖像資料的一或多個相機。Example 7 includes the first device according to example 6, further including one or more cameras configured to generate image material.

示例8包括根據示例1至7中任一項所述的第一設備，其中一或多個處理器還被配置為基於聲學環境偵測操作產生與偵測到的環境相對應的環境資料。Example 8 includes the first device of any one of Examples 1-7, wherein the one or more processors are further configured to generate environment profile corresponding to the detected environment based on the acoustic environment detection operation.

示例9包括根據示例1至8中任一項所述的第一設備，其中一或多個處理器還被配置為：基於到達方向資訊對音訊信號執行空間處理，以產生一或多個波束成形音訊信號；及將一或多個波束成形音訊信號發送到第二設備。Example 9 includes the first device of any one of Examples 1 to 8, wherein the one or more processors are further configured to: perform spatial processing on the audio signal based on the direction of arrival information to generate one or more beamforming an audio signal; and sending one or more beamformed audio signals to a second device.

示例10包括根據示例1至9中任一項所述的第一設備，其中記憶體和一或多個處理器被整合到耳機設備中，並且其中第二設備對應於行動電話。Example 10 includes the first device of any one of Examples 1 to 9, wherein the memory and the one or more processors are integrated into the headset device, and wherein the second device corresponds to a mobile phone.

示例11包括根據示例1至9中任一項所述的第一設備，其中一或多個處理器被整合在交通工具中。Example 11 includes the first device of any one of Examples 1-9, wherein the one or more processors are integrated in a vehicle.

示例12包括根據示例1至11中任一項所述的第一設備，還包括數據機，其中資料經由數據機被發送到第二設備。Example 12 includes the first device of any one of Examples 1 to 11, further comprising a modem, wherein the material is sent to the second device via the modem.

示例13包括根據示例1至12中任一項所述的第一設備，其中一或多個處理器還被配置為向第二設備發送音訊信號的表示。Example 13 includes the first device of any one of Examples 1-12, wherein the one or more processors are further configured to send the representation of the audio signal to the second device.

示例14包括根據示例13之第一設備，其中音訊信號的表示對應於一或多個波束成形音訊信號。Example 14 includes the first apparatus according to example 13, wherein the representation of the audio signal corresponds to one or more beamformed audio signals.

示例15包括根據示例1至14中任一項所述的第一設備，其中一或多個處理器還被配置為產生指示環境事件或聲學事件中的至少一個的使用者介面輸出。Example 15 includes the first device of any one of Examples 1-14, wherein the one or more processors are further configured to generate a user interface output indicative of at least one of an environmental event or an acoustic event.

示例16包括根據示例1至15中任一項所述的第一設備，其中一或多個處理器還被配置為從第二設備接收指示聲學事件的資料。Example 16 includes the first device of any one of Examples 1-15, wherein the one or more processors are further configured to receive material indicative of the acoustic event from the second device.

示例17包括根據示例1至16中任一項所述的第一設備，其中一或多個處理器還被配置為：從第二設備接收與音訊信號相關聯的方向資訊；及基於方向資訊執行音訊縮放操作。Example 17 includes the first device of any one of Examples 1-16, wherein the one or more processors are further configured to: receive direction information associated with the audio signal from the second device; and perform Audio zoom operation.

示例18包括根據示例1至17中任一項所述的第一設備，其中基於到達方向資訊的資料包括指示至少一個偵測到的事件和偵測到的事件的方向的報告。Example 18 includes the first device of any one of Examples 1 to 17, wherein the direction-of-arrival information-based profile includes a report indicating at least one detected event and a direction of the detected event.

示例19包括根據示例1至18中任一項所述的第一設備，還包括多個麥克風。Example 19 includes the first device of any one of Examples 1-18, further comprising a plurality of microphones.

示例20包括根據示例1至19中任一項所述的第一設備，還包括被配置為輸出與至少一個音訊信號相關聯的聲音的至少一個揚聲器。Example 20 includes the first device of any one of Examples 1-19, further comprising at least one speaker configured to output a sound associated with the at least one audio signal.

示例21包括根據示例1至20中任一項所述的第一設備，其中：類別對應於以音訊信號表示的並且與特定音訊事件相關聯的特定聲音的分類；並且嵌入包括與特定聲音或特定音訊事件相對應的簽名或資訊，並且被配置為能夠經由其他音訊信號的處理來偵測其他音訊信號中的特定聲音或特定音訊事件。Example 21 includes the first device according to any one of Examples 1 to 20, wherein: the category corresponds to a classification of a particular sound represented in the audio signal and associated with a particular audio event; The signature or information corresponding to the audio event is configured to be able to detect specific sounds or specific audio events in other audio signals through the processing of other audio signals.

根據示例22，一種系統包括：根據示例1至21中任一項所述的第一設備；及第二設備，該第二設備包括：一或多個處理器，被配置為：接收資料；及處理資料以驗證類別、基於到達方向資訊和嵌入來修改表示聲音場景的音訊資料以產生與更新後的聲音場景相對應的修改後的音訊資料或兩者。According to example 22, a system comprising: the first device according to any one of examples 1-21; and a second device comprising: one or more processors configured to: receive data; and The data is processed to verify the class, modify the audio data representing the sound scene based on the direction of arrival information and embedding to generate modified audio data corresponding to the updated sound scene, or both.

根據示例23，一種系統包括：第一設備，包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從多個麥克風接收音訊信號；處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及發送基於到達方向資訊和與到達方向資訊相關聯的類別的資料；及第二設備，包括一或多個處理器，一或多個處理器被配置為：接收基於到達方向資訊和類別的資料；獲得表示與到達方向資訊相關聯的聲音的音訊資料和類別；及至少基於音訊資料和到達方向資訊來驗證類別。According to example 23, a system includes: a first device including: a memory configured to store instructions; and one or more processors configured to: receive audio signals from a plurality of microphones; direction of arrival information corresponding to one or more sources of sound represented by the one or more representations in the audio signal; and transmitting data based on the direction of arrival information and a category associated with the direction of arrival information; and a second device comprising one or a plurality of processors, one or more processors configured to: receive data based on the direction of arrival information and category; obtain audio data and category representing sounds associated with the direction of arrival information; and based at least on the audio data and the direction of arrival information to verify the category.

根據示例24，一種系統包括：第一設備，包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從多個麥克風接收音訊信號；處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及發送基於到達方向資訊和與到達方向資訊相關聯的嵌入的資料；及第二設備，包括一或多個處理器，一或多個處理器被配置為：接收基於到達方向資訊和嵌入的資料；及基於到達方向資訊和嵌入來處理表示聲音場景的音訊資料，以產生與更新後的聲音場景相對應的修改後的音訊資料。According to example 24, a system includes: a first device including: a memory configured to store instructions; and one or more processors configured to: receive audio signals from a plurality of microphones; direction of arrival information corresponding to one or more sources of sound represented by the one or more representations in the audio signal; and transmitting embedded data based on and associated with the direction of arrival information; and a second device comprising one or A plurality of processors, one or more processors configured to: receive data based on direction of arrival information and embedding; and process audio data representing a sound scene based on direction of arrival information and embedding to generate an updated sound scene corresponding to The corresponding modified audio data.

根據示例25，一種處理音訊的方法包括：在第一設備的一或多個處理器處接收來自多個麥克風的音訊信號；處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。According to example 25, a method of processing audio comprises: receiving, at one or more processors of a first device, audio signals from a plurality of microphones; processing the audio signals to produce a sound corresponding to one or more of the audio signals and sending direction-of-arrival information corresponding to one or more sources of the direction-of-arrival information to the second device based on the direction-of-arrival information and the category or embedded data associated with the direction-of-arrival information.

示例26包括根據示例25之方法，還包括處理與音訊信號相對應的信號資料以決定類別或嵌入。Example 26 includes the method according to Example 25, further comprising processing signal data corresponding to the audio signal to determine the class or embedding.

示例27包括根據示例26之方法，還包括對音訊信號執行波束成形操作以產生信號資料。Example 27 includes the method of Example 26, further comprising performing a beamforming operation on the audio signal to generate signal data.

示例28包括根據示例26或示例27之方法，其中在一或多個分類器處處理信號資料，以從由一或多個分類器支持的多個類別當中為以音訊信號中的一或多個表示的並且與音訊事件相關聯的聲音決定類別，並且其中類別被發送到第二設備。Example 28 includes the method according to example 26 or example 27, wherein the signal data is processed at one or more classifiers to generate one or more of the audio signals from a plurality of classes supported by the one or more classifiers The sound represented and associated with the audio event determines the category, and wherein the category is sent to the second device.

示例29包括根據示例26至28中任一項所述的方法，其中在一或多個編碼器處處理信號資料以產生嵌入，該嵌入對應於以音訊信號中的一或多個表示的並且與音訊事件相關聯的聲音，並且其中嵌入被發送到第二設備。Example 29 includes the method according to any one of Examples 26 to 28, wherein the signal data is processed at one or more encoders to produce embeddings corresponding to and associated with one or more of the audio signals The audio event is associated with a sound, and an embed thereof is sent to the second device.

示例30包括根據示例25至29中任一項所述的方法，還包括向第二設備發送音訊信號的表示。Example 30 includes the method of any one of Examples 25-29, further comprising sending the representation of the audio signal to the second device.

示例31包括根據示例25至30中任一項所述的方法，還包括：在第二設備的一或多個處理器處接收基於到達方向資訊和類別的資料；在第二設備的一或多個處理器處獲得表示與到達方向資訊相關聯的聲音的音訊資料和類別；及在第二設備的一或多個處理器處至少基於音訊資料和到達方向資訊來驗證類別。Example 31 includes the method of any one of Examples 25-30, further comprising: receiving, at one or more processors of the second device, data based on direction-of-arrival information and category; obtaining, at a processor, audio data representing a sound associated with the direction of arrival information and a category; and at one or more processors of a second device, verifying the category based on at least the audio data and the direction of arrival information.

示例32包括根據示例25至31中任一項所述的方法，還包括：在第二設備的一或多個處理器處接收基於到達方向資訊和嵌入的資料；及在第二設備的一或多個處理器處基於到達方向資訊和嵌入來處理表示聲音場景的音訊資料，以產生與更新後的聲音場景相對應的修改後的音訊資料。Example 32 includes the method of any one of Examples 25-31, further comprising: receiving, at one or more processors of the second device, the data based on the direction of arrival information and the embedding; and at one or more processors of the second device The audio data representing the sound scene is processed at a plurality of processors based on the direction of arrival information and the embedding to generate modified audio data corresponding to the updated sound scene.

根據示例33，一種設備包括：記憶體，被配置為儲存指令；及處理器，被配置為執行指令以執行根據示例25至30中任一項所述的方法。According to example 33, an apparatus comprising: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method according to any one of examples 25-30.

根據示例34，一種非暫時性電腦可讀取媒體包括指令，當由一或多個處理器執行時，該些指令使得一或多個處理器執行根據示例25至30中任一項所述的方法。According to Example 34, a non-transitory computer-readable medium comprising instructions that, when executed by one or more processors, cause one or more processors to perform the method according to any one of Examples 25-30 method.

根據示例35，一種裝置包括用於實施根據示例25至30中任一項所述的方法的構件。According to example 35, an apparatus comprises means for implementing the method according to any one of examples 25-30.

根據示例36，一種非暫時性電腦可讀取媒體包括指令，當由第一設備的一或多個處理器執行時，該些指令使得一或多個處理器：從多個麥克風接收音訊信號；處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊；及向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料。According to example 36, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a first device, cause the one or more processors to: receive audio signals from a plurality of microphones; processing the audio signal to generate direction-of-arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and sending to a second device based on the direction-of-arrival information and a category associated with the direction-of-arrival information or embedded data.

示例37包括根據示例36之非暫時性電腦可讀取媒體，其中該些指令可執行以進一步使得一或多個處理器向第二設備發送音訊信號的表示。Example 37 includes the non-transitory computer-readable medium of example 36, wherein the instructions are executable to further cause the one or more processors to send the representation of the audio signal to the second device.

示例38包括根據示例37之非暫時性電腦可讀取媒體，其中音訊信號的表示對應於一或多個波束成形音訊信號。Example 38 includes the non-transitory computer-readable medium of example 37, wherein the representation of the audio signal corresponds to one or more beamformed audio signals.

根據示例39，一種第一設備包括：用於從多個麥克風接收音訊信號的構件；用於處理音訊信號以產生與以音訊信號中的一或多個表示的聲音的一或多個源相對應的到達方向資訊的構件；及用於向第二設備發送基於到達方向資訊和與到達方向資訊相關聯的類別或嵌入的資料的構件。According to example 39, a first apparatus comprises: means for receiving audio signals from a plurality of microphones; and processing the audio signals to produce one or more sources corresponding to sounds represented by one or more of the audio signals means for the direction-of-arrival information; and means for sending to the second device data based on the direction-of-arrival information and the category or embedding associated with the direction-of-arrival information.

本案包括以下第三組示例。This case includes the following third set of examples.

根據示例1，一種第二設備包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從第一設備接收與音訊事件相對應的音訊類別的指示。According to example 1, a second device includes: a memory configured to store instructions; and one or more processors configured to: receive from a first device an indication of an audio category corresponding to an audio event.

示例2包括根據示例1之第二設備，其中一或多個處理器還被配置為：從第一設備接收表示與音訊事件相關聯的聲音的音訊資料；及在一或多個分類器處處理音訊資料以驗證聲音對應於音訊事件。Example 2 includes the second device according to example 1, wherein the one or more processors are further configured to: receive audio data representing sounds associated with the audio event from the first device; and process at the one or more classifiers Audio data to verify that the sound corresponds to an audio event.

示例3包括根據示例2之第二設備，其中一或多個處理器被配置為將音訊資料和音訊類別的指示作為輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。Example 3 includes the second apparatus according to example 2, wherein the one or more processors are configured to provide as input the audio data and the indication of the audio category to the one or more classifiers to determine a classification associated with the audio material.

示例4包括根據示例2或示例3之第二設備，其中音訊類別對應於交通工具事件，並且其中一或多個處理器還被配置為基於第一設備的位置和一或多個第三設備的位置向一或多個第三設備發送交通工具事件的通知。Example 4 includes the second device according to example 2 or example 3, wherein the audio category corresponds to a vehicle event, and wherein the one or more processors are further configured to The location sends a notification of the vehicle event to one or more third devices.

示例5包括根據示例2至4中任一項所述的第二設備，其中一或多個處理器還被配置為基於一或多個分類器的輸出向第一設備發送控制信號。Example 5 includes the second device of any one of Examples 2-4, wherein the one or more processors are further configured to send a control signal to the first device based on the output of the one or more classifiers.

示例6包括根據示例5之第二設備，其中控制信號指示第一設備執行音訊縮放操作。Example 6 includes the second device of example 5, wherein the control signal instructs the first device to perform an audio zoom operation.

示例7包括根據示例5或示例6之第二設備，其中控制信號指示第一設備基於聲源的方向執行空間處理。Example 7 includes the second device according to example 5 or example 6, wherein the control signal instructs the first device to perform spatial processing based on the direction of the sound source.

示例8包括根據示例2至7中任一項所述的第二設備，其中一或多個處理器還被配置為：從第一設備接收與聲源相對應的方向資料；及將音訊資料、方向資料和音訊類別的指示作為輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。Example 8 includes the second device of any one of Examples 2-7, wherein the one or more processors are further configured to: receive direction data corresponding to sound sources from the first device; and convert the audio data, The directional data and an indication of the audio class are provided as input to one or more classifiers to determine a class associated with the audio data.

示例9包括根據示例2至8中任一項所述的第二設備，其中音訊資料包括一或多個波束成形信號。Example 9 includes the second device of any one of Examples 2-8, wherein the audio data includes one or more beamforming signals.

示例10包括根據示例1至9中任一項所述的第二設備，其中一或多個處理器還被配置為：從第一設備接收對應於與音訊事件相關聯的聲源的方向資料；基於音訊事件更新音訊場景中的定向聲源的地圖，以產生更新後的地圖；及向地理上遠離第一設備的一或多個第三設備發送與更新後的地圖相對應的資料。Example 10 includes the second device of any one of Examples 1-9, wherein the one or more processors are further configured to: receive, from the first device, directional data corresponding to a sound source associated with the audio event; updating a map of directional sound sources in the audio scene based on the audio event to generate an updated map; and sending data corresponding to the updated map to one or more third devices geographically remote from the first device.

示例11包括根據示例1至10中任一項所述的第二設備，其中記憶體和一或多個處理器被整合到行動電話中，並且其中第一設備對應於耳機設備。Example 11 includes the second device of any one of Examples 1-10, wherein the memory and the one or more processors are integrated into a mobile phone, and wherein the first device corresponds to a headset device.

示例12包括根據示例1至10中任一項所述的第二設備，其中記憶體和一或多個處理器被整合到交通工具中。Example 12 includes the second device of any one of Examples 1-10, wherein the memory and the one or more processors are integrated into a vehicle.

示例13包括根據示例1至12中任一項所述的第二設備，還包括數據機，其中音訊類別的指示是經由數據機接收的。Example 13 includes the second device of any one of Examples 1-12, further comprising a modem, wherein the indication of the audio category is received via the modem.

示例14包括根據示例1至13中任一項所述的第二設備，其中一或多個處理器被配置為基於是否從第一設備接收到到達方向資訊而選擇性地繞過對接收到的與音訊事件相對應的音訊資料的到達方向處理。Example 14 includes the second device of any one of Examples 1-13, wherein the one or more processors are configured to selectively bypass checking of received direction-of-arrival information based on whether direction-of-arrival information is received from the first device. Direction-of-arrival processing of audio data corresponding to an audio event.

示例15包括根據示例1至14中任一項所述的第二設備，其中一或多個處理器被配置為基於接收到的音訊資料是對應於來自第一設備的多通道麥克風信號還是對應於來自第一設備的波束成形信號而選擇性地繞過波束成形操作。Example 15 includes the second device of any one of Examples 1-14, wherein the one or more processors are configured to, based on whether the received audio material corresponds to a multi-channel microphone signal from the first device or to The beamforming signal from the first device selectively bypasses the beamforming operation.

示例16包括根據示例1至15中任一項所述的第二設備，其中：音訊類別對應於以音訊信號表示的並且與音訊事件相關聯的特定聲音的分類。Example 16 includes the second device of any one of Examples 1-15, wherein the audio category corresponds to a classification of a particular sound represented by the audio signal and associated with the audio event.

根據示例17，一種系統包括：根據示例1至16中任一項所述的第二設備；及第一設備，該第一設備包括：一或多個處理器，被配置為：從一或多個麥克風接收音訊信號；處理音訊信號以決定音訊類別；及向第二設備發送音訊類別的指示。According to example 17, a system comprising: the second device according to any one of examples 1 to 16; and a first device comprising: one or more processors configured to: A microphone receives an audio signal; processes the audio signal to determine an audio type; and sends an indication of the audio type to a second device.

根據示例18，一種系統包括：第一設備，包括：一或多個處理器，被配置為：從一或多個麥克風接收音訊信號；處理音訊信號以決定與音訊事件相對應的音訊類別；及發送音訊類別的指示；及第二設備，包括一或多個處理器，該一或多個處理器被配置為：接收與音訊事件相對應的音訊類別的指示。According to example 18, a system includes: a first device comprising: one or more processors configured to: receive audio signals from one or more microphones; process the audio signals to determine an audio category corresponding to an audio event; and sending an indication of an audio class; and a second device comprising one or more processors configured to: receive an indication of an audio class corresponding to an audio event.

根據示例19，一種方法包括：在第二設備的一或多個處理器處接收音訊類別的指示，該指示從第一設備接收並且對應於音訊事件；及在第二設備的一或多個處理器處處理音訊資料，以驗證以該音訊資料表示的聲音對應於該音訊事件。According to example 19, a method comprises: receiving at one or more processors of a second device an indication of an audio class received from a first device and corresponding to an audio event; and processing at one or more processors of the second device The audio data is processed by the device to verify that the sound represented by the audio data corresponds to the audio event.

示例20包括根據示例19之方法，還包括從第一設備接收音訊資料，並且其中音訊資料的處理包括將音訊資料作為輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。Example 20 includes the method of example 19, further comprising receiving audio data from the first device, and wherein processing the audio data includes providing the audio data as input to one or more classifiers to determine a classification associated with the audio data.

示例21包括根據示例20之方法，其中音訊資料的處理還包括將音訊類別的指示作為第二輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。Example 21 includes the method of example 20, wherein the processing of the audio data further includes providing an indication of the audio category as a second input to one or more classifiers to determine a classification associated with the audio material.

示例22包括根據示例20或示例21之方法，還包括基於一或多個分類器的輸出向第一設備發送控制信號。Example 22 includes the method of example 20 or example 21, further comprising sending a control signal to the first device based on the output of the one or more classifiers.

示例23包括根據示例22之方法，其中控制信號包括音訊縮放指令。Example 23 includes the method of example 22, wherein the control signal includes an audio zoom instruction.

示例24包括根據示例22或示例23之方法，其中控制信號包括用於基於聲源的方向執行空間處理的指令。Example 24 includes the method according to example 22 or example 23, wherein the control signal includes instructions for performing spatial processing based on the direction of the sound source.

示例25包括根據示例19至24中任一項所述的方法，其中音訊類別對應於交通工具事件，並且該方法還包括基於第一設備的位置和一或多個第三設備的位置向一或多個第三設備發送交通工具事件的通知。Example 25 includes the method of any one of Examples 19 to 24, wherein the audio category corresponds to a vehicle event, and the method further comprises reporting to one or more third devices based on the location of the first device and the location of the one or more third devices. A plurality of third devices send notifications of vehicle events.

示例26包括根據示例19至25中任一項所述的方法，還包括：從第一設備接收對應於與音訊事件相關聯的聲源的方向資料；基於音訊事件更新音訊場景中的定向聲源的地圖，以產生更新後的地圖；及向地理上遠離第一設備的一或多個第三設備發送與更新後的地圖相對應的資料。Example 26 includes the method of any one of Examples 19 to 25, further comprising: receiving from the first device directional data corresponding to sound sources associated with the audio event; updating the directional sound source in the audio scene based on the audio event to generate an updated map; and sending data corresponding to the updated map to one or more third devices geographically distant from the first device.

示例27包括根據示例19至26中任一項所述的方法，還包括基於是否從第一設備接收到到達方向資訊而選擇性地繞過對接收到的與音訊事件相對應的音訊資料的到達方向處理。Example 27 includes the method of any one of Examples 19 to 26, further comprising selectively bypassing the arrival of the received audio data corresponding to the audio event based on whether direction of arrival information is received from the first device. Orientation processing.

示例28包括根據示例19至27中任一項所述的方法，還包括基於接收到的音訊資料是對應於來自第一設備的多通道麥克風信號還是對應於來自第一設備的波束成形信號而選擇性地繞過波束成形操作。Example 28 includes the method of any one of Examples 19 to 27, further comprising selecting based on whether the received audio material corresponds to a multi-channel microphone signal from the first device or to a beamformed signal from the first device to bypass the beamforming operation.

示例29包括根據示例19至28中任一項所述的方法，還包括：在第一設備的一或多個處理器處接收來自一或多個麥克風的音訊信號；在第一設備的一或多個處理器處處理音訊信號以決定音訊類別；及從第一設備向第二設備發送音訊類別的指示。Example 29 includes the method of any one of Examples 19-28, further comprising: receiving, at one or more processors of the first device, audio signals from one or more microphones; Process the audio signal at multiple processors to determine the audio type; and send an indication of the audio type from the first device to the second device.

根據示例30，一種設備包括：記憶體，被配置為儲存指令；及處理器，被配置為執行指令以執行根據示例16至28中任一項所述的方法。According to example 30, an apparatus comprising: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method according to any one of examples 16-28.

根據示例31，一種非暫時性電腦可讀取媒體包括指令，當由一或多個處理器執行時，該些指令使得一或多個處理器執行根據示例16至29中任一項所述的方法。According to Example 31, a non-transitory computer-readable medium comprising instructions which, when executed by one or more processors, cause one or more processors to perform the method according to any one of Examples 16-29 method.

根據示例32，一種裝置包括用於執行根據示例16至28中任一項所述的方法的構件。According to example 32, an apparatus comprises means for performing the method according to any one of examples 16-28.

根據示例33，一種非暫時性電腦可讀取媒體包括指令，當由第二設備的一或多個處理器執行時，該些指令使得一或多個處理器從第一設備接收與音訊事件相對應的音訊類別的指示。According to example 33, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a second device, cause the one or more processors to receive from a first device An indication of the corresponding audio class.

示例34包括根據示例33之非暫時性電腦可讀取媒體，其中該些指令可執行以進一步使得一或多個處理器：從第一設備接收表示與音訊事件相關聯的聲音的音訊資料；及在一或多個分類器處處理音訊資料以驗證聲音對應於音訊事件。Example 34 includes the non-transitory computer-readable medium of Example 33, wherein the instructions are executable to further cause the one or more processors to: receive audio data representing sounds associated with the audio event from the first device; and The audio data is processed at one or more classifiers to verify that sounds correspond to audio events.

示例35包括根據示例34之非暫時性電腦可讀取媒體，其中該些指令可執行以進一步使得一或多個處理器將音訊資料和音訊類別的指示作為輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。Example 35 includes the non-transitory computer-readable medium of Example 34, wherein the instructions are executable to further cause the one or more processors to provide as input the audio data and the indication of the audio category to the one or more classifiers, to determine the category associated with the audio data.

示例36包括根據示例34或示例35之非暫時性電腦可讀取媒體，其中該些指令可執行以進一步使得一或多個處理器：從第一設備接收與聲源相對應的方向資料；及將音訊資料、方向資料和音訊類別的指示作為輸入提供給一或多個分類器，以決定與音訊資料相關聯的分類。Example 36 includes the non-transitory computer-readable medium of Example 34 or Example 35, wherein the instructions are executable to further cause the one or more processors to: receive direction data corresponding to a sound source from the first device; and The audio data, direction data, and indication of audio category are provided as input to one or more classifiers to determine a category associated with the audio data.

根據示例37，一種裝置包括：用於接收音訊類別的指示的構件，該指示從遠端設備接收並且對應於音訊事件；及用於處理音訊資料以驗證以該音訊資料表示的聲音對應於該音訊事件的構件。According to example 37, an apparatus comprising: means for receiving an indication of an audio class, the indication being received from a remote device and corresponding to an audio event; and for processing audio data to verify that a sound represented by the audio data corresponds to the audio The component of the event.

根據示例38，一種第二設備包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從第一設備接收：表示聲音的音訊資料；及音訊資料對應於與交通工具事件相關聯的音訊類別的指示；在一或多個分類器處處理音訊資料，以驗證以音訊資料表示的聲音對應於交通工具事件；及基於第一設備的位置和一或多個第三設備的位置向一或多個第三設備發送交通工具事件的通知。According to example 38, a second device includes: a memory configured to store instructions; and one or more processors configured to: receive from the first device: audio data representing sounds; an indication of an audio category associated with the vehicle event; processing the audio data at one or more classifiers to verify that the sound represented by the audio data corresponds to the vehicle event; and based on the location of the first device and one or more third The location of the device sends a notification of the vehicle event to one or more third devices.

根據示例39，一種方法包括：在第二設備的一或多個處理器處接收來自第一設備的音訊資料以及來自第一設備的音訊資料對應於與交通工具事件相關聯的音訊類別的指示；在第二設備的一或多個分類器處處理音訊資料，以驗證以音訊資料表示的聲音對應於交通工具事件；及基於第一設備的位置和一或多個第三設備的位置向一或多個第三設備發送交通工具事件的通知。According to example 39, a method comprises: receiving, at one or more processors of a second device, audio data from a first device and an indication that the audio data from the first device corresponds to an audio category associated with a vehicle event; Processing the audio data at one or more classifiers of the second device to verify that the sound represented by the audio data corresponds to a vehicle event; A plurality of third devices send notifications of vehicle events.

根據示例40，一種設備包括：記憶體，被配置為儲存指令；及處理器，被配置為執行指令以執行根據示例39之方法。According to example 40, an apparatus comprising: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method according to example 39.

根據示例41，一種非暫時性電腦可讀取媒體包括指令，當由第二設備的一或多個處理器執行時，該些指令使得一或多個處理器執行根據示例39之方法。According to Example 41, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a second device, cause the one or more processors to perform the method according to Example 39.

根據示例42，一種裝置包括用於執行根據示例39之方法的構件。According to example 42, an apparatus comprises means for performing the method according to example 39.

根據示例43，一種第一設備包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從一或多個麥克風接收一或多個音訊信號；處理一或多個音訊信號，以決定以音訊信號中的一或多個表示的聲音是否來自可辨識的方向；及基於該決定選擇性地向第二設備發送聲源的到達方向資訊。According to example 43, a first apparatus comprises: a memory configured to store instructions; and one or more processors configured to: receive one or more audio signals from one or more microphones; process one or more an audio signal to determine whether a sound represented by one or more of the audio signals is from an identifiable direction; and selectively send direction-of-arrival information of the sound source to the second device based on the determination.

根據示例44，一種方法包括：在第一設備的一或多個處理器處接收來自一或多個麥克風的一或多個音訊信號；在一或多個處理器處處理一或多個音訊信號，以決定以音訊信號中的一或多個表示的聲音是否來自可辨識的方向；及基於該決定選擇性地向第二設備發送聲源的到達方向資訊。According to example 44, a method comprises: receiving, at one or more processors of a first device, one or more audio signals from one or more microphones; processing, at the one or more processors, the one or more audio signals , to determine whether the sound represented by one or more of the audio signals is from an identifiable direction; and selectively send arrival direction information of the sound source to the second device based on the determination.

根據示例45，一種設備包括：記憶體，被配置為儲存指令；及處理器，被配置為執行指令以執行根據示例44之方法。According to example 45, an apparatus comprising: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method according to example 44.

根據示例46，一種非暫時性電腦可讀取媒體包括指令，當由第一設備的一或多個處理器執行時，該些指令使得一或多個處理器執行根據示例44之方法。According to Example 46, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a first device, cause the one or more processors to perform the method according to Example 44.

根據示例47，一種裝置包括用於執行根據示例44之方法的構件。According to example 47, an apparatus comprises means for performing the method according to example 44.

根據示例48，一種第一設備包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從一或多個麥克風接收一或多個音訊信號；基於一或多個標準決定是向第二設備發送一或多個音訊信號還是向第二設備發送基於一或多個音訊信號而產生的波束成形音訊信號；及基於該決定向第二設備發送與一或多個音訊信號相對應或者與波束成形音訊信號相對應的音訊資料。According to example 48, a first apparatus comprises: a memory configured to store instructions; and one or more processors configured to: receive one or more audio signals from one or more microphones; determining whether to send one or more audio signals to the second device or sending a beamformed audio signal generated based on the one or more audio signals to the second device; and sending the audio signal to the second device based on the decision. The audio data corresponding to the signal or corresponding to the beamformed audio signal.

根據示例49，一種方法包括：在第一設備的一或多個處理器處接收來自一或多個麥克風的一或多個音訊信號；在一或多個處理器處基於一或多個標準來決定是向第二設備發送一或多個音訊信號還是向第二設備發送基於一或多個音訊信號而產生的波束成形音訊信號；及基於該決定向第二設備發送與一或多個音訊信號相對應或者與波束成形音訊信號相對應的音訊資料。According to example 49, a method comprises: receiving, at one or more processors of a first device, one or more audio signals from one or more microphones; at the one or more processors, based on one or more criteria, deciding whether to send one or more audio signals to a second device or to send a beamformed audio signal generated based on the one or more audio signals to the second device; and sending the one or more audio signals to the second device based on the decision Audio data corresponding to or corresponding to the beamformed audio signal.

根據示例50，一種設備包括：記憶體，被配置為儲存指令；及處理器，被配置為執行指令以執行根據示例49之方法。According to example 50, an apparatus comprising: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method according to example 49.

根據示例51，一種非暫時性電腦可讀取媒體包括指令，當由第一設備的一或多個處理器執行時，該些指令使得一或多個處理器執行根據示例49之方法。According to example 51 , a non-transitory computer-readable medium comprising instructions which, when executed by one or more processors of a first device, cause the one or more processors to perform the method according to example 49.

根據示例52，一種裝置包括用於執行根據示例49之方法的構件。According to example 52, an apparatus comprises means for performing the method according to example 49.

根據示例53，一種第二設備包括：記憶體，被配置為儲存指令；及一或多個處理器，被配置為：從第一設備接收：表示聲音的音訊資料；與聲源相對應的方向資料；及與音訊事件相對應的聲音的分類；處理音訊資料以驗證聲音對應於音訊事件；基於音訊事件更新音訊場景中的定向聲源的地圖，以產生更新後的地圖；及向地理上遠離第一設備的一或多個第三設備發送與更新後的地圖相對應的資料。According to example 53, a second device includes: a memory configured to store instructions; and one or more processors configured to: receive from a first device: audio data representing a sound; a direction corresponding to a sound source data; and classification of sounds corresponding to the audio event; processing the audio data to verify that the sound corresponds to the audio event; updating a map of directional sound sources in the audio scene based on the audio event to generate an updated map; and moving geographically away The one or more third devices of the first device transmit data corresponding to the updated map.

根據示例54，一種方法包括：在第二設備的一或多個處理器處接收表示聲音的音訊資料、與聲源相對應的方向資料、以及與音訊事件相對應的聲音的分類，音訊資料、方向資料和分類是從第一設備接收的；在一或多個處理器處處理音訊資料，以驗證聲音對應於音訊事件；在一或多個處理器處基於音訊事件來更新音訊場景中的定向聲源的地圖，以產生更新後的地圖；及向地理上遠離第一設備的一或多個第三設備發送與更新後的地圖相對應的資料。According to example 54, a method comprises receiving, at one or more processors of a second device, audio data representative of sounds, direction data corresponding to sound sources, and classifications of sounds corresponding to audio events, the audio data, The orientation data and classification are received from the first device; the audio data is processed at the one or more processors to verify that the sound corresponds to the audio event; the orientation in the audio scene is updated at the one or more processors based on the audio event a map of sound sources to generate an updated map; and sending data corresponding to the updated map to one or more third devices geographically remote from the first device.

根據示例55，一種設備包括：記憶體，被配置為儲存指令；及處理器，被配置為執行指令以執行根據示例54之方法。According to example 55, an apparatus comprising: a memory configured to store instructions; and a processor configured to execute the instructions to perform the method according to example 54.

根據示例56，一種非暫時性電腦可讀取媒體包括指令，當由第二設備的一或多個處理器執行時，該些指令使得一或多個處理器執行根據示例54之方法。According to example 56, a non-transitory computer-readable medium includes instructions that, when executed by one or more processors of a second device, cause the one or more processors to perform the method according to example 54.

根據示例57，一種裝置包括用於執行根據示例54之方法的構件。According to example 57, an apparatus comprises means for performing the method according to example 54.

熟習此項技術者將進一步瞭解，結合本文揭露的實施方式描述的各種說明性邏輯區塊、配置、模組、電路和演算法步驟可以被實施為電子硬體、由處理器執行的電腦軟體或兩者的組合。各種說明性的元件、方塊、配置、模組、電路和步驟已經在上面根據它們的功能進行了整體描述。這種功能是被實施為硬體還是處理器可執行指令取決於特定的應用和對整個系統施加的設計約束。熟習此項技術者可以針對每個特定的應用以不同的方式實施所描述的功能，此類實施決策不應被解釋為導致脫離本案的範圍。Those skilled in the art will further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or A combination of both. Various illustrative elements, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or processor-executable instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, and such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

結合本文揭露的實施方式描述的方法或演算法的步驟可以直接體現在硬體、由處理器執行的軟體模組或兩者的組合中。軟體模組可以常駐在隨機存取記憶體（RAM）、快閃記憶體、唯讀記憶體（ROM）、可程式設計唯讀記憶體（PROM）、可抹除可程式設計唯讀記憶體（EPROM）、電子可抹除可程式設計唯讀記憶體（EEPROM）、暫存器、硬碟、抽取式磁碟、光碟唯讀記憶體（CD-ROM）、或本領域已知的任何其他形式的非瞬態儲存媒體中。示例性儲存媒體耦合到處理器，使得處理器可以從儲存媒體讀取資訊以及向儲存媒體寫入資訊。在替代方案中，儲存媒體可以整合到處理器中。處理器和儲存媒體可以常駐在專用積體電路（ASIC）中。ASIC可以常駐在計算設備或使用者終端中。在替代方案中，處理器和儲存媒體可以作為個別元件常駐在計算設備或使用者終端中。The steps of the method or algorithm described in connection with the embodiments disclosed herein may be directly embodied in hardware, a software module executed by a processor, or a combination of both. Software modules can be resident in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory ( EPROM), Electronically Erasable Programmable Read-Only Memory (EEPROM), scratchpad, hard disk, removable disk, compact disk read-only memory (CD-ROM), or any other form known in the art in non-transitory storage media. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integrated into the processor. The processor and storage medium may reside in an application specific integrated circuit (ASIC). The ASIC can be resident in a computing device or a user terminal. In the alternative, the processor and storage medium may be resident as separate components in the computing device or user terminal.

提供對所揭示的態樣的先前描述是為了使熟習此項技術者能夠製造或使用所揭示的態樣。熟習此項技術者將容易明白對這些態樣的各種修改，並且本文限定的原理可以在不脫離本案的範圍的情況下應用於其他態樣。因此，本案不旨在限於本文所示的態樣，而是要符合與由所附請求項限定的原理和新穎特徵相一致的儘可能最寬的範圍。The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the disclosed aspects. Various modifications to these aspects will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other aspects without departing from the scope of the present disclosure. Thus, the present application is not intended to be limited to the aspects shown herein, but is to be accorded the widest possible scope consistent with the principles and novel features defined by the appended claims.

100:系統 102:麥克風 104:麥克風 106:麥克風 108:麥克風 110:設備 111:第一輸入介面 112:第二輸入介面 114:記憶體 115:指令 116:處理器 118:數據機 120:設備 121:第一輸入介面 122:第二輸入介面 124:記憶體 125:指令 126:處理器 128:數據機 129:感測器 132:處理單元 134:音訊事件處理單元 136:聲學環境處理單元 138:波束成形單元 142:到達方向資訊 143:到達方向資訊 144:音訊事件資訊 145:音訊事件資訊 146:環境資訊 147:環境資訊 148:波束成形音訊信號 149:波束成形音訊信號 152:到達方向處理單元 154:音訊事件處理單元 156:聲學環境處理單元 158:波束成形單元 170:音訊信號 172:音訊信號 174:音訊訊框 176:音訊訊框 178:音訊資料 180:源 182:聲音 190:音訊信號 192:音訊信號 194:音訊訊框 196:音訊訊框 198:音訊資料 200:系統 202:處理器 210:第一處理域 220:第二處理域 230:音訊預處理單元 232:到達方向處理單元 234:音訊事件處理單元 236:聲學環境處理單元 238:波束成形單元 250:方向資訊 252:啟動信號 274:音訊訊框 276:音訊訊框 278:音訊資料 300:系統 310:耳機 320:行動電話 330:音訊處理單元 332:音訊縮放單元 334:使用者提示產生單元 336:揚聲器 350:使用者警報 352:使用者警報 400:系統 410:耳機 420:行動電話 430:音訊處理單元 432:音訊縮放單元 434:除噪單元 436:揚聲器 440:第一聲音資訊 442:第二聲音資訊 450:單麥克風音訊上下文偵測單元 452:音訊調整單元 454:模式控制器 460:音訊縮放角度 462:降噪參數 464:模式信號 490:降噪信號 496:上下文資訊 500:系統 502:空間濾波處理單元 504:音訊事件處理單元 506:應用程式設計介面 508:語音使用者介面 510:輸出 512:輸出 514:輸出 542:到達方向資訊 544:音訊事件資訊 574:音訊訊框 576:音訊訊框 600:實施方式 602:信號資料 610:分類器 612:類別 614:類別 616:指示 700:實施方式 710:編碼器 712:嵌入 716:指示 900:實施方式 902:指示 904:音訊資料 910:波束成形信號 912:方向資料 920:分類器 922:分類 924:類別 930:通知 932:控制信號 934:分類器輸出 1000:另一實施方式 1002:多通道音訊信號 1100:實施方式 1102:輸入混合波形 1104:嵌入 1106:目標輸出 1120:內容分離器 1122:音訊產生網路 1150:圖示 1151:音訊場景 1152:第一氛圍 1154:前景聲音 1156:前景聲音 1158:前景聲音 1160:前景提取操作 1162:前景聲音 1164:場景產生操作 1171:音訊場景 1172:第二氛圍 1200:方法 1202:方塊 1204:方塊 1206:方塊 1208:方塊 1210:方塊 1212:方塊 1214:方塊 1220:方塊 1230:方塊 1232:方塊 1234:方塊 1236:方塊 1238:方塊 1240:方塊 1242:方塊 1244:方塊 1246:方塊 1248:方塊 1300:方法 1302:方塊 1304:方塊 1306:方塊 1308:方塊 1310:方塊 1312:方塊 1314:方塊 1320:方塊 1322:方塊 1330:方塊 1332:方塊 1340:方塊 1342:方塊 1350:方塊 1400:系統 1402:麥克風 1404:麥克風 1410:交通工具 1411:第一輸入介面 1412:第二輸入介面 1414:記憶體 1415:指令 1416:處理器 1420:設備 1432:到達方向處理單元 1434:音訊事件處理單元 1436:報告產生器 1438:導航指令產生器 1442:到達方向資訊 1444:音訊事件資訊 1446:報告 1448:導航指令 1456:第二報告 1458:導航指令 1470:音訊信號 1472:音訊信號 1474:音訊訊框 1476:音訊訊框 1478:音訊資料 1480:源 1482:聲音 1490:其他設備 1492:通知 1500:系統 1502:交通工具事件 1504:音訊類別 1510:交通工具 1514:記憶體 1515:指令 1516:處理器 1520:設備 1522:分類 1530:分類器 1550:音訊資料 1552:指示 1602:指示 1604:方向資料 1612:地圖更新器 1614:地圖 1616:地圖 1618:音訊場景渲染器 1660:資料 1670:設備 1672:設備 1674:設備 1700:3D音訊地圖 1702:使用者 1710:第一交通工具 1712:第二交通工具 1714:犬吠聲 1716:說話的人 1718:人行橫道計時器 1720:人造聲音 1802:定向音訊場景 1804:使用者 1810:第一代表性揚聲器 1812:第二代表性揚聲器 1814:第三代表性揚聲器 1820:操作 1830:定向音訊場景 1832:虛擬參與者 1834:虛擬參與者 1900:實施方式 1902:積體電路 1904:音訊輸入 1906:信號輸出 1916:處理器 1990:定向音訊信號處理單元 1992:定向音訊信號資料 2000:實施方式 2002:行動設備 2004:顯示螢幕 2100:實施方式 2102:耳機設備 2200:實施方式 2202:可佩戴電子設備 2204:顯示螢幕 2300:實施方式 2302:語音啟動設備 2304:揚聲器 2400:實施方式 2402:相機設備 2500:實施方式 2502:耳機 2600:實施方式 2602:增強現實或混合現實眼鏡 2604:全息投影單元 2606:鏡片 2700:實施方式 2702:第一耳塞 2704:第二耳塞 2706:一對耳塞 2720:第一麥克風 2722A:麥克風 2722B:麥克風 2722C:麥克風 2724:「內部」麥克風 2726:語音麥克風 2730:揚聲器 2800:實施方式 2802:交通工具 2850:定向音訊信號處理單元 2900:另一實施方式 2910:揚聲器 2920:顯示器 2950:定向音訊信號處理單元 3000:方法 3002:方塊 3004:方塊 3006:方塊 3100:方法 3102:方塊 3104:方塊 3106:方塊 3200:方法 3202:方塊 3204:方塊 3300:方法 3302:方塊 3304:方塊 3306:方塊 3400:方法 3402:方塊 3404:方塊 3406:方塊 3500:方法 3502:方塊 3504:方塊 3506:方塊 3600:方法 3602:方塊 3604:方塊 3606:方塊 3608:方塊 3700:設備 3702:數位類比轉換器（DAC） 3704:類比數位轉換器（ADC） 3706:處理器 3708:語音和音樂轉碼器-解碼器（轉碼器，CODEC） 3710:處理器 3722:系統級封裝或片上系統設備 3726:顯示控制器 3728:顯示器 3730:輸入裝置 3734:CODEC 3736:語音轉碼器（「聲碼器」）編碼器 3738:聲碼器解碼器 3744:電源 3750:收發器 3752:天線 3756:指令 3770:數據機 3786:記憶體 3792:揚聲器 FG1:第一前景聲音 FG2:第二前景聲音 FG3:第三前景聲音 100: system 102: Microphone 104: Microphone 106: Microphone 108: Microphone 110: Equipment 111: the first input interface 112: Second input interface 114: memory 115: instruction 116: Processor 118: modem 120: Equipment 121: the first input interface 122: Second input interface 124: memory 125: instruction 126: Processor 128: modem 129: sensor 132: Processing unit 134: Audio event processing unit 136: Acoustic environment processing unit 138: Beamforming unit 142: Arrival direction information 143: Arrival direction information 144:Audio event information 145:Audio event information 146:Environmental information 147: Environmental information 148: Beamforming audio signal 149: Beamforming audio signal 152: Arrival direction processing unit 154: Audio event processing unit 156: Acoustic environment processing unit 158: Beamforming unit 170: audio signal 172:Audio signal 174: Audio frame 176: Audio frame 178: Audio data 180: source 182: sound 190: audio signal 192: audio signal 194: Audio frame 196: Audio frame 198: Audio data 200: system 202: Processor 210: First processing domain 220: second processing domain 230: audio preprocessing unit 232: Arrival direction processing unit 234: Audio event processing unit 236: Acoustic environment processing unit 238: Beamforming unit 250: Direction information 252: start signal 274: Audio frame 276: Audio frame 278: Audio data 300: system 310: Headphones 320: mobile phone 330:Audio processing unit 332: audio scaling unit 334: user prompt generating unit 336:Speaker 350: User Alert 352: User Alert 400: system 410: Headphones 420: mobile phone 430:Audio processing unit 432: Audio scaling unit 434: Noise removal unit 436:Speaker 440: First Voice Information 442:Second Voice Information 450:Single microphone audio context detection unit 452:Audio adjustment unit 454: Mode Controller 460: audio zoom angle 462: Noise reduction parameters 464: mode signal 490: Noise reduction signal 496:Context information 500: system 502: Spatial filtering processing unit 504: Audio event processing unit 506: Application Programming Interface 508:Voice user interface 510: output 512: output 514: output 542: Arrival direction information 544:Audio event information 574:Audio frame 576:Audio frame 600: Implementation 602: signal data 610: Classifier 612: Category 614: Category 616: instruction 700: Implementation 710: Encoder 712:Embed 716: instruction 900: Implementation 902: instruction 904: audio data 910: Beamforming signal 912: Direction data 920: Classifier 922: classification 924: category 930: Notification 932: control signal 934: Classifier output 1000: another implementation 1002: Multi-channel audio signal 1100: Implementation 1102: input mixed waveform 1104: Embed 1106: target output 1120: content separator 1122:Audio generation network 1150: icon 1151: Audio scene 1152: The first atmosphere 1154:Foreground sound 1156:Foreground sound 1158:Foreground sound 1160: Foreground extraction operation 1162: Foreground sound 1164: Scene generation operation 1171: Audio scene 1172:Second Atmosphere 1200: method 1202: block 1204: block 1206: block 1208: block 1210: block 1212: block 1214: block 1220: block 1230: block 1232: square 1234: block 1236: block 1238: square 1240: block 1242: block 1244: block 1246: block 1248: block 1300: method 1302: block 1304: block 1306: cube 1308: cube 1310: block 1312: block 1314: block 1320: block 1322: block 1330: block 1332: block 1340: block 1342: block 1350: block 1400: system 1402: Microphone 1404:Microphone 1410: Transportation 1411: the first input interface 1412: Second input interface 1414: memory 1415: instruction 1416: Processor 1420: Equipment 1432: Arrival direction processing unit 1434: Audio event processing unit 1436: Report Generator 1438: Navigation instruction generator 1442: Arrival direction information 1444:Audio event information 1446: report 1448:Navigation command 1456: Second report 1458:Navigation command 1470: audio signal 1472:Audio signal 1474:Audio frame 1476: Audio frame 1478: Audio data 1480: source 1482: sound 1490:Other equipment 1492: Notice 1500: system 1502: Vehicle event 1504: Audio category 1510: Transportation 1514: memory 1515: instruction 1516: Processor 1520: Equipment 1522: classification 1530: classifier 1550: audio data 1552: instruction 1602: instruction 1604: Direction information 1612: Map Updater 1614: map 1616: map 1618: Audio scene renderer 1660: data 1670: Equipment 1672: Equipment 1674: equipment 1700: 3D audio map 1702: user 1710: The first means of transportation 1712: Second means of transportation 1714: dog bark 1716: Talker 1718: Crosswalk timer 1720: Artificial Sound 1802: Directed audio scene 1804: user 1810: First representative speaker 1812: Second representative speaker 1814: Third representative speaker 1820: Operation 1830: Directed Audio Scenarios 1832: Virtual Participants 1834: Virtual Participants 1900: Implementation 1902: Integrated circuits 1904: Audio input 1906: Signal output 1916: Processor 1990: Directional audio signal processing unit 1992: Directional audio signal data 2000: Implementation 2002: Mobile devices 2004: display screen 2100: Implementation 2102: Headphone equipment 2200: Implementation 2202: Wearable Electronic Devices 2204: display screen 2300: Implementation 2302:Voice activated device 2304:Speaker 2400: Implementation method 2402: Camera equipment 2500: Implementation 2502: Headphones 2600: Implementation method 2602: Augmented or mixed reality glasses 2604: Holographic projection unit 2606: Lens 2700: Implementation 2702: The first earplug 2704:Second earplug 2706: A pair of earplugs 2720: the first microphone 2722A: Microphone 2722B: Microphone 2722C: Microphone 2724: "Internal" microphone 2726: voice microphone 2730:Speaker 2800: Implementation 2802: Transportation 2850: Directional Audio Signal Processing Unit 2900: another implementation 2910: Loudspeaker 2920:Display 2950: Directional Audio Signal Processing Unit 3000: method 3002: block 3004: block 3006: block 3100: method 3102: block 3104: block 3106: block 3200: method 3202: block 3204: block 3300: method 3302: block 3304: block 3306: block 3400: method 3402: block 3404: block 3406: block 3500: method 3502: block 3504: block 3506: block 3600: method 3602: block 3604: block 3606: block 3608: block 3700: Equipment 3702: Digital to Analog Converter (DAC) 3704: Analog to Digital Converter (ADC) 3706: Processor 3708: Speech and music transcoder-decoder (transcoder, CODEC) 3710: Processor 3722: System-in-Package or System-on-Chip Devices 3726: display controller 3728:Display 3730: input device 3734:CODEC 3736: Speech Transcoder ("Vocoder") Encoder 3738: Vocoder decoder 3744:Power 3750: Transceiver 3752: Antenna 3756: command 3770: modem 3786:Memory 3792:Speaker FG1: first foreground sound FG2: second foreground voice FG3: third foreground voice

圖1是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的特定說明性態樣的方塊圖。1 is a block diagram of certain illustrative aspects of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖2是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。2 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖3是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。3 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖4是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。4 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖5是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。5 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖6是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。6 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖7是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。7 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖8是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。8 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖9是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。9 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖10是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。10 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖11是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖，並且包括音訊內容分離的圖形圖示。FIG. 11 is a block diagram of another specific illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, and includes an aspect of audio content separation, according to some examples of the present disclosure. Graphic illustration.

圖12是根據本案的一些示例的可以在音訊處理設備中執行的操作的特定實施方式的圖。12 is a diagram of a particular implementation of operations that may be performed in an audio processing device, according to some examples of the present disclosure.

圖13是根據本案的一些示例的可以在音訊處理設備中執行的操作的另一特定實施方式的圖。13 is a diagram of another particular implementation of operations that may be performed in an audio processing device, according to some examples of the present disclosure.

圖14是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。14 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖15是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。15 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖16是根據本案的一些示例的被配置為對從一或多個麥克風接收的一或多個音訊信號執行定向處理的系統的另一特定說明性態樣的方塊圖。16 is a block diagram of another particular illustrative aspect of a system configured to perform directional processing on one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖17圖示根據本案的一些示例的包括可以經由對從一或多個麥克風接收的一或多個音訊信號的定向處理來決定的多個定向聲源的音訊場景的示例。17 illustrates an example of an audio scene including multiple directional sound sources that may be determined through directional processing of one or more audio signals received from one or more microphones, according to some examples of the present disclosure.

圖18圖示根據本案的一些示例的包括多個定向聲源的共享音訊場景的示例。18 illustrates an example of a shared audio scene including multiple directional sound sources, according to some examples of the present disclosure.

圖19圖示根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的積體電路的示例。19 illustrates an example of an integrated circuit including a directional audio signal processing unit for generating directional audio signal data, according to some examples of the present disclosure.

圖20是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的行動設備的圖。20 is a diagram of a mobile device including a directional audio signal processing unit for generating directional audio signal data, according to some examples of the present disclosure.

圖21是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的耳機的圖。21 is a diagram of a headset including a directional audio signal processing unit for generating directional audio signal data, according to some examples of the present disclosure.

圖22是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的可佩戴電子設備的圖。22 is a diagram of a wearable electronic device including a directional audio signal processing unit for generating directional audio signal data, according to some examples of the present disclosure.

圖23是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的音控揚聲器系統的圖。23 is a diagram of a voice-activated speaker system including a directional audio signal processing unit for generating directional audio signal material, according to some examples of the present disclosure.

圖24是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的相機的圖。24 is a diagram of a camera including a directional audio signal processing unit for generating directional audio signal data, according to some examples of the present disclosure.

圖25是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的耳機（諸如虛擬實境、混合現實或增強現實耳機）的圖。25 is a diagram of a headset, such as a virtual reality, mixed reality, or augmented reality headset, including a directional audio signal processing unit for generating directional audio signal material, according to some examples of the present disclosure.

圖26是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的混合現實或增強現實眼鏡設備的圖。26 is a diagram of a mixed reality or augmented reality glasses device including a directional audio signal processing unit for generating directional audio signal data, according to some examples of the present disclosure.

圖27是根據本案的一些示例的包括用於產生定向音訊信號資料的定向音訊信號處理單元的耳塞的圖。27 is a diagram of an earbud including a directional audio signal processing unit for generating directional audio signal data, according to some examples of the present disclosure.

圖28是根據本案的一些示例的包括用於導航交通工具的定向音訊信號處理單元的交通工具的第一示例的圖。28 is a diagram of a first example of a vehicle including a directional audio signal processing unit for navigating the vehicle, according to some examples of the present disclosure.

圖29是根據本案的一些示例的包括用於導航交通工具的定向音訊信號處理單元的交通工具的第二示例的圖。29 is a diagram of a second example of a vehicle including a directional audio signal processing unit for navigating the vehicle, according to some examples of the present disclosure.

圖30是根據本案的一些示例的處理音訊的方法的特定實施方式的圖。30 is a diagram of a particular implementation of a method of processing audio according to some examples of the present disclosure.

圖31是根據本案的一些示例的處理音訊的方法的另一特定實施方式的圖。FIG. 31 is a diagram of another particular embodiment of a method of processing audio according to some examples of the present disclosure.

圖32是根據本案的一些示例的處理音訊的方法的另一特定實施方式的圖。32 is a diagram of another particular embodiment of a method of processing audio according to some examples of the present disclosure.

圖33是根據本案的一些示例的處理音訊的方法的另一特定實施方式的圖。FIG. 33 is a diagram of another particular embodiment of a method of processing audio according to some examples of the present disclosure.

圖34是根據本案的一些示例的處理音訊的方法的另一特定實施方式的圖。FIG. 34 is a diagram of another particular embodiment of a method of processing audio according to some examples of the present disclosure.

圖35是根據本案的一些示例的處理音訊的方法的另一特定實施方式的圖。FIG. 35 is a diagram of another particular embodiment of a method of processing audio according to some examples of the present disclosure.

圖36是根據本案的一些示例的處理音訊的方法的另一特定實施方式的圖。FIG. 36 is a diagram of another particular embodiment of a method of processing audio according to some examples of the present disclosure.

圖37是根據本案的一些示例的可操作以對從一或多個麥克風接收的一或多個音訊信號執行定向處理的設備的特定說明性示例的方塊圖。37 is a block diagram of a particular illustrative example of an apparatus operable to perform directional processing on one or more audio signals received from one or more microphones, in accordance with some examples of the present disclosure.

國內寄存資訊(請依寄存機構、日期、號碼順序註記) 無國外寄存資訊(請依寄存國家、機構、日期、號碼順序註記) 無 Domestic deposit information (please note in order of depositor, date, and number) none Overseas storage information (please note in order of storage country, institution, date, and number) none

100:系統 100: system

102:麥克風 102: Microphone

104:麥克風 104: Microphone

106:麥克風 106: Microphone

108:麥克風 108: Microphone

110:設備 110: Equipment

111:第一輸入介面 111: the first input interface

112:第二輸入介面 112: Second input interface

114:記憶體 114: memory

115:指令 115: instruction

116:處理器 116: Processor

118:數據機 118: modem

120:設備 120: Equipment

121:第一輸入介面 121: the first input interface

122:第二輸入介面 122: Second input interface

124:記憶體 124: memory

125:指令 125: instruction

126:處理器 126: Processor

128:數據機 128: modem

129:感測器 129: sensor

132:處理單元 132: Processing unit

134:音訊事件處理單元 134: Audio event processing unit

136:聲學環境處理單元 136: Acoustic environment processing unit

138:波束成形單元 138: Beamforming unit

142:到達方向資訊 142: Arrival direction information

143:到達方向資訊 143: Arrival direction information

144:音訊事件資訊 144:Audio event information

145:音訊事件資訊 145:Audio event information

146:環境資訊 146:Environmental information

147:環境資訊 147: Environmental information

148:波束成形音訊信號 148: Beamforming audio signal

149:波束成形音訊信號 149: Beamforming audio signal

152:到達方向處理單元 152: Arrival direction processing unit

154:音訊事件處理單元 154: Audio event processing unit

156:聲學環境處理單元 156: Acoustic environment processing unit

158:波束成形單元 158: Beamforming unit

170:音訊信號 170: audio signal

172:音訊信號 172:Audio signal

174:音訊訊框 174: Audio frame

176:音訊訊框 176: Audio frame

178:音訊資料 178: Audio data

180:源 180: source

182:聲音 182: sound

190:音訊信號 190: audio signal

192:音訊信號 192: audio signal

194:音訊訊框 194: Audio frame

196:音訊訊框 196: Audio frame

198:音訊資料 198: Audio data

Claims

A first device comprising: a memory configured to store instructions; and One or more processors configured to: Receive audio signals from multiple microphones; processing the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and Sending to a second device based on the direction of arrival information and a class or embedded data associated with the direction of arrival information.

The first apparatus according to claim 1, wherein the one or more processors are further configured to process signal data corresponding to the audio signals to determine the class or embedding.

The first apparatus according to claim 2, wherein the one or more processors are further configured to perform a beamforming operation on the audio signals to generate the signal data.

The first apparatus according to claim 2, wherein the one or more processors are further configured to process the signal data at one or more classifiers to select from a plurality of classes supported by the one or more classifiers The category is determined for a sound represented by one or more of the audio signals and associated with an audio event, and wherein the category is sent to the second device.

The first apparatus according to claim 2, wherein the one or more processors are further configured to process the signal data at one or more encoders to generate the embedding corresponding to the One or more representations of a sound are associated with an audio event, and wherein the embedding is sent to the second device.

The first apparatus according to claim 1, wherein the one or more processors are further configured to process image data at one or more encoders to generate the embedding corresponding to a representation of the image data The object is also associated with an audio event, and wherein the embedding is sent to the second device.

The first device according to claim 6, further comprising one or more cameras configured to generate the image data.

The first device according to claim 1, wherein: the category corresponds to a classification of a particular sound represented by the audio signals and associated with a particular audio event; and The embedding includes a signature or information corresponding to the specific sound or specific audio event and is configured to enable detection of the specific sound or the specific audio event in other audio signals through processing of other audio signals.

The first device according to claim 1, wherein the one or more processors are further configured to: performing spatial processing on the audio signals based on the direction of arrival information to generate one or more beamformed audio signals; and The one or more beamformed audio signals are sent to the second device.

The first device according to claim 1, wherein the memory and the one or more processors are integrated into an earphone device, and wherein the second device corresponds to a mobile phone.

According to claim 1, the first device further comprises a modem, wherein the data is sent to the second device via the modem.

The first device according to claim 1, wherein the one or more processors are further configured to send a representation of the audio signals to the second device.

The first apparatus according to claim 12, wherein the representations of the audio signals correspond to one or more beamformed audio signals.

The first apparatus according to claim 1, wherein the one or more processors are further configured to generate a user interface output indicative of at least one of an environmental event or an acoustic event.

The first device according to claim 1, wherein the one or more processors are further configured to receive data indicative of an acoustic event from the second device.

The first device according to claim 1, wherein the one or more processors are further configured to: receive direction information associated with the audio signal from the second device; and Performs an audio zoom operation based on the orientation information.

The first device according to claim 1, wherein the one or more processors are integrated in a vehicle.

The first apparatus according to claim 1, wherein the data based on the direction of arrival information includes a report indicating at least one detected event and the direction of the detected event.

According to the first device of claim 1, further comprising a plurality of microphones.

The first device according to claim 1, further comprising at least one speaker configured to output a sound associated with at least one of the audio signals.

A method of processing audio, the method comprising: receiving, at one or more processors of a first device, audio signals from a plurality of microphones; processing the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and Sending to a second device based on the direction of arrival information and a class or embedded data associated with the direction of arrival information.

The method according to claim 21, further comprising processing signal data corresponding to the audio signals to determine the class or embedding.

The method according to claim 22, further comprising performing a beamforming operation on the audio signals to generate the signal data.

The method according to claim 22, wherein the signal data is processed at one or more classifiers to represent one or more of the audio signals from a plurality of classes supported by the one or more classifiers The sound associated with the audio event determines the category, and wherein the category is sent to the second device.

The method according to claim 22, wherein the signal data is processed at one or more encoders to generate the embedding corresponding to sounds represented by one or more of the audio signals and associated with audio events , and wherein the embedding is sent to the second device.

The method according to claim 21, further comprising sending a representation of the audio signal to the second device.

According to the method of claim 21, further comprising: at one or more processors of the second device, receiving data based on the direction of arrival information and the category; obtaining, at one or more processors of the second device, audio data representing a sound associated with the direction-of-arrival information and the category; and At one or more processors of the second device, the class is verified based on at least the audio data and the direction-of-arrival information.

According to the method of claim 21, further comprising: at one or more processors of the second device, receiving data based on the direction of arrival information and the embedded data; and At one or more processors of the second device, audio data representing a sound scene is processed based on the direction of arrival information and the embedding to generate modified audio data corresponding to an updated sound scene.

A non-transitory computer-readable medium comprising instructions that, when executed by one or more processors of a first device, cause the one or more processors to: Receive audio signals from multiple microphones; processing the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and Sending to a second device based on the direction of arrival information and a class or embedded data associated with the direction of arrival information.

A first device comprising: means for receiving audio signals from multiple microphones; means for processing the audio signals to generate direction of arrival information corresponding to one or more sources of sound represented by one or more of the audio signals; and Means for sending to a second device based on the direction of arrival information and a class or embedded data associated with the direction of arrival information.