TW202028929A

TW202028929A - Spatial repositioning of multiple audio streams

Info

Publication number: TW202028929A
Application number: TW108142945A
Authority: TW
Inventors: 望傅沈; 迪篪李
Original assignee: 新加坡商創新科技有限公司
Priority date: 2018-12-07
Filing date: 2019-11-26
Publication date: 2020-08-01
Also published as: KR20200070110A; SG10201911051PA; US10966046B2; US20200186954A1; CN111294724B; TWI808277B; EP3664477A1; JP2020108143A; CN111294724A

Abstract

An audio rendering system includes a processor that combines audio input signals with personalized spatial audio transfer functions preferably including room responses. The personalized spatial audio transfer functions are selected from a database having a plurality of candidate transfer function datasets derived from in-ear microphone measurements for a plurality of individuals. Alternatively, the personalized transfer function datasets are derived from actual in-ear measurements of the listener. Foreground and background positions are designated and matched with transfer function pairs from the selected dataset for the foreground and background direction and distance. Two channels of input audio such as voice and music are processed. When a voice communication such as a phone call is accepted the music being rendered is moved from a foreground to a background channel corresponding to a background spatial audio position using the personalized transfer functions. The voice call is simultaneously transferred to the foreground channel.

Description

Spatial relocation of multi-audio streaming

本發明關於用於產生音訊以供透過頭戴式耳機來顯現之方法及系統。更特別而言，本發明關於使用個人化空間音訊轉移函數的資料集，並使用個人化空間音訊轉移函數來產生空間音訊位置，以創造透過頭戴式耳機之更逼真的音訊顯現(rendering)，該空間音訊轉移函數具有與音訊串流一起關聯於空間音訊位置的場所脈衝響應資訊。The present invention relates to a method and system for generating audio for presentation through a headset. More particularly, the present invention relates to a data set using a personalized spatial audio transfer function, and uses the personalized spatial audio transfer function to generate spatial audio positions, so as to create more realistic audio rendering through headphones, The spatial audio transfer function has location impulse response information associated with the audio stream and the location of the spatial audio.

相關申請案之交互參照Cross-reference of related applications

本申請案是以參照方式而納入來自以下前案的整體揭露內容：西元2018年1月7日所提出且標題為「藉由頭部追蹤來產生客製化空間音訊之方法」的美國專利申請案序號第62/614,482號；西元2016年12月28日所提出且標題為「用於產生客製化/個人化頭部相關轉移函數之方法」的國際申請案第PCT/SG2016/050621號，其主張來自西元2015年12月31日所提出且標題為「用於產生客製化/個人化頭部相關轉移函數之方法」的新加坡專利申請案第10201510822Y號之優先權裨益，其整體內容是針對所有目的以參照方式而納入。本申請案是以參照方式而進一步納入來自以下前案的整體揭露內容：西元2018年5月2日所提出且標題為「用於客製化音訊體驗之系統及處理方法」的美國專利申請案序號第15/969,767號；以及西元2018年9月19日所提出且標題為「藉由頭部追蹤來產生客製化空間音訊之方法」的美國專利申請案序號第16/136,211號。This application is a reference to incorporate the overall disclosure content from the following previous case: a US patent application filed on January 7, 2018 and titled "Method of Generating Customized Spatial Audio by Head Tracking" Case No. 62/614,482; International Application No. PCT/SG2016/050621 filed on December 28, 2016 and titled "Method for Generating Customized/Personalized Head Related Transfer Functions", Its claim comes from the priority benefit of Singapore Patent Application No. 10201510822Y filed on December 31, 2015 and titled "Method for Generating Customized/Personalized Head Related Transfer Functions". The overall content is Incorporated by reference for all purposes. This application is a reference to further incorporate the overall disclosure content from the following previous cases: US patent application filed on May 2, 2018 and titled "System and Processing Method for Customized Audio Experience" Serial No. 15/969,767; and U.S. Patent Application Serial No. 16/136,211 filed on September 19, 2018 and titled "Method of Generating Customized Spatial Audio by Head Tracking".

經常，正在手機上聽音樂的使用者，當電話打來時可能希望音樂不間斷地繼續。不幸的是，大多數手機被裝配以在接聽電話時而使音樂靜音。所需要的是一改良系統，其在接聽電話時而允許音樂或其他音訊不間斷地繼續，並且考慮到允許使用者區別二個不同音訊源。Often, users who are listening to music on their mobile phones may want the music to continue uninterrupted when the call comes. Unfortunately, most cell phones are equipped to mute the music when answering the phone. What is needed is an improved system that allows music or other audio to continue uninterrupted when answering the phone, and allows the user to distinguish between two different audio sources.

為了達成前述者，本發明在種種實施例中提出經裝配以將雙耳(binaural)訊號提供到頭戴式耳機之處理器及系統，所述系統包括用於將第一輸入音訊頻道中音訊放置在第一位置(諸如前景位置)的機構，以及用於將第二輸入音訊頻道中音訊放置在第二位置(諸如背景位置)的機構。In order to achieve the foregoing, the present invention proposes in various embodiments a processor and system equipped to provide binaural signals to a headset. The system includes a system for placing audio in the first input audio channel A mechanism in a first position (such as a foreground position), and a mechanism for placing audio in a second input audio channel in a second position (such as a background position).

在本發明的一些實施例中，所述系統包括個人化空間音訊轉移函數的資料集，該個人化空間音訊轉移函數的資料集具有與至少二個音訊串流一起關聯於空間音訊位置的場所脈衝響應資訊(諸如：HRTF或BRIR)。針對至少二個位置的個人化BRIR是和二個輸入音訊串流一起使用，以建立前景空間音訊源與背景空間音訊源，來提供用於聆聽者透過頭戴式耳機之身歷其境的體驗。In some embodiments of the present invention, the system includes a data set of a personalized spatial audio transfer function, and the data set of the personalized spatial audio transfer function has a location pulse associated with the spatial audio location along with at least two audio streams. Response information (such as: HRTF or BRIR). The personalized BRIR for at least two locations is used together with two input audio streams to create a foreground space audio source and a background space audio source to provide a listener with an immersive experience through a headset.

本發明的較佳實施例將作詳細論述。較佳實施例的實例被說明在伴隨圖式中。儘管本發明將關聯於這些較佳實施例而描述，將瞭解的是並無意圖以將本發明限制於上述較佳實施例。反之，意圖以涵蓋如可包括在由隨附申請專利範圍所界定之本發明精神與範疇內的替代、修改、與等效者。在以下說明，諸多特定細節被陳述以便提供本發明的徹底瞭解。本發明可在沒有這些特定細節的一些或全部而實行。在其他情況下，眾所周知的機構是為了避免不必要地混淆本發明而未詳細描述。The preferred embodiments of the present invention will be discussed in detail. Examples of preferred embodiments are illustrated in the accompanying drawings. Although the present invention will be described in relation to these preferred embodiments, it will be understood that there is no intention to limit the present invention to the above-described preferred embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents that may be included in the spirit and scope of the invention as defined by the scope of the appended application. In the following description, many specific details are set forth in order to provide a thorough understanding of the present invention. The present invention can be implemented without some or all of these specific details. In other cases, well-known mechanisms are not described in detail in order to avoid unnecessarily obscuring the present invention.

應在此指出的是，在種種圖式中的同樣標號係指同樣部分。在本文所說明及描述的種種圖式被使用以說明本發明的種種特徵。在某種程度上，特定特徵被說明在一個圖式中而未在另一個圖式，除非在其他情況下指明或其結構本身禁止所述特徵之納入，要瞭解的是，那些特徵可經調適以包括在其他圖所代表的實施例中，如同其完整說明在那些圖中。除非另為指明，所述圖式無須依照比例。提供在圖式上的任何尺度並無意以限制關於本發明的範疇而僅是說明性質。It should be pointed out here that the same reference numerals in various drawings refer to the same parts. The various diagrams illustrated and described herein are used to illustrate various features of the present invention. To a certain extent, specific features are described in one schema but not in another, unless otherwise specified or the structure itself prohibits the inclusion of the features. It should be understood that those features can be adapted It may be included in the embodiments represented by the other figures as if it were fully described in those figures. Unless otherwise specified, the drawings need not be to scale. Any scale provided in the drawings is not intended to limit the scope of the present invention but is merely illustrative.

雙耳技術，概括指稱關於或使用二個耳朵的技術，其致使使用者能以三維場來感知音訊。在一些實施例中，這是透過雙耳場所脈衝響應(BRIR, Binaural Room Impulse Response)與其相關的雙耳場所轉移函數(BRTF, Binaural Room Transfer Function)之確定及使用來達成。BRIR模擬來自揚聲器的聲波和聆聽者的耳朵、頭部與軀幹、以及在場所中的牆壁與其他物體之交互作用。替代而言，頭部相關轉移函數(HRTF, Head Related Transfer Function)被使用在一些實施例中。HRTF是頻域中的轉移函數，其對應於代表在無響環境中的交互作用的脈衝響應。即，脈衝響應在此代表和聆聽者的耳朵、頭部與軀幹之聲音交互作用。Binaural technology, generally refers to the technology related to or using two ears, which enables users to perceive audio in a three-dimensional field. In some embodiments, this is achieved through the determination and use of the Binaural Room Impulse Response (BRIR) and its associated Binaural Room Transfer Function (BRTF). BRIR simulates the interaction of sound waves from speakers and the listener's ears, head and torso, walls and other objects in the venue. Alternatively, a head related transfer function (HRTF, Head Related Transfer Function) is used in some embodiments. HRTF is a transfer function in the frequency domain, which corresponds to an impulse response representing an interaction in a silent environment. That is, the impulse response here represents the interaction with the sound of the listener's ears, head, and torso.

根據用於確定HRTF或BRTF的已知方法，真實或虛擬的頭部與雙耳麥克風被使用，以記錄針對於真實場所中之若干個揚聲器位置各者的立體聲脈衝響應(IR, impulse response)。即，各個耳朵為一者的一對脈衝響應是針對於各個位置所產生，這對被稱作為BRIR。音樂曲目或其他音訊串流可接著使用這些BRIR進行卷積(濾波)，且結果混合在一起及透過頭戴式耳機所播放。若正確等化被應用，則音樂頻道將聽起來像在BRIR被記錄處之場所中的喇叭位置所播放。According to known methods for determining HRTF or BRTF, real or virtual head and binaural microphones are used to record the stereo impulse response (IR, impulse response) for each of several speaker positions in the real place. That is, a pair of impulse responses in which each ear is one is generated for each position, and this pair is called BRIR. Music tracks or other audio streams can then be convolved (filtered) with these BRIRs, and the results are mixed together and played through headphones. If the correct equalization is applied, the music channel will sound like it is played at the speaker position in the location where the BRIR was recorded.

經常，正在手機上聽音樂的使用者，當電話打來時而使用者可能希望在接聽電話時音樂不間斷地繼續。並非行使靜音功能，二個單獨的音訊訊號(即：電話通話與音樂)可被饋送到相同頻道中。但概括而言，人們難以區別來自同個方向的聲音源。為了解決這問題，且根據一個實施例，當來電打來時，音樂是從第一位置指向到在第二位置(諸如背景位置)中之喇叭或頻道，即：音樂與語音通訊被放置在不同位置。不幸的是，儘管這些定位顯現音訊串流之方法當使用多喇叭設置時允許來源之分離，現今的大部分語音通訊是透過手機而打來，其通常並未連接到多頻道喇叭設置。甚者，即使使用多頻道設置的上述方法，當音訊源是藉由平移(pan)所指定的位置，該位置非完全對準於揚聲器實際位置，上述方法有時提供低於最佳的結果。這是部分歸因於當上述位置是藉由傳統的平移方法所大致估計以將感知的音訊位置移動到在多頻道喇叭位置之間的地方時，聆聽者難以精確定位空間音訊位置。Frequently, a user who is listening to music on a mobile phone may want the music to continue uninterrupted when the call comes. Instead of using the mute function, two separate audio signals (ie, telephone conversation and music) can be fed to the same channel. But in general, it is difficult for people to distinguish sound sources from the same direction. To solve this problem, and according to one embodiment, when an incoming call comes in, the music is directed from the first location to the speaker or channel in the second location (such as the background location), that is, the music and voice communication are placed in different locations. position. Unfortunately, although these methods of locating and displaying audio streams allow separation of sources when using a multi-speaker setup, most voice communications today are made through mobile phones, which are usually not connected to a multi-channel speaker setup. Moreover, even if the above method of multi-channel setting is used, when the audio source is at a position designated by pan, the position is not completely aligned with the actual position of the speaker. The above method sometimes provides lower than optimal results. This is partly due to the difficulty of the listener to accurately locate the spatial audio position when the position is roughly estimated by the traditional translation method to move the perceived audio position to a place between the positions of the multi-channel speaker.

本發明藉由自動將語音通話與音樂定位在不同空間音訊位置以解決透過頭戴式耳機之語音通訊的這些問題，其藉由運用使用轉移函數(諸如藉由使用HRTF)所虛擬化的位置，該轉移函數至少模擬來自至少個體的頭部、軀幹、與耳朵在音訊上的效應。更佳而言，在音訊上的場所效應是藉由BRIR來處理音訊串流所考量。但，非個體化的商用BRIR資料集，其給予大多數使用者不佳的方向感以及感知聲音源之更差的距離感。這可能導致在區別聲音源的難度。The present invention solves these problems of voice communication through headsets by automatically locating voice calls and music at different spatial audio positions. By using the positions virtualized by using transfer functions (such as by using HRTF), The transfer function at least simulates the audio effects from at least the individual's head, torso, and ears. More preferably, the location effect on audio is considered by BRIR to process audio streams. However, the non-individualized commercial BRIR data set gives most users a poor sense of direction and a worse sense of distance from the sound source. This may cause difficulty in distinguishing the sound source.

為了解決這些附加問題，在一些實施例中，本發明使用個體化BRIR。在一個實施例中，個體化HRTF或BRIR資料集之產生，其是藉由將麥克風插入到聆聽者耳朵中且記錄在錄製期間中的脈衝響應所產生。這是耗時的過程，可能不便於包括在行動電話或其他音訊單元之銷售。在進一步實施例中，使用針對於各個個體聆聽者的基於影像性質之擷取所導出的個體化BRIR (或關聯BRTF)，語音與音樂的聲音源是定位在分開的第一(例如：前景)與第二(例如：背景)位置，所述性質被使用以從具有針對於複數個測量個體的個體化空間音訊轉移函數的候選庫之資料庫來確定適合的個體化BRIR。對應於至少二個分開空間音訊位置各者之個體化BRIR被較佳使用，以將第一與第二音訊串流指向到二個不同空間音訊位置。To solve these additional problems, in some embodiments, the present invention uses individualized BRIR. In one embodiment, the individualized HRTF or BRIR data set is generated by inserting a microphone into the listener's ear and recording the impulse response during the recording. This is a time-consuming process and may be inconvenient to include in the sale of mobile phones or other audio units. In a further embodiment, the individualized BRIR (or related BRTF) derived from the capture based on the image properties for each individual listener is used, and the sound source of voice and music is located in the first separate (for example: foreground) With the second (eg, background) location, the property is used to determine a suitable individualized BRIR from a database of candidate libraries with individualized spatial audio transfer functions for a plurality of measured individuals. The individualized BRIR corresponding to each of the at least two separate spatial audio locations is preferably used to direct the first and second audio streams to two different spatial audio locations.

再者，由於已知人們能夠較佳區別當一者是由聆聽者所確定為較接近而另一者被確定為較遠離時的二個聲音源，在一些實施例中，運用使用擷取基於影像性質所導出的個體化BRIR，音樂被自動放置在背景空間位置中的某距離且語音被放置在較近距離。Furthermore, since it is known that people can better distinguish between the two sound sources when one is determined by the listener to be closer and the other is determined to be farther away, in some embodiments, the use of extraction is based on In the individualized BRIR derived from the nature of the image, the music is automatically placed at a certain distance in the background space and the voice is placed at a closer distance.

在再一個實施例中，擷取的基於影像性質是由行動電話所產生。在另一個實施例中，在確定語音通話為較低優先順序時，在收到諸如藉由致動開關所產生之來自聆聽者的控制訊號時，語音通話從前景被指向到背景且音樂被指向到前景。在又一個實施例中，在確定語音通話為較低優先順序且在收到來自聆聽者的控制訊號時，使用對應於針對於相同方向的不同距離之個體化BRIR，語音通話的視在距離被增大且音樂的視在距離被減小。In yet another embodiment, the captured image-based nature is generated by a mobile phone. In another embodiment, when it is determined that the voice call is of lower priority, when receiving a control signal from the listener, such as by actuating a switch, the voice call is directed from the foreground to the background and the music is directed To the foreground. In another embodiment, when it is determined that the voice call is of lower priority and when the control signal from the listener is received, the individualized BRIR corresponding to different distances in the same direction is used, and the apparent distance of the voice call is reduced by Increase and the apparent distance of music is reduced.

儘管應瞭解的是，本文的大多數實施例描述用在於頭戴式耳機的個人化BRIR，所述用於定位結合語音通訊的媒體串流之技術，亦可根據關於圖3所述的步驟而延伸到針對於使用者所客製化的任何適合的轉移函數。Although it should be understood that most of the embodiments herein describe personalized BRIR for headsets, the technology for positioning media streams combined with voice communications can also be implemented according to the steps described in relation to FIG. 3 Extend to any suitable transfer function customized for the user.

應瞭解的是，本發明的範疇意圖以涵蓋將各別的第一音訊源與語音通訊放置在使用者周圍的任何位置。再者，前景與背景之本文的使用並無意為受限於分別在聆聽者前方或在聆聽者後方的區域。確切而言，前景是以最概括意義被解讀為指稱二個分離位置的較顯著或重要者，而反之背景指稱分離位置的較不顯著者。甚者，應指出的是，本發明的範疇是以極為概括意義存在而根據本文所述技術使用HRTF或BRIR來將第一音訊串流指向到第一位置且將第二音訊串流指向到第二空間音訊位置。應進而指出的是，本發明的一些實施例可藉著同時施加訊號衰減，以取代地將較近距離指定為前景位置且將較遠位置指定為背景位置，而延伸到選擇在使用者周圍的任何方向位置為前景或背景位置各者。以其最簡單形式，應用二對BRIR以代表前景與背景位置之濾波電路系統將根據本發明的實施例而最初顯示。It should be understood that the scope of the present invention is intended to cover placing separate first audio sources and voice communications anywhere around the user. Furthermore, the use of the text of the foreground and the background is not intended to be limited to the area in front of the listener or behind the listener, respectively. To be precise, the foreground is interpreted in the most general sense as referring to the more significant or important of the two separated locations, while the background refers to the less significant of the separated locations. Moreover, it should be pointed out that the scope of the present invention exists in a very general sense and uses HRTF or BRIR to direct the first audio stream to the first position and direct the second audio stream to the first position according to the technology described herein. 2. Space audio location. It should be further pointed out that some embodiments of the present invention can apply signal attenuation at the same time, instead of designating a relatively short distance as a foreground position and a relatively distant position as a background position, and extend to areas selected around the user. The position in any direction is either the foreground or the background position. In its simplest form, a filter circuit system applying two pairs of BRIRs to represent the foreground and background positions will be initially shown according to the embodiment of the present invention.

圖1是說明針對於根據本發明的一些實施例所處理的音訊之空間音訊位置的示意圖。初始，聆聽者105可透過頭戴式耳機103來聆聽諸如音樂的第一音訊訊號。使用應用到第一音訊串流的BRIR，聆聽者感覺到第一音訊串流為來自第一音訊位置102。在一些實施例中，這是前景位置。在一個實施例中，一種技術將此前景位置放置在相對於聆聽者105的零度位置。當觸發事件發生，諸如在一個實施例中為接到一通電話，第二串流(例如：語音通訊或電話通話)被路由到第一位置(102)而第一音訊訊號被路由到第二位置104。在圖示的實例實施例中，這第二位置被放置在200度位置，其在一些實施例中被描述為較不顯著或背景位置。200度位置僅是作為非限制的實例而被選擇。放置音訊串流在這第二位置，較佳使用對應於針對有關聆聽者的這第二位置的方位(azimuth)、仰角(elevation)、與距離之BRIR (或BRTF)所達成。FIG. 1 is a schematic diagram illustrating spatial audio positions for audio processed according to some embodiments of the present invention. Initially, the listener 105 can listen to the first audio signal such as music through the headset 103. Using the BRIR applied to the first audio stream, the listener feels that the first audio stream comes from the first audio location 102. In some embodiments, this is the foreground position. In one embodiment, a technique places this foreground position at zero degrees relative to the listener 105. When a triggering event occurs, such as receiving a phone call in one embodiment, the second stream (for example: voice communication or telephone call) is routed to the first location (102) and the first audio signal is routed to the second location 104. In the illustrated example embodiment, this second position is placed at a 200 degree position, which in some embodiments is described as a less prominent or background position. The 200 degree position was chosen only as a non-limiting example. Placing the audio stream in the second position is preferably achieved by using the BRIR (or BRTF) corresponding to the azimuth, elevation, and distance of the second position for the relevant listener.

在一個實施例中，第一音訊串流到第二位置(例如：背景)之變遷係突然發生，而沒有提供任第一音訊串流正在移動通過中間空間位置之意識。這是由路徑110以圖形所描繪，其顯示無任何中間空間位置。在另一個實施例中，音訊是在短暫的過渡時間期間被定位在中間點112與114，以提供自前景位置102到背景位置104之直接移動感或替代性弧形型式的移動感。在較佳實施例中，用於中間位置112與114的BRIR被使用以空間定位音訊串流。在替代實施例中，藉由使用針對於前景與背景位置的BRIR，且藉由在對應於那些前景與背景位置的那些虛擬揚聲器之間平移，達成所述移動感。在一些實施例中，使用者可確認語音通訊(例如：電話通話)不應該得到優先順序狀態，且選取將電話通話移交到第二位置(例如：背景位置)或甚至使用者選定的第三位置，並將音樂回到第一(例如：前景)位置。在一個實施例中，這藉由將對應於音樂的音訊串流傳送回到前景(第一)位置102、且將語音通訊傳送到背景位置104所實行。在另一個實施例中，此優先順序之重新排列是藉由使得語音通話較遠離且音樂較接近於聆聽者頭部105所實行。此較佳為藉由指定在不同距離所捕捉、從捕捉測量所計算或內插以代表新距離之針對於聆聽者的新HRTF或BRTF所作成。舉例來說，為了增加來自背景位置104之音樂的優先權，視在距離可減小到空間音訊位置118或116。較佳藉由新HRTF或BRTF來處理音樂音訊串流所達成之此減小距離，其增加相對於語音通訊訊號之音樂的音量。在一些實施例中，同樣出於捕捉HRTF/BRTF值或內插之選擇，語音訊號可同時增加離聆聽者頭部105的距離。內插/計算可使用超過2個點來作成。舉例來說，為了得出其為二條線(AB與CD)之相交的一點，內插/計算可能需要點A、B、C、與D。In one embodiment, the transition from the first audio stream to the second location (for example, the background) occurs suddenly, without providing the awareness that the first audio stream is moving through the intermediate spatial location. This is graphically depicted by the path 110, which shows no intermediate spatial position. In another embodiment, the audio is positioned at the intermediate points 112 and 114 during a short transition time to provide a sense of direct movement from the foreground position 102 to the background position 104 or an alternative arc-shaped movement. In the preferred embodiment, the BRIR for intermediate positions 112 and 114 is used to spatially locate the audio stream. In an alternative embodiment, the sense of movement is achieved by using BRIR for foreground and background positions, and by panning between those virtual speakers corresponding to those foreground and background positions. In some embodiments, the user can confirm that the voice communication (e.g., phone call) should not be given priority status, and choose to hand over the phone call to a second location (e.g., background location) or even a third location selected by the user , And return the music to the first (for example: foreground) position. In one embodiment, this is performed by sending the audio stream corresponding to the music back to the foreground (first) location 102 and sending the voice communication to the background location 104. In another embodiment, the reordering of the priority order is performed by making the voice call farther away and the music closer to the listener's head 105. This is preferably done by specifying a new HRTF or BRTF for the listener that is captured at different distances, calculated from captured measurements, or interpolated to represent the new distance. For example, in order to increase the priority of music from the background position 104, the apparent distance can be reduced to the spatial audio position 118 or 116. It is better to use the new HRTF or BRTF to handle the reduced distance achieved by the music audio stream, which increases the volume of the music relative to the voice communication signal. In some embodiments, also for the choice of capturing HRTF/BRTF values or interpolation, the voice signal can increase the distance from the listener's head 105 at the same time. Interpolation/calculation can be done using more than 2 points. For example, in order to find a point where two lines (AB and CD) intersect, points A, B, C, and D may be needed for interpolation/calculation.

替代而言，產生語音通訊的空間音訊位置可在重排序步驟期間而維持在固定位置或增加。在一些實施例中，二個單獨的音訊串流享有相等的顯著性。Alternatively, the spatial audio position where the voice communication is generated can be maintained at a fixed position or increased during the reordering step. In some embodiments, two separate audio streams share equal significance.

在還有其他實施例中，使用者可從使用者介面空間音訊位置來選取用於串流的至少一者，更佳而言，針對於所有串流為單一或多個位置。In still other embodiments, the user can select at least one of the audio locations in the user interface space for streaming, and more preferably, a single or multiple locations for all streams.

圖2是說明根據本發明的一些實施例之用於模擬在不同空間音訊位置的音訊源與語音通訊之系統的示意圖。圖2描繪進入空間音訊定位系統的概括二個不同串流(202與204)，藉由使用單獨成對之用於第一空間音訊位置的濾波器(即：濾波器207、208)以及用於第二空間音訊位置的濾波器209、210。在分別用於左頭戴式耳機杯部的訊號被相加在加法器214且用於頭戴式耳機216的右頭戴式耳機杯部的濾波結果被類似相加在加法器215之前，增益222-225可施加到所有濾波後的串流。儘管此組合的硬體模組顯示涉及的基本原理，其他實施例使用儲存在記憶體中的BRRI或HRTF，諸如在圖3所示(諸如行動電話)的音訊顯現模組730的記憶體732。在一些實施例中，根據那些空間音訊位置是藉由針對個人選擇除HRTF外尚具有場所響應的轉移函數所產生之事實，聆聽者輔助於分辨第一與第二空間音訊位置。在較佳實施例中，第一與第二位置是使用針對於聆聽者所客製化的BRIR而確定。2 is a schematic diagram illustrating a system for simulating audio sources and voice communication at different spatial audio locations according to some embodiments of the present invention. Figure 2 depicts a summary of two different streams (202 and 204) entering the spatial audio positioning system, by using separate pairs of filters for the first spatial audio location (ie: filters 207, 208) and Filters 209 and 210 at the second spatial audio position. After the signals respectively used for the left headset cup are added to the adder 214 and the filtered results of the right headset cup for the headset 216 are similarly added before the adder 215, the gain 222-225 can be applied to all filtered streams. Although this combined hardware module shows the basic principles involved, other embodiments use BRRI or HRTF stored in memory, such as the memory 732 of the audio display module 730 shown in FIG. 3 (such as a mobile phone). In some embodiments, the listener assists in distinguishing the first and second spatial audio positions based on the fact that those spatial audio positions are generated by personally selecting transfer functions that have a place response in addition to the HRTF. In a preferred embodiment, the first and second positions are determined using BRIR customized for the listener.

當HRTF或BRTF是針對聆聽者而個體化時，用於透過頭戴式耳機的顯現之系統及方法運作最佳化，該個體化不論是藉由直接入耳式麥克風測量，或者是在藉由並未使用入耳式麥克風測量時的個體化BRIR/HRIR資料集。根據本發明的較佳實施例，用於產生BRIR的一種客製方法被使用，其涉及從使用者擷取基於影像性質，且從BRIR的候選庫來確定適合BRIR，如概括由圖3所描繪。更詳細而言，圖3說明根據本發明的實施例之系統，其用以產生用於客製化用途的HRTF、取得用於客製化的聆聽者性質、選擇用於聆聽者的客製化HRTF、提供適以和相關使用者頭部移動一起運作的轉動濾波器，且用於顯現如BRIR所修正的音訊。擷取裝置702是裝配以識別及擷取聆聽者的音訊相關實體性質之裝置。雖然方塊702可經裝配以直接測量那些性質(例如：耳朵的高度)，在較佳實施例中，有關的測量是從使用者的取得影像被擷取以包括至少使用者的耳朵或雙耳。擷取那些性質所必要的處理較佳發生在擷取裝置702，但同樣可位在別處。就非限制的實例而言，可在來自影像感測器704的影像之收到後而藉由在遠端伺服器710的處理器擷取性質。When the HRTF or BRTF is individualized for the listener, the system and method for display through the headset are optimized, whether the individualization is measured by a direct in-ear microphone, or by combining Individualized BRIR/HRIR data set when no in-ear microphone is used for measurement. According to a preferred embodiment of the present invention, a customized method for generating BRIR is used, which involves extracting from the user based on the image properties, and determining a suitable BRIR from the BRIR candidate library, as outlined in Figure 3 . In more detail, FIG. 3 illustrates a system according to an embodiment of the present invention, which is used to generate an HRTF for customized purposes, obtain listener properties for customization, and select customized listeners for use. HRTF provides a rotating filter suitable for working with the movement of the user’s head, and is used to display audio corrected by BRIR. The capture device 702 is a device equipped to identify and capture the audio-related physical properties of the listener. Although the block 702 can be configured to directly measure those properties (for example, ear height), in a preferred embodiment, the relevant measurement is captured from the user's acquired image to include at least the user's ears or both ears. The processing necessary to capture those properties preferably occurs in the capture device 702, but can also be located elsewhere. For a non-limiting example, the image from the image sensor 704 may be captured by the processor of the remote server 710 after the image is received.

在較佳實施例中，影像感測器704取得使用者耳朵的影像，且處理器706被裝配以擷取針對於使用者的有關性質，且將其傳送到遠端伺服器710。舉例來說，在一個實施例中，主動形狀模型(Active Shape Model)可被使用以識別在耳廓影像中的界標(landmark)，且使用那些界標與其幾何關係和線性距離以識別關於使用者的性質，其為相關於從一組儲存的BRIR資料集(即：從BRIR資料集的候選庫)來產生客製化BRIR。在其他實施例中，迴歸樹模型(RGT, Regression Tree Model)被使用以擷取性質。在還有其他實施例中，諸如神經網路與其他形式的人工智慧(AI, artificial intelligence)之機器學習被使用以擷取性質。神經網路的一個實例是卷積(Convolutional)神經網路。用於識別新聆聽者的獨特實體性質之數種方法的完整論述，被描述在西元2016年12月28日所提出且標題為「用於產生客製化個人化頭部相關轉移函數的方法」之申請案第PCT/SG2016/050621號，其揭示內容是以參照方式而整體納入本文。In a preferred embodiment, the image sensor 704 obtains the image of the user's ear, and the processor 706 is configured to capture the relevant properties for the user and send it to the remote server 710. For example, in one embodiment, the Active Shape Model can be used to identify landmarks in the auricle image, and use those landmarks and their geometric relationships and linear distances to identify user-related The property is related to generating a customized BRIR from a set of stored BRIR data sets (ie, from a candidate library of the BRIR data set). In other embodiments, a regression tree model (RGT, Regression Tree Model) is used to capture properties. In still other embodiments, machine learning such as neural networks and other forms of artificial intelligence (AI) are used to capture properties. An example of a neural network is a convolutional neural network. A complete discussion of several methods for identifying the unique physical properties of new listeners is described on December 28, 2016 and titled "Methods for Generating Customized Personalized Head Related Transfer Functions" The application No. PCT/SG2016/050621, the disclosure of which is incorporated herein by reference in its entirety.

遠端伺服器710較佳為透過諸如網際網路的網路而可存取。遠端伺服器較佳包括選擇處理器710來存取記憶體714，以使用在擷取裝置702所擷取的實體性質或其他影像相關性質而確定最佳匹配BRIR資料集。選擇處理器712較佳存取具有複數個BRIR資料集的記憶體714。即，在候選庫中的各個資料集將具有較佳針對於在方位與仰角且或許還有頭部傾斜的適當角度之各點的BRIR對。舉例來說，測量可在方位與仰角的每3度來進行，以產生針對於組成BRIR的候選庫之取樣個體的BRIR資料集。The remote server 710 is preferably accessible through a network such as the Internet. The remote server preferably includes a selection processor 710 to access the memory 714 to use the physical properties or other image-related properties captured by the capture device 702 to determine the best matching BRIR data set. The selection processor 712 preferably accesses the memory 714 having a plurality of BRIR data sets. That is, each data set in the candidate library will have a BRIR pair that is better for each point in the azimuth and elevation angle and perhaps the appropriate angle of the head tilt. For example, the measurement can be performed every 3 degrees of the azimuth and elevation angles to generate a BRIR data set for the sampled individuals that constitute the BRIR candidate library.

如稍早所論述，這些較佳為藉由關於適度規模的群體(即：大於100個個體)之入耳式麥克風的測量所導出，但可用較小群組的個體而運作，且連同關聯於各個BRIR集的類似影像相關性質而儲存。這些可為部分由直接測量且部分由內插法所產生，以形成BRIR對的球形柵格(grid)。即使有部分測量/部分內插的柵格，一旦適當方位與仰角的值被使用以識別針對於來自BRIR資料集的一點之適當BRIR對，未落在柵格線上之另外的點可經內插。舉例來說，可使用任何適合內插方法，其包括而不限於相鄰線性內插法、雙線性內插法、與球面三角內插法，較佳為在頻域。As discussed earlier, these are preferably derived from in-ear microphone measurements on a moderately sized group (ie: greater than 100 individuals), but can be operated with smaller groups of individuals, and are associated with each Similar images of the BRIR collection are stored with related properties. These can be a spherical grid of BRIR pairs produced partly by direct measurement and partly by interpolation. Even if there is a partially measured/partially interpolated grid, once the appropriate azimuth and elevation values are used to identify the appropriate BRIR pair for a point from the BRIR data set, other points that do not fall on the grid line can be interpolated . For example, any suitable interpolation method can be used, including but not limited to adjacent linear interpolation, bilinear interpolation, and spherical trigonometric interpolation, preferably in the frequency domain.

在一個實施例中，儲存在記憶體714中的每個BRIR資料集包括針對於聆聽者的至少一個完整球形柵格。在上述情形，在方位(在圍繞使用者的水平面，即：在耳朵高度)或仰角的任何角度可經選擇以供放置聲音源。在其他實施例中，BRIR資料集較為受限，在一個實例為受限於產生符合習用立體聲設置(即：在相對於向前直行零位置的+30度與-30度)在場所中的揚聲器放置所需的BRIR對，或在完整球形柵格的另一個子集，用於多頻道設置的喇叭放置為沒有限制，諸如5.1系統或7.1系統。In one embodiment, each BRIR data set stored in the memory 714 includes at least one complete spherical grid for the listener. In the above situation, any angle in the azimuth (at the horizontal plane surrounding the user, ie at the ear height) or the elevation angle can be selected for placing the sound source. In other embodiments, the BRIR data set is more limited. In one example, it is limited to produce speakers in a venue that conform to conventional stereo settings (ie: +30 degrees and -30 degrees relative to the zero position of the forward straight forward) Place the required BRIR pairs, or in another subset of the full spherical grid, there are no restrictions on the placement of speakers for multi-channel settings, such as 5.1 systems or 7.1 systems.

HRIR是頭部相關脈衝響應(head-related impulse response)。其完整描述在無響條件下於時域之從來源到接收者的聲音傳遞。其大部分資訊關於被測量的人士之生理機能與人體測量。HRTF是頭部相關轉移函數(head-related transfer function)。其等同於HRIR，除了其為在頻域中的描述之外。BRIR是雙耳場所脈衝響應(binaural room impulse response)。其等同於HRIR，除了其在場所所測量之外，且因此額外納入針對在經捕捉於其中的特定配置之場所響應。BRTF是BRIR的頻域版本。應瞭解的是，在此說明書中，由於BRIR和BRTF為易於可調換，且同理HRIR和HRTF為易於可調換，本發明實施例是意圖以涵蓋那些易於可調換的步驟，即使其並未在此明確描述。因此，舉例來說，當說明提到存取另一個BRIR資料集，應瞭解的是涵蓋存取另一個BRTF。HRIR is head-related impulse response. It fully describes the sound transmission from source to receiver in the time domain under silent conditions. Most of its information is about the physiology and anthropometry of the person being measured. HRTF is a head-related transfer function. It is equivalent to HRIR, except that it is described in the frequency domain. BRIR is the binaural room impulse response. It is equivalent to HRIR, except that it is measured at the location, and therefore additionally includes the location response for the specific configuration captured in it. BRTF is the frequency domain version of BRIR. It should be understood that in this specification, since BRIR and BRTF are easily interchangeable, and HRIR and HRTF are similarly interchangeable, the embodiments of the present invention are intended to cover those steps that are easy to interchange, even if they are not This is clearly described. So, for example, when the description refers to accessing another BRIR data set, it should be understood that it covers accessing another BRTF.

圖3進而描繪針對於儲存在記憶體中的資料之試樣邏輯關係。記憶體被顯示為在行716包括用於數個個體的BRIR資料集(例如：HRTF DS1A、HRTF DS2A等等)。這些是藉由和各個BRIR資料集關聯的性質(較佳為影像相關性質)而編索引及存取。在行715所顯示的關聯性質致使能將新聆聽者性質和關聯的BRIR的性質匹配，該些和BRIT關聯的性質係經測量及儲存在行716、717、與718中。即，其作用為對於在那些行所顯示之BRIR資料集的候選庫之索引。行717有關在參考位置零的儲存BRIR，其和其餘BRIR資料集相關聯，並當監測到聆聽者頭部轉動且順應聆聽者頭部轉動，其可和轉動濾波器結合以供有效率儲存及處理。這選項的進一步說明被詳述在西元2018年9月19日所提出且標題為「藉由頭部追蹤來產生客製化空間音訊之方法」之共同審理中的申請案第16/136,211號，其為以參照方式而整體納入本文。Figure 3 further depicts the logical relationship of the samples with respect to the data stored in the memory. The memory is shown as including BRIR data sets for several individuals in row 716 (for example: HRTF DS1A, HRTF DS2A, etc.). These are indexed and accessed based on the properties associated with each BRIR data set (preferably image-related properties). The properties of the association shown in row 715 enable the new listener properties to be matched with the properties of the associated BRIR. These properties associated with the BRIT are measured and stored in rows 716, 717, and 718. That is, it functions as an index to the candidate library of the BRIR data set displayed in those rows. Line 717 relates to the storage of BRIR at the reference position zero, which is associated with the rest of the BRIR data set, and when the listener's head rotation is detected and follows the listener's head rotation, it can be combined with the rotation filter for efficient storage and deal with. The further explanation of this option is detailed in the joint trial application No. 16/136,211, which was filed on September 19, 2018 and titled "Methods to Generate Customized Spatial Audio by Head Tracking". It is incorporated herein in its entirety by reference.

概括而言，存取BRIR (或HRTF)資料集的候選庫之一個目的，是產生針對於個人的客製化音訊響應特性(諸如：BRIR資料集)。在一些實施例中，這些是使用來處理諸如語音通訊與媒體串流的輸入音訊訊號，以便如上所述為了和第一位置與第二位置關聯的空間音訊之準確感知而定位輸入音訊訊號。在一些實施例中，產生諸如個體化BRIR之此客製化音訊響應特性，包括擷取對於個體之諸如生物特徵量測資料的影像相關性質。舉例來說，此生物特徵量測資料可包括關於耳朵的耳廓、個人的整個耳朵、頭部、及/或肩膀之資料。在進一步實施例中，諸如(1)多匹配、(2)多辨識器型式、以及(3)基於叢集之處理策略被使用以產生中間資料集，其稍後被組合(在多個命中造成的情況)以產生針對於個體的客製化BRIR資料集。這些可藉由使用在其他方法之間的加權總和而被組合。在僅有單一個匹配的情況，沒有必要組合中間結果。在一個實施例中，中間資料集是至少部分基於提取的BRIR資料集(來自候選庫)關於擷取性質之匹配的接近度。在其他實施例中，多個辨識器匹配步驟被使用，藉此處理器基於對應於生物特徵量測資料的複數個訓練參數而提取一個或多個資料集。在還有其他實施例中，基於叢集的處理策略被使用，藉此潛在資料集基於擷取資料(例如：生物特徵量測資料)而被叢集。叢集包含具有之一關係的多個資料集，此處它們被叢集或群集在一起，以形成有匹配來自影像的擷取資料(例如：生物特徵量測)之對應BRIR資料集的一模型。In summary, one purpose of accessing the candidate database of BRIR (or HRTF) data sets is to generate customized audio response characteristics (such as BRIR data sets) for individuals. In some embodiments, these are used to process input audio signals such as voice communications and media streams in order to locate the input audio signals for accurate perception of spatial audio associated with the first location and the second location as described above. In some embodiments, generating the customized audio response characteristics such as personalized BRIR includes capturing image-related properties of the individual such as biometric measurement data. For example, the biometric measurement data may include information about the pinna of the ear, the entire ear of the individual, the head, and/or the shoulder. In a further embodiment, such as (1) multiple matching, (2) multiple recognizer types, and (3) cluster-based processing strategies are used to generate intermediate data sets, which are later combined (caused by multiple hits) Situation) to generate a customized BRIR data set for individuals. These can be combined by using a weighted sum between other methods. In the case of a single match, there is no need to combine intermediate results. In one embodiment, the intermediate data set is based at least in part on the proximity of the extracted BRIR data set (from the candidate database) with respect to the matching of the retrieval properties. In other embodiments, multiple identifier matching steps are used, whereby the processor extracts one or more data sets based on a plurality of training parameters corresponding to the biometric measurement data. In still other embodiments, a cluster-based processing strategy is used, whereby the potential data set is clustered based on retrieved data (for example, biometric measurement data). The cluster includes a plurality of data sets having a relationship, where they are clustered or clustered together to form a model of the corresponding BRIR data set that matches the captured data (for example, biometric measurement) from the image.

在本發明的一些實施例中，2或多個距離球被儲存。此指稱對於相距使用者的2個不同距離所產生的球形柵格。在一個實施例中，一個參考位置BRIR被儲存且關聯於2或多個不同球形柵格距離球。在其他實施例中，各個球形柵格將具有其本身的參考BRIR以使用於可應用的轉動濾波器。選擇處理器712被使用以將在記憶體714的性質和擷取性質相匹配，該擷取性質係針對於新聆聽者而從擷取裝置702所接收的。種種方法被使用以匹配關聯性質，使得正確BRIR資料集可被導出。如上所述，這些方法包括藉由基於多匹配處理策略、多辨識器處理策略、基於叢集處理策略，以及如在西元2018年5月2日所提出且標題為「用於客製化音訊體驗的系統和處理方法」的美國專利申請案編號第15/969,767號之其他方法來比較生物特徵量測資料，上述美國專利申請案的揭露內容是以參照方式而整體納入本文。行718是指針對於在第二距離的測量個體之成組的BRIR資料集。即，此行告示針在第二距離處對於測量個體所記錄的BRIR資料集。作為在一個實例，在行716的第一BRIR資料集可在相距聆聽者為1.0 m到1.5 m被取得，而在行718的BRIR資料集可指在相距聆聽者為5 m所測量的那些資料集。理想而言，BRIR資料集形成完整球形柵格，但本發明的實施例應用到完整球形柵格的任何與所有子集，其包括而不限於含有習用立體聲集的BRIR對之子集、5.1多頻道設置、7.1多頻道設置、以及球形柵格的所有其他變化與子集，包括在方位與仰角二者為每3度或更少的BRIR對，以及在其密度為不規則的那些球形柵格。舉例來說，這可能包括球形柵格，其中在前面位置之柵格點的密度遠大於相對於聆聽者後方之柵格點的密度。再者，在行716與718的內容配置不僅應用到如由測量與內插法所導出的儲存BRIR對，而且還應用到藉由產生反映前者到含有轉動濾波器的BRIR之轉換的BRIR資料集所進一步改良者。In some embodiments of the invention, 2 or more distance balls are stored. This refers to spherical grids produced at 2 different distances from the user. In one embodiment, a reference position BRIR is stored and associated with 2 or more different spherical grid distance balls. In other embodiments, each spherical grid will have its own reference BRIR for use in applicable rotating filters. The selection processor 712 is used to match the properties in the memory 714 with the capture properties that are received from the capture device 702 for new listeners. Various methods are used to match the nature of the association so that the correct BRIR data set can be derived. As mentioned above, these methods include multi-match-based processing strategies, multi-identifier processing strategies, cluster-based processing strategies, and as proposed on May 2, 2018 and titled "Used for Customized Audio Experience System and Processing Method" US Patent Application No. 15/969,767 is another method for comparing biometric measurement data. The disclosure of the above US patent application is incorporated herein by reference in its entirety. Row 718 is a set of BRIR data for the group of measured individuals at the second distance. That is, this row announces the BRIR data set recorded by the measuring individual at the second distance. As an example, the first BRIR data set in row 716 can be obtained at a distance of 1.0 m to 1.5 m from the listener, and the BRIR data set in row 718 can refer to those data measured at a distance of 5 m from the listener set. Ideally, the BRIR data set forms a complete spherical grid, but the embodiments of the present invention apply to any and all subsets of the complete spherical grid, including but not limited to subsets of BRIR pairs containing conventional stereo sets, 5.1 multi-channels Settings, 7.1 multi-channel settings, and all other variations and subsets of spherical grids, including BRIR pairs that are every 3 degrees or less in both azimuth and elevation, and those spherical grids whose density is irregular. For example, this might include a spherical grid, where the density of grid points at the front position is much greater than the density of grid points behind the listener. Furthermore, the content configuration in lines 716 and 718 is not only applied to the stored BRIR pairs as derived by the measurement and interpolation method, but also to the BRIR data set by generating a conversion reflecting the former to a BRIR containing a rotating filter The further improvement.

在一個或多個匹配或計算BRIR資料集之確定後，資料集被傳送到音訊顯現裝置730，以供儲存藉由針對於新聆聽者如上所述的匹配或其他技術所確定的整個BRIR資料集、或在一些實施例中為對應於選擇空間化音訊位置的子集。在一個實施例中，音訊顯現裝置接著選擇對於期望的方位或仰角位置之BRIR對且將其應用到輸入音訊訊號，以將空間化音訊提供到頭戴式耳機735。在其他實施例中，選擇的BRIR資料集被儲存在經耦接到音訊顯現裝置730及/或頭戴式耳機735的單獨模組中。在其他實施例中，在僅有受限儲存為可用在顯現裝置之情形，顯現裝置僅儲存和聆聽者最佳匹配的關聯性質資料之識別或最佳匹配BRIR資料集之識別，且如所需要而從遠端伺服器710來即時下載期望的BRIR對(針對於選定的方位與仰角)。如稍早論述，這些BRIR對較佳為藉由關於適度規模的群體(即：大於100個個體)之入耳式麥克風的測量所導出，且連同關聯於各個BRIR資料集的類似影像相關性質而儲存。並非為取得所有7200個點，這些可部分由直接測量且部分由內插法所產生以形成BRIR對的球形柵格。即使有部分測量/部分內插的柵格，未落在柵格線上之另外的點可經內插，一旦適當方位與仰角值被使用以識別針對於來自BRIR資料集的一點之適當BRIR對。After one or more matching or calculation BRIR data sets are determined, the data sets are sent to the audio display device 730 for storage of the entire BRIR data set determined by matching or other techniques for new listeners as described above Or, in some embodiments, a subset corresponding to the selected spatialized audio position. In one embodiment, the audio display device then selects the BRIR pair for the desired azimuth or elevation angle position and applies it to the input audio signal to provide spatialized audio to the headset 735. In other embodiments, the selected BRIR data set is stored in a separate module coupled to the audio display device 730 and/or the headset 735. In other embodiments, in the case where only limited storage is available for the display device, the display device only stores the identification of the associated property data that best matches the listener or the identification of the best match BRIR data set, and as required The remote server 710 downloads the desired BRIR pair (for the selected azimuth and elevation angle) in real time. As discussed earlier, these BRIR pairs are preferably derived from in-ear microphone measurements on a moderately sized population (ie: greater than 100 individuals) and stored along with similar image-related properties associated with each BRIR data set . Not to obtain all 7200 points, these can be partly generated by direct measurement and partly by interpolation to form a spherical grid of BRIR pairs. Even if there is a partially measured/partially interpolated grid, other points that do not fall on the grid line can be interpolated once the appropriate azimuth and elevation values are used to identify the appropriate BRIR pair for a point from the BRIR data set.

一旦顧客所選擇的HRTF或BRIR資料集被選擇用於個體，這些個體化轉移函數被使用以使得使用者或系統能夠提供至少第一與第二空間音訊位置以供定位各別的媒體串流與語音通訊。換言之，一對轉移函數被用於第一與第二空間音訊位置各者以虛擬放置那些串流，且歸因於其分開空間音訊位置而因此致使聆聽者能專注在他優先選用音訊串流(例如：電話通話或媒體串流)。本發明的範疇是意圖涵蓋包括而不限於關聯於視訊的音訊、與音樂之所有媒體串流。Once the HRTF or BRIR data set selected by the customer is selected for the individual, these individualized transfer functions are used to enable the user or the system to provide at least the first and second spatial audio locations for locating separate media streams and Voice communication. In other words, a pair of transfer functions are used for each of the first and second spatial audio locations to virtually place those streams, and due to their separate spatial audio locations, the listener can focus on his preferred audio stream ( For example: telephone conversation or media streaming). The scope of the present invention is intended to cover all media streams including, but not limited to, audio and music related to video.

雖然前述的本發明已經為了清楚瞭解而以某些細節來描述，將顯而易見的是，某些變化與修改可在隨附申請專利範圍的範疇內而實行。是以，本實施例將視為說明性質而非限制性質，且本發明並不受限於本文所給定的細節，而是可在隨附申請專利範圍的範疇與等效者內所修改。Although the foregoing invention has been described in certain details for clear understanding, it will be apparent that certain changes and modifications can be implemented within the scope of the appended patent application. Therefore, the present embodiment will be regarded as illustrative rather than restrictive, and the present invention is not limited to the details given herein, but can be modified within the scope of the appended patent application and equivalents.

102:第一音訊位置(前景位置) 103:頭戴式耳機 104:第二位置(背景位置) 105:聆聽者(頭部) 110:路徑 112、114:中間點(中間位置) 116、118:空間音訊位置 202、204:串流 207、208:濾波器 209、210:濾波器 214、215:加法器 216:頭戴式耳機 222、223、224、225:增益 700:系統 702:擷取裝置 704:影像感測器 706:處理器 710:遠端伺服器 712:選擇處理器 714:記憶體 715、716、717、718:行 720:BRIR產生模組 730:音訊顯現模組 732:記憶體 735:頭戴式耳機 102: The first audio position (foreground position) 103: Headphones 104: second position (background position) 105: listener (head) 110: Path 112, 114: middle point (middle position) 116, 118: Spatial audio position 202, 204: Streaming 207, 208: filter 209, 210: filter 214, 215: adder 216: Headphones 222, 223, 224, 225: gain 700: System 702: Capture Device 704: Image Sensor 706: processor 710: remote server 712: select processor 714: memory 715, 716, 717, 718: OK 720: BRIR generation module 730: Audio display module 732: memory 735: Headphones

[圖1]是說明針對於根據本發明的一些實施例所處理的音訊之空間音訊位置的示意圖。[Fig. 1] is a schematic diagram illustrating the spatial audio position of audio processed according to some embodiments of the present invention.

[圖2]是說明根據本發明的一些實施例之用於呈現在不同空間音訊位置的音訊源(諸如來自數個不同型式的媒體任一者的)以及語音通訊之系統的示意圖。[FIG. 2] is a schematic diagram illustrating a system for audio sources (such as from any of several different types of media) and voice communication presented in different spatial audio locations according to some embodiments of the present invention.

[圖3]是說明根據本發明的實施例之用於產生用於客製化的BRIR、取得用於客製化的聆聽者性質、選擇用於聆聽者的客製化BRIR、以及用於顯現由BRIR所修正的音訊之系統的示意圖。[FIG. 3] It is an illustration of an embodiment of the present invention for generating BRIR for customization, obtaining listener properties for customization, selecting a customized BRIR for listeners, and displaying Schematic diagram of the audio system modified by BRIR.

700:系統 700: System

702:擷取裝置 702: Capture Device

704:影像感測器 704: Image Sensor

706:處理器 706: processor

710:遠端伺服器 710: remote server

712:選擇處理器 712: select processor

714:記憶體 714: memory

715、716、717、718:行 715, 716, 717, 718: OK

720:BRIR產生模組 720: BRIR generation module

730:音訊顯現模組 730: Audio display module

732:記憶體 732: memory

735:頭戴式耳機 735: Headphones

Claims

An audio processing device for processing events by using a spatial audio position transfer function data set, the device comprising: The audio display module is assembled to locate the first and second audio signals respectively including at least one voice communication stream and a media stream at the selected ones of at least the first spatial audio position and the second spatial audio position, said Each of the first and second spatial audio positions is revealed by using respective first and second transfer functions from the spatial audio position transfer function data set; The monitoring module is used to monitor the start of a voice communication event, the event includes receiving a phone call, and at the start of the phone call, by positioning the voice communication to the first spatial audio position And locate the media stream to the second spatial position to process the first and second audio signals; and The output module is equipped to display the synthesized audio signal to the coupled pair of headphones through two output channels.

The audio processing device according to claim 1, wherein the spatial audio position transfer function data set is an individualized head related impulse response (HRIR, Head Related Impulse Response) data set for the data set customized by the individual Or one of the individualized Binaural Room Impulse Response (BRIR, Binaural Room Impulse Response) data sets.

The audio processing device according to claim 2, which further includes a second processor, which is configured to retrieve the image-based properties for the individual from the input image and is used to transfer the image-based properties to the selection process The selection processor is configured to determine the individualized HRIR or BRIR data set from the memory having a plurality of candidate HRIR or BRIR data sets provided for a group of individuals, the HRIR or BRIR The data sets are each associated with their corresponding image-based properties.

The audio processing device according to claim 3, wherein the selection processor accesses the candidate library, compares the captured image-based properties for the individual and the candidate library The properties of the capture determine the individualized BRIR data set, identify one or more BRIR data sets based on the proximity measurement, and the processing strategies used are multi-matching, multi-identifier type, and cluster-based One.

The audio processing device according to claim 2, wherein the first and second spatial audio positions from the determined individualized BRIR data set are obtained from a captured data set in the memory by interpolation Method or other calculation methods, and wherein the first and second spatial audio positions include foreground and background positions, respectively.

The audio processing device according to claim 5, wherein when the individual listener determines that the voice call is of lower priority and generates a corresponding control signal, the voice call is directed to the background position and the The music is directed to the foreground position.

The audio processing device according to claim 2, wherein when the individual listener determines that the voice call is of lower priority and generates the corresponding control signal, the individualized BRIR corresponding to different distances in the same direction is used , The apparent distance of the voice call is increased and the apparent distance of the music is reduced.

The audio processing device according to claim 2, wherein from the respective initial positions of the voice communication and the media stream, the positioning of the voice communication to the first spatial audio position, and to the The positioning of the media stream of the second spatial audio position is performed in a burst mode.

The audio processing device according to claim 2, which further includes a portable image capturing device equipped to obtain the input image, and wherein the audio processing device obtains the image and captures the image-based One of a mobile phone, a communication device, or a tablet computer.

The audio processing device according to claim 1, wherein the audio processing device is configured to relocate the media stream to the first spatial audio position when the voice communication stream is terminated.

The audio processing device according to claim 1, wherein the media stream includes music.

The audio processing device according to claim 1, wherein respective first and second spatial audio position sound transfer functions from individualized BRIRs corresponding to different distances in the same direction are used, and the apparent distance of the voice call Is increased and the apparent distance of the music is reduced.

The audio processing device according to claim 1, wherein the output module is coupled to the headset via one of a wireless connection and a wired connection.

The audio processing device of claim 1, wherein the output module includes a digital-to-analog converter, and the coupling to the headset is through an analog port.

The audio processing device according to claim 1, wherein the output module is equipped to pass digital signals to the headset, and the headset includes a digital-to-analog converter.

The audio processing device according to claim 1, further comprising a user interface configured to select a position for at least one of the first spatial audio position and the second spatial audio position.

A method for processing audio streaming to a set of headsets, the method comprising: Locating the first and second audio signals of at least one voice communication stream and a media stream in the selected ones of at least a first spatial audio position and a second spatial audio position, respectively, the first and second spatial audio positions Each of is revealed by using the respective first and second transfer functions from the spatial audio position transfer function data set; Monitor the start of a voice communication event, the event includes receiving a phone call, and at the start of the phone call, by positioning the voice communication to the first spatial audio position and linking the media string The stream is positioned to the second spatial audio location to process the first and second audio signals, wherein at least one associated place impulse response exists for the second spatial audio location; and The synthesized audio is displayed to the headset of the coupling pair through two output channels.

The method according to claim 17, wherein the spatial audio position transfer function data set is one of the HRIR data set or the BRIR data set customized for the individual.

The method of claim 18, wherein the customization includes extracting image-based properties for the individual from the input image, and transmitting the image-based properties to a selection processor, the selection processor Is assembled to determine an individualized HRIR or BRIR data set from a memory with a plurality of candidate HRIR or BRIR data sets that have been provided for a group of individuals, each of which is associated with its corresponding Image nature.

The method according to claim 19, wherein it is determined that the individualized BRIR data set is included in the candidate library to perform interpolation between existing BRIR data sets.