BR102019004378A2

BR102019004378A2 - METHOD FOR TRANSFORMING STEREO AUDIO TO 3D AUDIO WITH INTELLIGENT MOVEMENT OF THE HUMAN VOICE ACCORDING TO ITS INTERPRETATION

Info

Publication number: BR102019004378A2
Application number: BR102019004378-4A
Authority: BR
Inventors: Pablo Forvença Coitino
Original assignee: Pablo Fervença Coitino
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2020-10-06

Abstract

a presente patente refere-se à obtenção de método para transformar áudio estéreo em áudio 3d com movimentação inteligente da voz humana de acordo com sua interpretação que poderá ser reproduzido em sistema ambisonics que permitirá a interação do usuário em sistemas de realidade virtual e/ou binaural.the present patent refers to obtaining a method to transform stereo audio into 3d audio with intelligent movement of the human voice according to its interpretation that can be reproduced in an ambisonics system that will allow user interaction in virtual reality and / or binaural systems .

Description

METHOD FOR TRANSFORMING STEREO AUDIO TO 3D AUDIO WITH INTELLIGENT MOVEMENT OF THE HUMAN VOICE ACCORDING TO ITS INTERPRETATION

[001] A presente patente refere-se à obtenção de Método para transformar áudio estéreo em áudio 3D com movimentação inteligente da voz humana de acordo com sua interpretação; com a chegada dos smartphones que oferecem fones de ouvido, ouvir música através dos fones se tornou popular. Logo, a técnica de áudio imersivo tridimensional binaural se tornou muito atraente para esses usuários, pois é a forma muito acessível de realidade virtual compacta.[001] The present patent refers to obtaining a Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation; With the arrival of smartphones that offer headsets, listening to music through headphones has become popular. Therefore, the binaural three-dimensional immersive audio technique became very attractive to these users, as it is the very accessible form of compact virtual reality.

[002] Sabe-se que, na codificação de áudio imersivo tridimensional, também conhecido como áudio espacial, 360, áudio 3D, spatial audio, um sinal de áudio com múltiplos canais é processado de modo que os sinais de áudio a serem reproduzidos em diferentes canais de áudios diferem entre si, proporcionando assim aos ouvintes a impressão de um efeito espacial à fonte de áudio. O efeito espacial pode ser criado pela gravação direta do áudio em formatos adequados para reprodução binaural ou em canal múltiplo, ou o efeito pode ser criado de modo artificial em qualquer sinal de áudio de múltiplos canais, o que é conhecido por espacialização.[002] It is known that, in the encoding of three-dimensional immersive audio, also known as spatial audio, 360, 3D audio, spatial audio, an audio signal with multiple channels is processed so that the audio signals to be reproduced in different audio channels differ, thus giving listeners the impression of a spatial effect to the audio source. The spatial effect can be created by directly recording audio in formats suitable for binaural or multi-channel reproduction, or the effect can be created artificially in any multi-channel audio signal, which is known as spatialization.

[003] É de conhecimento público que para aperfeiçoar a reprodução por fones de ouvido, a espacialização artificial pode ser executada por filtro HRTF (Função de Transferência Relativa à Cabeça), que produz sinais binaurais para os ouvidos direito e esquerdo do ouvinte. Os sinais de fonte sonora são filtrados com filtros derivados dos HTRFs correspondentes ao seu sentido de origem. Uma HRTF é a função de transferência medida a partir da fonte artificial, dividida pela função da transferência para um microfone que substitui a cabeça, e situada no meio da cabeça. O efeito da sala artificial (por exemplo, reflexões prematuras e / ou reverberações tardias) pode ser adicionado aos sinais espacializados a fim de aperfeiçoar a externalidade e a naturalidade da fonte.[003] It is public knowledge that to improve reproduction by headphones, artificial spatialization can be performed by an HRTF filter (Head Relative Transfer Function), which produces binaural signals to the listener's right and left ears. The sound source signals are filtered with filters derived from the HTRFs corresponding to their sense of origin. An HRTF is the transfer function measured from the artificial source, divided by the transfer function to a microphone that replaces the head, and located in the middle of the head. The effect of the artificial room (for example, premature reflections and / or delayed reverberations) can be added to the spatialized signals in order to improve the externality and naturalness of the source.

[004] Com o aumento e a popularização dos dispositivos de interação e de escuta de audio a compatibilidade se torna mais importante. Entre os formatos de áudios espaciais, a compatibilidade é almejada por meio das técnicas de transferência por redução e de aumento de canais. Em geral, é notória a existência de algoritmos para a conversão do sinal de audio de canal múltiplo em formato estéreo.[004] With the increase and popularization of devices for interaction and listening to audio, compatibility becomes more important. Among the spatial audio formats, compatibility is sought through the techniques of transfer by reduction and increase of channels. In general, the existence of algorithms for converting the multi-channel audio signal into stereo format is notorious.

[005] No entanto neste tipo de processamento, a imagem espacial do sinal original de audio de canal múltiplo não é reproduzida na sua totalidade. A melhor forma de conversão do sinal de audio de canal múltiplo para a escuta em fones de ouvido é substituir os altos falantes originais por autofalantes virtuais, empregando o filtro HRTF e reproduzir os sinais do canal do alto falante através destes.[005] However, in this type of processing, the spatial image of the original multiple channel audio signal is not reproduced in its entirety. The best way to convert the multi-channel audio signal for listening to headphones is to replace the original speakers with virtual speakers, using the HRTF filter and reproduce the signals from the speaker channel through them.

[006] Esse processo, para gerar um sinal binaural, tem como desvantagem a necessidade permanente de mixar o canal múltiplo em primeiro lugar, ou seja, os canais múltiplos são primeiros decodificados e sintetizados, e as HRTFs são então aplicadas a cada sinal para a formação de um sinal binaural.[006] This process, to generate a binaural signal, has the disadvantage of the permanent need to mix the multiple channel first, that is, the multiple channels are first decoded and synthesized, and the HRTFs are then applied to each signal for the formation of a binaural signal.

[007] Atualmente existem apenas 2 formas de se produzir áudio binaural com movimento. A primeira é realizando o movimento fisicamente, utilizando microfones, podendo ser binaural ou usando sistema ambisonics. E a segunda é mixando digitalmente. No Segundo caso, o produtor utiliza faixas de áudio diferentes normalmente em mono e utiliza um processador digital para emular esse som em um espaço tridimensional.[007] Currently, there are only 2 ways to produce binaural audio with movement. The first is to perform the movement physically, using microphones, which can be binaural or using an ambisonics system. And the second is digitally mixing. In the second case, the producer uses different audio tracks normally in mono and uses a digital processor to emulate that sound in a three-dimensional space.

[008] Para se produzir uma música tridimensional binaural hoje em dia, pós gravação com microfones tradicionais, com a voz humana em movimento tridimensional, o produtor necessita de todos os elementos da música em separado. E então espacializa elemento por elemento, posicionando cada instrumento em um espaço determinado, utilizando ferramentas digitais que emulam um ambiente físico, além de remixar todos os instrumentos novamente para então decodificar para os ouvidos humanos em binaural, perde-se assim todas as propriedades de mixagem original da música além de inviabilizar a popularização do efeito imersivo por causa das horas trabalhadas em cada fonograma.[008] To produce binaural three-dimensional music today, post recording with traditional microphones, with the human voice in three-dimensional movement, the producer needs all the elements of the music separately. And then it spaces element by element, placing each instrument in a specific space, using digital tools that emulate a physical environment, in addition to remixing all instruments again and then decoding for human ears in binaural, thus losing all mixing properties of the music in addition to making the immersive effect popular because of the hours worked on each phonogram.

[009] Em razão de tais deficiências, e com o propósito de criar uma forma inovadora de imersão completa em áudio, obtida através de um processo, simples, eficiente e de baixo custo, desenvolveu-se o processo de Método para transformar áudio estéreo em áudio 3D com movimentação inteligente da voz humana, objeto da presente patente.[009] Due to such deficiencies, and with the purpose of creating an innovative form of complete immersion in audio, obtained through a simple, efficient and low cost process, the Method process for transforming stereo audio into 3D audio with intelligent movement of the human voice, object of this patent.

[010] O objeto desta patente é um método automático para transformar o áudio estéreo que contém a mixagem fechada (com todos os elementos da música misturados em 2 canais Left e Right) em um áudio 3D em tempo real. Em que se percebe todos os mesmos elementos da música, porém, com espacialização da voz humana e com movimentação ativa ao redor do ouvinte, podendo se aproximar/afastar e ir para todos os lados. Frente, atrás, esquerda, direita e acima ou abaixo da cabeça, de forma contínua, conforme o ajuste desejado, de maneira automática e em tempo muito próximo ao real.[010] The object of this patent is an automatic method to transform the stereo audio that contains the closed mix (with all elements of the music mixed in 2 channels Left and Right) into 3D audio in real time. In which all the same elements of music are perceived, however, with the spatialization of the human voice and with active movement around the listener, being able to approach / move away and go everywhere. Front, back, left, right and above or below the head, continuously, according to the desired adjustment, automatically and in very close to real time.

[011] Em qualquer plataforma digital ou física, montaremos o sistema da seguinte forma, e sua reprodução poderá ser aplicada em um sistema físico com múltiplos alto-falantes ou decodificada para os fones de ouvido.[011] On any digital or physical platform, we will set up the system as follows, and its reproduction can be applied in a physical system with multiple speakers or decoded for the headphones.

[012] O sistema se inicia utilizando um filtro de reflexões, que atenuará a reverberação presente no sinal estéreo, depois, utilizaremos a técnica M/S ou Mid/Side desenvolvida por Blumlein, porém utilizada de maneira diferente. Sabe-se que essa técnica usa um microfone que é o componente Mid (M), originalmente cardióide, com o seu eixo apontado ao centro da fonte sonora, e um microfone Side (S), um microfone bidirecional, orientado lateralmente (90 graus da fonte) para que o seu lado neutro de captação esteja alinhado com o eixo de captação de maior sensibilidade do microfone Mid. O microfone lateral é posicionado verticalmente coincidente com o microfone Mid, para minimizar quaisquer diferenças de fase no tempo de chegada dos sinais dos dois microfones. Porém, ao invés de posicionarmos uma única fonte sonora da forma como explicado acima, iremos introduzir 2 fontes sonoras de frente e de trás do microfone bidirecional a uma distância média de 3 metros.[012] The system starts using a reflection filter, which will attenuate the reverberation present in the stereo signal, then we will use the M / S or Mid / Side technique developed by Blumlein, but used differently. It is known that this technique uses a microphone that is the Mid (M) component, originally cardioid, with its axis pointed at the center of the sound source, and a Side (S) microphone, a bidirectional microphone, oriented laterally (90 degrees from the source) so that its neutral pickup side is aligned with the most sensitive pickup axis of the Mid microphone. The side microphone is positioned vertically coincident with the Mid microphone, to minimize any phase differences in the time of arrival of the signals from the two microphones. However, instead of positioning a single sound source as explained above, we will introduce 2 sound sources from the front and back of the two-way microphone at an average distance of 3 meters.

[013] Esses 2 sinais são os provenientes (O canal L e o canal R Left/Right simultaneamente). Para se reproduzir digitalmente a mesma técnica, pode-se utilizar um processador espacial ambisonics em formato B, e em sua configuração, por se tratar de uma captação espacial em formato polar (ambiente em esfera) usa-se o parâmetro AZIMUTH em 90 graus para o sinal de áudio R (Right) e -90 graus para o sinal de áudio L (Left) com o parâmetro de distância a 3 metros da fonte do microfone ambisonics virtual, a distância terá o papel de gerar ambientação de sala logo que o uma ambientação virtual esteja presente no sistema, que poderá ser ajustada conforme o resultado desejado.[013] These 2 signals are those coming from (L channel and R Left / Right channel simultaneously). In order to digitally reproduce the same technique, an ambisonics space processor in B format can be used, and in its configuration, because it is a space capture in polar format (sphere environment), the parameter AZIMUTH in 90 degrees is used to the R (Right) audio signal and -90 degrees for the L (Left) audio signal with the distance parameter 3 meters from the virtual ambisonics microphone source, the distance will have the role of generating room ambience as soon as the one virtual environment is present in the system, which can be adjusted according to the desired result.

[014] Como resultado, teremos 2 sinais mono da técnica M/S convencional, utilizando 2 microfones como acima mencionado tendo então um sinal Mid (M) e outro sinal Side (S), ou 4 sinais mono proveniente do sistema ambisonics, captado por um microfone virtual Soundfield convertido em formato B, que então utilizaremos apenas os 2 primeiros sinais, o W (omnidirecional) análogo ao sinal Mid(M) e o Y (figura 8) análogo ao sinal Side (S).[014] As a result, we will have 2 mono signals from the conventional M / S technique, using 2 microphones as mentioned above then having a Mid (M) signal and another Side (S) signal, or 4 mono signals from the ambisonics system, captured by a virtual Soundfield microphone converted to B format, which we will then use only the first 2 signals, the W (omnidirectional) analogous to the Mid (M) signal and the Y (figure 8) analogous to the Side (S) signal.

[015] O áudio com a resultante SIDE passará por um filtro passa alta na faixa de 256Hz podendo ser ajustado para mais ou para menos conforme desejado.[015] The audio with the resulting SIDE will pass through a high-pass filter in the 256Hz range and can be adjusted up or down as desired.

[016] O áudio com a resultante MID será duplicado. Chamaremos de MID1 e MID2.[016] The audio with the resulting MID will be duplicated. We will call it MID1 and MID2.

[017] O áudio MID1 passará por um filtro passa baixa em 256Hz.[017] MID1 audio will pass through a 256Hz low pass filter.

[018] O áudio MID2 passará por um filtro passa alta na faixa de 256Hz ou ajustado conforme a música, depois por um filtro de reflexões e será espacializado através do sistema ambisonics que será poderá ser enviado para um decodificador binaural com interatividade ativa com o usuário, ou fazendo a conversão direta para o binaural, ou através de um microfone já binaural que não terá interatividade ativa com o usuário. E utilizará um algoritmo para movimentar de forma automática o eixo AZIMUTH de -180 graus até 180 graus em velocidade constante de 0.0250Hz em direção positiva, com um controlador de ativação negativo (sidechain) ativado de acordo com à energia (potência) de voz na gama de frequência entre 473Hz até 1622Hz e controlará o parâmetro de distância de forma randômica com velocidade de 0.0833Hz para sua movimentação ao redor do ouvinte e com a opção de acionar o eixo de elevação de forma linear conforme o desejo do ouvinte.[018] MID2 audio will pass through a high pass filter in the 256Hz range or adjusted according to the music, then through a reflections filter and will be spatialized through the ambisonics system that will be able to be sent to a binaural decoder with active user interactivity. , or making the direct conversion to the binaural, or through an already binaural microphone that will not have active interactivity with the user. And it will use an algorithm to automatically move the AZIMUTH axis from -180 degrees to 180 degrees at a constant speed of 0.0250Hz in a positive direction, with a negative activation controller (sidechain) activated according to the energy (power) of the voice in the frequency range from 473Hz to 1622Hz and will control the distance parameter randomly with a speed of 0.0833Hz for its movement around the listener and with the option of driving the elevation axis in a linear way according to the listener's desire.

[019] Teremos então como mixagem final os seguintes sinais de áudio. O sinal estéreo SIDE mixado conforme sugerido na técnica de Blumlein [(+S) para esquerda e (-S) para direita] com um limitador de sinal, mais o sinal MID1 e um limitador de sinal e mais o sinal MID2 proveniente do sistema ambisonics que poderá seguir em um sistema separado ambisonics físico ou interativo com o usuário em um smartphone, podendo usufruir da movimentação livre oportunada nesse sistema, convertido diretamente para um sinal binaural, também com um limitador de sinal.[019] We will then have the following audio signals as the final mix. The SIDE stereo signal mixed as suggested in the Blumlein technique [(+ S) to the left and (-S) to the right] with a signal limiter, plus the MID1 signal and a signal limiter and plus the MID2 signal from the ambisonics system which will be able to follow in a separate physical or interactive ambisonics system with the user on a smartphone, being able to take advantage of the opportune free movement in that system, converted directly to a binaural signal, also with a signal limiter.

[020] A maioria das mixagens de música em formato estéreo, utiliza como padrão gravar utilizando a técnica de Blumlein, e posicionar a voz, o baixo e o bumbo da bateria ao centro. Essa técnica de microfonação chamada MID/SIDE descoberta e patenteada pelo alemão Alan Blumlein em 1934 (Patente US 2.218.902, de 5 de junho de 1937) em que 2 microfones são posicionados para dar mais realidade ao som, (um bidirecional e outro cardioide ou omnidirecional gerando uma correlação de fase), é possível através de uma matriz MID/SIDE ouvir os sons do centro do estéreo e os sons das laterais.[020] Most music mixes in stereo format, use as a standard recording using the Blumlein technique, and position the voice, bass and bass drum in the center. This microphonation technique called MID / SIDE discovered and patented by the German Alan Blumlein in 1934 (US Patent 2,218,902, of June 5, 1937) in which 2 microphones are positioned to give more reality to the sound, (one bidirectional and another cardioid or omnidirectional generating a phase correlation), it is possible through a MID / SIDE matrix to hear the sounds of the center of the stereo and the sounds of the sides.

[021] Desse modo, conseguimos ouvir todos os sons que estão no centro, na maioria dos casos temos a voz, o baixo e a bateria ao centro, enquanto os outros instrumentos e vozes de fundo ficam evidentes nas laterais.[021] In this way, we can hear all the sounds that are in the center, in most cases we have the voice, the bass and the drums in the center, while the other instruments and background voices are evident on the sides.

[022] Isso não se limita só para música, já que poderemos usar para qualquer coisa que tenha voz, como por exemplo um audiolivro.[022] This is not just limited to music, since we can use it for anything that has a voice, such as an audiobook.

[023] Deixamos o sinal das baixas frequências fluírem direto para o ouvinte, assim como o sinal dos lados da música que dão o efeito estéreo e automatizamos apenas a região da voz, que poderá ser extraída através de um simples filtro passa alta ou poderemos utilizar ferramentas mais complexas de "learning machine" para separação mais detalhada da voz, e utilizando como chave de direção de movimento o volume da voz através de uma gama faixa de frequência específica da voz humana. Dessa forma, quando detectada um alto volume nessa gama de frequência, o eixo Azimuth que antes soa automático e constante em um círculo de um lado para o outro, o dispositivo para e retrocessa o movimento de forma mais devagar, fazendo cada fonograma ter um movimento único de acordo com à interpretação interlocutor.[023] We let the low frequency signal flow straight to the listener, as well as the signal from the sides of the music that give the stereo effect and automate only the region of the voice, which can be extracted through a simple high-pass filter or we can use more complex tools of "learning machine" for more detailed separation of the voice, and using as the key of direction of movement the volume of the voice through a specific range of frequency of the human voice. In this way, when a high volume is detected in this frequency range, the Azimuth axis that previously sounds automatic and constant in a circle from side to side, the device stops and rewinds the movement more slowly, making each phonogram have a movement according to the interlocutor interpretation.

[024] Descritivo das figuras[024] Description of the figures

[025] Estas figuras são meramente ilustrativas, podendo apresentar variações, desde que não fujam do inicialmente pleiteado.[025] These figures are merely illustrative, and may present variations, as long as they do not run away from what was initially claimed.

[026] Neste caso temos:[026] In this case we have:

[027] A FIGURA 1/5 vista de modelo patenteada pelo alemão Alan Blumlein em 1934 (Patente US 2.218.902, de 5 de junho de 1937) em que 2 microfones são posicionados para dar mais realidade ao som;[027] FIGURE 1/5 model view patented by the German Alan Blumlein in 1934 (US Patent 2,218,902, of June 5, 1937) in which 2 microphones are positioned to give more reality to the sound;

[028] A FIGURA 2/5 mesmo modelo patenteada pelo alemão Alan Blumlein em 1934 (Patente US 2.218.902, de 5 de junho de 1937) em que 2 microfones são posicionados para dar mais realidade ao som, só que com microfones reais para simplificar a demonstração do diagrama da figura 1/6[028] FIGURE 2/5 same model patented by the German Alan Blumlein in 1934 (US Patent 2,218,902, of June 5, 1937) in which 2 microphones are positioned to give more reality to the sound, only with real microphones for simplify the diagram demonstration in figure 1/6

[029] A FIGURA 3/5 temos em destaque o diagrama da forma que estamos aplicando o som estéreo, captando o sinal M/S junto com as reflexões do ambiente objeto desta patente.[029] FIGURE 3/5 highlights the diagram as we are applying the stereo sound, capturing the M / S signal together with the reflections of the environment object of this patent.

[030] A FIGURA 4/5 Temos em destaque os eixos Azimuth, Elevação e distância que iremos automatizar, dentro do sistema ambisonics em formato-B ambix.[030] FIGURE 4/5 We highlight the Azimuth, Elevation and distance axes that we will automate, within the ambix system in B-format ambix.

[031] A FIGURA 5/5 temos em destaque apresentação do organograma do sistema da patente.[031] FIGURE 5/5 highlights the organization chart of the patent system.

[032] O campo de aplicação do processo de obtenção Método para transformar áudio estéreo em áudio 3D com movimentação inteligente da voz do interlocutor de acordo com sua interpretação, objeto desta patente, é imenso, o mesmo pode ser utilizado em cinemas, teatros, filmes, automóveis, aplicativos, softwares, entre outras aplicações que só descobriremos com a operacionalização do mesmo.[032] The field of application of the obtaining method Method to transform stereo audio into 3D audio with intelligent movement of the interlocutor's voice according to his interpretation, object of this patent, is immense, the same can be used in cinemas, theaters, films , automobiles, applications, software, among other applications that we will only discover with the operationalization of it.

Claims

Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation; characterized by transforming stereo audio into three-dimensional audio for ambisonics or binaural systems.

Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation, characterized by giving movement to the interlocutor, making it part of the physical environment in which the listener is, moving the voice according to his expression and dynamics.

Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation; characterized by filtering the high reverberation of the stereo media and using the impulse response and calculating an inverse filter that cancels the effect of the initial reflections.

Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation, characterized by changing the M / S technique, using 2 mono sound sources (L and R from the stereo signal) simultaneously at an average distance of 3 meters from the front and rear of the bidirectional microphone, in order to add ambience to the signal.

Method for transforming stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation, characterized by duplicating the Mid (M) signal, using a Mid1 with low pass filter, configured at an average of 256Hz, and the other Mid2 with a filter high pass, configured at 256Hz.

Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation; characterized by applying the reflection filter to the Mid2 signal and moving around the space microphone.

Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation, characterized by using 3 variables contained in the sphere of the capturing field of the Giro system around the listener, from left to right and vice versa in positive waveform in the variable called AZIMUTH).

Method for transforming stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation, characterized by turning the spin as opposed to the constant turning (AZIMUTH) in a negative waveform, triggered only when the voice (identified by a range of specific frequency of Mid2) exceeds a preset volume and 2x slower than the speed of constant rotation around the listener.

Method to transform stereo audio into 3D audio with intelligent movement of the human voice according to its interpretation, characterized by presenting 2 experience solutions for the user, the Mid2 signal can then be delivered to an ambisonics reproduction system or decoded to a stereo signal binaural.

Method to transform stereo audio into 3D audio with intelligent human voice movement according to its interpretation, characterized by mixing the SIDE + MID1 signals (with low pass filter) for the recomposition of the stereo signal plus the Mid2 3D signal, which can work through an ambisonic system or already converted to binaural.