CN114598985B - Audio processing method and device - Google Patents

Audio processing method and device Download PDF

Info

Publication number
CN114598985B
CN114598985B CN202210216580.2A CN202210216580A CN114598985B CN 114598985 B CN114598985 B CN 114598985B CN 202210216580 A CN202210216580 A CN 202210216580A CN 114598985 B CN114598985 B CN 114598985B
Authority
CN
China
Prior art keywords
target object
audio
head
transfer function
related transfer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210216580.2A
Other languages
Chinese (zh)
Other versions
CN114598985A (en
Inventor
黎镭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anker Innovations Co Ltd
Original Assignee
Anker Innovations Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anker Innovations Co Ltd filed Critical Anker Innovations Co Ltd
Priority to CN202210216580.2A priority Critical patent/CN114598985B/en
Publication of CN114598985A publication Critical patent/CN114598985A/en
Application granted granted Critical
Publication of CN114598985B publication Critical patent/CN114598985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/308Electronic adaptation dependent on speaker or headphone connection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

The application relates to an audio processing method and device, and relates to the field of audio processing, wherein the audio processing method comprises the following steps: acquiring dual-channel audio data; inputting the binaural audio data into a pre-trained music source separation model, and outputting single-channel signals corresponding to each of a plurality of musical instruments corresponding to the binaural audio data; determining the spatial position of each instrument; and acquiring a head related transfer function of a target object according to the personal characteristics of a user and the spatial positions of each musical instrument, and determining the audio signals received by the ears of the target object according to the head related transfer function and the spatial positions of the at least two single-channel signals. The application can change the binaural audio on wearable devices such as headphones, glasses and the like into spatial audio, and has better user experience.

Description

Audio processing method and device
Technical Field
The present application relates to the field of audio processing, and in particular, to an audio processing method and apparatus.
Background
With the development of science and technology, spatial audio has been attracting attention in recent years, and spatial information and azimuth information of sound have been attracting more and more attention. At present, on wearable devices such as headphones and glasses, in-head effects are easy to occur, the sound field and the spatial azimuth information perceived by a user are reduced, namely the prior art cannot change the common binaural stereo on the wearable devices such as headphones and glasses into spatial audio, and the user experience is poor.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides an audio processing method and apparatus.
In a first aspect, the present application provides an audio processing method, including:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
and determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In a second aspect, the present application provides an audio processing apparatus comprising:
the data acquisition module is used for acquiring the double-channel audio data;
The signal output module is used for inputting the binaural audio data into a pre-trained music source separation model and decomposing the binaural audio data to obtain at least two single-channel signals;
the position acquisition module is used for acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
the function acquisition module is used for acquiring a head related transfer function of the target object;
And the audio processing module is used for determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In a third aspect, a computer device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory perform communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the audio processing method according to any one of the embodiments of the first aspect when executing a program stored on a memory.
In a fourth aspect, there is provided a wearable device comprising: a memory, a processor and an audio processing program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the audio processing method as in any of the embodiments of the first aspect.
In a fifth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the audio processing method as in any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
According to the method provided by the embodiment of the application, the binaural audio data is input into the pre-trained music source separation model by acquiring the binaural audio data, at least two single-channel signals are obtained by decomposing the binaural audio data, and the spatial position of each single-channel signal in the at least two single-channel signals is acquired, so that a foundation is laid for the subsequent acquisition of the spatial audio data. The method and the device can change the binaural audio on wearable equipment such as headphones and glasses into spatial audio, and have better user experience.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
Fig. 1 is a schematic flow chart of an audio processing method according to an embodiment of the present application;
Fig. 2 is a flowchart of an embodiment of an audio processing method according to an embodiment of the present application;
fig. 3 is a flowchart of another embodiment of an audio processing method according to an embodiment of the present application;
Fig. 4 is a schematic structural diagram of an audio processing device according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms involved in the present application are explained as follows:
Head related transfer function (HEAD RELATED TRANSFER Functions, abbreviated HRTF): refer to an acoustic localization algorithm that describes the transmission of sound waves from a sound source to both ears. It is the result of the comprehensive filtering of sound waves by human physiological structures such as the head, pinna, torso, etc.
With the development of science and technology, spatial audio has been attracting attention in recent years, and spatial information and azimuth information of sound have been attracting more and more attention. At present, on wearable devices such as headphones and glasses, in-head effects are easy to occur, the sound field and the spatial azimuth information perceived by a user are reduced, namely the prior art cannot change the common binaural stereo on the wearable devices such as headphones and glasses into spatial audio, and the user experience is poor. In order to solve the above problems, an embodiment of the present application provides an audio processing method.
Fig. 1 is a flow chart of an audio processing method according to an embodiment of the present application, as shown in fig. 1, where the audio processing method includes:
Step 101, obtaining dual-channel audio data.
In this embodiment, the binaural audio data is binaural, specifically, it may be whole music data with binaural generated by a wearable device (such as headphones, glasses, etc.) in a music mode, or may be a piece of music data with binaural, and if the quality of the music data is differentiated, it may be high-quality binaural music data, or may be low-quality binaural music data.
Step 102, inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals.
As an embodiment, after the binaural audio data is input to the pre-trained music source separation model, a plurality of single-channel signals of a plurality of musical instruments corresponding to the binaural audio data may be output, where the musical instruments are in one-to-one correspondence with the single-channel signals.
In this embodiment, the music source separation model may be a deep learning model (including but not limited to the models of Wave-U-Net, conv-Tasnet, MMDenseLSTM, etc.), and the training process is as follows:
obtaining sample data, wherein the sample data comprises: a plurality of binaural audio data, and a single channel signal of a plurality of musical instruments corresponding to the binaural audio data;
And training by using sample data through a deep learning model to obtain a music source separation model.
As an embodiment, a first data set may be acquired over a network, the first data set comprising: the audio data comprises a plurality of single-channel signals corresponding to a plurality of musical instruments, a music source separation model is obtained through training of a deep learning model, and when the audio data are used, the double-channel audio data are input into the trained music source separation model, so that the plurality of single-channel signals corresponding to the double-channel audio data can be output.
Among them, a variety of musical instruments include: electronic organ, guitar, vocal, drum, bass, etc.
Step 103, acquiring the spatial position of each single-channel signal in the at least two single-channel signals.
Based on the embodiment in step 102, it should be appreciated that since the instrument is in one-to-one correspondence with the single channel signal, the spatial position of the instrument is the spatial position of the single channel signal.
For the method of acquiring the spatial positions of the plurality of musical instruments, as an embodiment, a second data set may be acquired in advance on the network, the second data set including: in the scene information, for example, in the concert scene, the user is the center of sphere, and the placement positions of the various instruments with respect to the user are the electronic organ (front left), guitar (middle), vocal (middle), drum (rear middle), bass (front right). Based on the above, the preset neural network model is trained by using scene information such as a concert hall and a concert, and the placement positions of various musical instruments corresponding to the respective scene information, and a trained musical instrument placement model is generated. When in use, there are two methods: firstly, the scene information of the scene where the user is or the preset scene information is obtained, and the scene information is input into the trained musical instrument placing model, so that the placing positions of various musical instruments can be output. Secondly, the user can customize the space placement positions of various musical instruments according to preference or experience, thereby training the musical instrument placement model according to the placement positions of the musical instruments preset by the user, and finally selecting the placement positions of various musical instruments according to the user.
Step 104, acquiring a head related transfer function of the target object.
In this embodiment, the target object may be a user.
The method for acquiring the head related transfer function of the target object specifically comprises the following steps: derived based on field measurement data or derived based on head related transfer function matching of existing objects.
For example, the head-related transfer function of the target object may be obtained by one of the following methods:
1. The HRTF of a person's head is matched with public data.
For example, the user can sort out the data already disclosed on the network to perform subjective judgment, and match the HRTF of the head of the person deemed to be most suitable by the user, and the head-related transmission coefficient at this time is determined according to the data directly selected by the user.
2. Firstly clustering head HRTFs of a plurality of persons in a database, and then selecting a certain type of the best-matched HRTFs from the public database by utilizing personalized auricle characteristics of the users as a final function.
Specifically, a plurality of head related transfer functions stored in advance may be clustered, and head related transfer functions matched with the target object may be selected from the clustered plurality of head related transfer functions according to auricle characteristics of ears of the target object.
3. The HRTF of the user is measured directly in the anechoic chamber. That is, the HRTF of the user is measured according to the personal characteristics (height, auricle information, etc.) of the user and the placement space position of the sound-deadening indoor musical instrument.
4. The HRTF of the user is measured directly at the concert hall. That is, the HRTF of the user is measured according to the personal characteristics (height, auricle information, etc.) of the user and the placement space position of the instrument in the concert hall.
Step 105, determining the audio signal received by the target object in double ears according to the head related transfer function and the spatial positions of the at least two single channel signals.
As an embodiment, to accurately obtain the target object binaural received audio signal, determining the target object binaural received audio signal from the head related transfer function and the spatial positions of the at least two single channel signals, comprises:
convolving the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
and obtaining the target object binaural received audio signal according to the plurality of processed audio signals.
For example, if 5 kinds of musical instruments such as an electronic organ, a guitar, a human voice, a drum and a bass are split from one binaural music, after the 5 kinds of musical instruments are spatially arranged, the 5 kinds of musical instruments respectively correspond to 5 directions with respect to the left ear and the right ear of the user. At this time, the single-channel signals of the 5 directional instruments corresponding to the left ear are multiplied by the head related transfer function, and after multiplication, the 5 signals are added, so that the audio signal received by the left ear of the target object can be obtained. Similarly, the single-channel signals of the 5 directional instruments corresponding to the right ear are multiplied by the head related transfer function, and after multiplication, the 5 signals are added, so that the audio signal received by the right ear of the target object can be obtained. At this time, the target object binaural received audio signal is obtained.
Further, considering that the spatial positions of the different music instruments are different, in order to ensure the quality of the audio signals received by the ears of the user, adding the plurality of processed audio signals received by the ears of the target object respectively to obtain the audio signals received by the ears of the target object, including:
According to the spatial position of each single-channel signal, adding the plurality of processed audio signals received by the left ear according to a preset proportion, and adding the plurality of processed audio signals received by the right ear according to the preset proportion, so as to obtain the audio signal received by the target object double ears.
In practice, the magnitude of the preset ratio may be determined according to how far the instrument is from the user's ears, for example, when the instrument is farther from the user's left ear, the ratio is set smaller, and when the instrument is closer to the user's left ear, the ratio is set slightly larger.
In order to make the immersion sound effect of the finally obtained audio signal more realistic, after the head related transfer function of the target object is obtained, as shown in fig. 2, the audio processing method further includes:
step 201, acquiring angle information of each single-channel signal relative to the head of a target object;
Step 202, fusion processing is performed on the angle information of each single-channel signal relative to the head of the target object and the head related transfer function.
It should be understood that for intelligent wearing devices such as headphones and glasses, the gyroscope built in the intelligent wearing device can be utilized to track the actions of the head of the human body, so that the angle information of each single-channel signal relative to the head of the target object, namely the angle information of each instrument relative to the head of the target object, is obtained, and then the angle information of each single-channel signal relative to the head of the target object and the head related transfer function are fused, so that the effect of real-time mapping is achieved, and the sound can be kept at a fixed position relative to the head even if the head of the user moves.
Specifically, the angle information of each single-channel signal relative to the head of the target object is contained in Euler angles (including yaw angle, pitch angle and roll angle) in the gyroscope technology, and the Euler angles can be obtained through quaternion conversion and can be combined with dynamic HRTF to perform fusion processing in a geodetic coordinate system.
Fig. 3 is a flowchart of another embodiment of an audio processing method according to an embodiment of the present application, in order to increase a spatial sense of an audio signal received by two ears of a target object, after obtaining the audio signal received by two ears of the target object, as shown in fig. 3, the audio processing method further includes:
step 301, performing reverberation processing on the audio signals received by the target object ears.
It should be understood that the addition of reverberation mainly further increases the sense of space, and may increase the reverberation of various scenes such as a concert hall, a concert, or a recording studio. The music played by the concert hall has good effect, and is closely related to the building acoustics of the concert hall.
As an embodiment, when the target object binaural received audio signal is used as the audio to be processed, when reverberation is added to the audio to be processed, the processing time required for the wearable devices with poor processing performance to add reverberation to the audio to be processed is longer, and the wearable devices can be considered to not support adding reverberation. And then, according to preset processing logic or a preset performance parameter table, determining whether the current wearable device supports adding reverberation to the audio to be processed according to the performance parameters of the current wearable device. The performance parameter table may store a correspondence between the performance parameter of the device and whether adding reverberation is supported.
In this embodiment, determining whether the current wearable device supports adding reverberation to the audio to be processed may also include: acquiring a device information set; and determining whether the current device supports adding reverberation to the audio to be processed according to the device information set.
Wherein the device information may be any information capable of identifying the wearable device. As an example, the device information may be a device model number, a device name, and the like. As an example, the device information in the set of device information may be device information of a wearable device that does not support adding reverberation. At this time, it may be determined whether the device information of the current wearable device is in the above device information set. If at this point, it may be determined that the current device does not support adding reverberation to the audio to be processed. Otherwise, it may be determined that the current wearable device supports adding reverberation to the audio to be processed. It will be appreciated that in practice, the device information in the set of device information described above may also be device information of a wearable device that supports adding reverberation.
If the current equipment is determined to support adding reverberation to the audio to be processed, a reverberation algorithm is determined according to the reverberation category selected by the user.
In specific implementation, the reverberation algorithm can be divided into different reverberation categories according to different simulated environmental effects. For example, the reverberation category may have a hall effect, a studio effect, a valley effect, or the like. For various reverberation categories, category information (e.g., information such as names, pictures, etc.) of each category may be presented to the user. Wherein each category information is associated with a reverberation category indicated by the category information. So that the user can perform some operation (e.g., a click operation) on these category information to select a reverberation category.
In this embodiment, determining a reverberation algorithm according to a reverberation category selected by a user includes: performing audio characteristic analysis on the audio to be processed to obtain audio characteristic data; and determining a reverberation algorithm according to the audio characteristic data and the reverberation category selected by the user.
As an example, the audio to be processed may be analyzed by some existing audio analysis application or open source toolkit. Wherein the audio characteristics include frequency, bandwidth, amplitude, etc. of the audio. On this basis, a reverberation algorithm (e.g., a 3D surround algorithm) may be determined based on the audio characteristic data and the user-selected reverberation category.
Specifically, a reverberation algorithm matching the obtained audio characteristic data and the user-selected reverberation category may be queried in a table of correspondence between both pre-established audio characteristic data and reverberation categories and the reverberation algorithm. The audio to be processed may then be input to at least one filter set according to the determined reverberation algorithm, resulting in processed audio. The number of types of filters corresponding to each reverberation algorithm may be preset. Each reverberation algorithm may be combined by at least one filter. For example, a comb filter and an all-pass filter may be selected for combination. The filter can be a hardware module in the current wearable device or a software module in the current wearable device according to implementation requirements.
It can be seen from the foregoing that, in the audio processing method provided by the embodiment of the present application, by acquiring the binaural audio data, inputting the binaural audio data into the pre-trained music source separation model, outputting a plurality of single-channel signals of a plurality of instruments corresponding to the binaural audio data, determining a spatial position of each of the plurality of single-channel signals, and according to the acquired head-related transfer function of the target object and the spatial positions of the plurality of single-channel signals, converting the binaural audio data into the spatial audio data, and determining the audio signals received by the target object and ears, that is, the present application can change the binaural audio on wearable devices such as headphones, glasses, and the like into spatial audio, so that the user experience is good.
Based on the same inventive concept, the embodiments of the present application also provide an audio processing apparatus, as follows. Since the principle of the audio processing apparatus for solving the problem is similar to that of the audio processing method, the implementation of the audio processing apparatus can refer to the implementation of the audio processing method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, as shown in fig. 4, where the audio processing apparatus includes:
the data acquisition module 401 is configured to acquire binaural audio data.
The signal output module 402 is configured to input the binaural audio data into a pre-trained music source separation model, and decompose the binaural audio data to obtain at least two single-channel signals.
A position acquisition module 403, configured to acquire a spatial position of each of the at least two single-channel signals.
A function obtaining module 404, configured to obtain a head related transfer function of the target object.
An audio processing module 405, configured to determine an audio signal received by both ears of the target object according to the head related transfer function and the spatial positions of the at least two single channel signals.
Optionally, the audio processing module 404 is further configured to:
and carrying out convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1.
And obtaining the target object binaural received audio signal according to the plurality of processed audio signals.
Optionally, the audio processing module 404 is further configured to:
According to the spatial position of each single-channel signal, adding the plurality of processed audio signals received by the left ear according to a preset proportion, and adding the plurality of processed audio signals received by the right ear according to the preset proportion, so as to obtain the audio signal received by the target object double ears.
Optionally, the audio processing device further includes:
and the reverberation processing module is used for carrying out reverberation processing on the audio signals received by the double ears of the target object.
Optionally, the audio processing device further includes:
the angle information acquisition module is used for acquiring the angle information of each single-channel signal relative to the head of the target object;
and the fusion processing module is used for carrying out fusion processing on the angle information of each single-channel signal relative to the head of the target object and the head-related transfer function.
In the embodiment of the application, the head related transfer function of the target object is obtained based on field measurement data or is obtained based on matching of the head related transfer function of the existing object.
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the computer device 500 shown in fig. 5 includes: at least one processor 501, memory 502, at least one network interface 504, and other user interfaces 503. The various components in computer device 500 are coupled together by bus system 505. It is understood that bus system 505 is used to enable connected communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 505 in fig. 5.
The user interface 503 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).
It will be appreciated that the memory 502 in embodiments of the invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDRSDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct memory bus random access memory (DRRAM). The memory 502 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some implementations, the memory 502 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system 5021 and application programs 5022.
The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 5022 includes various application programs, such as a media player (MEDIA PLAYER), a Browser (Browser), and the like, for implementing various application services. A program for implementing the method according to the embodiment of the present invention may be included in the application 5022.
In the embodiment of the present invention, the processor 501 is configured to execute the method steps provided by the method embodiments by calling a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in the application 5022, for example, including:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
and determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In an alternative embodiment, said determining the binaural received audio signal of the target object based on the head related transfer function and the spatial positions of the at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
And obtaining the audio signals received by the double ears of the target object according to the processed audio signals.
In an alternative embodiment, the obtaining the target object binaural received audio signal according to a plurality of the processed audio signals includes:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
And carrying out reverberation processing on the audio signals received by the target object ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
Acquiring the angle information of the double ears of the target object;
And fusing the angle information of the target object ears with a plurality of processed audio signals received by the target object ears to obtain fused audio signals.
In an alternative embodiment, the acquiring the head related transfer function of the target object specifically includes:
derived based on field measurement data or derived based on head related transfer function matching of existing objects.
The method disclosed in the above embodiment of the present invention may be applied to the processor 501 or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 501. The Processor 501 may be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software elements in a decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 502, and the processor 501 reads information in the memory 502 and, in combination with its hardware, performs the steps of the method described above.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application SPECIFIC INTEGRATED Circuits (ASICs), digital signal processors (DIGITAL SIGNAL Processing, DSPs), digital signal Processing devices (DSPDEVICE, DSPD), programmable logic devices (Programmable Logic Device, PLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units for performing the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The computer device provided in this embodiment may be a computer device as shown in fig. 5, and may perform all the steps of the audio processing method shown in fig. 1-3, so as to achieve the technical effects of the audio processing method shown in fig. 1-3, and the detailed description will be omitted herein for brevity.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.
When one or more programs in the storage medium are executable by one or more processors, the above-described pose processing method performed on the pose processing apparatus side is implemented.
The processor is used for executing the pose processing program stored in the memory to realize the following steps of the audio processing method:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
and determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In an alternative embodiment, said determining the binaural received audio signal of the target object based on the head related transfer function and the spatial positions of the at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
And obtaining the audio signals received by the double ears of the target object according to the processed audio signals.
In an alternative embodiment, the obtaining the target object binaural received audio signal according to a plurality of the processed audio signals includes:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
And carrying out reverberation processing on the audio signals received by the target object ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
Acquiring the angle information of the double ears of the target object;
And fusing the angle information of the target object ears with a plurality of processed audio signals received by the target object ears to obtain fused audio signals.
In an alternative embodiment, the acquiring the head related transfer function of the target object specifically includes:
derived based on field measurement data or derived based on head related transfer function matching of existing objects.
In summary, the method and the device for obtaining the spatial audio data establish a foundation for subsequently obtaining the spatial audio data by obtaining the binaural audio data, inputting the binaural audio data into a pre-trained music source separation model, decomposing the binaural audio data to obtain at least two single-channel signals, and obtaining the spatial position of each single-channel signal in the at least two single-channel signals. The method and the device can change the binaural audio on wearable equipment such as headphones and glasses into spatial audio, and have better user experience. When the application is applied to wearable equipment (such as headphones, glasses and the like), the application can change common stereo music into concert hall-level spatial audio in a music playing mode of the wearable equipment.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing is merely exemplary of embodiments of the present invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. An audio processing method, comprising:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
determining an audio signal received by the target object in double ears according to the head related transfer function and the space positions of the at least two single-channel signals;
wherein after the obtaining the head related transfer function of the target object, the method further comprises:
Tracking the head of the target object, and acquiring the angle information of each single-channel signal relative to the head of the target object;
Fusing the angle information of each single channel signal relative to the target object head with the head related transfer function so as to keep the sound at a fixed position relative to the target object head under the condition that the target object head moves;
Wherein said determining the binaural received audio signal of the target object based on the head related transfer function and the spatial positions of the at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
Obtaining the audio signals received by the target object ears according to the processed audio signals;
the obtaining the target object binaural received audio signal according to the processed audio signals comprises the following steps:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
2. The method of claim 1, wherein after the obtaining the target object binaural received audio signal, the method further comprises:
And carrying out reverberation processing on the audio signals received by the target object ears.
3. The method according to claim 1, wherein the acquiring the head related transfer function of the target object specifically comprises:
derived based on field measurement data or derived based on head related transfer function matching of existing objects.
4. An audio processing apparatus, comprising:
the data acquisition module is used for acquiring the double-channel audio data;
The signal output module is used for inputting the binaural audio data into a pre-trained music source separation model and decomposing the binaural audio data to obtain at least two single-channel signals;
the position acquisition module is used for acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
the function acquisition module is used for acquiring a head related transfer function of the target object;
The audio processing module is used for determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals;
Wherein the apparatus further comprises:
the angle information acquisition module is used for tracking the head of the target object and acquiring the angle information of each single-channel signal relative to the head of the target object;
The fusion processing module is used for carrying out fusion processing on the angle information of each single-channel signal relative to the head of the target object and the head-related transfer function so as to keep the sound at a fixed position relative to the head of the target object under the condition that the head of the target object moves;
wherein the audio processing module is further configured to:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
Obtaining the audio signals received by the target object ears according to the processed audio signals;
The audio processing module is further configured to:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
5. The computer equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the audio processing method of any one of claims 1-3 when executing a program stored on a memory.
6. A wearable device, the wearable device comprising: a memory, a processor and an audio processing program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the audio processing method according to any one of claims 1 to 3.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the audio processing method according to any of claims 1-3.
CN202210216580.2A 2022-03-07 2022-03-07 Audio processing method and device Active CN114598985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210216580.2A CN114598985B (en) 2022-03-07 2022-03-07 Audio processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210216580.2A CN114598985B (en) 2022-03-07 2022-03-07 Audio processing method and device

Publications (2)

Publication Number Publication Date
CN114598985A CN114598985A (en) 2022-06-07
CN114598985B true CN114598985B (en) 2024-05-03

Family

ID=81815840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210216580.2A Active CN114598985B (en) 2022-03-07 2022-03-07 Audio processing method and device

Country Status (1)

Country Link
CN (1) CN114598985B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007080224A1 (en) * 2006-01-09 2007-07-19 Nokia Corporation Decoding of binaural audio signals
WO2017135063A1 (en) * 2016-02-04 2017-08-10 ソニー株式会社 Audio processing device, audio processing method and program
CN112037738A (en) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Music data processing method and device and computer storage medium
CN113747337A (en) * 2021-09-03 2021-12-03 杭州网易云音乐科技有限公司 Audio processing method, medium, device and computing equipment
CN113821190A (en) * 2021-11-25 2021-12-21 广州酷狗计算机科技有限公司 Audio playing method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10848899B2 (en) * 2016-10-13 2020-11-24 Philip Scott Lyren Binaural sound in visual entertainment media
US11417347B2 (en) * 2020-06-19 2022-08-16 Apple Inc. Binaural room impulse response for spatial audio reproduction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007080224A1 (en) * 2006-01-09 2007-07-19 Nokia Corporation Decoding of binaural audio signals
WO2017135063A1 (en) * 2016-02-04 2017-08-10 ソニー株式会社 Audio processing device, audio processing method and program
CN112037738A (en) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Music data processing method and device and computer storage medium
CN113747337A (en) * 2021-09-03 2021-12-03 杭州网易云音乐科技有限公司 Audio processing method, medium, device and computing equipment
CN113821190A (en) * 2021-11-25 2021-12-21 广州酷狗计算机科技有限公司 Audio playing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114598985A (en) 2022-06-07

Similar Documents

Publication Publication Date Title
US10924875B2 (en) Augmented reality platform for navigable, immersive audio experience
US10349197B2 (en) Method and device for generating and playing back audio signal
CN110972053B (en) Method and related apparatus for constructing a listening scene
US9131305B2 (en) Configurable three-dimensional sound system
JP4921470B2 (en) Method and apparatus for generating and processing parameters representing head related transfer functions
GB2543275A (en) Distributed audio capture and mixing
CN111294724B (en) Spatial repositioning of multiple audio streams
CN109996166A (en) Sound processing apparatus and method and program
Ziemer Source width in music production. methods in stereo, ambisonics, and wave field synthesis
Yeoward et al. Real-time binaural room modelling for augmented reality applications
CN113347552B (en) Audio signal processing method and device and computer readable storage medium
US20240005897A1 (en) Sound editing device, sound editing method, and sound editing program
CN114598985B (en) Audio processing method and device
Guthrie Stage acoustics for musicians: A multidimensional approach using 3D ambisonic technology
CA3044260A1 (en) Augmented reality platform for navigable, immersive audio experience
WO2023109278A1 (en) Accompaniment generation method, device, and storage medium
CN113347551B (en) Method and device for processing single-sound-channel audio signal and readable storage medium
Bargum et al. Virtual reconstruction of a the ambisonic concert hall of the royal danish academy of music
Picinali et al. Chapter Reverberation and its Binaural Reproduction: The Trade-off between Computational Efficiency and Perceived Quality
CN113747337A (en) Audio processing method, medium, device and computing equipment
Zea Binaural In-Ear Monitoring of acoustic instruments in live music performance
CN117998274B (en) Audio processing method, device and storage medium
Pellegrini Comparison of data-and model-based simulation algorithms for auditory virtual environments
CN112954548B (en) Method and device for combining sound collected by terminal microphone and headset
WO2023173285A1 (en) Audio processing method and apparatus, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant