CN114598985B - Audio processing method and device - Google Patents
Audio processing method and device Download PDFInfo
- Publication number
- CN114598985B CN114598985B CN202210216580.2A CN202210216580A CN114598985B CN 114598985 B CN114598985 B CN 114598985B CN 202210216580 A CN202210216580 A CN 202210216580A CN 114598985 B CN114598985 B CN 114598985B
- Authority
- CN
- China
- Prior art keywords
- target object
- audio
- head
- transfer function
- related transfer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 30
- 230000005236 sound signal Effects 0.000 claims abstract description 81
- 238000012546 transfer Methods 0.000 claims abstract description 54
- 210000005069 ears Anatomy 0.000 claims abstract description 39
- 238000012545 processing Methods 0.000 claims abstract description 39
- 238000000926 separation method Methods 0.000 claims abstract description 17
- 230000006870 function Effects 0.000 claims description 60
- 230000015654 memory Effects 0.000 claims description 45
- 238000000034 method Methods 0.000 claims description 33
- 238000004891 communication Methods 0.000 claims description 10
- 238000007499 fusion processing Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 5
- 239000011521 glass Substances 0.000 abstract description 11
- 230000008859 change Effects 0.000 abstract description 7
- 230000000694 effects Effects 0.000 description 8
- 230000009471 action Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000013136 deep learning model Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 210000000056 organ Anatomy 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 235000019800 disodium phosphate Nutrition 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 230000005355 Hall effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007654 immersion Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S1/00—Two-channel systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/308—Electronic adaptation dependent on speaker or headphone connection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/11—Positioning of individual sound objects, e.g. moving airplane, within a sound field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/01—Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Stereophonic System (AREA)
Abstract
The application relates to an audio processing method and device, and relates to the field of audio processing, wherein the audio processing method comprises the following steps: acquiring dual-channel audio data; inputting the binaural audio data into a pre-trained music source separation model, and outputting single-channel signals corresponding to each of a plurality of musical instruments corresponding to the binaural audio data; determining the spatial position of each instrument; and acquiring a head related transfer function of a target object according to the personal characteristics of a user and the spatial positions of each musical instrument, and determining the audio signals received by the ears of the target object according to the head related transfer function and the spatial positions of the at least two single-channel signals. The application can change the binaural audio on wearable devices such as headphones, glasses and the like into spatial audio, and has better user experience.
Description
Technical Field
The present application relates to the field of audio processing, and in particular, to an audio processing method and apparatus.
Background
With the development of science and technology, spatial audio has been attracting attention in recent years, and spatial information and azimuth information of sound have been attracting more and more attention. At present, on wearable devices such as headphones and glasses, in-head effects are easy to occur, the sound field and the spatial azimuth information perceived by a user are reduced, namely the prior art cannot change the common binaural stereo on the wearable devices such as headphones and glasses into spatial audio, and the user experience is poor.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides an audio processing method and apparatus.
In a first aspect, the present application provides an audio processing method, including:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
and determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In a second aspect, the present application provides an audio processing apparatus comprising:
the data acquisition module is used for acquiring the double-channel audio data;
The signal output module is used for inputting the binaural audio data into a pre-trained music source separation model and decomposing the binaural audio data to obtain at least two single-channel signals;
the position acquisition module is used for acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
the function acquisition module is used for acquiring a head related transfer function of the target object;
And the audio processing module is used for determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In a third aspect, a computer device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory perform communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the audio processing method according to any one of the embodiments of the first aspect when executing a program stored on a memory.
In a fourth aspect, there is provided a wearable device comprising: a memory, a processor and an audio processing program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the audio processing method as in any of the embodiments of the first aspect.
In a fifth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the audio processing method as in any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
According to the method provided by the embodiment of the application, the binaural audio data is input into the pre-trained music source separation model by acquiring the binaural audio data, at least two single-channel signals are obtained by decomposing the binaural audio data, and the spatial position of each single-channel signal in the at least two single-channel signals is acquired, so that a foundation is laid for the subsequent acquisition of the spatial audio data. The method and the device can change the binaural audio on wearable equipment such as headphones and glasses into spatial audio, and have better user experience.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
Fig. 1 is a schematic flow chart of an audio processing method according to an embodiment of the present application;
Fig. 2 is a flowchart of an embodiment of an audio processing method according to an embodiment of the present application;
fig. 3 is a flowchart of another embodiment of an audio processing method according to an embodiment of the present application;
Fig. 4 is a schematic structural diagram of an audio processing device according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms involved in the present application are explained as follows:
Head related transfer function (HEAD RELATED TRANSFER Functions, abbreviated HRTF): refer to an acoustic localization algorithm that describes the transmission of sound waves from a sound source to both ears. It is the result of the comprehensive filtering of sound waves by human physiological structures such as the head, pinna, torso, etc.
With the development of science and technology, spatial audio has been attracting attention in recent years, and spatial information and azimuth information of sound have been attracting more and more attention. At present, on wearable devices such as headphones and glasses, in-head effects are easy to occur, the sound field and the spatial azimuth information perceived by a user are reduced, namely the prior art cannot change the common binaural stereo on the wearable devices such as headphones and glasses into spatial audio, and the user experience is poor. In order to solve the above problems, an embodiment of the present application provides an audio processing method.
Fig. 1 is a flow chart of an audio processing method according to an embodiment of the present application, as shown in fig. 1, where the audio processing method includes:
Step 101, obtaining dual-channel audio data.
In this embodiment, the binaural audio data is binaural, specifically, it may be whole music data with binaural generated by a wearable device (such as headphones, glasses, etc.) in a music mode, or may be a piece of music data with binaural, and if the quality of the music data is differentiated, it may be high-quality binaural music data, or may be low-quality binaural music data.
Step 102, inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals.
As an embodiment, after the binaural audio data is input to the pre-trained music source separation model, a plurality of single-channel signals of a plurality of musical instruments corresponding to the binaural audio data may be output, where the musical instruments are in one-to-one correspondence with the single-channel signals.
In this embodiment, the music source separation model may be a deep learning model (including but not limited to the models of Wave-U-Net, conv-Tasnet, MMDenseLSTM, etc.), and the training process is as follows:
obtaining sample data, wherein the sample data comprises: a plurality of binaural audio data, and a single channel signal of a plurality of musical instruments corresponding to the binaural audio data;
And training by using sample data through a deep learning model to obtain a music source separation model.
As an embodiment, a first data set may be acquired over a network, the first data set comprising: the audio data comprises a plurality of single-channel signals corresponding to a plurality of musical instruments, a music source separation model is obtained through training of a deep learning model, and when the audio data are used, the double-channel audio data are input into the trained music source separation model, so that the plurality of single-channel signals corresponding to the double-channel audio data can be output.
Among them, a variety of musical instruments include: electronic organ, guitar, vocal, drum, bass, etc.
Step 103, acquiring the spatial position of each single-channel signal in the at least two single-channel signals.
Based on the embodiment in step 102, it should be appreciated that since the instrument is in one-to-one correspondence with the single channel signal, the spatial position of the instrument is the spatial position of the single channel signal.
For the method of acquiring the spatial positions of the plurality of musical instruments, as an embodiment, a second data set may be acquired in advance on the network, the second data set including: in the scene information, for example, in the concert scene, the user is the center of sphere, and the placement positions of the various instruments with respect to the user are the electronic organ (front left), guitar (middle), vocal (middle), drum (rear middle), bass (front right). Based on the above, the preset neural network model is trained by using scene information such as a concert hall and a concert, and the placement positions of various musical instruments corresponding to the respective scene information, and a trained musical instrument placement model is generated. When in use, there are two methods: firstly, the scene information of the scene where the user is or the preset scene information is obtained, and the scene information is input into the trained musical instrument placing model, so that the placing positions of various musical instruments can be output. Secondly, the user can customize the space placement positions of various musical instruments according to preference or experience, thereby training the musical instrument placement model according to the placement positions of the musical instruments preset by the user, and finally selecting the placement positions of various musical instruments according to the user.
Step 104, acquiring a head related transfer function of the target object.
In this embodiment, the target object may be a user.
The method for acquiring the head related transfer function of the target object specifically comprises the following steps: derived based on field measurement data or derived based on head related transfer function matching of existing objects.
For example, the head-related transfer function of the target object may be obtained by one of the following methods:
1. The HRTF of a person's head is matched with public data.
For example, the user can sort out the data already disclosed on the network to perform subjective judgment, and match the HRTF of the head of the person deemed to be most suitable by the user, and the head-related transmission coefficient at this time is determined according to the data directly selected by the user.
2. Firstly clustering head HRTFs of a plurality of persons in a database, and then selecting a certain type of the best-matched HRTFs from the public database by utilizing personalized auricle characteristics of the users as a final function.
Specifically, a plurality of head related transfer functions stored in advance may be clustered, and head related transfer functions matched with the target object may be selected from the clustered plurality of head related transfer functions according to auricle characteristics of ears of the target object.
3. The HRTF of the user is measured directly in the anechoic chamber. That is, the HRTF of the user is measured according to the personal characteristics (height, auricle information, etc.) of the user and the placement space position of the sound-deadening indoor musical instrument.
4. The HRTF of the user is measured directly at the concert hall. That is, the HRTF of the user is measured according to the personal characteristics (height, auricle information, etc.) of the user and the placement space position of the instrument in the concert hall.
Step 105, determining the audio signal received by the target object in double ears according to the head related transfer function and the spatial positions of the at least two single channel signals.
As an embodiment, to accurately obtain the target object binaural received audio signal, determining the target object binaural received audio signal from the head related transfer function and the spatial positions of the at least two single channel signals, comprises:
convolving the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
and obtaining the target object binaural received audio signal according to the plurality of processed audio signals.
For example, if 5 kinds of musical instruments such as an electronic organ, a guitar, a human voice, a drum and a bass are split from one binaural music, after the 5 kinds of musical instruments are spatially arranged, the 5 kinds of musical instruments respectively correspond to 5 directions with respect to the left ear and the right ear of the user. At this time, the single-channel signals of the 5 directional instruments corresponding to the left ear are multiplied by the head related transfer function, and after multiplication, the 5 signals are added, so that the audio signal received by the left ear of the target object can be obtained. Similarly, the single-channel signals of the 5 directional instruments corresponding to the right ear are multiplied by the head related transfer function, and after multiplication, the 5 signals are added, so that the audio signal received by the right ear of the target object can be obtained. At this time, the target object binaural received audio signal is obtained.
Further, considering that the spatial positions of the different music instruments are different, in order to ensure the quality of the audio signals received by the ears of the user, adding the plurality of processed audio signals received by the ears of the target object respectively to obtain the audio signals received by the ears of the target object, including:
According to the spatial position of each single-channel signal, adding the plurality of processed audio signals received by the left ear according to a preset proportion, and adding the plurality of processed audio signals received by the right ear according to the preset proportion, so as to obtain the audio signal received by the target object double ears.
In practice, the magnitude of the preset ratio may be determined according to how far the instrument is from the user's ears, for example, when the instrument is farther from the user's left ear, the ratio is set smaller, and when the instrument is closer to the user's left ear, the ratio is set slightly larger.
In order to make the immersion sound effect of the finally obtained audio signal more realistic, after the head related transfer function of the target object is obtained, as shown in fig. 2, the audio processing method further includes:
step 201, acquiring angle information of each single-channel signal relative to the head of a target object;
Step 202, fusion processing is performed on the angle information of each single-channel signal relative to the head of the target object and the head related transfer function.
It should be understood that for intelligent wearing devices such as headphones and glasses, the gyroscope built in the intelligent wearing device can be utilized to track the actions of the head of the human body, so that the angle information of each single-channel signal relative to the head of the target object, namely the angle information of each instrument relative to the head of the target object, is obtained, and then the angle information of each single-channel signal relative to the head of the target object and the head related transfer function are fused, so that the effect of real-time mapping is achieved, and the sound can be kept at a fixed position relative to the head even if the head of the user moves.
Specifically, the angle information of each single-channel signal relative to the head of the target object is contained in Euler angles (including yaw angle, pitch angle and roll angle) in the gyroscope technology, and the Euler angles can be obtained through quaternion conversion and can be combined with dynamic HRTF to perform fusion processing in a geodetic coordinate system.
Fig. 3 is a flowchart of another embodiment of an audio processing method according to an embodiment of the present application, in order to increase a spatial sense of an audio signal received by two ears of a target object, after obtaining the audio signal received by two ears of the target object, as shown in fig. 3, the audio processing method further includes:
step 301, performing reverberation processing on the audio signals received by the target object ears.
It should be understood that the addition of reverberation mainly further increases the sense of space, and may increase the reverberation of various scenes such as a concert hall, a concert, or a recording studio. The music played by the concert hall has good effect, and is closely related to the building acoustics of the concert hall.
As an embodiment, when the target object binaural received audio signal is used as the audio to be processed, when reverberation is added to the audio to be processed, the processing time required for the wearable devices with poor processing performance to add reverberation to the audio to be processed is longer, and the wearable devices can be considered to not support adding reverberation. And then, according to preset processing logic or a preset performance parameter table, determining whether the current wearable device supports adding reverberation to the audio to be processed according to the performance parameters of the current wearable device. The performance parameter table may store a correspondence between the performance parameter of the device and whether adding reverberation is supported.
In this embodiment, determining whether the current wearable device supports adding reverberation to the audio to be processed may also include: acquiring a device information set; and determining whether the current device supports adding reverberation to the audio to be processed according to the device information set.
Wherein the device information may be any information capable of identifying the wearable device. As an example, the device information may be a device model number, a device name, and the like. As an example, the device information in the set of device information may be device information of a wearable device that does not support adding reverberation. At this time, it may be determined whether the device information of the current wearable device is in the above device information set. If at this point, it may be determined that the current device does not support adding reverberation to the audio to be processed. Otherwise, it may be determined that the current wearable device supports adding reverberation to the audio to be processed. It will be appreciated that in practice, the device information in the set of device information described above may also be device information of a wearable device that supports adding reverberation.
If the current equipment is determined to support adding reverberation to the audio to be processed, a reverberation algorithm is determined according to the reverberation category selected by the user.
In specific implementation, the reverberation algorithm can be divided into different reverberation categories according to different simulated environmental effects. For example, the reverberation category may have a hall effect, a studio effect, a valley effect, or the like. For various reverberation categories, category information (e.g., information such as names, pictures, etc.) of each category may be presented to the user. Wherein each category information is associated with a reverberation category indicated by the category information. So that the user can perform some operation (e.g., a click operation) on these category information to select a reverberation category.
In this embodiment, determining a reverberation algorithm according to a reverberation category selected by a user includes: performing audio characteristic analysis on the audio to be processed to obtain audio characteristic data; and determining a reverberation algorithm according to the audio characteristic data and the reverberation category selected by the user.
As an example, the audio to be processed may be analyzed by some existing audio analysis application or open source toolkit. Wherein the audio characteristics include frequency, bandwidth, amplitude, etc. of the audio. On this basis, a reverberation algorithm (e.g., a 3D surround algorithm) may be determined based on the audio characteristic data and the user-selected reverberation category.
Specifically, a reverberation algorithm matching the obtained audio characteristic data and the user-selected reverberation category may be queried in a table of correspondence between both pre-established audio characteristic data and reverberation categories and the reverberation algorithm. The audio to be processed may then be input to at least one filter set according to the determined reverberation algorithm, resulting in processed audio. The number of types of filters corresponding to each reverberation algorithm may be preset. Each reverberation algorithm may be combined by at least one filter. For example, a comb filter and an all-pass filter may be selected for combination. The filter can be a hardware module in the current wearable device or a software module in the current wearable device according to implementation requirements.
It can be seen from the foregoing that, in the audio processing method provided by the embodiment of the present application, by acquiring the binaural audio data, inputting the binaural audio data into the pre-trained music source separation model, outputting a plurality of single-channel signals of a plurality of instruments corresponding to the binaural audio data, determining a spatial position of each of the plurality of single-channel signals, and according to the acquired head-related transfer function of the target object and the spatial positions of the plurality of single-channel signals, converting the binaural audio data into the spatial audio data, and determining the audio signals received by the target object and ears, that is, the present application can change the binaural audio on wearable devices such as headphones, glasses, and the like into spatial audio, so that the user experience is good.
Based on the same inventive concept, the embodiments of the present application also provide an audio processing apparatus, as follows. Since the principle of the audio processing apparatus for solving the problem is similar to that of the audio processing method, the implementation of the audio processing apparatus can refer to the implementation of the audio processing method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, as shown in fig. 4, where the audio processing apparatus includes:
the data acquisition module 401 is configured to acquire binaural audio data.
The signal output module 402 is configured to input the binaural audio data into a pre-trained music source separation model, and decompose the binaural audio data to obtain at least two single-channel signals.
A position acquisition module 403, configured to acquire a spatial position of each of the at least two single-channel signals.
A function obtaining module 404, configured to obtain a head related transfer function of the target object.
An audio processing module 405, configured to determine an audio signal received by both ears of the target object according to the head related transfer function and the spatial positions of the at least two single channel signals.
Optionally, the audio processing module 404 is further configured to:
and carrying out convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1.
And obtaining the target object binaural received audio signal according to the plurality of processed audio signals.
Optionally, the audio processing module 404 is further configured to:
According to the spatial position of each single-channel signal, adding the plurality of processed audio signals received by the left ear according to a preset proportion, and adding the plurality of processed audio signals received by the right ear according to the preset proportion, so as to obtain the audio signal received by the target object double ears.
Optionally, the audio processing device further includes:
and the reverberation processing module is used for carrying out reverberation processing on the audio signals received by the double ears of the target object.
Optionally, the audio processing device further includes:
the angle information acquisition module is used for acquiring the angle information of each single-channel signal relative to the head of the target object;
and the fusion processing module is used for carrying out fusion processing on the angle information of each single-channel signal relative to the head of the target object and the head-related transfer function.
In the embodiment of the application, the head related transfer function of the target object is obtained based on field measurement data or is obtained based on matching of the head related transfer function of the existing object.
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the computer device 500 shown in fig. 5 includes: at least one processor 501, memory 502, at least one network interface 504, and other user interfaces 503. The various components in computer device 500 are coupled together by bus system 505. It is understood that bus system 505 is used to enable connected communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 505 in fig. 5.
The user interface 503 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).
It will be appreciated that the memory 502 in embodiments of the invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDRSDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct memory bus random access memory (DRRAM). The memory 502 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some implementations, the memory 502 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system 5021 and application programs 5022.
The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 5022 includes various application programs, such as a media player (MEDIA PLAYER), a Browser (Browser), and the like, for implementing various application services. A program for implementing the method according to the embodiment of the present invention may be included in the application 5022.
In the embodiment of the present invention, the processor 501 is configured to execute the method steps provided by the method embodiments by calling a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in the application 5022, for example, including:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
and determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In an alternative embodiment, said determining the binaural received audio signal of the target object based on the head related transfer function and the spatial positions of the at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
And obtaining the audio signals received by the double ears of the target object according to the processed audio signals.
In an alternative embodiment, the obtaining the target object binaural received audio signal according to a plurality of the processed audio signals includes:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
And carrying out reverberation processing on the audio signals received by the target object ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
Acquiring the angle information of the double ears of the target object;
And fusing the angle information of the target object ears with a plurality of processed audio signals received by the target object ears to obtain fused audio signals.
In an alternative embodiment, the acquiring the head related transfer function of the target object specifically includes:
derived based on field measurement data or derived based on head related transfer function matching of existing objects.
The method disclosed in the above embodiment of the present invention may be applied to the processor 501 or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 501. The Processor 501 may be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software elements in a decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 502, and the processor 501 reads information in the memory 502 and, in combination with its hardware, performs the steps of the method described above.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application SPECIFIC INTEGRATED Circuits (ASICs), digital signal processors (DIGITAL SIGNAL Processing, DSPs), digital signal Processing devices (DSPDEVICE, DSPD), programmable logic devices (Programmable Logic Device, PLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units for performing the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The computer device provided in this embodiment may be a computer device as shown in fig. 5, and may perform all the steps of the audio processing method shown in fig. 1-3, so as to achieve the technical effects of the audio processing method shown in fig. 1-3, and the detailed description will be omitted herein for brevity.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.
When one or more programs in the storage medium are executable by one or more processors, the above-described pose processing method performed on the pose processing apparatus side is implemented.
The processor is used for executing the pose processing program stored in the memory to realize the following steps of the audio processing method:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
and determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals.
In an alternative embodiment, said determining the binaural received audio signal of the target object based on the head related transfer function and the spatial positions of the at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
And obtaining the audio signals received by the double ears of the target object according to the processed audio signals.
In an alternative embodiment, the obtaining the target object binaural received audio signal according to a plurality of the processed audio signals includes:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
And carrying out reverberation processing on the audio signals received by the target object ears.
In an alternative embodiment, after the obtaining the target object binaural received audio signal, the method further comprises:
Acquiring the angle information of the double ears of the target object;
And fusing the angle information of the target object ears with a plurality of processed audio signals received by the target object ears to obtain fused audio signals.
In an alternative embodiment, the acquiring the head related transfer function of the target object specifically includes:
derived based on field measurement data or derived based on head related transfer function matching of existing objects.
In summary, the method and the device for obtaining the spatial audio data establish a foundation for subsequently obtaining the spatial audio data by obtaining the binaural audio data, inputting the binaural audio data into a pre-trained music source separation model, decomposing the binaural audio data to obtain at least two single-channel signals, and obtaining the spatial position of each single-channel signal in the at least two single-channel signals. The method and the device can change the binaural audio on wearable equipment such as headphones and glasses into spatial audio, and have better user experience. When the application is applied to wearable equipment (such as headphones, glasses and the like), the application can change common stereo music into concert hall-level spatial audio in a music playing mode of the wearable equipment.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing is merely exemplary of embodiments of the present invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (7)
1. An audio processing method, comprising:
acquiring dual-channel audio data;
Inputting the binaural audio data into a pre-trained music source separation model, and decomposing the binaural audio data to obtain at least two single-channel signals;
Acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transfer function of a target object;
determining an audio signal received by the target object in double ears according to the head related transfer function and the space positions of the at least two single-channel signals;
wherein after the obtaining the head related transfer function of the target object, the method further comprises:
Tracking the head of the target object, and acquiring the angle information of each single-channel signal relative to the head of the target object;
Fusing the angle information of each single channel signal relative to the target object head with the head related transfer function so as to keep the sound at a fixed position relative to the target object head under the condition that the target object head moves;
Wherein said determining the binaural received audio signal of the target object based on the head related transfer function and the spatial positions of the at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
Obtaining the audio signals received by the target object ears according to the processed audio signals;
the obtaining the target object binaural received audio signal according to the processed audio signals comprises the following steps:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
2. The method of claim 1, wherein after the obtaining the target object binaural received audio signal, the method further comprises:
And carrying out reverberation processing on the audio signals received by the target object ears.
3. The method according to claim 1, wherein the acquiring the head related transfer function of the target object specifically comprises:
derived based on field measurement data or derived based on head related transfer function matching of existing objects.
4. An audio processing apparatus, comprising:
the data acquisition module is used for acquiring the double-channel audio data;
The signal output module is used for inputting the binaural audio data into a pre-trained music source separation model and decomposing the binaural audio data to obtain at least two single-channel signals;
the position acquisition module is used for acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
the function acquisition module is used for acquiring a head related transfer function of the target object;
The audio processing module is used for determining the audio signals received by the double ears of the target object according to the head related transfer function and the space positions of the at least two single-channel signals;
Wherein the apparatus further comprises:
the angle information acquisition module is used for tracking the head of the target object and acquiring the angle information of each single-channel signal relative to the head of the target object;
The fusion processing module is used for carrying out fusion processing on the angle information of each single-channel signal relative to the head of the target object and the head-related transfer function so as to keep the sound at a fixed position relative to the head of the target object under the condition that the head of the target object moves;
wherein the audio processing module is further configured to:
performing convolution processing on the ith single-channel signal in the time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
Obtaining the audio signals received by the target object ears according to the processed audio signals;
The audio processing module is further configured to:
And adding the plurality of processed audio signals received by the left ear according to the preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the target object in double ears.
5. The computer equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the audio processing method of any one of claims 1-3 when executing a program stored on a memory.
6. A wearable device, the wearable device comprising: a memory, a processor and an audio processing program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the audio processing method according to any one of claims 1 to 3.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the audio processing method according to any of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210216580.2A CN114598985B (en) | 2022-03-07 | 2022-03-07 | Audio processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210216580.2A CN114598985B (en) | 2022-03-07 | 2022-03-07 | Audio processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114598985A CN114598985A (en) | 2022-06-07 |
CN114598985B true CN114598985B (en) | 2024-05-03 |
Family
ID=81815840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210216580.2A Active CN114598985B (en) | 2022-03-07 | 2022-03-07 | Audio processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114598985B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007080224A1 (en) * | 2006-01-09 | 2007-07-19 | Nokia Corporation | Decoding of binaural audio signals |
WO2017135063A1 (en) * | 2016-02-04 | 2017-08-10 | ソニー株式会社 | Audio processing device, audio processing method and program |
CN112037738A (en) * | 2020-08-31 | 2020-12-04 | 腾讯音乐娱乐科技(深圳)有限公司 | Music data processing method and device and computer storage medium |
CN113747337A (en) * | 2021-09-03 | 2021-12-03 | 杭州网易云音乐科技有限公司 | Audio processing method, medium, device and computing equipment |
CN113821190A (en) * | 2021-11-25 | 2021-12-21 | 广州酷狗计算机科技有限公司 | Audio playing method, device, equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10848899B2 (en) * | 2016-10-13 | 2020-11-24 | Philip Scott Lyren | Binaural sound in visual entertainment media |
US11417347B2 (en) * | 2020-06-19 | 2022-08-16 | Apple Inc. | Binaural room impulse response for spatial audio reproduction |
-
2022
- 2022-03-07 CN CN202210216580.2A patent/CN114598985B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007080224A1 (en) * | 2006-01-09 | 2007-07-19 | Nokia Corporation | Decoding of binaural audio signals |
WO2017135063A1 (en) * | 2016-02-04 | 2017-08-10 | ソニー株式会社 | Audio processing device, audio processing method and program |
CN112037738A (en) * | 2020-08-31 | 2020-12-04 | 腾讯音乐娱乐科技(深圳)有限公司 | Music data processing method and device and computer storage medium |
CN113747337A (en) * | 2021-09-03 | 2021-12-03 | 杭州网易云音乐科技有限公司 | Audio processing method, medium, device and computing equipment |
CN113821190A (en) * | 2021-11-25 | 2021-12-21 | 广州酷狗计算机科技有限公司 | Audio playing method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114598985A (en) | 2022-06-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10924875B2 (en) | Augmented reality platform for navigable, immersive audio experience | |
US10349197B2 (en) | Method and device for generating and playing back audio signal | |
CN110972053B (en) | Method and related apparatus for constructing a listening scene | |
US9131305B2 (en) | Configurable three-dimensional sound system | |
JP4921470B2 (en) | Method and apparatus for generating and processing parameters representing head related transfer functions | |
GB2543275A (en) | Distributed audio capture and mixing | |
CN111294724B (en) | Spatial repositioning of multiple audio streams | |
CN109996166A (en) | Sound processing apparatus and method and program | |
Ziemer | Source width in music production. methods in stereo, ambisonics, and wave field synthesis | |
Yeoward et al. | Real-time binaural room modelling for augmented reality applications | |
CN113347552B (en) | Audio signal processing method and device and computer readable storage medium | |
US20240005897A1 (en) | Sound editing device, sound editing method, and sound editing program | |
CN114598985B (en) | Audio processing method and device | |
Guthrie | Stage acoustics for musicians: A multidimensional approach using 3D ambisonic technology | |
CA3044260A1 (en) | Augmented reality platform for navigable, immersive audio experience | |
WO2023109278A1 (en) | Accompaniment generation method, device, and storage medium | |
CN113347551B (en) | Method and device for processing single-sound-channel audio signal and readable storage medium | |
Bargum et al. | Virtual reconstruction of a the ambisonic concert hall of the royal danish academy of music | |
Picinali et al. | Chapter Reverberation and its Binaural Reproduction: The Trade-off between Computational Efficiency and Perceived Quality | |
CN113747337A (en) | Audio processing method, medium, device and computing equipment | |
Zea | Binaural In-Ear Monitoring of acoustic instruments in live music performance | |
CN117998274B (en) | Audio processing method, device and storage medium | |
Pellegrini | Comparison of data-and model-based simulation algorithms for auditory virtual environments | |
CN112954548B (en) | Method and device for combining sound collected by terminal microphone and headset | |
WO2023173285A1 (en) | Audio processing method and apparatus, electronic device, and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |