CN114598985A - Audio processing method and device - Google Patents

Audio processing method and device Download PDF

Info

Publication number
CN114598985A
CN114598985A CN202210216580.2A CN202210216580A CN114598985A CN 114598985 A CN114598985 A CN 114598985A CN 202210216580 A CN202210216580 A CN 202210216580A CN 114598985 A CN114598985 A CN 114598985A
Authority
CN
China
Prior art keywords
audio
channel
target object
head
transfer function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210216580.2A
Other languages
Chinese (zh)
Other versions
CN114598985B (en
Inventor
黎镭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anker Innovations Co Ltd
Original Assignee
Anker Innovations Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anker Innovations Co Ltd filed Critical Anker Innovations Co Ltd
Priority to CN202210216580.2A priority Critical patent/CN114598985B/en
Publication of CN114598985A publication Critical patent/CN114598985A/en
Application granted granted Critical
Publication of CN114598985B publication Critical patent/CN114598985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/308Electronic adaptation dependent on speaker or headphone connection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

The application relates to an audio processing method and an audio processing device, which relate to the field of audio processing, wherein the audio processing method comprises the following steps: acquiring two-channel audio data; inputting the two-channel audio data into a pre-trained music source separation model, and outputting a single-channel signal corresponding to each musical instrument in a plurality of musical instruments corresponding to the two-channel audio data; determining the spatial position of each instrument; and acquiring a head-related transfer function of a target object according to personal characteristics of a user and the spatial position of each instrument, and determining the audio signals which are received by ears of the target object according to the head-related transfer function and the spatial positions of the at least two single-channel signals. This application can become the dual channel audio frequency on wearable equipment such as earphone, glasses into the space audio frequency, and user experience is better.

Description

Audio processing method and device
Technical Field
The present application relates to the field of audio processing, and in particular, to an audio processing method and apparatus.
Background
With the development of scientific technology, spatial audio has attracted much attention in recent years, and people pay more attention to spatial information and orientation information of sound. At present, on wearable equipment such as earphones and glasses, the effect in the head appears very easily, and the sound field and the space direction information that the user perceived become fewly, and prior art can't become the ordinary double-track stereo sound on wearable equipment such as earphones, glasses into the space audio promptly, and user experience is relatively poor.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the present application provides an audio processing method and apparatus.
In a first aspect, the present application provides an audio processing method, including:
acquiring two-channel audio data;
inputting the two-channel audio data into a pre-trained music source separation model, and decomposing the two-channel audio data to obtain at least two single-channel signals;
acquiring the spatial position of each of the at least two single-channel signals;
acquiring a head-related transfer function of a target object;
determining an audio signal binaural to the target object based on the head-related transfer function and the spatial positions of the at least two single-channel signals.
In a second aspect, the present application provides an audio processing apparatus comprising:
the data acquisition module is used for acquiring the two-channel audio data;
the signal output module is used for inputting the two-channel audio data into a pre-trained music source separation model and decomposing the two-channel audio data to obtain at least two single-channel signals;
the position acquisition module is used for acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
the function acquisition module is used for acquiring a head related transmission function of the target object;
an audio processing module for determining an audio signal binaural-received by the target object based on the head-related transfer function and the spatial positions of the at least two single-channel signals.
In a third aspect, a computer device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the audio processing method as described in any one of the embodiments of the first aspect when executing the program stored in the memory.
In a fourth aspect, there is provided a wearable device comprising: a memory, a processor and an audio processing program stored on the memory and executable on the processor, the audio processing program, when executed by the processor, implementing the steps of the audio processing method as defined in any one of the embodiments of the first aspect.
In a fifth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the audio processing method as defined in any one of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the method provided by the embodiment of the application, the two-channel audio data are acquired and input into the pre-trained music source separation model, the two-channel audio data are decomposed to obtain at least two single-channel signals, and the spatial position of each single-channel signal in the at least two single-channel signals is acquired, so that a foundation is laid for subsequently acquiring the spatial audio data. Through the head-related transfer function of the target object, and according to the head-related transfer function and the spatial positions of the at least two single-channel signals, conversion from the two-channel audio data to the spatial audio data can be achieved, the audio signals received by the two ears of the target object are determined, namely, the two-channel audio on wearable devices such as earphones and glasses can be changed into the spatial audio, and user experience is good.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of an audio processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating an embodiment of an audio processing method according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart illustrating an audio processing method according to another embodiment of the present application;
fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The following explains the terms referred to in this application:
head Related Transfer Functions (HRTFs): refers to a sound effect localization algorithm that describes the transmission of sound waves from a sound source to both ears. It is the result of comprehensive filtering of sound waves by human physiological structures (such as the head, pinna, torso, etc.).
With the development of scientific technology, spatial audio has attracted much attention in recent years, and people pay more attention to spatial information and orientation information of sound. At present, on wearable equipment such as earphones and glasses, the effect in the head appears very easily, and the sound field and the space azimuth information that the user perceived become fewly, and prior art can't become the space audio with ordinary two-channel stereo sound on wearable equipment such as earphones, glasses promptly, and user experience is relatively poor. In order to solve the above problem, an embodiment of the present application provides an audio processing method.
Fig. 1 is a schematic flowchart of an audio processing method according to an embodiment of the present application, and as shown in fig. 1, the audio processing method includes:
step 101, acquiring binaural audio data.
In this embodiment, the binaural audio data is binaural sound, specifically, it may be complete music data with binaural sound generated by a wearable device (such as an earphone and glasses) in a music mode, or may be music data segments with binaural sound, and if the quality of the music data is distinguished, it may be high-quality binaural music data, or may be low-quality binaural music data.
And 102, inputting the two-channel audio data into a pre-trained music source separation model, and decomposing the two-channel audio data to obtain at least two single-channel signals.
As an embodiment, after the binaural audio data is input to the pre-trained music source separation model, a plurality of single channel signals of a plurality of instruments corresponding to the binaural audio data may be output, where the instruments correspond to the single channel signals one to one.
In the present embodiment, the music source separation model can be a deep learning model (including and not limited to Wave-U-Net, Conv-Tasnet, MMDenseLSTM, etc.), and the training process thereof is as follows:
acquiring sample data, wherein the sample data comprises: a plurality of binaural audio data and single-channel signals of a plurality of musical instruments corresponding to the binaural audio data;
and obtaining a music source separation model through deep learning model training by utilizing the sample data.
As an example, a first data set may be obtained over a network, the first data set comprising: the method comprises the steps that a plurality of two-channel stereo audio data comprise single-channel signals corresponding to a plurality of musical instruments, a music source separation model is obtained through deep learning model training, and when the method is used, the two-channel audio data are input into the trained music source separation model, so that the single-channel signals of the plurality of musical instruments corresponding to the two-channel audio data can be output.
Wherein the plurality of musical instruments includes: electronic organ, guitar, vocal, drum, bass, etc.
And 103, acquiring the spatial position of each single-channel signal in the at least two single-channel signals.
Based on the embodiment in step 102, it should be understood that, since the instruments are in one-to-one correspondence with the single-channel signals, the spatial positions of the instruments are the spatial positions of the single-channel signals.
As an embodiment of the method for acquiring spatial positions of a plurality of musical instruments, a second data set may be acquired in advance on a network, the second data set including: scene information such as concert halls and concerts, and placement positions of various instruments corresponding to the scene information in the scene information, for example, in the concert scene, the user is the center of a sphere, and the placement positions of various instruments with respect to the user are electronic organ (left front), guitar (middle), vocal (middle), drum (middle rear), bass (right front). Based on the method, scene information such as a concert hall and a concert, and placement positions of various musical instruments corresponding to the scene information are utilized to train the preset neural network model, and the trained musical instrument placement model is generated. When applied, there can be two methods: firstly, the scene information of the scene where the user is located or the preset scene information is obtained, and the scene information is input into a trained instrument placing model, so that the placing positions of various instruments can be output. Secondly, the user can customize the spatial placement positions of various musical instruments according to the hobbies or experiences, so that the musical instrument placement model is trained according to the placement positions of the musical instruments preset by the user, and finally, the placement positions of various musical instruments are selected according to the user.
And 104, acquiring a head related transfer function of the target object.
In this embodiment, the target object may be a user.
The method for acquiring the head related transfer function of the target object specifically comprises the following steps: based on field measurement data or based on head-related transfer function matching of existing objects.
For example, the head-related transfer function of the target object may be obtained by one of the following methods:
firstly, matching HRTFs of a certain person's head by using public data.
For example, a user may perform subjective judgment by collating data already published on the network, and match out the HRTF of the head of a person that the user considers to be most suitable, and the head related transmission coefficient at this time is determined according to the data directly selected by the user.
Secondly, clustering the head HRTFs of a plurality of persons in a database, and then selecting a certain type of HRTFs which are matched most from a public database as a final function by utilizing the personalized auricle characteristics of a user.
Specifically, the pre-stored plurality of head related transfer functions may be clustered, and according to the auricle characteristics of the ears of the target object, a head related transfer function matched with the target object is screened from the clustered plurality of head related transfer functions.
And thirdly, directly measuring the HRTF of the user in the anechoic chamber. That is, the HRTF of the user is measured from the personal characteristics (height, pinna information, etc.) of the user and the placement space position of the musical instrument in the anechoic chamber.
And fourthly, directly measuring the HRTF of the user in the concert hall. That is, the HRTF of the user is measured according to the personal characteristics (height, pinna information, etc.) of the user and the spatial positions of the instruments placed in the concert hall.
Step 105, determining the audio signals received by ears of the target object according to the head-related transfer function and the spatial positions of the at least two single-channel signals.
As an embodiment, in order to accurately obtain an audio signal binaural received by a target object, determining the audio signal binaural received by the target object based on the head-related transfer function and the spatial positions of the at least two single-channel signals, includes:
performing convolution processing on the ith single-channel signal on a time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
and obtaining the audio signals which are received by the ears of the target object according to the plurality of processed audio signals.
For example, if a piece of binaural music is split into 5 musical instruments, such as an electronic organ, a guitar, a human voice, a drum, and bass, the 5 musical instruments are spatially arranged, and the 5 musical instruments correspond to 5 directions with respect to the left ear and the right ear of the user, respectively. At this time, the single-channel signals of the 5 directional instruments corresponding to the left ear are multiplied by the head-related transfer function, and the 5 signals are added after the multiplication, so that the audio signal received by the left ear of the target object can be obtained. In the same way, the single-channel signals of the 5 directional instruments corresponding to the right ear are multiplied by the head-related transfer function, and the 5 signals are added after the multiplication, so that the audio signal received by the right ear of the target object can be obtained. At this time, the audio signal binaural received by the target object is obtained.
Further, considering that different musical instruments are located at different spatial positions, in order to ensure the quality of the audio signals received by the ears of the user, the method for obtaining the audio signals received by the ears of the target object by adding the processed audio signals received by the ears of the target object includes:
and adding the plurality of processed audio signals received by the left ear according to a preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the two ears of the target object.
In one embodiment, the predetermined scale may be determined according to how far the instrument is from the user's ears, for example, when the instrument is farther from the user's left ear, the scale is set smaller, and when the instrument is closer to the user's left ear, the scale is set larger.
After acquiring the head-related transfer function of the target object in order to make the immersion sound effect of the finally obtained audio signal more realistic, as shown in fig. 2, the audio processing method further includes:
step 201, obtaining angle information of each single-channel signal relative to the head of a target object;
and 202, carrying out fusion processing on the angle information of each single-channel signal relative to the head of the target object and the head-related transfer function.
It should be understood that, for intelligent wearing devices such as earphones and glasses, the motion of the head of a human body can be tracked by using a gyroscope built in intelligent wearing, angle information of each single-channel signal relative to the head of a target object, namely angle information of each musical instrument relative to the head of the target object, and then the angle information of each single-channel signal relative to the head of the target object and a head-related transfer function are fused, so that the real-time mapping effect is achieved, and even if the head of a user moves, sound can be kept at a fixed position relative to the head.
Specifically, the angle information of each single-channel signal relative to the head of the target object is contained in euler angles (including yaw angle, pitch angle and roll angle) in the gyroscope technology, the euler angles can be obtained by quaternion conversion, and the euler angles can be combined with dynamic HRTFs in a geodetic coordinate system for fusion processing.
Fig. 3 is a flowchart of another embodiment of an audio processing method according to an embodiment of the present application, where in order to increase a spatial sense of an audio signal received by two ears of a target object, as shown in fig. 3, after obtaining the audio signal received by two ears of the target object, the audio processing method further includes:
step 301 performs reverberation processing on the audio signal received by both ears of the target object.
It should be understood that the addition of reverberation is mainly to further increase the sense of spatiality, and may increase the reverberation of various scenes such as a concert hall, a concert hall or a recording studio. For example, the music played in the concert hall has good effect and is closely related to the building acoustics of the concert hall.
As an embodiment, an audio signal received by ears of a target object is used as a to-be-processed audio, when reverberation is added to the to-be-processed audio, a wearable device with poor processing performance needs a longer processing time when the reverberation is added to the to-be-processed audio, and it can be considered that adding of the reverberation is not supported for the wearable devices. And then, according to a preset processing logic or a preset performance parameter table, determining whether the current wearable device supports adding reverberation to the audio to be processed according to the performance parameters of the current wearable device. The performance parameter table may store a corresponding relationship between the performance parameter of the device and whether adding reverberation is supported.
In this embodiment, determining whether the current wearable device supports adding reverberation to the audio to be processed may also include: acquiring a device information set; and determining whether the current equipment supports adding reverberation to the audio to be processed according to the equipment information set.
The device information may be any information that can identify the wearable device. As an example, the device information may be a device model, a device name, and the like. As an example, the device information in the device information set may be device information of a wearable device that does not support adding reverberation. At this time, it may be determined whether the device information of the currently worn device is in the above-described device information set. If so, it may be determined that the current device does not support adding reverberation to the audio to be processed. Otherwise, it may be determined that the current wearable device supports adding reverberation to the audio to be processed. It is to be understood that, in practice, the device information in the device information set may also be the device information of the wearable device that supports adding reverberation.
And if the current equipment supports adding reverberation to the audio to be processed, determining a reverberation algorithm according to the reverberation category selected by the user.
In specific implementation, the reverberation algorithm can be divided into different reverberation categories according to different simulated environmental effects. For example, the reverberation category may be a hall effect, a studio effect, a valley effect, and the like. For various reverberation categories, category information (e.g., name, picture, etc.) for each category may be presented to the user. Wherein each category information is associated with a reverberation category indicated by the category information. So that the user can perform some operation (e.g., a click operation) on the category information to select the reverberation category.
In this embodiment, determining a reverberation algorithm according to a reverberation category selected by a user includes: carrying out audio characteristic analysis on the audio to be processed to obtain audio characteristic data; and determining a reverberation algorithm according to the audio characteristic data and the reverberation category selected by the user.
As an example, the processed audio may be analyzed by some existing audio analysis application or an open source toolkit. The audio characteristics include, among other things, the frequency, bandwidth, amplitude, etc. of the audio. On the basis of the audio characteristic data and the reverberation category selected by the user, a reverberation algorithm (such as a 3D surround algorithm) can be determined.
Specifically, a reverberation algorithm matching the obtained audio characteristic data and the reverberation category selected by the user may be queried in a pre-established correspondence table between the audio characteristic data and the reverberation category and the reverberation algorithm. The audio to be processed may then be input to at least one filter set according to the determined reverberation algorithm to obtain processed audio. The number of types of filters corresponding to each reverberation algorithm may be preset. Each reverberation algorithm can be obtained by combining at least one filter. For example, a combination of a comb filter and an all-pass filter may be selected. The filter may be a hardware module in the current wearable device or a software module in the current wearable device according to implementation requirements.
As can be seen from the above, in the audio processing method provided in this embodiment of the present application, by obtaining the two-channel audio data, and inputting the two-channel audio data into the pre-trained music source separation model, a plurality of single-channel signals of multiple musical instruments corresponding to the two-channel audio data are output, and then the spatial position of each single-channel signal in the plurality of single-channel signals is determined, and according to the obtained head-related transfer function of the target object and the spatial positions of the plurality of single-channel signals, the conversion from the two-channel audio data to the spatial audio data can be implemented, and the audio signals received by two ears of the target object are determined, that is, the two-channel audio on wearable devices such as earphones and glasses can be changed into spatial audio, so that the user experience is better.
Based on the same inventive concept, the embodiment of the present application further provides an audio processing apparatus, such as the following embodiments. Since the principle of the audio processing apparatus for solving the problem is similar to the audio processing method, the implementation of the audio processing apparatus can refer to the implementation of the audio processing method, and repeated descriptions are omitted. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, and as shown in fig. 4, the audio processing apparatus includes:
a data obtaining module 401, configured to obtain the binaural audio data.
A signal output module 402, configured to input the binaural audio data into a pre-trained music source separation model, and decompose the binaural audio data to obtain at least two single-channel signals.
A position obtaining module 403, configured to obtain a spatial position of each of the at least two single-channel signals.
A function obtaining module 404, configured to obtain a head-related transfer function of the target object.
An audio processing module 405 for determining an audio signal binaural received by the target object based on the head-related transfer function and the spatial positions of the at least two single channel signals.
Optionally, the audio processing module 404 is further configured to:
and performing convolution processing on the ith single-channel signal in a time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain the ith processed audio signal, wherein i is a positive integer greater than or equal to 1.
And obtaining the audio signals which are received by the ears of the target object according to the plurality of processed audio signals.
Optionally, the audio processing module 404 is further configured to:
and adding the plurality of processed audio signals received by the left ear according to a preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to the preset proportion to obtain the audio signals received by the two ears of the target object.
Optionally, the audio processing apparatus further includes:
and the reverberation processing module is used for carrying out reverberation processing on the audio signals which are received by the ears of the target object.
Optionally, the audio processing apparatus further includes:
the angle information acquisition module is used for acquiring the angle information of each single-channel signal relative to the head of the target object;
and the fusion processing module is used for carrying out fusion processing on the angle information of each single-channel signal relative to the head of the target object and the head-related transfer function.
In the embodiment of the application, the head-related transfer function of the target object is obtained based on field measurement data, or is obtained based on head-related transfer function matching of an existing object.
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, where the computer device 500 shown in fig. 5 includes: at least one processor 501, memory 502, at least one network interface 504, and other user interfaces 503. The various components in the computer device 500 are coupled together by a bus system 505. It is understood that the bus system 505 is used to enable connection communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 505 in FIG. 5.
The user interface 503 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball (trackball), a touch pad or touch screen, etc.
It is to be understood that the memory 502 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), synchlronous SDRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 502 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system 5021 and application programs 5022.
The operating system 5021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application 5022 includes various applications, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. The program for implementing the method according to the embodiment of the present invention may be included in the application program 5022.
In the embodiment of the present invention, by calling a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in the application 5022, the processor 501 is configured to execute the method steps provided by the method embodiments, for example, including:
acquiring two-channel audio data;
inputting the two-channel audio data into a pre-trained music source separation model, and decomposing the two-channel audio data to obtain at least two single-channel signals;
acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
acquiring a head related transmission function of a target object;
determining an audio signal binaural to the target object based on the head-related transfer function and the spatial positions of the at least two single-channel signals.
In an alternative embodiment, said determining the target object binaural received audio signal based on the head-related transfer function and the spatial positions of the at least two single-channel signals comprises:
performing convolution processing on the ith single-channel signal on a time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
and obtaining the audio signals which are received by the ears of the target object according to the plurality of processed audio signals.
In an alternative embodiment, said deriving an audio signal binaural received by said target object based on a plurality of said processed audio signals comprises:
and adding the plurality of processed audio signals received by the left ear according to a preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to a preset proportion to obtain the audio signals received by the two ears of the target object.
In an alternative embodiment, after obtaining the audio signal binaural for receiving by the target object, the method further comprises:
and performing reverberation processing on the audio signals which are received by two ears of the target object.
In an alternative embodiment, after obtaining the audio signal binaural for receiving by the target object, the method further comprises:
acquiring angle information of two ears of the target object;
and fusing the angle information of the two ears of the target object with a plurality of processed audio signals received by the two ears of the target object to obtain fused audio signals.
In an optional embodiment, the obtaining a head-related transfer function of a target object specifically includes:
based on field measurement data or based on head-related transfer function matching of existing objects.
The method disclosed by the above-mentioned embodiments of the present invention may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and completes the steps of the method in combination with the hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The computer device provided in this embodiment may be the computer device shown in fig. 5, and may execute all the steps of the audio processing method shown in fig. 1 to 3, so as to achieve the technical effect of the audio processing method shown in fig. 1 to 3, and for brevity, it is specifically described with reference to fig. 1 to 3, and no further description is given here.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.
When one or more programs in the storage medium are executable by one or more processors to implement the above-described pose processing method executed on the pose processing apparatus side.
The processor is used for executing the pose processing program stored in the memory so as to realize the following steps of the audio processing method:
acquiring two-channel audio data;
inputting the two-channel audio data into a pre-trained music source separation model, and decomposing the two-channel audio data to obtain at least two single-channel signals;
acquiring the spatial position of each of the at least two single-channel signals;
acquiring a head-related transfer function of a target object;
determining an audio signal binaural to the target object based on the head-related transfer function and the spatial positions of the at least two single-channel signals.
In an alternative embodiment, said determining the target object binaural received audio signal based on said head related transfer function and the spatial positions of said at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal on a time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
and obtaining the audio signals which are received by the ears of the target object according to the plurality of processed audio signals.
In an alternative embodiment, said deriving an audio signal binaural received by said target object based on a plurality of said processed audio signals comprises:
and adding the plurality of processed audio signals received by the left ear according to a preset proportion according to the spatial position of each single-channel signal, and adding the plurality of processed audio signals received by the right ear according to a preset proportion to obtain the audio signals received by the two ears of the target object.
In an alternative embodiment, after obtaining the audio signal binaural for receiving by the target object, the method further comprises:
and performing reverberation processing on the audio signals which are received by the ears of the target object.
In an alternative embodiment, after obtaining the audio signal binaural for receiving by the target object, the method further comprises:
acquiring angle information of two ears of the target object;
and fusing the angle information of the two ears of the target object with a plurality of processed audio signals received by the two ears of the target object to obtain fused audio signals.
In an optional embodiment, the obtaining a head-related transfer function of a target object specifically includes:
based on field measurement data or based on head-related transfer function matching of existing objects.
In summary, the application establishes a foundation for subsequently acquiring spatial audio data by acquiring the two-channel audio data, inputting the two-channel audio data into a pre-trained music source separation model, decomposing the two-channel audio data to obtain at least two single-channel signals, and acquiring the spatial position of each of the at least two single-channel signals. Through the head-related transfer function of the target object, and according to the head-related transfer function and the spatial positions of the at least two single-channel signals, conversion from the two-channel audio data to the spatial audio data can be achieved, the audio signals received by the two ears of the target object are determined, namely, the two-channel audio on wearable devices such as earphones and glasses can be changed into the spatial audio, and user experience is good. When the application is applied to wearable equipment (such as earphones, glasses and the like), the application can change common stereo music into spatial audio at a music hall level in a music playing mode of the wearable equipment.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. An audio processing method, comprising:
acquiring two-channel audio data;
inputting the two-channel audio data into a pre-trained music source separation model, and decomposing the two-channel audio data to obtain at least two single-channel signals;
acquiring the spatial position of each of the at least two single-channel signals;
acquiring a head-related transfer function of a target object;
determining an audio signal binaural to the target object based on the head-related transfer function and the spatial positions of the at least two single-channel signals.
2. The method of claim 1, wherein said determining the target object binaural received audio signal based on the head related transfer function and the spatial locations of the at least two single channel signals comprises:
performing convolution processing on the ith single-channel signal on a time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
and obtaining the audio signals which are received by the ears of the target object according to the plurality of processed audio signals.
3. The method of claim 2, wherein deriving the target object binaural received audio signal from the plurality of processed audio signals comprises:
and adding a plurality of processed audio signals received by the left ear according to a preset proportion according to the spatial position of each single-channel signal, and adding a plurality of processed audio signals received by the right ear according to a preset proportion to obtain the audio signals received by the two ears of the target object.
4. The method according to claim 2 or 3, wherein after obtaining the target object binaural received audio signal, the method further comprises:
and performing reverberation processing on the audio signals which are received by the ears of the target object.
5. The method of claim 1, wherein after obtaining the head-related transfer function of the target object, the method further comprises:
acquiring angle information of each single-channel signal relative to the head of the target object;
and carrying out fusion processing on the angle information of each single-channel signal relative to the head of the target object and the head-related transfer function.
6. The method according to claim 1, wherein the obtaining a head-related transfer function of the target object specifically comprises:
based on field measurement data or based on head-related transfer function matching of existing objects.
7. An audio processing apparatus, comprising:
the data acquisition module is used for acquiring the two-channel audio data;
the signal output module is used for inputting the two-channel audio data into a pre-trained music source separation model and decomposing the two-channel audio data to obtain at least two single-channel signals;
the position acquisition module is used for acquiring the spatial position of each single-channel signal in the at least two single-channel signals;
the function acquisition module is used for acquiring a head related transmission function of the target object;
an audio processing module for determining the audio signals binaural-received by the target object based on the head-related transfer function and the spatial positions of the at least two single-channel signals.
8. The apparatus of claim 7, wherein the audio processing module is further configured to:
performing convolution processing on the ith single-channel signal on a time domain according to the head-related transfer function and the spatial position of the ith single-channel signal to obtain an ith processed audio signal, wherein i is a positive integer greater than or equal to 1;
and obtaining the audio signals which are received by the ears of the target object according to the plurality of processed audio signals.
9. The computer equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the audio processing method of any of claims 1-6 when executing a program stored in the memory.
10. A wearable device, characterized in that the wearable device comprises: memory, a processor and an audio processing program stored on the memory and executable on the processor, the audio processing program, when executed by the processor, implementing the steps of the audio processing method according to any one of claims 1 to 6.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the audio processing method according to any one of claims 1 to 6.
CN202210216580.2A 2022-03-07 2022-03-07 Audio processing method and device Active CN114598985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210216580.2A CN114598985B (en) 2022-03-07 2022-03-07 Audio processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210216580.2A CN114598985B (en) 2022-03-07 2022-03-07 Audio processing method and device

Publications (2)

Publication Number Publication Date
CN114598985A true CN114598985A (en) 2022-06-07
CN114598985B CN114598985B (en) 2024-05-03

Family

ID=81815840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210216580.2A Active CN114598985B (en) 2022-03-07 2022-03-07 Audio processing method and device

Country Status (1)

Country Link
CN (1) CN114598985B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007080224A1 (en) * 2006-01-09 2007-07-19 Nokia Corporation Decoding of binaural audio signals
WO2017135063A1 (en) * 2016-02-04 2017-08-10 ソニー株式会社 Audio processing device, audio processing method and program
US20180109900A1 (en) * 2016-10-13 2018-04-19 Philip Scott Lyren Binaural Sound in Visual Entertainment Media
CN112037738A (en) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Music data processing method and device and computer storage medium
CN113747337A (en) * 2021-09-03 2021-12-03 杭州网易云音乐科技有限公司 Audio processing method, medium, device and computing equipment
CN113821190A (en) * 2021-11-25 2021-12-21 广州酷狗计算机科技有限公司 Audio playing method, device, equipment and storage medium
US20210398545A1 (en) * 2020-06-19 2021-12-23 Apple Inc. Binaural room impulse response for spatial audio reproduction

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007080224A1 (en) * 2006-01-09 2007-07-19 Nokia Corporation Decoding of binaural audio signals
WO2017135063A1 (en) * 2016-02-04 2017-08-10 ソニー株式会社 Audio processing device, audio processing method and program
US20180109900A1 (en) * 2016-10-13 2018-04-19 Philip Scott Lyren Binaural Sound in Visual Entertainment Media
US20210398545A1 (en) * 2020-06-19 2021-12-23 Apple Inc. Binaural room impulse response for spatial audio reproduction
CN112037738A (en) * 2020-08-31 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Music data processing method and device and computer storage medium
CN113747337A (en) * 2021-09-03 2021-12-03 杭州网易云音乐科技有限公司 Audio processing method, medium, device and computing equipment
CN113821190A (en) * 2021-11-25 2021-12-21 广州酷狗计算机科技有限公司 Audio playing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114598985B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN106797525B (en) For generating and the method and apparatus of playing back audio signal
US10645518B2 (en) Distributed audio capture and mixing
CN105027580B (en) Method for outputting a modified audio signal
JP6665379B2 (en) Hearing support system and hearing support device
US9131305B2 (en) Configurable three-dimensional sound system
JP4921470B2 (en) Method and apparatus for generating and processing parameters representing head related transfer functions
Hohmann et al. The virtual reality lab: Realization and application of virtual sound environments
CN109891503B (en) Acoustic scene playback method and device
US10924875B2 (en) Augmented reality platform for navigable, immersive audio experience
CN111294724B (en) Spatial repositioning of multiple audio streams
US20230239642A1 (en) Three-dimensional audio systems
Bujacz et al. Sound of Vision-Spatial audio output and sonification approaches
WO2023109278A1 (en) Accompaniment generation method, device, and storage medium
CN113784274A (en) Three-dimensional audio system
CN113347552B (en) Audio signal processing method and device and computer readable storage medium
Geronazzo et al. Superhuman hearing-virtual prototyping of artificial hearing: a case study on interactions and acoustic beamforming
CN114598985B (en) Audio processing method and device
Vennerød Binaural reproduction of higher order ambisonics-a real-time implementation and perceptual improvements
CA3044260A1 (en) Augmented reality platform for navigable, immersive audio experience
Guthrie Stage acoustics for musicians: A multidimensional approach using 3D ambisonic technology
CN113347551B (en) Method and device for processing single-sound-channel audio signal and readable storage medium
CN114173256A (en) Method, device and equipment for restoring sound field space and tracking posture
Barrett Spatial music composition
CN112954548B (en) Method and device for combining sound collected by terminal microphone and headset
WO2023173285A1 (en) Audio processing method and apparatus, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant