CN115460515A - Immersive audio generation method and system - Google Patents

Immersive audio generation method and system Download PDF

Info

Publication number
CN115460515A
CN115460515A CN202210919846.XA CN202210919846A CN115460515A CN 115460515 A CN115460515 A CN 115460515A CN 202210919846 A CN202210919846 A CN 202210919846A CN 115460515 A CN115460515 A CN 115460515A
Authority
CN
China
Prior art keywords
audio
immersive
units
position information
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210919846.XA
Other languages
Chinese (zh)
Inventor
马士超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LEONIS (BEIJING) INFORMATION TECHNOLOGY CO LTD
Original Assignee
LEONIS (BEIJING) INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LEONIS (BEIJING) INFORMATION TECHNOLOGY CO LTD filed Critical LEONIS (BEIJING) INFORMATION TECHNOLOGY CO LTD
Priority to CN202210919846.XA priority Critical patent/CN115460515A/en
Publication of CN115460515A publication Critical patent/CN115460515A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/12Circuits for transducers, loudspeakers or microphones for distributing signals to two or more loudspeakers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups

Landscapes

  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

The present disclosure relates to audio and video technologies, and in particular, to an immersive audio generation method and system. Wherein the method comprises receiving immersive audio data for different audio objects through different channels, wherein the immersive audio data comprises position information of the audio objects and audio data of the audio objects; calculating a plurality of audio playing units corresponding to the audio object according to the position information of the audio object and the position information of a plurality of audio playing units in the immersive audio data; and sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field. By the immersive audio generation system of embodiments herein, the user's sense of spatial position of the audio object when the audio is reproduced can be made more realistic.

Description

Immersive audio generation method and system
Technical Field
The present disclosure relates to audio and video technologies, and in particular, to an immersive audio generation method and system.
Background
The current development of immersive audio is mainly focused on human ear-mounted devices, and most of them adopt a two-channel simulation sense of space, and lack of sufficient immersive experience.
Chinese patent application CN104919821A discloses a playback device using directivity, such as a special speaker having directivity, a room wall reflection arrangement, a room device arrangement, and the like, which calculates a position corresponding to an audio object by using vector calculation, forms a beam, and plays audio. There are problems in that many professional devices are required to arrange the audio playing space, the cost is high, and the scenes of the home theater are not facilitated.
How to solve the problem that the immersive audio playing under the application scene of the home theater in the prior art needs to be solved urgently.
Disclosure of Invention
To solve the problems in the prior art, embodiments herein provide an immersive audio generation method and system, where position information of an audio object carried in immersive audio data may be used to restore an accurate position of the audio object in a spatial field formed by an audio playing unit, so as to implement an immersive audio playing experience.
An immersive audio generation method for a master device, comprising,
receiving immersive audio data for different audio objects through different channels, wherein the immersive audio data includes position information of the audio objects and audio data of the audio objects;
calculating a plurality of audio playing units corresponding to the audio objects according to the position information of the audio objects and the position information of the plurality of audio playing units in the immersive audio data;
and sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field.
According to one aspect of embodiments herein, further comprising transmitting the audio data of the audio object to a corresponding plurality of audio playback units in the spatial field,
measuring a clock deviation from each audio playback unit;
and sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field according to the clock deviation.
According to one aspect of embodiments herein, measuring the clock offset from each audio playback unit further comprises,
a master clock of the master control device sends a synchronous message to slave clocks of all audio playing units, and the corresponding audio playing units record the sending and receiving time of the synchronous message at all ends;
receiving a request message fed back from a clock of each audio playing unit, and recording the sending and receiving time of the request message of each end by the corresponding audio playing unit;
the master clock of the master control device sends response messages to the slave clocks of all the audio playing units, and the corresponding audio playing units record the sending and receiving time of the response messages of all the ends;
and the corresponding audio playing units calculate the clock deviation of each audio playing unit according to the sending and receiving time of the synchronous message, the sending and receiving time of the request message and the sending and receiving time of the response message, so that the main control device can obtain the clock deviation and send the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field.
According to an aspect of embodiments herein, transmitting the audio data of the audio object to the corresponding plurality of audio playback units in the spatial field further comprises,
dividing the audio data of the audio object according to a frequency threshold value;
outputting the audio data of the audio objects lower than the frequency threshold value to a background audio playing unit in a spatial field for rendering;
and outputting the audio data of the audio objects which are higher than the frequency threshold value to a plurality of corresponding audio playing units in the spatial field for rendering.
According to one aspect of embodiments herein, calculating, from the positional information of the audio object in the immersive audio data and the positional information of the plurality of audio playback units, further comprises calculating the plurality of audio playback units corresponding to the audio object,
according to the coordinates in the position information of the audio playing units, taking a user as a sphere center, and making an inscribed curved surface space in the space surrounded by the audio playing units;
converting the position information of the audio object into coordinates in the inscribed surface space;
determining a plurality of audio playback units that are proximate to the coordinates of the audio object as a plurality of audio playback units corresponding to the audio object.
According to one aspect of embodiments herein, calculating, from the positional information of the audio object in the immersive audio data and the positional information of the plurality of audio playback units, further comprises calculating the plurality of audio playback units corresponding to the audio object,
and inputting the position information of the audio object and the position information of the audio playing units in the immersive audio data into a fine tuning model, and calculating the sound pressure fine tuning amount of the audio playing units corresponding to the audio object.
According to an aspect of embodiments herein, inputting the position information of the audio object and the position information of the plurality of audio playing units in the immersive audio data into a fine tuning model, calculating the sound pressure fine tuning amount of the plurality of audio playing units corresponding to the audio object further comprises,
the fine tuning model is trained by inputting, as the environment, the position information of the audio object in the immersive audio data, the position information of the plurality of audio playback units, and the evaluation score.
According to an aspect of embodiments herein, the fine tuning model may be formed of a Q-value network model, the method further comprising,
adding an attention extraction layer between convolution layers of a Q value network model at the current moment, and extracting interesting features in features obtained by the previous convolution layer;
after the characteristics are multiplied by the interesting characteristics, sound pressure fine adjustment quantity at the current moment is obtained through a rolling layer;
the sound pressure fine adjustment quantity at the current moment and the environment input at the next moment are used as the input of a Q value network model at the next moment to obtain the sound pressure fine adjustment quantity at the next moment;
and iterating the steps according to the time sequence, and stopping iteration through the error function of the fine tuning model to obtain the sound pressure fine tuning quantity of the fine tuning model to the plurality of audio playing units corresponding to the audio object.
According to one aspect of embodiments herein, the attention-extracting layer further comprises,
the global pooling layer reduces the dimension of the features obtained by the previous volume layer, and the softmax layer sorts the features after dimension reduction to form the interesting features.
According to one aspect of embodiments herein, the error function is:
Loss(ε)=E(Q target -Q present (s,a,ε)) 2
wherein Q target Refers to the sound pressure fine adjustment quantity, Q, output by a target Q value network present The sound pressure fine adjustment quantity of the Q value network output at the current moment is indicated, s is the sound pressure fine adjustment quantity of the environment input and the last moment, a is the sound pressure fine adjustment quantity at the current moment, and epsilon is various parameters of network training;
said Q target Calculated as follows: q target =Reward(ε)+γ×Q max (s',a',ε);
Wherein the Reward (epsilon) is a Reward function, a gamma hyperparameter, Q max To be Q on a time series present The maximum output sound pressure fine adjustment quantity s 'is the environment input at each moment and the sound pressure fine adjustment quantity at the last moment, a' is the sound pressure fine adjustment quantity at each moment, epsilon is various parameters of network training, and the Reward function is Reward (epsilon) = e -Sear/Smic Said S ear Evaluation score for subjective human auditory perception, S mic An evaluation score of the audio data is collected for the microphone.
Embodiments herein also provide an immersive audio generation system comprising,
the system comprises a sound source device, a main control device and a plurality of audio playing units;
the audio source device outputs immersive audio data aiming at different audio objects to the main control device through different channels, wherein the immersive audio data comprises position information of the audio objects and audio data of the audio objects;
the master control device is used for receiving immersive audio data aiming at different audio objects through different channels; calculating a plurality of audio playing units corresponding to the audio objects according to the position information of the audio objects and the position information of the plurality of audio playing units in the immersive audio data; sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field;
the audio playing unit is used for playing the audio data.
Embodiments herein also provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method when executing the computer program.
Embodiments herein also provide a computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the above-described method.
With the immersive audio generation system of the embodiments herein, the spatial position sense of the audio object by the user during audio reproduction can be more realistic, since the arrangement positions of the audio playing units can be arbitrarily arranged by the user according to the characteristics of the spatial field (room), and the main control device can restore the immersive audio object in the spatial field specific to the user only according to the position of each audio playing unit and the position of the audio object. And the common and best subjective auditory effect of human ears can be obtained by utilizing a fine adjustment value obtained by training under the condition of adding artificial judgment.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art, the drawings used in the embodiments or technical solutions in the prior art are briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic block diagram of an immersive audio generation system according to embodiments herein;
FIG. 2 is a flow diagram illustrating a method of immersive audio generation in accordance with embodiments herein;
FIG. 3 is a schematic diagram illustrating a curved surface formed by a plurality of audio playing units according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram showing more audio playback units within a spatial field;
FIG. 5 is a flowchart illustrating synchronization between a main control device and an audio playback unit according to an embodiment of the disclosure;
FIG. 6 is a flow chart illustrating a method for playing according to audio data frequency according to an embodiment of the present disclosure;
FIG. 7 is a flow diagram illustrating the conversion of audio object position information to coordinates in inscribed surface space according to an embodiment herein;
FIG. 8 is a schematic diagram of a coordinate transformation performed by embodiments herein;
FIG. 9 is a schematic diagram illustrating a fine-tuning model training process according to an embodiment herein;
FIG. 10 is a schematic diagram illustrating an internal structure of a Q-value network model according to an embodiment of the present invention;
fig. 11 illustrates a computer device provided in an embodiment herein.
[ description of reference ]
101. A sound source device;
102. a master control device;
103. an audio playing unit;
301. an audio playing unit;
302. a curved surface;
303. an audio object;
304. a user;
401. an audio playing unit;
402. a background audio playing unit;
1102. a computer device;
1104. a processor;
1106. a memory;
1108. a drive mechanism;
1110. an input/output module;
1112. an input device;
1114. an output device;
1116. a presentation device;
1118. a graphical user interface;
1120. a network interface;
1122. a communication link;
1124. a communication bus.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of protection given herein.
Fig. 1 is a schematic structural diagram of an immersive audio generation system according to an embodiment of the present disclosure, in which audio is played at different sound pressures under the control of a master control device through a plurality of audio playing units in a cinema environment or a home cinema environment, so as to simulate a position of an audio object in a spatial field, and provide an immersive audio experience for a user. The system comprises a sound source device 101, a main control device 102 and a plurality of audio playing units 103.
Wherein the sound source device 101 outputs immersive audio data for different audio objects to the master device through different channels, wherein the immersive audio data include position information of the audio objects and audio data of the audio objects. The sound source device 101 may be, for example, an audio/video player, a mobile phone, a computer, or other equipment having an audio data processing function and a communication function.
The master device 102 is configured to receive immersive audio data for different audio objects through different channels; calculating a plurality of audio playing units corresponding to the audio object according to the position information of the audio object and the position information of a plurality of audio playing units in the immersive audio data; and sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field. The main control device 102 and the audio source device 101 may be connected in a wired manner or in a wireless manner, where the channels refer to audio playing units distributed at different positions in space, and each audio playing unit is a channel.
The audio object is a source of sound in a virtual scene formed by video or audio, for example, a human voice spoken at a certain spatial position in the virtual scene, or a bird singing at a certain spatial position in the virtual scene. The audio object in the embodiment herein includes specific audio data, such as a human speech sound or a bird song, and also includes position information of the audio object in the virtual scene, which may be three-dimensional coordinates.
The master control device 102 may be implemented by a DSP and incorporates components such as a network communication module that communicates with each audio playback unit 103.
The audio playing units 103 are respectively disposed at different positions of a spatial field where a user is located, for example, the position shown in fig. 3, or the position shown in fig. 4, where fig. 3 is a schematic diagram of a curved surface formed by multiple audio playing units in the present embodiment, only one curved surface 302 formed by three audio playing units 301 is exemplarily shown in the figure, and position information of an audio object is converted into spatial coordinates in the curved surface, as shown in the figure, the position of the audio object may be represented by spatial coordinates corresponding to the audio playing units I1, I2, and I3, a virtual audio object 303 is formed in a space in the curved surface 302, the three audio playing units 301 are controlled by the main control device to respectively output audio data of the audio object with different sound pressures, and the user 304 located at a central spatial position facing the curved surface 302 can experience an immersive audio playing effect.
Fig. 4 shows an example of having more audio playing units 401 in a spatial field, in which in addition to 7 audio playing units 401 arranged at the top of the spatial field, 7 audio playing units 401 are arranged at the bottom of the spatial field, so that a user can obtain more realistic audio feeling in the spatial field, wherein every three audio playing units 401 may form a curved surface, or a curved surface may be formed by more audio playing units 401, and the position of an audio object is converted into the formed curved surface space. In addition, fig. 4 further includes a plurality of background audio playing units 402 for playing background audio, that is, audio data with audio frequency below a certain frequency threshold, and the audio playing unit 401 is used for exclusively playing audio data with audio frequency above the frequency threshold to render the audio object.
The audio playing unit 103 may be a common and conventional speaker, and receives the audio data controlled and sent by the main control device 102 in a wired or wireless manner, and the audio playing unit 103 may also include a power amplifier driver, a network communication module, and other components, which are used to drive the speaker and perform time synchronization with the main control device 102.
With the immersive audio generation system of the embodiments herein, the user's sense of spatial position of the audio object during audio reproduction can be more realistic, since the arrangement positions of the audio playing units 103 can be arbitrarily arranged by the user according to the characteristics of the spatial field (house), and the master control device 102 only needs to restore the immersive audio object in the user-specific spatial field according to the position of each audio playing unit 103 and the position of the audio object. And the common and best subjective auditory effect of human ears can be obtained by utilizing a fine adjustment value obtained by training under the condition of adding artificial judgment.
Fig. 2 is a flowchart of an immersive audio generation method according to an embodiment of the present disclosure, in which a method for processing immersive audio data of a host device in a system as in fig. 1 is described, where the method may be executed by a DSP or by a CPU, and the following methods are not necessarily in the only order, and those skilled in the art may implement other methods in other orders without inventive expansion, where the method specifically includes:
step 201, receiving immersive audio data for different audio objects through different channels, wherein the immersive audio data comprise position information of the audio objects and audio data of the audio objects;
step 202, calculating a plurality of audio playing units corresponding to the audio object according to the position information of the audio object and the position information of a plurality of audio playing units in the immersive audio data;
step 203, sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field.
As a further embodiment herein, transmitting audio data of an audio object to a corresponding plurality of audio playback units in a spatial field further comprises,
measuring a clock deviation from each audio playback unit;
and sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field according to the clock deviation.
In this embodiment, as can be seen from the foregoing example of fig. 4, when a large number of audio playing units are used for audio object rendering, the audio playing units may be far away from the main control device, and the audio data playing of the audio objects may not be synchronized, which may cause a problem that the audibility and audibility of the user are reduced. Based on this, the embodiments herein provide that the audio data sent by the main control device to each audio playing unit is adjusted according to the clock offset (i.e. delay or time difference) between each audio playing unit and the main control device, so as to reduce the problem that the audio playing units are not synchronized when playing the audio data.
As a further embodiment herein, measuring the clock offset from each audio playback unit further comprises,
a master clock of the master control device sends a synchronous message to slave clocks of all audio playing units, and the corresponding audio playing units record the sending and receiving time of the synchronous message at all ends;
receiving a request message fed back from a clock of each audio playing unit, and recording the sending and receiving time of the request message of each end by the corresponding audio playing unit;
the master clock of the master control device sends response messages to the slave clocks of all the audio playing units, and the corresponding audio playing units record the sending and receiving time of the response messages of all the ends;
and the corresponding audio playing units calculate the clock deviation of each audio playing unit according to the synchronous message sending and receiving time, the request message sending and receiving time and the response message sending and receiving time, so that the main control device can obtain the clock deviation and send the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field.
In this embodiment, referring to fig. 5, which is a flowchart illustrating synchronization between a main control device and audio playing units in this embodiment, where it is described in this figure that a delay between the main control device and each audio playing unit is calculated according to a time when the main control device and each audio playing unit send information, and a time of audio data sent to each audio playing unit is adjusted, so as to implement synchronization between each audio playing unit and the main control device, specifically, the method includes:
step 501, a master clock of a master control device sends a synchronization message (Sync) to a slave clock of an audio playing unit, and records sending time t1; and after receiving the message from the clock of the audio playing unit, recording the receiving time t2.
Step 502, after the master clock of the master control device sends a Sync message, a Follow-Up message carrying t1 is sent to the slave clock of the audio playing unit.
Step 503, the slave clock of the audio playing unit sends a request message (Pdelay _ Req) to the master clock of the master control device, for initiating the calculation of the reverse transmission delay, and recording the sending time t3; and after receiving the message, the master clock of the master control device records the receiving time t4.
Step 504, the master clock of the master control device replies a response message (Pdelay _ Resp) carrying t4, and records the sending time t5; after receiving the message from the clock of the audio playing unit, the receiving time t6 is recorded.
In step 505, the master clock of the master control device sends a Pdelay _ Resp _ Follow _ Up message carrying t5 to the slave clock of the audio playing unit.
The time for receiving and sending the message may be sent from the master clock of the master control apparatus to the slave clock of the audio playing unit as described in the above embodiment, or the slave clock of the audio playing unit replies to the master clock of the master control apparatus with the time for recording the message received and sent by itself, or both sides will reply the message received and sent time, and both sides will obtain all the message sending and receiving times, which is not limited herein.
In step 506, the slave clock of the audio playing unit has six time stamps t 1-t 6, so that the total round-trip delay between the master clock and the slave clock can be calculated to be [ (t 4-t 3) + (t 6-t 5) ], and the one-way delay between the master clock and the slave clock is [ (t 4-t 3) + (t 6-t 5) ]/2 because the network is symmetrical. Thus, the clock skew of the slave clock relative to the master clock is: offset = (t 2-t 1) - [ (t 4-t 3) + (t 6-t 5) ]/2.
The work of calculating the clock skew in this step may be calculated by the audio playing unit, or may be calculated by the main control device having all time information, which is not limited herein. The master control device can directly inquire by utilizing the audio playing unit for calculation without sequentially recording the clock deviation of each channel by the master control device, so that the problems that the clock deviation of one channel (audio playing unit) of the master control device is wrong and the clock deviations of all other channels are wrong are solved.
In step 507, the host packetizes the audio data according to a packetization standard of time interval T (generally, 4 to 8 frames, i.e., 200 ms), defines T + Offset + T '(T' is the current time of the host clock of the host) as a timestamp in an RTP protocol (real-time transport protocol) packet, encodes the packetized audio data, packetizes the encoded audio data into an RTP protocol packet, and adds the encoding format to the packet.
And the N subprograms which are responsible for packaging and sending the audio data of the main control device respectively send the audio data to a plurality of audio playing units corresponding to the audio objects in the audio data.
And step 508, the main control device sends the packed RTP format protocol data packet to the switch through the network card and the port number, and the switch forwards the RTP format protocol data packet to the power amplifier of each audio playing unit.
As an aspect of embodiments herein, before transmitting the audio data of the audio object to the corresponding plurality of audio playback units in the spatial field further comprises,
dividing the audio data of the audio object according to a frequency threshold value;
outputting the audio data of the audio objects lower than the frequency threshold value to a background audio playing unit in a spatial field for rendering;
and outputting the audio data of the audio objects which are higher than the frequency threshold value to a plurality of corresponding audio playing units in the spatial field for rendering.
In this embodiment, the audio data may be distributed to the corresponding audio playing units to realize immersive audiovisual experience, or the audio data may be divided, the audio data with partial frequency is sent to the audio playing units, rendering (playing) is performed according to the position information of the audio object, and the audio data with partial frequency is sent to the background audio playing units to be rendered (playing). Specifically, as shown in fig. 6, a flowchart of a method for playing according to audio data frequency in this embodiment includes the following steps:
step 601, frequency division is performed on the obtained audio data.
Continuous objects in the immersive audio data are disassembled into one frame according to frames (generally 25 ms), the space-time attribute information of each object in each frame and corresponding frame data are disassembled, and the frame data are decoded to obtain the audio data.
A lower frequency may be limited to a low frequency part and a high frequency part, for example, with a frequency threshold of 100 Hz. Audio data below 100Hz is divided into low frequency parts and audio data above 100Hz is divided into high frequency parts.
Wherein, low frequencies (e.g. below 100 Hz), such as white noise of the environment, background sounds, etc., do not require the position of the audio object to be restored; high frequencies (e.g., above 100 Hz), such as human voice, are audio features in movie theaters that are primarily needed to restore the position of audio objects.
Step 602, the audio data of the high frequency part and the low frequency part are processed respectively.
The low frequency part is directly rendered to a background audio playing unit channel after being processed by the bass unit, and the high frequency part is rendered and restored to the position of an audio object in a three-dimensional space and is mapped to a plurality of audio playing units of a spatial field.
As one embodiment herein, calculating, from position information of an audio object in the immersive audio data and position information of a plurality of audio playback units, further comprises calculating a plurality of audio playback units corresponding to the audio object,
according to the coordinates in the position information of the audio playing units, taking a user as a sphere center, and forming an inscribed curved surface space in the space surrounded by the audio playing units;
converting the position information of the audio object into coordinates in the inscribed surface space;
determining a plurality of audio playback units that are proximate to the coordinates of the audio object as a plurality of audio playback units corresponding to the audio object.
In this embodiment, reference may be made to fig. 7, which is a flowchart illustrating the conversion of the audio object position information into the coordinates in the inscribed surface space according to this embodiment, and this drawing illustrates an example of 8 audio playing units and 2 audio objects (see the schematic diagram of fig. 8 illustrating the coordinate conversion according to this embodiment), and in other embodiments, there may be more audio playing units or fewer audio playing units.
Step 701, converting the position information of the audio object into coordinates in the inscribed surface space.
As shown in FIG. 8, assume that there is P at the current time 1 、P 2 Two audio objects, present in
Figure BDA0003775602160000111
Two positions where p 1 、φ 1 、θ 1 、ρ 2 、φ 2 、θ 2 Is P 1 、P 2 The position information of the audio object, eight loudspeakers I1-I8 are set in the spatial field, after the scene is set, the main control device presets the space of the eight loudspeakersInter-coordinate (x) i ,y i ,z i ) (i is a loudspeaker mark number), the center of the inscribed sphere of the space field constructed by the loudspeaker is O (the original center of the Cartesian coordinate system is also the center of the inscribed sphere O, and the position is also the optimal viewing position). In other embodiments, the center position O corresponding to the inscribed surface of the spatial field may also be constructed if the number of speakers is insufficient.
Step 702, calculating k loudspeakers nearest to the audio object by a k-neighbor borwood method as a plurality of loudspeakers corresponding to the audio object, and determining that the k loudspeakers generate sound.
In the spatial field of the present embodiment, k =3 is taken as an example to construct a triangular surface space, and P1 and P2 are converted into space coordinates of a cartesian coordinate system according to the following formula:
Figure BDA0003775602160000112
according to the formula
Figure BDA0003775602160000121
Sequentially traversing the loudspeakers numbered i, calculating the distance between the audio object P1 and each loudspeaker, and calculating the distance d p11 ~d p18 In ascending order, selecting the 3 loudspeakers with the smallest distance as the nearest neighbor loudspeakers to sound, wherein the step determines that I1, I2 and I3 are the loudspeakers nearest to P1, and similarly, I1, I5 and I6 are the loudspeakers nearest to P2.
Step 703, calculating the playing sound pressures of the selected plurality of speakers.
Assuming that the full sound pressure of the speaker is S, it indicates the sound pressure of the sound source at the point O, and this embodiment determines the final sound pressure of each speaker by using a method of fixing a distance and approximately fixing the sound pressure.
The distances from the three loudspeakers I1, I2 and I3 to the P1 are d respectively p11 、d p12 、d p13 Approximately the loudspeaker I1 sound pressure is fixed according to the following equation:
Figure BDA0003775602160000122
similarly, fixed sound pressure of the speakers I2 and I3 can be obtained
Figure BDA0003775602160000123
As one embodiment herein, calculating, from position information of an audio object in the immersive audio data and position information of a plurality of audio playback units, further comprises calculating a plurality of audio playback units corresponding to the audio object,
and inputting the position information of the audio object and the position information of the audio playing units in the immersive audio data into a fine tuning model, and calculating the sound pressure fine tuning amount of the audio playing units corresponding to the audio object.
In this embodiment, after the fixed approximate sound pressure is determined, the sound pressure fine adjustment amount sound pressure for each speaker may be obtained according to an intelligent model, the intelligent model is a model obtained by training auditory feelings (subjective data and objective data) of a user after a triangular surface space is formed by three speakers in a laboratory in the early stage, different coordinate origin points and audio object coordinate points are set, the model is composed of a convolutional neural network, model parameters are weight and offset information of each layer, the model parameters are input as coordinates and object coordinates of the three speakers, and the output is the sound pressure fine adjustment amount
Figure BDA0003775602160000124
Based on the sound pressure fine adjustment quantity of the loudspeakers I1 to I3
Figure BDA0003775602160000125
The final sound pressure should be
Figure BDA0003775602160000126
As an embodiment herein, inputting the position information of the audio object and the position information of the plurality of audio playing units in the immersive audio data to a fine tuning model, calculating the sound pressure fine tuning amount of the plurality of audio playing units corresponding to the audio object further comprises,
and training a Q value network model by taking the position information of the audio object in the immersive audio data, the position information of the audio playing units and the evaluation scores as environment input to obtain the fine tuning model.
In this embodiment, the Q-value network model may adopt a double-depth Q-value network model (DDQN), which belongs to a network structure with fast convergence in the current practice in the field of reinforcement learning, and ensures that an optimal operation result can be obtained. The deep Q network accords with the Markov chain principle, the sound pressure fine adjustment quantity output next time only depends on the current environment input information and the sound pressure fine adjustment quantity output last time, the environment input information (the position of a loudspeaker, the position of an audio object and an evaluation score) and the corresponding current fine adjustment sound pressure are continuously accumulated in the training process, and the accumulated data is used as training data to train the Q value network model in an iterative mode through a memory playback unit.
Wherein the evaluation score comprises two parts, one part from the objective data: the main sound pressure position of sound is collected through the omnidirectional microphone, and the score is calculated through the angle difference delta phi between the target angle and the measurement angle:
Figure BDA0003775602160000131
the other part is from subjective data: the current conditions of the output results of a plurality of loudspeakers are judged by manually scoring (the scoring range is 0-1, 1 is the best) through the result audited by a plurality of experimenters at the position of the origin of coordinates, and then are clustered by a k-means method, and the average number with the most clustering results is adopted to obtain S ear
In other embodiments, network models such as LSTM, RCNN, tanformer, etc. may also be used for training to construct the fine tuning model.
As one embodiment herein, the training process of the fine tuning model is as follows:
adding an attention extraction layer between convolution layers of a Q value network model at the current moment, and extracting interesting features in features obtained by a previous convolution layer;
after the characteristics are multiplied by the interesting characteristics, sound pressure fine adjustment quantity at the current moment is obtained through a rolling layer;
the sound pressure fine adjustment quantity at the current moment and the environment input at the next moment are used as the input of a Q value network model at the next moment to obtain the sound pressure fine adjustment quantity at the next moment;
and iterating the steps according to the time sequence, and stopping iteration through the error function of the fine tuning model to obtain the sound pressure fine tuning quantity of the fine tuning model to the plurality of audio playing units corresponding to the audio object.
In this embodiment, as shown in fig. 9, which is a schematic diagram of a fine-tuning model training process of the embodiment herein, an attention extracting layer (SE-Net) is added on the basis of an existing Q-value network model, and a set target Q-value network model and an error function are input to the current-time Q-value network model with the coordinates of speakers in a spatial field, audio object positions, and evaluation scores as environments, and also the acquired current sound pressures of the respective speakers are input to the current-time Q-value network model, and the accumulated above data and the reward value of the current time are taken as training data through a playback memory unit, so as to train the current-time Q-value network model iteratively in time sequence.
The sound pressure fine-tuning amount output by the target Q value network model can be expressed as follows:
Q target =Reward(ε)+γ×Q max (s',a',ε)
the Reward function Reward (epsilon) is used as a part of the sound pressure fine adjustment quantity Q' output by the target Q value network model, and the calculation mode is as follows: reward (epsilon) = e -Sear/Smic Gamma hyperparameter (Q value network model training effect is better by adjusting hyperparameter), Q max Multiple sound pressure trimming quantities (Q) output for the Q-value network model at the current time in time series present ) The middle maximum sound pressure fine adjustment quantity, s 'is the environment input at each moment and the sound pressure fine adjustment quantity at the last moment, a' is the sound pressure fine adjustment quantity at each moment, epsilon is various parameters of network model training,
wherein S is ear For subjective human auditory evaluation data, S mic Evaluation of the acquisition of audio data for a microphoneIn this case, the reward function is input by S in the environment ear And S mic And calculating to obtain a reward value, inputting the reward value into a playback memory unit for forming a training sample for training the Q value network model at the current moment, and outputting the reward value to the target Q value network model and the error function.
The current time Q value network model outputs sound pressure fine adjustment quantity at each time, namely (Q) present ) To the error function, respectively with the sound pressure fine adjustment quantity (Q) output by the target Q value network model target ) And performing iterative calculation to obtain various parameters epsilon for adjusting the Q value network model at the current moment, and updating the parameters epsilon into the parameters epsilon of the target Q value network model after N moments, so that the accumulated environment input and the corresponding sound pressure fine adjustment data are used as training data to obtain more accurate sound pressure fine adjustment output by the optimized fine adjustment model.
The error function adopts a gradient descent algorithm to optimize the current-time Q value network model, so that the current-time Q value network model does not deviate from the target Q value network model too much, and the error function can be as follows:
Loss(ε)=E(Q target -Q present (s,a,ε)) 2
wherein Q is target Refers to the sound pressure fine adjustment quantity, Q, output by a target Q value network present The sound pressure fine adjustment quantity output by the network is referred to as Q value at the current moment, s is the sound pressure fine adjustment quantity of the environment input and the last moment, a is the sound pressure fine adjustment quantity at the current moment, and epsilon is various parameters of network training.
As one embodiment herein, the attention-extracting layer further comprises,
the global pooling layer reduces the dimension of the features obtained by the previous volume layer, and the softmax layer sorts the features after dimension reduction to form the interesting features.
In this embodiment, referring to fig. 10, which is a schematic diagram of an internal structure of a Q-value network model in an embodiment herein, a specific structure of the Q-value network model is described in this figure, in the embodiment herein, an attention extraction layer (i.e., SE-Net) is added on the basis of an existing three-layer convolution for extracting a feature of interest in a feature acquired at the last time, ht-1 is a sound pressure fine adjustment amount output at a time t-1, ht is a sound pressure fine adjustment amount output at a time t, xt-1 is an environment input at a time t-1, xt is an environment input at a time t, and the Q-value network model input at a time t further includes the sound pressure fine adjustment amount output at a time t-1.
The attention extracting layer further includes a global pooling layer and a softmax layer, and the feature vector matrix of the previous convolutional layer is reduced through one global pooling layer, in this embodiment, the multi-dimensional vector matrix output by the second convolutional layer is reduced into a feature vector matrix of 1 × 1 × C, where C is the dimension of a feature; the softmax layer is used for sequencing the feature vectors in the feature vector matrix after dimensionality reduction according to the size sequence to obtain a feature vector matrix F 'with an attention mechanism, namely, the maximum value of a feature channel is selected through the global pooling layer and the softmax layer, the feature vector matrix F processed by the second convolution layer and the feature vector matrix F' with the attention mechanism are multiplied to obtain a feature vector matrix added with attention, then the third convolution layer outputs a sound pressure fine adjustment quantity (ht-1), a network transmission mode is designed by utilizing an LSTM network structure in a time sequence dimension, the sound pressure fine adjustment quantity at the previous time is continuously used at the current time, the features at the current time are continuously transmitted to the next time for use, and therefore the features of the time sequence are obtained and the network output is further optimized.
In other embodiments, the feature of interest may also be extracted by using, for example, PCA principal components analysis, two-layer convolution-based Encoder, etc. to implement the functions of the attention extraction layer in the above embodiments.
Through the scheme of this writing embodiment, can make the speaker cover to more space coordinates around the people's ear to guarantee that it possesses comparatively accurate audio frequency object space position expression ability and sound analytic power, respectively install 3 and more than three speakers above the plane of people's ear place, below the plane of people's ear place in better speaker arrangement scene, use can regard people's ear as the centre of a circle, form the space coordinate expression ability at no dead angle in certain radius range. The user can feel more vivid about the spatial position of the audio object during audio reproduction, and the arrangement positions of the audio playing units can be arbitrarily arranged by the user according to the characteristics of the spatial field (room), and the main control device can restore the immersive audio object in the user-specific spatial field only according to the position of each audio playing unit and the position of the audio object.
Fig. 11 shows a computer device provided in an embodiment of the present disclosure, where the master control apparatus in the embodiment of the present disclosure may be a computer device in the embodiment of the present disclosure, and perform the method described herein. The computer device 1102 may include one or more processors 1104, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 1102 may also include any memory 1106 for storing any kind of information, such as code, settings, data, etc. For example, and without limitation, memory 1106 may include any one or more of the following in combination: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memories may represent fixed or removable components of computer device 1102. In one case, when the processor 1104 executes the associated instructions, which are stored in any memory or combination of memories, the computer device 1102 can perform any of the operations of the associated instructions. The computer device 1102 also includes one or more drive mechanisms 1108, such as a hard disk drive mechanism, an optical disk drive mechanism, etc., for interacting with any memory.
Computer device 1102 can also include input/output module 1110 (I/O) for receiving various inputs (via input device 1112) and for providing various outputs (via output device 1114). One particular output mechanism may include a presentation device 1116 and an associated Graphical User Interface (GUI) 1118. In other embodiments, input/output module 1110 (I/O), input device 1112, and output device 1114 may also be excluded, as only one computer device in a network. Computer device 1102 can also include one or more network interfaces 1120 for exchanging data with other devices via one or more communication links 1122. One or more communication buses 1124 couple the above-described components together.
Communication link 1122 may be implemented in any manner, e.g., via a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. Communications link 1122 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
Embodiments herein also provide computer readable instructions, wherein a program thereof causes a processor to perform a method as described above when the instructions are executed by the processor.
It should be understood that, in various embodiments herein, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments herein.
It should also be understood that, in the embodiments herein, the term "and/or" is only one kind of association relation describing an associated object, and means that there may be three kinds of relations. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided herein, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purposes of the embodiments herein.
In addition, functional units in the embodiments herein may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions in the present invention substantially or partially contribute to the prior art, or all or part of the technical solutions may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The principles and embodiments of this document are explained herein using specific examples, which are presented only to aid in understanding the methods and their core concepts; meanwhile, for the general technical personnel in the field, according to the idea of this document, there may be changes in the concrete implementation and the application scope, in summary, this description should not be understood as the limitation of this document.

Claims (13)

1. An immersive audio generation method for a master device, comprising,
receiving immersive audio data for different audio objects through different channels, wherein the immersive audio data includes position information of the audio objects and audio data of the audio objects;
calculating a plurality of audio playing units corresponding to the audio objects according to the position information of the audio objects and the position information of the plurality of audio playing units in the immersive audio data;
and sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field.
2. The immersive audio generation method of claim 1 further comprising transmitting audio data of the audio objects to a corresponding plurality of audio playback units in the spatial field,
measuring a clock deviation from each audio playback unit;
and sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field according to the clock deviation.
3. The immersive audio generation method of claim 2, wherein measuring a clock offset from each audio playback unit further comprises,
the master clock of the master control device sends a synchronous message to the slave clocks of all the audio playing units, and the corresponding audio playing units record the sending and receiving time of the synchronous message at all the ends;
receiving a request message fed back from a clock of each audio playing unit, and recording the sending and receiving time of the request message of each end by the corresponding audio playing unit;
the master clock of the master control device sends response messages to the slave clocks of all the audio playing units, and the corresponding audio playing units record the sending and receiving time of the response messages of all the ends;
and the corresponding audio playing units calculate the clock deviation of each audio playing unit according to the sending and receiving time of the synchronous message, the sending and receiving time of the request message and the sending and receiving time of the response message, so that the main control device can obtain the clock deviation and send the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field.
4. The immersive audio generation method of claim 1, wherein transmitting audio data of the audio objects to a corresponding plurality of audio playback units in the spatial field further comprises,
dividing the audio data of the audio object according to a frequency threshold value;
outputting the audio data of the audio objects lower than the frequency threshold value to a background audio playing unit in a spatial field for rendering;
and outputting the audio data of the audio objects which are higher than the frequency threshold value to a plurality of corresponding audio playing units in the spatial field for rendering.
5. The immersive audio generation method of claim 1, wherein calculating, from position information of an audio object in the immersive audio data and position information of a plurality of audio playback units, a plurality of audio playback units corresponding to the audio object further comprises,
according to the coordinates in the position information of the audio playing units, taking a user as a sphere center, and making an inscribed curved surface space in the space surrounded by the audio playing units;
converting the position information of the audio object into coordinates in the inscribed surface space;
determining a plurality of audio playback units that are proximate to the coordinates of the audio object as a plurality of audio playback units corresponding to the audio object.
6. The immersive audio generation method of claim 5, wherein calculating, from position information of an audio object in the immersive audio data and position information of a plurality of audio playback units, a plurality of audio playback units corresponding to the audio object further comprises,
and inputting the position information of the audio object and the position information of the audio playing units in the immersive audio data into a fine tuning model, and calculating the sound pressure fine tuning amount of the audio playing units corresponding to the audio object.
7. The immersive audio generation method of claim 6, wherein inputting position information of an audio object and position information of a plurality of audio playback units in the immersive audio data to a fine tuning model, wherein calculating sound pressure fine tuning amounts of the plurality of audio playback units corresponding to the audio object further comprises,
the fine tuning model is trained by inputting, as the environment, the position information of the audio object in the immersive audio data, the position information of the plurality of audio playback units, and the evaluation score.
8. The immersive audio generation method of claim 7 wherein the fine tuning model is a Q-value network model, the method further comprising,
adding an attention extraction layer between convolution layers of a Q value network model at the current moment, and extracting interesting features in features obtained by a previous convolution layer;
after the characteristics are multiplied by the interesting characteristics, sound pressure fine adjustment quantity at the current moment is obtained through a rolling layer;
taking the sound pressure fine adjustment quantity at the current moment and the environment input at the next moment as the input of a Q value network model at the next moment to obtain the sound pressure fine adjustment quantity at the next moment;
and iterating the steps according to a time sequence, and stopping iteration through an error function of the fine tuning model to obtain the sound pressure fine tuning quantity of the fine tuning model for the plurality of audio playing units corresponding to the audio object.
9. The immersive audio generation method of claim 8 wherein the attention extraction layer further comprises,
the global pooling layer reduces the dimension of the features obtained by the previous convolution layer, and the softmax layer sorts the reduced features to form the interesting features.
10. An immersive audio generation method according to claim 8, wherein the error function is:
Loss(ε)=E(Q target -Q present (s,a,ε)) 2
wherein Q is target Refers to the sound pressure fine adjustment quantity, Q, output by a target Q value network present The sound pressure fine adjustment quantity E () 2 The average value of the sum of squares of one round of training is obtained, s is the sound pressure fine adjustment quantity of the environment input and the last moment, a is the sound pressure fine adjustment quantity of the current moment, and epsilon is various parameters of network training;
said Q target Calculated by the following formula: q target =Reward(ε)+γ×Q max (s',a',ε);
Wherein the Reward (epsilon) is a Reward function, a gamma hyperparameter, Q max To Q on a time series present The maximum output sound pressure fine adjustment quantity s 'is the environment input at each moment and the sound pressure fine adjustment quantity at the last moment, a' is the sound pressure fine adjustment quantity at each moment, epsilon is various parameters of network training, and the Reward function is Reward (epsilon) = e -Sear/Smic Said S ear Evaluation score for subjective human auditory perception, S mic An evaluation score of the audio data is collected for the microphone.
11. An immersive audio generation system comprising,
the system comprises a sound source device, a main control device and a plurality of audio playing units;
the audio source device outputs immersive audio data aiming at different audio objects to the main control device through different channels, wherein the immersive audio data comprises position information of the audio objects and audio data of the audio objects;
the master control device is used for receiving immersive audio data aiming at different audio objects through different channels; calculating a plurality of audio playing units corresponding to the audio objects according to the position information of the audio objects and the position information of the plurality of audio playing units in the immersive audio data; sending the audio data of the audio object to a plurality of corresponding audio playing units in the spatial field;
the audio playing unit is used for playing the audio data.
12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-10 when executing the computer program.
13. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, is adapted to carry out the method of any of the preceding claims 1-10.
CN202210919846.XA 2022-08-01 2022-08-01 Immersive audio generation method and system Pending CN115460515A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210919846.XA CN115460515A (en) 2022-08-01 2022-08-01 Immersive audio generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210919846.XA CN115460515A (en) 2022-08-01 2022-08-01 Immersive audio generation method and system

Publications (1)

Publication Number Publication Date
CN115460515A true CN115460515A (en) 2022-12-09

Family

ID=84296558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210919846.XA Pending CN115460515A (en) 2022-08-01 2022-08-01 Immersive audio generation method and system

Country Status (1)

Country Link
CN (1) CN115460515A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106465034A (en) * 2014-03-26 2017-02-22 弗劳恩霍夫应用研究促进协会 Apparatus and method for audio rendering employing a geometric distance definition
CN107172527A (en) * 2016-03-08 2017-09-15 中兴通讯股份有限公司 Cooperate with the volume adjusting method played, device and collaboration playing device
CN109671446A (en) * 2019-02-20 2019-04-23 西华大学 A kind of deep learning sound enhancement method based on absolute hearing threshold
KR102065030B1 (en) * 2019-09-05 2020-03-03 주식회사 지브이코리아 Control method, apparatus and program of audio tuning system using artificial intelligence model
US20210021949A1 (en) * 2019-07-18 2021-01-21 International Business Machines Corporation Spatial-based audio object generation using image information
CN114554355A (en) * 2022-03-17 2022-05-27 中科雷欧(常熟)科技有限公司 Vehicle-mounted immersive audio transmission method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106465034A (en) * 2014-03-26 2017-02-22 弗劳恩霍夫应用研究促进协会 Apparatus and method for audio rendering employing a geometric distance definition
CN107172527A (en) * 2016-03-08 2017-09-15 中兴通讯股份有限公司 Cooperate with the volume adjusting method played, device and collaboration playing device
CN109671446A (en) * 2019-02-20 2019-04-23 西华大学 A kind of deep learning sound enhancement method based on absolute hearing threshold
US20210021949A1 (en) * 2019-07-18 2021-01-21 International Business Machines Corporation Spatial-based audio object generation using image information
CN112333623A (en) * 2019-07-18 2021-02-05 国际商业机器公司 Spatial-based audio object generation using image information
KR102065030B1 (en) * 2019-09-05 2020-03-03 주식회사 지브이코리아 Control method, apparatus and program of audio tuning system using artificial intelligence model
CN114554355A (en) * 2022-03-17 2022-05-27 中科雷欧(常熟)科技有限公司 Vehicle-mounted immersive audio transmission method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李英若 等: "MPEG-H格式背景下沉浸式音响技术及其应用", 《影视制作》, 15 February 2019 (2019-02-15), pages 69 - 74 *

Similar Documents

Publication Publication Date Title
JP7408048B2 (en) Anime character driving method and related device based on artificial intelligence
US10645518B2 (en) Distributed audio capture and mixing
US10349197B2 (en) Method and device for generating and playing back audio signal
CN112312297B (en) Audio bandwidth reduction
EP4121957A1 (en) Encoding reverberator parameters from virtual or physical scene geometry and desired reverberation characteristics and rendering using these
CN109891503B (en) Acoustic scene playback method and device
JP2019523607A (en) Mixed reality system using spatialized audio
US11109177B2 (en) Methods and systems for simulating acoustics of an extended reality world
US11437004B2 (en) Audio performance with far field microphone
US11611840B2 (en) Three-dimensional audio systems
US20220101623A1 (en) Room Acoustics Simulation Using Deep Learning Image Analysis
CN110751956B (en) Immersive audio rendering method and system
US12010490B1 (en) Audio renderer based on audiovisual information
US20230007427A1 (en) Audio scene change signaling
CN113316078B (en) Data processing method and device, computer equipment and storage medium
CN111142066A (en) Direction-of-arrival estimation method, server, and computer-readable storage medium
CN107948623A (en) Projecting apparatus and its music related information display methods
US20240119946A1 (en) Audio rendering system and method and electronic device
US20240119945A1 (en) Audio rendering system and method, and electronic device
WO2023109862A1 (en) Method for cooperatively playing back audio in video playback and communication system
CN115460515A (en) Immersive audio generation method and system
CN115705839A (en) Voice playing method and device, computer equipment and storage medium
CN111246345B (en) Method and device for real-time virtual reproduction of remote sound field
JP2023510141A (en) Wireless microphone with local storage
CN116614762B (en) Sound effect processing method and system for spherical screen cinema

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination