CN110166927B

CN110166927B - Virtual sound image reconstruction method based on positioning correction

Info

Publication number: CN110166927B
Application number: CN201910392966.7A
Authority: CN
Inventors: 涂卫平; 翟双星; 郑佳玺; 余智勇; 万言
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2020-05-12
Anticipated expiration: 2039-05-13
Also published as: CN110166927A

Abstract

The invention provides a virtual sound image reconstruction method based on positioning correction, which comprises the following steps: firstly, determining the azimuth of a loudspeaker and the azimuth of a target reconstructed sound image, then distributing loudspeaker gains based on a vector amplitude translation method, further synthesizing binaural signals and extracting interaural clues, then estimating the azimuth of a virtual sound image based on a virtual sound image estimation model, comparing the estimated azimuth with the target azimuth, adjusting the gain value of the loudspeaker by adopting a dichotomy method, further enabling the deviation of the estimated azimuth and the target azimuth to be smaller than a minimum audible angle, and outputting the adjusted loudspeaker gains so as to correct the vector amplitude translation method. The invention realizes the effect that the sound image azimuth reconstructed by amplitude translation based on the vector is consistent with the target azimuth.

Description

Virtual sound image reconstruction method based on positioning correction

Technical Field

The invention relates to the technical field of audio, in particular to a virtual sound image reconstruction method based on positioning correction.

Background

In virtual reality, realistic acoustic image spatial perception experience is realized, and undistorted reconstruction is realized depending on perception of virtual acoustic images, so how to improve the accuracy of virtual acoustic image reconstruction becomes one of research hotspots in the multimedia field at home and abroad. The most widely used method for reconstructing a virtual sound image is the Amplitude Panning (AP) technique. The AP technology includes a sine law Panning technology, a tangent law Panning technology, a Vector-based Amplitude Panning (Vector Base Amplitude Panning, Vector-based Amplitude Panning), a Multiple Base Amplitude Panning (MDAP) technology, and the like. Virtual sound image reconstruction based on the AP technology adopts a simple geometric model, and by establishing direction vectors from a listening point to each loudspeaker and distributing gains for each loudspeaker based on a vector synthesis method, sound images of a target direction are synthesized.

Although the AP technology is simple in calculation and mainly forms a simple geometric model based on speakers and listening points, it does not consider the filtering effect of human head, trunk and the like in the process of transmitting sound to ears, which causes deviation between the estimated orientation and the listener perceived orientation, and further causes deviation of the synthesized virtual sound image from the target sound image. Based on this, the vector-based magnitude translation technique requires correction studies.

Disclosure of Invention

The invention provides a virtual sound image reconstruction method based on positioning correction, which is used for correcting a vector-based amplitude translation method, so that a virtual sound image reconstructed by vector-based amplitude translation is more accurate; the method comprises the following steps:

step 1: determining the positions of loudspeakers and a target position, wherein the number of the loudspeakers is 2 or 3, and the target position is an ideal virtual sound image position expected to be reconstructed;

step 2: according to the loudspeaker position and the target position, distributing initial gain for each loudspeaker by adopting a vector-based amplitude translation method;

and step 3: synthesizing a binaural signal corresponding to the initial virtual sound image through a summation positioning criterion according to the gain value of the loudspeaker, and extracting an interaural clue;

and 4, step 4: inputting the interaural cables extracted in the step 3 into an existing virtual sound image position estimation model, wherein the estimation model is used for estimating the position represented by a binaural signal;

and 5: judging whether the estimated azimuth of the virtual sound image azimuth estimation model is consistent with the target azimuth, wherein the consistency means that the difference value of the estimated azimuth and the target azimuth is within the minimum audible angle range of the target azimuth, and if the estimated azimuth is consistent with the target azimuth, taking the current loudspeaker gain as the correction gain of amplitude translation based on the vector;

step 6: if the estimated azimuth is inconsistent with the target azimuth, calculating a gain ratio of the loudspeaker, dividing a gain ratio interval, determining a median gain ratio according to a dichotomy, calculating the gain of the loudspeaker, and repeating the steps 3-6, wherein the gain ratio is the ratio of the gain of the right loudspeaker to the gain of the left loudspeaker;

preferably, the extracting of the interaural cues in step 3 specifically includes:

step 3.1: selecting corresponding HRTF data according to the position of each loudspeaker and the target position, wherein the HRTF data are stored in an HRTF database, and the HRTF data of left and right ears corresponding to each spatial position are recorded in the database;

step 3.2: obtaining each loudspeaker signal after each loudspeaker gain acts on the sound source signal, and summing the loudspeaker signals after each loudspeaker signal is respectively convolved with left and right ear HRTF data to obtain left and right ear signals;

step 3.3: the method comprises the steps of extracting interaural clues from left and right ear signals, wherein the interaural clues are clues used for positioning the sound source position and comprise binaural clues and monaural clues.

Preferably, the determining the median gain ratio according to the bisection method in step 6 is a correction gain of a successive approximation loudspeaker adopting the bisection method, and specifically includes:

step 6.1: calculating a gain ratio according to the gain of the loudspeaker, and dividing an original gain interval into a left interval and a right interval by taking the gain ratio as a critical point;

step 6.2: selecting a gain ratio variation interval from the two intervals in step 6.1 according to the deviation of the target azimuth from the predicted azimuth;

step 6.3: and calculating a median gain ratio according to the left limit value and the right limit value of the gain ratio interval, and solving the gains of the left loudspeaker and the right loudspeaker according to a gain normalization mode.

Drawings

FIG. 1: the space position diagram of the loudspeaker and the human head is shown in the embodiment of the invention;

FIG. 2: synthesizing a binaural signal diagram for the left and right speakers;

FIG. 3: is a structural diagram of a neural network;

FIG. 4: the invention is a flow chart of amplitude translation correction based on vector;

FIG. 5: a method diagram for adjusting the gain of a speaker according to an embodiment of the present invention;

FIG. 6: a spatial position diagram of three loudspeakers;

FIG. 7: to estimate the mapping of the sound image in the plane of the speakers 1 and 2;

FIG. 8: to estimate a map of the sound image at the plane of the speakers 2 and 3.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a virtual sound image reconstruction method based on positioning correction, which is used for solving the problem that the position of a reconstructed virtual sound image deviates from a target position due to neglecting the sound field disturbance effect of a listener in the existing vector-based amplitude translation technology.

The technical scheme in the embodiment of the application has the following general idea:

firstly, determining the orientation and the target orientation of a loudspeaker, then distributing corresponding gain values for each loudspeaker by adopting a vector-based amplitude translation method, then synthesizing a virtual sound image based on an HRTF database and extracting interaural clues, then estimating the orientation of the currently synthesized virtual sound image by using a virtual sound image orientation estimation model, next, carrying out dichotomy adjustment on the gain of the loudspeaker according to the difference between the target orientation and the estimated orientation, and then continuously and iteratively adjusting the gain of the loudspeaker until the difference between the target orientation and the estimated orientation is smaller than the minimum audible angle, and recording the current gain of the loudspeaker, namely the finally corrected gain of the loudspeaker.

According to the method provided by the invention, the virtual sound image azimuth is predicted in real time, the loudspeaker gain is continuously adjusted by adopting the dichotomy so as to change the virtual sound image azimuth in real time, and the method is not terminated until the difference value between the predicted azimuth and the target azimuth is smaller than the minimum audible angle. Therefore, when the prediction error of the virtual sound image orientation estimation model is small, the method provided by the invention can effectively improve the problem of vector-based amplitude translation positioning deviation. Most of the existing virtual sound image orientation estimation models have better prediction performance.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical scheme of the invention is explained in detail in the following by combining the drawings and the embodiment.

The invention provides a virtual sound image synthesis method and device based on positioning correction, which are used for solving the problem that the orientation deviation of a virtual sound image synthesized by a vector-based amplitude translation method is large. The implementation flow of the embodiment comprises the following steps:

step 1: determining loudspeaker positions and target positions, wherein the number of the loudspeakers is 2 or 3, and the target positions are virtual sound image positions expected to be reconstructed;

the determining the speaker orientation and the target orientation in the step 1 specifically comprises:

the vector-based amplitude translation method is suitable for the conditions of two or three loudspeakers, taking the condition of 2 loudspeakers as an example, a coordinate system is established by taking a human head as an origin, the 2 loudspeakers are positioned on a circle by taking a listening point (the human head) as a circle center, the front of the human head is set to be 0 degrees, and the corresponding directions of a left ear and a right ear are respectively-90 degrees and 90 degrees; the angles of the 2 loudspeakers are respectively-theta and theta, and the orientation of the target sound image is

Step 2: calculating an initial gain value of each loudspeaker by using vector-based amplitude translation according to the loudspeaker orientation and the target orientation (the virtual sound image orientation expected to be reconstructed);

in step 2, calculating the initial gain value g1, g2 of each loudspeaker by using vector-based amplitude translation according to the loudspeaker orientation and the target orientation:

specifically, the principle of the vector-based amplitude panning method is that given 2 or 3 speakers with the same radius from the listening point, assuming that the virtual sound image and the speakers are located on a spherical surface with the same radius from the center point, the location of the speakers and the center point form a unit vector, and the unit vector of the virtual sound image is synthesized by the vector.

In a specific implementation, the number of speakers is 2, and the speakers may be referred to as a left speaker and a right speaker according to their relative orientations, and the initial gain of the speakers may be derived according to a formula derived from a vector-based magnitude panning method as follows:

and 3, synthesizing a binaural signal corresponding to the initial virtual sound image through a summation positioning criterion according to the gain value of the loudspeaker, and extracting interaural clues as follows:

determining a corresponding Head-Related transfer function (HRTF) according to the position of each loudspeaker, wherein the HRTF is stored in an HRTF database, the HRTF of the left ear and the HRTF of the right ear corresponding to each spatial position are recorded in the database, the corresponding HRTF is obtained according to the position of the loudspeaker, and a binaural signal synthesized by the double loudspeakers at the position of the human ear is calculated by combining the initial gain value of the loudspeaker obtained in the step 2, and an interaural clue is extracted;

specifically, the method is a sound effect positioning algorithm, which utilizes pulse signals to record the transmission process of free-field sound waves from a sound source to two ears of a listener, including the comprehensive filtering of the sound waves of the head, auricle, trunk and the like of the listener, and stores the sound waves as an HRTF database. Different positions correspond to different HRTFs, the HRTF is related to individual characteristics, the HRTF database comprises a CIPIC database, an ARI database, a PKU database, a SADIE database and the like, the data volume and the sampling precision of each database are different, and the HRTF database can be selected as required.

As an optional implementation, the speakers include a left speaker and a right speaker, the selected database is a CIPIC database, and in step 3, under the left and right speaker configuration, the method synthesizes a virtual sound image based on the HRTF database and extracts an interaural cue, and specifically includes:

step 3.1: selecting a corresponding HRTF from CIPICs according to the position and the target position of each loudspeaker, wherein the CIPIC library records HRTF data of left and right ears corresponding to each spatial position, and the HRTF data contains M-1250 spatial positions in total;

step 3.2: according to left and right ear HRTFs corresponding to the left and right loudspeaker positions, and in combination with the left and right loudspeaker gains, binaural signals corresponding to virtual sound images synthesized by the left and right loudspeakers can be calculated;

specifically, the CIPIC HRTF database is adopted, and s is taken as a sound source signal, the gain of the left loudspeaker is g1, and the gain of the right loudspeaker is g2, so that the left loudspeaker signal is sl-s-g 1, and the right loudspeaker signal is sr-s-g 2; convolving the speaker signal with the HRTF of the left ear to obtain a left ear signal, and convolving the speaker signal with the HRTF of the right ear to obtain a right ear signal; as shown in fig. 2, the left ear signal is the sum of signals al and bl transmitted to the left ear by the left and right speakers, respectively;right earThe signal is the sum of the signals ar and br transmitted by the left and right loudspeakers to the right ear, respectively. The left and right ear signals can be obtained according to the following formula:

xl＝s·g1·hrtf_ll+s·g2·hrtf_rl

xr＝s·g1·hrtf_lr+s·g2·hrtf_rr

wherein xl is defined as a left ear signal, and xr is defined as a right ear signal; definition of hrtf_llDefining hrtf for left ear hrtf corresponding to left loudspeaker_rlDefining hrtf for left ear hrtf corresponding to right speaker_lrDefining hrtf for right ear hrtf corresponding to left loudspeaker_rrThe right ear hrtf corresponding to the right speaker.

Step 3.3: extracting an interaural cue from a binaural signal, wherein the interaural cue is a cue used for judging the position of a sound source by human ears and comprises a binaural cue, a monaural cue and the like;

specifically, the Interaural line includes an Interaural Time Difference (ITD), an Interaural Level Difference (ILD), a binaural Cross-correlation function (CCF), a monaural line, and the like. The monaural spectral cue refers to a monaural spectral cue, and here, the Energy value (GFE) of the left and right ear signals after passing through the gamma Filter is used to represent the monaural spectral cue. The choice of interaural cues may be selected as desired.

In a specific implementation, the binaural signal obtained in step 3.1 is framed and the signal of one of the frames is taken for calculation.

The ILD is calculated as follows:

where Xl is defined as the left ear signal and Xr is defined as the right ear signal.

The formula for CCF is as follows:

wherein, xl (N) is defined as a left ear signal, xr (N) is defined as a right ear signal, N is defined as an nth time, τ is defined as a time delay of the right ear signal relative to the left ear signal, and N is defined as a total length of the signals.

The ITD value is the time delay difference at the CCF peak value; the value of the GFE is that the signal energy of the left ear and the right ear respectively passes through a gamma atom filter bank with 20 channels, and finally 40 GFE values can be obtained.

And 4, step 4: and estimating the sound image azimuth represented by the binaural signal by adopting a virtual sound image estimation model to obtain an estimated azimuth.

Specifically, the virtual sound image estimation model adopts a sound image estimation method based on a BP neural network model, wherein the input of the neural network model is an interaural clue, and the output of the neural network model is a corresponding sound image azimuth; the network structure is shown in fig. 3, and comprises an input layer, two hidden layers and an output layer; the input layer contains 75 nodes, the hidden layers each contain 151 nodes, and the output layer is 2 nodes. When training the neural network, the activation function of the hidden layer is set to be a sigmoid function, the learning rate is 0.001, and the iteration number is 350. And through verification, the average error of the sound image orientation estimated by the neural network is smaller than the average value of the minimum audible angle, and the neural network model positioning is considered to be accurate.

In the specific implementation process, the interaural cord extracted in the step 3 is input into a neural network model, and the estimated orientation is further obtained.

And 5: and judging whether the estimated azimuth of the virtual sound image azimuth estimation model is consistent with the target azimuth, wherein the consistency refers to that the difference value between the estimated azimuth and the target azimuth is smaller than the minimum audible angle of the target azimuth, and if the estimated azimuth is consistent with the target azimuth, the current loudspeaker gain is used as the correction gain of the amplitude translation based on the vector.

In particular, a target position is defined

Subtracting the estimated orientation

By a difference of

If the difference between the estimated azimuth and the target azimuth

Less than the Minimum Audible Angle (MAA), then

When the following formula is satisfied, the current speaker gain is output.

Step 6: and if the estimated azimuth is inconsistent with the target azimuth, calculating a gain ratio of the loudspeaker, dividing a gain ratio interval, determining a median gain ratio according to a dichotomy, calculating the gain of the loudspeaker, and repeating the steps 3-6, wherein the gain ratio is the ratio of the gain of the right loudspeaker to the gain of the left loudspeaker.

Specifically, the gain value of the speaker is continuously adjusted by the dichotomy until the estimated orientation output by the neural network system is consistent with the target orientation, and the current speaker gain is recorded as the correction gain based on the amplitude translation of the vector.

In practice, the estimated orientation is not consistent with the target orientation, i.e.

The general flow for adjusting the speaker gain above MAA is shown in fig. 5. The method specifically comprises the following steps:

step 6.1: firstly, calculating a gain ratio g of a current loudspeaker, setting an adjustment interval of the gain ratio as [ a, b ], and dividing the gain ratio interval into two intervals, namely [ a, g ] and [ g, b ];

step 6.2: if it is

The gain ratio interval is chosen to be [ a, g ]]If, if

The gain ratio interval is chosen to be g, b]；

Step 6.3: and (3) calculating a median gain ratio based on the gain ratio interval, namely calculating an average value of a left limit value and a right limit value of the gain ratio interval, taking the average value as the median gain ratio, then solving gains of the left loudspeaker and the right loudspeaker according to a gain normalization mode, and repeating the steps 3 to 6.

In the implementation, when there are three speakers, as shown in fig. 6 to 8, the target azimuth is first defined

The direction mapped on the listening point and the plane formed by the loudspeaker 1 and the loudspeaker 2 is

The direction mapped on the listening point and the plane formed by the loudspeaker 2 and the loudspeaker 3 is

Estimating an orientation

The loudspeaker adjusting steps are as follows:

definition of

Is composed of

Concrete regulation methodThe formula can be adjusted according to a dichotomy adopted when the number of the loudspeakers is two, and the adjusting method is the same as that of the step 6.1 to the step 6.3; orientation of virtual sound image synthesized by speaker 1 and speaker 2

And

and (5) the consistency is achieved.

Definition of

Is composed of

The specific adjusting mode can be adjusted according to a dichotomy adopted when the number of the loudspeakers is two, and the adjusting method is the same as the steps 6.1-6.3; orientation of virtual sound image synthesized by speakers 2 and 3

And

and (5) the consistency is achieved.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A virtual sound image reconstruction method based on localization correction is characterized by comprising the following steps:

and 4, step 4: inputting the interaural cables extracted in the step 3 into a virtual sound image position estimation model, wherein the estimation model is used for estimating the position represented by the binaural signal;

step 6: if the estimated azimuth is not consistent with the target azimuth, calculating a gain ratio of the loudspeaker, dividing a gain ratio interval, determining a median gain ratio according to a dichotomy, calculating the gain of the loudspeaker, and repeating the steps 3-6, wherein the gain ratio is the ratio of the gain of the right loudspeaker to the gain of the left loudspeaker.

2. The method of claim 1, wherein the extracting of the interaural cues in step 3 specifically comprises:

3. The method according to claim 1, wherein the determining the median gain ratio according to the bisection method in step 6 is a modified gain of a successive approximation speaker adopting the bisection method, and specifically includes:

step 6.2: selecting a gain ratio variation interval from the two intervals in step 6.1 according to the deviation of the target azimuth from the estimated azimuth;