WO2018066376A1 - Dispositif de traitement de signal, procédé et programme - Google Patents

Dispositif de traitement de signal, procédé et programme Download PDF

Info

Publication number
WO2018066376A1
WO2018066376A1 PCT/JP2017/034138 JP2017034138W WO2018066376A1 WO 2018066376 A1 WO2018066376 A1 WO 2018066376A1 JP 2017034138 W JP2017034138 W JP 2017034138W WO 2018066376 A1 WO2018066376 A1 WO 2018066376A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
sound source
signal
signal processing
spatial frequency
Prior art date
Application number
PCT/JP2017/034138
Other languages
English (en)
Japanese (ja)
Inventor
悠 前野
岩木 英明
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2018066376A1 publication Critical patent/WO2018066376A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Definitions

  • the present technology relates to a signal processing device and method, and a program, and more particularly, to a signal processing device and method and a program capable of emphasizing a desired sound.
  • the sound that the viewer wants to focus on can be emphasized. It can also be used for effects intended by the content producer.
  • sound source position information indicating the position of the object is usually included as additional information in a recording signal for reproducing the sound of the object.
  • the volume of each object can be easily controlled in the reproduction of the object-based sound in which the sound source position information exists.
  • a technique related to speech enhancement a mixing method for increasing the volume of sound of a specific virtual sound source in accordance with the virtual sound source position and the position of the viewer has been proposed (see, for example, Patent Document 1).
  • object-based sound has sound source position information
  • the position of the object in the space can be specified, so that the sound of any object can be emphasized.
  • the present technology has been made in view of such a situation, and enables a desired sound to be emphasized.
  • a signal processing device includes a control unit that performs processing for emphasizing sound of a sound source in a specific direction in a spatial frequency domain on an audio signal and outputs the audio signal obtained by the processing. .
  • the control unit can perform a process of giving directivity in the specific direction to the sound of the sound source as the process.
  • the control unit can perform beam forming as the processing.
  • the control unit can perform gain adjustment of the audio signal as the processing.
  • the control unit can perform the processing on a spatial frequency spectrum expressed as a spherical harmonic as the audio signal.
  • the specific direction can be the user's viewpoint direction in space.
  • the viewpoint direction can be detected by performing image recognition on an image taken with the user as a subject, or detecting the orientation of the user's head using a sensor.
  • the specific direction can be the direction of the specific sound source as viewed from the user in space.
  • the direction of the specific sound source viewed from the user may be obtained based on position information indicating the position of the specific sound source included in metadata of content including sound based on the audio signal.
  • the specific direction can be the direction of a specific position viewed from the user in space.
  • the specific direction can be the direction of the object seen from the user in space as seen from the user.
  • the direction of the object that can be seen by the user can be obtained based on information indicating the arrangement of the object in the space.
  • the control unit can perform the processing for a plurality of the specific directions.
  • the control unit can perform the processing on the audio signal when the user is in a predetermined area in space.
  • the control unit can perform a process of attenuating the sound of a sound source in a predetermined direction in the spatial frequency domain on the audio signal.
  • a signal processing method or program includes a step of performing a process of enhancing a sound of a sound source in a specific direction in a spatial frequency domain on an audio signal and outputting the audio signal obtained by the process. Including.
  • processing for emphasizing the sound of a sound source in a specific direction in the spatial frequency domain is performed on the audio signal, and the audio signal obtained by the processing is output.
  • a desired sound can be emphasized.
  • This technology uses the position of the viewer in the space and the emphasis direction information indicating the emphasis direction that is the direction of the sound to be emphasized, regardless of whether it is an object-based sound or a scene-based sound.
  • the desired sound can be emphasized with free viewpoint content.
  • the emphasis direction can be determined as shown in FIG.
  • the viewpoint direction of the viewer U11 is the direction indicated by the arrow A11, that is, the direction in which the sound source AS12 exists when viewed from the viewer U11.
  • the viewer U11 faces the direction of the sound source AS12 that is the viewpoint direction.
  • the origin position of the coordinate system that is the reference for the free viewpoint content is the position of the head of the viewer U11.
  • the viewpoint direction of the viewer U11 with respect to the coordinate axis serving as a reference for the free viewpoint content is detected, and the viewpoint direction is set as the enhancement direction.
  • the recorded signal for reproducing the sound of the free viewpoint content is subjected to processing for enhancing the sound of the sound source in the enhancement direction in the spatial frequency domain, so that The sound coming to U11 was emphasized.
  • a process for emphasizing the sound of a sound source in the emphasis direction in the spatial frequency domain, a process for giving directivity to the sound of the sound source in the emphasis direction, more specifically, for example, beam forming is performed on the recorded signal. Done.
  • a sound field having directivity in the direction indicated by the arrow A11 that is the enhancement direction is formed by the reproduction of the free viewpoint content, and is emphasized to the viewer U11.
  • the sound from the sound source AS12 in the direction can be heard louder. That is, the sound of the sound source AS12 is more greatly emphasized than the sound of the sound source AS11.
  • the sound of the sound source in the emphasis direction is emphasized even when it is not possible to specify whether the sound source exists in the emphasis direction in space, such as when it is a scene-based sound. Can do.
  • desired sounds such as the sound of the sound source in the viewpoint direction of the viewer U11, can be emphasized.
  • position information indicating the position of a specific object is included in video or audio metadata of free viewpoint content, or a position indicating the position of a specific object during editing after recording the free viewpoint content.
  • Information may be added as metadata.
  • the direction of the specific sound source viewed from the viewer U11 in the coordinate system serving as a reference for the free viewpoint content is set as the enhancement direction, as indicated by an arrow Q12. You can also.
  • a sound source AS13 and a sound source AS14 are present on the right diagonally forward and right sides of the viewer U11 of the free viewpoint content in the space, respectively.
  • the direction of the sound source AS14 viewed from the viewer U11 in the coordinate system serving as a reference for the free viewpoint content is set as the enhancement direction.
  • processing for enhancing the sound of the sound source in the enhancement direction in the spatial frequency domain is performed on the recorded signal for reproducing the sound of the free viewpoint content.
  • the viewer U11 can hear more sound from the sound source AS14 in the emphasis direction. That is, the sound of the sound source AS14 is emphasized more greatly than the sound of the sound source AS13.
  • the reproduction sound field of the sound of the point sound source that is, the state of the sound wave front is as shown by an arrow Q21 in FIG.
  • the shading in the part indicated by the arrow Q21 indicates the amplitude of the wavefront of the sound of the point sound source.
  • the position indicated by the arrow B11 indicates the position of the point sound source, and it can be seen that the sound wavefront spreads uniformly in all directions, that is, concentrically around the position of the point sound source.
  • the sound pressure at each position in the space is as indicated by an arrow Q22, and it can be seen that there is no particular directivity.
  • the shading at each position indicates the sound pressure at those positions.
  • the viewer's position is, for example, the position indicated by the arrow B21. That is, the viewer's position is the lower position in the point sound source diagram.
  • the wavefront of the sound from the point sound source does not spread equally in all directions but propagates with directivity in the direction of the viewer.
  • the sound pressure at each position in the space is as indicated by an arrow Q32, and it can be seen that the sound pressure has a strong directivity especially in the direction of the viewer. That is, it can be seen that the sound is emphasized so that the sound of the point sound source can be heard particularly loud in the direction of the viewer.
  • the shading at each position indicates the sound pressure at those positions.
  • FIG. 4 is a diagram illustrating a configuration example of an embodiment of a sound field enhancement device to which the present technology is applied.
  • the sound field enhancing device 11 is a signal processing device that emphasizes sound from a sound source in a desired direction.
  • the enhancement direction acquisition unit 21 includes, for example, a camera and an acceleration sensor, acquires enhancement direction information indicating the enhancement direction of the sound of the free viewpoint content to be reproduced, and supplies the enhancement direction information to the enhancement speech generation unit 22.
  • the free viewpoint content is content including video and audio in a three-dimensional space.
  • the emphasis direction may be only one direction or a plurality of directions.
  • the enhancement voice generation unit 22 uses the enhancement direction information supplied from the enhancement direction acquisition unit 21 for the recorded signal that is the audio signal for reproducing the sound of the free viewpoint content received by the reception unit such as an antenna. It functions as a control unit that performs processing for enhancing the sound of the sound source in the emphasis direction.
  • a recorded signal of free viewpoint content is obtained by recording a sound field using a microphone array or the like, that is, a scene-based sound signal obtained by collecting sound.
  • the recorded signal may be an object-based sound signal.
  • a signal in the spatial frequency domain that is, a spatial frequency spectrum, is supplied to the enhanced speech generation unit 22 as a recorded signal.
  • the enhanced speech generation unit 22 outputs (supplies) the spatial frequency spectrum obtained by the process of enhancing the sound of the sound source in the enhanced direction in the spatial frequency domain, that is, the process of enhancing the sound field in the enhanced direction, to the spatial frequency synthesizing unit 23. )
  • the spatial frequency synthesizer 23 performs spatial frequency inverse transform on the spatial frequency spectrum supplied from the emphasized speech generator 22 based on speaker arrangement information indicating the arrangement position of the speakers constituting the speaker array 25 supplied from the outside. Then, the time frequency spectrum obtained as a result is supplied to the time frequency synthesis unit 24.
  • the time-frequency synthesizer 24 synthesizes the time-frequency spectrum supplied from the spatial frequency synthesizer 23 with time-frequency synthesis, and reproduces the sound of the free viewpoint content emphasized in the emphasis direction from the resulting time signal. This is supplied to the speaker array 25 as a speaker drive signal.
  • the speaker array 25 is a speaker array configured by arranging a plurality of speakers such as a linear speaker array, a planar speaker array, an annular speaker array, a spherical speaker array, and the like, and is based on the speaker drive signal supplied from the time-frequency synthesis unit 24. To play the sound of free viewpoint content.
  • the enhancement direction acquisition unit 21 detects the viewer's viewpoint direction by image recognition on an image captured by the viewer taken by a camera, for example, or uses the direction of the viewer's head from the output of the acceleration sensor as the viewpoint direction. And the enhancement direction information is acquired by setting the obtained viewpoint direction as the enhancement direction.
  • the sensor used to detect the orientation of the viewer's head is not limited to the acceleration sensor, and may be any other sensor. Furthermore, it is assumed that the emphasis direction acquisition unit 21 can also specify the position of the viewer in the space from image recognition for the image taken by the camera, sensor output, and the like.
  • the enhancement direction acquisition unit 21 detects an object by performing object recognition on a video (image) of free viewpoint content, and determines the direction of the object viewed from the viewer in the space based on the recognition result.
  • the enhancement direction information is obtained as the enhancement direction.
  • the emphasis direction acquisition unit 21 acquires position information of a specific object (sound source) from video and audio metadata of free viewpoint content, free viewpoint content metadata added at the time of editing, or by a viewer or the like. Get the input location information. And the emphasis direction acquisition part 21 obtains emphasis direction information by calculating
  • the direction of the object viewed from the viewer that is, the direction of the sound source, but also the direction of a predetermined specific position viewed from the viewer in the space may be set as the emphasis direction.
  • the direction directly input by may be used as the enhancement direction as it is.
  • the coordinate system on the space is, for example, as shown in FIG. 5, a coordinate system in which the position of the viewer's (user) head is the origin O serving as the reference of the coordinate system.
  • the origin O serving as the reference of the coordinate system.
  • the x axis, the y axis, and the z axis are the axes of the coordinate system with respect to the origin O.
  • a straight line connecting the object OB11 and the origin O is a straight line LN
  • a straight line obtained by projecting the straight line LN from the z-axis direction onto the xy plane is a straight line LN ′.
  • an angle ⁇ formed by the x-axis and the straight line LN ′ is an azimuth indicating the direction of the object OB11 as viewed from the origin O on the xy plane.
  • an angle ⁇ formed by the z axis and the straight line LN is an elevation angle indicating the direction of the object OB11 viewed from the origin O in a plane perpendicular to the xy plane.
  • the emphasis direction acquisition unit 21 outputs relative angle information ( ⁇ , ⁇ ) between the viewer and the specific object in the space, which includes the elevation angle ⁇ and the azimuth angle ⁇ , as the emphasis direction information.
  • the elevation angle and azimuth angle in the enhancement direction indicated by the enhancement direction information are referred to as elevation angle ⁇ BF and azimuth angle ⁇ BF , respectively, and are also referred to as enhancement directions ( ⁇ BF , ⁇ BF ).
  • the enhancement speech generation unit 22 uses the enhancement direction information supplied from the enhancement direction acquisition unit 21, the enhancement speech generation unit 22 performs a process of enhancing the sound of the sound source in the enhancement direction with respect to the recorded signal.
  • the recorded signal supplied to the enhanced speech generation unit 22 has a spatial frequency spectrum.
  • the recorded signal is a scene-based sound signal obtained by recording a sound field with a spherical microphone array obtained by arranging a plurality of microphones on a spherical surface.
  • a spherical microphone array obtained by arranging a plurality of microphones on a spherical surface.
  • any microphone array for obtaining a recorded signal may be used as long as it is composed of a plurality of microphones such as a linear microphone array and an annular microphone array. Also good.
  • angle information including the elevation angle ⁇ i indicating the direction of the microphone unit of the microphone index i and the azimuth angle ⁇ i is referred to as a microphone angle ( ⁇ i , ⁇ i ).
  • the elevation angle ⁇ i and the azimuth angle ⁇ i are an elevation angle and an azimuth angle that indicate the direction of the microphone unit as viewed from a predetermined reference origin.
  • the time frequency spectrum S (i, n tf ) of the recorded signal is converted to the spatial frequency using the spherical harmonic series expansion defined below, and the spatial frequency spectrum Consider obtaining S ′ n m (n tf ).
  • i indicates a microphone index
  • n tf indicates a time frequency index
  • n and m Indicates the order of the spherical harmonic region.
  • the sound field S on a certain sphere can be expressed as shown in the following equation (1).
  • Equation (1) Y represents a spherical harmonic function matrix, W represents a weighting coefficient based on the sphere radius and the order of spatial frequency, and S ′ represents a spatial frequency spectrum.
  • Y represents a spherical harmonic function matrix
  • W represents a weighting coefficient based on the sphere radius and the order of spatial frequency
  • S ′ represents a spatial frequency spectrum.
  • the spatial frequency spectrum S ′ can be obtained by spatial frequency conversion by calculating the following equation (2).
  • Equation (2) Y + indicates a pseudo inverse matrix of the spherical harmonic function matrix Y, and is obtained by the following Equation (3), where Y T is a transposed matrix of the spherical harmonic function matrix Y.
  • the vector S ′ consisting of the spatial frequency spectrum S ′ n m (n tf ) can be obtained by the following equation (4). I understand.
  • Equation (4) S ′ represents a vector composed of the spatial frequency spectrum S ′ n m (n tf ), and the vector S ′ is represented by the following Equation (5).
  • S represents a vector composed of each time-frequency spectrum S (i, n tf ), and the vector S is represented by Expression (6) below.
  • Equation (4) Y mic represents a spherical harmonic function matrix, and the spherical harmonic function matrix Y mic is represented by the following equation (7).
  • Equation (4) Y mic T represents a transposed matrix of the spherical harmonic function matrix Y mic .
  • Equation (4) the spherical harmonic function matrix Y mic corresponds to the spherical harmonic function matrix Y in Equation (3).
  • the weighting factor corresponding to the weighting factor W shown in the equation (2) is omitted.
  • Y n m ( ⁇ i , ⁇ i ) in the equation (7) is a spherical harmonic function shown in the following equation (8).
  • n and m represent the order of the spherical harmonic region, that is, the spherical harmonic function Y n m ( ⁇ , ⁇ ), j represents a pure imaginary number, and ⁇ represents an angular frequency. .
  • ⁇ i and ⁇ i in the spherical harmonic function of Expression (7) indicate the elevation angle ⁇ i and the azimuth angle ⁇ i indicated by the microphone angle ( ⁇ i , ⁇ i ) of each microphone unit.
  • the enhanced speech generation unit 22 is supplied with the spatial frequency spectrum S ′ n m (n tf ) expressed in a spherical manner, that is, the vector S ′ as a recorded signal.
  • the enhanced speech generation unit 22 uses beamforming in the spherical harmonic region for enhancing the enhancement direction of the sound field indicated by the spatial frequency spectrum S ′ n m (n tf ) as the recorded signal.
  • the enhancement direction is expressed as ( ⁇ BF , ⁇ BF ) using the elevation angle ⁇ BF and the azimuth angle ⁇ BF .
  • the beam pattern F n m ( ⁇ BF , ⁇ BF ) expressed in spherical harmony, that is, in the enhancement direction ( ⁇ BF , ⁇ BF ) in the spherical harmonic region is expressed as shown in the following equation (9). .
  • c n in formula (9) shows the weighting factors, Y n m ( ⁇ BF, ⁇ BF) * denotes spherical harmonics Y n m ( ⁇ BF, ⁇ BF) the complex conjugate of.
  • n and m indicate the order of the spherical harmonic region.
  • the beam pattern F n m ( ⁇ BF , ⁇ BF ) represented by Expression (9) is a spatial frequency spectrum representing a beam pattern having directivity in the enhancement direction ( ⁇ BF , ⁇ BF ).
  • the beam pattern F n m ( ⁇ BF , ⁇ BF ) has a directivity in the enhancement direction ( ⁇ BF , ⁇ BF ) for a certain sound field, that is, reproduced sound, in the spatial frequency domain. It can be said that the weighting coefficient is shown.
  • the enhanced speech generator 22 includes a matrix F composed of beam patterns F n m ( ⁇ BF , ⁇ BF ) of each order n and m obtained from the enhancement direction ( ⁇ BF , ⁇ BF ), and a spatial frequency spectrum S as a recorded signal.
  • the following equation (10) is calculated based on the vector S 'consisting of' n m (n tf ) and beam forming is performed.
  • a vector S ′ BF composed of the spatial frequency spectrum S ′ n m BF (n tf ) of the sound field in which the enhancement direction ( ⁇ BF , ⁇ BF ) is enhanced is obtained.
  • enhancement in the enhancement direction is performed by calculating the product of the vector S ′, which is a signal in the spatial frequency domain, and the matrix F, that is, by computation in the spatial frequency domain using the spherically-harmoniced signal.
  • the vector S ′ BF that has been subjected to is calculated.
  • spatial waveform shaping is performed in the spatial frequency domain so that the sound field formed by the original recorded signal has directivity in the enhancement direction.
  • the sound field is formed by the obtained spatial frequency spectrum S ′ n m BF (n tf ), it is the same as the sound field based on the spatial frequency spectrum S ′ n m (n tf ) as the recorded signal, but is emphasized.
  • a sound field having directivity in the direction is formed.
  • the spatial frequency spectrum S ′ n m (n tf ) expressed in spherical harmony is used as a recorded signal for processing.
  • F n m ( ⁇ BF, ⁇ BF) constituting the matrix F as, F n m ( ⁇ BF, ⁇ BF) of order greater than n and m when to use those that are not zero to be obtained sound field
  • the directivity of becomes stronger. Therefore, the directivity of the intensity to have the sound field, i.e. depending on the degree of emphasis in the emphasis direction, to what degree not 0 F n m ( ⁇ BF, ⁇ BF) be determined whether to use the more appropriate Sound field can be obtained.
  • the formed sound field has directivity in each of the plurality of enhancement directions.
  • a spatial frequency spectrum S ′ n m BF (n tf ) that has directivity in the enhancement direction while forming a sound field similar to the original sound field. can be obtained.
  • the emphasis in the emphasis direction is performed in the time frequency domain, for example, the sense of localization of the sound image is lost, and a sound field can be formed with reproducibility as when beam forming is performed in the spatial frequency domain. Can not.
  • the enhanced speech generation unit 22 supplies the vector S ′ BF obtained as described above, that is, the spatial frequency spectrum S ′ n m BF (n tf ), to the spatial frequency synthesis unit 23.
  • sound from a sound source in a desired direction can also be attenuated. That is, the sound from the desired direction can be lowered or eliminated. Thereby, for example, editing such as lowering the gain of sound such as unnecessary noise can be performed.
  • the attenuation direction may be a predetermined direction, a direction opposite to the viewpoint direction, or a direction determined from the arrangement of objects or the like in the space.
  • the enhancement direction acquisition unit 21 acquires attenuation direction information indicating the attenuation direction by some method and supplies it to the enhancement speech generation unit 22.
  • the enhanced speech generation unit 22 performs a process of attenuating the sound of the sound source in the attenuation direction in the spatial frequency domain on the recorded signal.
  • the enhanced speech generation unit 22 selects a direction other than the direction close to the attenuation direction among predetermined directions including almost all directions such as front and rear, left and right as viewed from the viewer.
  • a matrix F in each emphasis direction is obtained as an emphasis direction, and the sum of these matrices F is defined as a final matrix F.
  • the emphasized speech generation unit 22 calculates the equation (10) using the obtained matrix F, thereby obtaining a vector S ′ BF in which the direction other than the enhancement direction, that is, the attenuation direction is relatively attenuated. .
  • the direction close to the attenuation direction may be, for example, a direction in which an angle formed with the attenuation direction is a predetermined threshold value or less.
  • the sound field in which the sound from the attenuation direction is attenuated can be formed.
  • the vector S ' BF can be obtained.
  • the directivity of the obtained sound field can be widened. If the directivity is widened in this way, not only the sound coming from the attenuation direction but also the sound coming from the direction close to the attenuation direction can be heard greatly, so that the sound from the attenuation direction is attenuated relatively. Can do.
  • the vector S ′ BF can be obtained by performing the same process for the object base sound.
  • the spatial frequency spectrum P ′ n m of the sound field formed by the sound source of the object-based sound at a predetermined position in space can be expressed as shown in the following formula (12) by spherical harmonic region expression.
  • j represents an imaginary unit
  • represents an angular frequency
  • c represents the speed of sound.
  • h n (2) represents the second-type spherical Hankel function
  • (r s , ⁇ s , ⁇ s ) represents the position of the sound source represented by spherical coordinates.
  • X (n tf ) is a sound source signal and corresponds to the recorded signal. That is, instead of the recording signal obtained by the sound collection by the microphone array, a signal in the time frequency domain for each object sound source, that is, a sound source signal X (n tf ) that is a time frequency spectrum is supplied to the emphasized sound generation unit 22. .
  • the enhanced speech generation unit 22 calculates Expression (12) based on the input sound source signal X (n tf ), obtains the spatial frequency spectrum P ′ n m of the object sound source expressed in spherical harmony, and then further performs the following. The calculation shown in equation (13) is performed.
  • the product of the matrix P ′ composed of the spatial frequency spectrums P ′ n m of the respective orders n and m and the matrix F shown in the equation (11) is obtained and emphasized in the enhancement direction.
  • a vector S ′ BF consisting of the spatial frequency spectrum of the sound field is obtained.
  • beam forming is performed by calculation in the spatial frequency domain using a spherically expressed signal.
  • the matrix P ′ is a matrix composed of spatial frequency spectra P ′ n m of the respective orders n and m as shown in the following equation (14).
  • the gain of the sound source signal X (n tf ) of the object sound source close to the enhancement direction ( ⁇ BF , ⁇ BF ) can be simply increased.
  • the sound field may be emphasized in the emphasis direction.
  • the emphasized sound generation unit 22 sets an object sound source in which the angle (difference in angle) formed by the direction of the object sound source viewed from the viewer in space and the emphasis direction is equal to or less than a threshold value in the emphasis direction. Select as a near object sound source.
  • the emphasized speech generation unit 22 multiplies the vector P ′ obtained for the object sound source close to the enhancement direction by a predetermined gain value GA greater than 1, and the vector P ′ multiplied by the gain value GA in the enhancement direction.
  • Vector S ′ BF obtained by adding the vector P ′ obtained for the remaining object sound sources that are not considered to be close is obtained.
  • the vector P of an object sound source close to the enhancement direction ' that is the spatial frequency spectrum P' by performing gain adjustment of n m, it is possible to emphasize the sound of a desired direction by a simple process.
  • the gain of the object sound source vector P ′ other than the object sound source near the emphasis direction is multiplied by a gain value of less than 1. You may make it emphasize the sound of a relatively emphasis direction by lowering.
  • the sound from the object sound source in a specific direction may be attenuated by multiplying the desired object sound source vector P ′ by a gain value less than 1. Good.
  • the spatial frequency synthesizing unit 23 indicates the direction of each speaker constituting the speaker array 25 indicated by the speaker arrangement information with respect to the spatial frequency spectrum S ′ n m BF (n tf ) supplied from the enhanced speech generating unit 22.
  • a spatial frequency inverse transform is performed using a spherical harmonic function matrix with angles ( ⁇ l , ⁇ l ), and a time frequency spectrum is obtained. That is, inverse spatial frequency transformation is performed as spatial frequency synthesis.
  • each speaker constituting the speaker array 25 is also referred to as a speaker unit.
  • the number of speaker units constituting the speaker array 25 is L
  • the speaker unit index indicating each speaker unit is l.
  • the speaker unit index l 0, 1,..., L-1.
  • the speaker arrangement information supplied from the outside to the spatial frequency synthesizing unit 23 is an angle ( ⁇ l , ⁇ l ) indicating the direction of each speaker unit indicated by the speaker unit index l.
  • ⁇ l and ⁇ l constituting the angle ( ⁇ l , ⁇ l ) of the speaker unit are angles indicating the elevation angle and azimuth angle of the speaker unit corresponding to the above-described elevation angle ⁇ i and azimuth angle ⁇ i , respectively. Yes, it is an angle from a predetermined reference direction.
  • the spatial frequency synthesizer 23 obtains the spherical harmonic function Y n m ( ⁇ l , ⁇ l ) obtained for the angle ( ⁇ l , ⁇ l ) indicating the direction of the speaker unit indicated by the speaker unit index l and the spatial frequency spectrum S
  • a spatial frequency inverse transform is performed by calculating the following equation (15) based on ' n m BF (n tf ), and a time-frequency spectrum D (l, n tf ) is obtained.
  • D represents a vector composed of each time-frequency spectrum D (l, n tf ), and the vector D is represented by Expression (16) below.
  • the vector S ′ BF represents a vector composed of each spatial frequency spectrum S ′ n m BF (n tf ), and the vector S ′ BF is represented by the following equation (17).
  • Y SP represents a spherical harmonic function matrix composed of the spherical harmonic functions Y n m ( ⁇ l , ⁇ l ), and the spherical harmonic function matrix Y SP is expressed by the following Expression (18). expressed.
  • the spatial frequency synthesizer 23 supplies the time frequency spectrum D (l, ntf ) thus obtained to the time frequency synthesizer 24.
  • the time frequency synthesizer 24 calculates the following equation (19), thereby performing an IDFT (Inverse Discrete Fourier Transform) (Inverse Discrete Fourier Transform) on the time frequency spectrum D (l, n tf ) supplied from the spatial frequency synthesizer 23. Time-frequency synthesis using Fourier transform is performed to calculate a speaker drive signal d (l, n d ) that is a time signal.
  • IDFT Inverse Discrete Fourier Transform
  • Equation (19) n d represents a time index, and M dt represents the number of IDFT samples.
  • j represents a pure imaginary number.
  • the time-frequency synthesizer 24 supplies the speaker drive signal d (l, n d ) thus obtained to each speaker unit constituting the speaker array 25 and reproduces the sound of the free viewpoint content.
  • step S11 the emphasis direction acquisition unit 21 acquires emphasis direction information and supplies it to the emphasis speech generation unit 22.
  • the enhancement direction acquisition unit 21 generates the enhancement direction information by detecting the viewpoint direction as the enhancement direction or the direction of the specific object as the enhancement direction, and supplies the enhancement direction information to the enhancement speech generation unit 22.
  • step S12 the enhanced speech generation unit 22 uses the enhancement direction information supplied from the enhancement direction acquisition unit 21 to enhance the sound of the sound source in the enhancement direction in the spatial frequency domain with respect to the supplied recording signal. I do.
  • the enhanced speech generation unit 22 is a vector composed of the matrix F shown in Expression (11) obtained for the enhancement direction information and the spatial frequency spectrum S ′ n m (n tf ) as a recorded signal.
  • the following equation (10) is calculated based on S ′, and a vector S ′ BF is calculated by beam forming.
  • the enhanced speech generation unit 22 supplies the vector S ′ BF thus obtained, that is, the spatial frequency spectrum S ′ n m BF (n tf ), to the spatial frequency synthesis unit 23.
  • the enhanced speech generating unit 22 performs the calculation of the above-described equations (12) and (13) and performs the vector S ′. BF is calculated.
  • step S12 not only the process of enhancing the sound of the sound source in the enhancement direction but also the process of attenuating the sound of the sound source in the attenuation direction may be performed at the same time.
  • the spatial frequency synthesizer 23 calculates the above-described equation based on the spatial frequency spectrum S ′ n m BF (n tf ) supplied from the enhanced speech generator 22 and the speaker arrangement information supplied from the outside. (15) is calculated, and spatial frequency inverse transformation is performed.
  • the spatial frequency synthesizer 23 supplies the temporal frequency spectrum D (l, n tf ) obtained by the spatial frequency inverse transform to the temporal frequency synthesizer 24.
  • step S14 the time-frequency synthesis unit 24 performs time-frequency synthesis on the time-frequency spectrum D (l, n tf ) supplied from the spatial frequency synthesis unit 23 by calculating the above equation (19). Then, the speaker drive signal d (l, n d ) is calculated.
  • the time-frequency synthesizer 24 supplies the obtained speaker drive signal d (l, n d ) to each speaker unit constituting the speaker array 25.
  • step S15 the speaker array 25 reproduces the sound of the free viewpoint content based on the speaker drive signal d (l, n d ) supplied from the time-frequency synthesizer 24, and the sound field enhancement process ends.
  • the sound field enhancing device 11 performs processing for enhancing the sound of the sound source in the enhancement direction with respect to the recorded signal in the spatial frequency domain, and reproduces the sound of the free viewpoint content.
  • the sound field enhancement device is configured as shown in FIG. 7, for example.
  • FIG. 7 portions corresponding to those in FIG. 4 are denoted by the same reference numerals, and description thereof is omitted as appropriate.
  • the 7 includes an enhancement direction acquisition unit 21, a speech enhancement filter coefficient recording unit 61, a filter unit 62, and a speaker array 25.
  • the speech enhancement filter coefficient recording unit 61 records a speech enhancement filter coefficient that is a coefficient of an audio filter for realizing a process of enhancing sound from that direction for each direction in the space.
  • the speech enhancement filter coefficient recording unit 61 selects a speech enhancement filter coefficient in the enhancement direction indicated by the enhancement direction information supplied from the enhancement direction acquisition unit 21 from among the speech enhancement filter coefficients recorded in advance in each direction. This is supplied to the filter unit 62.
  • the filter unit 62 convolves the speech enhancement filter coefficient supplied from the speech enhancement filter coefficient recording unit 61 with the sound source signal x (n d ) of the object sound source supplied from the outside, and the speaker drive signal d (l, n d ) is calculated and supplied to the speaker array 25.
  • the sound source signal x (n d ) of the object sound source is a time domain signal, that is, a time waveform signal.
  • the sound source signal X (n tf ) is set to 1, that is, the sound source signal X (n tf ) is not included in the calculation. Calculations of (12), (13), (15), and (19) are performed.
  • the speaker drive signal d (l, n d ) thus obtained represents the filter coefficient itself that does not depend on the sound source. Therefore, the speech enhancement filter coefficient h (l, m) for the enhancement direction ( ⁇ BF , ⁇ BF ) is obtained by replacing the time index n d of the speaker drive signal d (l, n d ) with the time index m. ).
  • Speech enhancement filter coefficients h (l, m) are obtained in advance for all possible directions as the enhancement directions ( ⁇ BF , ⁇ BF ) and recorded in the speech enhancement filter coefficient recording unit 61.
  • the filter unit 62 calculates the following equation (20), so that the sound source signal x (n d ) supplied from the outside and the voice enhancement filter coefficient h (l, The speaker drive signal d (l, n d ) is calculated by convolving m).
  • N indicates the filter length
  • step S41 is the same as the process in step S11 in FIG.
  • step S42 the speech enhancement filter coefficient recording unit 61 emphasizes the speech enhancement filter in the enhancement direction indicated by the enhancement direction information supplied from the enhancement direction acquisition unit 21 among the speech enhancement filter coefficients recorded in advance in each direction. A coefficient is selected and supplied to the filter unit 62.
  • step S43 the filter unit 62 calculates the equation (20) to perform a filter process that convolves the sound source signal supplied from the outside with the voice enhancement filter coefficient supplied from the voice enhancement filter coefficient recording unit 61.
  • the filter unit 62 supplies the speaker drive signal obtained by the filter process to the speaker array 25.
  • step S44 the speaker array 25 reproduces the sound of the free viewpoint content based on the speaker drive signal supplied from the filter unit 62, and the sound field enhancement process ends.
  • the sound field enhancing device 51 performs the process of enhancing the sound of the sound source in the enhancement direction in the spatial frequency domain by the filter process, and reproduces the sound of the free viewpoint content. In this way, a desired sound can be emphasized.
  • ⁇ Modification> when the arrangement of objects etc. in the space can be known from video information such as video metadata of free viewpoint content, sound enhancement or attenuation is performed according to the arrangement of those objects etc. Also good.
  • the information indicating the arrangement of objects and the like in space is not limited to metadata, and may be any other information.
  • FIG. 9 it is assumed that there are three sound sources AS41 to AS43 and a predetermined object OB41 in the space. 9, parts corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof is omitted as appropriate.
  • the sound source AS41 and the sound source AS42 are visible from the viewer U11, but the object OB41 is a shielding object, and the sound source AS43 cannot be seen from the position of the viewer U11. That is, the sound source AS43 is hidden from the object OB41 and cannot be seen from the position of the viewer U11.
  • the enhancement direction acquisition unit 21 obtains the direction of the sound source (object) that is viewed by the viewer U11 and viewed from the viewer U11 based on the metadata of the free viewpoint content, and sets it as the enhancement direction.
  • the direction of the sound source (object) that is not seen by the viewer U11 as seen from the viewer U11 is obtained and set as the attenuation direction.
  • the enhancement direction acquisition unit 21 sets the direction of the sound source AS41 viewed from the viewer U11 and the direction of the sound source AS42 viewed from the viewer U11 as the enhancement direction, and sets the direction of the sound source AS43 viewed from the viewer U11 as the attenuation direction. .
  • the direction of the sound source AS43 is not set as the attenuation direction, and the direction of the sound source AS43 may not be particularly emphasized or attenuated.
  • the sound from the sound source AS41 and the sound source AS42 is emphasized, and the sound from the sound source AS43 that cannot be seen by the viewer U11 is attenuated, and a higher sense of presence is obtained when the sound of the free viewpoint content is reproduced. Be able to.
  • a volume adjustment area with a specific volume adjustment in advance is provided in the space so that the sound of a specific sound source is emphasized or the sound of a specific sound source is played only while the viewer is in the volume adjustment area.
  • a sound field may be formed.
  • the enhancement direction acquisition unit 21 sets the predetermined direction as the enhancement direction, and when the viewer is outside the volume adjustment area. May be such that no direction is taken as an emphasis direction. This makes it possible to adjust the volume of the sound or change the sound to be played, such as by emphasizing the sound from the desired sound source depending on whether the viewer is inside or outside the volume adjustment area. it can.
  • a desired sound can be emphasized even when headphones or the like are used as a reproduction apparatus.
  • the spatial information of the sound is lost and the reproducibility of the sound wavefront is reduced.
  • the series of processes described above can be executed by hardware or can be executed by software.
  • a program constituting the software is installed in the computer.
  • the computer includes, for example, a general-purpose computer capable of executing various functions by installing a computer incorporated in dedicated hardware and various programs.
  • FIG. 10 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.
  • a CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • An input / output interface 505 is further connected to the bus 504.
  • An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.
  • the input unit 506 includes a keyboard, a mouse, a microphone, an image sensor, and the like.
  • the output unit 507 includes a display, a speaker, and the like.
  • the recording unit 508 includes a hard disk, a nonvolatile memory, and the like.
  • the communication unit 509 includes a network interface or the like.
  • the drive 510 drives a removable recording medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
  • the CPU 501 loads the program recorded in the recording unit 508 to the RAM 503 via the input / output interface 505 and the bus 504 and executes the program, for example. Is performed.
  • the program executed by the computer (CPU 501) can be provided by being recorded in a removable recording medium 511 as a package medium or the like, for example.
  • the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
  • the program can be installed in the recording unit 508 via the input / output interface 505 by attaching the removable recording medium 511 to the drive 510. Further, the program can be received by the communication unit 509 via a wired or wireless transmission medium and installed in the recording unit 508. In addition, the program can be installed in advance in the ROM 502 or the recording unit 508.
  • the program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.
  • the present technology can take a cloud computing configuration in which one function is shared by a plurality of devices via a network and is jointly processed.
  • each step described in the above flowchart can be executed by one device or can be shared by a plurality of devices.
  • the plurality of processes included in the one step can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.
  • the present technology can be configured as follows.
  • a signal processing apparatus comprising: a control unit that performs processing for enhancing sound of a sound source in a specific direction in a spatial frequency domain on an audio signal and outputs the audio signal obtained by the processing.
  • the signal processing apparatus according to (1) wherein the control unit performs a process of giving directivity to the sound of the sound source in the specific direction as the process.
  • the control unit performs gain adjustment of the audio signal as the processing.
  • the signal processing apparatus according to any one of (1) to (5), wherein the specific direction is a user's viewpoint direction in space.
  • the signal processing apparatus wherein the viewpoint direction is detected by performing image recognition on an image captured with the user as a subject, or detecting the orientation of the user's head using a sensor.
  • the signal processing device according to any one of (1) to (5), wherein the specific direction is a direction of a specific sound source viewed from a user in space.
  • the signal processing device according to (8), wherein the direction of the specific sound source viewed from the user is obtained based on position information indicating a position of the specific sound source included in metadata of content including sound based on the sound signal. .
  • the signal processing device according to any one of (1) to (13), wherein the control unit performs the processing on the audio signal when the user is in a predetermined area in space.
  • the signal processing apparatus according to any one of (1) to (14), wherein the control unit performs a process of attenuating sound of a sound source in a predetermined direction in a spatial frequency domain with respect to the audio signal.
  • a signal processing method comprising: performing a process of enhancing a sound of a sound source in a specific direction in a spatial frequency domain on an audio signal, and outputting the audio signal obtained by the process.
  • 11 sound field enhancement device 21 enhancement direction acquisition unit, 22 enhancement speech generation unit, 23 spatial frequency synthesis unit, 24 time frequency synthesis unit, 25 speaker array

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Stereophonic System (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

La présente invention concerne un dispositif, un procédé et un programme de traitement de signal configurés de façon à être capables de mettre en évidence un son souhaité. Le dispositif de traitement de signal comprend une unité de commande qui : réalise un traitement de signaux audio de façon à mettre en évidence un son à partir d'une source audio dans une direction spécifiée dans un domaine de fréquence spatiale ; et délivre les signaux audio obtenus à partir de ce traitement. Cette technologie peut être appliquée à des dispositifs d'amélioration de champ audio.
PCT/JP2017/034138 2016-10-05 2017-09-21 Dispositif de traitement de signal, procédé et programme WO2018066376A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016196889 2016-10-05
JP2016-196889 2016-10-05

Publications (1)

Publication Number Publication Date
WO2018066376A1 true WO2018066376A1 (fr) 2018-04-12

Family

ID=61831367

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/034138 WO2018066376A1 (fr) 2016-10-05 2017-09-21 Dispositif de traitement de signal, procédé et programme

Country Status (1)

Country Link
WO (1) WO2018066376A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113109763A (zh) * 2020-01-13 2021-07-13 北京地平线机器人技术研发有限公司 声源位置确定方法和装置、可读存储介质、电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09275533A (ja) * 1996-04-08 1997-10-21 Sony Corp 信号処理装置
JP2010206265A (ja) * 2009-02-27 2010-09-16 Toshiba Corp 音像制御装置、音像制御方法、ストリームのデータ構造、及びストリーム生成装置
WO2015159731A1 (fr) * 2014-04-16 2015-10-22 ソニー株式会社 Appareil, procédé et programme de reproduction de champ sonore
WO2016152511A1 (fr) * 2015-03-23 2016-09-29 ソニー株式会社 Dispositif et procédé de séparation de source sonore, et programme

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09275533A (ja) * 1996-04-08 1997-10-21 Sony Corp 信号処理装置
JP2010206265A (ja) * 2009-02-27 2010-09-16 Toshiba Corp 音像制御装置、音像制御方法、ストリームのデータ構造、及びストリーム生成装置
WO2015159731A1 (fr) * 2014-04-16 2015-10-22 ソニー株式会社 Appareil, procédé et programme de reproduction de champ sonore
WO2016152511A1 (fr) * 2015-03-23 2016-09-29 ソニー株式会社 Dispositif et procédé de séparation de source sonore, et programme

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113109763A (zh) * 2020-01-13 2021-07-13 北京地平线机器人技术研发有限公司 声源位置确定方法和装置、可读存储介质、电子设备
CN113109763B (zh) * 2020-01-13 2023-08-25 北京地平线机器人技术研发有限公司 声源位置确定方法和装置、可读存储介质、电子设备

Similar Documents

Publication Publication Date Title
JP7010334B2 (ja) 音声処理装置および方法、並びにプログラム
US10382849B2 (en) Spatial audio processing apparatus
JP7119060B2 (ja) マルチポイント音場記述を使用して拡張音場記述または修正音場記述を生成するためのコンセプト
JP7529371B2 (ja) 2dセットアップを使用したオーディオ再生のためのアンビソニックス・オーディオ音場表現を復号する方法および装置
US9552840B2 (en) Three-dimensional sound capturing and reproducing with multi-microphones
US8180062B2 (en) Spatial sound zooming
KR100964353B1 (ko) 오디오 데이터를 처리하기 위한 방법 및 이에 따른 사운드수집 장치
JP6820613B2 (ja) 没入型オーディオ再生のための信号合成
WO2018008395A1 (fr) Dispositif, procédé et programme de formation de champ acoustique
CN101852846A (zh) 信号处理设备、信号处理方法和程序
JP7378575B2 (ja) 空間変換領域における音場表現を処理するための装置、方法、またはコンピュータプログラム
JP2023517720A (ja) 残響のレンダリング
WO2018008396A1 (fr) Dispositif, procédé et programme de formation de champ acoustique
JP6284480B2 (ja) 音声信号再生装置、方法、プログラム、及び記録媒体
CN109314832A (zh) 音频信号处理方法和设备
Ifergan et al. On the selection of the number of beamformers in beamforming-based binaural reproduction
WO2018066376A1 (fr) Dispositif de traitement de signal, procédé et programme
WO2021212287A1 (fr) Procédé de traitement de signal audio, dispositif de traitement audio et appareil d'enregistrement
Pulkki et al. Spatial sound scene synthesis and manipulation for virtual reality and audio effects
JP2021013063A (ja) オーディオ信号処理装置、オーディオ信号処理方法及びオーディオ信号処理プログラム
RU2793625C1 (ru) Устройство, способ или компьютерная программа для обработки представления звукового поля в области пространственного преобразования
EP4451710A1 (fr) Dispositif et procédé de génération de son, dispositif de reproduction de son, et programme de traitement de signal sonore
US20230137514A1 (en) Object-based Audio Spatializer
Bai et al. An integrated analysis-synthesis array system for spatial sound fields
Gorzel et al. Virtual acoustic recording: An interactive approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17858215

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17858215

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP