WO2024161995A1 - 信号処理装置、信号処理方法、及び信号処理プログラム - Google Patents

信号処理装置、信号処理方法、及び信号処理プログラム Download PDF

Info

Publication number
WO2024161995A1
WO2024161995A1 PCT/JP2024/001058 JP2024001058W WO2024161995A1 WO 2024161995 A1 WO2024161995 A1 WO 2024161995A1 JP 2024001058 W JP2024001058 W JP 2024001058W WO 2024161995 A1 WO2024161995 A1 WO 2024161995A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
sound
target sound
volume
sound signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2024/001058
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
萌絵 高田
亮太 宮中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Intellectual Property Management Co Ltd
Original Assignee
Panasonic Intellectual Property Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Panasonic Intellectual Property Management Co Ltd filed Critical Panasonic Intellectual Property Management Co Ltd
Priority to JP2024574408A priority Critical patent/JPWO2024161995A1/ja
Priority to EP24749949.4A priority patent/EP4641565A4/en
Publication of WO2024161995A1 publication Critical patent/WO2024161995A1/ja
Priority to US19/286,570 priority patent/US20250365537A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers
    • H04R3/04Circuits for transducers for correcting frequency response
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/01Aspects of volume control, not necessarily automatic, in sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/13Acoustic transducers and sound field adaptation in vehicles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems

Definitions

  • This disclosure relates to technology for reproducing audio source signals.
  • Patent Document 1 discloses a technology for performing spectrum emphasis according to the degree of deterioration of the frequency selectivity of a hearing aid user. Specifically, Patent Document 1 discloses separating an input audio signal into a first band audio signal and a second band audio signal that is lower in frequency than the first band audio signal, performing a Fourier transform on the separated first band audio signal, extracting a fundamental wave component of a vowel and a part of its harmonics from the obtained signal, generating an attenuation waveform (emphasis waveform) according to the degree of deterioration of the individual's frequency selectivity based on the extracted fundamental wave component and its harmonic components, convolving the generated attenuation waveform into the first band audio signal, and adding the convolved sound data to the second band audio signal.
  • an attenuation waveform emphasis waveform
  • Patent Document 2 discloses a technique for effectively emphasizing voice components and background components contained in a sound source signal. Specifically, Patent Document 2 discloses separating an input sound source signal into a voice signal and a background sound signal, multiplying the voice signal by a first gain, multiplying the background sound signal by a second gain, and adding and outputting the voice signal multiplied by the first gain and the background sound signal multiplied by the second gain.
  • This disclosure was made in consideration of these issues, and aims to provide technology that makes it easier to hear target sounds in noisy environments.
  • a signal processing device includes an acquisition unit that acquires a sound source signal, a separation unit that separates the acquired sound source signal into a target sound signal and a background sound signal, a volume adjustment unit that adjusts the volume of the separated target sound signal to emphasize the target sound signal, an addition unit that generates an output signal by adding an emphasized target sound signal, which is the emphasized target sound signal, to the sound source signal, and an output unit that outputs a sound represented by the output signal from a speaker.
  • This disclosure makes it easier to hear the target sound in a noisy environment.
  • FIG. 1 is an installation diagram of an audio device according to an embodiment of the present disclosure.
  • 1 is a block diagram showing an example of a configuration of an audio device according to an embodiment of the present disclosure.
  • FIG. 11 is a block diagram showing an example of the configuration of a separation unit configured using Conv-Tasnet.
  • 4 is a flowchart illustrating an example of processing performed by a signal processing device according to an embodiment of the present disclosure.
  • FIG. 13 is a diagram showing a state of signal processing in a comparative example in which auto gain control is not applied.
  • 11A and 11B are diagrams illustrating the effect of auto gain control.
  • FIG. 2 is an explanatory diagram of a compressor process.
  • FIG. 2 is a diagram showing an overview of processing performed by a signal processing device according to the present embodiment.
  • FIG. 11 is a waveform
  • an attenuation waveform (emphasis waveform) is generated from a first band audio signal according to the degree of deterioration of an individual's frequency selectivity, the generated attenuation waveform is convolved with the first band audio signal, and the resulting sound data is added to the second band audio signal to generate an output signal.
  • the sound data obtained by convolution is added to the second band audio signal, if distortion occurs in the process of generating the attenuation waveform, there is a possibility that the distortion will be output as is without being suppressed.
  • the output signal is generated by adding a voice signal multiplied by a first gain and a background sound signal multiplied by a second gain, so if distortion occurs in the voice signal multiplied by the first gain, there is a possibility that the distortion will be output as is without being suppressed.
  • the inventors therefore discovered that if the emphasized target sound signal is added to a sound source signal that has not been processed and therefore is undistorted, the distortion that occurs in the process of emphasizing the target sound signal is compensated for by the sound source signal, thereby making it easier to hear the target sound in a noisy environment, and this led to the various aspects of the present disclosure.
  • a signal processing device includes an acquisition unit that acquires a sound source signal, a separation unit that separates the acquired sound source signal into a target sound signal and a background sound signal, a volume adjustment unit that adjusts the volume of the separated target sound signal to emphasize the target sound signal, an addition unit that generates an output signal by adding an emphasized target sound signal, which is the emphasized target sound signal, to the sound source signal, and an output unit that outputs a sound represented by the output signal from a speaker.
  • the emphasized target sound signal which is an emphasized target sound signal
  • the sound source signal is added to the sound source signal to generate an output signal. Therefore, even if distortion occurs in the process of generating the emphasized target sound signal, the distortion is compensated for by the sound source signal and the distortion is suppressed. This makes it easier to hear the target sound in a noisy environment.
  • the separation unit is configured with a learning model that is generated in advance to separate the sound source signal into the target sound signal and the background sound signal, and the learning data used to train the learning model may be generated by combining the target sound signal and at least one type of background sound signal.
  • the learning data is generated by combining the target sound signal with at least one type of background sound signal, so that it is easy to generate learning data that can handle a variety of cases. And because the learning model is trained using such learning data, the target sound signal and background sound signal can be accurately separated from various sound source signals.
  • the emphasized target sound signal and the sound source signal may each be a time signal, and the adder may add the emphasized target sound signal and the sound source signal in the time domain.
  • the enhanced target sound signal and the sound source signal are each a time signal, and the enhanced target sound signal and the sound source signal are added in the time domain, so that the occurrence of distortion can be further suppressed.
  • the volume adjustment unit generates the enhanced target sound signal by auto gain control, and the auto gain control may amplify the target sound signal if the volume of the target sound signal does not exceed a reference volume, and attenuate the target sound signal if the volume of the target sound signal exceeds the reference volume so that the volume of the reference volume is smaller than the reference volume.
  • the target sound signal that does not exceed the reference volume is amplified, and the target sound signal that exceeds the reference volume is attenuated so as not to exceed the reference volume, so that small sounds contained in the target sound signal can be emphasized while preventing the target sound signal from exceeding the reference volume.
  • the output unit may perform a compressor process to compress the output signal so that the volume of the output signal does not exceed a maximum volume.
  • the output signal is compressed so that its volume does not exceed the maximum volume, preventing clipping of the output signal.
  • the speaker may be an array speaker.
  • the output signal can be heard only within a specified area.
  • the target sound signal may be a speech signal indicating a voice spoken by a person.
  • This configuration makes it possible to avoid speech signals becoming difficult to hear in noisy environments.
  • the sound source signal may be an interior sound signal indicating an interior sound of a moving vehicle
  • the target sound signal may be a signal indicating an alarm sound or a sound output from a car navigation system.
  • This configuration makes it possible to hear the surrounding environmental sounds while avoiding the problem of it being difficult to hear the alarm sound or the sound output from the car navigation system inside a moving vehicle.
  • the sound source signal may be an acoustic signal representing the sounds of a plurality of musical instruments
  • the target sound signal may be a signal representing the sound of a specific musical instrument among the plurality of musical instruments.
  • This configuration makes it possible to clearly hear the sound of a specific instrument from an audio signal.
  • the sound source signal may be a content sound signal indicating a content sound included in a video content
  • the target sound signal may be a signal indicating a specific sound effect from among the content sounds.
  • This configuration makes it possible to clearly hear certain sound effects from the content sounds.
  • the signal processing device may be installed in a booth provided inside a vehicle.
  • This configuration makes it easier to hear the target sound, and prevents noise inside the vehicle from making the target sound signal difficult to hear.
  • the signal processing device may be installed in a booth provided inside a vehicle, and the reference volume may be the volume of the emphasized target sound signal that is expected to leak outside the booth when the sound is output from the speaker.
  • the volume of the target sound signal is reduced below the reference volume by the auto gain control, preventing the sound output from the speaker from leaking outside the booth.
  • the auto gain control may amplify the target sound signal with a predetermined gain if the volume of the target sound signal does not exceed the reference volume, and the predetermined gain may have a value that can make the volume of a whisper included in the target sound signal louder than the volume of noise heard by the user.
  • the automatic gain control makes the volume of the whispering louder than the volume of the noise around the speaker, allowing the user to hear the whispering.
  • a signal processing method in another aspect of the present disclosure is a signal processing method in a signal processing device, which acquires a sound source signal, separates the acquired sound source signal into a target sound signal and a background sound signal, adjusts the volume of the separated target sound signal to emphasize the target sound signal, generates an output signal by adding the enhanced target sound signal, which is the emphasized target sound signal, to the sound source signal, and outputs the sound represented by the output signal from a speaker.
  • This configuration provides a signal processing method that can prevent the target sound signal from becoming difficult to hear in a noisy environment.
  • a signal processing program causes a processor to execute a process of acquiring a sound source signal, separating the acquired sound source signal into a target sound signal and a background sound signal, adjusting the volume of the separated target sound signal to emphasize the target sound signal, generating an output signal by adding the enhanced target sound signal, which is the emphasized target sound signal, to the sound source signal, and outputting the sound represented by the output signal from a speaker.
  • This configuration makes it possible to provide a signal processing program that can prevent the target sound signal from becoming difficult to hear in a noisy environment.
  • the present disclosure can also be realized as a signal processing system that operates according to such a signal processing program. It goes without saying that such a computer program can be distributed on a non-transitory computer-readable recording medium such as a CD-ROM or via a communication network such as the Internet.
  • FIG. 1 is an installation diagram of an acoustic device 1 according to an embodiment of the present disclosure.
  • the acoustic device 1 is installed inside a booth 2.
  • the booth 2 is, for example, a partition provided for each seat 3 inside an airplane.
  • the booth 2 is installed so as to surround the seat 3.
  • the acoustic device 1 includes a speaker 13.
  • the booth 2 includes a side wall 2a provided on one side of the seat 3 and a side wall 2b provided on the other side of the seat 3.
  • the speaker 13 is provided on, for example, the side wall 2a.
  • the speaker 13 may be a pair of speakers.
  • the pair of speakers 13 are installed on, for example, the side wall 2a and the side wall 2b.
  • the installation position of the speaker 13 is not particularly limited.
  • the speaker 13 is, for example, an array speaker. As a result, a reproduction area for the sound output from the speaker 13 is set inside the booth 2, and a non-reproduction area for the sound is set outside the booth 2. As a result, sound leakage from the speaker 13 to the outside of the booth 2 is prevented. Since engine noise, wind noise, and other noises are intense inside an airplane, the user U has difficulty hearing the sound output from the speaker 13 due to this noise. In particular, in an airplane, content such as a movie is often reproduced by the audio device 1, and in this case, the user U has difficulty hearing the spoken voice, such as lines, among the sounds of the content due to the noise inside the airplane. On the other hand, if the volume of the sound output from the speaker 13 is increased overall, sound leakage may occur. Therefore, the audio device 1 is equipped with a signal processing device 10 (FIG. 2) that emphasizes the spoken voice so that the spoken voice can be easily heard.
  • the signal of the sound to be emphasized such as the spoken voice
  • the audio device 1 includes a signal processing device 10 and a speaker 13.
  • the signal processing device 10 includes a processor 11 and a memory 12.
  • An example of the processor 11 is, for example, a CPU or a signal processing circuit.
  • the processor 11 includes an acquisition unit 111, a separation unit 112, a volume adjustment unit 113, an addition unit 114, an output unit 115, and a learning model generation unit 116.
  • the acquisition unit 111 to the learning model generation unit 116 may be realized by the processor 11 executing a signal processing program, or may be configured as a dedicated hardware circuit.
  • all or some of the components of the signal processing device 10 may be provided in a cloud server.
  • the memory 12 is configured of a non-volatile rewritable storage device such as a flash memory.
  • the memory 12 stores a sound source signal D0.
  • the sound source signal D0 is a sound signal included in content such as a movie.
  • the learning model generation unit 116 may be provided in a learning device different from the audio device 1.
  • the acquisition unit 111 acquires the sound source signal D0 from the memory 12.
  • the separation unit 112 is configured with a learning model that is generated in advance to separate the sound source signal D0 acquired by the acquisition unit 111 into a target sound signal D1 and a background sound signal D2 (not shown).
  • the target sound indicated by the target sound signal D1 is, for example, a person's speech (e.g., a line) among the sounds contained in the content.
  • the background sound indicated by the background sound signal D2 is a sound other than speech among the sounds contained in the content, such as traffic noise, music that does not include vocals, and sound effects.
  • the learning model may be, for example, a model constructed by a deep neural network.
  • the learning data used for learning the learning model is generated by combining a target sound signal and at least one type of background sound signal.
  • examples of the target sound signal D1 include a person's speech, a speech translated from a first language into a second language, a person's whispering voice, and an emotional sound emitted by a person to express an emotion.
  • the first language is the speaker's native language
  • the second language is a language other than the native language.
  • examples of the second language are English, French, German, Chinese, etc.
  • the types of background sound signals D2 are determined so as to correspond to various scenes in a movie. Examples of the types of background sound signals D2 include traffic noise, music without vocals, and sound effects. Note that the learning data may include multiple types of background sound signals. For example, an example of a combination of learning data is one type of target sound signal (e.g., a Japanese speech signal) and two types of background sound signals (e.g., traffic noise and music).
  • target sound signal e.g., a Japanese speech signal
  • background sound signals e.g., traffic noise and music
  • the learning model generation unit 116 acquires various types of target sound signals D1 and background sound signals D2 to be used for learning, for example, from the memory 12 or an external server. The learning model generation unit 116 then randomly combines one type of target sound signal from the acquired multiple types of target sound signals D1 and multiple types of background sound signals D2 with one or multiple types of background sound signals D2 to generate a learning sound source signal in which the target sound signal D1 and the background sound signal D2 are superimposed. The learning model generation unit 116 then generates a learning model by having the learning sound source signal learn the learning model.
  • the learning model generation unit 116 adjusts the parameters of the learning model so as to minimize the error between the target sound signal D1 output from the learning model when the learning sound source signal is input to the learning model and the target sound signal D1 constituting the input learning sound source signal, and the error between the background sound signal D2 output from the learning model when the learning sound source signal is input to the learning model and the background sound signal D2 constituting the input learning sound source signal.
  • FIG. 3 is a block diagram showing an example of the configuration of the separation unit 112 configured with Conv-Tasnet.
  • the separation unit 112 includes an encoder 201, a separator 202, and a decoder 203.
  • the encoder 201 detects the features of the sound source signal D0.
  • the separator 202 estimates a target sound mask and a background sound mask from the features detected by the encoder 201.
  • the target sound mask is a separation mask for extracting the features of the target sound signal D1 from the features of the sound source signal D0.
  • the background sound mask is a separation mask for extracting the features of the background sound signal D2 from the features of the sound source signal D0.
  • the decoder 203 calculates the features of the target sound signal D1 by multiplying the features detected by the encoder 201 by the target sound mask, and generates the target sound signal D1 by converting the features of the target sound signal D1 into a sound signal.
  • the decoder 203 calculates the features of the background sound signal D2 by multiplying the features detected by the encoder 201 by the background sound mask, and generates the background sound signal D2 by converting the features of the background sound signal D2 into a sound signal.
  • the sound source signal D0 is separated into the target sound signal D1 and the background sound signal D2.
  • the target sound mask, background sound mask, parameters for estimating the target sound mask, parameters for estimating the background sound mask, and parameters for detecting features are learned through learning using training data.
  • the separation unit 112 is configured with a Conv-Tasnet, but this is just one example, and a Tasnet may be adopted, and in any case, any learning model may be adopted as long as it is a learning model that can separate the target sound signal D1 and the background sound signal D2 from the sound source signal D0.
  • the separation unit 112 may separate the target sound signal D1 and the background sound signal D2 from the sound source signal D0 by a method other than machine learning.
  • the separation unit 112 may separate the target sound signal D1 and the background sound signal D2 from the sound source signal D0 by performing a Fourier transform on the sound source signal D0 and applying a time-frequency mask to the sound source signal in the obtained frequency band.
  • the volume adjustment unit 113 emphasizes the target sound signal D1 by adjusting the volume of the target sound signal D1 separated by the separation unit 112. For example, the volume adjustment unit 113 adjusts the volume of the target sound signal D1 to an optimal volume by applying auto gain control to the target sound signal D1.
  • an example of auto gain control is a process of amplifying the target sound signal D1 with a first gain G1 (an example of a predetermined gain) when the volume of the target sound signal D1 does not exceed a predetermined reference volume, and attenuating the target sound signal with a second gain that reduces the volume of the target sound signal D1 below the reference volume when the volume of the target sound signal D1 exceeds the reference volume.
  • the first gain G1 is set to a value that allows the volume of a person's whispering to be louder than the volume of the surrounding sounds that the user is expected to hear in booth 2, for example.
  • the second gain G2 may be set, for example, so that the degree of attenuation increases as the amount by which the volume of the target sound signal D1 exceeds the reference volume increases.
  • the target sound signal D1 whose volume has been adjusted by the volume adjustment unit 113 is referred to as an emphasized target sound signal D3.
  • the reference volume refers to the volume of the emphasized target sound signal D3 that is output from the speaker 13 and is expected to leak outside the booth 2.
  • the adder 114 generates an added signal D4 by adding the sound source signal D0 input from the acquisition unit 111 and the emphasized target sound signal D3 input from the volume adjustment unit 113. As a result, even if distortion occurs in the process of generating the emphasized target sound signal D3, the distortion is compensated for by the sound source signal D0, and an added signal D4 with suppressed distortion is obtained.
  • both the sound source signal D0 and the emphasized target sound signal D3 are time signals. Therefore, the adder 114 adds the sound source signal D0 and the emphasized target sound signal D3 in the time domain.
  • the output unit 115 generates an output signal D5 from the sum signal D4 and outputs the sound indicated by the output signal D5 from the speaker 13. As a result, the sound of the content is output from the speaker 13.
  • the output unit 115 may execute a compressor process to compress the sum signal D4 so that the volume of the sum signal D4 does not exceed a predetermined maximum volume.
  • the maximum volume is a volume at which clipping occurs if the volume increases any further.
  • the output unit 115 may directly use the sum signal D4 as the output signal D5.
  • the output unit 115 may generate the output signal D5 so that area reproduction is realized in which the inside of the booth 2 is a reproduction area and the outside of the booth 2 is a non-reproduction area.
  • the output unit 115 may realize area reproduction by adjusting, for example, the phase of the output signal input to each of the multiple speaker elements that make up the speaker 13. Note that the method of area reproduction is publicly known, so a detailed description will be omitted here.
  • Speaker 13 is an array speaker with multiple speaker elements arranged in a line. This allows area reproduction to be achieved.
  • FIG. 4 is a flowchart showing an example of processing by the signal processing device 10 in an embodiment of the present disclosure.
  • the acquisition unit 111 acquires the sound source signal D0 from the memory 12 (step S1).
  • the separation unit 112 separates the sound source signal D0 acquired in step S1 into a target sound signal D1 and a background sound signal D2 (step S2).
  • the volume adjustment unit 113 applies auto gain control to the target sound signal D1 separated in step S2 to generate an enhanced target sound signal D3 (step S3).
  • FIG. 5 is a diagram showing the state of signal processing in a comparative example in which auto gain control is not applied.
  • the left diagram in FIG. 5 shows the sound source signal D0
  • the right diagram in FIG. 5 shows the sum signal D400 to which auto gain control has not been applied.
  • the vertical axis represents volume and the horizontal axis represents time. This is the same as in FIG. 6 and FIG. 7, which will be described later.
  • the sum signal D400 is generated by adding the sound source signal D0 to the target sound signal D1, which has been uniformly amplified by the first gain G1 regardless of whether the volume exceeds the reference volume.
  • region 511 indicates a whisper
  • region 512 indicates a normal speech sound.
  • the target sound signal D1 is amplified by the first gain G1, so that the user U can hear a whisper.
  • the normal speech sound is also amplified by the first gain G1, so that the normal speech sound is over-amplified as shown in region 512. Therefore, in the comparative example, there is a possibility that the sound output from the speaker 13 may leak outside the booth 2. Therefore, in this embodiment, an auto gain control is applied to the target sound signal D1 to generate an emphasized target sound signal D3.
  • FIG. 6 is a diagram illustrating the effect of auto gain control.
  • the left diagram in FIG. 6 shows an emphasized target sound signal D300 in a comparative example, and the right diagram in FIG. 6 shows an emphasized target sound signal D3 in this embodiment.
  • the upper reference volume limit TH1 has a value on the positive side of the reference volume, and the lower reference volume limit TH2 has a value on the negative side of the reference volume.
  • the enhanced target sound signal D300 is generated by amplifying the target sound signal D1 with a first gain G1.
  • the enhanced target sound signal D300 has multiple peaks, some of which exceed the upper reference volume limit TH1 and some of which are below the lower reference volume limit TH2. Therefore, in the comparative example, there is a possibility that the sound output from the speaker 13 may leak outside the booth 2.
  • the volume adjustment unit 113 applies auto gain control to the target sound signal D1. Specifically, when the volume of the target sound signal D1 exceeds the upper reference volume TH1 or falls below the lower reference volume TH2, the volume adjustment unit 113 attenuates the target sound signal D1 by the second gain G2. On the other hand, when the volume of the target sound signal D1 is within the range between the upper reference volume TH1 and the lower reference volume TH2, the volume adjustment unit 113 amplifies the target sound signal D1 by the first gain G1. As a result, the volume of the emphasized target sound signal D3 is kept within the range between the upper reference volume TH1 and the lower reference volume TH2.
  • step S4 the adder 114 generates a sum signal D4 by adding the emphasized target sound signal D3 generated in step S3 to the sound source signal D0 acquired in step S1.
  • the output unit 115 generates an output signal D5 by applying a compressor process to the sum signal D4 (step S5).
  • FIG. 7 is an explanatory diagram of the compressor process.
  • the sum signal D4 is generated by adding the emphasized target sound signal D3 to the sound source signal D0, and the output signal D5 is generated from this sum signal D4.
  • the volume of the output signal D5 exceeds the upper maximum volume TH3 or falls below the lower maximum volume TH4.
  • the upper maximum volume TH3 has a value on the positive side of the maximum volume
  • the lower maximum volume TH4 has a value on the negative side of the maximum volume.
  • the output unit 115 applies a compressor process to the sum signal D4 generated by the adder 114 to generate the output signal D5.
  • the compressor process compresses the added signal D4 so that the volume of the output signal D5 falls within the range between the upper maximum volume TH3 and the lower maximum volume TH4. This makes it possible to prevent the output signal D5 from being clipped.
  • step S6 the output unit 115 inputs the output signal D5 to the speaker 13, causing the speaker 13 to output the sound of the content.
  • FIG. 8 is a diagram showing an overview of the processing of the signal processing device 10 in this embodiment.
  • the sound source signal D0 is separated into a target sound signal D1 and a background sound signal D2 by the separation unit 112.
  • the separated target sound signal D1 has its volume adjusted by auto gain control by the volume adjustment unit 113, and an emphasized target sound signal D3 is generated.
  • the emphasized target sound signal D3 is added to the sound source signal D0 by the addition unit 114, and an added signal D4 is generated.
  • the added signal D4 is compressed by the output unit 115, and an output signal D5 is generated, and the output signal D5 is input to the speaker 13. As a result, the sound of the content is output from the speaker 13.
  • FIG. 9 is a waveform diagram of an output signal in a comparative example.
  • FIG. 10 is a waveform diagram of an output signal in this embodiment.
  • the volume of the speech signal which is the target sound signal D1 is not adjusted.
  • FIGS. 9 and 10 show output signals 801 and 802 consisting of two-channel stereo sound.
  • Spectrogram 803 is a spectrogram of output signal 801
  • spectrogram 804 is a spectrogram of output signal 802.
  • waveform 800 the high-level section is the speech section.
  • the vertical axis is frequency and the horizontal axis is time.
  • spectrograms 803 and 804 show that the volume is louder at points with higher brightness.
  • areas B1 and B2 shown in FIG. 10 correspond to areas A1 and A2 shown in FIG. 9 and are speech sections. Areas B1 and B2 have a louder volume overall than areas A1 and A2 shown in FIG. 9. This shows that in this embodiment, the volume of the speech signal is emphasized.
  • the emphasized target sound signal D3, which is the emphasized target sound signal D1 is added to the sound source signal D0, thereby generating the output signal D5. Therefore, even if distortion occurs in the process of generating the emphasized target sound signal D3, the distortion is compensated for by the sound source signal D0, and the distortion is suppressed. This makes it possible to avoid the target sound becoming difficult to hear in a noisy environment.
  • the volume adjustment unit 113 may emphasize only a specific band without emphasizing all bands of the target sound signal D1. For example, the volume adjustment unit 113 may apply a filtering process to the target sound signal D1 using a filter that increases the volume of only a specific band.
  • the specific band may be, for example, a female voice band or a male voice band.
  • the volume adjustment unit 113 may apply a filter that increases the volume of only the specific band to the target sound signal D1, thereby generating an emphasized target sound signal D3 in which only the specific band is emphasized.
  • the target sound signal D1 is a speech signal, but this is just one example, and the target sound signal D1 may be a signal other than a speech signal.
  • the target sound signal D1 may be a signal indicating a specific sound (e.g., a sound effect, background sound, or music) included in the content.
  • the separation unit 112 may separate the target sound signal D1 and the remaining sound signal from the sound source signal D0 using a learning model that has been trained to separate the sound source signal D0 into the specific sound and the remaining sounds.
  • the audio device 1 is installed inside an airplane, but it may also be installed on a train, a bus, etc. Also, the audio device 1 may be installed in a booth located in a busy area, or in a booth in an office.
  • the sound source signal may be an interior sound signal indicating the interior sound of a moving vehicle.
  • the target sound signal may be a signal indicating an alarm sound or a sound output from a car navigation system.
  • the alarm sound may be, for example, an alarm sound output by an emergency vehicle such as an ambulance, fire engine, or police car.
  • the moving vehicle may be, for example, a four-wheeled vehicle.
  • the sound output from the car navigation system may be, for example, a guidance voice and a caution sound for route guidance.
  • the sound source signal may be, for example, an acoustic signal representing musical sounds including sounds of multiple instruments, such as orchestral or light music.
  • the target sound signal may be a signal representing the sound of a specific instrument among the multiple instruments. Examples of specific instruments that can be used include guitar, violin, bass, drums, keyboard, piano, woodwind instruments, and brass instruments.
  • the sound source signal may be a content sound signal indicating content sound contained in the video content.
  • a movie or a landscape video may be used as the video content.
  • a sound signal used in a movie or a sound of the natural world captured when shooting the landscape may be used as the sound source signal.
  • a signal indicating a specific sound effect may be used as the target sound signal.
  • the specific sound effect may be a sound effect used in the movie.
  • the specific sound effect may be the calls of birds, the sound of a river, or the sound of the sea.
  • the present disclosure makes it easier to hear a target sound in a noisy environment, making it useful for audio devices installed on aircraft and other aircraft.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
PCT/JP2024/001058 2023-02-02 2024-01-17 信号処理装置、信号処理方法、及び信号処理プログラム Ceased WO2024161995A1 (ja)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2024574408A JPWO2024161995A1 (https=) 2023-02-02 2024-01-17
EP24749949.4A EP4641565A4 (en) 2023-02-02 2024-01-17 SIGNAL PROCESSING DEVICE, METHOD AND PROGRAM
US19/286,570 US20250365537A1 (en) 2023-02-02 2025-07-31 Signal processing device, signal processing method, and non-transitory computer readable recording medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2023-014776 2023-02-02
JP2023014776 2023-02-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/286,570 Continuation US20250365537A1 (en) 2023-02-02 2025-07-31 Signal processing device, signal processing method, and non-transitory computer readable recording medium

Publications (1)

Publication Number Publication Date
WO2024161995A1 true WO2024161995A1 (ja) 2024-08-08

Family

ID=92146472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2024/001058 Ceased WO2024161995A1 (ja) 2023-02-02 2024-01-17 信号処理装置、信号処理方法、及び信号処理プログラム

Country Status (4)

Country Link
US (1) US20250365537A1 (https=)
EP (1) EP4641565A4 (https=)
JP (1) JPWO2024161995A1 (https=)
WO (1) WO2024161995A1 (https=)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60104899U (ja) * 1983-12-21 1985-07-17 カシオ計算機株式会社 音声合成装置
JP2010231089A (ja) * 2009-03-27 2010-10-14 Yamaha Corp 録音装置および録音再生装置
JP2013050604A (ja) * 2011-08-31 2013-03-14 Nippon Hoso Kyokai <Nhk> 音響処理装置およびそのプログラム
WO2015097892A1 (ja) 2013-12-27 2015-07-02 パイオニア株式会社 端末装置、キャリブレーション方法及びキャリブレーションプログラム
WO2015097831A1 (ja) * 2013-12-26 2015-07-02 株式会社東芝 電子機器、制御方法およびプログラム
JP2015138053A (ja) * 2014-01-20 2015-07-30 キヤノン株式会社 音響信号処理装置およびその方法
CN110827843A (zh) * 2018-08-14 2020-02-21 Oppo广东移动通信有限公司 音频处理方法、装置、存储介质及电子设备
WO2021059718A1 (ja) * 2019-09-24 2021-04-01 ソニー株式会社 信号処理装置、信号処理方法及びプログラム
JP2021149784A (ja) * 2020-03-23 2021-09-27 ヤマハ株式会社 処理方法、処理装置、及びプログラム
JP2021157134A (ja) 2020-03-30 2021-10-07 リオン株式会社 信号処理方法、信号処理装置及び聴取装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021010006A1 (ja) * 2019-07-17 2021-01-21 パナソニックIpマネジメント株式会社 音声制御装置、音声制御システム及び音声制御方法
KR102694487B1 (ko) * 2019-08-06 2024-08-13 프라운호퍼 게젤샤프트 쭈르 푀르데룽 데어 안겐반텐 포르슝 에. 베. 선택적 청취를 지원하는 시스템 및 방법

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60104899U (ja) * 1983-12-21 1985-07-17 カシオ計算機株式会社 音声合成装置
JP2010231089A (ja) * 2009-03-27 2010-10-14 Yamaha Corp 録音装置および録音再生装置
JP2013050604A (ja) * 2011-08-31 2013-03-14 Nippon Hoso Kyokai <Nhk> 音響処理装置およびそのプログラム
WO2015097831A1 (ja) * 2013-12-26 2015-07-02 株式会社東芝 電子機器、制御方法およびプログラム
WO2015097892A1 (ja) 2013-12-27 2015-07-02 パイオニア株式会社 端末装置、キャリブレーション方法及びキャリブレーションプログラム
JP2015138053A (ja) * 2014-01-20 2015-07-30 キヤノン株式会社 音響信号処理装置およびその方法
CN110827843A (zh) * 2018-08-14 2020-02-21 Oppo广东移动通信有限公司 音频处理方法、装置、存储介质及电子设备
WO2021059718A1 (ja) * 2019-09-24 2021-04-01 ソニー株式会社 信号処理装置、信号処理方法及びプログラム
JP2021149784A (ja) * 2020-03-23 2021-09-27 ヤマハ株式会社 処理方法、処理装置、及びプログラム
JP2021157134A (ja) 2020-03-30 2021-10-07 リオン株式会社 信号処理方法、信号処理装置及び聴取装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4641565A1

Also Published As

Publication number Publication date
US20250365537A1 (en) 2025-11-27
EP4641565A4 (en) 2026-03-25
EP4641565A1 (en) 2025-10-29
JPWO2024161995A1 (https=) 2024-08-08

Similar Documents

Publication Publication Date Title
US11348595B2 (en) Voice interface and vocal entertainment system
CN102792374B (zh) 多通道音频中语音相关通道的缩放回避的方法和系统
JP5149968B2 (ja) スピーチ信号処理を含むマルチチャンネル信号を生成するための装置および方法
JP7799679B2 (ja) 拡張現実におけるバイノーラル再生のためのヘッドホン等化および室内適応のためのシステムおよび方法
JP4283212B2 (ja) 雑音除去装置、雑音除去プログラム、及び雑音除去方法
CN109389990B (zh) 加强语音的方法、系统、车辆和介质
US9892721B2 (en) Information-processing device, information processing method, and program
JP6279181B2 (ja) 音響信号強調装置
WO2012053629A1 (ja) 音声処理装置及び音声処理方法
CN102422349A (zh) 增益控制装置和增益控制方法、声音输出装置
KR101224755B1 (ko) 음성-상태 모델을 사용하는 다중-감각 음성 향상
CN114429763A (zh) 语音音色风格实时变换技术
WO2015125191A1 (ja) 音声信号処理装置および音声信号処理方法
WO2024161995A1 (ja) 信号処理装置、信号処理方法、及び信号処理プログラム
JP2002247699A (ja) ステレオ音響信号処理方法及び装置並びにプログラム及び記録媒体
US9210507B2 (en) Microphone hiss mitigation
JP2008072600A (ja) 音響信号処理装置、音響信号処理プログラム、音響信号処理方法
WO2023197203A1 (en) Method and system for reconstructing speech signals
US20260105926A1 (en) Method and system for reconstructing speech signals
JP2015070291A (ja) 集音・放音装置、音源分離ユニット及び音源分離プログラム
Singh et al. Improved Keyword Spotting in Soundbars: Mitigating Self-Generated Noise and Playback Distortions
CN120340473A (zh) 语音信息屏蔽方法及系统、设备和程序产品
JP2009065424A (ja) インパルス識別装置及びインパルス識別方法
CA2990207A1 (en) Voice interface and vocal entertainment system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24749949

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2024574408

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2024574408

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2024749949

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2024749949

Country of ref document: EP

Effective date: 20250723

ENP Entry into the national phase

Ref document number: 2024749949

Country of ref document: EP

Effective date: 20250723

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2024749949

Country of ref document: EP