WO2016076123A1 - Dispositif de traitement de son, procédé de traitement de son et programme - Google Patents

Dispositif de traitement de son, procédé de traitement de son et programme Download PDF

Info

Publication number
WO2016076123A1
WO2016076123A1 PCT/JP2015/080481 JP2015080481W WO2016076123A1 WO 2016076123 A1 WO2016076123 A1 WO 2016076123A1 JP 2015080481 W JP2015080481 W JP 2015080481W WO 2016076123 A1 WO2016076123 A1 WO 2016076123A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
filter
signal
sound
beam forming
Prior art date
Application number
PCT/JP2015/080481
Other languages
English (en)
Japanese (ja)
Inventor
慶一 大迫
堅一 牧野
宏平 浅田
徹徳 板橋
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Priority to JP2016558971A priority Critical patent/JP6686895B2/ja
Priority to EP15859486.1A priority patent/EP3220659B1/fr
Priority to US15/522,628 priority patent/US10034088B2/en
Publication of WO2016076123A1 publication Critical patent/WO2016076123A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/11Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's

Definitions

  • the present technology relates to a voice processing device, a voice processing method, and a program.
  • the present invention relates to a voice processing apparatus, a voice processing method, and a program that can appropriately extract a voice to be extracted by removing noise.
  • a user interface using voice is used, for example, when making a phone call or searching for information in a mobile phone (a device called a smart phone or the like).
  • Patent Document 1 it is proposed to perform a generalized sidelobe canceller by enhancing speech with a fixed beamformer unit and enhancing noise with a blocking matrix unit. Further, it has been proposed that the beamforming unit switching unit switches the coefficient of the fixed beamformer, and the switching is performed by switching between two filters when there is a voice and when there is no voice.
  • Patent Document 1 when switching filters having different characteristics depending on whether there is speech or not, switching to an accurate filter is impossible unless an accurate speech section can be detected. However, since it is difficult to accurately detect the speech section, the speech section cannot be accurately detected, and there is a possibility that the filter cannot be switched to an accurate filter.
  • Patent Document 1 since the filter is switched abruptly between when there is a voice and when there is no voice, the sound quality changes suddenly, which may give the user a sense of incongruity.
  • the present technology has been made in view of such a situation, and is capable of appropriately switching a filter and acquiring a desired sound.
  • An audio processing apparatus is applied to a sound collection unit that collects sound, an application unit that applies a predetermined filter to a signal collected by the sound collection unit, and an application unit A selection unit that selects a filter coefficient of the filter to be corrected, and a correction unit that corrects a signal from the application unit.
  • the selection unit may select the filter coefficient based on the signal collected by the sound collection unit.
  • the selection unit can create a histogram that associates the direction in which the sound is generated with the intensity of the sound from the signal collected by the sound collection unit, and selects the filter coefficient from the histogram.
  • the selection unit can create the histogram from the signal accumulated for a predetermined time.
  • the selection unit may select a filter coefficient of a filter that suppresses the sound in a region other than a region including the maximum value of the histogram.
  • a conversion unit that converts the signal collected by the sound collection unit into a signal in a frequency domain; and the selection unit selects the filter coefficients for all frequency bands using the signal from the conversion unit. To be able to.
  • the apparatus further includes a conversion unit that converts the signal collected by the sound collection unit into a frequency domain signal, and the selection unit selects the filter coefficient for each frequency band using the signal from the conversion unit. Can be.
  • the application unit includes a first application unit and a second application unit, and further includes a mixing unit that mixes signals from the first application unit and the second application unit, from the first filter coefficient
  • the first application unit applies a filter based on a first filter coefficient
  • the second application unit applies a filter based on a second filter coefficient
  • the mixing unit The signal from the first application unit and the signal from the second application unit can be mixed at a predetermined mixing ratio.
  • the first application unit can start a process of applying a filter based on the second filter coefficient, and the second application unit can stop the process. .
  • the selection unit can select the filter coefficient based on an instruction from a user.
  • the correction unit corrects to further suppress the signal suppressed by the application unit when the signal collected by the sound collection unit is smaller than the signal to which the predetermined filter is applied by the application unit.
  • correction is performed to suppress the signal amplified by the application unit. Can be.
  • the application unit may suppress stationary noise, and the correction unit may suppress sudden noise.
  • An audio processing method collects audio, applies a predetermined filter to the collected signal, selects a filter coefficient of the filter to be applied, and applies the predetermined filter Correcting the generated signal.
  • a program collects sound, applies a predetermined filter to the collected signal, selects a filter coefficient of the filter to be applied, and applies the predetermined filter
  • a computer is caused to execute a process including a step of correcting the signal.
  • sound is collected, a predetermined filter is applied to the collected signal, and a filter coefficient of the filter to be applied is selected.
  • a desired sound can be acquired by appropriately switching filters.
  • FIG. 1 is a diagram illustrating an external configuration of a voice processing device to which the present technology is applied.
  • the present technology can be applied to an apparatus that processes an audio signal.
  • the present invention can be applied to a mobile phone (including a device called a smart phone), a part that processes a signal from a microphone of a game machine, a noise canceling headphone, an earphone, and the like.
  • the present invention can also be applied to a device equipped with an application for realizing hands-free calling, voice dialogue system, voice command input, voice chat, and the like.
  • the voice processing device to which the present technology is applied may be a mobile terminal or a device installed and used at a predetermined position. Further, it is a glasses-type terminal or a terminal worn on an arm, and can also be applied to a device called a wearable device.
  • FIG. 1 is a diagram showing an external configuration of the mobile phone 10.
  • a speaker 21, a display 22, and a microphone 23 are provided on one surface of the mobile phone 10.
  • Speaker 21 and microphone 23 are used when making a voice call.
  • the display 22 displays various information.
  • the display 22 may be a touch panel.
  • the microphone 23 has a function of collecting voice uttered by the user, and is a part to which voice to be processed later is input.
  • the microphone 23 is an electret condenser microphone, a MEMS microphone, or the like.
  • the sampling of the microphone 23 is, for example, 16000 Hz.
  • FIG. 1 only one microphone 23 is shown, but two or more microphones 23 are provided as will be described later.
  • FIG. 3 and subsequent figures a plurality of microphones 23 are described as sound collection units.
  • the sound collection unit includes two or more microphones 23.
  • the installation position of the microphone 23 on the mobile phone 10 is an example, and does not indicate that the installation position is limited to the lower central portion as shown in FIG.
  • one microphone 23 may be provided on each of the left and right sides of the lower part of the mobile phone 10, or may be provided on a surface different from the display 22, such as a side surface of the mobile phone 10. .
  • the installation position and the number of the microphones 23 differ depending on the device in which the microphones 23 are provided, and it is sufficient that the microphones 23 are installed at appropriate installation positions for each device.
  • FIG. 2A is a diagram for explaining stationary noise.
  • the microphone 51-1 and the microphone 51-2 are located in a substantially central portion.
  • the microphone 51 when there is no need to distinguish between the microphone 51-1 and the microphone 51-2, they are simply referred to as the microphone 51.
  • the other parts will be described in the same manner.
  • the noise emitted from the sound source 61 is noise that continues to be generated from the same direction, such as fan noise of a projector and air-conditioning sound. Such noise is defined here as stationary noise.
  • FIG. 2B is a diagram for explaining sudden noise.
  • the situation shown in FIG. 2B is a state in which stationary noise is emitted from the sound source 61 and sudden noise is emitted from the sound source 62.
  • Sudden noise is, for example, noise that suddenly occurs from a direction different from stationary noise, such as a pen falling sound, a human cough or sneeze, and has a relatively short duration.
  • the noise is stationary noise and the noise is removed and the desired voice is extracted, if sudden noise occurs, the sudden noise cannot be dealt with. There is a possibility that it may adversely affect the extraction of a desired voice without removing noise. Or, for example, when stationary noise is processed by applying a predetermined filter, sudden noise occurs, and after switching to a filter for processing sudden noise, the stationary noise is processed immediately. When the filter is returned to the filter, the filter switching frequently occurs, and noise due to the filter switching may occur.
  • FIG. 3 is a diagram showing a configuration of the 1-1 speech processing apparatus 100.
  • the voice processing device 100 is provided inside the mobile phone 10 and constitutes a part of the mobile phone 10.
  • 3 includes a sound collection unit 101, a time frequency conversion unit 102, a beam forming unit 103, a filter selection unit 104, a filter coefficient holding unit 105, a signal correction unit 106, a correction coefficient calculation unit 107, and The time frequency inverse transform unit 108 is configured.
  • the mobile phone 10 also has a communication unit for functioning as a telephone, a function for connecting to a network, and the like.
  • a communication unit for functioning as a telephone a function for connecting to a network, and the like.
  • the configuration of the voice processing apparatus 100 related to voice processing is illustrated, Illustration and description of functions are omitted.
  • the sound collection unit 101 includes a plurality of microphones 23.
  • the sound collection unit 101 includes M microphones 23-1 to 23-M.
  • the audio signal collected by the sound collection unit 101 is supplied to the time frequency conversion unit 102.
  • the time-frequency conversion unit 102 converts the supplied time-domain signal into a frequency-domain signal, and supplies the signal to the beamforming unit 103, the filter selection unit 104, and the correction coefficient calculation unit 107.
  • the beam forming unit 103 performs beam forming processing using the audio signals of the microphones 23-1 to 23 -M supplied from the time-frequency conversion unit 102 and the filter coefficients supplied from the filter coefficient holding unit 105.
  • the beam forming unit 103 has a function of performing processing to which a filter is applied, and an example thereof is beam forming.
  • the beam forming executed by the beam forming unit 103 performs an addition type or subtraction type beam forming process.
  • the filter selection unit 104 calculates an index of the filter coefficient used for beam forming by the beam forming unit 103 for each frame.
  • the filter coefficient holding unit 105 holds the filter coefficient used in the beam forming unit 103.
  • the audio signal output from the beam forming unit 103 is supplied to the signal correction unit 106 and the correction coefficient calculation unit 107.
  • the correction coefficient calculation unit 107 receives the audio signal from the time-frequency conversion unit 102 and the beam-formed signal from the beam forming unit 103, and uses these signals to calculate the correction coefficient used by the signal correction unit 106. To do.
  • the signal correction unit 106 corrects the signal output from the beam forming unit 103 using the correction coefficient calculated by the correction coefficient calculation unit 107.
  • the signal corrected by the signal correction unit 106 is supplied to the time frequency inverse conversion unit 108.
  • the time-frequency inverse transform unit 108 converts the supplied frequency band signal into a time-domain signal and outputs it to a subsequent unit (not shown).
  • step S101 an audio signal is collected by each of the microphones 23-1 to 23-M of the sound collection unit 101.
  • the voice collected here is a voice uttered by the user, noise, a sound in which they are mixed, or the like.
  • step S102 the input signal is cut out for each frame.
  • Sampling at the time of extraction is performed at 16000 Hz, for example.
  • the signal of the frame extracted from the microphone 23-1 is defined as a signal x 1 (n)
  • the signal of the frame extracted from the microphone 23-2 is defined as a signal x 2 (n)
  • the signal of the frame cut out from is assumed to be a signal x m (n).
  • m represents the index (1 to M) of the microphone
  • n represents the sample number of the collected signal.
  • the extracted signals x 1 (n) to x m (n) are supplied to the time-frequency conversion unit 102, respectively.
  • step S103 the time frequency conversion unit 102 converts the supplied signals x 1 (n) to x m (n) into time frequency signals, respectively.
  • the time-frequency converter 102 receives time-domain signals x 1 (n) to x m (n).
  • the signals x 1 (n) to x m (n) are individually converted into frequency domain signals.
  • the time domain signal x 1 (n) is converted to a frequency domain signal x 1 (f, k), and the time domain signal x 2 (n) is converted to a frequency domain signal x 2 (f, k).
  • the time domain signal x m (n) is converted to the frequency domain signal x m (f, k) and the description will be continued.
  • F in (f, k) is an index indicating a frequency band
  • k in (f, k) is a frame index.
  • the time-frequency conversion unit 102 will be described by taking the input time domain signals x 1 (n) to x m (n) (hereinafter, the signal x 1 (n) as an example. ) For each frame size N samples, a window function is applied, and the signal is converted into a frequency domain signal by FFT (Fast Fourier Transform). In the frame division, a section for taking out N / 2 samples is shifted.
  • FFT Fast Fourier Transform
  • the case where the frame size N is set to 512 and the shift size is set to 256 is shown as an example. That is, in this case, the input signal x 1 (n) is divided into frames with a frame size N of 512, a window function is applied, and an FFT operation is performed to convert the signal into a frequency domain signal.
  • step S103 the signals x 1 (f, k) to x m (f, k) converted into signals in the frequency domain by the time-frequency converter 102 are the beam forming unit 103, This is supplied to the filter selection unit 104 and the correction coefficient calculation unit 107, respectively.
  • step S104 the filter selection unit 104 calculates a filter coefficient index I (k) used for beam forming for each frame.
  • the calculated index I (k) is sent to the filter coefficient holding unit 105.
  • the filter selection process is performed in three steps described below.
  • the filter selection unit 104 uses a signal x 1 (f, k) to x m (f, k) that is a time frequency signal supplied from the time frequency conversion unit 102 to generate a sound source. Estimate direction.
  • the estimation of the sound source direction can be performed based on, for example, a MUSIC (Multiple signal classification) method. With respect to the MUSIC method, methods described in the following documents can be applied.
  • the estimation result of the filter selection unit 104 is P (f, k).
  • P (f, k) takes a scalar value of ⁇ 90 degrees to +90 degrees.
  • the direction of the sound source may be estimated by other estimation methods.
  • Second step Creation of a sound source distribution histogram
  • the results estimated in the first step are accumulated.
  • the accumulation time can be, for example, the past 10 seconds.
  • the estimation result for this accumulation time is used to create a histogram. By providing such an accumulation time, it is possible to cope with sudden noise.
  • the filter is not switched in the subsequent processing, so that it is possible to prevent the filter from being switched due to sudden noise. Therefore, it is possible to prevent the filter from being frequently switched due to the influence of sudden noise, and to improve the stability.
  • FIG. 7 shows an example of a histogram created from data (sound source estimation result) accumulated for a predetermined time.
  • the horizontal axis of the histogram shown in FIG. 7 represents the direction of the sound source, and is a scalar value from ⁇ 90 degrees to +90 degrees as described above.
  • the vertical axis represents the frequency of the sound source azimuth estimation result P (f, k).
  • Such a histogram may be created for each frequency or may be created for all frequencies.
  • a case where all frequencies are created together will be described as an example.
  • a use filter is determined as a third step.
  • the filter coefficient holding unit 105 holds the three patterns of filters shown in FIG. 8 and the filter selection unit 104 selects any one of the three patterns.
  • FIG. 8 shows the patterns of filter A, filter B, and filter C, respectively.
  • the horizontal axis represents the angle from ⁇ 90 ° to 90 °
  • the vertical axis represents the gain.
  • the filters A to C are filters that selectively extract sounds coming from a predetermined angle, in other words, reduce sounds coming from an angle other than the predetermined angle.
  • Filter A is a filter that greatly reduces the gain on the left side (-90 degrees azimuth) when viewed from the sound processing device.
  • the filter A is a filter that is selected when, for example, it is desired to acquire a sound on the right side (+90 degrees azimuth) as viewed from the audio processing apparatus, or when it is determined that there is noise on the left side and it is desired to reduce the noise. .
  • Filter B is a filter that increases the gain at the center (0-degree azimuth) when viewed from the sound processing device and reduces the gain in other directions as compared to the central portion.
  • the filter B is, for example, when it is desired to acquire a sound near the center (0-degree azimuth) when viewed from the speech processing apparatus, or when it is determined that there is noise on both the left side and the right side, and when it is desired to reduce the noise, Is a filter selected when, for example, filter A or filter C (described later) cannot be applied.
  • Filter C is a filter that greatly reduces the gain on the right side (90-degree azimuth) when viewed from the sound processing device.
  • the filter C is, for example, a filter that is selected when it is desired to acquire the sound on the left side ( ⁇ 90 degrees azimuth) as viewed from the audio processing apparatus, or when it is determined that there is noise on the right side and it is desired to reduce the noise. is there.
  • each filter is a filter that extracts a voice that is desired to be collected, and is a filter that suppresses a voice other than the voice that is desired to be collected. It is only necessary to be provided and switchable.
  • a plurality of filters that match a plurality of environmental noises are set in advance, and each of the plurality of filters is a fixed coefficient.
  • filters suitable for noise are selected.
  • FIG. 9 shows the histogram shown in FIG. 7 and shows an example of division when the histogram generated in the second step is divided into three regions.
  • the area is divided into three areas, area A, area B, and area C.
  • the area A is an area from ⁇ 90 degrees to ⁇ 30 degrees
  • the area B is an area from ⁇ 30 degrees to 30 degrees
  • the area C is an area from 30 degrees to 90 degrees.
  • the highest signal strength in the three areas is compared.
  • the highest signal strength in region A is strength Pa
  • the highest signal strength in region B is strength Pb
  • the highest signal strength in region C is strength Pc.
  • each of the remaining intensity Pa and intensity Pc is likely to be noise.
  • the strength Pa is stronger than the strength Pc among the strength Pa in the region A and the strength Pb in the region B. In this case, it is considered that noise in the region A having high intensity is preferably suppressed.
  • filter A is selected. According to the filter A, the sound in the area A is suppressed, and the sounds in the areas B and C are output without being suppressed.
  • a histogram is generated, and the filter is selected by dividing the histogram by the number of filters and comparing the signal intensity in the divided area.
  • the histogram since the histogram is created by accumulating past data, even if it occurs with sudden changes such as sudden noise, the histogram will change greatly due to the data. You can prevent anything.
  • the number of filters is three has been described as an example, but it is needless to say that the number of filters may be other than three.
  • the number of filters and the number of divisions of the histogram have been described as the same number, they may be different.
  • the filter A and the filter C shown in FIG. 8 may be held, and the filter B may be generated by combining the filter A and the filter C. It is also possible to select a plurality of filters, such as applying filters A and C.
  • a plurality of filter groups including a plurality of filters may be held, and the filter group may be selected.
  • the filter is determined from the histogram, but the scope of application of the present technology is not limited to this method.
  • a means may be adopted in which the relationship between the histogram shape and the optimum filter is learned in advance by a machine learning algorithm, and the filter to be selected is determined.
  • signals x 1 (f, k) to x m (f, k) converted into signals in the frequency domain by the time-frequency converter 102 are input to the filter selector 104.
  • one filter index I (k) is output per frame.
  • signals x 1 (f, k) to x m (f, k) converted into signals in the frequency domain by the time-frequency converter 102 are input to the filter selector 104, and the frequency
  • the filter index I (f, k) may be obtained for each band. In this way, finer filter control can be performed by obtaining the filter index for each frequency band.
  • the description will be continued assuming that one filter index is output to the filter coefficient holding unit 105 for each frame, as shown in FIG.
  • the description of the filter will be continued by taking the case of the filters A to C shown in FIG. 8 as an example.
  • step S104 when the filter selection unit 104 determines a filter to be used for beam forming as described above, the process proceeds to step S105.
  • step S105 it is determined whether or not the filter has been changed. For example, when the filter selection unit 104 sets a filter in step S104, the filter selection unit 104 stores the set filter index, compares the filter index stored at the previous time point with the set filter index, and determines whether the same index is obtained. Judge whether or not. By executing such processing, the processing in step S105 is performed.
  • step S105 If it is determined in step S105 that the filter has not been changed, the process in step S106 is skipped, and the process proceeds to step S107 (FIG. 5). If it is determined that the filter has been changed, the process proceeds to step S106. Proceed to
  • step S106 the filter coefficient is read from the filter coefficient holding unit 105 and supplied to the beam forming unit 103.
  • the beam forming unit 103 performs beam forming.
  • the beam forming performed by the beam forming unit 103 and the filter index read from the filter coefficient holding unit 105 used when the beam forming is performed will be described.
  • Beam forming is a process of collecting sound using a plurality of microphones (microphone arrays) and performing addition and subtraction by adjusting the phase input to each microphone. According to this beam forming, the sound in a specific direction can be emphasized or attenuated.
  • the speech enhancement process can be performed by additive beamforming.
  • Delay and Sum (hereinafter referred to as DS) is additive beamforming, and is beamforming that emphasizes the gain of the target sound direction.
  • the sound attenuation process can be performed by attenuation beam forming.
  • Null Beam Forming (hereinafter referred to as NBF) is attenuating beamforming, which is a beamforming that attenuates the gain of the target sound direction.
  • the beam forming unit 103 receives signals x 1 (f, k) to x m (f, k) from the time-frequency conversion unit 102 and filters from the filter coefficient holding unit 105.
  • the coefficient vector C (f, k) is input.
  • the signal D (f, k) is output to the signal correction unit 106 and the correction coefficient calculation unit 107 as a processing result.
  • the beam forming unit 103 When the beam forming unit 103 performs voice enhancement processing based on DS beam forming, it has a configuration as shown in FIG.
  • the beam forming unit 103 includes a delay unit 131 and an adder 132.
  • FIG. 11B illustration of the time-frequency converter 102 is omitted. Further, in FIG. 11B, a case where two microphones 23 are used will be described as an example.
  • the audio signal from the microphone 23-1 is supplied to the adder 132, and the audio signal from the microphone 23-2 is delayed by a predetermined time by the delay unit 131 and then supplied to the adder 132. Since the microphone 23-1 and the microphone 23-2 are separated from each other by a predetermined distance, they are received as signals having different propagation delay times by the path difference.
  • a signal from one microphone 23 is delayed so as to compensate for a propagation delay related to a signal arriving from a predetermined direction. This delay is performed by a delay unit 131.
  • a delay device 131 is provided on the microphone 23-2 side.
  • the microphone 23-1 side is ⁇ 90 °
  • the microphone 23-2 side is 90 °
  • the direction perpendicular to the axis passing through the microphone 23-1 and the microphone 23-2 is the front side of the microphone 23. Is 0 °.
  • an arrow directed to the microphone 23 represents a sound wave of a sound emitted from a predetermined sound source.
  • the directivity characteristic is a plot of the beamforming output gain for each direction.
  • the input of the adder 132 matches the phase of signals coming from a predetermined direction, in this case, a direction between 0 ° and 90 °.
  • the signal coming from that direction is emphasized.
  • signals arriving from directions other than the predetermined direction are not emphasized as much as signals arriving from the predetermined direction because the phases do not match each other.
  • the signal D (f, k) output from the beam forming unit 103 has directivity characteristics as shown in C of FIG.
  • the signal D (f, k) output from the beamforming unit 103 is a voice uttered by the user, and a voice to be extracted (hereinafter referred to as a target voice as appropriate) and a noise to be suppressed are mixed. Signal.
  • the target voice of the signal D (f, k) output from the beam forming unit 103 is more than the target voice included in the signals x 1 (f, k) to x m (f, k) input to the beam forming unit 103. Is emphasized. Further, the noise of the signal D (f, k) output from the beam forming unit 103 is higher than the noise included in the signals x 1 (f, k) to x m (f, k) input to the beam forming unit 103. Is reduced.
  • NBF Null beamforming
  • the beam forming unit 103 When the beam forming unit 103 performs voice attenuation processing based on NULL beam forming, the beam forming unit 103 has a configuration as shown in FIG.
  • the beam forming unit 103 includes a delay device 141 and a subtracter 142.
  • the time-frequency conversion unit 102 is not shown.
  • FIG. 12A a case where two microphones 23 are used will be described as an example.
  • the audio signal from the microphone 23-1 is supplied to the subtractor 142, and the audio signal from the microphone 23-2 is delayed by a predetermined time by the delay device 141 and then supplied to the subtractor 142.
  • the configuration for performing the Null beamforming and the configuration for performing the DS beamforming described with reference to FIG. 11 are basically the same, and the difference between adding by the adder 132 or subtracting by the subtractor 142 is the same. There is only there. Therefore, detailed description on the configuration is omitted here. Further, the description of the same part as that in FIG. 11 is omitted as appropriate.
  • the phase of signals coming from a predetermined direction coincides with the input of the subtractor 142.
  • the signal coming from that direction is attenuated. Theoretically, the attenuation results in zero.
  • signals arriving from directions other than the predetermined direction are not attenuated as much as signals arriving from the predetermined direction because the phases do not match each other.
  • the signal D (f, k) output from the beam forming unit 103 has directivity characteristics as shown in B of FIG.
  • the signal D (f, k) output from the beam forming unit 103 is a signal in which the target voice is canceled and noise remains.
  • the target voice of the signal D (f, k) output from the beam forming unit 103 is more than the target voice included in the signals x 1 (f, k) to x m (f, k) input to the beam forming unit 103. Is attenuated. Further, the noise included in the signals x 1 (f, k) to x m (f, k) input to the beam forming unit 103 is the noise of the signal D (f, k) output from the beam forming unit 103. It will be of the same level.
  • the beam forming of the beam forming unit 103 can be expressed by the following equations (1) to (4).
  • f is the sampling frequency
  • n is the number of FFT points
  • dm is the position of the microphone m
  • is the orientation to be emphasized
  • i is the imaginary unit
  • s is a constant representing the speed of sound.
  • the superscript “.T” represents transposition.
  • the beam forming unit 103 performs beam forming by substituting values into the equations (1) to (4).
  • DS beam forming has been described as an example, but other beam forming such as adaptive beam forming, and speech enhancement processing or speech attenuation processing by a method other than beam forming can be applied to the present technology. it can.
  • step S ⁇ b> 107 when the beamforming process is performed in the beamforming unit 103, the result is supplied to the signal correction unit 106 and the correction coefficient calculation unit 107.
  • step S108 the correction coefficient calculation unit 107 calculates a correction coefficient from the input signal and the beam-formed signal.
  • the calculated correction coefficient is supplied from the correction coefficient calculation unit 107 to the signal correction unit 106 in step S109.
  • step S110 the signal correction unit 106 corrects the signal after beam forming using the correction coefficient.
  • steps S108 to S110 in other words, the processing of the correction coefficient calculation unit 107 and the signal correction unit 106 will be described.
  • the signal correcting unit 106 receives the beam-formed signal D (f, k) from the beam forming unit 103 and outputs the corrected signal Z (f, k). .
  • the signal correction unit 106 performs correction based on the following equation (5).
  • G (f, k) represents a correction coefficient supplied from the correction coefficient calculation unit 107.
  • the correction coefficient G (f, k) is calculated by the correction coefficient calculation unit 107.
  • the correction coefficient calculation unit 107 includes signals x 1 (f, k) to x m (f, k) from the time frequency conversion unit 102 and a signal D after beam forming from the beam forming unit 103. (F, k) is supplied.
  • the correction coefficient calculation unit 107 calculates a correction coefficient in the following two steps. First step: Calculation of signal change rate Second step: Determination of gain value
  • the signal change rate uses the levels of the input signal x (f, k) from the time frequency conversion unit 102 and the signal D (f, k) from the beam forming unit 103. Then, a change rate Y (f, k) representing how much the signal has changed by beam forming is calculated based on the following equations (6) and (7).
  • the rate of change Y (f, k) is the absolute value of the signal D (f, k) after beam forming and the input signals x 1 (f, k) to x m (f , K) is obtained as a ratio of the absolute values of the average values.
  • Expression (7) is an expression for calculating an average value of the input signals x 1 (f, k) to x m (f, k).
  • Second step Determination of gain value
  • the change rate Y (f, k) obtained in the first step is used to determine the correction coefficient G (f, k).
  • the correction coefficient G (f, k) is determined using, for example, a table as shown in FIG.
  • the table shown in FIG. 14 is an example, but the table satisfies the following conditions 1 to 3.
  • Condition 1 is a case where the absolute value of the signal D (f, k) after beam forming is equal to or less than the absolute value of the average value of the input signals x 1 (f, k) to x m (f, k). That is, the rate of change Y (f, k) is 1 or less.
  • Condition 2 is a case where the absolute value of the signal D (f, k) after beam forming is equal to or greater than the absolute value of the average value of the input signals x 1 (f, k) to x m (f, k). That is, the change rate Y (f, k) is 1 or more.
  • Condition 3 is a case where the absolute value of the signal D (f, k) after beam forming and the absolute value of the average value of the input signals x 1 (f, k) to x m (f, k) are the same. . That is, the change rate Y (f, k) is 1.
  • correction is performed such that the signal D (f, k) after beam forming is further suppressed and the influence of the sound that is increased due to the sudden noise is suppressed.
  • condition 2 When the condition 2 is satisfied, correction for suppressing the signal D (f, k) after beam forming amplified by the processing of the beam forming unit 103 is performed.
  • the condition 2 since the sudden noise is generated in a direction different from the direction in which the noise is suppressed, the sudden noise is also amplified by the beam forming process, and the input signal x 1 (f , K) to x m (f, k), the signal D (f, k) after beam forming is larger than the average value.
  • correction is performed to suppress the signal D (f, k) after beam forming amplified by the processing of the beam forming unit 103.
  • condition 3 When condition 3 is met, no correction is made. In this case, since no sudden noise has occurred, there is no significant change in sound, and the signal D (f, k) after beamforming and the input signals x 1 (f, k) to x m (f, k) The average value of is maintained at substantially the same level, and no correction is necessary, and no correction is performed.
  • the table shown in FIG. 14 is an example and does not indicate limitation.
  • Another table for example, a table set based on more detailed conditions instead of three conditions (three ranges) may be used.
  • the table can be arbitrarily set by the designer.
  • step S110 the signal corrected by the signal correction unit 106 is output to the time-frequency inverse transform unit 108.
  • step S111 the time-frequency inverse conversion unit 108 converts the time-frequency signal z (f, k) from the signal correction unit 106 into a time signal z (n).
  • the time-frequency inverse transform unit 108 adds the frames while shifting them to generate an output signal z (n).
  • the time-frequency inverse conversion unit 108 performs inverse FFT for each frame, and as a result, the output 512 samples are An output signal z (n) is generated by superimposing while shifting by 256 samples.
  • the generated output signal z (n) is output from the time-frequency inverse transform unit 108 to a subsequent processing unit (not shown) in step S113.
  • FIG. 15 shows the voice processing apparatus 100 shown in FIG.
  • the speech processing apparatus 100 is divided into two parts, and the part including the beam forming unit 103, the filter selection unit 104, and the filter coefficient holding unit 105 is a first part 151, and the signal correction unit 106 and correction coefficient calculation are performed.
  • the portion 107 is a second portion 152.
  • the first portion 151 is a portion that reduces stationary noise, for example, the sound of a fan of a projector and the sound of air conditioning, by beam forming.
  • the filter held by the filter coefficient holding unit 105 is a linear filter, so that it can be operated with high sound quality and stability.
  • the process of the first portion 151 executes a process of following so that an optimal filter is appropriately selected, such as when the direction of noise changes or the position of the sound processing apparatus 100 itself changes.
  • the speed accumulation time when creating the histogram
  • the follow-up speed it is possible to perform processing so that the sound changes instantaneously as in adaptive beamforming and does not cause a sense of incongruity.
  • the second portion 152 is a portion that reduces sudden noise coming from other than the direction attenuated by beamforming.
  • the stationary noise reduced by beam forming is further reduced depending on the situation.
  • FIG. 16 is a diagram illustrating a relationship between a filter and noise set at a certain time.
  • the filter A described with reference to FIG. 8 is applied.
  • the stationary noise 171 is determined to be in the ⁇ 90 degree direction, so the filter A is applied.
  • the filter A by applying the filter A, the sound in the direction with the stationary noise 171 is suppressed, and a sound with the stationary noise 171 suppressed can be acquired.
  • sudden noise 172 occurs in a direction of 90 degrees at time T2.
  • the filter A since the filter A is applied, the sound from the 90-degree direction is amplified (the gain is high). If sudden noise occurs in the direction of amplification, the sudden noise is also amplified.
  • the signal correction unit 106 performs correction to reduce the gain, the sound that is finally output is prevented from increasing due to sudden noise. It becomes sound.
  • the second portion 152 performs correction for suppressing the amplification amount. As a result, the influence of sudden noise can be suppressed.
  • the filter when the noise source moves, the filter can be appropriately switched in accordance with the direction of the sound source, and the frequent switching of the filter can be prevented.
  • the present technology for example, it is possible to obtain a target voice by using only a small omnidirectional microphone and signal processing without using a directional microphone (gun microphone) having a large housing. It is possible to contribute to weight reduction and weight reduction. Further, the present technology can be applied even when a directional microphone is used, and even when a directional microphone is used, the present technology can be expected.
  • the desired sound can be collected by reducing the influence of stationary noise and sudden noise, it is possible to improve the accuracy of speech processing such as speech recognition rate.
  • the above-described 1-1 speech processing apparatus 100 uses the speech signal from the time-frequency conversion unit 102 to select a filter, but the 1-2 speech processing apparatus 200 (FIG. 17) The difference is that a filter is selected using information input from the outside.
  • FIG. 17 is a diagram showing a configuration of the first-second audio processing apparatus 200.
  • the speech processing apparatus 200 shown in FIG. 17 parts having the same functions as those in the 1-1 speech processing apparatus 100 shown in FIG.
  • the audio processing device 200 shown in FIG. 17 is configured such that information necessary for selecting a filter is supplied from the outside to the filter instruction unit 201, and the signal from the time-frequency conversion unit 102 is the filter instruction unit 201. Is different from the speech processing apparatus 100 shown in FIG.
  • Information necessary for selecting a filter supplied to the filter instruction unit 201 is, for example, information input by the user.
  • it may be configured such that the user selects the direction of sound to be collected and the selected information is input.
  • a screen as shown in FIG. 18 is displayed on the display 22 of the mobile phone 10 (FIG. 1) including the audio processing device 200.
  • a message “What is the direction of the sound to be collected?” Is displayed at the top, and an option for selecting one of the three areas is displayed below the message. Yes.
  • the options are composed of a left area 221, a middle area 222, and a right area 223.
  • the user looks at the message and the options, and selects the direction in which the sound is desired to be collected from the options. For example, when there is a sound to be collected in the middle (front), the region 222 is selected. Such a screen may be presented to the user, and the direction of the sound to be collected may be selected by the user.
  • the direction of the sound to be collected is selected.
  • a message such as “Which direction is loud?” May be displayed, and the user may be allowed to select the direction where noise is present. .
  • a list of filters may be displayed, a user may select a filter from the list, and the selected information may be input.
  • a filter that is used in a situation such as “a filter used when there is a large amount of noise in the right direction” or “a filter used when collecting sound from a wide range”.
  • the filter may be displayed in a list on the display 22 (FIG. 1) so that the user can recognize it, and the user can select the filter.
  • a filter switching switch (not shown) may be provided in the voice processing apparatus 200 so that operation information of the switch is input.
  • the filter instruction unit 201 acquires such information, and instructs the filter coefficient holding unit 105 to specify the index of the filter coefficient used for beamforming from the acquired information.
  • steps S201 to S203 are performed in the same manner as the processes of steps S101 to 103 shown in FIG.
  • the process of determining the filter is executed in step S104. However, in the 1-2 speech processing apparatus 200, since such a process is not necessary, it is omitted. It is said that the flow of processing. Then, in the first-second sound processing apparatus 200, it is determined in step S204 whether or not there has been a filter change instruction.
  • step S204 If it is determined in step S204 that there has been an instruction to change the filter, for example, if there is an instruction from the user by the method described above, the process proceeds to step S205, and it is determined that there has been no instruction to change the filter. In the case where it is found, the process of step S205 is skipped and the process proceeds to step S206 (FIG. 20).
  • step S205 the filter coefficient is read from the filter coefficient holding unit 105 and sent to the beam forming unit 103 as in step S106 (FIG. 4).
  • steps S206 to S212 are basically performed in the same manner as the processes of steps S107 to S113 shown in FIG.
  • the first-second audio processing apparatus 200 information for selecting a filter is input from the outside (user).
  • the 1-2 speech processing apparatus 200 as in the 1-1 speech processing apparatus 100, an appropriate filter is selected, and it is possible to appropriately cope with sudden noise and the like, such as a speech recognition rate. It is possible to improve the accuracy of the voice processing.
  • FIG. 21 is a diagram illustrating a configuration of the second-first audio processing device 300.
  • the voice processing device 300 is provided inside the mobile phone 10 and constitutes a part of the mobile phone 10.
  • 21 includes a sound collection unit 101, a time frequency conversion unit 102, a filter selection unit 104, a filter coefficient holding unit 105, a signal correction unit 106, a correction coefficient calculation unit 107, and a time frequency inverse conversion unit 108.
  • a beam forming unit 301, and a signal transition unit 304 is a diagram illustrating a configuration of the second-first audio processing device 300.
  • the voice processing device 300 is provided inside the mobile phone 10 and constitutes a part of the mobile phone 10.
  • 21 includes a sound collection unit 101, a time frequency conversion unit 102, a filter selection unit 104, a filter coefficient holding unit 105, a signal correction unit 106, a correction coefficient calculation unit 107, and a time frequency inverse conversion unit 108.
  • the beam forming unit 301 includes a main beam forming unit 302 and a sub beam forming unit 303. Parts having the same functions as those of the speech processing apparatus 100 shown in FIG. 3 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.
  • the speech processing apparatus 300 in the second embodiment includes a main beamforming section 302 and a sub-beamforming section 303 in the beamforming section 103 (FIG. 3).
  • the point is different.
  • the point which is provided with the signal transition part 304 for switching the signal from the main beam forming part 302 and the sub beam forming part 303 differs.
  • the beam forming unit 301 includes a main beam forming unit 302 and a sub beam forming unit 303, and the main beam forming unit 302 and the sub beam forming unit 303 are respectively supplied from the time frequency conversion unit 102. Signals x 1 (f, k) to x m (f, k) converted to signals in the frequency domain are supplied.
  • the beam forming unit 301 includes a main beam forming unit 302 and a sub beam forming unit 303 in order to prevent the sound from changing at the moment when the filter coefficient C (f, k) supplied from the filter coefficient holding unit 105 is switched. Prepare.
  • the beam forming unit 301 performs the following operation.
  • Both the main beam forming unit 302 and the sub beam forming unit 303 of the beam forming unit 301 operate, and the main beam forming unit 302 is configured to use the old filter coefficient (filter coefficient before switching). ), And the sub-beamforming unit 303 executes the process with the new filter coefficient (filter coefficient after switching).
  • a predetermined frame here, after elapse of t frames, the main beam forming unit 302 starts operating with a new filter coefficient, and the sub beam forming unit 303 stops operating.
  • t is the number of transition frames and is arbitrarily set.
  • the beam-forming unit 301 outputs a beam-formed signal from the main beam forming unit 302 and the sub beam forming unit 303, respectively, when the filter coefficient C (f, k) is switched.
  • the signal transition unit 304 performs a process of mixing the signals output from the main beam forming unit 302 and the sub beam forming unit 303, respectively.
  • the signal transition unit 304 may perform processing with a fixed mixing ratio when performing mixing, or may perform processing while gradually changing the mixing ratio. For example, immediately after the filter coefficient C (f, k) is switched, processing is performed with a mixing ratio that mixes more signals from the main beamforming unit 302 than signals from the sub-beamforming unit 303, and then gradually the main coefficient is changed. The ratio at which the signal from the beam forming unit 302 is mixed is reduced, and the mixing ratio is changed so that a large amount of the signal from the sub beam forming unit 303 is included.
  • the signal transition unit 304 performs the following operation.
  • the signal from the main beam forming unit 302 is output to the signal correction unit 106 as it is.
  • the signal from the main beam forming unit 302 and the signal from the sub beam forming unit 303 are mixed based on the following equation (8) until t frames elapse after the filter coefficient C (f, k) is switched, and mixed.
  • the subsequent signal is output to the signal correction unit 106.
  • is a coefficient that takes a value of 0.0 to 1.0, and can be arbitrarily set by the designer.
  • the coefficient ⁇ is a fixed value, and the same value may be used from when the filter coefficient C (f, k) is switched until t frames elapse.
  • the coefficient ⁇ is a variable value.
  • the coefficient ⁇ is set to 1.0, decreases with time, and is set to 0.0 when t frames elapse. It is good also as such a value.
  • the output signal D (f, k) from the signal transition unit 304 after switching the filter coefficient is a signal obtained by multiplying the signal D main (f, k) from the main beam forming unit 302 by ⁇ . Then, a signal obtained by multiplying the signal D sub (f, k) from the sub beam forming unit 303 by (1- ⁇ ) is added.
  • the speech processing apparatus 300 including the main beam forming unit 302 and the sub beam forming unit 303 and including the signal transition unit 304 will be described with reference to the flowcharts of FIGS.
  • the part which has the same function as the audio processing apparatus 100 in the 1-1 embodiment basically performs the same process, the description thereof will be omitted as appropriate.
  • steps S301 to S305 processing by the sound collection unit 101, the time frequency conversion unit 102, and the filter selection unit 104 is executed. Since the processing of steps S301 to S305 is performed in the same manner as steps S101 to S105 (FIG. 4), description thereof is omitted.
  • step S305 If it is determined in step S305 that there is no change in the filter, the process proceeds to step S306.
  • step S306 the main beam forming unit 302 performs the beam forming process using the filter coefficient C (f, k) set at that time. That is, the process with the filter coefficient set at that time is continued.
  • the signal after beam forming from the main beam forming unit 302 is supplied to the signal transition unit 304.
  • the signal transition unit 304 outputs the supplied signal to the signal correction unit 106 as it is.
  • step S312 the correction coefficient calculation unit 107 calculates a correction coefficient from the input signal and the beam-formed signal.
  • Each process of steps S312 to S317 performed by the signal correction unit 106, the correction coefficient calculation unit 107, and the time-frequency inverse transform unit 108 is performed by the 1-1 speech processing apparatus 100 in steps S108 to S113 (FIG. 5). Since it is performed in the same manner as the process to be executed, the description thereof is omitted.
  • step S305 if it is determined in step S305 that the filter is changed, the process proceeds to step S306.
  • step S 306 the filter coefficient is read from the filter coefficient holding unit 105 and supplied to the sub beam forming unit 303.
  • step S307 the main beam forming unit 302 and the sub beam forming unit 303 perform beam forming processing.
  • the main beam forming unit 302 performs beam forming with the filter coefficients before the filter change (hereinafter referred to as old filter coefficients), and the sub beam forming unit 303 sets the filter coefficients after the filter change (hereinafter referred to as new filter coefficients). Perform beamforming with.
  • the main beam forming unit 302 continues the beam forming process without changing the filter coefficient, and the sub beam forming unit 303 uses the new filter coefficient supplied from the filter coefficient holding unit 105 in the process of step S307.
  • the beam forming process used is started.
  • step S309 the signal transition unit 304 mixes the signal from the main beam forming unit 302 and the signal from the sub beam forming unit 303 based on the above-described equation (8), and sends the mixed signal to the signal correction unit 106. Output a signal.
  • step S310 it is determined whether or not the number of signal transition frames has elapsed. If it is determined that the number of signal transition frames has not elapsed, the process returns to step S309, and the subsequent processing is repeated. That is, until it is determined that the number of signal transition frames has elapsed, the signal transition unit 304 performs a process of mixing and outputting the signals from the main beam forming unit 302 and the sub beam forming unit 303.
  • steps S312 to S317 is performed on the output from the signal transition unit 304.
  • a signal continues to be supplied to a processing unit (not shown) in the subsequent stage.
  • step S310 If it is determined in step S310 that the number of signal transition frames has elapsed, the process proceeds to step S311.
  • step S311 a process of moving the new filter coefficient to the main beam forming unit 302 is executed. After that, the main beam forming unit 302 starts the beam forming process using the new filter coefficient, and the sub beam forming unit 303 stops the beam forming process.
  • the filter coefficient when the filter coefficient is changed, the signals from the main beam forming unit 302 and the sub beam forming unit 303 are mixed to prevent the output signal from changing suddenly. Even if the coefficient changes, it is possible to prevent the user from feeling uncomfortable with the output signal.
  • the above-described effects of the 1-1 speech processing apparatus 100 and the 1-2 speech processing apparatus 200 can also be obtained in the 2-1 speech processing apparatus 300.
  • FIG. 25 is a diagram showing a configuration of the 2-2 speech processing apparatus 400.
  • the same reference numerals are given to the portions having the same functions as those of the 2-1 audio processing device 300 shown in FIG. 21, and the description thereof is omitted.
  • the audio processing apparatus 400 shown in FIG. 25 is configured such that information necessary for selecting a filter is supplied from the outside to the filter instruction unit 201, and the signal from the time-frequency conversion unit 102 is received from the filter instruction unit 201. Is different from the speech processing apparatus 300 shown in FIG.
  • the filter instruction unit 401 may have the same configuration as the filter instruction unit 201 of the first-second audio processing device 200.
  • Information necessary for selecting a filter supplied to the filter instruction unit 401 is, for example, information input by the user.
  • it may be configured such that the user selects the direction of sound to be collected and the selected information is input.
  • the screen as shown in FIG. 18 already described is displayed on the display 22 of the mobile phone 10 (FIG. 1) including the audio processing device 400, and such a screen is used to accept an instruction from the user. You may do it.
  • a list of filters may be displayed, a user may select a filter from the list, and the selected information may be input.
  • a filter switching switch (not shown) may be provided in the audio processing device 400 so that operation information of the switch is input.
  • the filter instruction unit 401 obtains such information, and instructs the filter coefficient holding unit 105 of the index of the filter coefficient used for beam forming from the obtained information.
  • steps S401 to S403 are performed in the same manner as the processes of steps S301 to S303 shown in FIG.
  • step S304 the process of determining the filter is executed in step S304, but in the 2-2 speech processing apparatus 400, such a process is not necessary.
  • the process flow is omitted.
  • step S404 it is determined in step S404 whether or not there has been a filter change instruction.
  • step S404 if it is determined that there is no filter change instruction, the process proceeds to step S405. If it is determined that there is a filter change instruction, the process proceeds to step S406.
  • steps S405 to S416 are basically performed in the same manner as the processes of steps S306 to S317 shown in FIGS. 23 and 24, the description thereof is omitted.
  • the 2-2 speech processing apparatus 400 information when selecting a filter is input from the outside (user).
  • the 2-2 speech processing apparatus 400 as in the 1-1 speech processing apparatus 100, the 1-2 speech processing apparatus 200, and the 2-1 speech processing apparatus 300, an appropriate filter is selected.
  • the user will not feel uncomfortable with the output signal even if the filter coefficient changes. It becomes possible to do.
  • the series of processes described above can be executed by hardware or can be executed by software.
  • a program constituting the software is installed in the computer.
  • the computer includes, for example, a general-purpose personal computer capable of executing various functions by installing various programs by installing a computer incorporated in dedicated hardware.
  • FIG. 28 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.
  • a CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • An input / output interface 1005 is further connected to the bus 1004.
  • An input unit 1006, an output unit 1007, a storage unit 1008, a communication unit 1009, and a drive 1010 are connected to the input / output interface 1005.
  • the input unit 1006 includes a keyboard, a mouse, a microphone, and the like.
  • the output unit 1007 includes a display, a speaker, and the like.
  • the storage unit 1008 includes a hard disk, a nonvolatile memory, and the like.
  • the communication unit 1009 includes a network interface.
  • the drive 1010 drives a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
  • the CPU 1001 loads the program stored in the storage unit 1008 into the RAM 1003 via the input / output interface 1005 and the bus 1004 and executes the program, for example. Is performed.
  • the program executed by the computer (CPU 1001) can be provided by being recorded on the removable medium 1011 as a package medium, for example.
  • the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
  • the program can be installed in the storage unit 1008 via the input / output interface 1005 by attaching the removable medium 1011 to the drive 1010. Further, the program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. In addition, the program can be installed in advance in the ROM 1002 or the storage unit 1008.
  • the program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.
  • system represents the entire apparatus composed of a plurality of apparatuses.
  • this technology can also take the following structures.
  • the sound processing apparatus according to (1) wherein the selection unit selects the filter coefficient based on a signal collected by the sound collection unit.
  • the selection unit creates a histogram in which the direction in which the sound is generated and the intensity of the sound are associated from the signal collected by the sound collection unit, and selects the filter coefficient from the histogram (1) or ( The speech processing apparatus according to 2).
  • (4) The voice processing device according to (3), wherein the selection unit creates the histogram from the signal accumulated for a predetermined time.
  • the sound processing apparatus selects a filter coefficient of a filter that suppresses the sound in a region other than a region including the maximum value of the histogram.
  • a conversion unit that converts the signal collected by the sound collection unit into a frequency domain signal The audio processing apparatus according to any one of (1) to (5), wherein the selection unit selects the filter coefficient for all frequency bands using a signal from the conversion unit.
  • a conversion unit that converts the signal collected by the sound collection unit into a frequency domain signal; The voice processing device according to any one of (1) to (5), wherein the selection unit selects the filter coefficient for each frequency band using a signal from the conversion unit.
  • the application unit includes a first application unit and a second application unit, A mixing unit for mixing signals from the first application unit and the second application unit; When switching from the first filter coefficient to the second filter coefficient, the first application unit applies the filter based on the first filter coefficient, and the second application unit applies the filter based on the second filter coefficient.
  • the audio processing apparatus according to any one of (1) to (7), wherein the mixing unit mixes the signal from the first application unit and the signal from the second application unit at a predetermined mixing ratio. (9) After the predetermined time has elapsed, the first application unit starts a process of applying a filter based on the second filter coefficient, and the second application unit stops the process. (8). Voice processing device.
  • the voice processing device wherein the selection unit selects the filter coefficient based on an instruction from a user.
  • the correction unit is When the signal collected by the sound collection unit is smaller than the signal to which a predetermined filter is applied by the application unit, correction is performed to further suppress the signal suppressed by the application unit, When the signal collected by the sound collection unit is larger than the signal to which a predetermined filter is applied by the application unit, correction is performed to suppress the signal amplified by the application unit (1) Thru
  • the application unit suppresses stationary noise, The speech processing apparatus according to any one of (1) to (11), wherein the correction unit suppresses sudden noise.
  • Collect audio Apply a predetermined filter to the collected signal, Select the filter coefficient of the filter to apply, An audio processing method including a step of correcting a signal to which the predetermined filter is applied.
  • Collect audio Apply a predetermined filter to the collected signal, Select the filter coefficient of the filter to apply,
  • a program for causing a computer to execute processing including a step of correcting a signal to which the predetermined filter is applied.
  • 100 voice processing device 101 sound collection unit, 102 time frequency conversion unit, 103 beam forming unit, 104 filter selection unit, 105 filter coefficient holding unit, 106 signal correction unit, 108 time frequency inverse conversion unit, 200 sound processing device, 201 Filter instruction unit, 300 audio processing device, 301 beam forming unit, 302 main beam forming unit, 303 sub beam forming unit, 304 signal transition unit, 400 audio processing device, 401 filter instruction unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

La présente invention concerne un dispositif de traitement de son, un procédé de traitement de son et un programme qui permettent la collecte d'un son souhaité. Sont prévues : une unité de collecte de son qui collecte un son ; une unité d'application qui applique un filtre prédéfini à un signal collecté par l'unité de collecte de son ; une unité de sélection qui sélectionne un coefficient de filtre du filtre appliqué par l'unité d'application ; et une unité de correction qui corrige le signal fourni à partir de l'unité d'application. L'unité de sélection sélectionne le coefficient de filtre sur la base du signal collecté par l'unité de collecte de son. L'unité de sélection crée un histogramme dans lequel une direction de survenue du son et l'intensité du son sont associées l'une à l'autre, à partir du signal collecté par l'unité de collecte de son, et sélectionne le coefficient de filtre à partir de l'histogramme. La présente invention peut s'appliquer à des dispositifs de traitement de son.
PCT/JP2015/080481 2014-11-11 2015-10-29 Dispositif de traitement de son, procédé de traitement de son et programme WO2016076123A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2016558971A JP6686895B2 (ja) 2014-11-11 2015-10-29 音声処理装置、音声処理方法、並びにプログラム
EP15859486.1A EP3220659B1 (fr) 2014-11-11 2015-10-29 Dispositif de traitement de son, procédé de traitement de son et programme
US15/522,628 US10034088B2 (en) 2014-11-11 2015-10-29 Sound processing device and sound processing method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014-228896 2014-11-11
JP2014228896 2014-11-11

Publications (1)

Publication Number Publication Date
WO2016076123A1 true WO2016076123A1 (fr) 2016-05-19

Family

ID=55954215

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/080481 WO2016076123A1 (fr) 2014-11-11 2015-10-29 Dispositif de traitement de son, procédé de traitement de son et programme

Country Status (4)

Country Link
US (1) US10034088B2 (fr)
EP (1) EP3220659B1 (fr)
JP (1) JP6686895B2 (fr)
WO (1) WO2016076123A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019207912A1 (fr) * 2018-04-23 2019-10-31 ソニー株式会社 Dispositif de traitement d'informations et procédé de traitement d'informations
JP2020018015A (ja) * 2017-07-31 2020-01-30 日本電信電話株式会社 音響信号処理装置、方法及びプログラム

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2557219A (en) * 2016-11-30 2018-06-20 Nokia Technologies Oy Distributed audio capture and mixing controlling
US10699727B2 (en) 2018-07-03 2020-06-30 International Business Machines Corporation Signal adaptive noise filter
KR102327441B1 (ko) * 2019-09-20 2021-11-17 엘지전자 주식회사 인공지능 장치

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001100800A (ja) * 1999-09-27 2001-04-13 Toshiba Corp 雑音成分抑圧処理装置および雑音成分抑圧処理方法
JP2013120987A (ja) * 2011-12-06 2013-06-17 Sony Corp 信号処理装置、信号処理方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6577966B2 (en) * 2000-06-21 2003-06-10 Siemens Corporate Research, Inc. Optimal ratio estimator for multisensor systems
EP1184676B1 (fr) * 2000-09-02 2004-05-06 Nokia Corporation Système et procédé de traitement d'un signal émis d'une source de signal cible à une environnement bruyant
CA2354858A1 (fr) * 2001-08-08 2003-02-08 Dspfactory Ltd. Traitement directionnel de signaux audio en sous-bande faisant appel a un banc de filtres surechantillonne
JP2010091912A (ja) 2008-10-10 2010-04-22 Equos Research Co Ltd 音声強調システム
US8724829B2 (en) * 2008-10-24 2014-05-13 Qualcomm Incorporated Systems, methods, apparatus, and computer-readable media for coherence detection
EP2222091B1 (fr) * 2009-02-23 2013-04-24 Nuance Communications, Inc. Procédé pour déterminer un ensemble de coefficients de filtre pour un moyen de compensation d'écho acoustique
US9552840B2 (en) * 2010-10-25 2017-01-24 Qualcomm Incorporated Three-dimensional sound capturing and reproducing with multi-microphones
EP2642768B1 (fr) * 2010-12-21 2018-03-14 Nippon Telegraph And Telephone Corporation Procédé d'amélioration du son, dispositif, programme et support d'enregistrement
US9232310B2 (en) * 2012-10-15 2016-01-05 Nokia Technologies Oy Methods, apparatuses and computer program products for facilitating directional audio capture with multiple microphones
US8666090B1 (en) * 2013-02-26 2014-03-04 Full Code Audio LLC Microphone modeling system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001100800A (ja) * 1999-09-27 2001-04-13 Toshiba Corp 雑音成分抑圧処理装置および雑音成分抑圧処理方法
JP2013120987A (ja) * 2011-12-06 2013-06-17 Sony Corp 信号処理装置、信号処理方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHIGEKI TATSUTA ET AL.: "Blind Source Separation by the method of Orientation Histograms", TECHNICAL REPORT OF IEICE, June 2005 (2005-06-01), pages 1 - 6, XP009502892, Retrieved from the Internet <URL:http://ci.nii.ac.jp/naid/10016576608> [retrieved on 20160115] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020018015A (ja) * 2017-07-31 2020-01-30 日本電信電話株式会社 音響信号処理装置、方法及びプログラム
WO2019207912A1 (fr) * 2018-04-23 2019-10-31 ソニー株式会社 Dispositif de traitement d'informations et procédé de traitement d'informations

Also Published As

Publication number Publication date
US20170332172A1 (en) 2017-11-16
US10034088B2 (en) 2018-07-24
JP6686895B2 (ja) 2020-04-22
EP3220659A1 (fr) 2017-09-20
JPWO2016076123A1 (ja) 2017-08-17
EP3220659B1 (fr) 2021-06-23
EP3220659A4 (fr) 2018-05-30

Similar Documents

Publication Publication Date Title
JP5805365B2 (ja) ノイズ推定装置及び方法とそれを利用したノイズ減少装置
US10580428B2 (en) Audio noise estimation and filtering
JP5573517B2 (ja) 雑音除去装置および雑音除去方法
JP5762956B2 (ja) ヌル処理雑音除去を利用した雑音抑制を提供するシステム及び方法
JP6686895B2 (ja) 音声処理装置、音声処理方法、並びにプログラム
US9042573B2 (en) Processing signals
US20130083943A1 (en) Processing Signals
US9747921B2 (en) Signal processing apparatus, method, and program
EP2752848B1 (fr) Procédé et appareil pour générer un signal audio à bruit réduit à l&#39;aide d&#39;un réseau de microphones
JP4448464B2 (ja) 雑音低減方法、装置、プログラム及び記録媒体
JP6241520B1 (ja) 収音装置、プログラム及び方法
JP6638248B2 (ja) 音声判定装置、方法及びプログラム、並びに、音声信号処理装置
JP6854967B1 (ja) 雑音抑圧装置、雑音抑圧方法、及び雑音抑圧プログラム
JP6631127B2 (ja) 音声判定装置、方法及びプログラム、並びに、音声処理装置
JP6263890B2 (ja) 音声信号処理装置及びプログラム
JP6295650B2 (ja) 音声信号処理装置及びプログラム
JP6544182B2 (ja) 音声処理装置、プログラム及び方法
JP6903947B2 (ja) 非目的音抑圧装置、方法及びプログラム
JP6221463B2 (ja) 音声信号処理装置及びプログラム
Takahashi et al. Structure selection algorithm for less musical-noise generation in integration systems of beamforming and spectral subtraction
JP2017067990A (ja) 音声処理装置、プログラム及び方法
JP2015025914A (ja) 音声信号処理装置及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15859486

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016558971

Country of ref document: JP

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2015859486

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015859486

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 15522628

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE