WO2010109711A1

WO2010109711A1 - Audio processing device, audio processing method, and program

Info

Publication number: WO2010109711A1
Application number: PCT/JP2009/068239
Authority: WO
Inventors: 浩司藤村
Original assignee: 株式会社東芝
Priority date: 2009-03-26
Filing date: 2009-10-23
Publication date: 2010-09-30
Also published as: JP2010232862A

Abstract

Provided are an audio processing device, an audio processing method, and a program which preferably reduce the noise of sound signals input from microphones in a variety of environments. An audio processing device (100) is provided with a position pattern detection section (102) for detecting an index of the relative position of a sound source and a plurality of microphones, a processing determination section (103) for determining audio processing to the sound signal input from each of the microphones on the basis of the index of the relative position, and a signal processing section (104) for executing the determined audio processing to the sound signals.

Description

Voice processing apparatus, voice processing method, and program

The present invention relates to a voice processing apparatus, and more particularly to a voice processing apparatus capable of obtaining a target sound with high SNR by classifying positions of a sound source and a microphone into N patterns and performing processing corresponding to each position pattern.

Conventionally, by collecting voice using a plurality of microphones called a microphone array and performing signal processing on these, a technique for estimating a target sound source direction and suppressing noise and extracting a signal from a target sound source at a high SNR It has been known.

For example, in Non-Patent Document 1, a so-called delay is used in which a target sound is received by a microphone array, the arrival time difference of the target sound to each microphone is corrected for each signal received by each microphone, and then those signals are added. By summing, it is shown how to obtain a signal that emphasizes the target sound. The invention disclosed in Non-Patent Document 1 is based on the assumption that a signal in which a target sound and noise are mixed is input to any microphone.

Also, as a method of using a plurality of microphones, using two microphones, one is a noise collecting microphone, the other is a microphone for collecting a target sound mixed with noise, and noise is generated from the signal of the microphone collecting the target sound. There is known a method of reducing noise by subtracting the output of a collecting noise microphone and extracting a target sound more clearly.

As an example, in Japanese Patent Application Laid-Open No. 2004-226656 (Patent Document 1), using two microphones, the distance between the lip and the reference microphone selected in advance is the signal level of the reference microphone and the other microphone. A technique such as a speaker distance detection device is disclosed that calculates from the difference of and adjusts the amount of subtraction when subtracting the signal of the other microphone from the signal of the reference microphone according to the distance.

The invention disclosed in Patent Document 1 is based on the premise that one microphone receives a signal in which a target sound and noise are mixed, but the other microphone has only noise or a relatively small amount even if the target sound is mixed. ing.

Japanese Patent Application Publication No. 2004-226656

However, when processing sound signals generated using a plurality of microphones, an environment in which the target sound and noise are mixed in any of the microphones, and the target sound and noise in one of the microphones cause noise in the other microphones. If the same processing is performed in an environment where the sound mainly enters, the target sound may not be processed properly. This is not taken into consideration in the inventions disclosed in Non-Patent Document 1 and Patent Document 1 above.

The present invention has been made to solve these problems in view of the above points, and it is an object of the present invention to preferably reduce noise of a sound signal input from a microphone in a plurality of environments.

In order to solve the problems described above and to achieve the object, an aspect of the present invention provides a position pattern detection unit that detects an index of relative position between a sound source and a plurality of microphones, and input from each of the plurality of microphones And a signal processing unit that executes the determined audio processing on the sound signal. The processing determination unit determines the audio processing on the sound signal based on the index of the relative position. .

According to the present invention, it is possible to suitably reduce noise of a sound signal input from a microphone in a plurality of environments.

FIG. 1 is a block diagram showing an audio processing apparatus according to a first embodiment. It is a figure which shows the flowchart which shows the process in the speech processing unit concerning a 1st embodiment. It is a figure (the 1) showing a position pattern. It is a figure (the 2) which shows a position pattern. It is a figure (the 3) which shows a position pattern. It is a figure which shows the example of a speech processing unit at the time of being classified with position pattern (2). It is a figure which shows the example of a speech processing unit at the time of being classified with position pattern (1). It is a block diagram showing the speech processing unit concerning a 2nd embodiment. It is a figure which shows the flowchart which shows operation | movement of the speech processing unit concerning 2nd Embodiment. It is a block diagram showing the speech processing unit concerning a 3rd embodiment. It is a figure showing the flow chart which shows operation of the speech processing unit concerning a 3rd embodiment. It is a figure explaining the example which provided the angle sensor in the mobile telephone. It is an explanatory view showing the hardware constitutions of the speech processing unit concerning this embodiment.

First Embodiment
FIG. 1 is a block diagram showing an audio processing apparatus according to the first embodiment. The speech processing apparatus 100 according to the first embodiment collates the input sound signal with the position pattern of the sound source held in advance, and executes speech processing corresponding to each position pattern. The voice processing apparatus 100 includes a sound input unit 101, a position pattern detection unit 102, a process determination unit 103, a signal processing unit 104, and a pattern database (hereinafter referred to as "pattern DB") 109.

The sound input unit 101 converts input sounds from a plurality of microphones into digitized sound signals, and detects the start and end of sound. The position pattern detection unit 102 detects an index of the position pattern of the sound source and the microphone from the sound signal. The process determining unit 103 determines the process to be performed on the sound signal by comparing the index of the position pattern with the position pattern held in advance. The signal processing unit 104 performs processing in accordance with the determination of the processing determination unit 103.

The pattern DB 109 holds information relating to position patterns of a plurality of microphones and sound sources. The position pattern represents the relative position (relative position) of the plurality of microphones and the sound source. In the pattern DB 109, indexes of patterns of sound signals input from a plurality of microphones are stored in association with each position pattern. The pattern stored in the pattern DB 109 is called by the process determination unit 103 and is compared with the index detected by the position pattern detection unit 102.

FIG. 2 is a flowchart showing processing in the speech processing apparatus 100 according to the first embodiment. Here, an example in which position patterns of a sound source and a microphone are acquired using two microphones will be described. Let two microphones be a microphone 1 and a microphone 2, respectively.

In step S101, the sound input unit 101 converts the sound input to the microphone from an analog signal to a digital signal using an AD converter.

In step S102, in order to detect the start and end of the speech to be subjected to noise processing, speech section detection is performed using, for example, the number of times of zero crossing. This voice section detection is performed using the microphone outputs of the microphone 1 and the microphone 2.

More specifically, for the sound signal acquired by the microphone 1 and the sound signal acquired by the microphone 2, the number of zero crossings is calculated, and if it is determined that the voice section is detected by either of the microphones, the detection from the detection point Treat the sound as speech.

Here, the start point information detected by the microphone 1 and the microphone 2 is held as S1 and S2, respectively. The processing of FIG. 2 is ended after the end of the voice is determined in the output microphone where the voice start end is detected the latest. The section detection method is not limited to this, and various section detection methods can be applied. Also, for example, a section detection method specific to a plurality of microphones may be applied.

In step S103, the position pattern detection unit 102 detects the index of the position pattern of the sound source and the microphone using the audio signal detected by the sound input unit 101. The index uses, for example, the time difference of voice arrival time between microphones and the signal level ratio.

More specifically, for example, in the case of using the microphone 1 as a reference, the sound source is closer to the microphone 1 as the difference in time of voice arrival to the microphone 2 becomes larger. The sound source is closer to the microphone 1 as the signal level on the microphone 1 side is higher than that of the microphone 2.

When calculating these two indices, the initial voice section of the target voice is used. The initial sound is a sound of a certain section after the sound is detected. The voice arrival time difference between microphones is calculated by using cross correlation. The start time of the microphone at which the voice start end is detected the earliest is detected as time 0, the voice signal input to the microphone 1 is x1, the correlation calculation section of x1 is time ts to time te (S1 <ts <te) What normalized the waveform in a section by power is set to x1 '. Also, assuming that the audio signal input to the microphone 2 is x2, the time T of the section [0-T] for obtaining the audio arrival time difference is at least ts from the detection of the microphone's late end when the voice start end is detected the latest. Set to take more than the interval of. For example, when S1 <S2, T is set according to the following equation (1).

In the equation (1), the output speech arrival time difference td can be expressed by the equation (5) using the following equations (2) to (4).

The arrival time difference td is one of the position pattern determination indexes of the sound source and each microphone. In the case of two microphones, if the arrival time difference td from the microphone 1 to the microphone 2 is determined, the arrival time difference from the microphone 2 to the microphone 1 can be obtained by reversing the positive and negative signs.
The signal level ratio dd between the microphone 1 and the microphone 2 can be obtained by the following equation using td obtained earlier.

The signal level ratio dd in equation (6) is one of the position pattern determination indexes of the sound source and each microphone. The position pattern determination index is not limited to those described above, and various criteria can be applied. For example, the maximum value of correlation calculated previously is included in this. If the maximum correlation value is higher than a certain reference, the sound source and the two microphones are equidistant, and if the maximum correlation value is lower than a certain reference, the sound source is near to either one microphone and the sound source to one microphone It is possible to derive a position pattern that is far. The maximum correlation value r _max is calculated by the following equation (7).

In step S104, the processing determination unit 103 uses the index for determining the position pattern calculated by the position pattern detection unit 102, and checks which of the three position patterns (1) to (3) below belongs to: Do. 3 to 5 show three position patterns.

(1) The sound source approaches the microphone 1 (FIG. 3).
(2) The sound source approaches the microphone 2 (FIG. 4).
(3) The sound source is not close to either of the microphones (FIG. 5).

Assuming that the arrival time difference determination threshold value tthre and the signal level difference determination threshold _values dd _thre1 and dd _thre2 are constants (where t _thre > 0, dd _thre1 > dd _thre2 > 0), when td> 0, the following expression (8) And when equation (9) holds, the position pattern is classified into (1).

Further, when td <= 0, when the following equations (10) and (11) hold, the position pattern is classified into (2).

If it is not classified into either of (1) and (2), the position pattern is classified into (3).

In step S105, the signal processing unit 104 performs predetermined processing in accordance with the classified position pattern. 6 and 7 are diagrams showing an operation at the time of signal processing switching in the signal processing unit 104. FIG. FIG. 6 is a diagram showing an example of a speech processing device when classified as position pattern (2), and FIG. 7 shows an example of a speech processing device when classified as position pattern (1) FIG.

Hereinafter, switching of signal processing will be described.

When the position pattern is classified into (1), the voice input to the microphone 1 is set as the target voice, and the sound input to the microphone 2 is processed as noise. Specifically, assuming that α is a constant (0 ≦ α), the output speech o of the speech processing apparatus 100 can be expressed by the following equation (12) using the delay time td calculated earlier.

At this time, the signal may be converted to the frequency domain to perform spectral subtraction. Alternatively, it is also possible to remove noise components from x1 using x2 as a reference signal using an adaptive filter that is often used in an echo canceller or the like.

When the position pattern is classified into (2), the voice input to the microphone 2 is set as the target voice, and the sound input to the microphone 1 is processed as noise. The specific process is the same as when the microphone 1 and the microphone 2 of the process in the case of position pattern (1) are interchanged. At this time, the output speech o is expressed by the following equation (13).

As described above, the processing when the sound source is classified as a position pattern in which the sound source approaches a specific microphone can be considered in other ways. For example, α may be a function of the maximum correlation value to adjust the subtraction amount. At this time, the value of α can be controlled by the following equation (14) by a linear function with a and b as constants.

By expressing α as in equation (14), the amount of subtraction can be reduced when the maximum correlation value is high, and the amount of subtraction can be increased when the maximum correlation value is low.

When the position pattern is classified into (3), the delay-and-sum array process is performed using both the voices input to the microphone 1 and the microphone 2. When a delay and sum array is used, the output speech o is expressed by the following equation (15).

Note that this array processing adaptation unit is not limited to the above method, and for example, by applying Griffiths-Jim type array processing, two microphones form a blind spot of noise for a certain angle, and the voice of that range is generated. It is possible to extract the target voice o with high SNR.
Returning to FIG. 2, in step S106, the end is detected by the sound input unit 101, and the end of the audio processing is ended.

Although the embodiments of the present invention have been described by taking two microphones as an example, it is not essential that two microphones are used in practicing the present invention, and the present invention may be applied to three or more microphones. It is also possible to expand. In the case of three microphones, assuming that the microphones are microphone 1, microphone 2 and microphone 3, the following seven position patterns are prepared.

(1 ') The microphone 1 is approaching.
(2 ') The microphone 2 is approaching.
(3 ') The microphone 3 is approaching.
(4 ') The microphones 1 and 2 are approaching.
(5 ') The microphones 2 and 3 are approaching.
(6 ') The microphones 1 and 3 are approaching.
(7 ') Not approaching any microphones.

For the input sound signal, it is determined which position pattern is to be classified using the above-mentioned arrival time difference and signal level difference. More specifically, the difference in arrival time from the microphone 1 to the microphone 2 is td _12, and is calculated by the equations (2) to (5). Similarly, the arrival time differences among the other microphones are calculated as td ₁₃ , td ₂₁ , td ₂₃ , td ₃₁ , and td ₃₂ , respectively. Further, the signal level difference between the microphone 1 and the microphone 2 is dd ₁₂ and calculated by the equation (6). Similarly, the signal level difference between the other _microphones, and _{_{_{dd 13, dd 21, dd 23}}} , dd 31, dd 32, calculates them.

At this time, the arrival time difference between the microphone n1 closest to the sound source and the other two microphones becomes a positive value. The arrival time difference of the second microphone n2 closest to the sound source is positive with respect to the remaining one microphone, and is negative with respect to the microphone n1. The microphone n3 farthest from the sound source has a negative arrival time difference with the other two microphones. Therefore, based on this characteristic, it is first determined which microphone will be the microphone n1, the microphone n2 and the microphone n3.

Mike n1 and arrival time difference _td between the microphone n2 _N1N2, the arrival time difference threshold _{td thre1,} microphone n1 and the signal level difference _{dd N1N2} with microphone n2, the threshold of the signal level difference is _{dd thre1,} the following equation When the microphone 1 is the microphone n1 (1 ′) and the microphone 2 is the microphone n1 (2 ′) when the equation (17) and the equation (17) are satisfied (2 ′), the microphone 3 is the microphone n1 (3 Classify in ') position pattern.

Next, the arrival time difference between the microphone n1 and the microphone n2 is _{td N1N2,} the threshold of the arrival time difference _{td thre1,} the arrival time difference between the microphone n2 and the microphone n3 is _{td N2N3,} the threshold of the arrival time difference _{td thre2,} a microphone n1 signal level difference _{dd N1N2} with microphone n2, threshold _{dd thre1} of the signal level differences, microphone n2 and the signal level difference _{dd N2N3} with microphone n3, the threshold of the signal level difference is _{dd thre2,} the following equation ( 18) When the expression (21) is satisfied, the microphone 3 is the microphone n3 (4 '), the microphone 1 is the microphone n3 (5'), the microphone 2 is the microphone n3 (6 ') Classify into location patterns of

Also, if it is not classified into any position pattern from (1 ') to (6'), the distance of the sound source to all the microphones is considered to be far, and is classified into the position pattern (7 ').

After being classified in this way, processing is switched according to each pattern. More specifically, in the case of (1 ′), (2 ′), (3 ′), the processing according to equation (22) is performed to subtract the noise from the target sound of the microphone close to the sound source.

Here, α1 and α2 are constants, and α1 ≧ 0 and α2 ≧ 0.

Moreover, in the case of (4 ′), (5 ′), (6 ′), the processing by the equation (23) is performed. As a result, the two microphones near the sound source are voice-emphasized in the delay sum array, and the output of the microphone farthest from the sound source is used for noise subtraction.

Further, in the case of (7 ′), the process is performed according to equation (24). As a result, speech is emphasized in the delay and sum array using all the microphones.

Thus, it can be easily expanded to three microphones.

Also, three or more microphones may be used to estimate the sound source position in a three-dimensional space. When the sound source position can be estimated, the distance from each microphone to the sound source can be calculated. Let the distances between the microphone and the sound source obtained by this processing be ld ₁ , ld ₂ and ld ₃ respectively.

At this time, when the distance threshold ld _thre is a constant and the following expression (25) is satisfied, the distance is classified into (1 ′).

Similarly, classification into position patterns (2 ′) to (7 ′) can also be realized.

Second Embodiment
FIG. 8 is a block diagram showing an audio processing apparatus according to the second embodiment. The speech processing apparatus 100a according to the second embodiment selects and performs processing corresponding to each position of the sound source acquired by the position sensor on the sound signal. The voice processing device 100a includes a sound input unit 101, a position pattern detection unit 102a, a process determination unit 103a, a signal processing unit 104, and a pattern DB 109a.

The sound input unit 101 detects the start and end of voice from the input sound. The position pattern detection unit 102a detects an index of the position pattern of the sound source and the microphone based on the signal from the position sensor. The process determination unit 103a determines the process to be performed by collating the index of the position pattern with the position pattern held in advance. The signal processing unit 104 performs processing in accordance with the determination of the processing determination unit 103a.

The pattern DB 109a holds position patterns of the sound source and the microphone. In the pattern DB 109a, the index of the signal input from the position sensor is associated with each position pattern of the relative position between the sound source and the microphone. The position pattern stored in the pattern DB 109a is read from the position pattern detection unit 102a, and is collated with the input from the position sensor.

FIG. 9 is a flowchart showing the operation of the speech processing apparatus according to the second embodiment. Description will be made using an example in which two microphones (microphone 1 and microphone 2) are used to process a target voice. In addition, it is not essential that there are two microphones, and implementation is possible if there are two or more microphones. In addition, it is not an essential element that the target sound is voice. The operation of the present embodiment is the same as that of the first embodiment except for the operations of the position sensor, the position pattern detection unit 102a, and the processing determination unit 103a, and the same operation as that of the first embodiment will not be described. .

In step S203, the measurement result of the sensor is used as a position pattern determination index by the output from the position sensor attached near each microphone. Specifically, the position sensor is an infrared sensor or the like that can measure the distance from each microphone to the target object that hits the sound source, and the distance from each microphone to the sound source is measured. Two microphones are used, and distances from the microphone 1 and the microphone 2 to the sound source are ld ₁ and ld ₂ respectively.

In step S204, the process determining unit 103a uses the position pattern determination index calculated by the position pattern detecting unit 102a to classify to which of the three position patterns it belongs. Three patterns are shown below.
(1A) The sound source is approaching the microphone 1.
(2A) The sound source approaches the microphone 2.
(3A) The sound source is not close to either of the microphones.

At this time, _assuming that the distance threshold ld _thre is a constant, it is classified into (1A) if the following equation (26) holds, and it is classified into (2A) if the following equation (27) holds.

If none of the above, it is classified into the position pattern (3A). The processing in step S205 and step S206 after position pattern classification is the same as step S105 and step S106 in FIG. 2, and thus the description thereof is omitted here.

Third Embodiment
FIG. 10 is a block diagram showing an audio processing apparatus according to the third embodiment. The voice processing apparatus 100b according to the third embodiment detects a position pattern of a sound source based on an input from a position sensor and a sound signal, and executes voice processing corresponding to each position pattern.

The voice processing device 100b includes a sound input unit 101, a position pattern detection unit 102b, a process determination unit 103b, a signal processing unit 104, and a pattern DB 109b.
The sound input unit 101 converts the input sound from the microphone into a digitized sound signal, and detects the start and end of sound. The position pattern detection unit 102 b detects an index of the position pattern of the sound source and the microphone from the input from the position sensor and the voice. The process determining unit 103b determines the process to be performed by collating the index of the position pattern with the position pattern held in advance. The signal processing unit 104 performs processing in accordance with the determination of the processing determination unit 103 b.

The pattern DB 109 b holds positional patterns of the microphone and the sound source. In the pattern DB 109b, combinations of the index of the signal input from the position sensor and the index of the sound signal are associated with each position pattern of the relative position between the microphone and the sound source. The pattern stored in the pattern DB 109 b is called from the position pattern detection unit 102 b, and is collated with the sound signal acquired by the sound input unit 101 and the input from the position sensor.

FIG. 11 is a flowchart showing the operation of the speech processing apparatus according to the third embodiment. Here, an example in which two microphones (microphone 1 and microphone 2) are used to process a target voice will be described. In addition, it is not essential that there are two microphones, and two or more microphones are sufficient. In addition, it is not an essential element that the target sound is voice. The operation of the present embodiment is the same as that of the second embodiment except that the operations of the position pattern detection unit 102b and the process determination unit 103b are different from those of the second embodiment, and therefore the same operation part is , I will omit the explanation.

The voice processing device 100b may use, for example, a distance sensor as a position sensor. In step S303, the position pattern detection unit 102b acquires the measurement result by the distance sensor and the voice information as a position pattern determination index.

More specifically, an infrared sensor or the like is used as a position sensor to measure the distance from the device to the sound source. Further, two microphones for acquiring a sound signal are used, and a distance from the sound processing apparatus 100b acquired using a sensor to a sound source is ld. Further, as in the first embodiment, the voice arrival time difference td and the signal level ratio dd are also determined.

In step S304, the process determining unit 103b uses the position pattern determination index calculated by the position pattern detecting unit 102b to classify to which of the three position patterns it belongs. Three patterns are shown below.
(1B) The sound source is approaching the microphone 1.
(2B) The sound source is approaching the microphone 2.
(3B) The sound source is not close to either of the microphones.

The arrival time difference determination threshold t _thre , the signal level difference determination threshold dd _thre1 , dd _thre2 , and the distance determination threshold ld _thre are respectively constant (where t _thre > 0, dd _thre1 > dd _thre2 > 0, _l d _thre > 0). Here, in the case of td> 0, the position pattern is classified into (1B) when all the following expressions (28) hold.

Further, in the case of td <= 0, the position pattern is classified into (2B) when the following equation (29) is all satisfied.

Also, position patterns that are neither (1B) nor (2B) are classified as (3B). Processing similar to (1), (2) and (3) of the first embodiment is performed for each of the three position patterns.

The output from the angle sensor can also be used as a position pattern determination index. FIG. 12 is a diagram for explaining an example in which an angle sensor is provided in a mobile phone. In the example of FIG. 12, the mobile phone is used sideways at the time of operation and vertically at the time of call. In such a device, the angle is detected using an angle sensor attached to the device body. An example of the detected angle θ is shown in FIG. The angle θ is, for example, 0 degrees where the line connecting two microphones and the ground are horizontal. Also, as in the first embodiment, the voice arrival time difference td and the signal level ratio dd are also determined.

In the example of FIG. 12, the classification of the position patterns into (1B), (2B), and (3B) includes the arrival time difference determination threshold t _thre , the signal level difference determination threshold dd _thre1 , dd _thre2 , and the angle determination threshold θ _thre . _{Assuming that} constants (where _thre > 0, dd _thre1 > dd _thre2 > 0, θ _thre 0 0), respectively, the following equations (30) and (31) are used.

When td> 0, the position pattern is classified into (1B) when the following equation (30) holds.

In the case of td <= 0, the position pattern is classified into (2B) when the following equation (31) holds.

If the position pattern is neither (1B) nor (2B), it is classified as (3B).

(Implementation by computer etc. Minimum configuration)
Next, the hardware configuration of the speech processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 13 is an explanatory view showing a hardware configuration of the speech processing apparatus according to the present embodiment.

The voice processing apparatus according to the present embodiment is connected to a control device such as a central processing unit (CPU) 51, a storage device such as a read only memory (ROM) 52 or a random access memory (RAM) 53, and a network. A communication I / F 54 for performing communication and a bus 61 for connecting each unit are provided.

The program executed by the voice processing apparatus according to the present embodiment is provided by being incorporated in advance in the ROM 52 or the like.

The program executed by the voice processing apparatus according to the present embodiment is a file in an installable format or an executable format, and is a compact disk read only memory (CD-ROM), a flexible disk (FD), a compact disk (CD-R) It may be configured to be recorded and provided in a computer readable recording medium such as a Disk Recordable) or a DVD (Digital Versatile Disk).

Furthermore, the program executed by the voice processing apparatus according to the present embodiment may be stored on a computer connected to a network such as the Internet and provided by being downloaded via the network. Further, the program executed by the voice processing apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.

The program executed by the voice processing apparatus according to the present embodiment has a module configuration including the above-described units, and as the actual hardware, the CPU 51 reads out the program from the ROM 52 and executes the program. It is loaded on the main storage device, and each part is generated on the main storage device.

The present invention is not limited to the above embodiment as it is, and at the implementation stage, the constituent elements can be modified and embodied without departing from the scope of the invention. In addition, various inventions can be formed by appropriate combinations of a plurality of constituent elements disclosed in the above embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, components in different embodiments may be combined as appropriate.

As described above, the voice processing apparatus according to the embodiment of the present invention is useful for noise removal, and is particularly suitable for processing a sound signal input from a microphone array.

1, 2, 3

microphones

100, 100a, 100b audio processing apparatus 101 sound input units 102, 102a, 102b position pattern detection units 103, 103a, 103b processing determination unit 104 signal processing unit

Claims

A position pattern detection unit that detects an index of the relative position between the sound source and the plurality of microphones;
A processing determination unit that determines audio processing on a sound signal input from each of the plurality of microphones based on the indicator of the relative position;
A signal processing unit that executes the determined audio processing on the sound signal;
A voice processing device characterized by having.
The speech processing apparatus according to claim 1, wherein the indicator of the relative position includes that the arrival time difference of the sound signal input from the plurality of microphones and the level difference of the sound signal are included.
The voice processing device according to claim 1, wherein the indicator of the relative position includes a distance measured by a distance sensor provided at a predetermined position for each of the plurality of microphones.
The voice processing device according to claim 1, wherein the index of the relative position includes an inclination of the microphone measured by an angle sensor provided at a predetermined position with respect to each of the plurality of microphones.
The process determining unit is a voice that gives a greater weight to a sound signal input from a microphone whose distance to the sound source is smaller than a predetermined value than a sound signal input from a microphone whose distance to the sound source is a predetermined value or more The speech processing apparatus according to claim 1, wherein it is determined to perform processing.
The process determining unit determines to perform an audio process for obtaining a delay sum on sound signals input from the plurality of microphones when the distance between the plurality of microphones and the sound source is equal to or more than a predetermined value. The speech processing apparatus according to claim 1, characterized in that:
Computer,
Position pattern detection unit that detects an indicator of the relative position between a sound source and a plurality of microphones,
A processing determination unit that determines audio processing on a sound signal input from each of the plurality of microphones based on the indicator of the relative position;
A signal processing unit that executes the determined audio processing on the sound signal;
Program to function as.
A position pattern detection step in which the position pattern detection unit detects an index of the relative position between the sound source and the plurality of microphones;
A process determining step of determining an audio process on a sound signal input from each of the plurality of microphones based on the indicator of the relative position;
A signal processing step in which the signal processing unit executes the determined audio processing on the sound signal;
An audio processing method characterized by comprising: