WO2007122729A1

WO2007122729A1 - Communication system, communication apparatus and sound source direction determining apparatus

Info

Publication number: WO2007122729A1
Application number: PCT/JP2006/308487
Authority: WO
Inventors: Akira Date
Original assignee: Hitachi, Ltd.
Priority date: 2006-04-18
Filing date: 2006-04-18
Publication date: 2007-11-01
Also published as: JPWO2007122729A1

Abstract

A communication system comprises a first communication apparatus having a processor, a memory and an interface and a second communication apparatus connected to the first communication apparatus and also having a processor, a memory and an interface. The first communication apparatus, which has a directional microphone for acquiring sounds from a particular direction, determines the position of a sound source and controls the directivity of the directional microphone so as to acquire the sounds from the determined position of the sound source. The first communication apparatus then notifies the second communication apparatus of both the sounds acquired by the directional microphone and the determined position of the sound source. The second communication apparatus then localizes a sound image based on the position of the sound source as notified of and outputs the sounds as notified of.

Description

Specification

Communication system, communication device, and sound source direction specifying device

Technical field

The present invention relates to a communication system including a first communication device and a second communication device, and more particularly to a technique for collecting voices of participants.

Background art

In recent years, broadbandization of the Internet has progressed, and applications using broadband data communication are becoming widespread. For example, in the communication field, video conferencing systems using images and sound are becoming popular. It is important to convey the nuances of the participants' speeches in order to have a smooth conference using the video conference system. Therefore, it is important to convey a sense of realism in the video conference system.

A television video apparatus is disclosed in Japanese Patent Application Laid-Open No. 2 00 0-3 1 2 3 3 6. The television video apparatus includes an input means for inputting a recording, a reproducing means for reproducing the input recording, a photographing means for expanding a person's face, and a face obtained by digitally analyzing the face image of the person in the playback means. Image recording means for storing image data, image changing means for instructing image change, image processing means for changing the face image of the person in the reproducing means to a facial image obtained by the photographing means, and editing processing A display means for displaying the recording with the face changed, and an audio means for outputting the sound.

Further, it is disclosed in Japanese Patent Application Laid-Open No. 2 0.05-0 9 2 3 4 6, which is a method for authenticating motion in moving images. In this authentication method, feature data is extracted from 3D data of a moving image using a three-dimensional higher-order local autocorrelation feature extraction method. Next, the extracted feature data is converted by statistical methods such as multivariate 4 ^ prayer, and new feature data is generated. Generate. Then, the operation is authenticated by comparing the generated feature data with the registered data. Disclosure of the invention

In conventional video conferencing systems, conference participants use fixed microphones. However, the participants move ^ 8 \ In the conventional video conference system, there was a problem that the participants' voices were picked up and cut off.

Therefore, it may be possible for participants to use wireless microphones. However, the radio wave condition of the wireless microphone is invisible and unstable. Therefore, even if participants use a wireless microphone, the video conference system is not always stable.

It is not always possible to communicate by voice.

In addition, the conventional video conference system plays back the received image and sound-voice in a fixed manner. Therefore, in the conventional video conference system, participants feel uncomfortable due to the discrepancy between image and sound.

The present invention has been made in view of the above-described problems, and an object thereof is to provide a realistic communication system.

A representative embodiment of the present invention is a communication system including: a first communication device including a processor, a memory, and an interface; and a second communication device including a processor, a memory, and an interface, and connected to the first communication device. The first communication device includes a directional microphone that acquires sound from a specific direction, specifies the position of the source, and acquires the sound from the position of the sound source specified by the first communication device. The directivity of the directional microphone is controlled, and the voice acquired by the directional microphone and the position of the identified sound source are told to the tilt communication second communication device, and the second communication device is said The sound image is localized based on the position of the sound source, and the said sound is output.

According to the representative embodiment of the present invention, a realistic communication environment is provided. it can. Brief Description of Drawings

FIG. 1 is a configuration diagram of a communication system according to a first embodiment of this invention.

FIG. 2 is a block diagram of a transmission / reception unit provided in the communication system according to the first embodiment of this invention.

FIG. 3 is a flowchart of the sound source direction estimation process according to the first embodiment of the present invention.

FIG. 4 is an explanatory diagram of the direction of the sound source according to the first embodiment of the present invention.

FIG. 5 is a flowchart of the processing of the noise reduction unit according to the first embodiment of the present invention.

FIG. 6 is an explanatory diagram of packet transmission processing of the communication processing unit according to the first embodiment of this invention.

FIG. 7 is a flowchart of processing of the localization processing unit according to the first embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

(First embodiment)

Figure 1 illustrates the communication system installed in Auditorium 1 and Conference Room 2. Auditorium 1 and Conference Room 2 are located at a distance from each other.

Auditorium 1 includes podium 4 and auditorium. At podium 4, speakers 5 A and 5 B will speak while moving. At the audience seat, viewer 9 A watches the lecture. Viewer 9 A can also view the conference room 2. On the other hand, in the conference room, viewer 9B watches the lecture. In the auditorium 1, a microphone array 6A, directional microphones 7A and 7B, a camera 8A, a display 10A, speakers 12A and 12B, and a transmission / reception unit 11A are installed. On the other hand, in the conference room 2, a microphone array 6B, a display 10B, speakers 12C, 12D, and a transmission / reception unit 11B are installed.

The microphone arrays 6 A and 6 B include a plurality of microphones. The microphones included in the microphone array 6 A are arranged at the microphone interval d in the horizontal direction (X-axis direction) in FIG. Similarly, the microphones included in the microphone array 6 B are also arranged in the vertical direction (y-axis direction) of FIG.

Directional microphones 7 A and 7 B acquire sound emitted from a specific direction. The transmission / reception unit 11 A controls the directions of the directional microphones 7 A and 7 B with reference to the sound acquired by the microphone array 6 A and the image captured by the camera 8 A. For example, the transceiver 1 1 A directs the directional microphone 7 A in one direction of the speaker 5 A or the speaker 5 B. Similarly, the transceiver 1 1 A points the directional microphone 7 B toward the speaker 5 A or the speaker 5 B. As a result, the directional microphones 7 A and 7 B can acquire the voices of the speakers 5 A and 5 B.

Cameras 8 A and 8 B do ^^^^. Specifically, camera 8 A takes the position of podium 4 in auditorium 1. The transmission / reception unit 11 A controls the direction of the camera 8 A with reference to the sound acquired by the microphone array 6 A and the image taken by the camera 8 A. For example, the transmission / reception unit 1 1 A points the camera 8 A toward the speaker 5 A or the speaker 5 B. Accordingly, the camera 8 A can photograph at least one of the speaker 5 A and the speaker 5 B.

On the other hand, the camera 8 B operates in the conference room 2. The transmission / reception unit 11 B controls the direction of the camera 8 B by referring to the audio acquired by the microphone array 6 B and the video image obtained by the power camera 8 B. For example, the transmission / reception unit 1 1 B directs the camera 8 B toward the viewer 9 B. As a result, the camera 8 B can capture the viewer 9 B. Display 1 OA displays the video taken by camera 8 B installed in conference room 2. As a result, the viewer 9 A in the auditorium 1 can see the video of the conference room 2. On the other hand, the display 10B displays the image captured by the camera 8A installed in the auditorium 1. As a result, the viewer 9 B in the conference room 2 can see in the auditorium 1.

The spins 12 A and 12 B output the sound acquired by the microphone array 6 B- installed in the conference room. As a result, the viewer 9 A in the auditorium can listen to the audio in the conference room 2. At this time, the transmission / reception unit 11A performs sound image localization processing by controlling the output timing of the sound from the speakers 12A and 12B. In order to bring about the effect of sound image localization, it is desirable that the speaker 12 A and the speaker 12 B have a force S that is placed equidistant from the center line of the screen 1 OA.

On the other hand, the speakers 12 C and 12 D output sound acquired by at least one of the microphone array 6 B and the directional microphones 7 A and 7 B installed by the auditorium 1. As a result, the viewer 9 B in the conference room 2 can listen to the audio of the auditorium 1. At this time, the transmission / reception unit 11B performs sound image localization processing by controlling the output timing of the sound from the speakers 12C and 12D. It should be noted that the speaker 12 C and the speaker 12D are preferably arranged equidistant from the center line of the screen 10B in order to bring about the effect of sound image localization.

The transmission / reception unit 11 A and the transmission / reception unit 11 B are connected by a communication line 3. The communication line 3 can be anything as long as it can transmit digital signals in real time. For example, the communication line 3 is a local area network (LAN), a dedicated line, a public line, or a wireless communication line.

The transmission / reception units 11A and 11B transmit and receive audio acquired by the microphone arrays 6A and 6B and the directional microphones 7A and 7B. In addition, the transmission / reception units 11.A and 11B transmit / receive information transmitted by the cameras 8A and 8B. In addition, the transceivers 11 11 and 11 B are connected to the direction of the directional microphones 7 A and 7 B Control direction. Furthermore, the transmission / reception units 1 1 A and 1 1 B perform sound image localization processing. Details of the transmission / reception units 1 1 A and 1 1 B will be described with reference to FIG.

Note that the communication system according to the first embodiment may be installed in any location. For example, the communication system may be installed in two auditoriums or in two conference rooms.

FIG. 2 is a block diagram of the transmission / reception unit 11 1 included in the communication system according to the first embodiment of this invention.

The transmitter / receiver 1 1 includes a noise reduction unit 20, a sound source direction estimation unit 2 1, an audio processing unit 2 2, a speaker position specifying unit 2 3, a video processing unit 2 4, a communication processing unit 2 5, and a person extraction unit 2 6 The screen includes a position specifying unit 27, a position control unit 28, and a localization processing unit 29 '.

The transceiver unit 11 includes a processor, a memory, and an interface. The processor included in the transmitter / receiver unit 1 1 executes a program stored in the memory, thereby reducing the noise reduction unit 20, the sound source direction estimation unit 2 1, the speech processing unit 2 2, and the speaker position specifying unit 2 3. The video processing unit 24, the communication processing unit 25, the person extraction unit 26, the in-screen position specifying unit 27, the position control unit 28, and the localization processing unit 29 are realized.

The memory provided in the transmission / reception unit 11 1 stores information executed by the processor and information necessary for the processor. The interface provided in the transmission / reception unit 11 1 is connected to the microphone array 6, the directional microphone 7, the camera 8, and the speaker 12. Furthermore, the interface provided in the transmission / reception unit 11 is connected to another transmission / reception unit 11 via the communication line 3.

The noise reduction unit 20 extracts the speaker sound / voice signal from the sound signal acquired by the directional microphone 7. In the present embodiment, the speakers are speakers 5 A, 5 B, and viewer 9 B. The noise reduction unit 20 may calculate a −noise reduction signal instead of the speaker voice signal. The noise reduction signal is a signal in which noise included in the audio signal acquired by the directional microphone 7 is reduced. Details of the processing of the noise reduction unit 20 will be described with reference to FIG. The sound source direction estimation unit 21 estimates the direction of the sound source based on the phase difference and the intensity of the audio signal acquired by the microphone array 6. When there are a plurality of sound source forces, the sound source direction estimating unit 21 estimates the directions of the plurality of sound sources. The processing of the sound source direction estimating unit 21 will be described in detail with reference to FIG.

The voice processing unit 22 converts the speaker voice extracted by the noise reduction unit 20 into a signal corresponding to the characteristics of the communication line 3. For example, the speech processing unit 22 performs processing such as encoding on the speaker speech extracted by the noise reduction unit 20.

The processing unit 24 converts »taken by the camera 8 into a signal corresponding to the characteristics of the communication line 3. For example, the processing unit 24 performs processing such as encoding on the image captured by the camera 8.

The person extraction unit 26 extracts a region of the person who is captured by the camera 8 and is reflected on the subject. In the present embodiment, the persons are speakers 5 A and 5 B and viewer 9 B. It should be noted that the person extracting unit 26 extracts a person area from the camera 8 by a general method. For example, the method for extracting human regions is as follows: “Extracting human regions from image sequences (Toru Tamaki, Satoshi Yamamura, Noboru Onishi: The Institute of Electrical Engineers of Japan, Bibliography, Volume C, Vol. 119—C, No. 1, p p. 37—43) ”.

The in-screen position specifying unit 27 specifies the position of the area of the person extracted by the person extracting unit 26 in the image taken by the camera 8. Further, the in-screen position specifying unit 27 estimates the direction of the person with the camera 8 as a base point based on the position of the person's area in and the current direction of the camera 8.

The speaker position specifying unit 23 determines the direction of the sound source based on the microphone array 6 estimated by the sound source direction estimating unit 21 and the person based on the camera 8 estimated by the on-screen position specifying unit 27. Identify speaker location based on direction.

Specifically, the speaker position specifying unit 23 specifies the position (x, y) of the speaker using Equation (1) and Equation (2). x = tand / (tan (i)-tan 0) (1)

y = (t a n θ X t a n) / (t a n φ— t a n 0).

In this embodiment, the horizontal direction in FIG. 1 is the X axis, and the vertical direction in FIG. 1 is the y axis. In the auditorium 1, the position of a specific viewer 9 A is a point. Similarly, in conference room 2, the position of a specific viewer 9B is the origin.

Also, 0 is the direction of the sound source with the microphone array 6 as a base point. Specifically, 0 is the angle between the microphone array 6 and the sound source and the X axis. Φ is the direction of the sound source with the camera 8 as a base point. Specifically, <is the angle between the male connecting the camera 8 and the person (sound source) and the X axis.

The position control unit 28 controls the direction of the directional microphone 7 and the camera 8 based on the position of the speaker specified by the speaker position specifying unit 23. Specifically, the position control unit 28 controls the direction of the directional microphone 7 so that the voice from the position of the speaker specified by the speaker position specifying unit 23 is acquired. As a result, the directional microphone 7 can clearly acquire the voice uttered by the speaker. In addition, the position control unit 28 controls the direction of the camera 8 so that the position of the speaker specified by the speaker position specifying unit 23 is determined. As a result, the camera 8 can accurately identify the speaker.

Note that the position control unit 28 controls the direction of the directional microphone 7 so that the sound from the direction of the sound source estimated by the sound source direction estimation unit 21 is acquired instead of the position of the speaker. Anyway. In addition, the position control unit 28 may control the direction of the directional microphone 7 so that the sound from the direction of the person specified by the in-screen position specifying unit 27 can be acquired. Les.

Similarly, the position control unit 28 may control the direction of the camera 8 so that the direction of the sound source estimated by the sound source direction estimation unit 21 is captured. In addition, the position control unit 28 may control the direction of the camera 8 so that the direction of the person specified by the in-screen position specifying unit 27 is photographed.

The communication processor 2 5 is connected to the communication processor 2 5 and the communication line 3 provided in the other transmitter / receiver 1 1. Communicate through. Specifically, the communication processing unit 25 selects the voice signal converted by the voice processing unit 22, the signal converted by the processing unit 24, and the speaker position specified by the speaker position specifying unit 23. Send and receive. For example, the communication processing unit 25 detects the voice signal converted by the voice processing unit 22, the signal converted by the processing unit 24, and the speaker position specified by the speaker position specifying unit 23. Send it in one packet. The packet transmission process of the communication processing unit 25 will be described in detail with reference to FIG.

In addition, the communication processing unit 25 obtains the audio signal, the signal, and the position of the speaker from the received bucket.

The trap processing unit 24 decodes the! ^ Signal received by the communication processing unit 25. Then, the processing unit 24 outputs the decrypted URL from the display 10. In addition, the audio processing unit 22 decodes the audio signal received by the communication processing unit 25. Then, the audio processing unit 22 delivers the decoded audio signal to the localization processing unit 29. The localization processing unit 29 performs sound image localization processing on the audio signal received from the audio processing unit 22 based on the speaker position received by the communication processing unit 25. That is, the localization processing unit 29 localizes the sound image of the audio received from the audio processing unit 22 and outputs it from the speaker 12.

FIG. 3 is a flowchart of the processing of the sound source direction estimation unit 21 according to the first embodiment of the present invention.

The sound source direction estimating unit 21 converts the audio signal acquired by the microphone array 6 into a digital signal (S41).

Next, the sound source direction estimating unit 21 obtains a time difference Δt between audio signals acquired by a plurality of microphones included in the microphone array 6. Specifically, the sound source direction estimating unit 21 calculates a time difference Δt that satisfies the following formula (3) (S42).

M_n (t) = aM_k (t + A t) (3)

N and k are the order in which the microphones included in the microphone array 6 are counted from the right or left. It is a turn. M— n (t) is an audio signal acquired by the n-th microphone included in the microphone array 6 at time t. Similarly, M k (t + Δt) is an audio signal acquired by the k-th microphone included in the microphone array 6 at time t + A t. α is the ratio of the amplitude of the signal corresponding to M− n (t) to the amplitude of the signal corresponding to M−k (t + Δ 1;). Next, the sound source direction estimation unit 21 calculates the direction 0 of the sound source with the microphone array 6 as a base point so as to satisfy Equation (4) (S43). Details of the direction Θ of the sound source from the microphone array 6 will be described in detail with reference to FIG.

d + V a a X t X cos Θ = v_a ~ X (t + Δt) X cos Θ · · · (4) d is the microphone interval of the microphone array 6. V—a is the speed of sound in the air. Then, the sound source direction estimation unit 21 ends this process.

As described above, the sound source direction estimating unit 21 calculates the direction Θ of the sound source.

FIG. 4 is an explanatory diagram of the direction Θ of the sound source according to the first embodiment of the present invention.

Direction of sound source based on microphone array 6A. Θ is the angle between the microphone array 6A and the sound source when the microphones included in the microphone array 6A are lined up. In other words, the direction Θ of the sound source based on the microphone array 6 A is an angle between the microphone array 6 A and the sound source and the X axis. As a result, the sound source direction estimating unit 21 can calculate the direction Θ of the sound source using Equation (4).

FIG. 5 is a flowchart of the processing of the noise reduction unit 20 according to the first embodiment of this invention.

First, the noise reduction unit 20 receives the audio signal M_g (t) acquired by the directional microphone 7 (S51). '

Next, the noise reduction unit 20 passes the audio band filter through the audio signal M−g (t) acquired by the directional microphone 7 (S52). The voice band filter is a filter that passes only the voice band signal. As a result, the noise reduction unit .20 obtains the speaker voice signal A s (t). Next, the noise reduction unit 20 obtains a noise signal N (t) using Equation (5). N (t) = M one g (t) one A one s (t)

Next, the noise reduction unit 20 obtains the noise reduction signal A_r. (T) using Equation (6) (S 53) ₀ '

A— r (t) = M_g (t) — kXN (t) · · · (6)

Here, k is a number set in advance by the user. The noise reduction signal A 1 r (t) is a signal in which noise included in the audio signal M−g (t) acquired by the directional microphone 7 is reduced. In other words, the noise reduction signal A_r (t) is a signal in which the speaker voice signal A—s (t) and the reduced noise signal N (t) are combined. And the noise reduction part 20 complete | finishes this process.

Note that the noise reduction unit 20 may obtain at least one of the noise reduction signal A—r (t) and the speaker voice signal A—s (t). If the noise reduction unit 20 obtains only the speaker voice signal A — s (t), step S 53 is omitted.

FIG. 6 is an explanatory diagram of packet processing of the communication processing unit 25 according to the first embodiment of this invention.

First, the communication processing unit 25 stores the audio signal converted by the audio processing unit 22 in the audio queue 111. Further, the communication processing unit 25 stores the video signal converted by the video processing unit 24 in the queue 112. Further, the communication processing unit 25 stores the position information of the speaker specified by the speaker position specifying unit 23 in the data queue 113. Note that a part of the memory provided in the transmission / reception unit 11 is used for the voice queue 111, the bag queue 112, and the data queue 113.

On the other hand, the communication processing unit 25 sequentially extracts audio signals from the audio cue 111 force. Further, the communication processing unit 25 sequentially extracts signals from the queue 112. Further, the communication processing unit 25 sequentially extracts speaker position information from the data queue 113 (114).

Next, the communication processing unit 25 uses the extracted voice signal, habit signal, and speaker position information. Create a packet by including it in the packet (115). Then, the communication processing unit 25 transmits the created bucket to the communication processing unit 25 provided in the other transmission / reception unit 11 (116).

Then, the communication processing unit 25 ends the packet transmission process.

FIG. 7 is a flowchart of the processing of the localization processing unit 29 according to the first embodiment of the present invention.

First, the position processing unit 29 calculates the speaker direction ψ based on the position (x, y) of the speaker received by the communication processing unit 25. Specifically, the localization processing unit 29 calculates the direction of the speaker using Equation (7) (S61). Note that the direction of the speaker ψ is the direction of the speaker 5 A or 5 B on the screen 10 B, with the specific viewer 9 B as the base point.

0 = acos AT (x ² + y ² )) (7)

Next, the localization processing unit 29 calculates the delay time Δ ι using Equation (8) (S 62) ο

Δ u = ά― s pXc o s ^) / v― a (8)

Note that d−s p is the distance between the speakers 12. In the auditorium 1, d—s p is the distance between the speaker 12A and the speaker 12B. In the conference room 2, d−sp is the distance between the speaker 12C and the speaker 12D. Next, the localization processing unit 29 localizes the sound image of the audio received from the audio processing unit 22 and outputs it from the speaker 12 (S63). At this time, the localization processing unit 29 delays the sound output from one of the two speakers 12 by a delay time Διι.

First, the sound image localization process in Auditorium 1 will be described. First, the localization processing unit 29 determines whether or not the speaker direction φ is less than or equal to π / 2. When the speaker direction is π-2 or less, the localization processing unit 29 delays the sound output from the speaker 12 installed on the left side of the auditorium 1 by a delay time Διι. On the other hand, the direction of the speaker is The localization processing unit 29, which is larger than π / 2, delays the sound output from the 'spinning force 1 2' installed on the right side of the auditorium 1 by a delay time Δ.

Next, the sound image localization process in the conference room 2 will be described. First, the localization processing unit 29 determines whether or not the speaker direction φ is 2 or less. When the speaker direction is π Ζ 2 or less, the localization processing unit 29 delays the sound output from the speaker 12 C installed on the left side of the conference room 2 by the delay time. On the other hand, the localization processing unit 29 having the speaker direction φ larger than π / 2 delays the sound output from the speaker 12 D installed on the right side of the conference room 2 by the delay time A u.

As described above, in the embodiment of the present invention, the transmission / reception unit 11 identifies the position of the speaker (sound source). Then, the transmission / reception unit 11 controls the direction of the directional microphone 7 and the force lens 8 and performs sound image localization based on the position of the identified speaker. Industrial applicability

The present invention can be applied to a video conference system.

Claims

The scope of the claims

1. A communication system comprising: a first communication device comprising a processor, a memory and an interface; and a second communication device comprising a processor, a memory and an interface and connected to the first self-communication device,

The first communication device is

With a directional microphone that captures sound from a specific direction,

Locate the sound source,

Controlling the directivity of the directional microphone so that sound from the position of the identified sound source is acquired;

-Say the voice acquired by the directional microphone and the position of the specified sound source to the second communication device.

l The second communication device

A communication system in which a sound image is localized based on the position of the read sound source, and the read and transmitted sound is output.

2. A communication system according to claim 1,

frf first communication device is

It has a microphone array that contains multiple microphones,

Based on the sound acquired by the microphone array, the direction of the sound source with the microphone array as a base point is estimated,

The feature is that the position of the sound source is specified based on the direction of the sound source estimated.

3. A communication system according to claim 1,

The first communication device is: Equipped with a camera that

Estimate the direction of the sound source with the camera as the base point based on the image taken by the power mela.

The position of the sound source is specified based on the estimated direction of the sound source.

4. A communication system according to claim 1,

The first communication device is:

A microphone array including a plurality of microphones, and a camera for photographing », and based on the sound acquired by the microphone array, the direction of a sound source based on the microphone array is estimated,

The direction of the sound source based on the force mela is estimated based on

tilt identifies the position of the sound source based on the direction of the sound source based on the estimated microphone array and the direction of the sound source based on the estimated camera:

5. A communication system according to claim 1,

The first communication device is:

Equipped with a camera

The direction of the camera is controlled so that the specified position is determined.

6. A communication device comprising a processor, a memory and an interface, connected to an output device for localizing the sound image and outputting sound,

Features a directional microphone that captures audio from a specific direction 'Locate the sound source,

A communication device characterized in that the specified sound source position is transmitted as sound image localization data to the output device together with the sound acquired by the directional microphone.

7. The communication device according to claim 6,

Furthermore, a microphone array including a plurality of microphones is provided,

Based on the sound acquired by the microphone array, the direction of the sound source based on the microphone array is estimated,

8. The communication device according to claim 6, wherein

In addition, it has a camera that swells

Estimate the direction of the sound source with the camera as a base point based on ¾m photographed by the force mela,

9. The communication device according to claim 6, wherein

A microphone array including a plurality of microphones; and a camera that captures a heel; and based on sound acquired by the microphone array, estimating a direction of a sound source based on the microphone array, Based on the »made by the tins camera, the direction of the sound source from the camera is estimated,

The position of the sound source is specified based on the direction of the sound source based on the estimated microphone array and the direction of the sound source based on the estimated force meter.

1 0. The communication device according to claim 6,

With a camera that shadows

The direction of the camera is controlled so that the specified position is photographed.

-1 1. A sound source direction identification device with a processor, memory and interface.

A microphone array including a plurality of microphones, a directional microphone that acquires sound from a specific direction, and

The direction of the sound source is estimated based on the sound acquired by the microphone array, and the directivity of the tins directional microphone is controlled so that the sound from the direction of the sound source estimated by the self is acquired. A sound source direction specifying device.