WO2015198964A1

WO2015198964A1 - Imaging device provided with audio input/output function and videoconferencing system

Info

Publication number: WO2015198964A1
Application number: PCT/JP2015/067628
Authority: WO
Inventors: 大坪　宏安
Original assignee: 日立マクセル株式会社
Priority date: 2014-06-24
Filing date: 2015-06-18
Publication date: 2015-12-30
Also published as: JP2016010010A

Abstract

Provided are an imaging device provided with an audio input/output function for videoconferencing, and a videoconferencing system, which are inexpensive to produce. An imaging device (1) provided with an audio input/output function that is used in a videoconferencing system is provided with microphones (5), a speaker (6) and an omnidirectional camera (7), and the microphones (5), the speaker (6) and the omnidirectional camera (7) are arranged at positions that are close to one another. The imaging device (1) provided with an audio input/output function is used while placed on a table around which participants participating in a videoconference sit. A person speaking at the conference speaks towards the microphones (5) and is highly likely to face the speaker in order to hear the sound of the speaker (6). Thus, the circumstances are such that the person speaking is imaged from the front by the omnidirectional camera (7).

Description

Image pickup apparatus with audio input / output function and video conference system

The present invention relates to an imaging apparatus with a voice input / output function and a video conference system.

In recent years, a conference room has a voice input / output device including a microphone and a speaker arranged on a table, a display arranged near the table, and a video camera (for example, recording) arranged near the display. There is a case in which a video conference system is provided in which a so-called video conference using an image and sound is possible between another conference room and a remote conference room.

In such a video conference system, the angle of view of the TV camera is often adjusted so that all participants in the conference fall within the shooting range. In this case, the seating position of the participant may be limited, or it may be difficult to keep all the participants within the shooting range. Further, it may take a little time to adjust the angle of view, zoom, etc. of the TV camera before the start of the conference, and there will be a time lag between the start of the conference after all the participants have gathered.

In addition, if the main speaker is determined in advance at the conference, it is possible to take measures such as sitting at the center of the shooting range of the TV camera as much as possible, but who of the participants speaks? If you don't know, there is a problem that the speaker is near the end of the shooting range and you cannot see it well.

Therefore, a plurality of microphones for voice input or a plurality of wide-angle cameras are provided, and the position of the speaker is identified from the audio signals of the plurality of microphones and the image data of the plurality of wide-angle cameras, and based on the position of the speaker A proposal has been made to control a microphone so as to mainly input a voice uttered by a speaker and to control a camera so as to mainly photograph the speaker (see Patent Document 1).

In recent conference systems, a PTZ camera is used as the camera. PTZ is a camera capable of panning (P) for swinging the camera left and right, tilt (t) for swinging the camera up and down, and zoom (Z) for enlarging the image. The camera direction and zoom can be controlled so that Further, in the case of a system in which the position of the speaker can be specified as described above, the PTZ camera can be automatically directed to the speaker.

Japanese Patent Laid-Open No. 10-145763

By the way, in the invention of Patent Document 1, the position of the speaker is specified using a plurality of microphones and cameras, and the camera is used so that the speaker is mainly photographed based on the position of the specified speaker. The microphone is controlled so that the voice of the speaker's speech is mainly input. Therefore, in Patent Document 1, a plurality of microphones and cameras are necessary, and a control device that controls the microphones and cameras is necessary, which increases the cost of the conference system.

For example, if there are more than several tens of participants in one conference room, the position of the speaker is specified, the camera control for imaging the specified speaker, and the voice of the speaker Although it may be necessary to control the microphone to extract the signal, there is a problem in cost performance when the number of participants in one conference room is less than ten.

The present invention has been made in view of the above circumstances, and provides an imaging device with a voice input / output function for a video conference that can be manufactured at low cost and a video conference system having the imaging device with a voice input / output function. With the goal.

In order to solve the above-described problem, an imaging apparatus with a voice input / output function according to the present invention includes an omnidirectional camera that captures the surroundings,
An audio output device provided in the vicinity of the omnidirectional camera and outputting an audio signal input from the outside as sound;
An audio input device that is provided in the vicinity of the omnidirectional camera and inputs ambient audio as an audio signal;
The image data picked up by the omnidirectional camera and the sound signal input by the sound input device are output.

According to such a configuration, when the imaging device with a voice input / output function is used as an imaging device for a conference system, a speaker as a voice output device, and a microphone as a voice input device, the imaging device with a voice input / output function is used. All the participants can be photographed with an omnidirectional camera by placing them on a table and sitting around and crawling around the table. In this case, each participant who surrounds the table sees the imaging device with a voice input / output function on the table, or sees a display on which other venues of the video conference are projected.

However, many of the speakers basically speak to a microphone as an audio input device, and there is a high possibility that the speakers will face the speaker from which other participants' voices are output. In general, when the sound source is in the front direction of the face, the sound is easier to hear and the sound source is often viewed so that the sound can be heard better. That is, when a conference participant sits around the omnidirectional camera on the table, at least the speaker faces the mic or speaker so that the omnidirectional camera in the vicinity of the mic or speaker faces. Therefore, the omnidirectional camera takes a picture of the participant from the front, and on the image data of the speaker, the speaker speaks to the participant at the other conference room where the speaker is watching the image data. Looks like you are doing.

In other words, by placing microphones, speakers, and cameras in approximately the same position, at least when a participant in a conference speaks, it is possible to encourage the camera to speak and to clarify the image of the speaker Can do.

Also, since the omnidirectional camera is arranged in a table and takes pictures of participants sitting around the table, the distance from the participants is short, and the difference in distance between the participants is small. Therefore, even if it is not an omnidirectional camera having a high resolution, the participants can be sufficiently photographed, and the cost can be reduced as compared with the case of using a high resolution omnidirectional camera.

Note that if the omnidirectional image data picked up by the omnidirectional camera is output as it is projected onto the plane as it is, it becomes a distorted image. For example, it is converted into a panoramic image or an image for each participant of the conference as each subject. In addition, it is necessary to perform image processing that takes distortion. The omnidirectional camera includes, for example, a fisheye camera using a fisheye lens, a camera using a mirror having a shape close to a conical shape, and an omnidirectional camera. The audio output device is, for example, a speaker. The voice input device is a microphone, for example.

In the configuration of the present invention, a plurality of displays for displaying image data input from the outside so as to be visible from a plurality of directions in the vicinity of the omnidirectional camera in a position that does not interfere with surrounding imaging by the omnidirectional camera Is preferably provided.

According to such a configuration, the participants of the conference basically have a display on which the participants of other venues are projected, a speaker in which the speech of the participants of other venues is output as audio, and participation of other venues. Although there is a high possibility that the microphone is facing the direction of the person who talks to the person, these displays, microphones, and speakers are close together, so basically, most of the participants in the conference are naturally omnidirectional cameras. In other venue displays, the participants who are facing the participants in the other venues are displayed.

In addition, when an omnidirectional camera is placed on a table, the distance between each participant and the display becomes shorter, and it becomes possible to identify participants in different venues even with a relatively small display, so use multiple displays. Even if it is a thing, cost reduction can be aimed at rather than the case where one big display is used. In addition, when a participant sits facing each other in two rows on a square table, two displays can be provided. When participants sit in a circle around the round table, or when participants sit separately on three or more of the four sides of a square table, it is preferable that there are three or more displays.

In the above configuration of the present invention, the voice input device includes at least three microphones respectively facing at least different surrounding directions,
A sound source direction recognition device that identifies the direction of the sound source from the volume of the sound input to each microphone;
An omnidirectional image data captured by the omnidirectional camera is preferably provided with an image processing device that converts image data centered on the direction of the sound source specified by the sound source direction recognition device.

According to such a configuration, a speaker who is speaking is identified from among the participants, and a panoramic image with the speaker as a substantially central left and right is displayed on the display of another conference hall, or the speaker is extracted. It is possible to display the image of the state on the display of another venue. In the present invention, since there are participants around the omnidirectional camera and the microphone in the vicinity thereof, a speaker as a sound source can be relatively easily compared by comparing the volumes of the microphones, even if they are not highly directional microphones. As long as the direction of the sound source can be specified and the direction of the sound source can be specified to specify the position of the speaker on the omnidirectional image, it is not necessary to specify the position of the sound source, and the microphone array can be used to specify the position of the sound source. In addition, since it is not necessary to use a microphone with high directivity, the cost can be reduced. In addition, when creating image data centered on the speaker specified by the microphone (main subject), it is easy to create image data mainly consisting of the speaker by specifying the direction on the omnidirectional image data. can do.

In the configuration of the present invention, the face of the person being imaged is recognized in the image data imaged by the omnidirectional camera, and the movement of the mouth of the recognized face is used to recognize the face of the person being imaged. An image recognition device that identifies the person being imaged,
It is preferable that an omnidirectional image data captured by the omnidirectional camera is provided with an image processing device that converts the omnidirectional image data into image data centered on the person to be imaged identified as speaking by the image recognition device.

According to such a configuration, if the direction of the speaker is specified in the same manner as in the case of voice, it is possible to create image data mainly for the speaker and there is no need to specify the position. There is no need to use it, and the cost can be reduced. Similarly to the case where the direction of the speaker is specified by voice, when creating image data mainly including the specified speaker, the direction can be easily specified by specifying the direction on the omnidirectional image data. Image data mainly composed of a person can be created.

The video conference system according to the present invention includes a plurality of imaging devices with audio input / output functions according to the present invention, and each of the imaging devices with audio input / output functions includes the image data and the audio signal in another imaging device with audio input / output functions. And a communication device for inputting the image data and the audio signal output from the other imaging apparatus with the audio input / output function.

According to such a configuration, the video conference system of the present invention can achieve the above-described operational effects of each imaging apparatus with a voice input / output function. An imaging device with a voice input / output function may be configured without a display, but image data with a voice input / output function can be obtained by inputting image data captured by another imaging device with a voice input / output function. In the apparatus, image data can be output to an external display.

According to the imaging device with audio input / output function and the video conference system of the present invention, it can be manufactured at a low cost, and when a speaker is displayed on the display, it can be in a state suitable for a person who looks at the display. Increases nature.

It is the figure which made the cover which shows the imaging device with a voice input / output function of 1st Embodiment translucent, Comprising: (a) is a top view, (b) is a side view. It is a figure for demonstrating the use condition of an imaging device with a voice input / output function. It is a figure for demonstrating the image image | photographed with the omnidirectional camera of the imaging device with a voice input / output function. FIG. 4 is a diagram for explaining an image output from an imaging apparatus with a voice input / output function, in which (a) is a diagram illustrating an outline of a panoramic image converted from an omnidirectional image, and (b) is a diagram illustrating all images. The panoramic image converted from the azimuth image is divided into two columns, and (c) is an image of the speaker added, and (d) the omnidirectional images taken at three different locations. Are panoramic images. It is the figure which made the cover which shows the imaging device with a voice input / output function of 2nd Embodiment translucent, Comprising: (a) is a top view, (b) is a side view. It is the figure which made the cover which shows the imaging device with a voice input / output function of 3rd Embodiment translucent, Comprising: (a) is a top view, (b) is a side view. It is a figure which shows the imaging device with a voice input / output function of 4th Embodiment, (a) is a front view, (b) is a rear view.

The first embodiment of the present invention will be described below with reference to the drawings.
The video conference system according to the present embodiment uses a plurality of imaging devices 1 with audio input / output functions shown in FIGS. 1A and 1B, and images with audio input / output functions are provided in a plurality of remote conference rooms. By arranging the device 1, a video conference system is constructed.

An imaging apparatus 1 with a voice input / output function shown in FIG. 1 includes a substantially disc-shaped base plate 2, a substantially dome-shaped cover 3 that covers the base plate 2, and an outer peripheral portion of the base plate 2 along the circumferential direction. Speakers (sounds) arranged in a state of being covered with a cover 3 between a base plate 2 and a cover 3 and microphones (sound input devices) 5 that are arranged at equal intervals and connected to a control board 4 described later. Output device) 6 and an omnidirectional camera 7 fixed on the cover 3. The microphone 5, the speaker 6, and the omnidirectional camera 7 are provided close to each other. That is, the microphone 5, the speaker 6, and the omnidirectional camera 7 are arranged close to each other. Further, the speaker 6 and the omnidirectional camera 7 are arranged so that their central axes substantially coincide with each other, and the microphone 5 is arranged at a position that is substantially equidistant from the above-mentioned central axis.

The base plate 2 is provided with an attachment structure for attaching the microphone 5, the speaker 6 and the control board 4 on the upper surface thereof. An attachment structure for attaching a circular lower edge portion (outer peripheral edge portion) of the cover 3 having the same diameter as the base plate 2 is provided on the outer peripheral portion of the disc-shaped base plate 2.

The cover 3 is provided with one or a plurality of holes (not shown) at positions corresponding to the microphone 5 so as not to interfere with voice input to the microphone 5. The dome-shaped cover 3 is provided with an opening 3a for outputting sound from the speaker 6 at the upper part (central part). The opening 3 a of the cover 3 is provided with a bridge-like camera fixing portion 3 b for fixing the omnidirectional camera 7 to the central portion of the upper portion of the cover 3.

The microphone 5 has, for example, directivity, and the direction with the highest sensitivity is matched with the radial direction orthogonal to the central axis of the hemispherical surface or cylindrical surface of the omnidirectional camera 7, for example. In addition, the microphones 5 are arranged at equal positions in the radial direction with respect to the center axis of the photographing range and at positions shifted by 90 degrees (equal intervals in the circumferential direction). Note that an omnidirectional microphone 5 may be used as the microphone 5. Each microphone 5 is connected to the control board 4 and converts sound into an audio signal and inputs it to the control board 4. Note that the audio signal may be analog or digital.

The speaker 6 is of an omnidirectional type, and a single speaker 6 outputs audio in almost all directions. A plurality of non-omnidirectional speakers such as three or four may be used. The speaker 6 is connected to the control board 4 and converts a sound signal output from the control board 4 into sound and outputs the sound to the surroundings.

The omnidirectional camera 7 is, for example, a fisheye camera having a hemispherical imaging range, and the surrounding area is an imaging target. For example, omnidirectional image data F (shown in FIG. 2) is obtained from images captured by a plurality of cameras. ), A camera that captures the surroundings through a substantially conical mirror, or an omnidirectional camera. The omnidirectional camera 7 only needs to be able to image a participant as a subject sitting around the table T from the imaging apparatus 1 with a voice input / output function placed on the table T. For example, upward image data is not required.

Further, when the arrangement position of the omnidirectional camera 7 is high, for example, when it has a height higher than the head of the participant who sits down, it becomes impossible to photograph the bust of the participant in the hemispherical photographing range. When the arrangement position of the omnidirectional camera 7 becomes high, an omnidirectional camera can be preferably used.

As the sound source direction recognition device, the control board 4 specifies the direction of the sound source from the volume level (sound volume) of the audio signals input from the four microphones 5. In the present embodiment, since the position of the sound source is not specified by specifying the direction of the sound source and the distance to the sound source, the position of the sound source is measured from the volume levels of the four microphones 5. For example, the top two adjacent microphones 5 having a high volume level are specified, and the direction between these two microphones 5 is determined from the difference in volume between these two microphones.

For example, if there is no difference in volume between the two microphones 5, it is specified that there is a sound source in the direction that is approximately the center of these microphones 5. If the volume of either microphone 5 is high, the direction that is the center of these microphones 5. And the direction of the sound source is between the direction of the microphone 5 with the higher volume. If the volume of the microphone 5 with the second volume is substantially the same as that of the microphone 5 with the third volume, the sound source is in the direction in which the microphone 5 with the first volume faces.

The sound source may be specified from the phase shift of the sound in each microphone 5. That is, a well-known method for specifying the direction of the sound source based on the difference in sound arrival time in each microphone 5 due to the difference in distance from the sound source may be used.
The control board 4 as an image recognition device is adapted to specify the direction of the speaker from the omnidirectional image data F input from the omnidirectional camera 7. Basically, the direction of each participant is specified by recognizing the face of each participant (imaged person) from the omnidirectional image data F by well-known face recognition. Further, each participant's mouth is image-recognized, it is determined whether or not the mouth (lips) is moving, and the direction of the face determined that the mouth is moving is set as the direction of the speaker.

Note that image processing and image recognition can be easily created using Intel (registered trademark) Open CV (Intel Open Source Computer Library Library). For example, when creating a face recognition program, an object detection program registered in the open CV can be used. As a principle of image recognition, there are a learning phase and a recognition phase. By extracting a feature amount from an image and learning a feature of an object by a learning algorithm, for example, image recognition such as face recognition becomes possible. In Open CV, Haar / Like feature values are used as image feature values, and an algorithm called Adaboost is used as a learning algorithm. In the object detection program, it is possible to recognize a face image as a face in the object detection program by causing the object detection program to perform machine learning based on the feature points. Note that the open CV is not necessarily used for the image recognition program, and an existing program or a chip equipped with an existing image recognition circuit may be used. The movement of the speaker's mouth can also be recognized by using the above-mentioned open CV object detection program for opportunity learning, for example, to recognize the difference between a speaking mouth and a silent mouth.

In this embodiment, face recognition is performed to recognize the direction of each participant, and mouth movement is detected to recognize the direction of the speaker. As described above, the control board 4 specifies the direction of the sound source as a speaker even by voice. In the present embodiment, the direction of the speaker based on the sound source direction recognition and the image recognition is, for example, predetermined. Of the two directions obtained by the sound source direction recognition and the image recognition, for example, the direction obtained by the image recognition is used as the speaker direction when the angles match within the angle range (for example, within 0 to 10 degrees). Yes.

If the direction of the sound source by the sound source direction recognition and the direction of the speaker by the image recognition are not within the predetermined angle range, it is determined that there is no speaker. As a result, participants who speak a private language, participants who yawn, participants who make loud noises when moving a chair, etc. are recognized as speakers even temporarily, for example This prevents a situation in which the image is displayed largely on the display 8 at another venue. Note that the direction of the speaker may be determined only by sound source direction recognition, or the direction of the speaker may be determined only by image recognition.

Further, the control board 4 functions as an image processing device that converts the omnidirectional image data F input from the omnidirectional camera 7 into a panoramic image by known image processing. At this time, the panoramic image data is created from the omnidirectional image data F by determining the positions at the right and left ends of the panoramic image from the omnidirectional image data F. When the direction of the speaker is specified as described above, the omnidirectional image data F is cut open at a position 180 degrees from the direction of the speaker, that is, in a direction opposite to the direction of the inventor. The positions of the right end and the left end of the panoramic image are used. Further, when there is no speaker, for example, the interval of each participant whose face is recognized as described above is determined, and the center of the widest interval is set as the position of the left end and the right end of the panoramic image.

Also, when the direction of the speaker is specified, the control board 4 creates the image data of the speaker whose participants are mainly recognized in that direction. In the creation of the image data, the face-recognized participant's image portion may be taken out and used as image data, or the image portion within a predetermined angle range in the direction of the specified speaker is used as the speaker's image data. Also good.

In addition, the control board 4 as a communication device uses a local area network (LAN), the Internet, a public telephone line network, a mobile phone line network, a dedicated communication line, etc. The panoramic image data and the speaker image obtained by performing the image processing as described above on the audio signal input from the microphone 5 and the omnidirectional image data F captured by the omnidirectional camera 7 by performing data communication with the imaging device 1 with an output function. Data is transmitted to another imaging apparatus 1 with a voice input / output function.

Also, it receives audio signals, panoramic image data, speaker image data, etc. transmitted from the other imaging device 1 with audio input / output function. The image data of the speaker is transmitted / received only when the image data is created. In the present embodiment, since the imaging apparatus 1 with the voice input / output function does not have the display 8, the received image data is output to the connection terminal for the display 8, and is displayed on the display 8 connected to the connection terminal. Display image data. As will be described later, the received image data including the display 8 in the imaging apparatus 1 with the voice input / output function may be output to the display 8 of the imaging apparatus 1 with the voice input / output function.

The control board 4 performs sound source direction recognition, image recognition, image processing, and the like. However, the control board 4 mainly controls input / output of audio signals and image data, and connects the control board 4 to the wired LAN. Alternatively, sound source direction recognition, image recognition, and image processing may be performed by a personal computer (PC PC: illustrated in FIG. 2) connected by a wireless LAN, USB, or the like. In addition, various image processing is performed by the imaging apparatus 1 with an audio input / output function having an omnidirectional camera 7 that captures an omnidirectional image. You may carry out with the imaging device 1 or the personal computer PC connected to it. That is, the omnidirectional image data F taken by the omnidirectional camera 7 may be transmitted as image data as it is, and the received image pickup apparatus with a voice input / output function 1 may process the image and display it on the display 8.

The imaging device 1 with a voice input / output function of such a telephone conference system is used by being placed on a table T in a conference room, for example, as shown in FIG. The conference participant P sits around the table T. Here, the participants P sit in two rows on the two long sides of the rectangular table T, respectively. In FIG. 2, the personal computer PC is used as described above, and the display 8 is connected via the personal computer PC, and image data processed by the personal computer PC is displayed on the display 8.

The omnidirectional image data F captured by the omnidirectional camera 7 in the state shown in FIG. 2 is in the state shown in FIG. In FIG. 3, the three-dimensional omnidirectional image data F is shown in a simplified manner in a state projected onto a plane. The control board 4 performs image processing on this omnidirectional image data F to make two panoramic images G1 or G1 displayed during display on the display 8 shown in FIG. 4A or 4B. The divided panoramic images G2 and G3 are used.

In the present embodiment, as shown in FIG. 4B, the interval of each participant P in the omnidirectional image data F is determined, and if there is an interval greater than a predetermined interval (angle), the panoramic image G1 is displayed. The left and right widths of the panoramic images G2 and G3 are compressed by separating and cutting the interval between the separated parts. Note that when creating the panoramic images G1, G2, and G3, all the intervals between the participants P may be cut. Alternatively, the panoramic image may be created by creating image data of each participant P with a predetermined width (predetermined angle range) and arranging the data side by side. Also in this case, the interval between the participants P can be prevented from being displayed. In FIG. 4B, the panoramic images G2 and G3 are displayed in a large size by displaying the image data separated into two in the upper and lower stages.

Further, when the speaker is specified, as shown in FIG. 4C, in addition to the panoramic image G1 shown in FIG. 4A, an image G10 mainly composed of the speaker is displayed separately. In addition, since the video conference is not necessarily held in only two places and may be held in three or more places, in that case, for example, as shown in FIG. And panoramic images G1, G4, and G5 are displayed at the respective divided portions. In FIG. 4D, a video conference is performed by connecting four places, and images of three meeting rooms other than the meeting room with the display 8 are displayed.

In the video conference system using the imaging device 1 with the voice input / output function, the control board 4 as a communication device of the imaging device 1 with the voice input / output function installed in each conference room, as described above, in each conference room. By transmitting and receiving the captured image data and the input audio signal, the images of the participants in other conference rooms are displayed on the display 8 as described above, and input from the speakers 6 in the other conference rooms. Audio signal is output.

In such an imaging apparatus 1 with a voice input / output function and a video conference system, the omnidirectional camera 7, the microphone 5, and the speaker 6 are substantially integrated as described above, and a participant who speaks (speaker) ) Basically tries to speak into the microphone 5. In this case, since there is the omnidirectional camera 7 in the vicinity of the microphone 5, the speaker is in a state of speaking toward the omnidirectional camera 7, and the speaker is in a state of being photographed from the front. In this case, when the speaker's image G <b> 10 is displayed on the display 8, there is a high possibility that the speaker is talking to a participant in another conference room looking at the display 8.

Also, in the state of talking with participants in other conference rooms, the voice of the speaker in the other conference room can be heard from the speaker 6 near the omnidirectional camera 7, so that the sound can be easily heard. It faces the speaker 6. As a result, the speaker speaks into the omnidirectional camera 7 and is inexpensive. Therefore, as described above, it is easy to obtain an image in a state where a speaker is speaking toward a participant in another conference room. For these reasons, it is possible to suppress a sense of incongruity peculiar to a video conference caused by a speaker speaking in a direction other than the omnidirectional camera 7 on the screen of the display 8. In other words, it is possible to urge the speaker to naturally face the omnidirectional camera 7 without making an effort to consciously face the camera.

In addition, since all the participants sitting around the table T are basically photographed with substantially the same size by the omnidirectional camera 7, the above-described omnidirectional camera 7 can be used without any particular control. Thus, if the participant who speaks is specified, the image of the speaker can be easily obtained.

Next, a second embodiment of the present invention will be described.
As shown in FIGS. 5A and 5B, the imaging apparatus 1a with the voice input / output function of the second embodiment is similar to the imaging apparatus 1 with the voice input / output function of the first embodiment. A base plate 11, a cover 12, a control board (not shown) (control board 4 in FIG. 1), a microphone 5, a speaker 6, and an omnidirectional camera 7 are provided. The imaging apparatus 1a with a voice input / output function of the second embodiment further includes a display 8, that is, the imaging apparatus 1 with a voice input / output function of the first embodiment and the voice input of the second embodiment. The difference from the imaging device with an output function 1a is whether the display 8 is separate from the imaging device 1 with a voice input / output function or whether the display 8 is provided in the imaging device 1a with a voice input / output function. It is a difference.

In this embodiment, the base plate 11 is formed in a rectangular plate shape, and the microphone 5 is provided at each of the four corners. In addition, a display (for example, a liquid crystal display) 8 is attached to a pair of side edges of the base plate 11 that face each other in the opposite direction (outside). A control board and a speaker 6 (not shown) are disposed between the two displays 8 on the base plate 11.

The cover 12 is formed in a rectangular parallelepiped shape corresponding to the rectangular base plate 11 and is attached so as to cover the base plate 11. On two side surfaces of the cover 12 corresponding to the above-described two displays 8 that are parallel to each other, a window portion 12a that allows the display screen of the display 8 to be visually recognized from the outside is provided. Further, the top plate of the cover 12 is provided with an opening 12 b corresponding to the speaker 6. A camera fixing portion 12c is provided in a bridge shape at the opening 12b portion of the cover 12, and the omnidirectional camera 7 is attached to the camera fixing portion 12c. One or a plurality of holes may be provided at a position corresponding to the microphone 5 of the cover 12.

In addition, as shown in FIG. 2, the imaging apparatus 1 a with a voice input / output function includes two pieces so as to be preferably used when participants P sit side by side on two parallel side edges of the table T, respectively. The displays 8 are arranged in opposite directions. Further, as the display 8, for example, a display 8 having a relatively small screen of about 7 inches to 15 inches is used, and when placed on the table T, the lines of sight of the participants sitting facing each other are displayed. It is designed not to block. Moreover, the cost concerning the display 8 is reduced.

According to the imaging apparatus with a voice input / output function 1a of the second embodiment, it is possible to obtain substantially the same operational effects as the imaging apparatus 1 with a voice input / output function of the first embodiment. In addition, the display 8 is provided in the vicinity of the omnidirectional camera 7, and when the participant is seated around the table T as described above, the head is turned to the front without being inclined. The display 8 can be seen without difficulty.

When the participant P faces the display 8, the omnidirectional camera 7 is seen in the vicinity of the display 8 in the vicinity of the display 8, so that the omnidirectional camera 7 is viewed. You can get image data that looks like participants in other venues. That is, in the first embodiment, the structure is such that the speaker is mainly urged to speak by looking at the omnidirectional camera 7, but the other participants have the display 8 in a different place from the omnidirectional camera 7. It is difficult to prevent the participant other than the speaker from looking at the omnidirectional camera 7 and taking an image of the participant other than the speaker facing away. Met.

On the other hand, in the second embodiment, the display 8 is arranged in the vicinity of the omnidirectional camera 7, and when the participant looks at the display 8, the face of the participant is directed toward the omnidirectional camera 7. In addition, in order for the speaker to view the display 8, it is not necessary to deviate the direction of the face from the direction of the microphone 5, the speaker 6, and the omnidirectional camera 7, and the face is directed to the omnidirectional camera 7 while speaking. .

Next, a third embodiment of the present invention will be described.
As shown in FIGS. 6A and 6B, the imaging device 1b with the voice input / output function of the third embodiment is similar to the imaging device 1 with the voice input / output function of the first embodiment. A base plate 21, a cover 22, a control board (control board 4 in FIG. 1), a microphone 5, a speaker 6, and an omnidirectional camera 7 are provided. The imaging apparatus 1b with a voice input / output function of the third embodiment includes a display 8 as in the case of the second embodiment.

In this embodiment, the base plate 21 is formed in a triangular plate shape, and a microphone 5 is provided at each of the three corners. A display 8 is attached to each of the three side edges of the base plate 21 with the display screen facing outward. Further, a control board and a speaker 6 (not shown) are arranged inside the three displays 8 of the base plate 11.

The cover 22 is formed in a triangular prism shape corresponding to the triangular base plate 21 and is attached so as to cover the base plate 21. At positions corresponding to the display 8 on each of the three side surfaces of the cover 22, a window portion 22 a that allows the display screen of the display 8 to be visually recognized from the outside is provided. In addition, the top plate of the cover 22 is provided with an opening 22 b corresponding to the speaker 6. A camera fixing portion 22c is provided in a Y-bridge shape at the opening 22b of the cover 22, and the omnidirectional camera 7 is attached to the camera fixing portion 22c. One or a plurality of holes may be provided at a position corresponding to the microphone 5 of the cover 22.

The imaging apparatus 1b with a voice input / output function according to the third embodiment is basically the same as that of the second embodiment except for the difference in the number of displays 8 and microphones 5 and whether the planar shape is a square or a triangle. The image pickup apparatus 1a with a voice input / output function has substantially the same structure and exhibits the same effects. In the third embodiment, since the display 8 faces three directions 120 degrees apart from each other, it is possible to reduce the direction of the blind spot where the screen of the display 8 cannot be seen around the table T. In addition, it is good also as the imaging device 1a with an audio | voice input / output function having the four displays 8 by providing the display 8 in all the side surfaces of the cover 12 in the shape of 2nd Embodiment.

Next, a fourth embodiment of the present invention will be described.
As shown in FIGS. 7A and 7B, the imaging device 1c with the voice input / output function of the fourth embodiment is similar to the imaging device 1 with the voice input / output function of the first embodiment. A control board (control board 4 in FIG. 1), a microphone 5a, a speaker 6a, and an omnidirectional camera 7a are provided. The imaging apparatus with audio input / output function 1c according to the fourth embodiment includes a display 8a as in the second and third embodiments.

In the present embodiment, for example, as the display 8a larger than 15 inches, the control board of the imaging device 1c with a voice input / output function, the microphone 5a, for example, the display 8a of about 20 to 32 inches (or more) may be used. The speaker 6a is incorporated, and an omnidirectional camera 7a is attached to the center of the upper surface of the display 8a. That is, in the display for a personal computer or the like, the control board and the omnidirectional camera 7a are provided on a display incorporating a speaker and a microphone. However, the display 8a may be connected to a personal computer, and the personal computer PC may have functions other than the control board data input / output. In this case, the connection between the display 8a and the personal computer PC can be performed in the same manner as when a display of a type including a speaker, a microphone, and a camera is connected to the personal computer PC.

As shown in FIGS. 7 (a) and 7 (b), in the fourth embodiment, the display 8a has

display screens

14a and 14b on both the front and back surfaces. As shown in FIG. Participants sitting facing each other see

different display screens

14a and 14b. In addition, in the imaging device 1c with a voice input / output function, a plurality of imaging devices 1c with a voice input / output function may be used as the display screen 14b is not provided on the back side. In this case, although there are a plurality of omnidirectional cameras 7a, a plurality of omnidirectional cameras 7a are not necessarily required. Therefore, a combination of a type having an omnidirectional camera 7a and a type having no omnidirectional camera 7a may be used. Good. Further, depending on the size of the display 8, the position of the omnidirectional camera 7a may be too high for the participant, and only the upper part of the bust portion of the participant may be reflected in the hemispherical shooting range. Only the upper part of the participant's face may be visible. Therefore, it is preferable that the omnidirectional camera 7a is an omnidirectional camera having a shooting range that is wider than the hemisphere and close to the whole globe.

According to the imaging apparatus with a voice input / output function 1c of the fourth embodiment, it is possible to obtain substantially the same operational effects as those of the first and second embodiments.

1, 1a, 1b, 1c Image pickup apparatus with voice input / output function 4 Control board (image recognition device, image processing device, sound source direction recognition device, communication device)
5,5a Microphone (voice input device)
6,6a Speaker (Audio output device)
7,7a Omnidirectional camera 8,8a Display

Claims

An omnidirectional camera that captures the surroundings;
An audio output device provided in the vicinity of the omnidirectional camera and outputting an audio signal input from the outside as sound;
An audio input device that is provided in the vicinity of the omnidirectional camera and inputs ambient audio as an audio signal;
An image pickup apparatus with a voice input / output function, which outputs image data picked up by the omnidirectional camera and a voice signal input by the voice input device.
In the vicinity of the omnidirectional camera, a plurality of displays for displaying image data input from the outside so as to be visible from a plurality of surrounding directions are provided at positions that do not interfere with surrounding imaging by the omnidirectional camera. The imaging apparatus with a voice input / output function according to claim 1.
The voice input device includes at least three microphones respectively facing at least different surrounding directions,
A sound source direction recognition device that identifies the direction of the sound source from the volume of the sound input to each microphone;
2. The image processing device according to claim 1, further comprising an image processing device that converts omnidirectional image data captured by the omnidirectional camera into image data centered on a direction of a sound source identified by the sound source direction recognition device. Imaging device with voice input / output function.
Recognizing the face of the person being imaged in the image data captured by the omnidirectional camera, and speaking from the person being imaged based on the movement of the mouth of the recognized face An image recognition device for identifying a person,
An image processing device that converts omnidirectional image data captured by the omnidirectional camera into image data centered on the imaged person identified as speaking by the image recognition device. The imaging apparatus with a voice input / output function according to claim 1.
5. A plurality of the imaging devices with a voice input / output function according to claim 1, wherein each of the imaging devices with a voice input / output function is added to the other imaging device with a voice input / output function. A video conferencing system comprising: a communication device that outputs data and the audio signal, and inputs the image data and the audio signal output from the other imaging device with an audio input / output function.