JP2012054897A

JP2012054897A - Conference system, information processing apparatus, and information processing method

Info

Publication number: JP2012054897A
Application number: JP2010198098A
Authority: JP
Inventors: Masahiro Sakaguchi; 昌弘坂口; Masaaki Toyoda; 将哲豊田; Daisuke Igarashi; 大輔五十嵐; Daisuke Tani; 大輔谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2010-09-03
Filing date: 2010-09-03
Publication date: 2012-03-15

Abstract

PROBLEM TO BE SOLVED: To provide a conference system, an information processing apparatus, and an information processing method, capable of suppressing unpleasantness and discomfort to other conference participants by suppressing transmission and reception of image data and voice data based on an operation analysis by image recognition in the conference system.SOLUTION: A conference server apparatus 3 receives images photographed by cameras 15 of respective terminal devices 1, 1, ..., performs image recognition processing for the received images and detects an inappropriate operation from an image recognition result. When the inappropriate operation is detected, the conference server apparatus 3 controls to stop transmission or process a part thereof so that the image indicating the inappropriate operation cannot be confirmed through the other terminal devices 1, 1, ....

Description

本発明は、複数の情報処理装置間でカメラによって撮像された画像又はマイクロフォンにて集音された音声を送信しあい、遠隔にあっても会議参加者間での会議を実現できる会議システムに関する。特に、会議参加者の動作を検出し、検出した参加者の動作の状態に応じて画像又は音声の送受信を制御し、他の会議参加者への不快感、違和感を抑制することができる会議システム、情報処理装置及び情報処理方法に関する。 The present invention relates to a conference system that transmits images picked up by a camera or a sound collected by a microphone between a plurality of information processing apparatuses, and can realize a conference between conference participants even if they are remote. In particular, a conference system capable of detecting the motion of a conference participant, controlling transmission / reception of images or sounds according to the detected motion status of the participant, and suppressing discomfort and discomfort to other conference participants The present invention relates to an information processing apparatus and an information processing method.

通信技術、画像処理技術等の発展に伴い、遠隔の二拠点又は三拠点以上の複数拠点に夫々設置された複数の情報処理装置間でネットワークを介して会議ができるテレビ会議システムが実現されている。大容量データの送受信が可能であることから、端末装置にて集音される音声のデータを他の端末装置へ送信して複数の端末装置にて発言者の発言を共有するのみならず、各端末装置にて会議参加者を撮影し、撮影した映像データを他の端末装置へ送信することによって、表情、身振り等を交えた会議が実現できる会議システム（所謂Ｗｅｂ会議システム）が実用化されている。 Along with the development of communication technology, image processing technology, etc., a video conference system has been realized that enables a conference between a plurality of information processing apparatuses respectively installed at two remote sites or a plurality of three or more sites via a network. . Since large-capacity data can be transmitted and received, not only is the voice data collected by the terminal device transmitted to other terminal devices to share the speech of the speaker, but each terminal device A conference system (so-called Web conference system) that can realize a conference with facial expressions, gestures, etc. by photographing a conference participant with a terminal device and transmitting the captured video data to another terminal device has been put into practical use. Yes.

従来の会議システムでは、各情報処理装置が電話番号又はＩＰ（Internet Protocol）アドレスを指定して他の情報処理装置と直接的に接続を確立し、２つの情報処理装置が１対１で音声データ及び画像データを交換することで実現されてきた。３つ以上の情報処理装置間での会議システムを実現する場合には、１台の情報処理装置を親機とし、他の複数の情報処理装置を子機として、複数の子機が夫々親機との接続を確立し、親機が子機間のデータ交換を中継する。 In a conventional conference system, each information processing device designates a telephone number or an IP (Internet Protocol) address and establishes a connection directly with another information processing device, and the two information processing devices have one-to-one audio data. And by exchanging image data. When realizing a conference system between three or more information processing devices, one information processing device is a parent device, other information processing devices are child devices, and a plurality of child devices are each a parent device. And the parent device relays data exchange between the child devices.

より多くの拠点間での会議システムを実現するためには、複数の情報処理装置をＭＣＵ（Multipoint Control Unit：多地点接続装置）へ、スター型に接続し、情報処理装置間のデータ交換をＭＣＵが中継する構成がある。ＭＣＵを用いた会議システムでは、会議システムに参加することが可能な情報処理装置（拠点）の数は、ＭＣＵの性能、即ち接続できる情報処理装置の数（例えば通信ポートの数）に依存する。 In order to realize a conference system between more bases, a plurality of information processing devices are connected to an MCU (Multipoint Control Unit) in a star shape, and data exchange between the information processing devices is performed by the MCU. There is a configuration that relays. In a conference system using an MCU, the number of information processing devices (bases) that can participate in the conference system depends on the performance of the MCU, that is, the number of information processing devices that can be connected (for example, the number of communication ports).

また、多くの拠点間での会議システムを実現するためには、ＬＡＮ（Local Area Network）又はインターネット等の通信網を介し、会議参加者が使用する情報処理装置がクライアント装置としてサーバ装置と接続する構成にて、サーバ装置でデータ交換を中継する構成もある。このようなサーバ・クライアントシステムの構成では、サーバ装置の処理能力及びネットワークの通信速度（使用可能帯域幅）の制限があるものの、ＭＣＵを用いた構成と比較して、会議システムに参加する拠点数（情報処理装置の数）を容易に増減させることができる等の利点がある。 In order to realize a conference system between many sites, an information processing device used by a conference participant connects to a server device as a client device via a communication network such as a LAN (Local Area Network) or the Internet. There is also a configuration in which data exchange is relayed by the server device. In such a server / client system configuration, although there are limitations on the processing capacity of the server device and the network communication speed (usable bandwidth), the number of sites participating in the conference system compared to the configuration using the MCU. There is an advantage that (the number of information processing devices) can be easily increased or decreased.

このように、ＭＣＵを利用する構成でも、サーバ・クライアントシステムの構成でも、各情報処理装置を会議参加者が一人一人（又は少数人で）利用して会議を実現することができる。このとき、各情報処理装置には、共有画面を表示する液晶パネル、有機ＥＬパネル等を利用したディスプレイ、装置を使用する会議参加者を撮影するカメラ、装置を使用する会議参加者の発言を集音するマイクロフォン及び音声を出力するスピーカ等が備えられる。そして情報処理装置は、撮影した映像（画像）のデータ及び集音した音声のデータをＭＣＵ又はサーバ装置を介して送受信する。これにより、会議参加者同士の発言、表情、身振り等を共有して会議を行なうことができる。 In this way, a conference can be realized by using each information processing apparatus individually (or by a small number of people) for each information processing apparatus, regardless of the configuration using the MCU or the configuration of the server / client system. At this time, in each information processing apparatus, a liquid crystal panel that displays a shared screen, a display that uses an organic EL panel, a camera that shoots a conference participant who uses the apparatus, and a conference participant who uses the apparatus are collected. A microphone for sounding and a speaker for outputting sound are provided. The information processing apparatus transmits and receives captured video (image) data and collected audio data via the MCU or the server apparatus. Thereby, it is possible to hold a conference by sharing the remarks, facial expressions, gestures, etc. of the conference participants.

特許文献１には、音声及び画像を交換する会議システムを利用し、異なる文化圏の会議参加者間で会議を実施する場合に、各会議参加者の身振りを、三次元アバター等を用いた動画像にて表現する構成とし、ある文化圏における会議参加者の身振りが、他の文化圏でも社会的に適切と捉えられるように一部変更する発明が開示されている。 In Patent Document 1, when using a conference system for exchanging audio and images and conducting a conference between conference participants in different cultural spheres, the motion of each conference participant is animated using a three-dimensional avatar or the like. An invention is disclosed in which the gestures of conference participants in a certain cultural sphere are partly changed so that they can be regarded as socially appropriate in other cultural spheres.

特許文献２には、会議システムにて、会議参加者による発言を音声認識によって認識し、文字情報に変換して共通画面に表示する際、文字情報から不要な語句を削除して表示する発明が開示されている。 Patent Document 2 discloses an invention in which, in a conference system, a speech by a conference participant is recognized by voice recognition, converted into character information, and displayed on a common screen by deleting unnecessary words from the character information. It is disclosed.

特開２００９−７７３８０号公報JP 2009-77380 A 特開平１０−３０１９２７号公報JP-A-10-301927

遠隔地にいながら、複数拠点間で相手の画像を確認しつつ会議を実現できる会議システム及びその周辺技術により、文化及び言語が異なる人々の間でのコミュニケーション向上に大きな役割を果たしている。逆に、会議参加者の画像を送受信する会議システムでは、会議参加者の動作がリアルタイムで他の会議参加者へ伝わる。これにより、他の媒体を介したコミュニケーションでは存在しなかった様々な問題及び懸念が発生する場合がある。 The conference system that can realize a conference while confirming the image of the other party while being in a remote place and its peripheral technology play a large role in improving communication between people of different cultures and languages. Conversely, in a conference system that transmits and receives images of conference participants, the operations of the conference participants are transmitted to other conference participants in real time. This may cause various problems and concerns that did not exist in communication via other media.

例えば、複数の拠点間で会議参加者の映像を送受信する会議システムでは、会議参加者の映像信号をリアルタイムに他の拠点へ送信する。この場合、会議参加者の非言語的な動作、表情が重要となる。特に、会議参加者が多様な文化及び言語を背景にする場合には問題となるときがある。その場に適切な動作、言語であるか否かの適切性は当然、文化が異なれば大きく異なる。ある文化では適切とされ、許容される動作であっても、別の文化では不適切とされ、許容されない動作と受け取られる場合が多く、誤解を招く原因となる。 For example, in a conference system that transmits and receives video of conference participants between a plurality of sites, video signals of conference participants are transmitted to other sites in real time. In this case, the non-verbal movements and facial expressions of the conference participants are important. In particular, it may be a problem if conference participants are in the background of diverse cultures and languages. Naturally, the appropriateness of whether or not the language is appropriate for the situation will vary greatly depending on the culture. Actions that are appropriate and allowed in one culture are often considered inappropriate and not allowed in another culture, which can be misleading.

国際的なやり取りが行なわれるビジネスの場では、会議システムを利用する場合の会議参加者の動作の適切性が重要となる。動作によっては、言語のみのコミュニケーションよりも信頼を築くことが可能である。適切なタイミングでの適度な凝視（gaze）、適切な身振り及び表情によって信頼を表すことができ、取引の成否を左右する。逆に、不適切なタイミングでの不適切な身振りは、無意識で行なっていたとしても大きく信頼を損ねる場合がある。 In a business place where international exchanges are performed, the appropriateness of the operation of the conference participants when using the conference system is important. Depending on the behavior, it is possible to build trust over language-only communication. Trust can be expressed by appropriate gaze at the right time, proper gestures and facial expressions, and it determines the success or failure of the transaction. Conversely, improper gestures at improper timing can greatly reduce trust, even if done unconsciously.

会議参加者は、異文化における適切な動作を予備知識として学ぶことは可能であるが、パターン化されない動作でも適切性を維持することは非常に困難である。また、会議参加者が、参加する可能性がある全ての会議の他の会議参加者の文化における習慣及び伝統の違いを全て予備知識として持つことは非現実的である。 Conference participants can learn appropriate behaviors in different cultures as background knowledge, but it is very difficult to maintain their suitability even with unpatterned behaviors. It is also unrealistic for a conference participant to have all the differences in customs and traditions in the culture of other conference participants in all conferences that may participate.

更に、会議参加者が無意識に行なってしまう癖であって他の会議参加者を不快にさせる可能性のある動作を事前に矯正することも非現実的である。 Furthermore, it is unrealistic to correct in advance an action that may be unconsciously performed by a conference participant and that may make other conference participants uncomfortable.

特許文献１に開示されている技術を用いることにより、会議システムにて他の会議参加者に、文化的又は社会的に不適切な動作と受け取られる身振りが一の会議参加者が行なった場合であっても、アバターを利用して他の身振りに代替されるか、又は会議参加者の画像以外の画像、例えば会議資料等の共有画面へ切り替えられる。これにより、他の会議参加者を不快にさせることを回避することができる。 By using the technology disclosed in Patent Document 1, when a conference participant performs gestures that are perceived as inappropriate culturally or socially by other conference participants in the conference system. Even if it exists, it substitutes for another gesture using an avatar, or switches to images other than a meeting participant's image, for example, a shared screen, such as a meeting material. Thereby, it is possible to avoid making other conference participants uncomfortable.

しかしながら、特許文献１に開示されている技術では、不適切な動作か否かを判断するために、会議参加者を撮影した画像データを取得し、文化モデルデータに従って動作解析を行なう必要がある。特許文献１には、動作解析にはモーションセンサ若しくは加速度センサから得られるデータ、又は音声からの情報を利用することも可能とされているが、実際の解析対象、それらの情報を利用した解析についての例は記載されていない。また、特許文献１に開示されている技術では、文化モデルデータに従った動作解析が行なわれるが、多様な文化に対応させるためには各文化の膨大な量の文化モデルデータを予めデータベース化しておく必要があり、解析処理の負荷もその分膨大となり、現実的でない。 However, in the technique disclosed in Patent Document 1, in order to determine whether or not the operation is inappropriate, it is necessary to acquire image data obtained by photographing a conference participant and perform an operation analysis according to the culture model data. In Patent Document 1, it is possible to use data obtained from a motion sensor or an acceleration sensor or information from voice for motion analysis. However, the actual analysis target and analysis using the information are described. Examples of are not described. Further, in the technique disclosed in Patent Document 1, motion analysis is performed according to culture model data. In order to deal with various cultures, a huge amount of culture model data for each culture is stored in a database in advance. It is necessary to set the load on the analysis process, and the load of the analysis process is enormous.

また、特許文献２に開示されている技術を用い、会議参加者の発言内容から、不適切と受け取られる語句を削除してから共通画面へ表示することも可能である。またこれにより、会議参加者が共通画面を見やすくなると共に、文字情報に既に変換されることから議事録の作成が容易となる。 Further, by using the technique disclosed in Patent Document 2, it is also possible to delete words / phrases that are regarded as inappropriate from the content of the speech of the conference participant and display them on the common screen. This also makes it easier for conference participants to see the common screen, and also facilitates the creation of minutes because it has already been converted into text information.

しかしながら、特許文献２に開示されている技術では、不適切と受け取られる情報を表示する文字情報から削除することはできるものの、音声及びその発言中の映像は既に共有されていて他の会議参加者へ伝達されている。したがって、発言者以外の他の会議参加者へ不快感、違和感を与えることを回避することはできない。 However, with the technique disclosed in Patent Document 2, although it is possible to delete the information received as inappropriate, it can be deleted from the character information to display, but the audio and the video being spoken are already shared and other conference participants Has been communicated to. Therefore, it is impossible to avoid discomfort and discomfort to other conference participants other than the speaker.

本発明は斯かる事情に鑑みてなされたものであり、会議システムにて、画像認識による動作解析に基づき画像データ及び音声データの送受信を制御し、他の会議参加者への不快感、違和感を抑制することができる会議システム、該会議システムを制御する情報処理装置及び情報処理方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and the conference system controls transmission / reception of image data and audio data based on an operation analysis based on image recognition, thereby causing discomfort and discomfort to other conference participants. It is an object of the present invention to provide a conference system that can be suppressed, an information processing apparatus that controls the conference system, and an information processing method.

本発明に係る会議システムは、撮像装置又は集音装置と、前記撮像装置からの画像又は前記集音装置からの音声を取得する手段と、取得した画像又は音声を送受信する送受信手段を備える第１情報処理装置を複数と、複数の第１情報処理装置に通信媒体を介して接続し、各第１情報処理装置で送受信される画像又は音声を中継する第２情報処理装置とを含み、複数の第１情報処理装置間で共通の画像又は音声を表示又は出力させて情報を共有させ、会議を実現させる会議システムにおいて、第１情報処理装置にて取得された画像に映る人物の動作を認識する認識手段と、該認識手段による認識結果に基づき、不適切な動作の有無を検出する検出手段と、該検出手段の検出結果に応じて、前記第１情報処理装置からの画像若しくは音声の受信、若しくは他の第１情報処理装置への送信の可否、送信レートの増減、又は前記画像の一部への加工を制御する送信制御手段とを備えることを特徴とする。 The conference system according to the present invention includes a first imaging device or a sound collecting device, a means for acquiring an image from the imaging device or a sound from the sound collecting device, and a transmission / reception means for transmitting and receiving the acquired image or sound. A plurality of information processing devices, and a plurality of second information processing devices connected to the plurality of first information processing devices via communication media and relaying images or sounds transmitted and received by each first information processing device, In a conference system that realizes a conference by displaying or outputting a common image or sound between the first information processing devices to realize a conference, recognizes the action of a person shown in the image acquired by the first information processing device Recognizing means, detecting means for detecting presence / absence of an inappropriate operation based on the recognition result by the recognizing means, and reception of an image or sound from the first information processing apparatus according to the detection result of the detecting means, Properly it is characterized by having a transmission control means for controlling the processing of the possibility of transmission to another first information processing apparatus, increase or decrease of the transmission rate, or to a part of the image.

本発明に係る会議システムは、前記送信制御手段は、前記検出手段が不適切な動作を検出した場合、前記不適切な動作が検出された第１情報処理装置からの画像若しくは音声の受信、若しくは他の第１情報処理装置への送信を禁止するか、送信レートを低減するか、又は前記画像の一部への加工を行なうようにしてあることを特徴とする。 In the conference system according to the present invention, when the detection unit detects an inappropriate operation, the transmission control unit receives an image or sound from the first information processing apparatus in which the inappropriate operation is detected, or Transmission to another first information processing apparatus is prohibited, a transmission rate is reduced, or processing of a part of the image is performed.

本発明に係る会議システムは、前記検出手段が不適切な動作を検出した後の前記動作の継続時間を計測する手段と、前記継続時間が所定時間以上であるか否かを判断する判断手段とを備え、前記送信制御手段は、前記判断手段が所定時間以上であると判断した場合に、前記不適切な動作が検出された第１情報処理装置からの画像若しくは音声の受信、若しくは前記画像若しくは音声の他の第１情報処理装置への送信を禁止するか、送信レートを低減するか、又は前記画像の一部への加工を行なうようにしてあることを特徴とする。 The conference system according to the present invention includes means for measuring a duration of the operation after the detection unit detects an inappropriate operation, and a determination unit for determining whether the duration is a predetermined time or more. And the transmission control means receives the image or sound from the first information processing apparatus in which the inappropriate operation is detected, or the image or Transmission of audio to the first information processing apparatus is prohibited, a transmission rate is reduced, or processing of a part of the image is performed.

本発明に係る会議システムは、不適切な動作として予め登録してある画像認識結果の一覧であるテーブルを備え、前記検出手段は、前記認識手段による認識結果が前記テーブルに含まれる認識結果と対応するか否かによって不適切な動作の有無を検出するようにしてあることを特徴とする。 The conference system according to the present invention includes a table that is a list of image recognition results registered in advance as inappropriate operations, and the detection unit corresponds to the recognition result included in the table by the recognition result by the recognition unit. The presence or absence of an inappropriate operation is detected depending on whether or not it is performed.

本発明に係る会議システムは、前記検出手段が、不適切な音声又は発声を伴う動作を検出した場合、前記不適切な動作が検出された第１情報処理装置からの画像若しくは音声の受信、若しくは前記画像若しくは音声の他の第１情報処理装置への送信を禁止するようにしてあることを特徴とする。 In the conference system according to the present invention, when the detection unit detects an operation accompanied by inappropriate voice or utterance, the image or voice is received from the first information processing apparatus in which the inappropriate operation is detected, or Transmission of the image or sound to another first information processing apparatus is prohibited.

本発明に係る会議システムは、前記送信制御手段は、不適切な動作が検出された第１情報処理装置からの画像の内、前記不適切な動作に対応する画像の一部の上に他の画像を重畳させる加工を行なうことを特徴とする。 In the conference system according to the present invention, the transmission control unit may include another image on a part of the image corresponding to the inappropriate operation among the images from the first information processing apparatus in which the inappropriate operation is detected. It is characterized by processing to superimpose an image.

本発明に係る会議システムは、前記認識手段及び検出手段は、第２情報処理装置が備えることを特徴とする。 In the conference system according to the present invention, the recognition unit and the detection unit are provided in a second information processing apparatus.

本発明に係る会議システムは、前記検出手段が不適切な動作を検出した場合、不適切な動作が検出された第１情報処理装置の撮像装置の撮像方向又は集音装置の集音方向の変更を指示する手段を備えることを特徴とする。 In the conference system according to the present invention, when the detection unit detects an inappropriate operation, the imaging direction of the imaging device of the first information processing apparatus or the sound collection direction of the sound collection device in which the inappropriate operation is detected is changed. It has the means to instruct | indicate.

本発明に係る情報処理装置は、他の複数の装置と通信媒体を介して接続し、各装置と画像又は音声を送受信する手段を備える情報処理装置において、受信した画像に映る人物の動作を認識する手段と、認識した結果に基づき、不適切な動作の有無を検出する手段と、検出した結果に応じて、前記他の装置からの画像若しくは音声の受信、若しくは、前記画像若しくは音声の他の装置への送信の可否、送信レートの増減、又は前記画像の一部への加工を制御する手段とを備えることを特徴とする。 The information processing apparatus according to the present invention recognizes the motion of a person appearing in a received image in an information processing apparatus that is connected to a plurality of other apparatuses via a communication medium and includes means for transmitting and receiving an image or sound to and from each apparatus. Means for detecting the presence or absence of an inappropriate operation based on the recognized result, and receiving an image or sound from the other device, or other image or sound based on the detected result. And means for controlling whether transmission to the apparatus is possible, increase / decrease in transmission rate, or processing of a part of the image.

本発明に係る情報処理方法は、撮像装置又は集音装置と、前記撮像装置からの画像又は前記集音装置からの音声を取得する手段と、取得した画像又は音声を送受信する送受信手段を備える第１情報処理装置を複数と、複数の第１情報処理装置に通信媒体を介して接続し、各第１情報処理装置で送受信される画像又は音声を中継する第２情報処理装置とを含むシステムにて、第１情報処理装置との画像又は音声の送受信を制御する情報処理方法において、第１情報処理装置にて取得された画像に映る人物の動作を認識し、認識した結果に基づき、不適切な動作の有無を検出し、検出した結果に応じて、前記第１情報処理装置からの画像若しくは音声の受信、若しくは前記画像若しくは音声の他の第１情報処理装置への送信の可否、送信レートの増減、又は前記画像の一部への加工を制御することを特徴とする。 An information processing method according to the present invention includes: an imaging device or a sound collecting device; a means for acquiring an image from the imaging device or a sound from the sound collecting device; and a transmission / reception means for transmitting and receiving the acquired image or sound. A system including a plurality of one information processing device and a second information processing device that connects the plurality of first information processing devices via a communication medium and relays an image or sound transmitted / received by each first information processing device. In the information processing method for controlling transmission / reception of an image or sound to / from the first information processing apparatus, the person's movement shown in the image acquired by the first information processing apparatus is recognized, and based on the recognized result, inappropriate The presence / absence of an operation is detected, and depending on the detection result, whether or not the image or sound is received from the first information processing apparatus, or whether the image or sound is transmitted to the first information processing apparatus, the transmission rate of Reduced, or and controls the processing into a part of the image.

本発明では、会議参加者が使用する第１情報処理装置にて撮像された画像、即ち会議参加者を撮像するはずの画像に、不適切な動作が映っているか否かが画像認識により検出され、検出結果に応じて、取得された画像又音声の送受信の可否、送信レートの増減、又は画像の一部に加工がされる等の制御が行なわれる。 In the present invention, it is detected by image recognition whether or not an improper operation is reflected in an image captured by the first information processing apparatus used by the conference participant, that is, an image that should capture the conference participant. Depending on the detection result, control is performed such as whether the acquired image or sound can be transmitted / received, the transmission rate is increased or decreased, or a part of the image is processed.

本発明では、具体的には、会議参加者を撮像するはずの画像から不適切な動作が検出された場合に、第１情報処理装置からの画像又は音声の他の第１情報処理装置への送信が禁止されるか、又は送信するとしても送信レートが低減されてコマ落ち状態となるか、又は、画面の一部が加工される。または、第１情報処理装置からの画像又は音声の第２情報処理装置での受信が禁止される。これにより、不適切な動作が映っている画像を他の第１情報処理装置にて観ることが不可能となるか、又は鮮明に確認することが困難となるか、又は一部が隠ぺいされて観ることが不可能となる。 Specifically, in the present invention, when an inappropriate operation is detected from an image that is to capture a conference participant, an image or sound from the first information processing device to another first information processing device is detected. Even if transmission is prohibited, even if transmission is performed, the transmission rate is reduced and the frame is dropped, or a part of the screen is processed. Alternatively, reception of an image or sound from the first information processing apparatus at the second information processing apparatus is prohibited. As a result, it becomes impossible to view an image showing an inappropriate operation on the other first information processing apparatus, or it becomes difficult to confirm clearly, or a part of the image is hidden. It becomes impossible to watch.

本発明では、所定時間以上継続して不適切な動作が検出された場合に、送受信が禁止されて観ることが不可能となるか、送信レートが低減されてコマ落ちとなるか、又は一部が加工されて隠ぺいされ、観ることが不可能となる。継続時間が所定時間以上か否かを監視することにより、一回のみの検出で送信が制限されるよりも、会議の進行を円滑にする。また、所定時間が経過するまでに不適切な動作を行なった人物に警告を与える猶予ができる。 In the present invention, when an inappropriate operation is detected continuously for a predetermined time or longer, transmission / reception is prohibited and viewing is impossible, or the transmission rate is reduced and frames are dropped, or part thereof Is processed and hidden, making it impossible to see. By monitoring whether the duration time is equal to or longer than the predetermined time, the progress of the conference is made smoother than when transmission is limited by only one detection. In addition, it is possible to give a warning to a person who has performed an inappropriate operation before the predetermined time has elapsed.

本発明では、不適切な動作として予め登録してある画像認識結果のテーブルとの比較参照に基づき、不適切な動作が検出される。予め定義をしておくことにより、画像認識に基づく不適切な動作の検出処理の負荷を軽減することが可能である。またテーブルとしておくことにより、不適切な動作の検出処理の内容を改定せずとも、テーブルを適宜更新することにより、不適切な動作として検出される内容を変更することが容易となる。
なお、不適切な動作として例えば、「電話、居眠り、雑談、喧嘩、泣く、大笑い、離席、よそ見、喫煙、食事、舌を出す」等が挙げられ、これらの動作に対応する画像認識結果との比較結果を行なう。それらの動作を行なう会議参加者の画像又は音声を会議システムから排除することが可能となる。 In the present invention, an inappropriate operation is detected based on a comparison reference with a table of image recognition results registered in advance as an inappropriate operation. By defining in advance, it is possible to reduce the load of detection processing of inappropriate motion based on image recognition. In addition, by using a table, it is easy to change the content detected as an inappropriate operation by appropriately updating the table without revising the content of the inappropriate operation detection process.
Inappropriate actions include, for example, “telephone, snooze, chat, quarrel, cry, laugh, leave, look away, smoke, eat, put out tongue”, and image recognition results corresponding to these actions. The comparison result is performed. It becomes possible to exclude the image or sound of the conference participant who performs those operations from the conference system.

本発明では、音声認識によらず、音声又は発声を伴う不適切な動作を口元の開閉動作等に基づき画像認識によって検出するので、不適切な音声又は発言が他の会議参加者へ伝達される前に検出し、会議システムから排除することが可能となる。
音声又は発声を伴う不適切な動作とは例えば、電話、雑談、喧嘩、泣く、又は大笑い等の動作であり、これらの場合、実際に不適切な発言がなされる前に口元の動き、手若しくは腕の動きを合わせた画像認識によって検出できる可能性がある。 In the present invention, an inappropriate action involving voice or utterance is detected by image recognition based on an opening / closing action of the mouth, etc., regardless of voice recognition, so that an inappropriate voice or utterance is transmitted to other conference participants. It can be detected before and excluded from the conference system.
Inappropriate actions involving voice or speech are, for example, actions such as phone calls, chats, fights, crying, or laughing, and in these cases, movements of the mouth, hands or There is a possibility that it can be detected by image recognition combined with arm movements.

本発明では、不適切な動作が検出された場合に、不適切な動作が映る画像の一部への加工として他の画像が重畳され、他の会議参加者の目に触れない。重畳される他の画像とは、白若しくは黒等の一色塗りの画像、又はモザイク画像等でよい。 In the present invention, when an inappropriate motion is detected, another image is superimposed as a part of the image in which the inappropriate motion is reflected, and the other conference participants do not touch it. The other image to be superimposed may be a monochrome image such as white or black, or a mosaic image.

本発明では、画像認識手段及び不適切な動作の検出は、第１情報処理装置からの画像を一極的に受信する第２情報処理装置にて行なわれる。これにより、各第１情報処理装置における処理の負荷を軽減でき、画像認識又は動作の検出等の特定の機能を有していない第１情報処理装置を用い、他の会議参加者を不快にさせない会議システムを実現できる。 In the present invention, the image recognition means and the detection of an inappropriate operation are performed by the second information processing apparatus that receives the image from the first information processing apparatus. As a result, the processing load on each first information processing apparatus can be reduced, and the first information processing apparatus that does not have a specific function such as image recognition or motion detection is used, and other conference participants are not made uncomfortable. A conference system can be realized.

本発明では、不適切な動作が検出された場合に、当該不適切な動作を行なう会議参加者を撮像しないように、又は当該会議参加者からの音声を集音しないように制御することによって、他の会議参加者を不快と感じさせる画像又は音声が第２情報処理装置から他の第１情報処理装置へ送信されることを回避できる。 In the present invention, when an inappropriate operation is detected, by controlling not to image a conference participant who performs the inappropriate operation, or to not collect sound from the conference participant, It can be avoided that an image or sound that makes other conference participants feel uncomfortable is transmitted from the second information processing apparatus to the other first information processing apparatus.

本発明による場合、第１情報処理装置を操作する会議参加者による不適切な動作が、画像認識による認識結果から検出され、動作状態に応じて、会議システムを実現するための会議参加者を撮像した撮像画像又は会議参加者の発声を集音した集音音声の送受信が制御される。 According to the present invention, an inappropriate operation by a conference participant operating the first information processing apparatus is detected from a recognition result by image recognition, and the conference participant for realizing the conference system is imaged according to the operation state. Transmission / reception of the collected image obtained by collecting the captured image or the speech of the conference participant is controlled.

これにより、ある特定の第１情報処理装置を操作する会議参加者の動作状態が例えば、「電話、居眠り、雑談、喧嘩、泣く、大笑い、離席、よそ見、喫煙、食事、舌を出す」である場合に、画像若しくは音声、又は両方の送受信を禁止（停止）するか、他の画像を重畳する等の加工が行なわれ、他の会議参加者へ与える可能性がある不快感、違和感を抑制し、快適な会議システムを実現できる。 Thus, for example, the operation state of a conference participant who operates a specific first information processing apparatus is “telephone, snooze, chat, quarrel, cry, laugh, leave, look away, smoking, meal, tongue out” In some cases, processing such as prohibiting (stopping) transmission or reception of images or audio, or both, or superimposing other images is performed to suppress discomfort and discomfort that may be given to other conference participants. A comfortable conference system can be realized.

また、他の会議参加者へ不快感、違和感を与える可能性がある撮像画像又は集音音声の送受信を停止する等の制御によって、画像又は音声による通信負荷の増大及び会議システムにて画像又は音声の送受信を中継するサーバ装置（第２情報処理装置）の処理負荷を抑制することが可能である。 Also, by controlling the stop of transmission / reception of captured images or collected sound that may cause discomfort or discomfort to other conference participants, the communication load increases due to images or sounds, and the images or sounds in the conference system. It is possible to suppress the processing load of the server device (second information processing device) that relays the transmission / reception of data.

実施の形態１の会議システムの構成を示す構成図である。1 is a configuration diagram illustrating a configuration of a conference system according to a first embodiment. 実施の形態１の会議システムを構成する端末装置の内部構成を示すブロック図である。3 is a block diagram showing an internal configuration of a terminal device that constitutes the conference system of Embodiment 1. FIG. 実施の形態１の会議システムを構成する会議サーバ装置の内部構成を示すブロック図である。2 is a block diagram showing an internal configuration of a conference server apparatus that constitutes the conference system of Embodiment 1. FIG. 実施の形態１の会議システムにて実現される画像及び音声の送受信を模式的に示す模式図である。3 is a schematic diagram schematically showing transmission and reception of images and sounds realized by the conference system of Embodiment 1. FIG. 会議サーバ装置の記憶部に記憶されてある画像認識用テーブルの内容例を示す説明図である。It is explanatory drawing which shows the example of the content of the table for image recognition memorize | stored in the memory | storage part of a conference server apparatus. 動作検出テーブルの内容例を示す説明図である。It is explanatory drawing which shows the example of the content of an operation | movement detection table. 実施の形態１の会議サーバ装置における画像及び音声の送受信処理及び不適切な動作の検出処理手順の一例を示すフローチャートである。4 is a flowchart illustrating an example of image and audio transmission / reception processing and inappropriate operation detection processing procedure in the conference server device according to the first embodiment. 本実施の形態１における会議システムにて、不適切な動作が検出された場合になされる送信制御の例を示す説明図である。It is explanatory drawing which shows the example of the transmission control performed when an inappropriate operation | movement is detected in the conference system in this Embodiment 1. FIG. 実施の形態２の会議システムを構成する端末装置の内部構成を示すブロック図である。6 is a block diagram illustrating an internal configuration of a terminal device that constitutes the conference system according to Embodiment 2. 実施の形態２の会議サーバ装置における画像及び音声の送受信処理及び不適切な動作の検出処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of image and audio transmission / reception processing and inappropriate operation detection processing procedure in the conference server device according to the second embodiment.

以下本発明をその実施の形態を示す図面に基づき具体的に説明する。 Hereinafter, the present invention will be specifically described with reference to the drawings showing embodiments thereof.

（実施の形態１）
図１は、実施の形態１の会議システムの構成を示す構成図である。会議システムは、会議参加者が夫々用いる端末装置１，１，…と、端末装置１，１，…が接続されるネットワーク２と、端末装置１，１，…間での画像（映像）及び音声の送受信及び共有を実現する会議サーバ装置３とを含んで構成される。 (Embodiment 1)
FIG. 1 is a configuration diagram illustrating a configuration of the conference system according to the first embodiment. The conference system uses the terminal devices 1, 1,... Used by the conference participants, the network 2 to which the terminal devices 1, 1,... Are connected, and the images (videos) and audio between the terminal devices 1, 1,. And the conference server apparatus 3 that realizes transmission / reception and sharing of the system.

ネットワーク２は、会議が行なわれる組織の組織内ＬＡＮでもよいし、インターネット等の公衆通信網でもよい。ネットワーク２はアクセスポイント２１を複数含み、端末装置１，１，…が無線通信によってサーバ装置１と通信できるようにしてある。 The network 2 may be an in-house LAN of an organization in which a conference is held, or a public communication network such as the Internet. The network 2 includes a plurality of access points 21, and the terminal devices 1, 1,... Can communicate with the server device 1 by wireless communication.

このように構成される会議システムでは、端末装置１，１，…が会議サーバ装置３との接続の認証を受け、認証された端末装置１，１，…が会議サーバ装置３から共有の画像（映像）及び音声の情報を送受信し、受信した画像（映像）及び音声を出力することにより、他の端末装置１，１，…と画像（映像）及び音声を共有し、ネットワークを介した会議を実現する。 In the conference system configured as described above, the terminal devices 1, 1,... Receive connection authentication with the conference server device 3, and the authenticated terminal devices 1, 1,. (Video) and audio information are transmitted and received, and the received image (video) and audio are output to share the image (video) and audio with other terminal devices 1, 1,. Realize.

なお、会議サーバ装置３は、複数の異なる会議１及び会議２を並列的に実現させることができる。会議サーバ装置３は、端末装置１，１，…を夫々グループ会議１及び会議２に対応付けて認識し、各グループ内で端末装置１，１，…間の画像（映像）及び音声の中継を夫々で独立に行なうことが可能である。 Note that the conference server device 3 can realize a plurality of different conferences 1 and 2 in parallel. The conference server device 3 recognizes the terminal devices 1, 1,... In association with the group conference 1 and the conference 2, respectively, and relays images (video) and audio between the terminal devices 1, 1,. Each can be done independently.

図２は、実施の形態１の会議システムを構成する端末装置１の内部構成を示すブロック図である。 FIG. 2 is a block diagram showing an internal configuration of the terminal device 1 constituting the conference system of the first embodiment.

端末装置１は、制御部１００と、一時記憶部１０１と、記憶部１０２と、入力処理部１０３と、表示処理部１０４と、映像処理部１０５と、入力音声処理部１０６と、出力音声処理部１０７と、通信処理部１０８と、符号化・復号処理部１０９とを備える。端末装置１は更に、内蔵又は外部接続により、タブレット１３と、ディスプレイ１４と、カメラ１５と、マイクロフォン（図中及び以下、マイクという）１６と、スピーカ１７と、無線通信部１８とを備える。 The terminal device 1 includes a control unit 100, a temporary storage unit 101, a storage unit 102, an input processing unit 103, a display processing unit 104, a video processing unit 105, an input audio processing unit 106, and an output audio processing unit. 107, a communication processing unit 108, and an encoding / decoding processing unit 109. The terminal device 1 further includes a tablet 13, a display 14, a camera 15, a microphone (in the drawing and hereinafter referred to as a microphone) 16, a speaker 17, and a wireless communication unit 18 by built-in or external connection.

制御部１００は、ＣＰＵ又はＭＰＵ等の演算処理装置を用い、記憶部１０２に記憶されている会議端末用のプログラムを一時記憶部１０１に読み出して実行することにより、会議システム専用端末を本発明に係る情報処理装置として動作させる。 The control unit 100 uses an arithmetic processing unit such as a CPU or MPU, reads the conference terminal program stored in the storage unit 102 into the temporary storage unit 101, and executes the conference terminal dedicated terminal in the present invention. The information processing apparatus is operated.

一時記憶部１０１にはＳＲＡＭ又はＤＲＡＭ等のＲＡＭを用いる。一時記憶部１０１には、上述のように読み出されるプログラムが記憶されると共に、制御部１００の処理によって発生する情報が記憶される。 The temporary storage unit 101 uses a RAM such as SRAM or DRAM. The temporary storage unit 101 stores the program read as described above, and stores information generated by the processing of the control unit 100.

記憶部１０２には、ＥＥＰＲＯＭ（Electrically Erasable Programmable ROM）又はフラッシュメモリ等の不揮発性メモリを用いる。記憶部１０２には、会議端末用のプログラム及び制御部１００が制御時に参照する条件例えば後述の検出された値に対する閾値等の条件等、端末装置１の機能を実現するためのプログラム及びデータが予め記憶されている。他に、端末装置１における他のアプリケーションソフトウェアプログラムが記憶されていてもよい。記憶部１０２にはハードディスク又はＳＳＤ等の外部装置を用いてもよい。 The storage unit 102 uses a nonvolatile memory such as an EEPROM (Electrically Erasable Programmable ROM) or a flash memory. In the storage unit 102, a program and data for realizing the function of the terminal device 1, such as a conference terminal program and conditions that the control unit 100 refers to at the time of control, for example, conditions such as a threshold for a detected value described later, are stored in advance. It is remembered. In addition, other application software programs in the terminal device 1 may be stored. The storage unit 102 may be an external device such as a hard disk or SSD.

入力処理部１０３には、ディスプレイ１４上に内蔵され、端末用ペン４による文字入力又は図形入力のための操作を受け付けるタブレット１３が接続されている。入力処理部１０３は、端末装置１の会議参加者の操作により入力されるボタン（クリックボタン）の押下情報、ディスプレイに表示中の画面内における位置を示す座標情報等の情報を受け付け、入力操作の有無及び入力操作の内容を判断して制御部１００へ通知する。なお、入力処理部１０３には図示しないマウス、又はキーボード等のポインティングデバイス（入力装置）が接続されており、それらのポインティングデバイスにて受け付けた操作に応じた信号を入力してもよい。 Connected to the input processing unit 103 is a tablet 13 that is built in the display 14 and receives an operation for character input or graphic input by the terminal pen 4. The input processing unit 103 receives information such as pressing information of a button (click button) input by the operation of the conference participant of the terminal device 1 and coordinate information indicating a position in the screen being displayed on the display. The presence / absence and contents of the input operation are determined and notified to the control unit 100. Note that a pointing device (input device) such as a mouse or a keyboard (not shown) may be connected to the input processing unit 103, and a signal corresponding to an operation accepted by the pointing device may be input.

表示処理部１０４には、液晶パネル、又は有機ＥＬ等を用いるタッチパネル型のディスプレイ１４が接続されている。制御部１００は、表示処理部１０４を介し、ディスプレイ１４に会議端末用のアプリケーション画面を出力し、アプリケーション画面内に共有させる画像（映像）を表示させる。共有させる画像には、後述するように会議サーバ装置３から受信した他の端末装置１，１，…から送信された画像も含まれる。会議サーバ装置３から送信される画像が、Ｈ．２６１、Ｈ．２６３、Ｈ．２６４、ＭＰＥＧ等の規格にて符号化されている場合、制御部１００は画像を復号してから表示処理部１０４に出力する。 A touch panel type display 14 using a liquid crystal panel, an organic EL, or the like is connected to the display processing unit 104. The control unit 100 outputs an application screen for the conference terminal to the display 14 via the display processing unit 104, and displays an image (video) to be shared in the application screen. The images to be shared include images transmitted from the other terminal devices 1, 1,... Received from the conference server device 3 as described later. An image transmitted from the conference server apparatus 3 is H.264. 261, H.H. 263, H.M. In the case of encoding according to a standard such as H.264 or MPEG, the control unit 100 decodes the image and then outputs it to the display processing unit 104.

映像処理部１０５には、ビデオカードを用いる。映像処理部１０５は、端末装置１が備えるカメラ１５に接続され、カメラ１５の動作の制御を行なうと共に、カメラ１１５にて撮像された映像データを取得する。カメラ１５は、端末装置１の筐体に設けられたディスプレイ１４の上方に、ユーザの顔又は上半身を撮像する方向へ向けて搭載されている。カメラ１５は、１秒間に数十回又は数百回等の頻度で撮像し、それらの画像信号を連続して映像データとして映像処理部１０５へ出力する。映像処理部１０５は、カメラ１５から取得した映像データを、Ｈ．２６１、Ｈ．２６３、Ｈ．２６４、ＭＰＥＧ等の映像規格のデータへ変換（符号化）する処理を行なってもよい。 A video card is used for the video processing unit 105. The video processing unit 105 is connected to the camera 15 included in the terminal device 1, controls the operation of the camera 15, and acquires video data captured by the camera 115. The camera 15 is mounted above the display 14 provided in the casing of the terminal device 1 in a direction in which the user's face or upper body is imaged. The camera 15 captures images at a frequency of several tens or hundreds of times per second, and outputs the image signals to the video processing unit 105 as video data continuously. The video processing unit 105 converts the video data acquired from the camera 15 into H.264. 261, H.H. 263, H.M. Processing to convert (encode) data into video standards such as H.264 and MPEG may be performed.

入力音声処理部１０６は、端末装置１が備えるマイク１６に接続され、マイク１６によって集音された音声をサンプリングしてデジタル音声データへ変換し、制御部１００へ出力するＡ／Ｄ変換機能を有する。入力音声処理部１０６は、集音された音声の信号レベルの調整及び帯域制限等の処理を行なうミキサ、及び、エコー部分を除去するエコーキャンセラを内蔵していてもよい。なお、入力音声処理部１０６は、集音音声をＧ．７１１、Ｇ．７２２、Ｇ．７２８、Ｇ．７２９又はＭＰＥＧＡｕｄｉｏ等の規格の音声データへ符号化する処理を行なってもよい。 The input voice processing unit 106 is connected to the microphone 16 included in the terminal device 1, and has an A / D conversion function that samples the voice collected by the microphone 16, converts it into digital voice data, and outputs the digital voice data to the control unit 100. . The input sound processing unit 106 may incorporate a mixer that performs processing such as signal level adjustment and band limitation of collected sound, and an echo canceller that removes an echo portion. Note that the input voice processing unit 106 converts the collected voice to G.G. 711, G.G. 722, G.G. 728, G.G. 720 or MPEGAudio or other standard audio data may be encoded.

出力音声処理部１０７は、端末装置１が備えるスピーカ１７に接続される。出力音声処理部１０７は、制御部１００から音声データが与えられた場合に、音声としてスピーカ１７から出力させるようにＤ／Ａ変換機能を有する。なお、会議サーバ装置３から送信される音声データがＧ．７１１、Ｇ．７２２、Ｇ．７２８、Ｇ．７２９又はＭＰＥＧＡｕｄｉｏ等の規格により符号化されている場合は、制御部１００は音声データを復号してからスピーカ１７へ出力する。 The output audio processing unit 107 is connected to the speaker 17 included in the terminal device 1. The output sound processing unit 107 has a D / A conversion function so that when sound data is given from the control unit 100, the sound is output from the speaker 17 as sound. Note that the audio data transmitted from the conference server apparatus 3 is G.P. 711, G.G. 722, G.G. 728, G.G. When the data is encoded according to a standard such as 729 or MPEGAudio, the control unit 100 decodes the audio data and then outputs the audio data to the speaker 17.

通信処理部１０８は、端末装置１のネットワーク２を介した通信を実現させる。通信処理部１０８は無線通信部１８と接続されており、アクセスポイント２１を介して会議サーバ装置３又は他の端末装置１との無線通信を実現する。通信処理部１０８は詳細には、送受信される情報のパケット化、パケットからの情報の読み取り等を行なう。制御部１００は、通信処理部１０８により画像（映像）及び音声のデータの送受信を行なうことができる。なお、通信プロトコルは、後述する会議サーバ装置３の通信処理部３７における通信プロトコルに対応する。 The communication processing unit 108 realizes communication via the network 2 of the terminal device 1. The communication processing unit 108 is connected to the wireless communication unit 18 and realizes wireless communication with the conference server device 3 or another terminal device 1 via the access point 21. Specifically, the communication processing unit 108 packetizes information to be transmitted and received, reads information from the packet, and the like. The control unit 100 can transmit and receive image (video) and audio data by the communication processing unit 108. The communication protocol corresponds to the communication protocol in the communication processing unit 37 of the conference server device 3 to be described later.

符号化・復号処理部１０９は、エンコーダ・デコーダチップを用い、Ｈ．２６１、Ｈ．２６３、Ｈ．２６４又はＭＰＥＧ等の規格に基づく映像（画像）の符号化・復号処理、及び、Ｇ．７１１、Ｇ．７２２、Ｇ．７２８、Ｇ．７２９又はＭＰＥＧＡｕｄｉｏ等の規格に基づく音声の符号化・復号処理を実現する。制御部１００は、会議サーバ装置３から符号化された映像、音声、又は多重化された映像のデータを受信した場合、符号化・復号処理部１０９へ与えて復号する。なお映像（画像）及び音声の規格は上述の例以外のものであってもよい。 The encoding / decoding processing unit 109 uses an encoder / decoder chip. 261, H.H. 263, H.M. H.264 or MPEG based video / image encoding / decoding processing; 711, G.G. 722, G.G. 728, G.G. Audio encoding / decoding processing based on a standard such as H.729 or MPEGAudio is realized. When the encoded video, audio, or multiplexed video data is received from the conference server device 3, the control unit 100 provides the encoded / decoded processing unit 109 for decoding. Note that video (image) and audio standards may be other than the above examples.

なお実施の形態１では、端末装置１はタッチパネル型のディスプレイ１４を搭載した専用端末を用いる構成とした。しかしながら、これに限らずデスクトップ型のディスプレイが別に接続されるパーソナルコンピュータに、カメラ及びスピーカを接続して構成されてもよい。更には、汎用的なディスプレイにカメラ、スピーカ及びネットワークカードを接続し、後述するような制御部１００の機能を実現する装置を接続する構成でも実現できる。会議システムは、構成が様々に異なる端末装置１を含んでもよい。 In the first embodiment, the terminal device 1 uses a dedicated terminal equipped with a touch panel display 14. However, the present invention is not limited to this, and a camera and speakers may be connected to a personal computer to which a desktop display is separately connected. Furthermore, it can be realized by connecting a camera, a speaker, and a network card to a general-purpose display, and connecting a device that realizes the function of the control unit 100 as described later. The conference system may include terminal devices 1 having different configurations.

図３は、実施の形態１の会議システムを構成する会議サーバ装置３の内部構成を示すブロック図である。 FIG. 3 is a block diagram showing an internal configuration of the conference server apparatus 3 constituting the conference system of the first embodiment.

会議サーバ装置３は、サーバコンピュータを用い、制御部３０と、一時記憶部３１と、記憶部３２と、符号化・復号処理部３３と、画像処理部３４と、音声処理部３５と、画像認識部３６と、通信処理部３７と、ネットワークＩ／Ｆ部３８とを備える。 The conference server device 3 uses a server computer, and includes a control unit 30, a temporary storage unit 31, a storage unit 32, an encoding / decoding processing unit 33, an image processing unit 34, an audio processing unit 35, and image recognition. Unit 36, communication processing unit 37, and network I / F unit 38.

制御部３０にはＣＰＵ（Central Processing Unit）又はＭＰＵ（Micro Processing Unit）等の演算処理装置を用い、記憶部３２に記憶されている会議サーバ用プログラム３Ｐを一時記憶部３１に読み出して実行することにより、サーバコンピュータを、本実施の形態１における会議サーバ装置３として動作させる。 The control unit 30 uses an arithmetic processing unit such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit), and reads the conference server program 3P stored in the storage unit 32 into the temporary storage unit 31 and executes it. Thus, the server computer is operated as the conference server device 3 in the first embodiment.

一時記憶部３１にはＳＲＡＭ（Static Random Access Memory）、ＤＲＡＭ（Dynamic Random Access Memory）等のＲＡＭを用いて、上述のように読み出される会議サーバ用プログラム３Ｐが一時的に読み出されると共に、制御部３０の処理によって発生する情報が一時的に記憶される。 The temporary storage unit 31 uses a RAM such as an SRAM (Static Random Access Memory) or a DRAM (Dynamic Random Access Memory) to temporarily read the conference server program 3P read as described above, and at the same time the control unit 30 Information generated by this processing is temporarily stored.

記憶部３２には、ハードディスク又はＳＳＤ（Solid State Drive）等の外部記憶装置を用いる。記憶部３２には、上述の会議サーバ用プログラム３Ｐが記憶されている。また記憶部３２には、会議参加者が用いる端末装置１，１，…の認証を行なうための認証データが記憶されている。会議サーバ装置３の記憶部３２には、端末装置１，１，…で共有するためのドキュメントデータ等を含む会議情報ＤＢ３２１が記憶されている。ドキュメントデータは、テキストデータ、写真データ、図データ等であり、フォーマットは問わない。更に記憶部３２には、後述にて詳細を説明する画像認識用テーブル３２２、及び動作検出テーブル３２３が記憶されており、制御部３０及び他の構成部から参照可能である。 The storage unit 32 uses an external storage device such as a hard disk or an SSD (Solid State Drive). The storage unit 32 stores the conference server program 3P described above. Further, the storage unit 32 stores authentication data for authenticating the terminal devices 1, 1,... Used by the conference participants. The storage unit 32 of the conference server device 3 stores a conference information DB 321 including document data to be shared by the terminal devices 1, 1,. The document data is text data, photo data, figure data, etc., and the format is not limited. Further, the storage unit 32 stores an image recognition table 322 and an operation detection table 323, which will be described in detail later, and can be referenced from the control unit 30 and other components.

符号化・復号処理部３３は、エンコーダ・デコーダチップを用い、Ｈ．２６１、Ｈ．２６３、Ｈ．２６４又はＭＰＥＧ（Moving Picture Experts Group）等の規格に基づく画像（映像）の符号化を行なう画像符号化部、及び符号化された画像を復号する符号処理部を含む。端末装置１，１，…から送信される画像が符号化されている場合には、制御部３０は符号化・復号処理部３３にて復号して画像処理部３４へ与え、画像処理部３４にて合成される画像を符号化・復号処理部３３へ与えて符号化してから通信処理部３７により端末装置１，１，…へ送信する。また符号化・復号処理部３３は、Ｇ．７１１、Ｇ．７２２、Ｇ．７２８、Ｇ．７２９又はＭＰＥＧＡｕｄｉｏ等の規格に基づく音声の符号化を行なう音声符号化部、及び符号化された音声を復号する音声復号処理部を含む。端末装置１，１，…から送信される音声が符号化されている場合には、制御部３０は符号化・復号処理部３３にて復号して音声処理部３５へ与え、音声処理部３５にて合成される音声を符号化・復号処理部３３へ与えて符号化してから通信処理部３７により端末装置１，１，…へ送信する。なお符号化・復号処理部３３は、画像と音声とを夫々符号化するのみならず、画像と音声とを時間同期させ、多重化する処理を行なって出力するようにしてもよい。 The encoding / decoding processing unit 33 uses an encoder / decoder chip. 261, H.H. 263, H.M. An image encoding unit that encodes an image (video) based on a standard such as H.264 or MPEG (Moving Picture Experts Group) and a code processing unit that decodes the encoded image are included. When the images transmitted from the terminal devices 1, 1,... Are encoded, the control unit 30 decodes them at the encoding / decoding processing unit 33 and gives them to the image processing unit 34. The image to be synthesized is applied to the encoding / decoding processing unit 33 and encoded, and then transmitted to the terminal devices 1, 1,. In addition, the encoding / decoding processing unit 33 includes G. 711, G.G. 722, G.G. 728, G.G. 729, an audio encoding unit that encodes audio based on a standard such as MPEG Audio, and an audio decoding processing unit that decodes the encoded audio. When audio transmitted from the terminal devices 1, 1,... Is encoded, the control unit 30 decodes the encoded / decoded processing unit 33 to the audio processing unit 35. .. Is supplied to the encoding / decoding processing unit 33 and encoded, and then transmitted to the terminal devices 1, 1,... By the communication processing unit 37. Note that the encoding / decoding processing unit 33 may not only encode the image and the sound, but also perform a process of time-synchronizing and multiplexing the image and the sound to output them.

画像処理部３４は、制御部３０からの指示により、複数の端末装置１，１，…から夫々送信された複数の画像データに基づき画像を合成する処理を実現する。画像処理部３４は他に、記憶部３２の会議情報ＤＢに含まれる共有ドキュメントの内、各端末装置１，１，…にて表示対象となるドキュメントデータを受け付け、該ドキュメントデータを画像に変換して出力する機能を有する。また、画像処理部３４は、画像の拡大縮小、エッジ強調又は色調整等の各種画像処理を行なうことが可能である。 The image processing unit 34 realizes a process of combining images based on a plurality of image data respectively transmitted from the plurality of terminal devices 1, 1,. In addition, the image processing unit 34 receives document data to be displayed on each terminal device 1, 1,... Among the shared documents included in the conference information DB of the storage unit 32, and converts the document data into an image. Output function. The image processing unit 34 can perform various types of image processing such as image enlargement / reduction, edge enhancement, or color adjustment.

音声処理部３５は、制御部３０からの指示により、複数の端末装置１，１，…から夫々送信された複数の音声データに基づき音声を合成する処理を実現する。音声処理部３５は他に、ノイズ除去又は音量調整等の各種音声処理を行なうことが可能である。 The voice processing unit 35 realizes a process of synthesizing voice based on a plurality of voice data respectively transmitted from the plurality of terminal devices 1, 1,. In addition, the audio processing unit 35 can perform various audio processes such as noise removal or volume adjustment.

画像認識部３６は、制御部３０から与えられた画像から人物の顔（輪郭）、手・腕を認識する処理、更に顔の中の口、目、及び眉毛等を認識する処理を行なう。また画像認識部３６は、特定の物（例えば食器、タバコ、電話機）等を認識し、認識結果を出力する。具体的には、認識した輪郭、口、目などの有無、座標等の情報を出力する。画像認識部３６は、制御部３０から時系列で与えられる複数の画像内で認識した顔、手・腕、口、目、眉毛、その他の特定の物の画像内での位置の変化を認識する。画像認識部３６は、画像内の人物の顔の口、目、眉毛の位置によって人物の表情を分類し、笑顔、怒り顔、又は泣き顔等として分類結果を出力する機能を有してもよい。 The image recognition unit 36 performs processing for recognizing a person's face (contour), hand / arm from the image given by the control unit 30, and processing for recognizing a mouth, eyes, eyebrows and the like in the face. The image recognition unit 36 recognizes a specific object (for example, tableware, cigarette, telephone) and outputs a recognition result. Specifically, information such as the presence or absence of recognized contours, mouths and eyes, and coordinates is output. The image recognizing unit 36 recognizes a change in position of the face, hand / arm, mouth, eyes, eyebrows, and other specific objects recognized in a plurality of images given in time series from the control unit 30. . The image recognition unit 36 may have a function of classifying a person's facial expression according to the position of the mouth, eyes, and eyebrows of the person's face in the image and outputting the classification result as a smile, an angry face, a crying face, or the like.

通信処理部３７は、会議サーバ装置３のネットワーク２を介した通信を実現させる。通信処理部３７は、ネットワーク２に接続されたネットワークカードを用いたネットワークＩ／Ｆ部３８と接続されており、ネットワーク２を介して端末装置１，１，…との間の画像又は音声を送受信するときのパケット化、パケットからの情報の読み取りを行なう。制御部３０は、通信処理部３７により画像（映像）及び音声の送受信を行なうことができる。実施の形態１の会議システムを実現するために、通信処理部３７による画像、音声を送受信するための通信プロトコルは、Ｈ．３２３、ＳＩＰ（Session Initiation Protocol）、又はＨＴＴＰ（Hypertext Transfer Protocol ）等のプロトコルを用いればよい。通信プロトコルはこれらに限られない。なお、ネットワークＩ／Ｆ部３８はアンテナを含み、通信処理部３７は無線通信を行なうように構成されてもよい。 The communication processing unit 37 realizes communication via the network 2 of the conference server device 3. The communication processing unit 37 is connected to a network I / F unit 38 using a network card connected to the network 2, and transmits / receives images or sound to / from the terminal devices 1, 1,. Packetization when reading, and reading information from the packet. The control unit 30 can transmit and receive images (video) and audio by the communication processing unit 37. In order to realize the conference system of the first embodiment, the communication protocol for transmitting and receiving images and sounds by the communication processing unit 37 is H.264. A protocol such as H.323, SIP (Session Initiation Protocol), or HTTP (Hypertext Transfer Protocol) may be used. The communication protocol is not limited to these. The network I / F unit 38 may include an antenna, and the communication processing unit 37 may be configured to perform wireless communication.

図４は、実施の形態１の会議システムにて実現される画像及び音声の送受信を模式的に示す模式図である。上述のように構成される会議システムでは基本的に、図４に示すように、端末装置１，１，…にて夫々、制御部１００の制御により、カメラ１５にて撮像される会議参加者の顔又は上半身の映像（又は静止画像）データを通信処理部１０８及び無線通信部１８を介して会議サーバ装置３へ送信し続ける。 FIG. 4 is a schematic diagram schematically showing transmission and reception of images and sounds realized in the conference system of the first embodiment. In the conference system configured as described above, basically, as shown in FIG. 4, the terminal devices 1, 1,... The video (or still image) data of the face or upper body is continuously transmitted to the conference server device 3 via the communication processing unit 108 and the wireless communication unit 18.

会議サーバ装置３では、各端末装置１，１，…にて撮像及び集音されて送信される画像データ及び音声データを受信した場合、画像データと音声データとに分離して夫々符号化・復号処理部３３へ与え、夫々復号する。そして制御部３０は、図４に示すように、復号後の複数の端末装置１，１，…からの画像（映像）が、並べて表示されるように画像処理部３４にて合成すべく画像処理部３４へ指示する。制御部３０は、復号後の複数の端末装置１，１，…からの音声が重ね合わされるように音声処理部３５へ指示する。制御部３０は、合成後の画像及び音声を符号化・復号処理部３３へ与え、符号化又は多重化し、通信処理部３７を介して各端末装置１，１，…へ送信する。 When the conference server device 3 receives image data and sound data transmitted after being picked up and collected by each of the terminal devices 1, 1,..., The image data and the sound data are separated and encoded / decoded. The data is given to the processing unit 33 and decoded. As shown in FIG. 4, the control unit 30 performs image processing so that the image processing unit 34 synthesizes images (videos) from the plurality of terminal devices 1, 1,. To the unit 34. The control unit 30 instructs the audio processing unit 35 to superimpose audio from the plurality of terminal devices 1, 1,... After decoding. The control unit 30 gives the synthesized image and sound to the encoding / decoding processing unit 33, encodes or multiplexes them, and transmits them to the terminal devices 1, 1,... Via the communication processing unit 37.

各端末装置１，１，…では、会議サーバ装置３から送信される合成された画像及び音声を符号化・復号処理部１０９にて復号して表示処理部１０４及び出力音声処理部１０７へ夫々与え、各端末装置１，１，…を使用する会議参加者の顔又は上半身を映した画像が並べて表示される。これにより、各端末装置１，１，…を用いる会議参加者は、他の会議参加者の表情、身振りを確認しながら自身を含む会議参加者の発言を聴くことが可能となる。 In each of the terminal devices 1, 1,..., The synthesized image and audio transmitted from the conference server device 3 are decoded by the encoding / decoding processing unit 109 and given to the display processing unit 104 and the output audio processing unit 107, respectively. , Images showing the face or upper body of the conference participants who use the terminal devices 1, 1,... Are displayed side by side. Thereby, the conference participant using each terminal device 1, 1,... Can listen to the speech of the conference participant including himself / herself while confirming the facial expressions and gestures of other conference participants.

ただし、実施の形態１における会議システムでは、いずれかの端末装置１から送信された画像に映っている会議参加者の動作が不適切である場合、当該端末装置１からの会議参加者の顔又は上半身の映像を撮像した画像及び会議参加者から発せられる音声の他の端末装置１への送信を禁止する。 However, in the conference system in the first embodiment, when the operation of the conference participant shown in the image transmitted from any one of the terminal devices 1 is inappropriate, the face of the conference participant from the terminal device 1 or Transmission of the image obtained by capturing the upper body image and the sound emitted from the conference participant to the terminal device 1 is prohibited.

会議参加者による不適切な動作の検出は、当該会議参加者が使用する端末装置１からの画像を受信した会議サーバ装置３にて行われる。以下、会議サーバ装置３における不適切な動作の検出処理について詳細を説明する。 Detection of an inappropriate operation by a conference participant is performed by the conference server device 3 that has received an image from the terminal device 1 used by the conference participant. Hereinafter, details of the inappropriate operation detection process in the conference server apparatus 3 will be described.

会議サーバ装置３にて会議参加者の不適切な動作を検出するために、記憶部３２には画像認識用テーブル３２２と、動作検出テーブル３２３とが記憶されてある。 In order for the conference server device 3 to detect an inappropriate operation of the conference participant, the storage unit 32 stores an image recognition table 322 and an operation detection table 323.

図５は、会議サーバ装置３の記憶部３２に記憶されてある画像認識用テーブル３２２の内容例を示す説明図である。画像認識用テーブル３２２は、不適切な動作の内容別に付されている番号、各動作の内容、及び認識パターンを含む。 FIG. 5 is an explanatory diagram showing an example of the contents of the image recognition table 322 stored in the storage unit 32 of the conference server device 3. The image recognition table 322 includes a number assigned to each inappropriate operation content, each operation content, and a recognition pattern.

図５に示す例では、不適切な動作として「電話、居眠り、雑談、喧嘩、泣く、大笑い、離席、よそ見、喫煙、食事、舌を出す」の動作が挙げられている。例えば、会議中の不適切な動作としての「電話」は、手・腕を耳付近に近づけ、口が動いている動作により判別が可能であると考えられる。したがって、「１：電話」の画像認識における認識パターンとして、体に対する腕の位置（腕の上がり・下がり）、複数の画像内に亘って口の開閉が有ること、又は電話機と認識される画像が有ること等が予め登録されてある。
「居眠り」は、両目が閉じて傾いている等の状況が考えられる。したがって、「２：居眠り」については、顔の向き（下向き、顔中心線の所定角度以上の傾き）、両目が閉じているか否か、又は頭（顔）の周期的な動きが有ること等が予め登録されてある。
「泣く」という不適切な動作は、画像から、顔を下に向けているか、手で目を押さえている状態であるか、又は泣き顔の表情から判別が可能と考えられる。したがって「３：泣く」については、顔の向きが下向きであること、体に対する手・腕の位置が顔を覆う位置にあること、又は泣き顔が認識されること等が予め登録されてある。
「大笑い」という不適切な動作は、口が大きく開いて笑っているという状況が考えられる。したがって「４：大笑い」については、認識される口の大きさが所定の大きさ以上か、又は顔に対する割合が所定値以上か、且つ笑顔が認識されたか等が予め登録されてある。
「離席」についてはそもそも人物が認識されない状況であるので、「５：離席」については顔（人物）が認識できない場合というパターンが予め登録されてある。
「よそ見」という動作は、画面を見ずに別の方向を向いている状態である。したがって、「６：よそ見」については顔の向きが所定の向き以外であること、又は所定の向きから大きく傾いていること等が予め登録されてある。
「喫煙」という動作は、手でタバコを支えているか、口でタバコを咥えているか、口元にタバコを持っている動作を繰り返しているかによって判別可能である。したがって「７：喫煙」については、タバコの画像が認識されること、手が口元に有ること、又は手の口元への動きが繰り返されていること等がパターンとして予め登録されてある。
「食事」という動作は、箸又はフォーク等の食器を手で持ち口元へ持っていく動作、又は口元が閉じられたまま動いている等の状況によって判別可能である。また、ガムを噛むことも食事という不適切な動作に含むとすると、この場合も口が閉じられたま動いている状況等によって判別可能である。したがって「８：食事」については、食器の画像が認識されること、手・腕の口元への動きがあること、又は口元が閉じた状態で繰り返し動いていること等のパターンが予め登録されてある。
「舌を出す」という動作は、口元から舌が出ている状態であるので、「９：舌を出す」については、口元付近に舌の画像が認識されるか等のパターンが予め登録されてある。
「喧嘩」は複数人がつかみあう等の状況であるので、「１０：喧嘩」については、複数の人物が認識されること、且つ、複数人人物が絡み合っていると判断できること、又は人物の表情として怒った顔が認識されること等のパターンが予め登録されてある。
「雑談」は複数人が会議の議題と別の話をしている状況であり、例えば隣り合う２人が向き合っており、口が動いている状態などが考えられる。そこで「１１：雑談」については、複数の人物が認識されること、且つ、複数の人物同士の顔の向きが向き合っていること等のパターンが予め登録されてある。 In the example shown in FIG. 5, as an inappropriate operation, operations such as “telephone, snooze, chat, quarrel, cry, laugh, leave, look away, smoke, eat, put out tongue” are listed. For example, it is considered that “telephone” as an inappropriate operation during a meeting can be determined by an operation in which the hand / arm is brought close to the ear and the mouth is moving. Accordingly, as a recognition pattern in the image recognition of “1: telephone”, the position of the arm with respect to the body (arm up / down), opening / closing of the mouth over a plurality of images, or an image recognized as a telephone set. Presence is registered in advance.
“Dozing” may be a situation in which both eyes are closed and tilted. Therefore, with regard to “2: dozing”, the direction of the face (downward, inclination of the face center line more than a predetermined angle), whether both eyes are closed, or there is a periodic movement of the head (face). Registered in advance.
The inappropriate action “crying” can be discriminated from the image based on whether the face is facing down, the hand is holding the eyes, or the expression of the crying face. Therefore, “3: cry” is registered in advance such that the face is facing downward, the position of the hand / arm with respect to the body is in a position covering the face, or the crying face is recognized.
Inappropriate behavior such as “Laughter” can be a situation where the mouth is wide open and laughing. Therefore, “4: Laughter” is registered in advance as to whether the size of the mouth to be recognized is a predetermined size or more, whether the ratio to the face is a predetermined value or more, and whether a smile is recognized.
Since “seating” is a situation where no person is recognized in the first place, a pattern indicating that a face (person) cannot be recognized for “5: leaving” is registered in advance.
The operation of “look away” is a state in which the user is facing another direction without looking at the screen. Therefore, “6: Looking Away” is registered in advance that the face is in a direction other than the predetermined direction, or is largely inclined from the predetermined direction.
The operation of “smoking” can be discriminated based on whether the hand is holding the cigarette, holding the cigarette in the mouth, or repeating the operation of holding the cigarette in the mouth. Therefore, for “7: smoking”, the recognition of the cigarette image, the presence of the hand in the mouth, the repeated movement of the hand to the mouth, etc. are registered in advance as a pattern.
The operation of “meal” can be discriminated according to the operation of bringing tableware such as chopsticks or forks to the mouth by hand, or moving while the mouth is closed. Also, if chewing gum is included in an inappropriate operation of eating, in this case as well, it can be discriminated by the situation where the mouth is moving and the like. Therefore, for “8: Meals”, patterns such as recognition of tableware images, movement of hands / arms to the mouth, or movement of the mouth with the mouth closed are registered in advance. is there.
Since the action of “sticking out the tongue” is a state in which the tongue is sticking out from the mouth, a pattern such as whether the tongue image is recognized in the vicinity of the mouth is registered in advance for “9: sticking out the tongue”. is there.
Since “fight” is a situation where multiple people are engaged, for “10: fight”, it can be determined that a plurality of people are recognized and that a plurality of people are intertwined, or a facial expression of a person. A pattern such as recognition of an angry face is registered in advance.
The “chat” is a situation in which a plurality of people are talking differently from the agenda of the meeting. For example, a situation where two adjacent people face each other and their mouths move is considered. Therefore, for “11: chat”, a pattern in which a plurality of persons are recognized and the faces of the plurality of persons face each other is registered in advance.

会議サーバ装置３の制御部３０は、１つの端末装置１から送信された画像を画像認識部３６に与え、人物の顔、手・腕、及び顔等などの認識結果を得る。なお不適切な動作には、１画像のみでは認識できないものも含まれるので、制御部３０は、例えば０．３秒等の所定時間分の複数の画像を与えて認識結果を得る。制御部３０は、得られた認識結果と、画像認識用テーブル３２２に記憶されている不適切動作夫々の認識パターンとを比較し、不適切な動作夫々の有無を判定する。このように、制御部３０が不適切な動作夫々の有無の判定に基づき不適切な動作を検出できる。新たに不適切な動作を加える場合、又は削除する場合には、画像認識用テーブル３２２の動作内容を新たに加えるか、削除するか、又は各動作についての判定の有効（許可）／無効（不許可）を設定しておけばよい。画像認識用テーブル３２２の内容を改変することにより、不適切な動作の検出の詳細を適宜変更することが容易となる。勿論、画像認識用テーブル３２２が無い構成であっても、会議サーバ用のプログラム３Ｐを変更することにより適宜、検出する不適切な動作の内容の詳細を変更することができる。 The control unit 30 of the conference server device 3 gives an image transmitted from one terminal device 1 to the image recognition unit 36, and obtains recognition results of a person's face, hands / arms, face, and the like. Inappropriate operations include those that cannot be recognized by only one image, and the control unit 30 provides a plurality of images for a predetermined time such as 0.3 seconds to obtain a recognition result. The control unit 30 compares the obtained recognition result with the recognition patterns of each inappropriate operation stored in the image recognition table 322, and determines the presence or absence of each inappropriate operation. In this way, the control unit 30 can detect an inappropriate operation based on the determination of the presence or absence of each inappropriate operation. When a new inappropriate operation is added or deleted, the operation content in the image recognition table 322 is newly added or deleted, or the determination for each operation is valid (permitted) / invalid (not valid). (Permission) should be set. By modifying the contents of the image recognition table 322, it becomes easy to appropriately change details of detection of inappropriate operations. Of course, even in the configuration without the image recognition table 322, the details of the contents of the inappropriate operation to be detected can be changed as appropriate by changing the conference server program 3P.

図６は、動作検出テーブル３２３の内容例を示す説明図である。動作検出テーブル３２３は、制御部３０が各端末１について画像認識用テーブル３２２に含まれる不適切な動作夫々に対して判定した結果を保有する情報である。動作検出テーブル３２３は、記憶部３２ではなく、一時記憶部３１に記憶される構成であってもよい。動作検出テーブル３２３は、画像の送信元の端末装置を識別する識別番号、不適切な動作の番号、判定結果、及び継続時間を含む。端末装置を識別する情報は番号に限らない。不適切な動作の番号とは、画像認識用テーブル３２２にて予め登録されている動作内容に割り振られた番号に対応する。図６における継続時間は判定結果「１：有」が続く回数である。実測された秒数であってもよい。 FIG. 6 is an explanatory diagram showing an example of the contents of the motion detection table 323. The motion detection table 323 is information that holds the result of the control unit 30 determining for each inappropriate motion included in the image recognition table 322 for each terminal 1. The operation detection table 323 may be stored in the temporary storage unit 31 instead of the storage unit 32. The motion detection table 323 includes an identification number for identifying a terminal device that is a transmission source of an image, an inappropriate motion number, a determination result, and a duration. The information for identifying the terminal device is not limited to the number. The inappropriate operation number corresponds to the number assigned to the operation content registered in advance in the image recognition table 322. The duration in FIG. 6 is the number of times the determination result “1: Present” continues. It may be the actually measured number of seconds.

図６には、会議サーバ装置３の制御部３０が、「００１」の識別番号が付された端末装置１からの画像に基づく不適切な動作が検出されたか否かの判定結果が示されている。制御部３０は、「００１」の識別番号が付された端末装置１から得られた画像に基づき、番号１〜１１の不適切な動作夫々について検出されたか否かの判定結果を動作検出テーブル３２３に記憶している。図６の例では、「００１」の識別番号が付された端末装置１から得られた画像に基づき、番号４の不適切な動作「大笑い」が検出され、継続時間が「０００５」となっている。 FIG. 6 shows the determination result of whether or not the control unit 30 of the conference server apparatus 3 detects an inappropriate operation based on the image from the terminal apparatus 1 with the identification number “001”. Yes. Based on the image obtained from the terminal device 1 to which the identification number “001” is assigned, the control unit 30 determines whether or not the inappropriate motions having the numbers 1 to 11 are detected based on the motion detection table 323. I remember it. In the example of FIG. 6, an inappropriate operation “Laughter” of number 4 is detected based on the image obtained from the terminal device 1 assigned the identification number “001”, and the duration is “0005”. Yes.

会議サーバ装置３の制御部３０は、定期的に送信される端末装置１，１，…からの画像に対する動作検出を、画像を受信する都度、又は複数回受信する毎に行なう。制御部３０は、不適切な動作夫々について動作検出を行ない、前回の判定結果が「０：無」であって今回の判定結果が「１：有」である場合、判定結果「１」を動作検出テーブル３２３に保存し、タイマー値を「０００１」と保存する。制御部３０は、前回の判定結果が「１：有」であって今回の判定結果も「１：有」である場合、判定結果「１」を保存し、タイマー値を１、又は秒数分加算する。制御部３０は、前回の判定結果が「１：有」であって今回の判定結果が「０：無」である場合、判定結果「０」を保存し、タイマー値を「０」にクリアする。 The control unit 30 of the conference server device 3 performs motion detection on the images from the terminal devices 1, 1,... That are periodically transmitted every time an image is received or every time it is received a plurality of times. The control unit 30 performs motion detection for each inappropriate motion, and when the previous determination result is “0: None” and the current determination result is “1: Yes”, the determination result “1” is operated. The data is stored in the detection table 323 and the timer value is stored as “0001”. When the previous determination result is “1: Yes” and the current determination result is also “1: Yes”, the control unit 30 stores the determination result “1” and sets the timer value to 1 or the number of seconds. to add. When the previous determination result is “1: Yes” and the current determination result is “0: None”, the control unit 30 stores the determination result “0” and clears the timer value to “0”. .

このようにして、動作検出テーブル３２３には、各端末装置１，１，…からの画像に基づく不適切な動作の検出結果が逐次更新される。なお、動作検出テーブル３２３は、１つの端末装置１から不適切な動作が検出された場合のみ、保存、更新される。 In this way, in the motion detection table 323, detection results of inappropriate motion based on images from the terminal devices 1, 1,. The operation detection table 323 is stored and updated only when an inappropriate operation is detected from one terminal device 1.

上述のように構成される会議システムにて、端末装置１，１，…から画像及び音声が送信され、会議サーバ装置３にてこれを受信して各端末装置１，１，…へ共有画像及び音声として送信する過程にて行なわれる不適切な動作の検出処理過程について、以下、フローチャートを参照して説明する。 In the conference system configured as described above, images and sound are transmitted from the terminal devices 1, 1,..., Received by the conference server device 3, and shared images and A process of detecting an inappropriate operation performed in the process of transmitting as voice will be described below with reference to a flowchart.

図７は、実施の形態１の会議サーバ装置３における画像及び音声の送受信処理及び不適切な動作の検出処理手順の一例を示すフローチャートである。なお、以下に示す処理は、１つの端末装置１に対する処理である。会議を構成する複数の端末装置について、以下の処理が夫々独立して並列的に行なわれてもよいし、又は、以下に示す各ステップにて全ての端末装置に対する処理が行なわれてもよい。 FIG. 7 is a flowchart illustrating an example of image and audio transmission / reception processing and inappropriate operation detection processing procedure in the conference server device 3 according to the first embodiment. In addition, the process shown below is a process with respect to the one terminal device 1. FIG. The following processing may be performed independently and in parallel for a plurality of terminal devices constituting the conference, or processing may be performed for all the terminal devices in the following steps.

会議サーバ装置３の制御部３０は、端末装置１からの会議参加申請に対して認証を行なって会議を開始すると、まず不適切な動作の継続時間を計測するためにタイマーをスタートさせる（ステップＳ１）。制御部３０は、端末装置１から画像及び音声のデータを受信する（ステップＳ２）。このとき制御部３０は、画像及び音声が符号化されている場合には符号化・復号処理部３３へ与えて復号してから以下の処理を行なう。 When the control unit 30 of the conference server device 3 authenticates the conference application from the terminal device 1 and starts the conference, the control unit 30 first starts a timer to measure the duration of inappropriate operation (step S1). ). The control unit 30 receives image and audio data from the terminal device 1 (step S2). At this time, when the image and the sound are encoded, the control unit 30 supplies the image to the encoding / decoding processing unit 33 for decoding, and then performs the following processing.

制御部３０は、ステップＳ２にて受信した画像を画像認識部３６に与え、認識処理を行なう（ステップＳ３）。制御部３０は、画像認識部３６から得られる認識結果と画像認識用テーブル３２２とを比較して不適切な動作の有無を判定して検出する（ステップＳ４）。 The control unit 30 gives the image received in step S2 to the image recognition unit 36 and performs recognition processing (step S3). The control unit 30 compares the recognition result obtained from the image recognition unit 36 with the image recognition table 322 to determine whether or not there is an inappropriate operation (step S4).

制御部３０は、不適切な動作有、即ち不適切な動作を検出した場合（Ｓ４：ＹＥＳ）、動作有と判定された不適切な動作の番号と、動作検出テーブル３２３にて前回、判定結果が「１：有」とされた不適切な動作の番号とが１つ以上一致するか否かを判断する（ステップＳ５）。 When the control unit 30 detects an inappropriate operation, that is, an inappropriate operation (S4: YES), the control unit 30 determines the number of the inappropriate operation determined as having an operation and the previous determination result based on the operation detection table 323. It is determined whether or not one or more inappropriate motion numbers for which “1: is present” match (step S5).

制御部３０は、１つ以上一致すると判断した場合（Ｓ５：ＹＥＳ）、一致する不適切な動作の継続時間を１つ、又は経過時間分だけ加算し、一致しない不適切な動作については新たに判定結果を「１：有」とするように動作検出テーブル３２３を更新する（ステップＳ６）。そして制御部３０は、一致する番号の不適切な動作の継続時間が所定時間を経過しているか否かを判断する（ステップＳ７）。具体的には、一致する番号の不適切な動作に対するタイマー値が所定数より１以上大きいか否かを判断する。制御部３０は、所定時間を経過していると判断した場合（Ｓ７：ＹＥＳ）、画像の送信元の端末装置１からの画像の他の端末装置１への送信を制限する制御を行なう（ステップＳ８）。このとき、タイマーによる時間計測は継続されている。 When it is determined that one or more matches (S5: YES), the control unit 30 adds one or the elapsed time of the matching inappropriate operation, and newly adds an inappropriate operation that does not match. The motion detection table 323 is updated so that the determination result is “1: present” (step S6). Then, the control unit 30 determines whether or not the duration of the inappropriate operation with the matching number has exceeded a predetermined time (step S7). Specifically, it is determined whether or not the timer value for an inappropriate operation with a matching number is one or more larger than a predetermined number. When the control unit 30 determines that the predetermined time has elapsed (S7: YES), the control unit 30 performs control to restrict transmission of the image from the terminal device 1 that is the transmission source of the image to the other terminal device 1 (step S1). S8). At this time, the time measurement by the timer is continued.

ステップＳ８は詳細には、以下のような処理を行なう。１つの方法として制御部３０は、不適切な動作が行なわれた端末装置１から受信した画像を画像処理部３４へは与えず、他の端末装置１，１，…への送信を停止する。他の方法として制御部３０は、不適切な動作が検出された画像内の対応位置に、白塗り若しくは黒塗りの画像を重畳するか、又はモザイク画像を重畳してから画像処理部３４へ与え、他の端末装置１からの画像と合成させる。他の方法として制御部３０は、不適切な動作が行なわれ端末装置１からの画像の受信を停止するようにしてもよい。更に他の方法として制御部３０は、不適切な動作が行なわれた端末装置１からの音声の受信も停止するようにしてもよい。また制御部３０は、不適切な動作が行なわれた端末装置１からの画像の他の端末装置１，１，…への送信をコマ落ちに、即ちレート（品質）を低減するようにしてもよい。 Specifically, step S8 performs the following processing. As one method, the control unit 30 does not provide the image processing unit 34 with the image received from the terminal device 1 on which an inappropriate operation has been performed, and stops transmission to the other terminal devices 1, 1,. As another method, the control unit 30 superimposes a white or black image on a corresponding position in the image where an inappropriate operation is detected, or superimposes a mosaic image and then supplies the image to the image processing unit 34. Then, it is combined with an image from another terminal device 1. As another method, the control unit 30 may perform an inappropriate operation and stop receiving an image from the terminal device 1. As another method, the control unit 30 may also stop receiving voice from the terminal device 1 that has performed an inappropriate operation. In addition, the control unit 30 may transmit frames from the terminal device 1 in which an inappropriate operation has been performed to other terminal devices 1, 1,..., Dropping frames, that is, reducing the rate (quality). Good.

またステップＳ８における制御部３０の詳細な処理は、検出された不適切な動作の内容に応じて変更してもよい。例えば、制御部３０は、「電話」「居眠り」「雑談」「喧嘩」「泣く」「大笑い」「離席」「よそ見」「喫煙」「食事」「舌を出す」の不適切な動作が検出された場合には、不適切な動作が検出された端末装置１からの「画像」の他の端末装置１，１，…への送信を停止するか、画像内の該当部分に白塗り画像を重畳するか、画像の送信レートを低減する。制御部３０は、「喫煙」「食事」「舌を出す」の口元の動きに関する不適切な動作が検出された場合には、口元に白塗り画像（又は他の画像）を重畳するようにしてもよい。制御部３０は、「電話」「雑談」「喧嘩」「泣く」「大笑い」などの不適切な音声又は発声を伴う動作が検出された場合には、不適切な動作が検出された端末装置１からの「音声」の他の端末装置１，１，…への送信を停止する。 Further, the detailed processing of the control unit 30 in step S8 may be changed according to the content of the detected inappropriate operation. For example, the control unit 30 detects inappropriate operations such as “telephone”, “sleeping”, “chatting”, “quarreling”, “crying”, “laughing”, “seating”, “looking away”, “smoking”, “meal”, and “sticking out tongue”. In such a case, transmission of the “image” from the terminal device 1 in which an inappropriate operation has been detected to the other terminal devices 1, 1,... Superimpose or reduce the image transmission rate. When an inappropriate movement related to the movement of the mouth such as “smoking”, “meal”, and “sticking out the tongue” is detected, the control unit 30 superimposes a white-painted image (or another image) on the mouth. Also good. When an operation with an inappropriate voice or utterance such as “telephone”, “chat”, “fight”, “cry”, “laughter” is detected, the control unit 30 detects the inappropriate operation. Stops transmitting “speech” to other terminal devices 1, 1,.

なお、画像の送信停止と音声の送信停止とを組み合わせた送信制御の具体的内容は、「不適切な動作」毎に、記憶部３２に予め記憶しておくようにしてもよい。画像認識用テーブル３２２の不適切な動作夫々の番号に対応付けて、送信制御の内容を設定して記憶しておく。例えば「１：電話」に対応付けて「画像：停止／音声：停止」、「２：居眠り」に対応付けて「画像：停止／音声：送信継続」、又は「１０：喧嘩」に対応付けて「画像：白塗り重畳／音声：停止」などと記憶しておく。また、複数人物が映る場合には不適切な動作が検出された箇所について白塗り画像を重畳すると記憶しておく。このように記憶部３２にて送信制御の具体的な設定を記憶しておくことで、記憶部３２内の設定内容を変更することによって適した処理を行なうように適宜変更できる。 Note that the specific content of the transmission control that combines the stop of image transmission and the stop of audio transmission may be stored in advance in the storage unit 32 for each “inappropriate operation”. The content of the transmission control is set and stored in association with the number of each inappropriate operation in the image recognition table 322. For example, “1: telephone / call” is associated with “image: stop / voice: stop”, “2: doze” is associated with “image: stop / voice: transmission continued”, or “10: fight”. “Image: white overlay / sound: stop” or the like is stored. Further, when a plurality of persons are shown, it is stored that a white-painted image is superimposed on a place where an inappropriate operation is detected. By storing specific settings for transmission control in the storage unit 32 as described above, it is possible to change the settings appropriately in such a manner that the setting contents in the storage unit 32 are changed.

次に制御部３０は、会議システムにおける会議が終了したか否かを判断する（ステップＳ９）。制御部３０は、終了していないと判断した場合は（Ｓ９：ＮＯ）、処理をステップＳ２へ戻して継続し、終了したと判断した場合は（Ｓ９：ＹＥＳ）、処理を終了する。 Next, the control unit 30 determines whether or not the conference in the conference system has ended (step S9). When it is determined that the control unit 30 has not ended (S9: NO), the process returns to step S2 and continues. When it is determined that the control unit 30 has ended (S9: YES), the process ends.

制御部３０は、ステップＳ４にて不適切な動作無、即ち不適切な動作を検出していない場合（Ｓ４：ＮＯ）、動作検出テーブル３２３の全ての判定結果を「０：無」として初期化し（ステップＳ１０）、処理をステップＳ９へ進める。 If no inappropriate operation is detected in step S4, that is, no inappropriate operation is detected (S4: NO), the control unit 30 initializes all determination results in the operation detection table 323 as “0: no”. (Step S10), the process proceeds to Step S9.

制御部３０は、動作有と判定された不適切な動作の番号と、動作検出テーブル３２３にて前回判定結果が「１：有」とされた不適切な動作の番号とが全く一致しない判断した場合（Ｓ５：ＮＯ）、動作検出テーブル３２３を更新し（ステップＳ１１）、処理を終了する。このとき、前回判定結果が「１：有」とされた不適切な動作については、継続時間は「０」にクリアされ、判定結果は「０：無」と保存される。今回の判定結果が「１：有」とされた不適切な動作については「１：有」と保存され、継続時間が保存される。 The control unit 30 determines that the number of the inappropriate motion determined to have motion does not match the number of the inappropriate motion whose previous determination result is “1: yes” in the motion detection table 323. In the case (S5: NO), the operation detection table 323 is updated (step S11), and the process is terminated. At this time, for an inappropriate operation in which the previous determination result is “1: yes”, the duration is cleared to “0” and the determination result is stored as “0: none”. An inappropriate operation in which the current determination result is “1: yes” is saved as “1: yes” and the duration is saved.

制御部３０は、ステップＳ７にて所定時間を経過していないと判断した場合（Ｓ７：ＮＯ）、処理をステップＳ９へ進める。 When it is determined that the predetermined time has not elapsed in step S7 (S7: NO), control unit 30 advances the process to step S9.

なおステップＳ７にて、所定時間が経過しているか否かは全ての不適切な動作に対して画一的に判断したが、検出された不適切な動作の内容毎に異なる所定時間に対して判断するようにしてもよい。このように不適切な動作の内容毎に異なる所定時間については、画像認識用テーブル３２２の不適切な動作夫々の番号に対応付けて夫々設定して記憶部３２に記憶しておけばよい。 In step S7, whether or not the predetermined time has elapsed is determined uniformly for all inappropriate operations. However, for a predetermined time that differs depending on the contents of the detected inappropriate operations. You may make it judge. As described above, the predetermined time that differs depending on the content of the inappropriate operation may be set in association with the number of each inappropriate operation in the image recognition table 322 and stored in the storage unit 32.

図８は、本実施の形態１における会議システムにて、不適切な動作が検出された場合になされる送信制御の例を示す説明図である。図８は、不適切な動作が検出された端末装置１とは異なる端末装置１のディスプレイ１４にて確認できる画像を示している。 FIG. 8 is an explanatory diagram illustrating an example of transmission control performed when an inappropriate operation is detected in the conference system according to the first embodiment. FIG. 8 shows an image that can be confirmed on the display 14 of the terminal device 1 different from the terminal device 1 in which an inappropriate operation is detected.

図８に示す例では上段に、送信制限の制御が行なわれる前の画像の内容が示されている。上段の画像からは、３人の人物が認識される。制御部３０は、当該画像を画像処理部３４へ与えて３つの人物についての顔（輪郭）、口、目、手・腕の位置等の情報を得、画像認識用テーブル３２２の認識パターンと比較する。図８の上段の画像では、３人の人物の内、最も左側に映る人物の顔の向きが、図中手前から奥への方向を撮像方向とするカメラ１５の撮像方向とは異なる向きであると判断できる。制御部３０は、顔の向きは「正面」及び所定の傾き以内であるべきところ、最も左側に映る人物の顔の向きが所定の傾き以上であるので「よそ見」、また認識された手・腕の位置が耳の高さに有り、且つ認識された口が開閉していることから「電話」と不適切な動作を検出する。この場合制御部３０は、不適切な動作が検出された画像内の位置に基づき、これらの動作が映らないように、一部に白塗り画像を重畳する。これにより、図８の下段に示すように、よそ見をしている人物が隠ぺいされた画像が確認できる。複数の人物が映る画像が端末装置１から送信される場合は、このように、画像の一部を他の画像で重畳する等の加工を行なう。これにより同一の画像に映る他の会議参加者の参加を妨げることはない。 In the example shown in FIG. 8, the upper part shows the contents of the image before the transmission restriction control is performed. Three persons are recognized from the upper image. The control unit 30 gives the image to the image processing unit 34 to obtain information such as the face (contour), mouth, eyes, hand / arm positions, etc. for the three persons, and compares the information with the recognition patterns in the image recognition table 322. To do. In the upper image of FIG. 8, the orientation of the face of the person on the left side of the three persons is different from the imaging direction of the camera 15 with the imaging direction from the front to the back in the figure. It can be judged. The control unit 30 determines that the direction of the face should be “front” and within a predetermined inclination, and because the direction of the face of the person shown on the leftmost side is greater than or equal to the predetermined inclination, “look away” and the recognized hand / arm Since the position of is at the height of the ear and the recognized mouth is opened and closed, an inappropriate operation such as “telephone” is detected. In this case, the control unit 30 superimposes a white-painted image on a part based on the position in the image where an inappropriate operation is detected so that these operations are not reflected. As a result, as shown in the lower part of FIG. 8, an image in which a person who is looking away is concealed can be confirmed. When an image showing a plurality of persons is transmitted from the terminal device 1, processing such as superimposing a part of the image with another image is performed in this way. This does not hinder the participation of other conference participants appearing in the same image.

会議サーバ装置３の制御部３０は、ステップＳ８にて、不適切な動作が検出された端末装置１からの画像又は音声の送信を停止する場合、送信を停止する旨のメッセージを当該端末装置１へ送信してもよい。このとき制御部３０は、所定時間が経過する前に、不適切な動作が検出された端末装置１へ警告表示又はアラーム音を出力するように指示を行ない、不適切な動作を行なっている会議参加者へ注意を促すようにしてもよい。これにより、会議の進行を円滑にすることができる。また、制御部３０は、不適切な動作が検出された端末装置からの画像の送信を停止する旨のメッセージを他の端末装置１，１，…へ送信してもよい。これにより、他の端末装置１を使用する会議参加者が状況を把握して会議システムの進行を継続させることができる。 When the control unit 30 of the conference server device 3 stops the transmission of the image or the sound from the terminal device 1 in which an inappropriate operation is detected in step S8, the control unit 30 sends a message indicating that the transmission is stopped to the terminal device 1 May be sent to. At this time, the control unit 30 gives an instruction to output a warning display or an alarm sound to the terminal device 1 in which the inappropriate operation is detected before the predetermined time elapses, and the conference is performing an inappropriate operation. The participant may be alerted. Thereby, progress of a meeting can be made smooth. Moreover, the control part 30 may transmit the message to the effect of stopping the transmission of the image from the terminal device in which inappropriate operation | movement was detected to other terminal device 1,1, .... Thereby, the conference participant who uses the other terminal device 1 can grasp the situation and continue the progress of the conference system.

以上のよう構成により、ある特定の端末装置１を使用する会議参加者の動作状態が例えば不適切（電話、居眠り、雑談、喧嘩、泣く、大笑い、離席、よそ見、喫煙、食事、舌を出す等）である場合に、当該動作を映す画像又は動作に伴う音声が他の会議参加者へ伝達することを回避される。不適切な動作が映る画像は、他の端末装置１，１，…への送信が禁止されるか、又は他の端末装置１，１，…へ送信されるとしても送信レートが低減されてコマ落ち状態となるか、又は、画面の一部が加工される。これにより、不適切な動作が映っている画像を他の端末装置１，１，…にて観なくとも済むか、ことが不可能となるか、又は鮮明に確認することが困難となるか、又は一部が隠ぺいされて観ることできなくなる。これにより、他の会議参加者の不快感、違和感を抑制し、快適な会議システムを実現できる。また、不適切な動作を撮像した画像、及び集音した音声のデータの送信が制限されることにより、不要なデータが送受信されることを回避することができ、ネットワーク２の通信負荷を軽減することができる。 With the above configuration, the operation state of a conference participant who uses a specific terminal device 1 is, for example, inappropriate (telephone, snooze, chat, quarrel, cry, laugh, leave, look away, smoking, meal, tongue out Etc.), it is avoided that an image showing the motion or a sound accompanying the motion is transmitted to other conference participants. An image showing an inappropriate operation is prohibited from being transmitted to other terminal devices 1, 1,..., Or even if transmitted to other terminal devices 1, 1,. It will fall or a part of the screen will be processed. As a result, it is not necessary to view an image showing an inappropriate operation on the other terminal devices 1, 1,..., Or it becomes impossible or it is difficult to confirm clearly, Or part of it is hidden and cannot be seen. As a result, it is possible to suppress the discomfort and discomfort of other conference participants and realize a comfortable conference system. Further, by restricting transmission of an image obtained by capturing an inappropriate operation and collected voice data, unnecessary data can be prevented from being transmitted and received, and the communication load on the network 2 can be reduced. be able to.

なお、実施の形態１では不適切な動作を「画像認識」によって検出する構成とした。つまり、音声認識によらず、音声又は発声を伴う不適切な動作を口元の開閉動作等に基づき画像認識によって検出する。音声又は発声を伴う不適切な動作とは例えば、電話、雑談、喧嘩、泣く、又は大笑い等の動作であり、これらの場合、実際に不適切な発言がなされる前に口元の動き、又はて若しくは腕の動きを合わせた画像認識によって検出できる可能性がある。したがって、画像認識により、不適切な音声又は発言が実際に発せられる前に、不適切な動作を検出することが可能である。 In the first embodiment, an inappropriate operation is detected by “image recognition”. That is, an inappropriate operation involving voice or utterance is detected by image recognition based on the opening / closing operation of the mouth, etc., regardless of voice recognition. Inappropriate actions involving voice or speech are, for example, actions such as telephone calls, chats, fights, crying, or laughing. In these cases, movements of the mouth before actually making inappropriate comments, or Or there is a possibility that it can be detected by image recognition combined with arm movements. Therefore, improper motion can be detected before improper voice or speech is actually emitted by image recognition.

実施の形態１では、制御部３０は、所定時間以上継続して不適切な動作が検出された場合に、他の端末装置１，１，…へ送信をしないように制御する構成とした。これにより、一回のみ不適切な動作が検出された場合に、直ぐに送信が停止され、突然に画像がぶつ切りになるよりも、会議の進行を円滑にする。また、所定時間が経過するまでに不適切な動作を行なった人物に警告を与える猶予ができるので、不要に会議への参加を制限して円滑な会議を妨げる可能性を低減することができる。ただし、本発明はこれに限らず、継続時間が所定時間以上であるか否かの判断（Ｓ７）を省略し、直ちに送信を制限する制御を行なう構成としてもよい。いずれの構成とするかは、所定時間を０とするか否かを会議の管理者が設定できるようにしておけば適宜変更可能である。 In the first embodiment, the control unit 30 is configured to perform control so as not to transmit to other terminal devices 1, 1,... When an inappropriate operation is detected continuously for a predetermined time or longer. As a result, when an inappropriate operation is detected only once, transmission is immediately stopped, and the progress of the conference is made smoother than when the image is suddenly cut off. In addition, since it is possible to give a warning to a person who has performed an inappropriate operation before the predetermined time elapses, it is possible to reduce the possibility of disturbing a smooth conference by unnecessarily restricting participation in the conference. However, the present invention is not limited to this, and it is possible to omit the determination (S7) as to whether or not the continuation time is equal to or longer than a predetermined time, and to perform control to immediately limit transmission. The configuration can be changed as appropriate as long as the conference manager can set whether or not the predetermined time is zero.

（実施の形態２）
実施の形態２では、不適切な動作が検出された端末装置でのカメラの撮像方向又はマイクの集音方向を変更する制御により、他の会議参加者に不快感を覚えさせる画像又は音声の送信を回避する。 (Embodiment 2)
In the second embodiment, transmission of an image or sound that causes other conference participants to feel uncomfortable by controlling to change the imaging direction of the camera or the sound collection direction of the microphone in the terminal device in which an inappropriate operation is detected. To avoid.

実施の形態２における会議システムは、会議参加者が夫々用いる端末装置５，５，…と、端末装置５，５，…が接続されるネットワーク２と、端末装置５，５，…間での画像（映像）及び音声の送受信及び共有を実現する会議サーバ装置３とを含んで構成される。つまり、端末装置５を含むことが実施の形態１と異なり、他の構成は実施の形態１の構成と同様である。したがって、以下の説明では実施の形態１の構成と共通する装置及び内部構成については同一の符号を付して詳細な説明を省略する。 The conference system in the second embodiment is an image between the terminal devices 5, 5,... Used by the conference participants and the network 2 to which the terminal devices 5, 5,. (Conference server device 3) that realizes transmission / reception and sharing of (video) and audio. That is, the configuration including the terminal device 5 is different from that of the first embodiment, and other configurations are the same as the configuration of the first embodiment. Therefore, in the following description, the same reference numerals are assigned to the devices and internal configurations that are the same as those in the configuration of the first embodiment, and detailed description thereof is omitted.

端末装置５は、実施の形態１における端末装置１同様、タブレット内蔵ディスプレイを搭載した会議システム専用端末を用い、外観も同様である。 As with the terminal device 1 in the first embodiment, the terminal device 5 uses a conference system dedicated terminal equipped with a tablet built-in display and has the same appearance.

図９は、実施の形態２の会議システムを構成する端末装置５の内部構成を示すブロック図である。 FIG. 9 is a block diagram showing an internal configuration of the terminal device 5 constituting the conference system of the second embodiment.

端末装置５は、制御部５００と、一時記憶部５０１と、記憶部５０２と、入力処理部５０３と、表示処理部５０４と、映像処理部５０５と、入力音声処理部５０６と、出力音声処理部５０７と、通信処理部５０８と、符号化・復号処理部５０９とを備える。端末装置５は更に、内蔵又は外部接続により、タブレット５３と、ディスプレイ５４と、カメラ５５と、マイク５６と、スピーカ５７と、無線通信部５８とに加え、駆動部５９を備える。 The terminal device 5 includes a control unit 500, a temporary storage unit 501, a storage unit 502, an input processing unit 503, a display processing unit 504, a video processing unit 505, an input audio processing unit 506, and an output audio processing unit. 507, a communication processing unit 508, and an encoding / decoding processing unit 509. The terminal device 5 further includes a drive unit 59 in addition to the tablet 53, the display 54, the camera 55, the microphone 56, the speaker 57, and the wireless communication unit 58 by built-in or external connection.

端末装置５が備える各構成部の内、駆動部５９の構成以外は、実施の形態１における各構成部と同様である。したがって、それらの詳細な説明は省略する。 Of the components included in the terminal device 5, the configuration other than the configuration of the drive unit 59 is the same as the components in the first embodiment. Therefore, detailed description thereof will be omitted.

駆動部５９は、会議サーバ装置３からの指示に基づく制御部５００からの制御信号により、カメラ５５の撮像方向を変更することが可能である。実施の形態２におけるカメラ５５は、端末装置５の筐体内部にて動かされること可能に支持されている。駆動部５９は、筐体内部におけるカメラ５５の支持部に、接するように配置され、ステッピングモータ等の機構を含んで制御部５００からの制御信号に従ってカメラ５５の支持部の向きを変更して撮像方向を変更する。 The drive unit 59 can change the imaging direction of the camera 55 by a control signal from the control unit 500 based on an instruction from the conference server device 3. The camera 55 in the second embodiment is supported so that it can be moved inside the casing of the terminal device 5. The drive unit 59 is disposed so as to be in contact with the support unit of the camera 55 inside the housing, and includes a mechanism such as a stepping motor to change the direction of the support unit of the camera 55 in accordance with a control signal from the control unit 500 and take an image. Change direction.

制御部５００は、記憶部５０２に記憶してある会議端末用プログラムを読み出して実行することにより、会議開始時には端末装置５を使用する会議参加者へ向けて撮像している。そして、制御部５００は、会議サーバ装置３から撮像方向を変更する指示を受信した場合、会議参加者が映らないようにカメラ５５の撮像方向を変更させる。 The control unit 500 reads out and executes the conference terminal program stored in the storage unit 502, thereby imaging the conference participants who use the terminal device 5 at the start of the conference. And the control part 500 changes the imaging direction of the camera 55 so that a conference participant may not be reflected, when the instruction | indication which changes an imaging direction from the conference server apparatus 3 is received.

図１０は、実施の形態２の会議サーバ装置における画像及び音声の送受信処理及び不適切な動作の検出処理手順の一例を示すフローチャートである。なお、以下に示す処理手順の内、実施の形態１の図７に示した処理手順と共通する手順には同一のステップ番号を付して詳細な説明を省略する。 FIG. 10 is a flowchart illustrating an example of an image and audio transmission / reception process and an inappropriate operation detection process procedure in the conference server apparatus according to the second embodiment. Of the processing procedures shown below, the same steps as those shown in FIG. 7 of the first embodiment are denoted by the same step numbers, and detailed description thereof is omitted.

制御部３０は、所定時間を経過していると判断した場合（Ｓ７：ＹＥＳ）、画像の送信元の端末装置１へ、カメラ５５の撮像方向を変更する指示を送信し（ステップＳ２０）、ステップＳ９へ処理を進める。即ち、実施の形態１におけるステップＳ８の代替として、ステップＳ２０が行なわれる。 When it is determined that the predetermined time has elapsed (S7: YES), the control unit 30 transmits an instruction to change the imaging direction of the camera 55 to the terminal device 1 that is the transmission source of the image (step S20). The process proceeds to S9. That is, step S20 is performed as an alternative to step S8 in the first embodiment.

これにより、端末装置５の制御部５００は、カメラ５５の撮像方向を、端末装置５を使用する会議参加者を撮像することで会議の場に相応しくない映像を映さないように、駆動部５９へ制御信号を与えて調整し、集音方向を変更する処理も行なう。 Thereby, the control unit 500 of the terminal device 5 moves the imaging direction of the camera 55 to the driving unit 59 so as not to display an image that is not suitable for the meeting place by imaging the conference participants who use the terminal device 5. A control signal is given and adjusted to change the sound collection direction.

このような構成により、会議参加者によって不適切な動作が検出された端末装置５では、他の会議参加者を不快にさせる可能性が有る画像及び音声を会議サーバ装置３へ届かないようにすることができ、快適な会議システムを実現することができる。 With such a configuration, the terminal device 5 in which an inappropriate operation has been detected by the conference participant is prevented from reaching the conference server device 3 with an image and sound that may make other conference participants uncomfortable. And a comfortable conference system can be realized.

実施の形態２では、端末装置５の制御部５００は、会議参加者の不適切な操作状態を検出した場合、カメラ５５の撮像方向を変更させる制御を行なう構成とした。しかしながら本発明はこれに限らず、指向性の高いマイク５６を用い、入力音声処理部５０６にてマイク５６の集音方向を変えるか、又は、特定の方向（右、左など）からの音声を事後的に除去するかの処理を行なってもよい。これにより、他の会議参加者を不快にするような音声が会議サーバ装置３を介して他の端末装置１，１，…へ送信されることを回避することができる。 In the second embodiment, the control unit 500 of the terminal device 5 is configured to perform control to change the imaging direction of the camera 55 when an inappropriate operation state of the conference participant is detected. However, the present invention is not limited to this, and the microphone 56 having high directivity is used, and the sound collection direction of the microphone 56 is changed by the input sound processing unit 506 or sound from a specific direction (right, left, etc.) is used. You may perform the process of removing afterwards. Thereby, it is possible to avoid transmission of sounds that make other conference participants uncomfortable to the other terminal devices 1, 1,... Via the conference server device 3.

また、会議サーバ装置３の制御部３０は、撮像方向及び集音方向を変更させる指示を送信する構成とした。しかしながら本発明はこれに限らず、制御部３０が、不適切な動作が検出された端末装置５へ、撮像及び集音の停止並びに画像及び音声のデータの送信の禁止の指示を送信する構成としてもよい。不快感、違和感を抑制するのみならず、他の端末装置５，５，…へ送信しない画像及び音声のデータがそもそもネットワーク２へ送出されないようにすることができる。これにより、画像又は音声による通信負荷の増大及び会議システムにて画像又は音声の送受信を中継する会議サーバ装置３の処理負荷を抑制することも可能である。更に、端末装置１からの送信を制御、即ち会議サーバ装置３での当該端末装置１からの受信を停止するのみならず、会議サーバ装置３は、不適切な動作が検出された端末装置５からの画像の送信データ量を低減する構成としてもよい。具体的には、制御部３０は、端末装置５へ映像処理部５０５により取得される画像の内の一部を送信するように指示する。これにより制御部５００は、送信する画像のデータをコマ落ちに、即ちレートを低減する。又は、制御部３０は、端末装置１へ映像処理部５０５からカメラ５５による撮像レートを低減させるようにしてもよいし、入力音声処理部５０６におけるサンプリングレートを低減させて音声のデータのデータ量を低減するようにしてもよい。これにより、不適切な動作を撮像した画像、及び集音した音声のデータの会議サーバ装置３への送信が制限されることにより、不要なデータが送受信されることを回避することができ、ネットワーク２の通信負荷増大を軽減することができる。 Further, the control unit 30 of the conference server device 3 is configured to transmit an instruction to change the imaging direction and the sound collection direction. However, the present invention is not limited to this, and the control unit 30 transmits an instruction to stop imaging and sound collection and prohibit transmission of image and sound data to the terminal device 5 in which an inappropriate operation is detected. Also good. In addition to suppressing discomfort and discomfort, it is possible to prevent image and audio data not transmitted to the other terminal devices 5, 5,. Thereby, it is also possible to suppress an increase in communication load due to images or sounds and a processing load on the conference server device 3 that relays transmission / reception of images or sounds in the conference system. Furthermore, the transmission from the terminal device 1 is controlled, that is, not only the reception from the terminal device 1 at the conference server device 3 is stopped, but the conference server device 3 also starts from the terminal device 5 from which an inappropriate operation is detected. The transmission data amount of the image may be reduced. Specifically, the control unit 30 instructs the terminal device 5 to transmit a part of the image acquired by the video processing unit 505. As a result, the control unit 500 drops the image data to be transmitted, that is, reduces the rate. Alternatively, the control unit 30 may cause the terminal device 1 to reduce the imaging rate by the camera 55 from the video processing unit 505, or reduce the sampling rate in the input audio processing unit 506 to reduce the amount of audio data. You may make it reduce. Thereby, it is possible to prevent unnecessary data from being transmitted / received by restricting transmission of the image of improper operation and the collected voice data to the conference server device 3, and the network. The increase in communication load of 2 can be reduced.

実施の形態１及び実施の形態２では、会議サーバ装置３が、端末装置１（５），１（５），…から受信した画像に対して画像認識を行なう画像認識部３６を備える構成とした。画像認識用テーブル３２２に予め登録されてある不適切な動作のパターンを一元化できる。しかしながら、画像認識処理及び不適切な動作の検出を行なう主体は、会議サーバ装置３に限らない。会議サーバ装置３の制御部３０の負荷を軽減するため、他の装置で行なっても良いし、各端末装置１（５），１（５），…で行なう構成としてもよい。端末装置１（５），１（５），…にて行なう場合、各装置画像認識処理が必要となるので各端末装置１（５），１（５），…の処理負荷が重くなる。一方で、不適切な動作が検出された端末装置１（５）からは会議サーバ装置３への画像及び音声の送信の停止、即ち会議サーバ装置３にて当該端末装置１からの画像及び音声の受信を禁止する等の制御が可能となり、ネットワーク２における通信負荷を軽減できるなどの効果がある。 In the first and second embodiments, the conference server device 3 includes an image recognition unit 36 that performs image recognition on images received from the terminal devices 1 (5), 1 (5),. . Inappropriate operation patterns registered in advance in the image recognition table 322 can be unified. However, the subject that performs image recognition processing and inappropriate motion detection is not limited to the conference server device 3. In order to reduce the load on the control unit 30 of the conference server device 3, it may be performed by another device, or may be configured by each terminal device 1 (5), 1 (5),. When the processing is performed by the terminal devices 1 (5), 1 (5),..., Each device image recognition process is required, so that the processing load on each terminal device 1 (5), 1 (5),. On the other hand, the terminal device 1 (5) in which an inappropriate operation is detected stops transmission of images and sounds to the conference server device 3, that is, the conference server device 3 receives images and sounds from the terminal device 1. Control such as prohibition of reception becomes possible, and there is an effect that the communication load in the network 2 can be reduced.

なお、開示された実施の形態は、全ての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上述の説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内での全ての変更が含まれることが意図される。 The disclosed embodiments should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

１端末装置（第１情報処理装置）
１００制御部
１０５映像処理部（取得する手段）
１０６音声処理部（取得する手段）
１０８通信処理部（送信手段、送受信手段）
１５カメラ（撮像装置）
１６マイク（集音装置）
２ネットワーク
３会議サーバ装置（第２情報処理装置、情報処理装置）
３０制御部（認識手段、検出手段、送信制御手段）
３１一時記憶部
３２記憶部
３２２画像認識用テーブル
３６画像認識部（認識手段）
３７通信処理部 1 Terminal device (first information processing device)
100 control unit 105 video processing unit (acquiring means)
106 Speech processing unit (means for acquiring)
108 Communication processing unit (transmission means, transmission / reception means)
15 Camera (imaging device)
16 Microphone (sound collector)
2 Network 3 Conference server device (second information processing device, information processing device)
30 control unit (recognition means, detection means, transmission control means)
31 Temporary storage unit 32 Storage unit 322 Image recognition table 36 Image recognition unit (recognition means)
37 Communication processing unit

Claims

A plurality of first information processing apparatuses each including an imaging device or a sound collecting device, a means for acquiring an image from the imaging device or a sound from the sound collecting device, and a transmission / reception means for transmitting / receiving the acquired image or sound; Common to a plurality of first information processing devices, including a second information processing device connected to the first information processing device via a communication medium and relaying an image or sound transmitted / received by each first information processing device. In a conference system that displays or outputs the image or sound and shares the information to realize the conference,
Recognizing means for recognizing the action of a person shown in the image acquired by the first information processing apparatus;
Detection means for detecting the presence or absence of inappropriate motion based on the recognition result by the recognition means;
Depending on the detection result of the detection means, reception of an image or sound from the first information processing apparatus or transmission to another first information processing apparatus, increase or decrease in transmission rate, or part of the image A transmission control means for controlling the processing of the conference system.

The transmission control means includes
Whether the detection means detects an inappropriate operation, prohibits reception of an image or sound from the first information processing apparatus in which the inappropriate operation is detected, or transmission to another first information processing apparatus The conference system according to claim 1, wherein a transmission rate is reduced or processing of a part of the image is performed.

Means for measuring a duration of the operation after the detection unit detects an inappropriate operation;
Determining means for determining whether or not the duration is a predetermined time or more,
The transmission control means receives the image or the sound from the first information processing apparatus in which the inappropriate operation is detected, or receives the image or the sound other than the image or the sound when the determining means determines that the predetermined time is exceeded. The conference system according to claim 1, wherein transmission to the first information processing apparatus is prohibited, a transmission rate is reduced, or a part of the image is processed.

It has a table that is a list of image recognition results registered in advance as inappropriate operations,
2. The detection unit according to claim 1, wherein the detection unit is configured to detect presence / absence of an inappropriate operation based on whether or not a recognition result by the recognition unit corresponds to a recognition result included in the table. Conference system.

When the detection means detects an inappropriate voice or an action involving utterance, the image or voice is received from the first information processing apparatus in which the inappropriate action is detected, or another image or voice is received. The conference system according to claim 1, wherein transmission to one information processing apparatus is prohibited.

The transmission control means performs processing for superimposing another image on a part of the image corresponding to the inappropriate operation in the image from the first information processing apparatus in which the inappropriate operation is detected. The conference system according to claim 1.

The conference system according to claim 1, wherein the recognition unit and the detection unit are provided in a second information processing apparatus.

And a means for instructing a change in the imaging direction of the imaging device of the first information processing apparatus or the sound collection direction of the sound collector when the inappropriate means detects the inappropriate operation. The conference system according to claim 1, wherein:

In an information processing apparatus comprising means for connecting with other devices via a communication medium and transmitting / receiving an image or sound to / from each device,
Means for recognizing the movement of a person in the received image;
A means for detecting the presence or absence of inappropriate movement based on the recognized result;
Depending on the detection result, reception of an image or sound from the other device, transmission or non-transmission of the image or sound to another device, increase or decrease of a transmission rate, or processing of a part of the image An information processing apparatus comprising: means for controlling.

A plurality of first information processing apparatuses each including an imaging device or a sound collecting device, a means for acquiring an image from the imaging device or a sound from the sound collecting device, and a transmission / reception means for transmitting / receiving the acquired image or sound; A first information processing apparatus connected to the first information processing apparatus via a communication medium, and a second information processing apparatus that relays an image or sound transmitted / received by each first information processing apparatus. In an information processing method for controlling transmission / reception of an image or sound,
Recognizing the movement of a person in the image acquired by the first information processing apparatus;
Based on the recognized result, detect the presence or absence of inappropriate movement,
Depending on the detected result, reception of an image or sound from the first information processing apparatus or transmission or non-transmission of the image or sound to the first information processing apparatus, increase or decrease in transmission rate, or one of the images An information processing method characterized by controlling machining to a part.