JP5043940B2

JP5043940B2 - Video surveillance system and method combining video and audio recognition

Info

Publication number: JP5043940B2
Application number: JP2009522745A
Authority: JP
Inventors: キンツレー、マーティン、ジー; シェイニン、ヴァディム
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2006-08-03
Filing date: 2006-08-03
Publication date: 2012-10-10
Anticipated expiration: 2026-08-03
Also published as: CN101501564B; CA2656268A1; CN101501564A; WO2008016360A1; BRPI0621897A2; JP2009545911A; BRPI0621897B1; MX2009001254A

Description

本発明は、一般に、セキュリティを確保するための監視システムおよび方法に関し、更に具体的には、監視システムのための新規なオンライン（リアルタイム）のビデオおよびオーディオ認識システムおよびプロセスに関する。 The present invention relates generally to surveillance systems and methods for ensuring security, and more specifically to novel online (real-time) video and audio recognition systems and processes for surveillance systems.

従来のビデオ監視システムは、通常、オーディオを監視するための機能も設備も含まない。すなわち、監視システムは、オーディオ入力を全く含まない。米国特許第６，７２４，４２１号および第６，１７５，３８２号に記載されたもの等の典型的なビデオ監視システムは、せいぜい、視覚情報および聴覚情報の同時記録を行うだけである。これらの参考文献に記載された双方のタイプのビデオ監視システムにおいて、ビデオ・データは高性能（smart）監視エンジンによって分析され、デジタル・ストレージ用に圧縮される。これらのエンジンは、顔認識、動作検出、パニック検出、突き刺すような動作の検出等、様々な認識アルゴリズムを実施する。例えば、高層ビルの入口を監視している場合に警報が発せられるある状況では、突然ある人物が別の人物に対して素早い動きをして、強盗、暴行、または同様の行動の可能性が示唆される。この場合、高性能監視エンジンは、突然の素早い動きを認識し（１００％未満の成功レベルで）、監視ステーションにおいて警報を発生する。かかる警報の結果、監視対象の位置に警察を派遣することができる。当然、突然の素早い動きは、子供が彼／彼女の親／友達に向かって走っていったために発生した可能性もある。この場合、発生した警報は誤警報となるが、これによってコストの高い警察の派遣が行われてしまう。高性能監視エンジンの誤検出が引き起こす別の例は、本当の非常時に警報が発せられないことである。これが起こるのは、例えば現場に２人以上の人がいる場合である。現在の監視システムの更に別の欠点は、真の非常事態が起こっている場合に警察が派遣されないことである。 Conventional video surveillance systems typically do not include functions or facilities for monitoring audio. That is, the surveillance system does not include any audio input. Typical video surveillance systems, such as those described in US Pat. Nos. 6,724,421 and 6,175,382, at best, only record visual and auditory information simultaneously. In both types of video surveillance systems described in these references, video data is analyzed by a smart surveillance engine and compressed for digital storage. These engines implement various recognition algorithms such as face recognition, motion detection, panic detection, and stab motion detection. For example, in some situations where an alarm is triggered when monitoring a high-rise building entrance, one person suddenly moves quickly against another person, suggesting the possibility of burglary, assault, or similar behavior. Is done. In this case, the high performance monitoring engine recognizes sudden and rapid movement (with a success level of less than 100%) and generates an alarm at the monitoring station. As a result of the alarm, the police can be dispatched to the position to be monitored. Of course, a sudden quick move may have occurred because the child was running towards his / her parents / friends. In this case, the generated alarm becomes a false alarm, but this causes an expensive police dispatch. Another example of high performance monitoring engine false detections is that no alarm is triggered in a real emergency. This occurs, for example, when there are two or more people at the site. Yet another disadvantage of current surveillance systems is that police are not dispatched in the event of a true emergency.

図１に、従来のビデオのみの監視システムを示す。カメラ・アレイ１０は、ビデオ・リンク１１を介してビデオ圧縮エンジン１２内にビデオ情報を供給する。ビデオ情報は圧縮され、リンク１６を介してストレージ・デバイス１４に送られ、長期的にストアされる。ビデオ情報は、更に、同じビデオ・リンク１１を介してビデオ認識エンジン１３に供給される。ビデオ認識エンジン１３は、顔認識、動き検出、およびその他のビデオ認識タスクを実行し、イベントおよび警報を発生して、これらがリンク１７を介してイベント・データベース１５および監視ステーション１８に送られる。監視ステーション１８は、有人監視ステーションを含み、これによってオペレータは特定の量のカメラをリアルタイムで視覚的に監視することができる。非常事態が起こったとオペレータによって解釈されると、監視対象の領域に警察または他の非常時対応チームを派遣するか否かが、彼／彼女によって決断される。上述の説明から、監視対象の領域においてオーディオ情報が利用可能であることが極めて多いにもかかわらず、かかる情報が無駄になっていることは明らかである。 FIG. 1 shows a conventional video-only surveillance system. The camera array 10 provides video information into the video compression engine 12 via a video link 11. The video information is compressed and sent over link 16 to storage device 14 for long-term storage. Video information is further fed to the video recognition engine 13 via the same video link 11. Video recognition engine 13 performs face recognition, motion detection, and other video recognition tasks and generates events and alerts that are sent via link 17 to event database 15 and monitoring station 18. The monitoring station 18 includes a manned monitoring station that allows an operator to visually monitor a certain amount of cameras in real time. When the operator interprets that an emergency has occurred, he / she decides whether to dispatch a police or other emergency response team to the monitored area. From the above description, it is clear that such information is wasted despite the very high availability of audio information in the monitored area.

図２に、オーディオ記録を用いた従来技術のビデオ監視システムを示す。カメラ・アレイ２０は、ビデオ・リンク２１を介してビデオおよびオーディオ圧縮エンジン２２内にビデオ情報を供給する。同時に、オーディオ情報が、マイクロフォン・アレイ２９からオーディオ・リンク３０を介してビデオおよびオーディオ圧縮エンジン２２に供給される。ビデオおよびオーディオ情報は圧縮され、リンク２６を介してストレージ・デバイス２４に送られ、長期的にストアされる。ビデオ情報は、同様に、同じビデオ・リンク２１を介してビデオ認識エンジン２３に供給される。ビデオ認識エンジン２３は、顔認識、動き検出、およびその他のビデオ認識タスクを実行し、イベントおよび警報を発生して、これらがリンク２７を介してデータベース２５および監視ステーション２８に送られる。監視ステーション２８は、有人監視ステーションであり、これによってオペレータは特定の量のカメラを視覚的に監視する。非常事態が起こったとオペレータによって解釈されると、監視対象の領域に警察または他の非常時対応チームを派遣するか否かが、彼／彼女によって決断される。上述の説明から、監視対象の領域から取得したオーディオ信号においてオーディオ入力からの有用な情報が利用可能であることが非常に多いにもかかわらず、かかる情報が抽出されていないことは明らかである。 FIG. 2 shows a prior art video surveillance system using audio recording. Camera array 20 provides video information into video and audio compression engine 22 via video link 21. At the same time, audio information is provided from the microphone array 29 via the audio link 30 to the video and audio compression engine 22. Video and audio information is compressed and sent over link 26 to storage device 24 for long-term storage. Video information is likewise provided to the video recognition engine 23 via the same video link 21. Video recognition engine 23 performs face recognition, motion detection, and other video recognition tasks and generates events and alerts that are sent via link 27 to database 25 and monitoring station 28. The monitoring station 28 is a manned monitoring station that allows an operator to visually monitor a certain amount of cameras. When the operator interprets that an emergency has occurred, he / she decides whether to dispatch a police or other emergency response team to the monitored area. From the above description, it is clear that such information is not extracted even though very often useful information from the audio input is available in the audio signal obtained from the monitored area.

上述のように、第２のタイプの監視システムは、ビデオおよびオーディオ情報を同時に記録し、様々なビデオ認識タスク用の高性能監視エンジンを組み込む。今日、これらのシステムにおいて、オーディオ情報は圧縮され記録されるが、分析されることはない。
米国特許６，７２４，４２１号米国特許６，１７５，３８２号 As mentioned above, the second type of surveillance system records video and audio information simultaneously and incorporates a high performance surveillance engine for various video recognition tasks. Today, in these systems, audio information is compressed and recorded, but not analyzed.
US Pat. No. 6,724,421 US Pat. No. 6,175,382

今日の監視システムは、ビデオ入力を分析する場合、貴重なオーディオ情報を全く利用しない。当然、このオーディオ情報は入手可能であり、多くの監視状況において極めて広範囲に渡って使用可能である。 Today's surveillance systems do not use any valuable audio information when analyzing video input. Of course, the audio information is available and can be used over a very wide range in many monitoring situations.

従って、ビデオ監視システムにおいてオーディオ情報の使用を組み込むことは極めて望ましいであろう。このオーディオ情報の使用によって、監視システムが発生する誤警報の数を減らすと共に、検出される真の警報の割合を高め、同時に、警報を評価する人物にいっそう多くの情報を提供することが期待される。更に、ビデオ情報のみを用いた場合は検出されなかったイベントであっても、オーディオおよびビデオ情報を用いて検出することが可能となる。 Therefore, it would be highly desirable to incorporate the use of audio information in video surveillance systems. The use of this audio information is expected to reduce the number of false alarms generated by the surveillance system, increase the percentage of true alarms detected, and at the same time provide more information to the person evaluating the alarm. The Furthermore, even when only video information is used, even an event that has not been detected can be detected using audio and video information.

従って、本発明の目的は、監視対象の領域から取得したオーディオ情報と組み合わせてビデオ情報を用いることを取り入れたビデオ監視システムおよび方法を提供することである。 Accordingly, it is an object of the present invention to provide a video surveillance system and method that incorporates the use of video information in combination with audio information acquired from the monitored area.

本発明の監視システムは、ビデオ信号入力およびオーディオ信号入力の双方を含む。ビデオ入力はデジタルまたはアナログ・カメラから供給し、オーディオ入力は監視対象の領域に設置されたマイクロフォンから受信する。ビデオおよびオーディオ情報を圧縮し、デジタル・ストレージ・デバイスに送信する。実施する全てのカメラおよびマイクロフォンに必要なデジタル・ストレージの量を節約するために、オーディオおよびビデオ情報を圧縮することが好ましい。記録と同時に、ビデオおよびオーディオ入力を高性能認識エンジンに供給する。このエンジンは、ビデオ認識、オーディオ認識を実行し、ビデオ−オーディオ認識からの結果を即時に相関付けて、例えば甲高い叫び声、爆発、発砲等のパニック状況を示す特定のイベント・セットを検出／認識する。高性能認識エンジンによって発生した警報は、監視ステーションに送信することができる。ここで、人間のオペレータが、警察または緊急事態に対応する人員を監視対象領域に派遣するか否かを決断する。 The surveillance system of the present invention includes both a video signal input and an audio signal input. Video input is provided from a digital or analog camera, and audio input is received from a microphone installed in the monitored area. Compress video and audio information and send it to a digital storage device. In order to save the amount of digital storage required for all cameras and microphones implemented, it is preferable to compress the audio and video information. Simultaneously with recording, video and audio inputs are fed into a high performance recognition engine. The engine performs video recognition, audio recognition, and instantly correlates the results from video-audio recognition to detect / recognize a specific set of events that indicate a panic situation such as shouts, explosions, fires, etc. . Alarms generated by the high performance recognition engine can be sent to the monitoring station. Here, the human operator decides whether or not to dispatch the police or personnel corresponding to the emergency situation to the monitoring target area.

本発明の一態様によれば、高性能認識エンジンは、顔認識、動き検出等の利用可能なビデオ認識アルゴリズム、および、特定の言葉（「助けて」、「強盗」等）の音声認識のためのオーディオ／音声認識アルゴリズムを実行する。オーディオ認識エンジンは、発砲、爆発等、ならびに、警報または緊急事態を示す甲高い声および他の音声特徴等の特別なオーディオ信号を認識するように訓練することができる。 In accordance with one aspect of the present invention, a high performance recognition engine can be used for video recognition algorithms such as face recognition, motion detection, and voice recognition of certain words (“help”, “robbery”, etc.). The audio / speech recognition algorithm is executed. The audio recognition engine can be trained to recognize special audio signals such as firing, explosions, and the like, and high-pitched voices and other voice features that indicate an alarm or emergency.

マイクロフォン・アレイを特定の方位に配置して用いて、音声の方向を決定することができる。次いで、方向性オーディオ情報をカメラ制御ユニットに送信して、カメラ／複数のカメラを対象の方向に向けることができる。そして、ビデオ／オーディオ認識を、もっと効率良く実行することができる。このため、例えば、監視対象の領域において、マイクロフォン・アレイを用いたオーディオ認識エンジンによって爆発音を検出することができる。この結果、カメラは爆発の方向に向き、監視ステーションへの警報から場面認識／把握までの後続動作をビデオ認識エンジンにおいて行う。ビデオおよびオーディオ認識からの結果をすぐに用いて、記録したオーディオおよびビデオの評価を行い、更に、新しいビデオおよびオーディオ入力の記録を改善することで、検出の精度が向上し、警報の性質を決定するためにかかる時間が短縮し、状況を評価している人間のオペレータに多くの情報が与えられるので、有利である。 A microphone array can be used in a particular orientation to determine the direction of speech. Directional audio information can then be sent to the camera control unit to direct the camera / multiple cameras in the direction of interest. And video / audio recognition can be performed more efficiently. Therefore, for example, explosion sound can be detected by an audio recognition engine using a microphone array in the monitored area. As a result, the camera is directed in the direction of the explosion, and the video recognition engine performs subsequent operations from alarm to the monitoring station to scene recognition / understanding. Immediately use the results from video and audio recognition to evaluate recorded audio and video, and improve the recording of new video and audio inputs to improve detection accuracy and determine the nature of the alarm This is advantageous because it reduces the time it takes to do so and gives more information to the human operator evaluating the situation.

ビデオ認識エンジンおよびオーディオ認識エンジンからの出力を、共同認識エンジンによって分析し、この結果、最終的な警報を発生して監視ステーションに転送する。 The output from the video recognition engine and the audio recognition engine is analyzed by the joint recognition engine, which results in a final alarm being transmitted to the surveillance station.

本発明の好適な態様によれば、これらおよび他の目的を達成するため、監視システムおよび方法、およびコンピュータ・プログラムが提供される。このシステムは、
監視対象の領域上で取得されたビデオ情報を含むリアルタイム・ビデオ信号を発生させるための手段と、
監視対象の領域からのオーディオ情報を含むリアルタイム・オーディオ信号を取得するための手段と、
ビデオ信号およびオーディオ信号を同時に受信し、そこから関連するビデオおよびオーディオ認識情報を求め、リアルタイム・オーディオおよびビデオ情報を相互に相関付けて特定のイベントの発生の可能性を求めるための手段と、
特定のイベントの発生に基づいて警報状況を発生させるための手段と、
を含む。 In accordance with preferred aspects of the present invention, a monitoring system and method, and a computer program are provided to accomplish these and other objectives. This system
Means for generating a real-time video signal containing video information acquired on the monitored area;
Means for obtaining a real-time audio signal including audio information from the monitored area;
Means for simultaneously receiving a video signal and an audio signal, determining related video and audio recognition information therefrom, and correlating real-time audio and video information with each other to determine the likelihood of occurrence of a particular event;
A means for generating an alarm condition based on the occurrence of a specific event;
including.

本発明の構造および方法の更に別の特徴、態様、および利点は、以下の説明、特許請求の範囲、および添付図面を参照して、いっそう充分に理解されよう。 Further features, aspects, and advantages of the structure and method of the present invention will be better understood with reference to the following description, appended claims, and accompanying drawings.

図３は、本発明による、ビデオおよびオーディオ認識を用いたビデオ監視システムを示す。図３に示すように、カメラ・アレイ４０は、例えばＣＣＤまたはＣＭＯＳカメラのような、カラーまたはモノクロの１つ以上の静止またはビデオ電子カメラを含むか、または同等のコンポーネントの組み合わせを有して、監視対象の領域を捕捉し、ビデオ通信リンク４１を介して、デジタル・ビデオおよびオーディオ圧縮エンジン４２内に、ビデオ信号を供給する。カメラ・アレイ４０の各カメラ・デバイスの動きおよび動作は、例えばコンピュータあるいはソフトウェアまたはそれら両方の制御のもとに、受信した制御信号によって制御することができる。更に、パン／チルト・ミラー、レンズ・システム、フォーカス・モータ、パン・モータ、およびチルト・モータ制御を含むカメラ・アレイ４０の各カメラの動作パラメータは、受信した制御信号によって制御される。これについては、後に本明細書内で更に詳しく説明する。デジタル・ビデオ信号を出力する前に、例えばノイズを低減するため、またはフィルタリング／画質向上の技法を実行するために、多くの信号処理技法を適用することができる。 FIG. 3 illustrates a video surveillance system using video and audio recognition according to the present invention. As shown in FIG. 3, the camera array 40 includes one or more color or monochrome still or video electronic cameras, such as CCD or CMOS cameras, or has a combination of equivalent components, A region to be monitored is captured and a video signal is provided via the video communication link 41 into the digital video and audio compression engine 42. The movement and operation of each camera device in the camera array 40 can be controlled by received control signals, for example under the control of a computer and / or software. In addition, the operating parameters of each camera in the camera array 40, including pan / tilt mirror, lens system, focus motor, pan motor, and tilt motor control, are controlled by received control signals. This will be described in more detail later in this specification. Many signal processing techniques can be applied before outputting a digital video signal, for example, to reduce noise or to perform filtering / image enhancement techniques.

同時に、音響圧力を電気信号に変換することができるマイクロフォン・センサ・デバイス（全方向性あるいは高度な指向性を有するまたはそれら両方のマイクロフォン）を含むマイクロフォン・アレイ４９を設けて、オーディオ通信リンク５０を介してデジタル・ビデオおよびオーディオ圧縮エンジン４２にオーディオ信号を供給する。当業者には既知のように、マイクロフォン・アレイの指向性レベルは音の周波数に対して変動するので、マイクロフォンの数およびマイクロフォン間の距離は、いずれかの所与の指向性を与えるために必要な周波数範囲を考慮して決定することができる。アレイに組み込まれるマイクロフォンは、例えばこれらの目的を達成するためにソフトウェア制御のもとで制御することができ、例えば人の声、爆発、発砲等の範囲における様々な周波数受信に対して明らかにバイアスをかけることができるピックアップ・パターンを有するように構成したトランスデューサを含むことができる。このようにして、マイクロフォン・アレイは、高い精度で音響イベントの音場に応答するような感度を有することが保証されている。更に別のオーディオ信号調整技法を適用して、例えばＡ／Ｄ変換器を用いて取得したアナログ・オーディオ信号をデジタル化すること、および、例えば利得制御、ノイズの低減／除去を行うことができる。デジタル化したビデオおよびオーディオ情報は、デジタルに圧縮されて、リンク４６を介してメモリ・ストレージ・デバイス４４に送られ、長期的にストアされる。デバイス４４は、例えば、データベース、ハード・ディスク・ドライブ、ＣＤ−ＲＯＭ、ＤＶＤ、テープ、プラッタ、ディスク・アレイ等を含むがこれらには限定されない磁気または光媒体等である。カメラ・アレイ４０の各カメラの出力は、ＭＰＥＧ１、ＭＰＥＧ２等の圧縮フォーマットでストレージ媒体にストアされる。更に、アレイの各カメラの出力は、そのカメラに関連付けたストレージ媒体上の特定の位置にストアすることができ、または、ストアされた各出力がどのカメラに対応するかを示す指示と共にストアされる。 At the same time, a microphone array 49 containing microphone sensor devices (omnidirectional and / or highly directional microphones) capable of converting acoustic pressure into electrical signals is provided to provide an audio communication link 50. Via the digital video and audio compression engine 42. As known to those skilled in the art, since the directivity level of a microphone array varies with the frequency of sound, the number of microphones and the distance between the microphones is necessary to give any given directivity. It can be determined in consideration of a specific frequency range. The microphones incorporated into the array can be controlled under software control, for example to achieve these objectives, and are clearly biased against various frequency receptions, for example in the range of human voice, explosions, firing, etc. And a transducer configured to have a pickup pattern that can be applied. In this way, the microphone array is guaranteed to be sensitive enough to respond to the sound field of an acoustic event with high accuracy. Yet another audio signal conditioning technique can be applied to digitize an analog audio signal acquired using, for example, an A / D converter, and to perform gain control, noise reduction / removal, for example. Digitized video and audio information is digitally compressed and sent over link 46 to memory storage device 44 for long-term storage. The device 44 is, for example, a magnetic or optical medium including but not limited to a database, a hard disk drive, a CD-ROM, a DVD, a tape, a platter, a disk array, and the like. The output of each camera in the camera array 40 is stored in a storage medium in a compression format such as MPEG1 or MPEG2. In addition, the output of each camera in the array can be stored at a particular location on the storage medium associated with that camera, or stored with an indication that indicates which camera each stored output corresponds to. .

図３に更に示すように、同じビデオ情報およびオーディオ情報が、各ビデオ・リンク４１およびオーディオ・リンク５０を介して高性能認識エンジン４３にも同時に供給される。各カメラ・アレイ、オーディオ・マイクロフォン・アレイ、ビデオおよびオーディオ圧縮エンジン４２、高性能認識エンジン４３間の通信リンク４１および５０は、配線によって接続することができるか、または無線リンクを使用可能であることは理解されよう。更に、これらの通信リンクが、ケーブル、衛星、ＲＦおよびマイクロ波伝送、光ファイバ等の形態を取ることも本発明の範囲内である。 As further shown in FIG. 3, the same video and audio information is simultaneously provided to the high performance recognition engine 43 via each video link 41 and audio link 50. Communication links 41 and 50 between each camera array, audio microphone array, video and audio compression engine 42, high performance recognition engine 43 can be connected by wiring or can use a wireless link Will be understood. Further, it is within the scope of the present invention for these communication links to take the form of cables, satellites, RF and microwave transmissions, optical fibers, and the like.

本発明中において後に更に詳しく説明するように、また図４に示すように、高性能認識エンジン４３は、ビデオ認識エンジン６２、オーディオ認識エンジン６３、共同認識エンジンおよび警報発生モジュール６４を含む。高性能認識エンジン４３は、コンピュータ・デバイスを制御してビデオ認識アルゴリズムおよび顔認識アルゴリズムを実施するための方法およびプロセスを実行するソフトウェアを組み込んでいる。これらは、動き検出アルゴリズム（例えば、個々のポイントを追跡する周知のパッチ相関または追跡アルゴリズム）によって、これと組み合わせて実行して、画像ストリーム内の特徴の動きを推定することができる。高性能認識エンジン４３は、更に、コンピュータ・デバイスを制御してオーディオ認識および音声認識アルゴリズムを実施するための方法およびプロセスを実行するソフトウェアを組み込んでいる。コンピュータ読み取り可能命令、データ構造、プログラム・モジュール等として実施される音声認識アルゴリズムを用いて、非常事態または警報を発すべき状況を示すと考えられる特定の話し言葉を認識することができる（例えば「助けて」、「強盗」等）。 As will be described in more detail later in the present invention and as shown in FIG. 4, the high performance recognition engine 43 includes a video recognition engine 62, an audio recognition engine 63, a joint recognition engine and an alarm generation module 64. The high performance recognition engine 43 incorporates software that performs a method and process for controlling a computing device to implement video and face recognition algorithms. These can be performed in combination with a motion detection algorithm (eg, well-known patch correlation or tracking algorithms that track individual points) to estimate the motion of features in the image stream. The high performance recognition engine 43 further incorporates software that performs the methods and processes for controlling the computing device to implement audio recognition and speech recognition algorithms. Speech recognition algorithms implemented as computer readable instructions, data structures, program modules, etc. can be used to recognize specific spoken words that are considered to indicate an emergency or situation that should trigger an alarm (eg, “help "," Robberies ", etc.).

コンピュータ読み取り可能命令、データ構造、プログラム・モジュールまたは他のデータを含むオーディオ認識エンジン６３は、発砲、爆発、例えば叫びや悲鳴等の甲高い声、および、警報を引き起こすと考えられる既知のイベントに関連した他の音声特徴等の特別なオーディオ信号を認識するように訓練することができる。しかしながら、本発明に従って、従来の訓練を必要としない様々な認識アルゴリズムを使用可能であることは理解されよう。 Audio recognition engine 63 containing computer readable instructions, data structures, program modules or other data related to firing, explosions, high pitched voices such as screams and screams, and known events that are believed to cause alarms It can be trained to recognize special audio signals such as other audio features. However, it will be appreciated that various recognition algorithms may be used in accordance with the present invention that do not require conventional training.

実施されるコンピューティング・デバイス（複数のデバイス）は、ＰＣ、デバイス、ラップトップ、モバイル・デバイス等の汎用コンピュータ・デバイスを含み、処理ユニット、システム・メモリ、およびシステム・メモリから処理ユニットを含む様々なシステム・コンポーネントを結合するシステム・バスを含むがこれらには限定されないコンポーネントを有する。コンピュータ・デバイスは、高性能認識エンジンおよびオーディオ認識エジンを実行するためにこれらのコンポーネントを実施する。これらのエンジンは、着脱可能媒体、着脱不可能媒体、揮発性媒体、および不揮発性媒体を含む、コンピュータ・デバイスがアクセス可能ないずれかの利用可能な媒体を含む周知のコンピュータ読み取り可能媒体上にストアされる。コンピュータ読み取り可能記録は、１つの位置に集中化するか、または、例えばネットワークを介して接続されたコンピュータ・システム上に分散化することができる。コンピュータ読み取り可能認識アルゴリズムは、コンピュータ読み取り可能記録媒体にストアし、分散化して実行することができる。 The computing device (s) implemented may include general purpose computer devices such as PCs, devices, laptops, mobile devices, etc., including processing units, system memory, and system memory to processing units. Having components including, but not limited to, a system bus that couples various system components. The computing device implements these components to implement a high performance recognition engine and audio recognition engine. These engines are stored on well-known computer readable media including any available media accessible to a computing device, including removable media, non-removable media, volatile media, and non-volatile media. Is done. Computer readable records can be centralized in one location or distributed over computer systems connected via a network, for example. The computer readable recognition algorithm can be stored in a computer readable recording medium and executed in a distributed manner.

図３に戻ると、マイクロフォン・アレイ４９を特定の方位で用いて、音声の方向を決定することができる。検知したオーディオ・イベントに関する方向性情報は、有線または無線の通信リンク５３を介して、カメラ・マイクロフォン制御モジュール５２に送信される。カメラ／マイクロフォン制御モジュール５２は、制御信号５４によって対象の方向にカメラ／カメラ・アレイ４０を向けると共にマイクロフォン・アレイ４９の位置を制御するようにモータ位置制御を実行するために必要なソフトウェアを全て含む。例えば、制御信号はカメラ・アレイ４０に入力して、カメラ・パン／チルト・ミラー、レンズ・システム（複数のシステム）、フォーカス・モータ、パン・モータ、およびチルト・モータ・コンポーネント、およびサブ・システムを調節または制御することができる。更に、これらの制御信号を用いて、カメラの視野を自動的に方向制御し、実際の警報または実際のイベントに関する情報を多く有するように中央に置いた画像、またはズームした画像、焦点の合った画像、解像した画像を得る。限定ではない１つの例では、高性能認識エンジンによる発砲オーディオ信号のオーディオ認識に応答して、制御信号を発生し、カメラ・アレイの１つ以上のカメラをその現場に向けて、発砲の方向に「ロック」することができる。発砲のオーディオ認識によってビデオ・カメラ・アレイが犯罪の場所に向けられた場合、「犯罪イベント」認識の方が有用である。なぜなら、発砲に関してもっと多くの情報が利用可能になるからである。あるいは、またはこれに加えて、これらの制御信号を発生し、これらを用いて、マイクロフォンの方位およびマイクロフォン間の距離を自動的に調節して、付随するオーディオ情報を更に良好に受信することができる。更に、マイクロフォンの方位は、必要な周波数範囲のオーディオ信号を検出することを考慮して、またはいずれかの所与の指向性を与えるように調節することができる。このため、例えば、ビデオ認識イベントに応答して、１つ以上のマクロフォンの向きを変えて、ある特定の方向から「聴く」ことも可能である。 Returning to FIG. 3, the microphone array 49 can be used in a particular orientation to determine the direction of speech. The direction information regarding the detected audio event is transmitted to the camera / microphone control module 52 via a wired or wireless communication link 53. The camera / microphone control module 52 includes all of the software necessary to direct the camera / camera array 40 in the direction of interest by the control signal 54 and to perform motor position control to control the position of the microphone array 49. . For example, control signals are input to the camera array 40 for camera pan / tilt mirror, lens system (s), focus motor, pan motor, and tilt motor components, and subsystems. Can be adjusted or controlled. In addition, these control signals can be used to automatically steer the camera's field of view and to have a centralized or zoomed image, in-focus that has a lot of information about the actual alarm or actual event. Obtain images and resolved images. In one non-limiting example, in response to audio recognition of the fired audio signal by the high performance recognition engine, a control signal is generated and one or more cameras in the camera array are directed to the scene in the direction of fire. Can be “locked”. “Criminal event” recognition is more useful when the video camera array is pointed to a crime location by firing audio recognition. This is because more information about firing is available. Alternatively, or in addition, these control signals can be generated and used to automatically adjust the microphone orientation and the distance between the microphones to better receive the accompanying audio information. . Furthermore, the orientation of the microphone can be adjusted to take into account the detection of audio signals in the required frequency range or to provide any given directivity. Thus, for example, in response to a video recognition event, one or more microphones can be turned to “listen” from a particular direction.

更に具体的には、図４に示すように、ビデオ認識エンジン６２およびオーディオ認識エンジン６３からの出力を共同認識エンジン６４によって分析して、同時に受信したビデオおよびオーディオ認識情報を処理し、最終的に警報状況が存在するか否かを判定する。このようにして、警報を発生し、通信リンク４７を介して有人監視ステーション４８に転送することができる。すなわち、共同認識エンジン６４において用いられる、コンピュータ読み取り可能命令、データ構造、プログラム・モジュール等として使用される認識プロセスは、一般に、パターン・マッチングあるいは仮説評価またはそれら両方に基づいている。評価段階の間、様々なイベントの確率の推定値を求める。これを行うには、リアルタイム・ビデオ認識情報およびオーディオ信号から、認識された各ビデオ場面およびそれに付随する認識された音声またはオーディオ特徴間にどの程度の相関が存在するかを求めれば良い。認識イベントの一例において、突き刺すような動きを認識する際には、様々なビデオ場面の確率を評価するために、ビデオ情報を用いる。かかる場面に甲高い声（叫び等）が伴うことがわかっていれば、オーディオ入力から甲高い声を検出することは、それがビデオ信号で捕捉された突き刺すような動きの結果である確率を高める。オペレータは、カメラ・アレイ４０によって監視された特定領域を視覚的に監視し、警報発生ユニットによって警報指示が与えられると、監視対象の領域に警察または非常事態の対応人員を派遣するか否かがオペレータによって決断される。上述の説明から、オーディオ入力から有用な情報が抽出されることは明らかであろう。これをビデオ認識イベントと組み合わせて、監視システムの全体的な動作を向上させる。 More specifically, as shown in FIG. 4, the outputs from the video recognition engine 62 and the audio recognition engine 63 are analyzed by the joint recognition engine 64 to process the simultaneously received video and audio recognition information, and finally Determine whether an alarm condition exists. In this way, an alarm can be generated and forwarded to the manned monitoring station 48 via the communication link 47. That is, the recognition process used in the co-recognition engine 64 as computer readable instructions, data structures, program modules, etc. is generally based on pattern matching and / or hypothesis evaluation. During the evaluation phase, estimate the probabilities of various events. To do this, real-time video recognition information and audio signals can be used to determine how much correlation exists between each recognized video scene and the associated recognized speech or audio features. In one example of a recognition event, when recognizing a piercing motion, video information is used to evaluate the probabilities of various video scenes. If it is known that a high-pitched voice (such as a scream) is associated with such a scene, detecting a high-pitched voice from the audio input increases the probability that it is the result of a piercing movement captured in the video signal. The operator visually monitors a specific area monitored by the camera array 40, and when an alarm instruction is given by the alarm generation unit, the operator determines whether to dispatch police or emergency response personnel to the monitored area. Determined by the operator. From the above description, it will be apparent that useful information is extracted from the audio input. This is combined with video recognition events to improve the overall operation of the surveillance system.

更に図４に示すように、ビデオ認識エンジン６２と共同認識エンジン６４との間の通信リンク６０は双方向性であり、オーディオ認識エンジン６３と共同認識エンジン６４との間の通信リンク６１も同様である。リンク６０および６１の双方向性によって、上述のようにビデオおよびオーディオ認識アルゴリズムの相互の作用が可能となり、これは結果として、ビデオおよびオーディオの認識レベルを高め、以前は検出が不可能であった特定のイベントを検出する可能性が生まれる。 Further, as shown in FIG. 4, the communication link 60 between the video recognition engine 62 and the joint recognition engine 64 is bidirectional, and the communication link 61 between the audio recognition engine 63 and the joint recognition engine 64 is similar. is there. The bidirectional nature of the links 60 and 61 allows the video and audio recognition algorithms to interact as described above, which results in an increased level of video and audio recognition that was previously impossible to detect. The possibility of detecting a specific event is born.

本発明について、その例示的な実行された実施形態に関連付けて具体的に図示し記載したが、本発明の意図および範囲から逸脱することなく、形態および詳細において前述およびその他の変更を実施可能であり、本発明の意図および範囲は特許請求の範囲によってのみ制限されることは、当業者には理解されよう。 Although the invention has been particularly shown and described in connection with exemplary implementations thereof, the foregoing and other changes in form and detail may be made without departing from the spirit and scope of the invention. Those skilled in the art will appreciate that the spirit and scope of the present invention is limited only by the claims.

従来技術によるビデオのみの監視システムを示す。1 shows a video-only surveillance system according to the prior art. 従来技術によるオーディオ記録機能を有するビデオ監視システムを示す。1 shows a video surveillance system having an audio recording function according to the prior art. 本発明によるビデオおよびオーディオ認識を用いたビデオ監視システムを示す。1 illustrates a video surveillance system using video and audio recognition according to the present invention. 本発明による高性能認識エンジンの詳細を示す。2 shows details of a high performance recognition engine according to the present invention.

Claims

A surveillance system using video and audio recognition,
One or more video camera devices for generating a real-time video signal containing video information acquired on the monitored area;
One or more microphone devices for obtaining real-time audio signals including audio information from the monitored area;
Simultaneously receiving the real-time video signal and the real-time audio signal, determining associated video recognition information and audio recognition information from them, correlating the video recognition information and the audio recognition information with each other, and Processing means for determining the likelihood of occurrence;
Means for generating an alarm condition based on the occurrence of the specific event,
The processing means includes joint recognition means for correlating the audio recognition information and video recognition information and detecting the occurrence of a specific event, the joint recognition means based on the video recognition information of a potential event. In response to recognizing the occurrence of an event, to generate a control signal for directing one or more microphones of the microphone device in the direction of the particular event so as to enable capture of audio recognition information further seen including a means, each of said microphone devices, in response to the control signal, automatically adjusts the orientation of the microphone in consideration of the detection of the required frequency range of the audio signal, the monitoring system .

The system of claim 1 , wherein the processing means includes a first recognition engine for processing the video signal for determining the video recognition information.

The system of claim 1 , wherein the processing means includes a second recognition engine for processing the audio signal for determining the audio recognition information.

Each of the video camera devices is responsive to the control signal to adjust one or more of the pan, tilt, zoom, rotation, dolly, translation control parameters of the video camera device. Miller, lens system, a focus motor, pan motor, and one or more tilt motor components, according to any one of claims 1 to 3 system.

One of the video camera devices to enable capture of a video signal in response to the joint recognition means recognizing the occurrence of the event based on the audio recognition information of a potential event. The monitoring system of claim 1, further comprising means for generating a control signal to direct the above in the direction of the particular event.

A surveillance system using video and audio recognition,
One or more video camera devices for generating a real-time video signal containing video information acquired on the monitored area;
One or more microphone devices for obtaining real-time audio signals including audio information from the monitored area;
Simultaneously receiving the real-time video signal and the real-time audio signal, determining associated video recognition information and audio recognition information from them, correlating the video recognition information and the audio recognition information with each other, and Processing means for determining the likelihood of occurrence;
Means for generating an alarm condition based on the occurrence of the specific event;
Including
The processing means includes joint recognition means for correlating the audio recognition information and video recognition information and detecting the occurrence of a specific event, the joint recognition means based on the video recognition information of a potential event. In response to recognizing the occurrence of an event, to generate a control signal for directing one or more microphones of the microphone device in the direction of the particular event so as to enable capture of audio recognition information Further comprising:
The monitoring system , wherein each of the microphone devices automatically adjusts the orientation of the microphone in response to receiving the audio signal at any given directivity in response to the control signal.

A monitoring method using video and audio recognition,
Simultaneously receiving in a processing means a real-time video signal including video information acquired on a monitored area and a real-time audio signal including audio information from the monitored area;
Determining associated video recognition and audio recognition information from the received real-time video signal and real-time audio signal;
Correlating the real-time audio and video recognition information with each other to determine the likelihood of occurrence of a particular event;
Generating an alarm condition based on the occurrence of the specific event, and wherein the processing means includes co-recognition means for correlating the audio recognition information and video recognition information and detecting the occurrence of the specific event, One or more of the microphone devices to enable capture of audio recognition information in response to the joint recognition means recognizing the occurrence of the event based on the video recognition information of a potential event A control signal for directing the microphone in the direction of the specific event, each of the microphone devices responding to the control signal and taking into account the detection of an audio signal in the required frequency range Automatically adjust the orientation of
Said method.

Each of the video camera devices is responsive to the control signal to adjust one or more of the pan, tilt, zoom, rotation, dolly, translation control parameters of the video camera device. 8. The method of claim 7 , comprising one or more of a mirror, a lens system, a focus motor, a pan motor, and a tilt motor component.

One of the video camera devices to enable capture of a video signal in response to the joint recognition means recognizing the occurrence of the event based on the audio recognition information of a potential event. 8. The method of claim 7 , further generating a control signal to direct the above in the direction of the specific event.

A monitoring method using video and audio recognition,
Simultaneously receiving in a processing means a real-time video signal including video information acquired on a monitored area and a real-time audio signal including audio information from the monitored area;
Determining associated video recognition and audio recognition information from the received real-time video signal and real-time audio signal;
Correlating the real-time audio and video recognition information with each other to determine the likelihood of occurrence of a particular event;
Generating an alarm condition based on the occurrence of the specific event;
And wherein the processing means correlates the audio recognition information and video recognition information and detects the occurrence of a specific event, the joint recognition means including the video recognition information of a potential event. In response to recognizing the occurrence of the event based on, a control signal for directing one or more microphones of the microphone device in the direction of the particular event to enable capture of audio recognition information Generate
Each of the microphone device, in response to the control signal, automatically adjusts the orientation of the microphone in consideration of the reception of the audio signal at any given directivity, said method.

The program for making a computer perform each step of the method as described in any one of Claims 7-10 .