JP2014171187A

JP2014171187A - Monitor camera system

Info

Publication number: JP2014171187A
Application number: JP2013043141A
Authority: JP
Inventors: Yoshiji Iwabuchi; 義嗣岩渕
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-03-05
Filing date: 2013-03-05
Publication date: 2014-09-18
Anticipated expiration: 2033-03-05
Also published as: JP6112913B2

Abstract

PROBLEM TO BE SOLVED: To enable the preservation of an image by an image format suitable to the kind of the image by determining the kind of the image such as a snap image and a monitor image through the use of the sound of a subject.SOLUTION: A monitor camera system for recording captured image data in a server includes: imaging means for imaging a subject; one or more elements of voice input means capable of receiving the inputs of sound to be uttered by the subject as sound signals; subject state detection means for detecting the state of the subject on the basis of the sound signals input by the sound input means; format change means for changing the format of image data to be recorded in the server; and control means for controlling the operation of the whole device. The control means controls the format change means to change the format of the image data to be recorded in accordance with the detection result of the subject state detection means.

Description

本発明は監視カメラシステムに関し、特に、遠隔地で撮影された映像データをＬＡＮなどのネットワークで送信する監視カメラシステムに用いて好適な技術に関するものである。 The present invention relates to a surveillance camera system, and more particularly to a technique suitable for use in a surveillance camera system that transmits video data shot at a remote place via a network such as a LAN.

近年、ビデオカメラなどで撮影した映像をインターネットやＬＡＮなどのネットワークを介して送信し、遠隔地のクライアント（コンピュータ端末）によりモニタリングを可能としたネットワークカメラが多く提案されている。 In recent years, many network cameras have been proposed in which video captured by a video camera or the like is transmitted via a network such as the Internet or a LAN and can be monitored by a remote client (computer terminal).

また、このようなネットワークカメラを、セキュリティを目的として住宅に設置し、留守中などに撮影した映像を警備会社などのクライアントに送信して記録する監視カメラシステムも広く知られた技術である。更に、家庭内の事故防止などを目的として、住宅の屋内に監視カメラを設置する場合も増えてきている。このような住宅に設置された監視カメラを利用すれば、家族などのスナップ映像を撮影することが可能であり、スナップ映像のデータを家庭のパソコンに保存することも可能となる。 In addition, a surveillance camera system that installs such a network camera in a house for security purposes and transmits a video taken while the user is away to a client such as a security company for recording is also a well-known technique. Furthermore, surveillance cameras are increasingly installed indoors for the purpose of preventing accidents in the home. If a surveillance camera installed in such a house is used, it is possible to take a snap video of a family member or the like, and it is possible to save the data of the snap video in a home personal computer.

一般的に、スナップ映像として保存したい映像は、被写体となる人物が会話などの発声をしている映像であることが多い。このような被写体の音声を利用して、話者や被写体の状態を検出する方法や、これらの検出結果を利用して画像処理を行う方法などが多く提案されている（特許文献１、特許文献２参照）。 In general, a video that is desired to be saved as a snap video is often a video in which a person who is a subject utters a conversation or the like. Many methods have been proposed, such as a method for detecting the state of a speaker or a subject using the sound of the subject and a method for performing image processing using the detection results (Patent Document 1, Patent Document). 2).

特開２００７−０１０９９５号公報JP 2007-010995 A 特開２００７−１９３８２４号公報JP 2007-193824 A

監視カメラが撮影した映像データは、監視に適した映像フォーマットで符号化して送信し、クライアントで記録する。具体的には、人物の顔が特定できる画質に適した映像フォーマットで符号化したり、クライアントに送信するためのネットワークの容量に適した映像フォーマットで符号化したりする。 Video data captured by the monitoring camera is encoded and transmitted in a video format suitable for monitoring, and is recorded by the client. Specifically, encoding is performed in a video format suitable for image quality that can identify a person's face, or encoding in a video format suitable for network capacity for transmission to a client.

一方、監視カメラが撮影した映像データをスナップ映像として記録する場合は、観賞や編集に適した映像フォーマットで符号化して送信し、クライアントで記録する必要がある。具体的には、高画質に適した映像フォーマットで符号化したり、スナップ写真として印刷できる静止画フォーマットで符号化したりしてから保存する必要がある。 On the other hand, when recording video data taken by the surveillance camera as a snap video, it is necessary to encode and transmit the video data in a video format suitable for viewing and editing, and to record it on the client. Specifically, it is necessary to save the data after encoding with a video format suitable for high image quality or with a still image format that can be printed as a snapshot.

したがって、住宅に設置した監視カメラを利用して、スナップ映像を記録するためには、監視用映像とスナップ用映像の異なる目的の映像データとして、２種類以上の映像フォーマットのデータを同時に記録していく必要がある。このような複数の映像記録のために、クライアントの記憶媒体の容量の増大、クライアントの記録処理の負荷の増大、送信データの増加にともなうネットワークの回線容量の制限などが問題となる。 Therefore, in order to record a snap video using a surveillance camera installed in a house, data of two or more video formats can be recorded simultaneously as video data for different purposes of the video for monitoring and the video for snapping. We have to go. For such a plurality of video recordings, there are problems such as an increase in the capacity of the storage medium of the client, an increase in the load of the recording process of the client, and a limitation of the network line capacity accompanying an increase in transmission data.

また、スナップ映像は記録した映像データから抽出する必要がある。そのため、撮影した映像をすべて記録した後に、人手による映像データの選別や編集が必要となり、非常に煩雑な作業が必要になる。 The snap video must be extracted from the recorded video data. Therefore, it is necessary to manually select and edit the video data after all the captured videos are recorded, and very complicated work is required.

このように、監視カメラを利用したスナップ映像の記録においては、監視用映像とスナップ用映像の同時記録による蓄積媒体の容量増大や、スナップ映像を全記録映像から抽出することが困難であるという問題点があった。
本発明は前述の問題点に鑑み、被写体の音声を利用してスナップ映像や監視映像などの映像の種別を判定し、映像の種別に適した映像フォーマットで保存できるようにすることを目的とする。 As described above, when recording a snap video using a surveillance camera, it is difficult to increase the capacity of a storage medium due to simultaneous recording of the monitoring video and the snap video, or to extract the snap video from all the recorded videos. There was a point.
In view of the above-described problems, an object of the present invention is to determine a video type such as a snap video or a monitoring video using the sound of a subject and to save the video in a video format suitable for the video type. .

本発明の監視カメラシステムは、撮影した映像データをサーバーに記録する監視カメラシステムであって、被写体を撮影する撮影手段と、前記被写体が発する音を音声信号として入力できる１つ以上の音声入力手段と、前記音声入力手段で入力された音声信号から、被写体の状態を検出する被写体状態検出手段と、前記サーバーに記録する映像データのフォーマットを変更するフォーマット変更手段と、装置全体の動作を制御する制御手段とを有し、前記制御手段は、前記被写体状態検出手段の検出結果に応じて、記録する映像データのフォーマットを変更するように前記フォーマット変更手段を制御することを特徴とする。 The surveillance camera system of the present invention is a surveillance camera system that records captured video data on a server, and is a photographing means for photographing a subject, and one or more voice input means capable of inputting a sound emitted from the subject as a voice signal. And a subject state detecting means for detecting the state of the subject from the sound signal input by the sound input means, a format changing means for changing the format of the video data recorded in the server, and controlling the operation of the entire apparatus. Control means, and the control means controls the format changing means so as to change the format of the video data to be recorded in accordance with the detection result of the subject state detecting means.

本発明によれば、被写体の音声を利用してスナップ映像や監視映像などの映像の種別を判定し、映像の種別に適した記録形式で保存するようにしたので、記録時の蓄積媒体の容量を低減することができるとともに、スナップ映像の抽出を簡単に行うことができる。 According to the present invention, the type of video such as snap video or surveillance video is determined using the sound of the subject, and the video is stored in a recording format suitable for the video type. And a snap video can be easily extracted.

本発明の実施形態を示し、監視カメラシステムの構成例を示すブロック図である。It is a block diagram which shows embodiment of this invention and shows the structural example of a surveillance camera system. 第１の実施形態の監視カメラシステムの制御手順の一例を示すフローチャートである。It is a flowchart which shows an example of the control procedure of the surveillance camera system of 1st Embodiment. 第１の実施形態の監視カメラシステムの音声検出手順の一例を示すフローチャートである。It is a flowchart which shows an example of the audio | voice detection procedure of the surveillance camera system of 1st Embodiment. 第１の実施形態の監視カメラシステムの音声スペクトルを説明する図である。It is a figure explaining the audio | voice spectrum of the surveillance camera system of 1st Embodiment. 第１の実施形態の監視カメラシステムの音声スペクトルを説明する図である。It is a figure explaining the audio | voice spectrum of the surveillance camera system of 1st Embodiment. 第２の実施形態の監視カメラシステムの制御手順の一例を示すフローチャートである。It is a flowchart which shows an example of the control procedure of the surveillance camera system of 2nd Embodiment. 第３の実施形態の監視カメラシステムの制御手順の一例を示すフローチャートである。It is a flowchart which shows an example of the control procedure of the surveillance camera system of 3rd Embodiment.

以下に、本発明の好ましい実施形態を、添付の図面に基づいて詳細に説明する。
＜第１の実施形態＞
図１は、本発明の実施形態に係る監視カメラシステムの構成例を示すブロック図である。この監視カメラシステムは、ネットワーク１３０を介して接続されたカメラサーバー（装置）１００とクライアント１２０とからなる。
撮像部１０１はズーム、フォーカス、露出などに関する不図示の制御回路を備え、レンズ制御部１０８からの制御データによって光学系を制御して最適な映像を撮影する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
<First Embodiment>
FIG. 1 is a block diagram illustrating a configuration example of a surveillance camera system according to an embodiment of the present invention. This surveillance camera system includes a camera server (device) 100 and a client 120 connected via a network 130.
The imaging unit 101 includes a control circuit (not shown) relating to zoom, focus, exposure, and the like, and controls the optical system according to control data from the lens control unit 108 to capture an optimal video.

撮像部１０１が撮影した映像データは映像処理部１０３に送信され、符号化部１０４で映像データの符号化を行う。映像データの符号化の方式としてはＪＰＥＧ、Ｈ.２６４、ＭＰＥＧ４などの手法があるが、本実施形態は符号化の方式に依存するものではないので、符号化の方式について詳細な説明を省略する。 Video data captured by the imaging unit 101 is transmitted to the video processing unit 103, and the encoding unit 104 encodes the video data. Video data encoding methods include methods such as JPEG, H.264, and MPEG4. However, the present embodiment does not depend on the encoding method, and thus detailed description of the encoding method is omitted. .

撮像部１０１は、所望の位置に回転することを可能にするため、撮像部１０１が搭載された雲台部１０２をモーター１１１で駆動することができる。また、モーター１１１は、雲台制御部１０９の制御によって所望の位置に回転することができる。また、被写体が発する音を音声信号として音声入力するためのマイク１１２を有している。マイク１１２から入力される被写体の音声データは、音声処理部１１３に送信され、被写体はスナップ映像に適しているか否かを判定した後、符号化部１０４で音声データの符号化を行う。音声データの符号化としてはＧ.７１１などの手法があるが、本実施形態は符号化の方式に依存するものではないので、詳細な説明を省略する。 In order to enable the imaging unit 101 to rotate to a desired position, the camera platform 102 on which the imaging unit 101 is mounted can be driven by a motor 111. Further, the motor 111 can rotate to a desired position under the control of the pan head control unit 109. Moreover, the microphone 112 for inputting the sound emitted from the subject as a voice signal is provided. The audio data of the subject input from the microphone 112 is transmitted to the audio processing unit 113, and after determining whether or not the subject is suitable for the snap video, the encoding unit 104 encodes the audio data. There is a method such as G.711 for encoding audio data, but since the present embodiment does not depend on the encoding method, detailed description thereof is omitted.

映像処理部１０３、音声処理部１１３、符号化部１０４、ＣＰＵ１０５、ＲＯＭ１０６、ＲＡＭ１０７、レンズ制御部１０８、雲台制御部１０９は内部バスを通じて接続されている。また、通信部１１０を通じてネットワーク１３０に映像データや音声データを送信したり、撮影アングルの要求信号を受信したりする機能を備える。前述したＣＰＵ１０５、ＲＯＭ１０６、ＲＡＭ１０７よりなるコンピュータシステムが装置全体の動作を制御する。クライアント１２０は、カメラサーバー１００で撮影した映像データや音声データの再生や記録を行うことが可能であり、記録サーバーとして機能する。また、カメラサーバー１００の撮影アングルを遠隔制御することが可能である。 The video processing unit 103, audio processing unit 113, encoding unit 104, CPU 105, ROM 106, RAM 107, lens control unit 108, and pan head control unit 109 are connected through an internal bus. In addition, a function of transmitting video data and audio data to the network 130 through the communication unit 110 and receiving a request signal for a shooting angle is provided. A computer system including the CPU 105, ROM 106, and RAM 107 described above controls the operation of the entire apparatus. The client 120 can reproduce and record video data and audio data captured by the camera server 100, and functions as a recording server. In addition, it is possible to remotely control the shooting angle of the camera server 100.

本実施形態では、被写体の音声信号から会話中や留守中などの被写体の状態を検出する被写体状態検出を行い、検出結果を示す状態情報に応じて映像データの符号化方式を変更することを特徴としている。以下、映像データの符号化方式を変更する処理手順（方法）を、図１の構成図と図２のフローチャートを用いて説明する。図２のフローチャートの処理は、ＲＯＭ１０６に格納されているプログラムをＲＡＭ１０７に展開し、ＣＰＵ１０５が実行することにより実現する。 In the present embodiment, subject state detection for detecting the state of a subject such as a conversation or absence from the speech signal of the subject is performed, and the video data encoding method is changed according to the state information indicating the detection result. It is said. A processing procedure (method) for changing the video data encoding method will be described below with reference to the configuration diagram of FIG. 1 and the flowchart of FIG. The processing of the flowchart in FIG. 2 is realized by developing a program stored in the ROM 106 in the RAM 107 and executing it by the CPU 105.

Ｓ２０１では、マイク１１２から入力された被写体の音声信号を解析し、所定の時間が無音状態であるか否かを判定する。Ｓ２０１で無音状態であることを判定したら、Ｓ２０６で符号化部１０４によって映像データをＭＰＥＧ４フォーマットの６４０×４８０画素（ＶＧＡサイズ）で符号化する。符号化した映像データは通信部１１０、ネットワーク１３０を介してクライアント１２０に送信し、クライアント１２０では映像データの記録を開始する。 In S201, the audio signal of the subject input from the microphone 112 is analyzed, and it is determined whether or not the predetermined time is silent. If it is determined in S201 that there is no sound, the encoding unit 104 encodes the video data with MPEG4 format 640 × 480 pixels (VGA size) in S206. The encoded video data is transmitted to the client 120 via the communication unit 110 and the network 130, and the client 120 starts recording the video data.

Ｓ２０７では、カメラが設置されている住宅の住人が帰宅したか否かを判定する。帰宅の検出方法については、ドアの解錠音や空調設備の動作音を事前に登録しておき、登録音とマイク１１２から入力される音声を比較する方法などがあるが、本実施形態はこれらの方法に依存するものではない。 In S207, it is determined whether the resident of the house where the camera is installed has returned home. As for the detection method of returning home, there is a method in which door unlocking sound and air conditioning equipment operation sound are registered in advance, and the registered sound and sound input from the microphone 112 are compared. It does not depend on the method.

ここで、登録音と駆動音の比較方法に関して、音声信号をスペクトルとして比較して音の発生源を特定する処理手順（方法）を、図３のフローチャートと、図４及び図５を用いて説明する。 Here, regarding the method for comparing the registered sound and the drive sound, a processing procedure (method) for identifying the sound source by comparing the sound signal as a spectrum will be described with reference to the flowchart of FIG. 3 and FIGS. 4 and 5. To do.

Ｓ３０１では、マイク１１２から入力された図４（ａ）の駆動音の音声信号（音声入力波形）を一定時間tのデータに変換するため窓関数をかける。
Ｓ３０２では、窓関数をかけたデータをフーリエ変換して周波数毎のパワーである音声スペクトルを求める。 In S301, a window function is applied in order to convert the drive sound signal (speech input waveform) of FIG. 4A input from the microphone 112 into data of a predetermined time t.
In S302, the data subjected to the window function is subjected to Fourier transform to obtain an audio spectrum that is power for each frequency.

Ｓ３０３では、事前に登録されている登録スペクトル図４（ｂ）を選択する。登録スペクトル図４（ｂ）は、ドアの解錠音のスペクトルや空調設備の動作音のスペクトルで構成されており、ドアの解錠音のスペクトルは周波数範囲ＦｋとレベルＬｋ、空調設備の動作音のスペクトルは周波数範囲ＦａとレベルＬａの特性である。 In S303, the registered spectrum diagram 4 (b) registered in advance is selected. Registered spectrum FIG. 4B is composed of a spectrum of door unlocking sound and an operating sound spectrum of air conditioning equipment. The spectrum of door unlocking sound is a frequency range Fk and level Lk, and an operating sound of air conditioning equipment. The spectrum of is characteristic of the frequency range Fa and the level La.

Ｓ３０４では、Ｓ３０２で求めた音声スペクトル図４（ｃ）と、登録スペクトル図４（ｂ）とを比較して、ドアの解錠音のスペクトルが入力されたか否かを判定する。
ドアの解錠音のスペクトルが検出された場合は、ドア解錠のフラグを「１」に設定して次のステップに進む。 In S304, the speech spectrum figure 4 (c) obtained in S302 is compared with the registered spectrum figure 4 (b) to determine whether or not the spectrum of the door unlocking sound has been inputted.
When the spectrum of the door unlocking sound is detected, the door unlocking flag is set to “1” and the process proceeds to the next step.

Ｓ３０５では、Ｓ３０２で求めた音声スペクトル図４（ｄ）と、登録スペクトル図４（ｂ）を比較して、空調設備の動作音のスペクトルが入力されたか否かを判定する。空調設備の動作音のスペクトルが検出された場合は、空調動作のフラグを「１」に設定して次のステップに進む。 In S305, the speech spectrum figure 4 (d) obtained in S302 is compared with the registered spectrum figure 4 (b) to determine whether or not the spectrum of the operating sound of the air conditioning equipment has been input. If the spectrum of the operation sound of the air conditioning equipment is detected, the air conditioning operation flag is set to “1” and the process proceeds to the next step.

Ｓ３０６では、ドア解錠のフラグと空調動作のフラグを確認し、どちらも「１」であった場合は帰宅したと判断する。
このように、ドアの解錠音や空調設備などのマイクから入力される音声信号は、周波数スペクトル化して比較することで特性が明確になり、誤判定を防止することが可能になる。 In S306, the door unlocking flag and the air conditioning operation flag are confirmed, and if both are “1”, it is determined that the user has returned home.
As described above, the sound signal input from the door unlocking sound or the microphone of the air conditioner or the like is clarified by comparing the frequency signal with the frequency spectrum, and erroneous determination can be prevented.

さらに、音声スペクトル図５（ａ）に対して、比較したい周波数帯域だけ抽出するバンドパスフィルタを使用することで、音声スペクトル図５（ｂ）のように監視カメラが設置されている環境音や電話の呼び出し音などのスペクトルを排除することが可能になる。これにより、音声の比較の精度を高めることが可能になる。 Furthermore, by using a band-pass filter that extracts only the frequency band to be compared with the voice spectrum diagram 5 (a), the environmental sound or telephone in which the surveillance camera is installed as shown in the voice spectrum diagram 5 (b). It is possible to eliminate the spectrum such as the ringing tone. As a result, it is possible to improve the accuracy of voice comparison.

以上説明したような帰宅の検出方法などを利用して、Ｓ２０７で住人の帰宅を検出したら、Ｓ２０５で映像データの送信を停止し、クライアント１２０では映像データの記録を停止する。
一方、Ｓ２０１でマイク１１２から入力された被写体の音声信号を解析し、音が入力されたことを認識したら、Ｓ２０２で入力された音は会話であるか否かを検出する。
ここで、会話を検出する方法例として、音声認識を利用する方法を説明する。音声認識は一般的な発声の統計データを使用して行うことができる。多くの発話を記録した学習用データと、マイク１１２からの音声信号との特徴を抽出して比較し、発話した言語を推定する。 When the return home detection method or the like as described above is used, when the resident's return is detected in S207, the transmission of the video data is stopped in S205, and the recording of the video data is stopped in the client 120.
On the other hand, if the sound signal of the subject input from the microphone 112 is analyzed in S201 and it is recognized that the sound has been input, it is detected whether or not the sound input in S202 is a conversation.
Here, as a method example of detecting a conversation, a method using voice recognition will be described. Speech recognition can be performed using general utterance statistics. The features of the learning data in which many utterances are recorded and the voice signal from the microphone 112 are extracted and compared, and the spoken language is estimated.

監視カメラにこのような音声認識の機能を搭載していれば、子供の名前や、「ただいま」などという特定の言語を発生した場合に、会話音であると判定することができる。また、複数のマイクを利用した指向性の情報や、前述のスペクトルの情報などを利用することで、精度の高い判定が可能になる。
本実施形態においては、このような音声認識のアルゴリズムに依存するものではなく、音声認識の結果を利用して会話をしているか否かを判定するものである。 If the surveillance camera is equipped with such a voice recognition function, it can be determined that the sound is a conversational sound when a child's name or a specific language such as "I'm right" is generated. Further, by using directivity information using a plurality of microphones, the above-described spectrum information, and the like, it is possible to make a highly accurate determination.
In the present embodiment, it does not depend on such a speech recognition algorithm, and it is determined whether or not a conversation is performed using the result of the speech recognition.

Ｓ２０２で、被写体の音声が会話であると判定した場合は、Ｓ２０３で符号化部１０４によって映像データをＨ.２６４フォーマットの１９２０×１０８０画素（ＨＤサイズ）で符号化する。符号化した映像データは通信部１１０、ネットワーク１３０を介してクライアント１２０に送信し、クライアント１２０では映像データの記録を開始する。
Ｓ２０４で音声認識を利用して会話が終了したことを確認したら、Ｓ２０５で映像データの送信を停止し、クライアント１２０での記録を停止する。 If it is determined in S202 that the subject's voice is conversation, the encoding unit 104 encodes the video data with 1920 × 1080 pixels (HD size) in H.264 format in S203. The encoded video data is transmitted to the client 120 via the communication unit 110 and the network 130, and the client 120 starts recording the video data.
If it is confirmed in S204 that the conversation is completed using voice recognition, the transmission of the video data is stopped in S205, and the recording in the client 120 is stopped.

以上説明したように、被写体の音声信号から、留守中や会話中などの被写体の状態を判定し、これに応じて映像の符号化方式を変更して記録することが可能になる。符号化方式としては、Ｈ．２６４方式もしくはＭＰＥＧ４方式もしくはＪＰＥＧ方式などを考慮する。
本実施形態では、記録方式の変更例として符号化方式や解像度について説明したが、スナップ映像に望まれる高画質のファクターとなる圧縮率などの変更についても同様の処理が可能である。また、本実施形態では、符号化方式の変更例として動画フォーマットの変更について説明したが、スナップ写真として必要な静止画フォーマットへ変更する、フォーマット変更についても同様の処理が可能である。 As described above, it is possible to determine the state of the subject such as absence or talking from the audio signal of the subject, and change the video encoding method accordingly to record. As an encoding method, H.264 can be used. The H.264 system, MPEG4 system, JPEG system, etc. are considered.
In the present embodiment, the encoding method and the resolution have been described as examples of changing the recording method. However, the same processing can be performed for changing the compression rate, which is a high-quality factor desired for a snap video. Further, in the present embodiment, the change of the moving image format has been described as an example of changing the encoding method. However, the same processing can be performed for changing the format to a still image format required as a snapshot.

＜第２の実施形態＞
第２の実施形態では、被写体の音声信号から会話をしている人数を検出し、会話をしている人数に応じて、記録する映像データのフォーマットをクライアント１２０で変更する処理手順（方法）を、図１の構成図と図６のフローチャートを用いて説明する。図６のフローチャートの処理は、ＲＯＭ１０６に格納されているプログラムをＲＡＭ１０７に展開し、ＣＰＵ１０５が実行することにより実現する。 <Second Embodiment>
In the second embodiment, a processing procedure (method) for detecting the number of people who are talking from the audio signal of the subject and changing the format of the video data to be recorded by the client 120 according to the number of people who are talking. This will be described with reference to the configuration diagram of FIG. 1 and the flowchart of FIG. The processing in the flowchart of FIG. 6 is realized by developing a program stored in the ROM 106 in the RAM 107 and executing it by the CPU 105.

Ｓ６０１では、カメラサーバー１００の符号化部１０４において、映像データを監視映像に適したＭＰＥＧ４と、スナップ映像に適したＨ.２６４の２種類の映像フォーマットで符号化する。符号化した２種類の映像データは通信部１１０、ネットワーク１３０を介してクライアント１２０に送信する。 In S601, the encoding unit 104 of the camera server 100 encodes video data in two types of video formats, MPEG4 suitable for monitoring video and H.264 suitable for snap video. The encoded two types of video data are transmitted to the client 120 via the communication unit 110 and the network 130.

Ｓ６０２では、クライアント１２０が受信したＭＰＥＧ４のフォーマットの映像データの記録を開始する。
Ｓ６０３では、マイク１１２から被写体の音声が入力されるまで待機しており、音声が入力された場合は音声信号を解析し、発声音であるか否かを判定する。Ｓ６０３で発声音があった場合はＳ６０４で会話をしている人数を検出し、音声処理部１１３において人数の情報を音声データに付加する。 In step S602, recording of video data in the MPEG4 format received by the client 120 is started.
In step S603, the process waits until the sound of the subject is input from the microphone 112. When the sound is input, the sound signal is analyzed to determine whether the sound is an uttered sound. If there is a utterance sound in S603, the number of people having a conversation is detected in S604, and the voice processing unit 113 adds the information of the number of people to the voice data.

ここで、会話をしている人数を検出する方法例として、話者認識を利用する方法を説明する。話者認識は個人がもつ声紋の情報を利用して行うことができる。事前に個人がもつ声紋の情報を登録するために、所定のテキストを登録者が読み上げる。さらに、所定のテキストに応じた発声によって話者の特徴を抽出し、代表的な音声モデルを学習しながら構築していく。このような音声モデルと入力される音声信号を比較して話者を特定することが可能になる。 Here, a method of using speaker recognition will be described as an example of a method of detecting the number of people having a conversation. Speaker recognition can be performed by using voiceprint information of an individual. A registrant reads out a predetermined text in order to register voiceprint information of an individual in advance. Furthermore, speaker features are extracted by utterance according to a predetermined text, and a typical speech model is learned and constructed. It is possible to identify the speaker by comparing such a speech model with the input speech signal.

監視カメラにこのような話者認識の機能を搭載していれば、家族の誰が話しているか、家族以外の訪問者が話しているか、などの話者や話者の人数を特定することができる。また、このような話者認識と映像からの顔認識を組み合わせることで、精度の高い認識を可能にする方法もある。
本実施形態においては、このような話者認識のアルゴリズムに依存するものではなく、話者認識の結果を利用して会話をしている人物や人数を判定するものである。 If the surveillance camera is equipped with such a speaker recognition function, it is possible to identify the number of speakers and the number of speakers, such as who is speaking in the family and who is visiting a visitor other than the family. . There is also a method that enables highly accurate recognition by combining such speaker recognition and face recognition from video.
In the present embodiment, it does not depend on such an algorithm for speaker recognition, but uses the result of speaker recognition to determine the number of persons and the number of people who are having a conversation.

Ｓ６０５では、クライアント１２０が音声データに付加された人数の情報を確認する。もし３人以上の人数が確認された場合は、Ｓ６０６でＭＰＥＧ４の映像データの送信を停止し、クライアント１２０ではＭＰＥＧ４の映像データの記録を停止する。
Ｓ６０７では、クライアント１２０で受信したＨ.２６４のフォーマットの映像データの記録を開始する。 In step S605, the client 120 confirms information on the number of people added to the audio data. If three or more people are confirmed, the transmission of MPEG4 video data is stopped in S606, and the client 120 stops recording the MPEG4 video data.
In S607, recording of video data in the H.264 format received by the client 120 is started.

以上説明したように、本実施形態においては話者認識によって特定の発話者が検出された頻度に応じて、これらの情報をスナップ映像として保存するための優先度として音声データに付加する。クライアント１２０は、これらの優先度の情報によって記録する映像フォーマットを変更することが可能になる。 As described above, in the present embodiment, according to the frequency at which a specific speaker is detected by speaker recognition, these pieces of information are added to audio data as a priority for saving as a snap video. The client 120 can change the video format to be recorded according to the priority information.

本実施形態では、記録方式の変更例として会話をしている人数を検出し、これに応じて映像の符号化方式を変更して記録する方法について説明したが、会話の重要度という観点では、会話の時間や音量といったファクターを利用しても同様の処理が可能である。 In the present embodiment, the number of people who have a conversation is detected as an example of changing the recording method, and the method of recording by changing the video encoding method according to this is described, but in terms of the importance of the conversation, Similar processing can be performed using factors such as conversation time and volume.

＜第３の実施形態＞
第３の実施形態では、会話をしている話者を前述の話者認識を利用して検出し、話者に応じて記録する映像フォーマットと記録サーバーを変更する処理手順（方法）を図７のフローチャートを用いて説明する。図７のフローチャートの処理は、ＲＯＭ１０６に格納されているプログラムをＲＡＭ１０７に展開し、ＣＰＵ１０５が実行することにより実現する。 <Third Embodiment>
In the third embodiment, a processing procedure (method) for detecting a speaker having a conversation using the above-described speaker recognition and changing a video format and a recording server according to the speaker is shown in FIG. It demonstrates using the flowchart of these. The processing of the flowchart in FIG. 7 is realized by developing a program stored in the ROM 106 in the RAM 107 and executing it by the CPU 105.

Ｓ７０１では、マイク１１２から入力された被写体の音声信号を解析し、話者認識を利用して事前に登録された家族の声と比較して話者を特定する。
Ｓ７０１で話者は家族であることを認識したら、Ｓ７０２で、カメラサーバー１００の符号化部１０４において、映像データをＨ.２６４フォーマットの１９２０×１０８０画素で符号化して、住宅内にあるホームサーバー（図示せず）に送信し記録する。 In S701, the voice signal of the subject input from the microphone 112 is analyzed, and the speaker is identified by comparing with the voice of the family registered in advance using speaker recognition.
If it is recognized in S701 that the speaker is a family, in S702, the encoding unit 104 of the camera server 100 encodes the video data with 1920 × 1080 pixels in H.264 format, and the home server ( Send to (not shown) and record.

Ｓ７０３では、マイク１１２から入力された被写体の音声信号を解析し、話者認識を利用して家族以外の話者を検出する。Ｓ７０３で家族以外の話者を検出したら、Ｓ７０４で、カメラサーバー１００の符号化部１０４において、映像データをＭＰＥＧ４フォーマットの３２０×２４０画素で符号化し、住宅外の警備会社などに設置されている監視サーバーに送信し記録する。 In S703, the audio signal of the subject input from the microphone 112 is analyzed, and a speaker other than the family is detected using speaker recognition. When a speaker other than the family is detected in S703, in S704, the encoding unit 104 of the camera server 100 encodes the video data with 320 × 240 pixels in the MPEG4 format and is installed in a security company outside the house. Send to server and record.

Ｓ７０５では、マイク１１２から入力された被写体の音声信号を解析し、話者認識を利用して家族だけの声になったことを検出する。Ｓ７０５で家族だけの声を確認したら、Ｓ７０６で、監視サーバーへの映像データの送信を停止する。 In step S <b> 705, the audio signal of the subject input from the microphone 112 is analyzed, and it is detected that the voice is only for the family by using speaker recognition. If the voice of only the family is confirmed in S705, the transmission of the video data to the monitoring server is stopped in S706.

以上、説明したように、話者認識で監視すべき映像か否かを判定し、監視する映像は監視用サーバーに送信し、スナップ映像はホームサーバーに送信し、それぞれ最適な映像フォーマットで記録することが可能になる。 As described above, it is determined whether or not the video should be monitored by speaker recognition, the video to be monitored is transmitted to the monitoring server, the snap video is transmitted to the home server, and each is recorded in an optimal video format. It becomes possible.

本実施形態では記録サーバーの変更について説明したが、記録する場所を変更するという目的においては、映像フォーマットの符号化の変更に応じて、記録するフォルダや記録媒体を変更することも同様の処理で可能である。 In the present embodiment, the change of the recording server has been described. However, for the purpose of changing the recording location, it is also possible to change the recording folder and recording medium in accordance with the change of the encoding of the video format. Is possible.

話者認識を利用すれば、家族のみでの会話、来訪者との会話などの認識結果をトリガとして符号化方式を変更することが可能になるが、話者認識に限らず、叫び声、笑い声などの声の質やドアや空調などの特徴音を利用することも可能である。このような特徴音の登録についてはスペクトルとして事前に登録し、入力音声のスペクトルと特性を比較することで精度の高い認識が可能になる。
以上、本発明の好ましい実施形態について説明したが、本発明はこれらの実施形態に限定されず、その要旨の範囲内で種々の変形及び変更が可能である。 If speaker recognition is used, it is possible to change the encoding method triggered by recognition results such as conversations with family members or conversations with visitors, but not only speaker recognition but also screams, laughter, etc. It is also possible to make use of the voice quality and characteristic sounds such as doors and air conditioning. Such registration of characteristic sounds is registered in advance as a spectrum, and a high-accuracy recognition is possible by comparing the spectrum and characteristics of the input speech.
As mentioned above, although preferable embodiment of this invention was described, this invention is not limited to these embodiment, A various deformation | transformation and change are possible within the range of the summary.

（その他の実施形態）
また、本発明は、以下の処理を実行することによっても実現される。即ち、前述した実施形態の機能を実現するソフトウェア（コンピュータプログラム）を、ネットワーク又は各種のコンピュータ読み取り可能な記憶媒体を介してシステム或いは装置に供給する。そして、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。 (Other embodiments)
The present invention can also be realized by executing the following processing. That is, software (computer program) that implements the functions of the above-described embodiments is supplied to a system or apparatus via a network or various computer-readable storage media. Then, the computer (or CPU, MPU, etc.) of the system or apparatus reads out and executes the program.

１００カメラサーバー
１０１撮像部
１０２雲台部
１０３映像処理部
１０４符号化部
１０５ＣＰＵ
１０６ＲＯＭ
１０７ＲＡＭ
１０８レンズ制御部
１０９雲台制御部
１１０通信部
１１１モーター
１１２マイク
１１３音声処理部
１２０クライアント
１３０ネットワーク DESCRIPTION OF SYMBOLS 100 Camera server 101 Image pick-up part 102 Pan head part 103 Image processing part 104 Encoding part 105 CPU
106 ROM
107 RAM
108 Lens Control Unit 109 Head Control Unit 110 Communication Unit 111 Motor 112 Microphone 113 Audio Processing Unit 120 Client 130 Network

Claims

A surveillance camera system for recording captured video data on a server,
Photographing means for photographing the subject;
One or more voice input means capable of inputting a sound emitted from the subject as a voice signal;
Subject state detection means for detecting the state of the subject from the sound signal input by the sound input means;
Format changing means for changing the format of video data recorded in the server;
Control means for controlling the operation of the entire apparatus,
The surveillance camera system characterized in that the control means controls the format changing means to change the format of video data to be recorded in accordance with the detection result of the subject state detecting means.

The subject state detection means inputs the subject's voice, compares the inputted voice data with a voice model for recognizing a speaker registered in advance, and detects a speaker that emits voice. The surveillance camera system according to claim 1, wherein:

2. The subject state detection unit inputs speech of a subject, compares the input speech data with a speech model for recognizing a language, and detects a spoken language. Surveillance camera system.

The subject state detection means detects a video to be monitored and a video to be saved,
The control means controls the format changing means so as to change either a coding method or resolution or compression rate of a moving image and a still image to be recorded according to a detection result of the subject state detecting means. The surveillance camera system according to any one of claims 1 to 3.

The encoding method is H.264. The surveillance camera system according to claim 4, wherein the surveillance camera system is a H.264 system, an MPEG4 system, or a JPEG system, and the resolution includes 1920 × 1080 pixels or 640 × 480 pixels.

The status information according to the detection result of the subject status detection means is transmitted to a server, and the server changes the format of the video data to be recorded according to the status information. The surveillance camera system according to item.

The subject state detection means inputs the sound of the subject and changes the format of the video data to be recorded in the server according to the frequency at which a specific language or a specific speaker is detected from the input sound data. The surveillance camera system according to any one of claims 1 to 5, wherein:

The surveillance camera system according to claim 1, wherein the control unit changes a server that records video data according to a detection result of the subject state detection unit.

The surveillance camera system according to claim 1, wherein the control unit changes a folder in which video data is recorded according to a detection result of the subject state detection unit.