JP7532963B2

JP7532963B2 - Image processing device and image processing method

Info

Publication number: JP7532963B2
Application number: JP2020117846A
Authority: JP
Inventors: 領平須永
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2024-08-14
Anticipated expiration: 2040-07-08
Also published as: JP2022015167A

Description

本発明は、画像処理装置および画像処理方法に関する。 The present invention relates to an image processing device and an image processing method.

車両の車室内の音声を含む画像（静止画および動画を含む）を車載カメラおよび車載マイクにより取得し、旅の記録として保存することが行われている。このような記録として保存された画像を見返す場合、画像に予め字幕等の演出を付与しておくことは有効である。
たとえば特許文献１の記載には、画像に含まれる音声から単語を抽出し、単語の使用頻度に応じて字幕の表示態様を決定する表示態様決定装置が開示されている。
また、車両の車室内の画像に限らず、各種移動体の室内の画像またはその他の空間を撮影した画像を出来事の記録として保存する場合についても同様である。 Images (including still images and videos) including audio from within a vehicle cabin are captured by an on-board camera and an on-board microphone and are stored as a record of a journey. When reviewing such images stored as a record, it is effective to add effects such as subtitles to the images in advance.
For example, Patent Document 1 discloses a display mode determination device that extracts words from sounds included in an image and determines the display mode of subtitles depending on the frequency of use of the words.
The same applies not only to images of the interior of a vehicle, but also to images of the interior of various mobile objects or images of other spaces that are stored as records of events.

特開２０１９－０６２３３２号公報JP 2019-062332 A

しかし上述した画像は、冗長な部分が多く、撮影後に出来事の記録として見返すのに適していないという問題があった。特に再生時間が長い場合は、この問題が顕著に生じる。 However, the images mentioned above have a problem in that they contain a lot of redundant parts and are not suitable for reviewing as a record of events after shooting. This problem is particularly noticeable when the playback time is long.

本発明は、上述した課題を解決するためになされたものであって、撮影後に、撮影中の重要な場面を分かりやすいダイジェスト画像で振り返ることができる画像処理装置および画像処理方法を提供するものである。 The present invention has been made to solve the above-mentioned problems, and provides an image processing device and an image processing method that allow users to look back on important scenes captured during shooting in an easy-to-understand digest image after shooting.

本発明の一態様にかかる画像処理装置は、対象空間の音声データおよび画像データを取得する取得部と、前記音声データをテキストデータに変換する音声認識部と、前記画像データに基づいて前記対象空間の状況を評価し、前記対象空間における前記状況の状況評価値を算出する状況評価部と、前記状況評価値が予め定められた条件を満たす状況を示すイベントが発生した場合、前記イベントが発生している期間を含む予め定められた期間の前記画像データの画像に、前記テキストデータに基づく画像および前記状況に関連する画像のうち少なくとも一方を重畳させた重畳画像データを生成する画像処理部とを備える。 An image processing device according to one aspect of the present invention includes an acquisition unit that acquires voice data and image data of a target space, a voice recognition unit that converts the voice data into text data, a situation evaluation unit that evaluates the situation of the target space based on the image data and calculates a situation evaluation value of the situation in the target space, and an image processing unit that, when an event occurs indicating a situation in which the situation evaluation value satisfies a predetermined condition, generates superimposed image data in which at least one of an image based on the text data and an image related to the situation is superimposed on an image of the image data for a predetermined period including the period during which the event occurs.

本発明の一態様にかかる画像処理方法は、対象空間の音声データおよび画像データを取得する段階と、前記音声データをテキストデータに変換する段階と、前記画像データに基づいて前記対象空間の状況を評価し、前記対象空間における前記状況の状況評価値を算出する段階と、前記状況評価値が予め定められた条件を満たす状況を示すイベントが発生した場合、前記イベントが発生している期間を含む予め定められた期間の前記画像データの画像に、前記テキストデータに基づく画像および前記状況に関連する画像のうち少なくとも一方を重畳させた重畳画像データを生成する段階とを備える。 An image processing method according to one aspect of the present invention includes the steps of acquiring voice data and image data of a target space, converting the voice data into text data, evaluating a situation in the target space based on the image data and calculating a situation evaluation value of the situation in the target space, and, when an event occurs indicating a situation in which the situation evaluation value satisfies a predetermined condition, generating superimposed image data in which at least one of an image based on the text data and an image related to the situation is superimposed on an image of the image data for a predetermined period including the period during which the event occurs.

本発明によれば、撮影後に、撮影中の重要な場面を分かりやすいダイジェスト画像で振り返ることができる画像処理装置および画像処理方法を提供することができる。 The present invention provides an image processing device and an image processing method that allow users to review important scenes captured during shooting in an easy-to-understand digest image after shooting.

実施形態１にかかる画像処理装置の構成の一例を示すブロック図である。1 is a block diagram showing an example of a configuration of an image processing apparatus according to a first embodiment. 実施形態１にかかる状況テーブルのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of a status table according to the first embodiment. 実施形態１にかかる画像処理態様テーブルのデータ構造の一例を示す図である。5 is a diagram illustrating an example of a data structure of an image processing mode table according to the first embodiment. FIG. 実施形態１にかかる画像処理装置の処理の概要を示すフローチャートである。4 is a flowchart showing an outline of processing performed by the image processing device according to the first embodiment. 実施形態１にかかる画像処理装置の処理の第１の例を示すフローチャートである。5 is a flowchart showing a first example of processing of the image processing device according to the first embodiment. 実施形態１にかかる画像処理装置の処理の第２の例を示すフローチャートである。10 is a flowchart showing a second example of the process of the image processing device according to the first embodiment. 実施形態１にかかる画像処理装置の画像処理態様の一例を示す図である。FIG. 2 is a diagram illustrating an example of an image processing mode of the image processing apparatus according to the first embodiment. 実施形態１にかかる画像処理装置の画像処理態様の一例を示す図である。FIG. 2 is a diagram illustrating an example of an image processing mode of the image processing apparatus according to the first embodiment. 実施形態１にかかる画像処理装置の画像処理態様の一例を示す図である。FIG. 2 is a diagram illustrating an example of an image processing mode of the image processing apparatus according to the first embodiment. 実施形態２にかかる画像処理装置の構成の一例を示すブロック図である。FIG. 11 is a block diagram showing an example of a configuration of an image processing device according to a second embodiment. 実施形態２にかかる状況テーブルのデータ構造の一例を示す図である。FIG. 11 is a diagram illustrating an example of a data structure of a status table according to the second embodiment. 実施形態２にかかる画像処理態様テーブルのデータ構造の一例を示す図である。FIG. 11 is a diagram illustrating an example of a data structure of an image processing mode table according to the second embodiment. 実施形態２にかかる画像処理装置の処理を示すフローチャートである。10 is a flowchart showing a process of an image processing device according to a second embodiment. 実施形態２にかかる画像処理装置の画像処理態様の一例を示す図である。FIG. 11 is a diagram showing an example of an image processing mode of an image processing apparatus according to a second embodiment.

以下、発明の実施の形態を通じて本発明を説明するが、特許請求の範囲にかかる発明を以下の実施形態に限定するものではない。また、実施形態で説明する構成の全てが課題を解決するための手段として必須であるとは限らない。説明の明確化のため、以下の記載および図面は、適宜、省略、および簡略化がなされている。なお、各図面において、同一の要素には同一の符号が付されており、必要に応じて重複説明は省略されている。 The present invention will be described below through embodiments of the invention, but the invention according to the claims is not limited to the following embodiments. Furthermore, not all of the configurations described in the embodiments are necessarily essential as means for solving the problems. For clarity of explanation, the following description and drawings have been omitted and simplified as appropriate. In addition, the same elements are given the same reference numerals in each drawing, and duplicate explanations have been omitted as necessary.

＜実施形態１＞
まず図１～９を用いて、本発明の実施形態１について説明する。図１は、実施形態１にかかる画像処理システム１の構成の一例を示すブロック図である。画像処理システム１は、人物を含む対象空間を撮影した画像に対して、字幕付与等の画像処理を行い、ダイジェスト画像を生成するコンピュータシステムである。本実施形態１では、対象空間は、乗員がいる車両の室内空間（以下、車室内と呼ぶことがある）を示す。しかしこれに限らず、対象空間は、乗員がいる船若しくは飛行機等のその他の移動体の室内、または人物がいる建物の屋内若しくは屋外の空間であってもよい。 <Embodiment 1>
First, a first embodiment of the present invention will be described with reference to FIGS. 1 to 9. FIG. 1 is a block diagram showing an example of the configuration of an image processing system 1 according to the first embodiment. The image processing system 1 is a computer system that performs image processing such as adding subtitles to an image captured of a target space including a person, and generates a digest image. In the first embodiment, the target space indicates an interior space of a vehicle in which an occupant is present (hereinafter, sometimes referred to as a vehicle interior). However, the target space is not limited to this, and may be an interior space of another moving body such as a ship or airplane in which an occupant is present, or an indoor or outdoor space of a building in which a person is present.

画像処理システム１は、画像処理装置１０と、カメラ３０と、マイク４０とを備える。本実施形態１では、画像処理装置１０、カメラ３０およびマイク４０は、ドライブレコーダ、ドライバモニタまたはインキャビンモニタ等の車載機器を含む車載機器システムを構成する。 The image processing system 1 includes an image processing device 10, a camera 30, and a microphone 40. In this embodiment 1, the image processing device 10, the camera 30, and the microphone 40 configure an in-vehicle equipment system including in-vehicle equipment such as a drive recorder, a driver monitor, or an in-cabin monitor.

カメラ３０は、車両の任意の位置に搭載され、車室内の風景を撮影して、撮影した画像の画像データを生成するカメラである。一例としてカメラ３０は、車両の座席に座る乗員を撮影する。カメラ３０は、画像処理装置１０に通信可能に接続される。カメラ３０は、例えば毎秒３０フレーム（３０ｆｐｓ）の画像データを生成し、生成した画像データを３０分の１秒ごとに画像処理装置１０に供給する。カメラ３０は、車室内の風景の他に車両外部の風景を撮影し、画像処理装置１０に供給してもよい。 The camera 30 is mounted at any position in the vehicle, captures the scenery inside the vehicle cabin, and generates image data of the captured image. As an example, the camera 30 captures an image of an occupant sitting in a seat in the vehicle. The camera 30 is communicatively connected to the image processing device 10. The camera 30 generates image data at, for example, 30 frames per second (30 fps), and supplies the generated image data to the image processing device 10 every 1/30th of a second. In addition to the scenery inside the vehicle cabin, the camera 30 may also capture the scenery outside the vehicle and supply the image data to the image processing device 10.

マイク４０は、車両の任意の位置に搭載され、車室内の音声を取得して、音声データを生成するマイクである。一例としてマイク４０は、車両の座席に座る乗員の声を取得する。マイク４０も、画像処理装置１０に通信可能に接続される。マイク４０は、生成した音声データをカメラ３０と同様に画像処理装置１０に供給する。マイク４０は、車室内の音声に限定されず、車両外部の音声を取得し、画像処理装置１０に供給してもよい。 The microphone 40 is mounted at any position in the vehicle, and captures the voices within the vehicle cabin to generate voice data. As an example, the microphone 40 captures the voices of passengers sitting in the vehicle seats. The microphone 40 is also communicatively connected to the image processing device 10. The microphone 40 supplies the generated voice data to the image processing device 10 in the same manner as the camera 30. The microphone 40 is not limited to capturing voices within the vehicle cabin, and may capture voices outside the vehicle and supply them to the image processing device 10.

画像処理装置１０は、車載機器に関連して設置されるコンピュータまたはハードウェアである。画像処理装置１０は、車載機器に内蔵されていてもよく、外付けで配設されていてもよい。また画像処理装置１０は、車両から離れた場所に設置され、無線通信手段によりカメラ３０およびマイク４０と通信してもよい。画像処理装置１０は、カメラ３０およびマイク４０からそれぞれ取得した画像および音声データに対して、字幕付与等の画像処理を行い、ダイジェスト画像を生成する。画像処理装置１０は、カメラ３０およびマイク４０を含んで構成されてもよい。 The image processing device 10 is a computer or hardware installed in association with an in-vehicle device. The image processing device 10 may be built into the in-vehicle device or may be arranged externally. The image processing device 10 may also be installed at a location away from the vehicle and communicate with the camera 30 and microphone 40 via wireless communication means. The image processing device 10 performs image processing, such as adding subtitles, on the image and audio data acquired from the camera 30 and microphone 40, respectively, to generate a digest image. The image processing device 10 may be configured to include the camera 30 and the microphone 40.

画像処理装置１０は、取得部１１と、状況評価部１２と、音声認識部１５と、画像処理部１６と、画像記録部１７と、テーブル記憶部１８とを有する。 The image processing device 10 has an acquisition unit 11, a situation evaluation unit 12, a voice recognition unit 15, an image processing unit 16, an image recording unit 17, and a table storage unit 18.

取得部１１は、撮影対象となる車室内の音声データおよび画像データを、マイク４０およびカメラ３０からそれぞれ取得する。取得部１１は、マイク４０およびカメラ３０と有線で接続される入力端子や入力回路などのインタフェースであってもよく、マイク４０およびカメラ３０と無線で接続される通信受信部であってもよい。取得部１１は、取得した音声データおよび画像データを、状況評価部１２および音声認識部１５に供給する。 The acquisition unit 11 acquires voice data and image data of the vehicle interior to be photographed from the microphone 40 and the camera 30, respectively. The acquisition unit 11 may be an interface such as an input terminal or an input circuit that is connected to the microphone 40 and the camera 30 by wire, or may be a communication receiving unit that is connected to the microphone 40 and the camera 30 by wireless. The acquisition unit 11 supplies the acquired voice data and image data to the situation evaluation unit 12 and the voice recognition unit 15.

状況評価部１２は、車室内の状況を評価する。ここで、車室内の状況とは、その空間にいる１または複数の人物（乗員）の表情または発話態様が類型化された状態を示す。たとえば状況は、笑顔の乗員がいる状態、眠っている乗員がいる状態、または大声で怒る乗員と笑顔の乗員とが混在する状態等が挙げられる。 The situation evaluation unit 12 evaluates the situation inside the vehicle cabin. Here, the situation inside the vehicle cabin refers to a typified state of the facial expressions or speech patterns of one or more people (occupants) in the space. For example, the situation may include a state in which there is a smiling occupant, a state in which there is a sleeping occupant, or a state in which there is a mixture of angry occupants and smiling occupants.

このとき状況評価部１２は、評価対象状況の状況評価値を算出する。評価対象状況とは、評価の対象となる状況を指し、例えば後述する状況テーブル１９に示す状況のうち評価対象として指定された状況を指す。評価対象状況は、状況テーブル１９に示す状況のすべてであってもよく、一部であってもよい。 At this time, the situation evaluation unit 12 calculates a situation evaluation value for the situation to be evaluated. The situation to be evaluated refers to a situation that is the subject of evaluation, and for example refers to a situation designated as the subject of evaluation among the situations shown in the situation table 19 described below. The situation to be evaluated may be all or some of the situations shown in the situation table 19.

状況評価値は、評価対象状況ごとに算出され、車室内の状況が評価対象状況とどの程度関連しているかのレベルを示す。 The situation evaluation value is calculated for each situation being evaluated and indicates the level to which the situation inside the vehicle is related to the situation being evaluated.

ここで状況評価部１２は、画像評価部１３を含む。
画像評価部１３は、取得部１１が取得した画像データに基づいて、評価対象状況に関連する各表情の表情評価統計値を算出する。より詳しくは、画像評価部１３は、画像データに画像認識処理を施して画像データに含まれる人物を認識し、また、認識した人物の表情を認識して、「笑い」、「悲しみ」、「怒り」および「眠気」などの表情の種別として評価する。画像認識処理は、機械学習などの公知の技術が用いられる。 Here, the situation evaluation unit 12 includes an image evaluation unit 13 .
The image evaluation unit 13 calculates a facial expression evaluation statistic for each facial expression related to the evaluation target situation based on the image data acquired by the acquisition unit 11. More specifically, the image evaluation unit 13 performs image recognition processing on the image data to recognize people included in the image data, and also recognizes the facial expressions of the recognized people and evaluates them as types of facial expressions such as "laughter,""sadness,""anger," and "sleepiness." The image recognition processing uses known techniques such as machine learning.

表情評価統計値は、状況評価値に含まれる状況評価指標の１つである。表情評価統計値は、画像データに写る全ての乗員を母集団とした表情評価値の統計値である。
ここで表情評価値は、表情の種別に応じて定められ、定められた表情に個々の乗員の表情がどの程度関連しているかを示す指標である。表情評価値は、画像データに写る乗員ごとに算出される。たとえば表情評価値は、顔の各部の大きさ、位置、形状または顔若しくは首の角度に関して、画像データから取得した実績値と予め定められた参照値との間の差分に基づいて算出されてよい。また表情評価値は、被写体として様々な表情の人物が写る画像を教師データとして学習した表情評価モデルにより算出されてもよい。表情評価統計値は、画像データに写る全ての乗員の表情評価値に基づいて、算出される。表情評価統計値は、閾値以上の表情評価値を有する乗員の数であってもよく、全ての乗員のうち、閾値以上の表情評価値を有する乗員の割合であってもよい。また表情評価統計値は、閾値以上の表情評価値を有する乗員の数や割合が予め定められる閾値を超えているか否かの判定結果であってもよい。また表情評価統計値は、表情評価値の合計値または平均値であってもよい。たとえば「笑い」の表情評価統計値は、「笑い」の表情評価値が高い乗員の数が多いほど、高い値を有してよい。このように表情評価統計値を状況評価値として用いることで、車室内の全体の状況をより正確に評価することが可能となる。 The facial expression evaluation statistic is one of the situation evaluation indexes included in the situation evaluation value. The facial expression evaluation statistic is a statistical value of facial expression evaluation values for a population of all occupants captured in image data.
Here, the facial expression evaluation value is determined according to the type of facial expression, and is an index showing the degree to which the facial expression of each occupant is related to the determined facial expression. The facial expression evaluation value is calculated for each occupant appearing in the image data. For example, the facial expression evaluation value may be calculated based on the difference between the actual value obtained from the image data and a predetermined reference value for the size, position, shape, or angle of each part of the face or the face or neck. The facial expression evaluation value may also be calculated by a facial expression evaluation model that has learned images of people with various facial expressions as subjects using as teacher data. The facial expression evaluation statistics are calculated based on the facial expression evaluation values of all occupants appearing in the image data. The facial expression evaluation statistics may be the number of occupants having a facial expression evaluation value equal to or greater than a threshold value, or may be the proportion of occupants having a facial expression evaluation value equal to or greater than a threshold value among all occupants. The facial expression evaluation statistics may also be a determination result of whether or not the number or proportion of occupants having a facial expression evaluation value equal to or greater than a threshold value exceeds a predetermined threshold value. The facial expression evaluation statistics may be the sum or average value of the facial expression evaluation values. For example, the facial expression evaluation statistics of "laughing" may have a higher value as the number of occupants having a high facial expression evaluation value of "laughing" increases. By using the facial expression evaluation statistics as the situation evaluation value in this manner, it is possible to more accurately evaluate the overall situation inside the vehicle cabin.

本実施形態１では、状況評価値として表情評価統計値を用いる。つまり本実施形態１では、画像評価部１３は、画像データに基づいて撮像された人物の表情を評価し、車室内における評価対象状況に関連する表情の表情評価統計値を算出し、その表情評価統計値を状況評価値とする。 In this embodiment 1, a facial expression evaluation statistic is used as the situation evaluation value. That is, in this embodiment 1, the image evaluation unit 13 evaluates the facial expression of the person captured based on the image data, calculates a facial expression evaluation statistic of the facial expression related to the situation to be evaluated in the vehicle cabin, and sets the facial expression evaluation statistic as the situation evaluation value.

そして状況評価部１２は、状況テーブル１９を用いて状況評価値が予め定められた条件を満たすか否かを判定し、満たす場合、評価対象状況のイベントが発生したと認定する。本実施形態１では、状況評価部１２は、評価対象状況に関連する表情評価統計値の各々について、その表情評価値がその表情評価値に対応して予め定められた閾値以上であるか否かを判定し、これらの判定結果に基づいてイベントが発生したか否かを判定する。状況評価部１２は、イベントが発生したと認定した場合、音声認識部１５および画像処理部１６に通知する。 The situation evaluation unit 12 then uses the situation table 19 to determine whether the situation evaluation value satisfies a predetermined condition, and if so, recognizes that an event of the situation to be evaluated has occurred. In this embodiment 1, the situation evaluation unit 12 determines whether each facial expression evaluation statistical value related to the situation to be evaluated is equal to or greater than a predetermined threshold value corresponding to that facial expression evaluation value, and determines whether an event has occurred based on these determination results. If the situation evaluation unit 12 recognizes that an event has occurred, it notifies the voice recognition unit 15 and the image processing unit 16.

音声認識部１５は、音声データを、文字列を含むテキストデータに変換する。音声認識部１５は、公知の音声データテキスト化技術を用いてよい。 The voice recognition unit 15 converts the voice data into text data including character strings. The voice recognition unit 15 may use a known voice data text conversion technology.

画像処理部１６は、イベント発生期間の画像データの画像に対して、画像処理態様テーブル２０を用いて、評価対象状況に応じた画像処理を行う。イベント発生期間とは、現にイベントが発生している期間と、イベント発生前後の予め定められた期間とを含む期間を示す。なおイベント発生期間は、イベント発生前後の期間に代えてイベント発生前またはイベント発生後のいずれかの期間を含むとしてもよいし、これらの期間を含まなくてもよい。画像処理部１６は、イベント発生期間の画像データの画像に対して、テキストデータに基づく画像および評価対象状況に関連する画像のうち少なくとも一方を重畳させた重畳画像データを生成する。テキストデータに基づく画像は、一例としてテキストデータをそのまま文字列として所定のフォントにて可視化した字幕画像である。テキストデータに基づく画像は、例えばテキストデータが日本語である場合、漢字や片仮名に変換した後に字幕画像に変換してもよい。またテキストデータに基づく画像は、テキストデータの一部を省略したり予め定められたテキストデータを追加したりする形態ととってもよい。またテキストデータに基づく画像は、図示しないデータベースによってテキストデータと予め関連付けられたマーク画像であってもよい。評価対象状況に関連する画像は、後述する画像処理態様テーブル２０によって状況ＩＤと関連づけられたマーク画像である。例えば評価対象状況に関連する画像は、評価対象状況が「悲しみ」を含む状況ＩＤであった場合、涙の形のマーク画像であってよい。 The image processing unit 16 performs image processing according to the evaluation target situation on the image of the image data during the event occurrence period using the image processing mode table 20. The event occurrence period refers to a period including the period during which the event actually occurs and a predetermined period before and after the event occurrence. The event occurrence period may include either the period before the event occurrence or the period after the event occurrence instead of the period before and after the event occurrence, or may not include these periods. The image processing unit 16 generates superimposed image data by superimposing at least one of an image based on text data and an image related to the evaluation target situation on the image of the image data during the event occurrence period. An example of the image based on text data is a subtitle image in which the text data is visualized as a character string in a predetermined font. For example, when the text data is Japanese, the image based on text data may be converted into kanji or katakana and then converted into a subtitle image. The image based on text data may be in a form in which part of the text data is omitted or predetermined text data is added. The image based on text data may be a mark image that is previously associated with the text data by a database not shown. The image related to the evaluation target situation is a mark image that is previously associated with a situation ID by the image processing mode table 20 described later. For example, if the situation to be evaluated has a situation ID that includes "sadness," the image associated with the situation to be evaluated may be a teardrop-shaped mark image.

そして画像処理部１６は、生成した重畳画像データを画像記録部１７に記録する。画像記録部１７は、画像処理部１６が生成した重畳画像データを記憶する記憶媒体である。 The image processing unit 16 then records the generated superimposed image data in the image recording unit 17. The image recording unit 17 is a storage medium that stores the superimposed image data generated by the image processing unit 16.

このとき画像処理部１６は、上書き禁止の記録形態で当該データを画像記録部１７に記録することが好ましい。たとえば画像処理装置１０がドライブレコーダに搭載されている場合、画像処理部１６は、通常のドライブレコーダが有する常時記録（上書き）、異常時記録（衝撃検知時等に上書き禁止で記録）、手動記録などの記録モードとは別の記録モードで画像記録部１７に記録してよい。画像記録部１７は、上記通常の記録モードの記録領域とは異なる記録領域であることが好ましい。画像処理部１６は、異常時記録の記録モードで記録する画像には、上述した画像の重畳処理を行わないことが好ましい。これにより上記通常の記録モードで記録されたデータとは区別して、運転後に記録ファイルを容易に閲覧することができる。さらに閲覧を容易にするために、記録モードに応じたフォルダ名およびファイル名を付与して記録ファイルを識別することが好ましいが、これに限定されない。なおドライブレコーダは、上述した記録モードの全てまたは一部を搭載していなくてもよい。 At this time, the image processing unit 16 preferably records the data in the image recording unit 17 in a recording format that prohibits overwriting. For example, when the image processing device 10 is mounted on a drive recorder, the image processing unit 16 may record the data in the image recording unit 17 in a recording mode different from the recording modes of a normal drive recorder, such as continuous recording (overwriting), abnormality recording (recording with overwriting prohibited when an impact is detected, etc.), and manual recording. The image recording unit 17 is preferably a recording area different from the recording area of the normal recording mode. It is preferable that the image processing unit 16 does not perform the above-mentioned image superimposition process on the image recorded in the abnormality recording recording mode. This makes it possible to easily view the recorded file after driving, distinguishing it from the data recorded in the normal recording mode. In order to further facilitate viewing, it is preferable to identify the recorded file by giving a folder name and file name according to the recording mode, but this is not limited to this. Note that the drive recorder does not need to be equipped with all or part of the above-mentioned recording modes.

テーブル記憶部１８は、状況テーブル１９および画像処理態様テーブル２０を記憶する記憶媒体である。 The table storage unit 18 is a storage medium that stores the status table 19 and the image processing mode table 20.

図２は、実施形態１にかかる状況テーブル１９のデータ構造の一例を示す図である。
状況テーブル１９は、状況識別情報（状況ＩＤ）と、その状況の内容と、状況評価値の種別およびイベント閾値とを関連付ける。 FIG. 2 is a diagram illustrating an example of a data structure of the status table 19 according to the first embodiment.
The situation table 19 associates situation identification information (situation ID), the content of the situation, the type of situation evaluation value, and an event threshold value.

本実施形態１で状況の内容は、状況に関連する表情またはそれらの組み合わせを指す。たとえば、状況ＩＤ１～３の状況の内容は、笑顔の乗員がいることを示す。また状況ＩＤ４～５の状況の内容は、悲しんでいる乗員がいて、かつ笑顔の乗員がいることを示す。状況ＩＤ６の状況の内容は、怒っている乗員がいて、かつ笑顔の乗員がいることを示す。状況ＩＤ７の状況の内容は、眠っている乗員がいることを、状況ＩＤ８の状況の内容は、眠っている乗員がいて、かつ笑顔の乗員がいることを示す。 In this embodiment 1, the content of the situation refers to facial expressions related to the situation or a combination thereof. For example, the content of the situation with situation IDs 1 to 3 indicates that there are smiling passengers. The content of the situation with situation IDs 4 to 5 indicates that there are both sad and smiling passengers. The content of the situation with situation ID 6 indicates that there are both angry and smiling passengers. The content of the situation with situation ID 7 indicates that there are both sleeping passengers, and the content of the situation with situation ID 8 indicates that there are both sleeping and smiling passengers.

状況評価値の種別は、状況の内容に含まれる各項目に対応する状況評価値の種別を示す。たとえば状況ＩＤ１～３の状況は、「笑い」の表情評価統計値である第１表情評価統計値を状況評価値としている。また状況ＩＤ４～５の状況は、「悲しみ」の表情評価統計値である第２表情評価統計値と、第１表情評価統計値とを状況評価値としている。また状況ＩＤ６の状況は、「怒り」の表情評価統計値である第３表情評価統計値と、第１表情評価統計値とを状況評価値としている。また状況ＩＤ７の状況は、「眠気」の表情評価統計値である第４表情評価統計値を、状況ＩＤ８の状況は、第４表情評価統計値と第１表情評価統計値とを状況評価値としている。 The type of situation evaluation value indicates the type of situation evaluation value corresponding to each item included in the situation content. For example, the situations with situation IDs 1 to 3 have the first facial expression evaluation statistic, which is the facial expression evaluation statistic for "laughter", as the situation evaluation value. Furthermore, the situations with situation IDs 4 to 5 have the second facial expression evaluation statistic, which is the facial expression evaluation statistic for "sadness", and the first facial expression evaluation statistic as the situation evaluation value. Furthermore, the situation with situation ID 6 has the third facial expression evaluation statistic, which is the facial expression evaluation statistic for "anger", and the first facial expression evaluation statistic as the situation evaluation value. Furthermore, the situation with situation ID 7 has the fourth facial expression evaluation statistic, which is the facial expression evaluation statistic for "drowsiness", and the situation with situation ID 8 has the fourth facial expression evaluation statistic and the first facial expression evaluation statistic as the situation evaluation value.

イベント閾値は、状況評価部１２がイベント発生を判定するための閾値である。たとえば第１評価統計値の大きさによって、状況が状況ＩＤ１～３の３段階に分類される。第１評価統計値がＸ_１以上であれば、乗員全員が笑顔である状況ＩＤ１のイベントが発生したと判定される。第１評価統計値がＸ_２以上Ｘ_１未満であれば、過半数の乗員が笑顔である状況ＩＤ２のイベントが発生したと判定される。第１評価統計値がＸ_３以上Ｘ_２未満であれば、少人数の乗員が笑顔である状況ＩＤ３のイベントが発生したと判定される。変形例として、第１評価統計値は、後述する乗員の笑顔度合いの評価値であってもよく、状況ＩＤ１が表情評価値５以上の場合、状況ＩＤ２が表情評価値３または４の場合、などの判定であってもよい。 The event threshold is a threshold for the situation evaluation unit 12 to determine the occurrence of an event. For example, situations are classified into three stages, situation IDs 1 to 3, depending on the magnitude of the first evaluation statistical value. If the first evaluation statistical value is _X1 or more, it is determined that an event of situation ID 1 in which all occupants are smiling has occurred. If the first evaluation statistical value is _X2 or more and less than _X1 , it is determined that an event of situation ID 2 in which the majority of occupants are smiling has occurred. If the first evaluation statistical value is _X3 or more and less than _X2 , it is determined that an event of situation ID 3 in which a small number of occupants are smiling has occurred. As a modified example, the first evaluation statistical value may be an evaluation value of the degree of smiling of the occupants, which will be described later, or may be determined as a case where situation ID 1 has a facial expression evaluation value of 5 or more, or situation ID 2 has a facial expression evaluation value of 3 or 4, etc.

ここで、各種表情評価統計値の算出方法の具体例について説明する。まず状況評価部１２の画像評価部１３は、取得した画像データから予め各乗員の顔の正規化画像を生成する。そして状況に関連する表情が「笑い」である場合、状況評価部１２の画像評価部１３は、無表情の場合の目の大きさ、口の大きさおよび口角の位置の平均値から、正規化画像に写る乗員の目の幅の減少量、口の大きさの変化量および口角の上昇量を算出する。画像評価部１３は、各種変化量と、歯が見えるか等の条件とに基づいて、「笑い」の表情評価値を、例えば１（無表情）～５（最大の笑顔）までの５段階評価値として算出する。笑っているか否かの０／１評価であってもよい。この評価値は、上述した口の大きさの変化量や歯が見えるか否かなどのパラメータにより算出してもよく、機械学習された笑顔度合いの認識辞書などを用いて笑顔の度合いを評価してもよい。後述する「悲しみ」、「怒り」および「眠気」についても同様である。また「笑い」の表情評価値は、被写体として様々な表情の人物が写る「笑い」の表情の画像を教師データとして学習した表情評価モデルにより算出されてもよい。 Here, a specific example of a method for calculating various facial expression evaluation statistics will be described. First, the image evaluation unit 13 of the situation evaluation unit 12 generates a normalized image of the face of each occupant from the acquired image data in advance. Then, when the facial expression related to the situation is "laughing", the image evaluation unit 13 of the situation evaluation unit 12 calculates the amount of decrease in the width of the eyes of the occupant, the amount of change in the size of the mouth, and the amount of rise in the corners of the mouth of the occupant appearing in the normalized image from the average values of the eye size, the size of the mouth, and the position of the corners of the mouth in the case of a blank expression. The image evaluation unit 13 calculates the facial expression evaluation value of "laughing" as a five-level evaluation value, for example, from 1 (blank expression) to 5 (maximum smile), based on various changes and conditions such as whether teeth are visible. It may be a 0/1 evaluation of whether or not the person is laughing. This evaluation value may be calculated based on parameters such as the change in the size of the mouth and whether teeth are visible, or the degree of smiling may be evaluated using a recognition dictionary of the degree of smiling that has been machine-learned. The same applies to "sadness", "anger", and "drowsiness" described later. The facial expression evaluation value for "laughing" may also be calculated using a facial expression evaluation model trained on training data of images of "laughing" facial expressions of people with various facial expressions.

また状況に関連する表情が「悲しみ」である場合、画像評価部１３は、無表情の場合の目の大きさ、口の形状並びに首および顔の角度の平均値から、正規化画像に写る乗員の目の幅の減少量と、口の形状並びに首および顔の角度の変化量とを算出する。画像評価部１３は、各種変化量と、涙の有無等の条件とに基づいて、「悲しみ」の表情評価値を算出する。また「悲しみ」の表情評価値は、被写体として様々な表情の人物が写る「悲しみ」の表情の画像を教師データとして学習した表情評価モデルにより算出されてもよい。 If the facial expression associated with the situation is "sadness," the image evaluation unit 13 calculates the amount of reduction in the eye width of the occupant in the normalized image and the amount of change in the shape of the mouth and the angle of the neck and face from the average values of the eye size, mouth shape, and neck and face angle in the case of a neutral expression. The image evaluation unit 13 calculates the facial expression evaluation value of "sadness" based on the various amounts of change and conditions such as the presence or absence of tears. The facial expression evaluation value of "sadness" may also be calculated using a facial expression evaluation model trained using images of "sadness" facial expressions of people with various facial expressions as training data.

また状況に関連する表情が「怒り」である場合、画像評価部１３は、無表情の場合の目の大きさ、口の形状、眉毛の位置並びに首および顔の角度の平均値から、正規化画像に写る乗員の目の大きさ、口の形状、眉毛の位置並びに首および顔の角度の変化量を算出する。画像評価部１３は、各種変化量に基づいて、「怒り」の表情評価値を算出する。また「怒り」の表情評価値は、被写体として様々な表情の人物が写る「怒り」の表情の画像を教師データとして学習した表情評価モデルにより算出されてもよい。 Furthermore, if the facial expression associated with the situation is "anger," the image evaluation unit 13 calculates the amount of change in the eye size, mouth shape, eyebrow position, and neck and face angle of the occupant appearing in the normalized image from the average values of the eye size, mouth shape, eyebrow position, and neck and face angle in the case of a neutral expression. The image evaluation unit 13 calculates an facial expression evaluation value for "anger" based on the various amounts of change. The facial expression evaluation value for "anger" may also be calculated using a facial expression evaluation model trained using images of "anger" facial expressions of people with various facial expressions as subject data.

また状況に関連する表情が「眠気」である場合、画像評価部１３は、無表情の場合の目の大きさ並びに首および顔の角度の平均値から、正規化画像に写る乗員の目が閉じているか否かを判定し、首および顔の角度の変化量を算出する。画像評価部１３は、判定結果および各種変化量に基づいて、「眠気」の表情評価値を算出する。また「眠気」の表情評価値は、被写体として様々な表情の人物が写る「眠気」の表情の画像を教師データとして学習した表情評価モデルにより算出されてもよい。 If the facial expression related to the situation is "drowsiness," the image evaluation unit 13 determines whether the eyes of the occupant in the normalized image are closed from the average values of the eye size and the angles of the neck and face when the occupant has no expression, and calculates the amount of change in the angle of the neck and face. The image evaluation unit 13 calculates the facial expression evaluation value of "drowsiness" based on the determination result and various amounts of change. The facial expression evaluation value of "drowsiness" may also be calculated by a facial expression evaluation model that has been trained using images of "drowsiness" facial expressions of people with various facial expressions as subject data.

そして画像評価部１３は、算出した各乗員の表情評価値に基づいて、各種表情評価統計値を算出する。 The image evaluation unit 13 then calculates various facial expression evaluation statistics based on the calculated facial expression evaluation values for each occupant.

図３は、実施形態１にかかる画像処理態様テーブル２０のデータ構造の一例を示す図である。画像処理態様テーブル２０は、状況ＩＤと、画像処理態様とを関連付ける。画像処理態様は、字幕の文字サイズ、フォント、色および字幕重畳位置等の重畳する字幕の表示態様に関する事項と、その他の画像処理態様とを含む。なお重畳する字幕の態様に関する事項は、本図では、文字サイズ、フォントおよび字幕重畳位置である。その他の画像処理態様は、マークの追加、状況に応じた字幕の追加、黒ベタ処理、色の変更、カラー画像のモノクロ化処理、白黒反転処理およびモザイク処理等が挙げられる。図３における文字サイズは、例として５（最大）から１（最小）までの５段階で示しているが、１０段階であってもよく、実際の文字サイズ（例えばフォント高さのピクセル数）で指定してもよい。 Figure 3 is a diagram showing an example of the data structure of the image processing mode table 20 according to the first embodiment. The image processing mode table 20 associates a situation ID with an image processing mode. The image processing mode includes matters related to the display mode of the superimposed subtitles, such as the character size, font, color, and subtitle superimposition position of the subtitles, and other image processing modes. In this figure, matters related to the mode of the superimposed subtitles are the character size, font, and subtitle superimposition position. Other image processing modes include adding marks, adding subtitles according to the situation, solid black processing, changing colors, converting color images to monochrome, black and white inversion processing, and mosaic processing. The character size in Figure 3 is shown as five levels from 5 (maximum) to 1 (minimum) as an example, but may be ten levels or may be specified by the actual character size (for example, the number of pixels of the font height).

なお画像処理部１６は、状況評価値、つまり本実施形態１では表情評価統計値に基づいて、重畳される字幕の表示態様を決定してもよい。これにより乗員の表情の統計値によって判定した室内の盛り上がりの程度に応じた、字幕演出を行うことができる。 The image processing unit 16 may determine the display mode of the superimposed subtitles based on the situation evaluation value, that is, in this embodiment 1, the facial expression evaluation statistical value. This allows subtitle presentation according to the degree of excitement in the cabin determined by the statistical value of the facial expressions of the occupants.

図４は、実施形態１にかかる画像処理装置１０の処理の概要を示すフローチャートである。
まずステップＳ１０において、画像処理装置１０の取得部１１は、車室内の音声データおよび画像データを、マイク４０およびカメラ３０からそれぞれ取得する。取得部１１は、取得した音声データおよび画像データを、状況評価部１２および音声認識部１５に供給する。 FIG. 4 is a flowchart showing an outline of the process of the image processing apparatus 10 according to the first embodiment.
First, in step S10, the acquisition unit 11 of the image processing device 10 acquires voice data and image data in the vehicle cabin from the microphone 40 and the camera 30, respectively. The acquisition unit 11 supplies the acquired voice data and image data to the situation evaluation unit 12 and the voice recognition unit 15.

次にステップＳ１１において、状況評価部１２の画像評価部１３は、画像データに基づいて、認識した各乗員の表情について評価対象状況に関連する表情評価値を算出し、各乗員の表情評価値に基づいて表情評価統計値を算出する。評価対象状況に関連する表情評価統計値が複数ある場合には、画像評価部１３は、各表情評価統計値を算出する。なお、画像評価部１３は状況テーブル１９を用いて、評価対象状況に関連する表情評価統計値の種別を確認してよい。 Next, in step S11, the image evaluation unit 13 of the situation evaluation unit 12 calculates a facial expression evaluation value related to the situation to be evaluated for each recognized facial expression of the occupant based on the image data, and calculates a facial expression evaluation statistical value based on the facial expression evaluation value of each occupant. If there are multiple facial expression evaluation statistical values related to the situation to be evaluated, the image evaluation unit 13 calculates each facial expression evaluation statistical value. Note that the image evaluation unit 13 may use the situation table 19 to check the type of facial expression evaluation statistical value related to the situation to be evaluated.

そしてステップＳ１２において、状況評価部１２は、評価対象状況に関連する状況評価値種別ごとに状況評価値を算出する。本実施形態１では、状況評価部１２は、ステップＳ１１で算出した表情評価統計値を状況評価値とする。 Then, in step S12, the situation evaluation unit 12 calculates a situation evaluation value for each situation evaluation value type related to the situation to be evaluated. In this embodiment 1, the situation evaluation unit 12 sets the facial expression evaluation statistics calculated in step S11 as the situation evaluation value.

次にステップＳ１３において、状況評価部１２は、状況テーブル１９を用いて、ステップＳ１２で算出した状況評価値種別ごとの状況評価値から、評価対象状況のイベントが発生したか否かを判定する。状況評価部１２は、イベントが発生したと判定した場合（ステップＳ１３でＹｅｓ）、音声認識部１５にその旨を通知し、処理をステップＳ１４に進める。一方、状況評価部１２は、そうでない場合（ステップＳ１３でＮｏ）、処理をステップＳ１０に戻す。 Next, in step S13, the situation evaluation unit 12 uses the situation table 19 to determine whether or not an event of the situation to be evaluated has occurred, based on the situation evaluation value for each situation evaluation value type calculated in step S12. If the situation evaluation unit 12 determines that an event has occurred (Yes in step S13), it notifies the voice recognition unit 15 of this and proceeds to step S14. On the other hand, if the situation evaluation unit 12 does not determine that an event has occurred (No in step S13), it returns the process to step S10.

ステップＳ１４において、音声認識部１５は、状況評価部１２から通知を受けたことに応じて、イベント発生期間の音声データをテキストデータに変換し、変換したテキストデータを画像処理部１６に供給し、処理をステップＳ１５に進める。 In step S14, in response to receiving a notification from the situation evaluation unit 12, the voice recognition unit 15 converts the voice data during the event occurrence period into text data, supplies the converted text data to the image processing unit 16, and proceeds to step S15.

ステップＳ１５において、画像処理部１６は、イベント発生期間の画像データに対して、画像処理態様テーブル２０を用いて、評価対象状況の状況ＩＤに応じたテキストデータに基づく画像および評価対象状況に関連する画像のうち少なくとも一方を生成し、画像データに重畳させる画像処理を実行し、重畳画像データを生成し、処理をステップＳ１６に進める。 In step S15, the image processing unit 16 uses the image processing mode table 20 to generate at least one of an image based on text data corresponding to the situation ID of the situation to be evaluated and an image related to the situation to be evaluated for the image data of the event occurrence period, performs image processing to superimpose the image data, generates superimposed image data, and proceeds to step S16.

そしてステップＳ１６において、画像処理部１６は、イベント発生期間の音声データおよび重畳画像データを、上書き禁止で画像記録部１７に記録し、処理を終了する。ステップＳ１６はなくともよい。 Then, in step S16, the image processing unit 16 records the audio data and superimposed image data during the event occurrence period in the image recording unit 17 with overwriting prohibited, and ends the process. Step S16 may be omitted.

なおステップＳ１５はステップＳ１４よりも前に実行されてもよく、並行して実行されてもよい。またステップＳ１４に代えて、音声認識部１５は、状況評価部１２から通知の有無に関わらず、音声データを取得したことに応じて音声データをテキストデータに変換してもよい。また音声認識部１５は、乗員の口の動きに基づいて乗員が発話中であると判定される場合にのみ、音声データをテキストデータに変換してもよい。なお上記判定は、画像評価部１３により行われてよい。 Note that step S15 may be executed before step S14, or may be executed in parallel. Also, instead of step S14, the voice recognition unit 15 may convert the voice data into text data in response to acquisition of the voice data, regardless of whether or not a notification is received from the situation evaluation unit 12. Also, the voice recognition unit 15 may convert the voice data into text data only when it is determined that the occupant is speaking based on the movement of the occupant's mouth. Note that the above determination may be made by the image evaluation unit 13.

次に図５～６を用いて、画像処理装置１０の処理の具体例について説明する。図５は、実施形態１にかかる画像処理装置１０の処理の第１の例を示すフローチャートである。第１の例では、状況テーブル１９の状況ＩＤ１～３が評価対象状況として予め指定されているものとする。なお図４のステップＳ１０に示す処理については、その記載を省略する。 Next, a specific example of the processing of the image processing device 10 will be described with reference to Figures 5 and 6. Figure 5 is a flowchart showing a first example of the processing of the image processing device 10 according to the first embodiment. In the first example, it is assumed that situation IDs 1 to 3 in the situation table 19 are designated in advance as situations to be evaluated. Note that a description of the processing shown in step S10 in Figure 4 will be omitted.

ステップＳ２０において、画像処理装置１０の状況評価部１２は、各乗員の第１表情評価値を算出し、第１表情評価値が所定閾値以上である乗員がいるか否か、つまり笑顔の乗員が検出された否かを判定する。画像評価部１３は、笑顔の乗員が検出されたと判定した場合（ステップＳ２０でＹｅｓ）、処理をステップＳ２１に進め、そうでない場合（ステップＳ２０でＮｏ）、ステップＳ２０に示す処理を繰り返す。 In step S20, the situation evaluation unit 12 of the image processing device 10 calculates a first facial expression evaluation value for each occupant and determines whether or not there is an occupant whose first facial expression evaluation value is equal to or greater than a predetermined threshold value, that is, whether or not a smiling occupant has been detected. If the image evaluation unit 13 determines that a smiling occupant has been detected (Yes in step S20), the process proceeds to step S21; otherwise (No in step S20), the process shown in step S20 is repeated.

次にステップＳ２１において、状況評価部１２は、第１表情評価統計値を算出し、第１表情評価統計値が状況テーブル１９に示す状況ＩＤ３の条件を満たすか否か、つまり所定人数以上の乗員が笑顔であるか否かを判定する。状況評価部１２は、所定人数以上の乗員が笑顔であると判定した場合（ステップＳ２１でＹｅｓ）、話が盛り上がるイベントが発生したとして処理をステップＳ２２に進め、そうでない場合（ステップＳ２１でＮｏ）、処理をステップＳ２０に戻す。 Next, in step S21, the situation evaluation unit 12 calculates a first facial expression evaluation statistic and determines whether or not the first facial expression evaluation statistic satisfies the condition of situation ID 3 shown in the situation table 19, that is, whether or not a predetermined number of occupants are smiling. If the situation evaluation unit 12 determines that a predetermined number of occupants are smiling (Yes in step S21), it determines that an event that gets the conversation going has occurred and proceeds to step S22; if not (No in step S21), it returns to step S20.

ステップＳ２２において、状況評価部１２は、笑顔の乗員を検出した前後の画像データを所定の記録領域にバックアップする。 In step S22, the situation assessment unit 12 backs up the image data before and after the smiling occupant is detected in a specified storage area.

次にステップＳ２３において、音声認識部１５は、笑顔の乗員を検出した前後の音声データをテキストデータに変換する。 Next, in step S23, the voice recognition unit 15 converts the voice data before and after detecting the smiling occupant into text data.

次にステップＳ２４において、状況評価部１２は、第１表情評価統計値が状況テーブル１９に示す状況ＩＤ１の条件を満たすか否か、つまり乗員の全員が笑顔であるか否かを判定する。状況評価部１２は、乗員の全員が笑顔であると判定した場合（ステップＳ２４でＹｅｓ）、処理をステップＳ２５に進め、そうでない場合（ステップＳ２４でＮｏ）、処理をステップＳ２６に進める。 Next, in step S24, the situation evaluation unit 12 determines whether the first facial expression evaluation statistical value satisfies the condition of situation ID 1 shown in the situation table 19, that is, whether all occupants are smiling. If the situation evaluation unit 12 determines that all occupants are smiling (Yes in step S24), the process proceeds to step S25; otherwise (No in step S24), the process proceeds to step S26.

ステップＳ２５において、画像処理部１６は、画像処理態様テーブル２０の状況ＩＤ１の画像処理態様で、バックアップした画像データの画像に対して、字幕等の画像を重畳させる。たとえば画像処理部１６は、通常の字幕に比べて、文字サイズを大きくし、字幕に対してより目立つフォントおよび色を付与する。そして画像処理部１６は処理をステップＳ２９に進める。 In step S25, the image processing unit 16 superimposes an image such as a subtitle onto the image of the backed up image data in the image processing mode of situation ID 1 in the image processing mode table 20. For example, the image processing unit 16 makes the character size larger than normal subtitles and gives the subtitles a more eye-catching font and color. The image processing unit 16 then advances the process to step S29.

ステップＳ２６において、状況評価部１２は、第１表情評価統計値が状況テーブル１９に示す状況ＩＤ２の条件を満たすか否か、つまり乗員の過半数が笑顔であるか否かを判定する。状況評価部１２は、乗員の過半数が笑顔であると判定した場合（ステップＳ２６でＹｅｓ）、処理をステップＳ２７に進め、そうでない場合（ステップＳ２６でＮｏ）、処理をステップＳ２８に進める。 In step S26, the situation evaluation unit 12 determines whether the first facial expression evaluation statistic satisfies the condition of situation ID 2 shown in the situation table 19, that is, whether the majority of the occupants are smiling. If the situation evaluation unit 12 determines that the majority of the occupants are smiling (Yes in step S26), the process proceeds to step S27; otherwise (No in step S26), the process proceeds to step S28.

ステップＳ２７において、画像処理部１６は、画像処理態様テーブル２０の状況ＩＤ２の画像処理態様で、バックアップした画像データの画像に対して、字幕等の画像を重畳させる。そして画像処理部１６は処理をステップＳ２９に進める。 In step S27, the image processing unit 16 superimposes an image such as a subtitle onto the image of the backed up image data in the image processing mode of situation ID 2 in the image processing mode table 20. The image processing unit 16 then advances the process to step S29.

ステップＳ２８において、画像処理部１６は、画像処理態様テーブル２０の状況ＩＤ３の画像処理態様で、バックアップした画像データの画像に対して、字幕等の画像を重畳させる。そして画像処理部１６は処理をステップＳ２９に進める。 In step S28, the image processing unit 16 superimposes an image such as a subtitle onto the image of the backed up image data in the image processing mode of situation ID 3 in the image processing mode table 20. The image processing unit 16 then advances the process to step S29.

ステップＳ２９において、画像処理部１６は、重畳画像データおよびこれに対応する音声データを上書き禁止で画像記録部１７に記録し、処理を終了する。ステップＳ２９はなくてもよい。 In step S29, the image processing unit 16 records the superimposed image data and the corresponding audio data in the image recording unit 17 with overwrite protection, and ends the process. Step S29 may be omitted.

本例では、画像処理態様は、笑顔の数を考慮した第１表情評価統計値によって決定される。したがって、乗員の表情によって判定された盛り上がりの程度に応じた字幕演出を行うことができる。 In this example, the image processing mode is determined by the first facial expression evaluation statistic, which takes into account the number of smiles. Therefore, it is possible to perform subtitle presentation according to the level of excitement determined by the facial expressions of the passengers.

図６は、実施形態１にかかる画像処理装置１０の処理の第２の例を示すフローチャートである。第２の例では、状況テーブル１９の状況ＩＤ７～８が評価対象状況として予め指定されているものとする。なお図４のステップＳ１０に示す処理については、その記載を省略する。 Figure 6 is a flowchart showing a second example of the processing of the image processing device 10 according to the first embodiment. In the second example, it is assumed that situation IDs 7 to 8 in the situation table 19 are designated in advance as situations to be evaluated. Note that the description of the processing shown in step S10 in Figure 4 is omitted.

ステップＳ３０において、画像処理装置１０の状況評価部１２は、第４表情評価統計値を算出し、第４表情評価統計値が状況テーブル１９に示す状況ＩＤ７～８の条件を満たすか否か、つまり睡眠中の乗員が検出された否かを判定する。状況評価部１２は、睡眠中の乗員が検出されたと判定した場合（ステップＳ３０でＹｅｓ）、処理をステップＳ３１に進め、そうでない場合（ステップＳ３０でＮｏ）、ステップＳ３０に示す処理を繰り返す。 In step S30, the situation evaluation unit 12 of the image processing device 10 calculates a fourth facial expression evaluation statistical value and determines whether or not the fourth facial expression evaluation statistical value satisfies the conditions of situation IDs 7 to 8 shown in the situation table 19, that is, whether or not a sleeping occupant has been detected. If the situation evaluation unit 12 determines that a sleeping occupant has been detected (Yes in step S30), the process proceeds to step S31; otherwise (No in step S30), the process shown in step S30 is repeated.

ステップＳ３１において、状況評価部１２は、第１表情評価統計値を算出し、第１表情評価統計値が状況テーブル１９に示す状況ＩＤ８の条件を満たすか否か、つまり所定人数以上の乗員が笑顔であるか否かを判定する。状況評価部１２は、所定人数以上の乗員が笑顔であると判定した場合（ステップＳ３１でＹｅｓ）、睡眠中の乗員を他の乗員が面白がるというイベントが発生したとして処理をステップＳ３２に進める。一方、状況評価部１２は、そうでない場合（ステップＳ３１でＮｏ）、処理をステップＳ３５に進める。 In step S31, the situation evaluation unit 12 calculates a first facial expression evaluation statistic and determines whether the first facial expression evaluation statistic satisfies the condition of situation ID 8 shown in the situation table 19, that is, whether a predetermined number of occupants or more are smiling. If the situation evaluation unit 12 determines that a predetermined number of occupants or more are smiling (Yes in step S31), it determines that an event has occurred in which other occupants are amused by the sleeping occupant, and proceeds to step S32. On the other hand, if this is not the case (No in step S31), the situation evaluation unit 12 proceeds to step S35.

ステップＳ３２において、画像処理部１６は、睡眠中の乗員を検出した前後のその乗員（ターゲット乗員）の顔のズームアップ画像を生成し、所定の記録領域にバックアップする。 In step S32, the image processing unit 16 generates zoomed-in images of the face of the sleeping occupant (target occupant) before and after the sleeping occupant is detected, and backs up the images in a specified recording area.

次にステップＳ３３において、音声認識部１５は、睡眠中の乗員を検出した前後の音声データをテキストデータに変換する。 Next, in step S33, the voice recognition unit 15 converts the voice data before and after detecting the sleeping occupant into text data.

次にステップＳ３４において、画像処理部１６は、画像処理態様テーブル２０の状況ＩＤ８の画像処理態様で、バックアップした画像データの画像に対して、ステップＳ３３で変換したテキストデータに基づく画像および評価対象状況に関連する画像のうち少なくとも一方を生成し、画像データに重畳させる画像処理を実行し、重畳画像データを生成する。そして画像処理部１６は処理をステップＳ３７に進める。 Next, in step S34, the image processing unit 16 performs image processing in which, in the image processing mode of situation ID 8 in the image processing mode table 20, at least one of an image based on the text data converted in step S33 and an image related to the situation to be evaluated is generated for the image of the backed up image data, and the image processing unit 16 superimposes the image data to generate superimposed image data. Then, the image processing unit 16 proceeds to the process of step S37.

ステップＳ３５において、状況評価部１２は、ステップＳ３２と同様に、ターゲット乗員の顔のズームアップ画像を生成し、所定の記録領域にバックアップする。 In step S35, the situation assessment unit 12 generates a zoomed-in image of the target occupant's face, similar to step S32, and backs it up in a specified recording area.

次にステップＳ３６において、画像処理部１６は、画像処理態様テーブル２０の状況ＩＤ７の画像処理態様で、バックアップした画像データの画像に対して、字幕および状況に関連する画像を重畳させる。そして画像処理部１６は処理をステップＳ３７に進める。 Next, in step S36, the image processing unit 16 superimposes subtitles and an image related to the situation onto the image of the backed up image data in the image processing mode of situation ID 7 in the image processing mode table 20. The image processing unit 16 then advances the process to step S37.

ステップＳ３７において、画像処理部１６は、重畳画像データおよびこれに対応する音声データを上書き禁止で画像記録部１７に記録し、処理を終了する。ステップＳ３７はなくてもよい。 In step S37, the image processing unit 16 records the superimposed image data and the corresponding audio data in the image recording unit 17 with overwrite protection, and ends the process. Step S37 may be omitted.

図７～９は、実施形態１にかかる画像処理装置１０の画像処理態様の一例を示す図である。 Figures 7 to 9 are diagrams showing an example of the image processing mode of the image processing device 10 according to the first embodiment.

図７には、状況ＩＤ８に対応するイベントが発生した場合に生成される重畳画像データの画像ＩＭＧ１が示される。
画像処理部１６は、画像処理態様テーブル２０の、状況ＩＤ８に基づいた画像データ、例えば睡眠中の擬音を表す「ｚｚｚ」マーク画像ＭＫ１および鼻風船の画像ＭＫ２を生成し、睡眠中のターゲット乗員の顔のズームアップ画像に対して重畳させる。
そして画像処理部１６は、他の乗員の発話内容である音声データをテキストデータに変換した「起きないね。」という字幕画像データを生成し、ＳＢ１を画像の下部に重畳させる。 FIG. 7 shows an image IMG1 of the superimposed image data that is generated when an event corresponding to situation ID 8 occurs.
The image processing unit 16 generates image data based on the situation ID 8 in the image processing mode table 20, for example, a "zzz" mark image MK1 representing the onomatopoeia of sleep and a nose balloon image MK2, and superimposes them on a zoomed-up image of the face of the target occupant who is sleeping.
The image processor 16 then converts the voice data, which is the speech of the other passenger, into text data to generate subtitle image data such as "Don't wake up," and superimposes SB1 at the bottom of the image.

図８には、状況ＩＤ５に対応するイベントが発生した場合に生成される重畳画像データの画像ＩＭＧ２が示される。
画像処理部１６は、悲しんでいるターゲット乗員の顔のズームアップ画像に対して、その目の周辺に悲しんでいることを表す涙マーク画像ＭＫ３を生成して重畳させ、額周辺に縦斜線マークＭＫ４を重畳させる。
そして画像処理部１６は、ターゲット乗員の発話内容である音声データをテキストデータに変換した「ひいいい」という字幕画像ＳＢ２を生成し、その乗員の左側に重畳させる。 FIG. 8 shows an image IMG2 of the superimposed image data that is generated when an event corresponding to situation ID 5 occurs.
The image processing unit 16 generates and superimposes a teardrop mark image MK3 representing sadness around the eyes of the zoomed-up image of the sad target occupant's face, and superimposes a vertical diagonal line mark MK4 around the forehead.
The image processing unit 16 then converts the voice data, which is the speech content of the target occupant, into text data to generate a subtitle image SB2 saying "Hiiiiii" and superimposes it on the left side of the occupant.

図９には、状況ＩＤ６に対応するイベントが発生した場合に生成される重畳画像データの画像ＩＭＧ３が示される。
画像処理部１６は、怒っているターゲット乗員の顔のズームアップ画像に対して、その額の周辺に怒っていることを表す怒りマーク画像ＭＫ５を生成して重畳させる。
そして画像処理部１６は、ターゲット乗員の発話内容である音声データをテキストデータに変換した「何だとっ！」という字幕画像ＳＢ３を生成し、怒っていることを表す吹き出しとともにその乗員の右側に重畳させる。 FIG. 9 shows an image IMG3 of the superimposed image data that is generated when an event corresponding to situation ID 6 occurs.
The image processing unit 16 generates an anger mark image MK5 representing anger around the forehead of the angry target occupant and superimposes it on the zoomed-up image of the face of the angry target occupant.
The image processing unit 16 then converts the voice data, which is the speech content of the target occupant, into text data to generate a subtitle image SB3 saying "What!" and superimposes it to the right of the occupant together with a speech bubble expressing anger.

このように実施形態１によれば、画像処理装置１０は、画像データから検出される情報に基づいて対象空間の状況を評価し、状況に対応するイベントが発生した期間周辺のダイジェスト画像を生成する。これにより撮影後に、撮影中の重要な場面、特に盛り上がった場面をダイジェスト画像で振り返ることができる。 As described above, according to the first embodiment, the image processing device 10 evaluates the situation of the target space based on information detected from the image data, and generates a digest image of the period around the time when an event corresponding to the situation occurred. This allows the user to look back on important scenes during shooting, particularly exciting scenes, using the digest image after shooting.

また画像処理装置１０は、撮影した画像に対して、評価した状況に合わせた字幕演出を行う。これにより、撮影後に分かりやすいバラエティ番組のような字幕付きの画像で旅の思い出を振り返ることができる。 The image processing device 10 also applies subtitles to the captured images in accordance with the evaluated situation. This allows the user to look back on their travel memories after capturing the images with easy-to-understand subtitles, like those in a variety show.

画像処理装置１０は、対象空間の状況を評価するために、画像データから乗員の表情を検出し、単一の表情だけでなく、乗員の表情の組み合わせを考慮してもよい。これにより、単に「笑い」だけでなく、「悲しみ」および「笑い」の融合など、相反する表情が乗員間で同時に起きた等の複雑な状況についても、盛り上がった場面として画像を記録することができる。 To evaluate the situation in the target space, the image processing device 10 may detect the facial expressions of the occupants from the image data and consider combinations of the occupants' facial expressions, rather than just a single facial expression. This makes it possible to record images of not just "laughing" but also complex situations in which opposing facial expressions, such as a combination of "sadness" and "laughing," occur simultaneously among the occupants, as scenes of excitement.

＜実施形態２＞
次に図１０～１４を用いて、本発明の実施形態２について説明する。実施形態２は、状況評価部が、画像データに加えて音声データに基づいて、対象空間の状況を評価することに特徴を有する。 <Embodiment 2>
Next, a second embodiment of the present invention will be described with reference to Figures 10 to 14. The second embodiment is characterized in that the situation assessment unit assesses the situation of the target space based on audio data in addition to image data.

図１０は、実施形態２にかかる画像処理装置１０ａの構成の一例を示すブロック図である。実施形態２にかかる画像処理装置１０ａは、実施形態１にかかる画像処理装置１０と基本的に同様の構成および機能を有する。ただし画像処理装置１０ａは、状況評価部１２、状況テーブル１９および画像処理態様テーブル２０に代えて、状況評価部１２ａ、状況テーブル１９ａおよび画像処理態様テーブル２０ａを備える点で画像処理装置１０と相違する。 Figure 10 is a block diagram showing an example of the configuration of an image processing device 10a according to the second embodiment. The image processing device 10a according to the second embodiment has basically the same configuration and functions as the image processing device 10 according to the first embodiment. However, the image processing device 10a differs from the image processing device 10 in that it includes a situation evaluation unit 12a, a situation table 19a, and an image processing mode table 20a instead of the situation evaluation unit 12, the situation table 19, and the image processing mode table 20.

状況評価部１２ａは、状況評価部１２と基本的に同様の機能を有するが、画像評価部１３に加えて、音声評価部１４を有する。 The situation evaluation unit 12a has basically the same functions as the situation evaluation unit 12, but in addition to the image evaluation unit 13, it also has an audio evaluation unit 14.

音声評価部１４は、音声データから取得される特徴を評価し、当該特徴に基づいて音声評価値を算出する。たとえば音声評価値は、音声データに含まれる発話の声量の程度を示す第１音声評価値と、発話される音声を認識し、予め定められるキーワードを抽出して、その出現頻度を示す第２音声評価値とを含む。 The voice evaluation unit 14 evaluates features obtained from the voice data and calculates a voice evaluation value based on the features. For example, the voice evaluation value includes a first voice evaluation value indicating the level of volume of the speech included in the voice data, and a second voice evaluation value that recognizes the spoken voice, extracts predetermined keywords, and indicates their frequency of occurrence.

状況評価部１２ａは、画像評価部１３が算出した表情評価統計値と、音声評価部１４が算出した音声評価値とに基づいて、対象空間の状況を評価する。したがって状況評価値は、状況評価指標として、表情評価統計値に加えて音声評価値を含む。 The situation evaluation unit 12a evaluates the situation in the target space based on the facial expression evaluation statistics calculated by the image evaluation unit 13 and the audio evaluation value calculated by the audio evaluation unit 14. Therefore, the situation evaluation value includes the audio evaluation value in addition to the facial expression evaluation statistics as a situation evaluation index.

図１１は、実施形態２にかかる状況テーブル１９ａのデータ構造の一例を示す図である。状況テーブル１９ａは、状況テーブル１９に代えて、状況評価値の１つとして音声評価値が採用される状況ＩＤ９～１０の状況のレコードである。状況テーブル１９ａは、状況テーブル１９の状況ＩＤを含んで構成されてもよい。
状況ＩＤ９～１０の状況の内容は、状況に関連する表情に加えて、発話態様またはこれらの組み合わせを示す。
たとえば状況ＩＤ９の状況の内容は、笑顔の乗員がいて、かつ乗員が所定値以上の声量（大声）で発話していることを示す。この場合、画像データに基づく第１表情評価統計値と、音声データに基づく第１音声評価値とが、状況評価値とされる。
また状況ＩＤ１０の状況の内容は、笑顔の乗員がいる期間中に、所定頻度以上で繰り返し発話される単語を示す高頻出キーワードがあることを示す。この場合、画像データに基づく第１表情評価統計値と、音声データに基づく第２音声評価値とが、状況評価値とされる。 11 is a diagram showing an example of the data structure of the situation table 19a according to the second embodiment. The situation table 19a is a record of situations with situation IDs 9 to 10 in which the voice evaluation value is adopted as one of the situation evaluation values instead of the situation table 19. The situation table 19a may be configured to include the situation ID of the situation table 19.
The contents of the situations with the situation IDs 9 to 10 indicate facial expressions related to the situation, speech patterns, or a combination thereof.
For example, the content of the situation of situation ID 9 indicates that there is a smiling occupant and that the occupant is speaking at a volume (loud voice) equal to or greater than a predetermined value. In this case, the first facial expression evaluation statistical value based on the image data and the first voice evaluation value based on the voice data are set as the situation evaluation value.
The content of the situation of situation ID 10 indicates that there is a high frequency keyword that indicates a word that is repeatedly uttered at a frequency equal to or higher than a predetermined frequency during a period when a smiling occupant is present. In this case, the first facial expression evaluation statistic based on the image data and the second voice evaluation value based on the voice data are set as the situation evaluation value.

図１２は、実施形態２にかかる画像処理態様テーブル２０ａのデータ構造の一例を示す図である。
画像処理態様テーブル２０ａは、画像処理態様テーブル２０に加えて、状況ＩＤ９～１０の状況のレコードを含む。 FIG. 12 is a diagram showing an example of the data structure of the image processing mode table 20a according to the second embodiment.
The image processing mode table 20a includes, in addition to the image processing mode table 20, records of the situations with the situation IDs 9 to 10.

なお画像処理部１６は、表情評価統計値に加えてまたは代えて、音声評価値に基づいて、重畳される字幕の態様を決定してよい。これにより盛り上がりの程度に応じた字幕演出をすることができる。 The image processing unit 16 may determine the style of the subtitles to be superimposed based on the audio evaluation value in addition to or instead of the facial expression evaluation statistics. This allows the subtitles to be displayed in a way that matches the level of excitement.

図１３は、実施形態２にかかる画像処理装置１０ａの処理を示すフローチャートである。図１３に示すステップは、図４に示すステップＳ１２，１３，１５に代えて、ステップＳ４０～４３を含む。図４に示すステップと同一のステップについては、適宜説明を省略する。 Figure 13 is a flowchart showing the processing of the image processing device 10a according to the second embodiment. The steps shown in Figure 13 include steps S40 to S43 instead of steps S12, S13, and S15 shown in Figure 4. Explanations of steps that are the same as those shown in Figure 4 will be omitted as appropriate.

なおステップＳ１４に示す音声データのテキスト変換処理については，ステップＳ１０および１１の間で実行されてよい。 The voice data text conversion process shown in step S14 may be performed between steps S10 and S11.

ステップＳ４０において、状況評価部１２ａの音声評価部１４は、取得部１１から供給された音声データから各種音声評価値を算出する。本ステップは、ステップＳ１１の前に行われてもよく、ステップＳ１１と並行して行われてもよい。 In step S40, the voice evaluation unit 14 of the situation evaluation unit 12a calculates various voice evaluation values from the voice data supplied from the acquisition unit 11. This step may be performed before step S11 or in parallel with step S11.

ステップＳ４１において、状況評価部１２ａは、評価対象状況が状況ＩＤ９または１０に該当する場合、表情評価統計値に加えて、音声評価値を状況評価値とする。 In step S41, if the situation to be evaluated corresponds to situation ID 9 or 10, the situation evaluation unit 12a sets the voice evaluation value as the situation evaluation value in addition to the facial expression evaluation statistics.

ステップＳ４２において、状況評価部１２ａは、状況テーブル１９ａを用いて評価対象
状況のイベントが発生したか否かを判定する。状況評価部１２ａは、イベントが発生した
と判定した場合（ステップＳ４２でＹｅｓ）、状況評価値の情報を画像処理部１６に供給
し、処理をステップＳ４３に進める。一方、状況評価部１２ａは、そうでない場合（ステップＳ４２でＮｏ）、処理をステップＳ１０に戻す。

In step S42, the situation evaluation unit 12a judges whether or not an event of the evaluation target situation has occurred by using the situation table 19a. If the situation evaluation unit 12a judges that an event has occurred (Yes in step S42), it supplies information on the situation evaluation value to the image processing unit 16 and proceeds to step S43. On the other hand, if the situation evaluation unit 12a does not judge that an event has occurred (No in step S42), it returns the process to step S10.

ステップＳ４３において、画像処理部１６は、イベント発生期間の画像データに対して、画像処理態様テーブル２０ａを用いて、評価対象状況に応じた画像処理を実行し、重畳画像データを生成する。 In step S43, the image processing unit 16 uses the image processing mode table 20a to perform image processing on the image data during the event occurrence period according to the evaluation target situation, and generates superimposed image data.

図１４は、実施形態２にかかる画像処理装置１０ａの画像処理態様の一例を示す図である。本図には、状況ＩＤ１０に対応するイベントが発生した場合に生成される重畳画像データの画像ＩＭＧ４が示される。状況ＩＤ１０においては、笑顔の乗員がいる中でキーワードが閾値以上の頻度で出現する場合に、状況評価部１２ａは、話が盛り上がっており、そのキーワードが話のキーポイントであると判定する。 Figure 14 is a diagram showing an example of an image processing mode of the image processing device 10a according to the second embodiment. This figure shows an image IMG4 of superimposed image data that is generated when an event corresponding to situation ID 10 occurs. In situation ID 10, when a keyword appears with a frequency equal to or greater than a threshold value in the presence of smiling passengers, the situation evaluation unit 12a determines that the conversation is lively and that the keyword is a key point of the conversation.

そして画像処理部１６は、撮影された画像データを例えば黒ベタ画像に変換する。そして画像処理部１６は、高頻出キーワード「寿司」を、文字サイズを大きくする、フォントを他のワードと異なるように変更する、および色を変更する、色の濃度を濃くする、等によって強調する表示態様で字幕データとして重畳させた重畳画像データを生成し、変換した黒ベタ画像に重畳する。
このような画像が動画として再生された場合、黒ベタ画像に上述した重畳画像データを重畳した画像を再生するとともに、対応する音声が再生されてよい。またこのときに、高頻出キーワードの音量を大きくする、周波数を変換する、等によって強調する態様で音声を再生させてもよい。 The image processing unit 16 then converts the captured image data into, for example, a solid black image, and generates superimposed image data in which the frequently occurring keyword "sushi" is superimposed as subtitle data in a display mode that emphasizes the keyword by increasing the character size, changing the font to be different from other words, changing the color, increasing the color density, etc., and superimposes the superimposed image data on the converted solid black image.
When such an image is reproduced as a video, an image in which the above-mentioned superimposed image data is superimposed on a solid black image is reproduced, and the corresponding audio may be reproduced. At this time, the audio may be reproduced in a manner that emphasizes the frequently occurring keywords by increasing the volume, converting the frequency, etc.

このように実施形態２によれば、画像処理装置１０ａは、画像データに加え、声量およびキーワードの繰り返し等の音声データから得られる特徴に基づいて、状況評価値を算出する。これにより画像処理装置１０ａは、盛り上がった場面を精度よく検出し、豊富なバリエーションで画像演出を行うことができる。 As described above, according to the second embodiment, the image processing device 10a calculates a situation evaluation value based on image data as well as features obtained from voice data such as voice volume and repetition of keywords. This allows the image processing device 10a to accurately detect exciting scenes and perform image presentation with a wide variety of options.

上述の実施形態では、本発明をハードウェアの構成として説明したが、本発明は、これに限定されるものではない。本発明は、任意の処理を、プロセッサにコンピュータプログラムを実行させることにより実現することも可能である。 In the above embodiment, the present invention has been described as being configured as hardware, but the present invention is not limited to this. The present invention can also be realized by having a processor execute a computer program to perform any process.

上述の例において、プログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ－ＲＯＭ（Read Only Memory）、ＣＤ－Ｒ、ＣＤ－Ｒ／Ｗ、ＤＶＤ（Digital Versatile Disc）、ＢＤ（Blu-ray（登録商標） Disc）、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（Random Access Memory））を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above example, the program can be stored and supplied to the computer using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer readable media include magnetic recording media (e.g., flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (e.g., magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R/Ws, DVDs (Digital Versatile Discs), BDs (Blu-ray (registered trademark) Discs), and semiconductor memories (e.g., mask ROMs, PROMs (Programmable ROMs), EPROMs (Erasable PROMs), flash ROMs, and RAMs (Random Access Memory)). The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can provide the program to the computer via a wired communication path, such as an electric wire or optical fiber, or via a wireless communication path.

上述の実施形態ではコンピュータは、パーソナルコンピュータやワードプロセッサ等を含むコンピュータシステムで構成される。しかしこれに限らず、コンピュータは、ＬＡＮ（ローカル・エリア・ネットワーク）のサーバ、コンピュータ（パソコン）通信のホスト、インターネット上に接続されたコンピュータシステム等によって構成されることも可能である。また、ネットワーク上の各機器に機能分散させ、ネットワーク全体でコンピュータを構成することも可能である。 In the above-described embodiment, the computer is configured as a computer system including a personal computer, a word processor, etc. However, the computer is not limited to this, and can also be configured as a server of a LAN (local area network), a host for computer (personal computer) communication, a computer system connected to the Internet, etc. It is also possible to distribute functions to each device on the network and configure a computer as a whole network.

特許請求の範囲、明細書、および図面中において示したシステムおよび方法における各処理の実行順序は、特段「より前に」、「先立って」等と明示しておらず、また、前の処理の出力を後の処理で用いるのではない限り、任意の順序で実現しうる。特許請求の範囲、明細書および図面中の動作フローに関して、便宜上「まず、」、「次に、」等を用いて説明したとしても、この順序で実施することが必須であることを意味するものではない。 The order of execution of each process in the systems and methods shown in the claims, specifications, and drawings is not specifically stated as "before" or "prior to," and may be implemented in any order, as long as the output of a previous process is not used in a subsequent process. Even if the operational flow in the claims, specifications, and drawings is explained using "first," "next," etc. for convenience, this does not mean that it is essential to implement the process in this order.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。上記の実施形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記）
（付記１）
対象空間の音声データおよび画像データを取得する取得部と、
前記音声データをテキストデータに変換する音声認識部と、
前記画像データに基づいて前記対象空間の状況を評価し、前記対象空間における前記状況の状況評価値を算出する状況評価部と、
前記状況評価値が予め定められた条件を満たす状況を示すイベントが発生した場合、前記イベントが発生している期間を含む予め定められた期間の前記画像データの画像に、前記テキストデータに基づく画像および前記状況に関連する画像のうち少なくとも一方を重畳させた重畳画像データを生成する画像処理部と
を備える画像処理システム。
（付記２）
前記状況評価部は、前記画像データに基づいて撮像された人物の表情を評価し、前記対象空間における前記状況に関連する表情の表情評価統計値を算出し、
前記状況評価値は、前記表情評価統計値を含む
付記１に記載の画像処理システム。
（付記３）
前記画像処理部は、前記状況評価値に基づいて、重畳される前記テキストデータに基づく画像の態様を決定する
付記１または２に記載の画像処理システム。
（付記４）
前記音声データから取得される特徴に基づいて音声評価値を算出する音声評価部をさらに備え、
前記状況評価値は、前記音声評価値を含む、
付記１から３のいずれか一項に記載の画像処理システム。
（付記５）
対象空間の音声データおよび画像データを取得する段階と、
前記音声データをテキストデータに変換する段階と、
前記画像データに基づいて前記対象空間の状況を評価し、前記対象空間における前記状況の状況評価値を算出する段階と、
前記状況評価値が予め定められた条件を満たす状況を示すイベントが発生した場合、前記イベントが発生している期間を含む予め定められた期間の前記画像データの画像に、前記テキストデータに基づく画像および前記状況に関連する画像のうち少なくとも一方を重畳させた重畳画像データを生成する段階と
を備える画像処理方法。 The present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit and scope of the present invention. A part or all of the above-described embodiment can be described as follows, but is not limited to the following.
(Additional Note)
(Appendix 1)
An acquisition unit that acquires audio data and image data of a target space;
a voice recognition unit for converting the voice data into text data;
a situation evaluation unit that evaluates a situation of the target space based on the image data and calculates a situation evaluation value of the situation in the target space;
and an image processing unit that, when an event occurs indicating a situation in which the situation evaluation value satisfies a predetermined condition, generates superimposed image data by superimposing at least one of an image based on the text data and an image related to the situation on an image of the image data for a predetermined period including a period during which the event occurs.
(Appendix 2)
the situation evaluation unit evaluates a facial expression of the person captured based on the image data, and calculates a facial expression evaluation statistic of the facial expression related to the situation in the target space;
The image processing system of claim 1, wherein the situation evaluation value includes the facial expression evaluation statistic.
(Appendix 3)
The image processing system according to claim 1 or 2, wherein the image processing unit determines an aspect of an image based on the text data to be superimposed based on the situation evaluation value.
(Appendix 4)
a voice evaluation unit that calculates a voice evaluation value based on features acquired from the voice data;
The situation evaluation value includes the voice evaluation value.
4. The image processing system according to claim 1 .
(Appendix 5)
acquiring audio data and image data of a target space;
converting the audio data into text data;
evaluating a situation in the target space based on the image data and calculating a situation evaluation value of the situation in the target space;
and when an event occurs indicating a situation in which the situation evaluation value satisfies a predetermined condition, generating superimposed image data by superimposing at least one of an image based on the text data and an image related to the situation on an image of the image data for a predetermined period including a period during which the event occurs.

１，１ａ画像処理システム
１０，１０ａ画像処理装置
１１取得部
１２，１２ａ状況評価部
１３画像評価部
１４音声評価部
１５音声認識部
１６画像処理部
１７画像記録部
１８テーブル記憶部
１９，１９ａ状況テーブル
２０，２０ａ画像処理態様テーブル
３０カメラ
４０マイク REFERENCE SIGNS LIST 1, 1a Image processing system 10, 10a Image processing device 11 Acquisition unit 12, 12a Situation evaluation unit 13 Image evaluation unit 14 Voice evaluation unit 15 Voice recognition unit 16 Image processing unit 17 Image recording unit 18 Table storage unit 19, 19a Situation table 20, 20a Image processing mode table 30 Camera 40 Microphone

Claims

An acquisition unit that acquires voice data and image data of a target space, which is an interior space of a moving body ;
a voice recognition unit for converting the voice data into text data;
a situation evaluation unit that evaluates a situation of the target space based on the image data and calculates a situation evaluation value of the situation in the target space;
an image storage unit that stores in advance an image when an abnormality occurs in the moving body;
an image processing unit that, when an event occurs indicating a situation in which the situation evaluation value satisfies a predetermined condition, generates superimposed image data by superimposing at least one of an image based on the text data and an image related to the situation on an image of the image data for a predetermined period including a period in which the event occurs , and records the superimposed image data in the image memory unit separately from images when an abnormality occurs in the moving body that have been previously recorded .

The acquisition unit acquires image data of a person present in the target space,
the image processing unit generates an image in which the person is zoomed in from the image data, and generates superimposed image data in which at least one of an image based on the text data and an image related to the situation is superimposed on the zoomed-in image.
The image processing device according to claim 1 .

The image processing device according to claim 1 , wherein the image processing unit determines an aspect of an image based on the text data to be superimposed on the image based on the situation evaluation value.

a voice evaluation unit that calculates a voice evaluation value based on features acquired from the voice data;
The situation evaluation value includes the voice evaluation value.
The image processing device according to claim 1 .

acquiring audio data and image data of a target space, which is a room of a moving body ;
converting the audio data into text data;
evaluating a situation in the target space based on the image data and calculating a situation evaluation value of the situation in the target space;
when an event occurs indicating a situation in which the situation evaluation value satisfies a predetermined condition, generating superimposed image data by superimposing at least one of an image based on the text data and an image related to the situation on an image of the image data for a predetermined period including a period during which the event occurs, and recording the superimposed image data in the image storage unit separately from images when an abnormality occurs in the moving body that have been recorded in advance in the image storage unit .