JP7430088B2

JP7430088B2 - speech control device

Info

Publication number: JP7430088B2
Application number: JP2020052773A
Authority: JP
Inventors: 瞳山口; 純洙權
Original assignee: Fujita Corp
Current assignee: Fujita Corp
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2024-02-09
Anticipated expiration: 2040-03-24
Also published as: JP2021152741A

Description

本発明は、例えば発話機能を有する対人ロボット等への適用が可能な発話制御装置に関する。 The present invention relates to a speech control device that can be applied to, for example, an interpersonal robot having a speech function.

従来、人物を検出してロボット等から発話させる先行技術が知られている（例えば、特許文献１参照。）。この先行技術の例では、複数人が行き交う環境下で対話型のロボットが利用される際、人物が対話意思や関心を持っているかについて、ロボットが人物の関心度を判定し、事前に対話対象となる人物を絞り込むこととしている。このため先行技術は、ロボットに内蔵した撮像装置からの画像情報から人物を検出し、検出された人物を撮像装置の複数の画像で追跡し、追跡された人物の関心度を、複数の画像における人物の顔の向きと胴体の向きの変化に基づいて算出し、算出された関心度に基づいて対話候補とするといった処理を行っている。 2. Description of the Related Art Conventionally, there is a known technology that detects a person and causes a robot or the like to speak (for example, see Patent Document 1). In this example of prior art, when an interactive robot is used in an environment where multiple people come and go, the robot determines the person's level of interest and determines whether the person has an intention to interact or is interested in the conversation. We are narrowing down the list of people. For this reason, the prior art detects a person from image information from an imaging device built into the robot, tracks the detected person in multiple images of the imaging device, and calculates the degree of interest of the tracked person in the multiple images. This is calculated based on changes in the direction of a person's face and body, and processing is performed to select dialogue candidates based on the calculated interest level.

特開２０１９－２１７５５８号公報JP2019-217558A

上述した先行技術は、画像情報から人物が存在する矩形領域を抽出した後、領域内での頭部の領域を検出したり、顔を検出したり、頭部の領域内の情報を用いて頭部の方向を推定したり、さらには、人物の胴体の方向を推定したりする複雑な処理を行っている。このような処理は、畳み込みフィルタを用いたニューラルネットワークを有する人工知能モデルによって好適に実行可能である。 The above-mentioned prior art extracts a rectangular region in which a person exists from image information, and then detects the head region within the region, detects the face, and detects the head region using information within the head region. It performs complex processing such as estimating the direction of a person's body, and further estimating the direction of a person's torso. Such processing can be suitably executed by an artificial intelligence model having a neural network using a convolution filter.

しかしながら、人物の領域に加えて頭部の領域や顔の位置、頭部の方向、さらには人物の胴体の方向までも正確に推定した上で、最終的な声掛けの判定を行おうとすると、適用する人工知能には高精度かつ高機能なモデルを採用しなければならないため、それだけ処理時間が長く、画像情報の入力から判定結果が出力されるまでに遅延が生じるという問題がある。そうかといって、高精度モデルに比較して処理が高速なタイプの人工知能モデルを単に適用しただけでは、判定の精度が犠牲になるため要求レベルを満たせないという不具合を生じる。 However, if you try to accurately estimate not only the area of the person, but also the area of the head, the position of the face, the direction of the head, and even the direction of the person's torso, and then make the final call decision, Since the applied artificial intelligence must employ a highly accurate and highly functional model, there is a problem in that the processing time is correspondingly long and there is a delay from the input of image information to the output of the determination result. However, simply applying an artificial intelligence model that processes faster than a high-precision model will result in a problem in that the required level cannot be met because the accuracy of the judgment will be sacrificed.

そこで本発明は、人物の判定を高速化しつつ、適切に発話を制御できる技術を提供するものである。 Therefore, the present invention provides a technique that can speed up the determination of a person and appropriately control speech.

本発明は、発話制御装置を提供する。この発話制御装置は、画像内の人物の判定（検出）を高速処理が可能な判定能力を有した人工知能モデルにより実行する。このような高速型の人工知能モデルを用いた発話制御は、人物の判定から発話音声の出力までの応答時間に遅延が少ないことから、特に、ランダムに移動している人物への発話（声掛け）のタイミングに目立った遅延がなく、発話内容を確実に人物に気付かせることができるという大きな利点がある。ただし、判定能力を高速化したこととのトレードオフで精度が犠牲になるため、その分を補償する手法を考える必要がある。 The present invention provides a speech control device. This speech control device performs determination (detection) of a person in an image using an artificial intelligence model that has a determination ability capable of high-speed processing. Speech control using such a high-speed artificial intelligence model has little delay in the response time from identifying the person to outputting the spoken voice, so it is especially useful for controlling speech to randomly moving people (calling out). ) has the great advantage that there is no noticeable delay in the timing of the utterance, and the person can be sure to notice the content of the utterance. However, the trade-off with speeding up the decision-making ability is that accuracy is sacrificed, so it is necessary to consider a method to compensate for this.

すなわち、現実に人物が存在する撮像エリアを撮像しても、その画像から人物を判定した結果には一定の割合で成功（人物判定あり）と不成功（人物判定なし）とが含まれることとなり、かつ、それらの発生回数や発生順は不規則である。この場合、判定結果を全て正しいものとして発話音声の出力を制御すると、同じ人物に対して同じ内容の発話を繰り返したり（連呼したり）、人物がいるのに発話しなかったりすることがある。 In other words, even if an image is taken of an imaging area where a person actually exists, the results of determining a person from that image will include a certain percentage of successes (person identification) and failures (no person identification). , and the number of occurrences and order of occurrence are irregular. In this case, if all the determination results are assumed to be correct and the output of the spoken voice is controlled, the same person may repeat the same utterance (sequential calling), or the person may not speak even though the person is present.

そこで本発明の発話制御装置は、人物の判定結果にフィルタリングの手法を採用する。すなわち、一連の判定結果をそのまま発話音声の出力に対する入力とするのではなく、得られた判定結果から擬制的な人物の検出結果を二次生成する。擬制的に生成された人物の検出結果は、一連の判定結果が成功と不成功との間でセンシティブに振れる（両極端に変化する）のに対し、ある程度の確からしさで「検出結果あり」と擬制されるか、「検出結果なし（未検出）」と擬制されるかのいずれかに平滑化される。 Therefore, the speech control device of the present invention employs a filtering method for the person determination results. That is, instead of using the series of determination results as input for the output of the uttered voice, a hypothetical person detection result is secondarily generated from the obtained determination results. In contrast to a series of synthetically generated human detection results, where a series of judgment results oscillates sensitively between success and failure (changes between extremes), it is possible to fictitiously say "detection result found" with a certain degree of certainty. It is smoothed to either "no detection result (undetected)".

そして、このような擬制的に生成された検出結果で示される人物について、所定の検出エリア内に進入したと判定したタイミングで発話音声を出力させる。このとき、発話音声の出力に用いる検出結果がフィルタリング（平滑化）されているため、同じ人物に対して同じ内容の発話が繰り返されたり、判定不成功で発話されなかったりといった不具合を確実に防止することができる。 Then, regarding the person indicated by such a hypothetically generated detection result, a speech sound is outputted at the timing when it is determined that the person has entered the predetermined detection area. At this time, since the detection results used to output the uttered audio are filtered (smoothed), it is possible to reliably prevent problems such as the same person repeating the same utterance or not being uttered due to an unsuccessful judgment. can do.

また、検出エリアは、例えば発話元と人物との位置関係において、発話内容が人物に届きやすく、また、聞き取りやすいと考えられる距離に基づいて規定することができる。これにより、例えば不特定の人物が任意の場所をランダムな方向に移動するような環境（例えば建設現場）においても、高速モデルを用いて人物を判定した場合の即応性を活かして、その人物との位置関係が最適な距離となるタイミングで発話音声を出力させることにより、発話されたことを人物に気付かせやすくし、また、発話内容を人物に聞き取りやすくすることができる。 Further, the detection area can be defined based on, for example, a distance at which it is thought that the content of the utterance can easily reach the person and be easily heard in terms of the positional relationship between the utterance source and the person. As a result, even in an environment where an unspecified person moves in an arbitrary location in a random direction (for example, a construction site), it is possible to take advantage of the quick response that can be achieved by using a high-speed model to identify the person. By outputting the spoken voice at a timing when the positional relationship between the two is at an optimal distance, it is possible to make it easier for the person to notice that the person has spoken, and to make it easier for the person to hear the content of the utterance.

発話制御装置によるフィルタリングの手法には、以下の好ましい態様が含まれる。
（１）高速モデルの一連の判定結果に含まれる成功（人物判定あり）の場合と不成功（人物判定なし）の場合との比率から、擬制的に人物を検出又は未検出とする検出結果を生成する。例えば、ある回数の連続する判定結果の群に着目したとき、その中で成功（人物判定あり）が所定割合以上あれば、「人物検出あり」と擬制する検出結果を生成する。逆に、ある回数の連続する判定結果の群の中で、成功（人物判定あり）が所定割合に達していなければ、「人物検出なし（未検出）」と擬制する検出結果を生成する。したがって、高速モデルによる判定結果が一時的（瞬間的）に振れたとしても、生成される検出結果は大きく振れることがなく、平滑化されることになる。 The filtering method by the speech control device includes the following preferred aspects.
(1) Based on the ratio of successful cases (with person identification) and unsuccessful cases (no person identification) included in the series of judgment results of the high-speed model, the detection result that hypothetically detects or does not detect a person is calculated. generate. For example, when focusing on a group of consecutive determination results a certain number of times, if a predetermined percentage or more of successes (person detection) is found, a detection result that falsely indicates "person detection" is generated. On the other hand, if a predetermined percentage of successes (person determination) has not been reached in a group of consecutive determination results a certain number of times, a false detection result of "no person detected (undetected)" is generated. Therefore, even if the determination result by the high-speed model fluctuates temporarily (instantaneous), the generated detection result will not fluctuate greatly and will be smoothed.

（２）高速モデルから所定回数連続して成功（人物判定あり）の判定結果が得られた場合、擬制的に人物の検出状態とする検出結果を生成し、この後に所定回数連続して成功（人物判定あり）の判定結果が得られなかった場合、擬制的に人物の未検出状態とする検出結果を生成する。この場合、高速モデルによる人物の判定が所定回数連続して成功したことを条件に、以後は「人物検出あり」の状態となる。この状態で、途中に不成功（人物判定なし）の判定結果が得られても、フィルタリング後の検出結果は「人物検出あり」の状態が維持される。したがって、所定回数より少ない回数の不成功によって検出結果が振れることなく、平滑化される。 (2) If a success (person detection) is obtained from the high-speed model a predetermined number of times in a row, a detection result is generated that artificially sets a person detection state, and then a success (person detection) is obtained a predetermined number of times in a row. If a determination result indicating that a person has been detected is not obtained, a detection result that virtually indicates that no person has been detected is generated. In this case, on the condition that the high-speed model successfully determines a person a predetermined number of times in a row, the state becomes "person detected" from then on. In this state, even if an unsuccessful determination result (no person detected) is obtained during the process, the state of "person detected" is maintained as the detection result after filtering. Therefore, the detection results are smoothed without being fluctuated due to failures occurring less than a predetermined number of times.

いずれにしても、上記（１）及び（２）のフィルタリングの態様では、「人物検出あり」と擬制した検出結果を生成した後も、高速モデルによる少数の判定結果が不成功（人物判定なし）となる場合がある。この場合、そのままでは、高速モデルの判定結果に基づく人物の検出結果を一時的（瞬間的）に生成することができないことになる。そこで発話制御装置は、成功の判定結果が得られた後に不成功の判定結果が得られた場合、最後（直前）に得られた成功の判定結果に基づいて、擬制的な人物の検出結果を生成する。これにより、「人物検出あり」と擬制した検出結果を生成した後の抜け（欠け）を防止し、安定して発話音声の出力制御を実行することができる。 In any case, in the above filtering modes (1) and (2), even after generating a false detection result of "person detected", a small number of judgment results by the high-speed model are unsuccessful (no person detected). In some cases, In this case, as it is, it is not possible to temporarily (instantaneously) generate a person detection result based on the determination result of the high-speed model. Therefore, when an unsuccessful determination result is obtained after a successful determination result is obtained, the speech control device determines the hypothetical person detection result based on the last (immediately) successful determination result. generate. Thereby, it is possible to prevent omissions (missing) after generating a detection result that simulates "person detected", and to stably perform output control of the uttered sound.

本発明によれば、適切に発話を制御することができる。 According to the present invention, speech can be appropriately controlled.

発話制御装置の適用場面を一例として示す図である。FIG. 2 is a diagram illustrating an example of an application scene of the speech control device. 建設現場ＣＳ内で移動ロボットＲＢが発話音声を出力する場面を例示した図である。FIG. 3 is a diagram illustrating a scene in which a mobile robot RB outputs a speech voice within a construction site CS. 一実施形態の発話制御装置１００の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a speech control device 100 according to an embodiment. 声掛けシステム１１０による処理の概要を示す図である。2 is a diagram showing an overview of processing by the calling system 110. FIG. フィルタリング部１４２による処理（１）の概要を示す図である。3 is a diagram illustrating an overview of processing (1) by the filtering unit 142. FIG. フィルタリング部１４２による処理（２）の概要を示す図である。3 is a diagram showing an overview of processing (2) by the filtering unit 142. FIG. 検出エリア判定部１４０による処理の概要を示す図である。3 is a diagram showing an overview of processing by a detection area determination unit 140. FIG. フィルタリング処理（１）の手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure of filtering processing (1). フィルタリング処理（２）の手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure of filtering process (2). 声掛け音声出力処理の手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure of voice voice output processing.

以下、本発明の実施形態について図面を参照しながら説明する。以下の実施形態では、発話制御装置を移動ロボット（自走式ロボット）による音声出力に適用した例を挙げているが、本発明はこの例に限られるものではない。 Embodiments of the present invention will be described below with reference to the drawings. In the embodiments below, an example is given in which the speech control device is applied to voice output by a mobile robot (self-propelled robot), but the present invention is not limited to this example.

図１は、発話制御装置の適用場面を一例として示す図である。本実施形態では、例えば、大型ビルやマンション、医療施設、福祉施設といった建物の建設現場ＣＳでの使用を想定することができる。この建設現場ＣＳは、建物の構造体（コンクリートの梁ＢＭ、壁ＷＬ、床ＦＬ、柱ＣＬ等）がある程度出来上がった状態にあり、内部を人（作業員等）が歩くことができる状態にある。また、図１には示されていないが、建設現場ＣＳには開けた空間の他に、通路や部屋、エレベータシャフト、階段室等も存在する。 FIG. 1 is a diagram showing an example of an application scene of the speech control device. In this embodiment, use can be assumed, for example, at a construction site CS of a building such as a large building, an apartment building, a medical facility, or a welfare facility. At this construction site CS, the building structure (concrete beams BM, walls WL, floors FL, columns CL, etc.) has been completed to some extent, and people (workers, etc.) can walk inside. . Although not shown in FIG. 1, in addition to open spaces, the construction site CS also includes passages, rooms, elevator shafts, staircases, and the like.

この建設現場ＣＳには、例えば自走式の移動ロボットＲＢが配置されている。移動ロボットＲＢは、例えば４つの車輪ＷＨで建設現場ＣＳ内を移動することができる。また、移動ロボットＲＢは、内蔵のＩＰカメラ１１２を用いて周囲を撮像したり、マイク・スピーカ１２８を用いて集音及び発音（発話音声出力）したりすることができる。 For example, a self-propelled mobile robot RB is placed at the construction site CS. The mobile robot RB can move within the construction site CS using, for example, four wheels WH. Furthermore, the mobile robot RB can take images of its surroundings using the built-in IP camera 112, and can collect and produce sounds (output speech sounds) using the microphone/speaker 128.

移動ロボットＲＢが建設現場ＣＳ内を移動して得た情報は、無線通信を介して例えばクラウドコンピュータ上にアップロードされる。また、移動ロボットＲＢは、クラウドコンピュータから更新情報を適時ダウンロードしてシステムをアップデートすることができる。このような移動ロボットＲＢは、既に多く提供されている公知の自律移動制御システムや環境検知システムを備えるものであり、その詳細についての説明は省略する。なお、移動ロボットＲＢは歩脚式のものでもよい。 Information obtained by the mobile robot RB as it moves within the construction site CS is uploaded onto, for example, a cloud computer via wireless communication. Furthermore, the mobile robot RB can update the system by downloading update information from the cloud computer at a timely manner. Such a mobile robot RB is equipped with a well-known autonomous movement control system and an environment detection system, which have already been widely provided, and a detailed explanation thereof will be omitted. Note that the mobile robot RB may be of a walking type.

本実施形態の発話制御装置は、この適用例に挙げた移動ロボットＲＢによる発話音声出力の制御を好適に実現する。以下、移動ロボットＲＢによる発話音声出力を「声掛け」としても呼称する。 The speech control device of this embodiment suitably realizes control of speech output by the mobile robot RB mentioned in this application example. Hereinafter, the speech output by the mobile robot RB will also be referred to as a "call".

図２は、建設現場ＣＳ内で移動ロボットＲＢが発話音声を出力する場面を例示した図である。移動ロボットＲＢは、日時、周囲の環境、人物の認識を各種センサとＡＩ（人工知能）を用いて行い、各人の状況や建設作業中に関係のある周囲の気候条件や環境に合わせた声掛けを行う。 FIG. 2 is a diagram illustrating a scene in which the mobile robot RB outputs a speech voice within the construction site CS. The mobile robot RB uses various sensors and AI (artificial intelligence) to recognize the date and time, surrounding environment, and people, and uses voices tailored to each person's situation and the surrounding climatic conditions and environment related to construction work. Make a bet.

図２中（Ａ）：移動ロボットＲＢは、例えば建設現場ＣＳ内で作業員を人物認識し、日時や環境、声掛けの対象となる人物の状況に合わせた発話内容を選択する。この例では、人物が立ち止まった姿勢であること、現在が日中の時間帯であること、周囲気温が何らかの閾値を超過すること等の状況から総合判断して、「こんにちは暑いので水分を取って下さい。」といった内容の声掛けを実行している。また、顔認識により人物個人を特定し、「○○さん」のように個人名を付した声掛けも実行することができる。 (A) in FIG. 2: The mobile robot RB recognizes a worker at a construction site CS, for example, and selects the content of the utterance in accordance with the date and time, the environment, and the situation of the person to be addressed. In this example, we make a comprehensive judgment based on the circumstances such as the person is standing still, the current time is daytime, and the ambient temperature exceeds some threshold. Please do so.'' It is also possible to identify an individual person through facial recognition and address the person by name, such as "Mr. ○○."

図２中（Ｂ）：また、移動ロボットＲＢは、例えば建設現場ＣＳ内で作業員を人物認識するとともに、建設関連情報を認識する。この例では、建設関連情報として人物が足場ＳＣに登った高所作業中であることを状況判断し、「危ないですよ！注意して作業して下さい」といった内容の声掛けを実行している。 (B) in FIG. 2: In addition, the mobile robot RB recognizes a worker at a construction site CS and recognizes construction-related information, for example. In this example, the system determines the construction-related information that a person is climbing a scaffolding SC and is working in a high place, and issues a message such as ``This is dangerous! Please work carefully.'' .

このような声掛けの仕組みは、移動ロボットＲＢが決まった音声で声掛けする場合と比較して、安全性の向上に利する点が大きい。すなわち、移動ロボットＲＢが建設現場ＣＳ内を移動して回り、「人物認識したら機械的に定型の発話内容で声掛けする」というパターンでは、作業中の人物には発話内容があまり届かず、注意喚起にはつながらない。これに対し、作業員に対してその場の状況に合わせた具体的な健康情報や危険情報、建築関連情報を音声で案内する声掛けのパターンであれば、対象人物の注意喚起につながり、安全性向上に利する点が大きくなる。 This system of calling out has a great advantage in improving safety compared to the case where the mobile robot RB calls out with a fixed voice. In other words, if the mobile robot RB moves around the construction site CS and uses a pattern in which it automatically calls out to people with a fixed utterance when it recognizes a person, the utterances will not reach the person working, and caution is required. It does not lead to arousal. On the other hand, if there is a pattern of voice guidance to workers with specific health information, danger information, or construction-related information tailored to the situation on the spot, it will alert the target person and ensure safety. The benefits for sexual improvement are greater.

〔処理速度と正確性のバランス〕
ここで、本実施形態の発話制御装置が取り扱う主題は、移動ロボットＲＢで人物の検出に要する処理速度と正確性とのバランスである。すなわち、移動ロボットＲＢが建設現場ＣＳ内を自律的に移動しつつ、様々な場所で人物（作業関係者）を認識した場合、その都度、適切なタイミングで発話音声を出力させる必要がある。このとき、どのようなタイミングで発話音声を出力させるかは、移動ロボットＲＢを発話元としたときの人物との位置関係にあり、具体的には人物までの距離に依存する。ただし、人物は常に一箇所に留まっているわけではなく、必要な作業をするために移動しているし、移動ロボットＲＢの方も自律移動している。このため、移動ロボットＲＢの方で人物を判定（検出）し、位置関係に基づいて声掛けさせる際、人物の認識にあまり長い処理時間を要していると、その間に人物が先に移動してしまい、声掛けのタイミングが遅れることになる。 [Balance between processing speed and accuracy]
Here, the subject matter handled by the speech control device of this embodiment is the balance between processing speed and accuracy required for detecting a person with the mobile robot RB. That is, when the mobile robot RB autonomously moves within the construction site CS and recognizes people (persons involved in the work) at various locations, it is necessary to output speech sounds at appropriate timing each time. At this time, the timing at which the spoken voice is output depends on the positional relationship with the person when the mobile robot RB is used as the utterance source, and specifically depends on the distance to the person. However, the person does not always stay in one place, but moves to perform necessary tasks, and the mobile robot RB also moves autonomously. For this reason, when the mobile robot RB identifies (detects) a person and calls out to them based on their positional relationship, if it takes too long a processing time to recognize the person, the person may move first during that time. This results in a delay in the timing of the call.

そこで、人物の検出処理を高速化することが考えられる。移動ロボットＲＢによる人物の認識には、ＩＰカメラ１１２で撮像した画像から人物を判定する人工知能モデルが用いられる。このとき、処理速度がより高速な人工知能モデルを適用することで、画像内に写っている人物を瞬時に判定（検出）することが可能であるが、処理が高速化されたモデルほど、判定の精度が低いことも確かである。このため、高速処理に特化した人工知能モデルを用いると、人物の判定に不確実性（感覚的に言うと「チラツキ」、「振れ」）が生じ、それによって声掛けを連呼してしまったり、逆に声掛けしなかったりすることがある。一方、高速モデルによる人物判定では、検出率が低い分、遅延は少なく、かつ、単位時間あたりの人物の判定回数は高精度モデルより数倍多いという特性がある。 Therefore, it is possible to speed up the human detection process. To recognize a person by the mobile robot RB, an artificial intelligence model that determines the person from an image captured by the IP camera 112 is used. At this time, by applying an artificial intelligence model with faster processing speed, it is possible to instantly determine (detect) the person in the image, but the faster the processing speed of the model, the faster the It is also true that the accuracy of is low. For this reason, when using an artificial intelligence model that specializes in high-speed processing, there is uncertainty in the judgment of people (intuitively speaking, "flickering" or "shaking"), which may result in repeated calls. On the other hand, sometimes they don't call out to you. On the other hand, person determination using a high-speed model has a characteristic that the delay is small due to the low detection rate, and the number of times of person determination per unit time is several times greater than that of a high-precision model.

そこで本実施形態では、上記の特性に鑑みて、高速処理に特化した人工知能モデルにより生じる不正確性を補償し、移動ロボットＲＢから最適に声掛けさせることができる仕組みを構築している。以下、本実施形態で用いる声掛けの仕組みについて説明する。 Therefore, in this embodiment, in consideration of the above-mentioned characteristics, a mechanism is constructed that compensates for the inaccuracies caused by the artificial intelligence model specialized in high-speed processing, and allows the mobile robot RB to optimally call out to the user. Hereinafter, the calling mechanism used in this embodiment will be explained.

〔発話制御装置の構成〕
図３は、一実施形態の発話制御装置１００の構成例を示すブロック図である。なお、図３では一部に移動ロボットＲＢの構成要素も合わせて示されている。 [Configuration of speech control device]
FIG. 3 is a block diagram showing a configuration example of the speech control device 100 of one embodiment. Note that in FIG. 3, some of the components of the mobile robot RB are also shown.

発話制御装置１００は、声掛けシステム１１０を中心として構成されている。声掛けシステム１１０は、ＩＰカメラ１１２やマイク・スピーカ１２８からの信号を入力とし、内部でＡＩ（高速モデル）による処理や各種の演算を行った上で、マイク・スピーカ１２８から発話音声を出力させる制御を実現する。 The speech control device 100 is configured mainly with a calling system 110. The calling system 110 inputs signals from an IP camera 112 and a microphone/speaker 128, performs internal processing using AI (high-speed model) and various calculations, and then outputs speech from the microphone/speaker 128. Achieve control.

マイク・スピーカ１２８は、例えば周囲の騒音レベルを計測したり、移動ロボットＲＢから発話音声を出力したりするために用いられる。なお、マイク・スピーカ１２８は別体式（マイクとスピーカが別）の構成であってもよい。 The microphone/speaker 128 is used, for example, to measure the surrounding noise level or to output speech from the mobile robot RB. Note that the microphone/speaker 128 may have a separate configuration (the microphone and speaker are separate).

ＩＰカメラ１１２は、人物を含む周囲環境を撮像するために用いられる。ＩＰカメラ１１２には、例えば公知の市販製品を適用することができる。ＩＰカメラ１１２は、いわゆるパン、チルト、ズーム（ＰＴＺ）機能を備えたネットワークカメラであるが、本実施形態では特にＰＴＺ機能を用いていない（ただし、用いてもよい。）。ＩＰカメラ１１２は、移動ロボットＲＢの本体（例えば頭部）に内蔵されている（図１参照）。ここでは、移動ロボットＲＢの進行方向正面にＩＰカメラ１１２の向きを設定している。 The IP camera 112 is used to capture images of the surrounding environment including people. For example, a known commercially available product can be applied to the IP camera 112. The IP camera 112 is a network camera equipped with a so-called pan, tilt, zoom (PTZ) function, but in this embodiment, the PTZ function is not particularly used (although it may be used). The IP camera 112 is built into the main body (for example, the head) of the mobile robot RB (see FIG. 1). Here, the direction of the IP camera 112 is set to be in front of the mobile robot RB in the direction of movement.

また、声掛けシステム１１０には、ＡＩ処理高速化装置１１４が付加されている。ＡＩ処理高速化装置１１４には、例えば公知の市販製品を用いることができ、ＡＩ処理高速化装置１１４は、声掛けシステム１１０の内部で実行されるＡＩ処理の高速化に寄与する。 Furthermore, an AI processing acceleration device 114 is added to the calling system 110. For example, a known commercially available product can be used as the AI processing acceleration device 114, and the AI processing acceleration device 114 contributes to speeding up the AI processing executed inside the calling system 110.

声掛けシステム１１０は、移動ロボットＲＢの制御部１３０と協働する。制御部１３０は、声掛けシステム１１０と協働して移動ロボットＲＢの移動装置１３２を制御する。例えば、声掛けシステム１１０が声掛けを実行する場合、制御部１３０は移動ロボットＲＢの移動を停止させたり、対象の人物との位置関係を調整したりする。あるいは、制御部１３０が移動ロボットＲＢを移動させつつ、声掛けシステム１１０が声掛けを実行することもある。 The calling system 110 cooperates with the control unit 130 of the mobile robot RB. The control unit 130 cooperates with the calling system 110 to control the moving device 132 of the mobile robot RB. For example, when the greeting system 110 executes a greeting, the control unit 130 stops the movement of the mobile robot RB or adjusts the positional relationship with the target person. Alternatively, the calling system 110 may perform calling while the control unit 130 moves the mobile robot RB.

声掛けシステム１１０は、例えば図示しないＣＰＵ（中央処理装置）及びその周辺機器を含むコンピュータ機器を用いて実現することができる。声掛けシステム１１０は、移動ロボットＲＢのシステムに追加して搭載される別のハードウエアでもよいし、移動ロボットＲＢが既に有するハードウエアにインストールされるソフトウエアでもよい。 The calling system 110 can be realized using, for example, computer equipment including a CPU (central processing unit) and its peripheral equipment (not shown). The calling system 110 may be separate hardware that is additionally installed in the system of the mobile robot RB, or may be software that is installed on the hardware that the mobile robot RB already has.

声掛けシステム１１０には、例えば人物判定部１３６や検出エリア判定部１４０、フィルタリング部１４２、そして演算部１２２といった各種の機能ブロックが含まれている。これらの機能ブロックは、例えばコンピュータプログラムを用いて行うＡＩ処理やソフトウエア処理によって実現することができる。本実施形態では、人物判定部１３６の処理に高速ＡＩモデルを採用している。各機能ブロックは、声掛けシステム１１０の内部バス（仮想バス）を通じて相互に連係しながら処理を実行する。 The calling system 110 includes various functional blocks such as a person determination section 136, a detection area determination section 140, a filtering section 142, and a calculation section 122, for example. These functional blocks can be realized by, for example, AI processing or software processing performed using a computer program. In this embodiment, a high-speed AI model is adopted for the processing of the person determination unit 136. Each functional block executes processing while interoperating with each other through an internal bus (virtual bus) of the calling system 110.

また、声掛けシステム１１０には記憶部１２４や出力装置１２６が含まれる。記憶部１２４は、例えば半導体メモリや磁気記録装置である。記憶部１２４には、例えば声掛けシステム１１０が移動ロボットＲＢに出力させる発話内容の音声データが格納されている。出力装置１２６は、マイク・スピーカ１２８を駆動するドライバアンプ等である。なお、音声データは適宜アップデートすることが可能である。 Further, the calling system 110 includes a storage unit 124 and an output device 126. The storage unit 124 is, for example, a semiconductor memory or a magnetic recording device. The storage unit 124 stores, for example, audio data of utterances that the calling system 110 causes the mobile robot RB to output. The output device 126 is a driver amplifier or the like that drives the microphone/speaker 128. Note that the audio data can be updated as appropriate.

図４は、声掛けシステム１１０による処理の概要を示す図である。なお、具体的な処理の詳細については、さらに別途フローチャートを用いて後述する。 FIG. 4 is a diagram showing an overview of processing by the calling system 110. Note that details of specific processing will be described later using a separate flowchart.

例えば、図４中（Ａ）～（Ｈ）に示すように、声掛けシステム１１０には、移動ロボットＲＢに内蔵のＩＰカメラ１１２（図４では省略）からの撮像信号が入力される。ＩＰカメラ１１２による撮像は連続的に（例えば３０～６０フレーム毎秒（ｆｐｓ）で）行われ、それらのフレーム画像が連続的に声掛けシステム１１０に入力されている。なお、ここでは簡略化のため、フレーム数は適宜間引いて示している（これ以降も同様。）。 For example, as shown in (A) to (H) in FIG. 4, an imaging signal from an IP camera 112 (omitted in FIG. 4) built into the mobile robot RB is input to the calling system 110. Imaging by the IP camera 112 is performed continuously (for example, at 30 to 60 frames per second (fps)), and these frame images are continuously input to the calling system 110. Note that for the sake of simplification, the number of frames is thinned out as appropriate (the same applies hereafter).

〔撮像エリア〕
図４中の中央領域に示すように、撮像エリアはＩＰカメラ１１２の画角（例えば水平方向で左右６４°程度、垂直方向で上方２８°程度、下方１０°程度）により規定される。フレーム画像は、この画角（視野）内に入る周囲環境を撮像したものとなる。なお、撮像エリアの範囲（角度）はこの例に限定されない。 [Imaging area]
As shown in the central region of FIG. 4, the imaging area is defined by the angle of view of the IP camera 112 (for example, about 64 degrees left and right in the horizontal direction, about 28 degrees upward and about 10 degrees downward in the vertical direction). The frame image is an image of the surrounding environment that falls within this angle of view (field of view). Note that the range (angle) of the imaging area is not limited to this example.

〔検出エリア〕
声掛けシステム１１０は、撮像エリア内に検出エリアＤＡ（図４にグレーで示す範囲）を予め規定している。検出エリアＤＡは、例えば移動ロボットＲＢの中心（ＩＰカメラ１１２による撮像地点）を基準点とした一定の範囲であり、ここでは半径Ｒ１～Ｒ３（例えば２ｍ～５ｍ）で示す扇状に近い帯状の範囲である。検出エリアＤＡには、移動ロボットＲＢからの声掛けに最適距離（例えば４ｍ）と考えられる発話地点が含まれる。なお、発話地点までの距離や検出エリアＤＡの範囲はこの例に限定されない。 [Detection area]
The calling system 110 predefines a detection area DA (range shown in gray in FIG. 4) within the imaging area. The detection area DA is, for example, a certain range with the center of the mobile robot RB (the imaging point by the IP camera 112) as a reference point, and here, it is a nearly fan-shaped belt-shaped range with a radius of R1 to R3 (for example, 2 m to 5 m). It is. The detection area DA includes a speaking point that is considered to be at an optimal distance (for example, 4 m) for a call from the mobile robot RB. Note that the distance to the speaking point and the range of the detection area DA are not limited to this example.

〔人物判定部〕
人物判定部１３６は、連続するフレーム画像から高速ＡＩモデルを用いた人物の判定処理を実行する。人物の判定は、例えば畳み込みニューラルネットワークを用いた画像認識処理で行われる。ここでは、ＡＩ処理高速化装置１１４のサポートを用いて、例えば１秒間に数回（３～４回）以上の頻度で人物を高速に判定することができる。 [Person identification department]
The person determination unit 136 executes a person determination process using a high-speed AI model from consecutive frame images. Determination of a person is performed by image recognition processing using, for example, a convolutional neural network. Here, by using the support of the AI processing acceleration device 114, it is possible to quickly determine a person at a frequency of several times (3 to 4 times) or more per second, for example.

〔判定精度〕
ただし、上記のように高速ＡＩモデルによる人物の判定結果には、ある程度の成功サンプルと不成功サンプルとが混在して得られる。例えば、図４中（Ａ）及び（Ｂ）のフレーム画像では、人物を判定した画像領域が一点鎖線の矩形枠（バウンディングボックス）で示されており、これらは人物判定部１３６で人物の判定に成功（検出）していることを意味している。しかし、次の図４中（Ｃ）のフレーム画像では、バウンディングボックスが消失しており、これは人物判定部１３６で人物の判定が不成功（未検出）となっていることを意味している。 [Judgment accuracy]
However, as described above, the person determination results obtained by the high-speed AI model include a certain amount of successful samples and unsuccessful samples. For example, in the frame images of (A) and (B) in FIG. 4, the image area where a person has been determined is indicated by a rectangular frame (bounding box) with a dashed dotted line, and these are the areas where the person determination unit 136 determines the person. It means that it is successful (detected). However, in the next frame image shown in FIG. 4 (C), the bounding box has disappeared, which means that the person determination unit 136 has failed to determine the person (undetected). .

以下同様に、図４中（Ｄ）のフレーム画像では人物の判定に成功（検出）しているが、次の（Ｅ）及び（Ｆ）のフレーム画像では、いずれも不成功（未検出）となっている。そして、また（Ｇ）及び（Ｈ）のフレーム画像では、人物の判定に成功（検出）しているが、その前の（Ｄ）からの間に２回、人物の判定が不成功（未検出）となっていることが分かる。 Similarly, in the frame image (D) in Figure 4, the person was successfully determined (detected), but in the following frame images (E) and (F), it was unsuccessful (undetected). It has become. In the frame images (G) and (H), the person was successfully identified (detected), but the person was unsuccessfully identified twice (undetected) from (D). ).

このような場合、人物判定部１３６で得られた一連の判定結果をそのまま声掛けの制御に用いようとすると、移動ロボットＲＢでは、図４中の中央領域に「検出」を付した各位置の実線で示される人物については認識（検出）できているが、「未検出」を付した各位置の二点鎖線で示される人物については認識（検出）できていないことになる。すなわち、移動ロボットＲＢからは、（Ａ）のフレーム画像の位置で認識（検出）していた人物が（Ｂ）のフレーム画像の位置に移動した後、途中が抜けて（Ｄ）のフレーム画像の位置に大きく移動し、次の瞬間（Ｇ）のフレーム画像の位置に大きく移動したように認識されることになる。 In such a case, if you try to use the series of judgment results obtained by the person judgment unit 136 as they are to control the calling, the mobile robot RB will be able to detect each position marked with "detection" in the central area in FIG. The person indicated by the solid line has been recognized (detected), but the person indicated by the two-dot chain line at each position marked with "undetected" has not been recognized (detected). In other words, from the mobile robot RB, after the person recognized (detected) at the position of the frame image (A) moves to the position of the frame image (B), the person disappears halfway and is recognized (detected) at the position of the frame image (D). It will be recognized as if the image has moved significantly to the position of the frame image at the next moment (G).

このような人物の判定（検出）結果からダイレクトに移動ロボットＲＢから声掛けさせると、例えば（Ｄ）のフレーム画像の位置で声掛けした後で、（Ｇ）のフレーム画像の位置でも同じ内容を声掛けするといった連呼の問題が発生する。これでは、せっかくの声掛けが人物に対する煩わしさや違和感となってしまう。 If the mobile robot RB directly calls out based on the result of such person determination (detection), for example, after calling out at the position of the frame image (D), it will also repeat the same content at the position of the frame image (G). Problems arise with repeated calls. In this case, the effort made to call out to the person becomes annoying and makes the person feel uncomfortable.

あるいは、（Ｅ），（Ｆ）のフレーム画像の位置まで人物が接近しているにもかかわらず、移動ロボットＲＢからは人物が判定（検出）できていないため、何も声掛けしないという無反応の問題が発生する。これでは、せっかく（Ａ）のフレーム画像のような遠方の位置から人物の存在を判定（検出）できていたにも関わらず、適切な位置関係となったときに声掛けする機会を逸したことになる。特に（Ｆ）のフレーム画像の位置は検出エリアＤＡ内であるため、ここで声掛けしていないのは好ましくない。 Or, even though the person has approached the position of the frame images in (E) and (F), the mobile robot RB has not been able to determine (detect) the person, so there is no response such as not calling out to them. problem occurs. In this case, even though we were able to determine (detect) the presence of a person from a faraway position as shown in the frame image (A), we missed the opportunity to call out to them when the positional relationship was appropriate. become. In particular, since the position of the frame image (F) is within the detection area DA, it is not preferable that the call is not made here.

〔フィルタリング部〕
このため本実施形態では、フィルタリング部１４２による処理を用いる。図５及び図６は、フィルタリング部１４２による処理の概要を示す図である。本実施形態のフィルタリング部１４２は、例えば異なる２つの態様でフィルタリング処理を実行することができる。このため、フィルタリング処理（１）の概要を図５に示し、フィルタリング処理（２）の概要を図６にしめしている。以下、フィルタリング部１４２の処理について説明する。 [Filtering section]
For this reason, in this embodiment, processing by the filtering section 142 is used. 5 and 6 are diagrams showing an overview of processing by the filtering unit 142. The filtering unit 142 of this embodiment can perform filtering processing in two different ways, for example. For this reason, an outline of the filtering process (1) is shown in FIG. 5, and an outline of the filtering process (2) is shown in FIG. The processing of the filtering section 142 will be explained below.

〔フィルタリング処理（１）〕
フィルタリング部１４２は、人物判定部１３６による人物の判定結果を連続的に観測する。この例では、図５中の上部枠内に（検出データＡ）、（検出データＢ）、（検出データＣ）、（検出データＤ）、（検出データＥ）、（検出データＦ）、（検出データなし）、（検出データＧ）、（検出データＨ）、（検出データＩ）、（検出データＪ）、（検出データなし）、（検出データＫ）、（検出データなし）、（検出データなし）、（検出データなし）で示される一連のフレーム画像毎に判定結果が得られている。 [Filtering process (1)]
The filtering unit 142 continuously observes the person determination result by the person determining unit 136. In this example, (detection data A), (detection data B), (detection data C), (detection data D), (detection data E), (detection data F), (detection data (No data), (Detection data G), (Detection data H), (Detection data I), (Detection data J), (No detection data), (Detection data K), (No detection data), (No detection data ) and (no detection data), determination results are obtained for each series of frame images.

ここで、（検出データＡ）、（検出データＢ）、・・・（検出データＫ）は、それぞれのフレーム画像内で人物が判定（検出）されていることを表している。また、Ａ、Ｂ、・・・Ｋの符号は、フレーム画像別の判定結果を識別するものである。例えば、（検出データＡ）～（検出データＦ）と（検出データＧ）～（検出データＫ）とでは、人物を判定したバウンディングボックスの大きさが違っており、人物の位置が異なることを意味している。したがって、（検出データＡ）～（検出データＦ）と（検出データＧ）～（検出データＫ）とでは、移動ロボットＲＢから人物までの距離が異なっている。また、（検出データＧ）から（検出データＫ）に向かって人物との距離は小さくなっている。 Here, (detection data A), (detection data B), ... (detection data K) represent that a person has been determined (detected) in each frame image. Furthermore, the symbols A, B, . . . K identify the determination results for each frame image. For example, between (detection data A) to (detection data F) and (detection data G) to (detection data K), the size of the bounding box used to determine the person is different, which means that the position of the person is different. are doing. Therefore, the distances from the mobile robot RB to the person are different between (detection data A) to (detection data F) and (detection data G) to (detection data K). Furthermore, the distance to the person decreases from (detection data G) to (detection data K).

フィルタリング部１４２による処理は、図５中の下部領域に示す処理テーブルを用いて説明することができる。この処理テーブルは、例えばメモリ空間に展開されたデータ配列を便宜的に視覚化したものである。このとき、処理テーブルには、縦方向に「検出結果」、「内部状態」及び「出力」のデータ領域が定義されており、横方向には各データ領域に対応するデータが時系列に配列されている。 The processing by the filtering unit 142 can be explained using the processing table shown in the lower area of FIG. This processing table is, for example, a convenient visualization of a data array developed in a memory space. At this time, in the processing table, the data areas of "detection result", "internal state", and "output" are defined in the vertical direction, and the data corresponding to each data area is arranged in chronological order in the horizontal direction. ing.

〔検出結果のデータ配列〕
処理テーブルの上段に示されているように、「検出結果」のデータ領域には、左（時系列の最古）から右（最新）に向かって人物判定部１３６による一連の判定結果（検出結果）が順次配列される。ここでは、左から３個目までのフレームが全てデータなしであり、４個目から９個目までのフレームには、「Ａ」～「Ｆ」の検出データが順に配列されている。また、１０個目のフレームがデータなしであり、１１個目から１４個目のフレームには「Ｇ」～「Ｊ」の検出データが順に配列されている。１５個目のフレームが再度データなしであるが、１６個目のフレームには「Ｋ」の検出データが配列されている。そして、１７個目以降のフレームはデータなしが連続している。このようなデータ配列は、図５中の上部枠内に示した一連のフレーム画像毎の判定結果に対応している。 [Detection result data array]
As shown in the upper part of the processing table, the "detection results" data area contains a series of determination results (detection results) by the person determination unit 136 from the left (oldest in chronological order) to the right (latest). ) are arranged sequentially. Here, the third frames from the left all have no data, and the fourth to ninth frames have detected data "A" to "F" arranged in order. Further, the 10th frame has no data, and the 11th to 14th frames have detected data of "G" to "J" arranged in order. The 15th frame again has no data, but the 16th frame has "K" detection data arranged. The 17th and subsequent frames continue to have no data. Such a data array corresponds to the determination results for each of the series of frame images shown in the upper frame in FIG.

〔内部状態のデータ配列〕
処理テーブルの中段に示される「内部状態」のデータ配列は、上段の「検出結果」のデータ配列に基づいて決定される。具体的には、フィルタリング部１４２は、連続するｎ個（例えば３個）のデータ中に検出データが所定割合（例えば６割）以上含まれる場合、内部状態を「検出状態」とし、所定割合に満たない場合は内部状態を「未検出状態」とする。この例では、左から３個のフレームには検出データがないため、ここまでの内部状態は「未検出状態」となっている。２個目から４個目のフレームには検出データＡが１つあるが、６割に満たないため内部状態は「未検出状態」のままである。３個目から５個目のフレームには検出データＡ及びＢがあり、６割以上となることから、ここから内部状態は「検出状態」となる。以後も同様に、連続するｎ個のデータ中に６割以上の検出データがあれば、内部状態は「検出状態」となる。そして、１５個目から１７個目のフレームには検出データＫが１つとなり、ここから内部状態は「未検出状態」となる。 [Internal state data array]
The data array of "internal state" shown in the middle row of the processing table is determined based on the data array of "detection result" in the upper row. Specifically, when the detection data is included in a predetermined ratio (e.g. 60%) or more in n consecutive data (e.g. 3 pieces), the filtering unit 142 sets the internal state to the "detection state" and sets the internal state to the "detection state", and sets the internal state to the "detection state". If the condition is not met, the internal state is set to "undetected state". In this example, since there is no detection data in the three frames from the left, the internal state up to this point is "undetected state." There is one detection data A in the second to fourth frames, but since it is less than 60%, the internal state remains in the "undetected state". The third to fifth frames have detection data A and B, which account for 60% or more, so the internal state becomes the "detection state" from this point on. Thereafter, similarly, if 60% or more of the detected data is present in n consecutive pieces of data, the internal state becomes the "detected state". There is one detection data K in the 15th to 17th frames, and the internal state becomes the "undetected state" from this point on.

〔出力のデータ配列〕
処理テーブルの下段に示される「出力」のデータ配列は、フィルタリング部１４２が出力する検出データを示している。フィルタリング部１４２からの出力は、人物判定部１３６の判定結果に基づいて生成した擬制的な検出結果である。具体的には、「内部状態」が「検出状態」である場合、フィルタリング部１４２は、最後に得られた検出データをその時点での検出結果と擬制して（みなして）出力する。この例では、時系列で最初に内部状態が「検出状態」となった時点では、最後に得られた検出データＢを出力している。以後は順次、検出データＣ、Ｄ、Ｅ、Ｆを出力するが、１０個目のフレームで検出データなしとなった場合、この時点で最後に得られていた検出データＦを出力している。次からは再び、検出データＧ、Ｈ、Ｉ、Ｊが出力されるが、１５個目のフレームでは検出データなしとなっているため、この時点で最後に得られていた検出データＪを出力している。そして、１６個目では検出データＫが最後となるため、この時点で検出データＫを出力する。 [Output data array]
The "output" data array shown at the bottom of the processing table indicates the detection data output by the filtering section 142. The output from the filtering section 142 is a hypothetical detection result generated based on the determination result of the person determining section 136. Specifically, when the "internal state" is the "detection state", the filtering unit 142 outputs the detection data obtained last, assuming that it is the detection result at that time. In this example, when the internal state becomes the "detection state" for the first time in time series, the last obtained detection data B is output. Thereafter, detection data C, D, E, and F are sequentially output, but when there is no detection data in the 10th frame, the detection data F that was obtained last at this point is output. From the next time, detection data G, H, I, and J will be output again, but since there is no detection data in the 15th frame, the detection data J that was obtained last at this point will be output. ing. Since the 16th piece of detection data K is the last, the detection data K is output at this point.

〔フィルタリング処理（２）〕
図６に示されるフィルタリング処理（２）は、上記のフィルタリング処理（１）と異なるロジックで「内部状態」及び「出力」を処理する。すなわち、図６中の上部枠内に示される判定結果は同じであるが、下部領域に示される処理テーブル中段の「内部状態」及び下段の「出力」のデータ配列が図５と異なっている。なお、処理テーブル上段の「検出結果」は図５と同じである。 [Filtering process (2)]
Filtering process (2) shown in FIG. 6 processes "internal state" and "output" using a logic different from that of filtering process (1) described above. That is, although the determination results shown in the upper frame in FIG. 6 are the same, the data arrangement of the "internal state" in the middle part of the processing table and the "output" in the lower part of the processing table shown in the lower part are different from those in FIG. Note that the "detection results" at the top of the processing table are the same as in FIG. 5.

〔内部状態のデータ配列〕
例えば、検出データが未だ得られていない初期の段階では、内部状態が「未検出状態」となっている。ここから、ｎフレーム（例えば３フレーム）連続で検出データが得られた場合、フィルタリング部１４２は内部状態を「検出状態」とする。この例では、太枠で示す４個目から６個目のフレームには検出データＡ、Ｂ及びＣがあり、ｎフレーム連続していることから、ここから内部状態は「検出状態」となる。そして、これ以後は同じ内部状態を継続し、ｎフレーム連続して検出データが得られなかった場合は内部状態を「未検出状態」とする。この例では、太枠で示す１７個目から１９個目のフレームがデータなしとなっており、ｎフレーム連続していることから、ここから内部状態は「未検出状態」となる。 [Internal state data array]
For example, at an early stage when detection data is not yet obtained, the internal state is in the "undetected state." From here, when detection data is obtained for n consecutive frames (for example, 3 frames), the filtering unit 142 sets the internal state to the "detection state." In this example, the fourth to sixth frames indicated by thick frames include detection data A, B, and C, and since they are continuous for n frames, the internal state becomes the "detection state" from this point on. Thereafter, the same internal state continues, and if no detection data is obtained for n consecutive frames, the internal state is set to "undetected state." In this example, the 17th to 19th frames indicated by thick frames have no data, and since there are n consecutive frames, the internal state becomes "undetected state" from this point on.

〔出力のデータ配列〕
フィルタリング処理（２）でも同様に、「内部状態」が「検出状態」である場合、フィルタリング部１４２は、最後に得られた検出データをその時点での検出結果と擬制して（みなして）出力する。この例では、時系列で最初に内部状態が「検出状態」となった時点では、最後に得られた検出データＣから出力する点がフィルタリング処理（１）と異なる。以後は順次、検出データＤ、Ｅ、Ｆを出力し、１０個目のフレームで検出データなしとなった場合、この時点で最後に得られていた検出データＦを出力する点は同じである。次からは、検出データＧ、Ｈ、Ｉ、Ｊが出力されるが、１５個目のフレームでは検出データなしとなっているため、この時点で最後に得られていた検出データＪを出力し、そして、１７個目と１８個目のフレームでは検出データＫが最後となるため、それぞれ検出データＫを出力する。 [Output data array]
Similarly, in the filtering process (2), when the "internal state" is the "detection state", the filtering unit 142 outputs the detection data obtained last, assuming that it is the detection result at that point. do. This example differs from the filtering process (1) in that when the internal state becomes the "detection state" for the first time in time series, the detection data C obtained last is output. Thereafter, detection data D, E, and F are sequentially output, and when there is no detection data in the 10th frame, the detection data F that was obtained last at this point is output. From the next time on, detection data G, H, I, and J will be output, but since there is no detection data in the 15th frame, the detection data J that was obtained last at this point will be output. Since the detection data K is the last in the 17th and 18th frames, the detection data K is output for each frame.

〔発話タイミング〕
図７は、検出エリア判定部１４０による処理の概要を示す図である。検出エリア判定部１４０は、フィルタリング部１４２による検出結果（検出データＢ、Ｃ、・・・Ｋ）で示される人物Ｐに基づいて、人物Ｐが検出エリアＤＡに進入したか否かを判定する。このとき、人物Ｐがどの場所（距離）にいるかについては、各検出データに示されるバウンディングボックスの大きさから推定する。人物Ｐまでの距離とバウンディングボックスの大きさ（高さ）との関係を予め相関データとして記憶しておくことで、各検出データに示されるバウンディングボックスの大きさから人物Ｐまでの距離を推定する。 [Speech timing]
FIG. 7 is a diagram showing an overview of processing by the detection area determination unit 140. The detection area determining unit 140 determines whether the person P has entered the detection area DA, based on the person P indicated by the detection result (detection data B, C, . . . K) by the filtering unit 142. At this time, the location (distance) of the person P is estimated from the size of the bounding box shown in each detection data. By storing the relationship between the distance to the person P and the size (height) of the bounding box in advance as correlation data, the distance to the person P is estimated from the size of the bounding box shown in each detection data. .

検出エリア判定部１４０は、フィルタリング部１４２からの出力に基づいて検出エリアＤＡ外の遠方から人物Ｐを追跡し、常時、その距離を推定している。その結果、人物Ｐが検出エリアＤＡ（この例では５ｍ以内）に進入したと判定すると、そのタイミングで検出エリア判定部１４０は演算部１２２に判定結果を出力する。これを受けて、演算部１２２が出力装置１２６を駆動し、マイク・スピーカ１２８から発話音声を出力させる。これにより、実際に人物Ｐが検出エリアＤＡに進入したタイミングで、直ちに（遅延することなく）移動ロボットＲＢから「こんにちは熱中症に注意してください」といった声掛けが適切に実行されることになる。なお、声掛けの内容はこれに限定されない。 The detection area determination unit 140 tracks the person P from a distance outside the detection area DA based on the output from the filtering unit 142, and constantly estimates the distance. As a result, when it is determined that the person P has entered the detection area DA (within 5 m in this example), the detection area determination section 140 outputs the determination result to the calculation section 122 at that timing. In response to this, the arithmetic unit 122 drives the output device 126 to cause the microphone/speaker 128 to output the spoken voice. As a result, when the person P actually enters the detection area DA, the mobile robot RB will immediately (without delay) appropriately say, "Hello, please be careful of heatstroke." . Note that the content of the call is not limited to this.

〔処理プログラムの例〕
以上の説明で声掛けシステム１１０の各機能ブロックによる処理の概要は明らかとなっているが、以下では、フローチャートを用いて具体的な処理の手順を説明する。 [Example of processing program]
The above description has clarified the outline of the processing by each functional block of the calling system 110, but below, the specific processing procedure will be explained using a flowchart.

〔フィルタリング処理（１）〕
図８は、フィルタリング部１４２で実行されるプログラムの一部として、フィルタリング処理（１）の手順例を示すフローチャートである。この処理は、図５に示す処理テーブルに対応する。以下、手順例に沿って説明する。 [Filtering process (1)]
FIG. 8 is a flowchart illustrating an example of a procedure for filtering processing (1) as part of a program executed by the filtering unit 142. This process corresponds to the process table shown in FIG. The procedure will be explained below using an example procedure.

ステップＳ１００：フィルタリング部１４２は、ｎフレーム数を初回定義する。ここでは、例えばｎフレーム数を「３個」と定義する。なお、定義は初回のフレームに対して処理を実行した場合のみ行い、以後のフレームで繰り返し処理を実行した場合には重ねて定義しない。また、ここで定義するｎフレーム数の値は声掛けシステム１１０に対して任意に書き換え可能とする。 Step S100: The filtering unit 142 defines the number of n frames for the first time. Here, for example, the number of n frames is defined as "3". Note that the definition is performed only when the process is executed for the first frame, and is not defined again when the process is repeatedly executed for subsequent frames. Further, the value of the number of n frames defined here can be arbitrarily rewritten in the calling system 110.

ステップＳ１０２：フィルタリング部１４２は、毎フレームの人物判定部１３６の判定結果（検出データ）を入力する。ここで入力する判定結果は、各フレームの（検出データＡ）、（検出データＢ）、・・・（検出データＫ）、（検出データなし）等である。 Step S102: The filtering unit 142 inputs the determination result (detection data) of the person determining unit 136 for each frame. The determination results input here are (detection data A), (detection data B), . . . (detection data K), (no detection data), etc. of each frame.

〔１フレーム目の処理〕
ステップＳ１０４：フィルタリング部１４２は、検出データがある場合（Ｙｅｓ）、ステップＳ１０６に進むが、図５の処理テーブルの例では、１個目のフレームに検出データがないため（Ｎｏ）、ステップＳ１１８に進む。 [1st frame processing]
Step S104: If there is detection data (Yes), the filtering unit 142 proceeds to Step S106, but in the example of the processing table of FIG. 5, since there is no detection data in the first frame (No), the filtering unit 142 proceeds to Step S118. move on.

ステップＳ１１８：フィルタリング部１４２は、変数Ｎが０より大か確認する。ここで、変数Ｎは初期値０に設定されているため、ここでは変数Ｎは０より大とならず（Ｎｏ）、ステップＳ１２４に進む。 Step S118: The filtering unit 142 checks whether the variable N is greater than 0. Here, since the variable N is set to the initial value 0, the variable N is not greater than 0 (No), and the process advances to step S124.

ステップＳ１２４：フィルタリング部１４２は、変数Ｎを１インクリメントする。ここでは、初期値０であった変数Ｎに値「１」が代入される。
ステップＳ１２６：フィルタリング部１４２、内部状態を「未検出」に設定する。したがって、図５の処理テーブルの例では、１個目のフレームで内部状態が「未検出」となる。 Step S124: The filtering unit 142 increments the variable N by 1. Here, the value "1" is assigned to the variable N, which had an initial value of 0.
Step S126: The filtering unit 142 sets the internal state to "undetected". Therefore, in the example of the processing table of FIG. 5, the internal state becomes "undetected" in the first frame.

ステップＳ１２８：フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図５の処理テーブルの例では、１個目のフレームで出力なしとなる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、２フレーム目について本処理を実行する。 Step S128: The filtering unit 142 outputs the detection data "none". That is, in the example of the processing table in FIG. 5, there is no output in the first frame.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is executed for the second frame.

〔２フレーム目の処理〕
ステップＳ１１８：２フレーム目の処理では、検出データなし（ステップＳ１０４＝Ｎｏ）の場合でも変数Ｎが０より大となっているため（Ｙｅｓ）、ステップＳ１０６に進む。
ステップＳ１０６：フィルタリング部１４２は、変数Ｎを１インクリメントする。２フレーム目では、変数Ｎに値「２」が代入されることになる。 [2nd frame processing]
Step S118: In the process of the second frame, even if there is no detected data (step S104=No), the variable N is greater than 0 (Yes), so the process proceeds to step S106.
Step S106: The filtering unit 142 increments the variable N by 1. In the second frame, the value "2" is assigned to the variable N.

ステップＳ１０８：フィルタリング部１４２は、変数Ｎが定義したフレーム数ｎに等しければ（Ｙｅｓ）、ステップＳ１１０に進むが、ここではフレーム数ｎ（３個）に満たないため（Ｎｏ）、ステップＳ１２６に進む。 Step S108: If the variable N is equal to the defined number of frames n (Yes), the filtering unit 142 proceeds to Step S110, but here, since the number of frames is less than n (3) (No), the process proceeds to Step S126. .

ステップＳ１２６：フィルタリング部１４２、内部状態を「未検出」に設定する。したがって、図５の処理テーブルの例では、２個目のフレームで内部状態が「未検出」となる。
ステップＳ１２８：そして、フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図５の処理テーブルの例では、２個目のフレームで出力なしとなる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、３フレーム目について本処理を実行する。 Step S126: The filtering unit 142 sets the internal state to "undetected". Therefore, in the example of the processing table of FIG. 5, the internal state becomes "undetected" in the second frame.
Step S128: Then, the filtering unit 142 outputs the detection data "none". That is, in the example of the processing table in FIG. 5, there is no output in the second frame.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is executed for the third frame.

〔３フレーム目の処理〕
ステップＳ１１８：３フレーム目の処理では、検出データなし（ステップＳ１０４＝Ｎｏ）の場合でも変数Ｎが０より大となっているため（Ｙｅｓ）、ステップＳ１０６に進む。
ステップＳ１０６：フィルタリング部１４２は、変数Ｎを１インクリメントする。３フレーム目では、変数Ｎに値「３」が代入されることになる。 [3rd frame processing]
Step S118: In the third frame process, even if there is no detected data (step S104=No), the variable N is greater than 0 (Yes), so the process proceeds to step S106.
Step S106: The filtering unit 142 increments the variable N by 1. In the third frame, the value "3" is assigned to the variable N.

ステップＳ１０８：この場合、変数Ｎが定義したフレーム数ｎに等しいため（Ｙｅｓ）、ステップＳ１１０に進む。
ステップＳ１１０：フィルタリング部１４２は、ｎフレーム中の検出データ数と閾値ｘ（例えばｘ＝２）とを比較し、閾値ｘ以上（Ｙｅｓ）の場合はステップＳ１１２に進む。ただし、図５の処理テーブルの例では、３フレーム目で検出データ数は未だ０であるため（Ｎｏ）、ステップＳ１２０に進む。なお、閾値ｘの値は任意に書き換え可能である。 Step S108: In this case, since the variable N is equal to the defined number of frames n (Yes), the process proceeds to step S110.
Step S110: The filtering unit 142 compares the number of detected data in n frames with a threshold value x (for example, x=2), and if the number is equal to or greater than the threshold value x (Yes), the process proceeds to step S112. However, in the example of the processing table in FIG. 5, the number of detected data is still 0 in the third frame (No), so the process advances to step S120. Note that the value of the threshold x can be arbitrarily rewritten.

ステップＳ１２０：フィルタリング部１４２は、内部状態を「未検出」に設定する。したがって、図５の処理テーブルの例では、３個目のフレームで内部状態が「未検出」となる。 Step S120: The filtering unit 142 sets the internal state to "undetected". Therefore, in the example of the processing table of FIG. 5, the internal state becomes "undetected" in the third frame.

ステップＳ１２２：そして、フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図５の処理テーブルの例では、３個目のフレームで出力なしとなる。
ステップＳ１１６：ここで、フィルタリング部１４２は変数Ｎを１デクリメントする。これにより、変数Ｎに値「２＝３－１」が代入されることになる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、４フレーム目について本処理を実行する。 Step S122: Then, the filtering unit 142 outputs the detection data "none". That is, in the example of the processing table in FIG. 5, there is no output in the third frame.
Step S116: Here, the filtering unit 142 decrements the variable N by 1. As a result, the value "2=3-1" is assigned to the variable N.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is executed for the fourth frame.

〔４フレーム目の処理〕
ステップＳ１０４：図５の処理テーブルの例では、４フレーム目で検出データＡが入力されている。このため、検出データありとなり（Ｙｅｓ）、ステップＳ１０６に進む。
ステップＳ１０６：フィルタリング部１４２は、変数Ｎを１インクリメントする。４フレーム目では、再び変数Ｎに値「３＝２＋１」が代入されることになる。 [4th frame processing]
Step S104: In the example of the processing table in FIG. 5, detection data A is input in the fourth frame. Therefore, it is determined that there is detection data (Yes), and the process advances to step S106.
Step S106: The filtering unit 142 increments the variable N by 1. In the fourth frame, the value "3=2+1" is assigned to the variable N again.

ステップＳ１０８：この場合、変数Ｎが定義したフレーム数ｎに等しいため（Ｙｅｓ）、ステップＳ１１０に進む。
ステップＳ１１０：図５の処理テーブルの例では、４フレーム目で検出データ数は１であるため（Ｎｏ）、ステップＳ１２０に進む。 Step S108: In this case, since the variable N is equal to the defined number of frames n (Yes), the process proceeds to step S110.
Step S110: In the example of the processing table in FIG. 5, the number of detected data is 1 in the fourth frame (No), so the process proceeds to step S120.

ステップＳ１２０：フィルタリング部１４２は、内部状態を「未検出」に設定する。したがって、図５の処理テーブルの例では、４個目のフレームで内部状態が「未検出」となる。 Step S120: The filtering unit 142 sets the internal state to "undetected". Therefore, in the example of the processing table of FIG. 5, the internal state becomes "undetected" in the fourth frame.

ステップＳ１２２：そして、フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図５の処理テーブルの例では、４個目のフレームで出力なしとなる。
ステップＳ１１６：また、フィルタリング部１４２は変数Ｎを１デクリメントする。これにより、再び変数Ｎに値「２＝３－１」が代入されることになる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、５フレーム目について本処理を実行する。 Step S122: Then, the filtering unit 142 outputs the detection data "none". That is, in the example of the processing table in FIG. 5, there is no output in the fourth frame.
Step S116: Also, the filtering unit 142 decrements the variable N by 1. As a result, the value "2=3-1" is assigned to the variable N again.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is executed for the fifth frame.

〔５フレーム目の処理〕
ステップＳ１０４：図５の処理テーブルの例では、５フレーム目で検出データＢが入力されている。このため、検出データありとなり（Ｙｅｓ）、ステップＳ１０６に進む。
ステップＳ１０６：フィルタリング部１４２は、変数Ｎを１インクリメントする。５フレーム目では、再び変数Ｎに値「３＝２＋１」が代入される。 [5th frame processing]
Step S104: In the example of the processing table in FIG. 5, detection data B is input in the fifth frame. Therefore, it is determined that there is detection data (Yes), and the process advances to step S106.
Step S106: The filtering unit 142 increments the variable N by 1. In the fifth frame, the value "3=2+1" is assigned to the variable N again.

ステップＳ１０８：この場合、変数Ｎが定義したフレーム数ｎに等しいため（Ｙｅｓ）、ステップＳ１１０に進む。
ステップＳ１１０：図５の処理テーブルの例では、５フレーム目で検出データ数は２であるため（Ｙｅｓ）、ステップＳ１１２に進む。 Step S108: In this case, since the variable N is equal to the defined number of frames n (Yes), the process proceeds to step S110.
Step S110: In the example of the processing table of FIG. 5, the number of detected data is 2 in the 5th frame (Yes), so the process proceeds to step S112.

ステップＳ１１２：ここでフィルタリング部１４２は、内部状態を「検出」に設定する。したがって、図５の処理テーブルの例では、５個目のフレームで内部状態が「検出」となる。 Step S112: Here, the filtering unit 142 sets the internal state to "detection". Therefore, in the example of the processing table in FIG. 5, the internal state becomes "detected" in the fifth frame.

ステップＳ１１４：そして、フィルタリング部１４２は、最新の検出データを出力する。すなわち、図５の処理テーブルの例では、５個目のフレームで最新の検出データＢが出力されることになる。
ステップＳ１１６：また、フィルタリング部１４２は変数Ｎを１デクリメントする。これにより、再び変数Ｎに値「２＝３－１」が代入されることになる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、６フレーム目以降についても順次、本処理を実行する。 Step S114: Then, the filtering unit 142 outputs the latest detection data. That is, in the example of the processing table in FIG. 5, the latest detection data B is output in the fifth frame.
Step S116: Also, the filtering unit 142 decrements the variable N by 1. As a result, the value "2=3-1" is assigned to the variable N again.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is sequentially executed for the sixth frame and subsequent frames.

〔１０フレーム目の処理〕
１０フレーム目の処理は以下となる。
ステップＳ１１８：図５の処理テーブルの例では、１０フレーム目の処理で検出データなし（ステップＳ１０４＝Ｎｏ）の場合でも、変数Ｎが０より大となっており（Ｙｅｓ）、ステップＳ１０６に進む。
ステップＳ１０６：フィルタリング部１４２は、変数Ｎを１インクリメントする。１０フレーム目では、変数Ｎに値「３」が代入されることになる。 [10th frame processing]
The processing for the 10th frame is as follows.
Step S118: In the example of the processing table of FIG. 5, even if there is no detected data in the processing of the 10th frame (step S104=No), the variable N is greater than 0 (Yes), and the process proceeds to step S106.
Step S106: The filtering unit 142 increments the variable N by 1. In the 10th frame, the value "3" is assigned to the variable N.

ステップＳ１０８：この場合、変数Ｎが定義したフレーム数ｎに等しいため（Ｙｅｓ）、ステップＳ１１０に進む。
ステップＳ１１０：図５の処理テーブルの例では、１０フレーム目で検出データ数は２であるため（Ｙｅｓ）、ステップＳ１１２に進む。 Step S108: In this case, since the variable N is equal to the defined number of frames n (Yes), the process proceeds to step S110.
Step S110: In the example of the processing table of FIG. 5, the number of detected data is 2 in the 10th frame (Yes), so the process proceeds to step S112.

ステップＳ１１２：フィルタリング部１４２は、内部状態を「検出」に設定する。したがって、図５の処理テーブルの例では、１０個目のフレームで内部状態が「検出」となる。 Step S112: The filtering unit 142 sets the internal state to "detection". Therefore, in the example of the processing table in FIG. 5, the internal state becomes "detected" in the 10th frame.

ステップＳ１１４：そして、フィルタリング部１４２は、最新の検出データを出力する。すなわち、図５の処理テーブルの例では、１０個目のフレームで最新の検出データＦが出力されることになる。
ステップＳ１１６：また、フィルタリング部１４２は変数Ｎを１デクリメントする。これにより、再び変数Ｎに値「２＝３－１」が代入されることになる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、１１フレーム目以降についても順次、本処理を実行する。 Step S114: Then, the filtering unit 142 outputs the latest detection data. That is, in the example of the processing table in FIG. 5, the latest detection data F is output in the tenth frame.
Step S116: Also, the filtering unit 142 decrements the variable N by 1. As a result, the value "2=3-1" is assigned to the variable N again.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is executed sequentially for the 11th frame and thereafter.

〔１７フレーム目の処理〕
１７フレーム目の処理は以下となる。
ステップＳ１１８：図５の処理テーブルの例では、１７フレーム目の処理で検出データなし（ステップＳ１０４＝Ｎｏ）の場合でも、変数Ｎが０より大となっており（Ｙｅｓ）、ステップＳ１０６に進む。
ステップＳ１０６：フィルタリング部１４２は、変数Ｎを１インクリメントする。１０フレーム目では、変数Ｎに値「３」が代入されることになる。 [17th frame processing]
The processing for the 17th frame is as follows.
Step S118: In the example of the processing table of FIG. 5, even if there is no detected data in the processing of the 17th frame (step S104=No), the variable N is greater than 0 (Yes), and the process proceeds to step S106.
Step S106: The filtering unit 142 increments the variable N by 1. In the 10th frame, the value "3" is assigned to the variable N.

ステップＳ１０８：この場合、変数Ｎが定義したフレーム数ｎに等しいため（Ｙｅｓ）、ステップＳ１１０に進む。
ステップＳ１１０：図５の処理テーブルの例では、１７フレーム目で検出データ数は１であるため（Ｎｏ）、ステップＳ１２０に進む。 Step S108: In this case, since the variable N is equal to the defined number of frames n (Yes), the process proceeds to step S110.
Step S110: In the example of the processing table in FIG. 5, the number of detected data is 1 in the 17th frame (No), so the process proceeds to step S120.

ステップＳ１２０：フィルタリング部１４２は、ここで内部状態を「未検出」に設定する。したがって、図５の処理テーブルの例では、１７個目のフレームで内部状態が「未検出」となる。 Step S120: The filtering unit 142 sets the internal state to "undetected" here. Therefore, in the example of the processing table of FIG. 5, the internal state becomes "undetected" in the 17th frame.

ステップＳ１２２：そして、フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図５の処理テーブルの例では、１７個目のフレームで出力なしとなる。
ステップＳ１１６：また、フィルタリング部１４２は変数Ｎを１デクリメントする。これにより、再び変数Ｎに値「２＝３－１」が代入されることになる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、１８フレーム目以降についても順次、本処理を実行する。 Step S122: Then, the filtering unit 142 outputs the detection data "none". That is, in the example of the processing table in FIG. 5, there is no output at the 17th frame.
Step S116: Also, the filtering unit 142 decrements the variable N by 1. As a result, the value "2=3-1" is assigned to the variable N again.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is sequentially executed for the 18th frame and thereafter.

〔フィルタリング処理（２）〕
図９は、フィルタリング処理（２）の手順例を示すフローチャートである。この処理は、図６に示す処理テーブルに対応する。以下、手順例に沿って説明する。 [Filtering process (2)]
FIG. 9 is a flowchart illustrating an example of the procedure of filtering process (2). This process corresponds to the process table shown in FIG. The procedure will be explained below using an example procedure.

ステップＳ２００：フィルタリング部１４２は、ｎフレーム数を初回定義する。処理の内容はフィルタリング処理（１）のステップＳ１００と同様である。
ステップＳ２０２：フィルタリング部１４２は、毎フレームの人物判定部１３６の判定結果（検出データ）を入力する。処理の内容はフィルタリング処理（１）のステップＳ１０２と同様である。 Step S200: The filtering unit 142 defines the number of n frames for the first time. The content of the process is the same as step S100 of filtering process (1).
Step S202: The filtering unit 142 inputs the determination result (detection data) of the person determining unit 136 for each frame. The content of the process is the same as step S102 of filtering process (1).

〔１フレーム目の処理〕
ステップＳ２０４：フィルタリング部１４２は、検出データがある場合（Ｙｅｓ）、ステップＳ２０６に進むが、図６の処理テーブルの例では、１個目のフレームに検出データがないため（Ｎｏ）、ステップＳ２１６に進む。 [1st frame processing]
Step S204: If there is detection data (Yes), the filtering unit 142 proceeds to Step S206, but in the example of the processing table of FIG. 6, since there is no detection data in the first frame (No), the filtering unit 142 proceeds to Step S216. move on.

ステップＳ２１６：フィルタリング部１４２は、変数Ｎ_１を値「０」にリセットし、変数Ｎ_２を１インクリメントする。変数Ｎ_２は初期値０に設定されているため、ここでは変数Ｎ_２に値「１」が代入される。なお、変数Ｎ_１も初期値０である。
ステップＳ２１８：フィルタリング部１４２は、変数Ｎ_２が定義したフレーム数ｎに等しければ（Ｙｅｓ）、ステップＳ２２４に進むが、ここではフレーム数ｎ（３個）に満たないため（Ｎｏ）、ステップＳ２２０に進む。 Step S216: The filtering unit 142 resets the variable _N1 to the value "0" and increments the variable _N2 by 1. Since the variable _N2 is set to an initial value of 0, the value "1" is assigned to the variable _N2 here. Note that the variable _N1 also has an initial value of 0.
Step S218: If the variable _N2 is equal to the defined number of frames n (Yes), the filtering unit 142 proceeds to Step S224, but here, since the number of frames is less than n (3) (No), the filtering unit 142 proceeds to Step S220. move on.

ステップＳ２２０：フィルタリング部１４２は、内部状態が「検出」である場合（Ｙｅｓ）、ステップＳ２１４に進む。ただし、図６の処理テーブルの例では、１フレーム目の内部状態は「未検出」であるため（Ｎｏ）、ステップＳ２２２に進む。 Step S220: If the internal state is "detected" (Yes), the filtering unit 142 proceeds to step S214. However, in the example of the processing table of FIG. 6, the internal state of the first frame is "undetected" (No), so the process advances to step S222.

ステップＳ２２２：フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図６の処理テーブルの例では、１個目のフレームで出力なしとなる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、２フレーム目以降についても順次、上記と同様に本処理を実行する。 Step S222: The filtering unit 142 outputs the detection data "none". That is, in the example of the processing table in FIG. 6, there is no output in the first frame.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is sequentially executed in the same manner as above for the second and subsequent frames.

〔４フレーム目の処理〕
４フレーム目の処理は以下となる。
ステップＳ２０４：図６の処理テーブルの例では、４フレーム目に検出データＡがあるため（Ｙｅｓ）、ステップＳ２０６に進む。
ステップＳ２０６：フィルタリング部１４２は、変数Ｎ_２を値「０」にリセットし、変数Ｎ_１を１インクリメントする。変数Ｎ_１は初期値０に設定されているため、ここでは変数Ｎ_１に値「１」が代入される。 [4th frame processing]
The processing for the fourth frame is as follows.
Step S204: In the example of the processing table of FIG. 6, since the detection data A is present in the fourth frame (Yes), the process proceeds to step S206.
Step S206: The filtering unit 142 resets the variable _N2 to the value "0" and increments the variable _N1 by one. Since the variable _N1 is set to the initial value 0, the value " ₁ " is assigned to the variable N1 here.

ステップＳ２０８：フィルタリング部１４２は、変数Ｎ_１が定義したフレーム数ｎに等しければ（Ｙｅｓ）、ステップＳ２１０に進むが、ここではフレーム数ｎ（３個）に満たないため（Ｎｏ）、ステップＳ２２０に進む。 Step S208: If the variable _N1 is equal to the defined number of frames n (Yes), the filtering unit 142 proceeds to Step S210, but here, since the number of frames is less than n (3) (No), the filtering unit 142 proceeds to Step S220. move on.

ステップＳ２２０：フィルタリング部１４２は、内部状態が「検出」である場合（Ｙｅｓ）、ステップＳ２１４に進む。ただし、図６の処理テーブルの例では、４フレーム目の内部状態は「未検出」であるため（Ｎｏ）、ステップＳ２２２に進む。 Step S220: If the internal state is "detected" (Yes), the filtering unit 142 proceeds to step S214. However, in the example of the processing table in FIG. 6, the internal state of the fourth frame is "undetected" (No), so the process advances to step S222.

ステップＳ２２２：フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図６の処理テーブルの例では、４個目のフレームで出力なしとなる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、５フレーム目ついても順次、上記と同様に本処理を実行する。 Step S222: The filtering unit 142 outputs the detection data "none". That is, in the example of the processing table in FIG. 6, there is no output in the fourth frame.
The filtering unit 142 temporarily leaves (returns) this process. Then, for the fifth frame as well, this process is executed in the same manner as above.

〔６フレーム目の処理〕
６フレーム目の処理は以下となる。
ステップＳ２０４：図６の処理テーブルの例では、６フレーム目に検出データＣがあるため（Ｙｅｓ）、ステップＳ２０６に進む。
ステップＳ２０６：フィルタリング部１４２は、変数Ｎ_２を値「０」にリセットし、変数Ｎ_１を１インクリメントする。前回の５フレーム目の処理で変数Ｎ_１に値「２＝１＋１」が代入されているため、ここで変数Ｎ_１に値「３＝２＋１」が代入される。 [6th frame processing]
The processing for the 6th frame is as follows.
Step S204: In the example of the processing table of FIG. 6, since the detection data C exists in the sixth frame (Yes), the process advances to step S206.
Step S206: The filtering unit 142 resets the variable _N2 to the value "0" and increments the variable _N1 by one. Since the value "2=1+1" was assigned to the variable _N1 in the previous process of the fifth frame, the value "3=2+1" is assigned to the variable _N1 here.

ステップＳ２０８：フィルタリング部１４２は、変数Ｎ_１が定義したフレーム数ｎに等しいため（Ｙｅｓ）、ステップＳ２１０に進む。 Step S208: Since the variable _N1 is equal to the defined number of frames n (Yes), the filtering unit 142 proceeds to step S210.

ステップＳ２１０：フィルタリング部１４２は、内部状態を「検出」に設定する。これにより、図６の処理テーブルの例では、６個目のフレームの内部状態が「検出」となる。
ステップＳ２１２：そして、変数Ｎ_１を値「０」にリセットする。 Step S210: The filtering unit 142 sets the internal state to "detection". As a result, in the example of the processing table of FIG. 6, the internal state of the sixth frame becomes "detection".
Step S212: Then, the variable _N1 is reset to the value "0".

ステップＳ２１４：フィルタリング部１４２は、最新の検出データを出力する。すなわち、図６の処理テーブルの例では、６個目のフレームで最新の検出データＣが出力される。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、７フレーム目ついても順次、上記と同様に本処理を実行する。７フレーム目から１８フレーム目までは、変数Ｎ_２が定義したフレーム数ｎに満たないため、内部状態は「検出」となる。 Step S214: The filtering unit 142 outputs the latest detection data. That is, in the example of the processing table in FIG. 6, the latest detection data C is output in the sixth frame.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is executed in the same manner as above for the seventh frame as well. From the 7th frame to the 18th frame, the internal state is "detected" because the variable _N2 is less than the defined number of frames n.

〔１９フレーム目の処理〕
１９フレーム目の処理は以下となる。前回１８フレーム目の処理までで、変数Ｎ_２が値「２」となっている。
ステップＳ２０４：図６の処理テーブルの例では、１９フレーム目が検出データなしであるため（Ｎｏ）、ステップＳ２１６に進む。 [19th frame processing]
The processing for the 19th frame is as follows. Up to the previous processing of the 18th frame, the variable _N2 has a value of "2".
Step S204: In the example of the processing table of FIG. 6, since there is no detection data in the 19th frame (No), the process advances to step S216.

ステップＳ２１６：変数Ｎ_１を値「０」にリセットし、変数Ｎ_２を１インクリメントする。前回の１８フレーム目の処理で変数Ｎ_２に値「２＝１＋１」が代入されているため、ここで変数Ｎ_２に値「３＝２＋１」が代入される。 Step S216: The variable _N1 is reset to the value "0" and the variable _N2 is incremented by one. Since the value "2=1+1" was assigned to the variable _N2 in the previous process of the 18th frame, the value "3=2+1" is assigned to the variable _N2 here.

ステップＳ２１８：フィルタリング部１４２は、変数Ｎ_２が定義したフレーム数ｎに等しいため（Ｙｅｓ）、ステップＳ２２４に進む。 Step S218: Since the variable _N2 is equal to the defined number of frames n (Yes), the filtering unit 142 proceeds to step S224.

ステップＳ２２４：フィルタリング部１４２は、内部状態を「未検出」に設定する。これにより、図６の処理テーブルの例では、１９個目のフレームの内部状態が「未検出」となる。
ステップＳ２２６：そして、変数Ｎ_２を値「０」にリセットする。 Step S224: The filtering unit 142 sets the internal state to "undetected". As a result, in the example of the processing table of FIG. 6, the internal state of the 19th frame becomes "undetected."
Step S226: Then, the variable _N2 is reset to the value "0".

ステップＳ２２２：フィルタリング部１４２は、検出データ「なし」を出力する。すなわち、図６の処理テーブルの例では、１９個目のフレームで出力なしとなる。
フィルタリング部１４２は、ここで本処理を一旦離脱（リターン）する。そして、２０フレーム目以降についても順次、上記と同様に本処理を実行する。 Step S222: The filtering unit 142 outputs the detection data "none". That is, in the example of the processing table of FIG. 6, there is no output at the 19th frame.
The filtering unit 142 temporarily leaves (returns) this process. Then, this process is sequentially executed in the same manner as above for the 20th frame and thereafter.

〔声掛け音声出力処理〕
図１０は、演算部１２２で実行されるプログラムの一部として声掛け音声出力処理の手順例を示すフローチャートである。以下、手順例に沿って説明する。 [Voice voice output processing]
FIG. 10 is a flowchart illustrating an example of a procedure for outputting a greeting voice as part of a program executed by the calculation unit 122. The procedure will be explained below using an example procedure.

ステップＳ３００：演算部１２２は、フィルタリング部１４２から検出データを入力する。ここで入力する検出データは、図５又は図６の処理テーブルの例に示されるように、擬制的に生成されたものとなる。
ステップＳ３０２：検出データがある場合（Ｙｅｓ）、ステップＳ３０４に進む。検出データがない場合（Ｎｏ）、ここで本処理を一旦離脱（リターン）する。 Step S300: The calculation unit 122 inputs the detection data from the filtering unit 142. The detection data input here is virtually generated, as shown in the processing table example of FIG. 5 or 6.
Step S302: If there is detection data (Yes), proceed to step S304. If there is no detection data (No), this process is temporarily exited (return).

ステップＳ３０４：演算部１２２は、検出エリア判定部１４０の判定結果を取得し、人物が検出エリアＤＡ内に進入したと判定された場合（Ｙｅｓ）、次にステップＳ３０６を実行する。それ以外では（Ｎｏ）、ここで本処理を一旦離脱（リターン）する。 Step S304: The calculation unit 122 obtains the determination result of the detection area determination unit 140, and if it is determined that the person has entered the detection area DA (Yes), then executes step S306. Otherwise (No), the process is temporarily exited (return).

ステップＳ３０６：演算部１２２は、声掛け音声出力を出力装置１２６に対して指示する。これにより、人物が検出エリアＤＡ内に進入した発話タイミングでマイク・スピーカ１２８から発話音声が出力される。 Step S306: The calculation unit 122 instructs the output device 126 to output the greeting voice. Thereby, the utterance sound is output from the microphone/speaker 128 at the utterance timing when the person enters the detection area DA.

以上の手順を実行すると、演算部１２２は本処理を離脱（リターン）する。そして、上記同様の手順を繰り返し実行する。 After executing the above procedure, the calculation unit 122 exits (returns) this process. Then, the same procedure as above is repeated.

このように、声掛けシステム１１０の各部が処理を連係又は協働して実行することにより、移動ロボットＲＢによる声掛けが適切に実行されることになる。 In this way, each part of the greeting system 110 executes the processing in conjunction or in cooperation, so that the mobile robot RB can appropriately execute the greeting.

なお、上記の処理では便宜上、未検出状態では「検出データなし」といったステータス情報を出力することとしているが、未検出状態では検出情報そのものを出力しないこととしてもよい。 Note that in the above process, for convenience, status information such as "no detected data" is output in the undetected state, but the detected information itself may not be output in the undetected state.

以上のような実施形態の発話制御装置１００によれば、人物を高速に判定（検出）することで適切なタイミングを逸することなく、また、連呼することなく発話を制御することができる。これにより、例えば建設現場ＣＳのように不特定の人物がランダムに移動している場合であっても、移動ロボットＲＢが日中、建設現場ＣＳ内を自律移動しながら作業者にタイミングよく声掛けし、その際に声掛けの内容を確実に人物に聞かせることができる。また、高速ＡＩモデルを搭載することによる不確実性（低い検出率）が適切に補償され、実用的で違和感のない声掛けシステム１１０を実現することができる。 According to the speech control device 100 of the embodiment as described above, by determining (detecting) a person at high speed, speech can be controlled without missing an appropriate timing and without repeating calls. As a result, even if an unspecified person is moving randomly, such as at a construction site CS, the mobile robot RB can autonomously move around the construction site CS during the day and call out to workers in a timely manner. At that time, it is possible to ensure that the person hears the content of the call. Moreover, the uncertainty (low detection rate) due to the installation of a high-speed AI model is appropriately compensated, and it is possible to realize a practical and natural calling system 110.

また、建設現場ＣＳ等では、例えば周囲環境の明るさが充分でなく、ＩＰカメラ１１２で人物を鮮明に撮像できないフレームがあったり、人物の動きが想定よりも速く、人物画像が不鮮明となるフレームがあったりする。これらの場合、ｎフレーム連続で検出データを得ることができないことが頻繁に生じるため、高速モデルではさらに検出率が低くなるが、フィルタリング処理（１）のロジックを用いれば、ｎフレーム中のデータありとデータなしの比率（検出データが所定割合以上）であれば検出データありと擬制することで、未検出フレームの絶対数を低く抑えることができる。 In addition, at a construction site CS, for example, there may be frames in which the IP camera 112 cannot capture a clear image of a person due to insufficient brightness in the surrounding environment, or frames in which the person's movement is faster than expected and the image of the person is unclear. Sometimes there is. In these cases, it often happens that detection data cannot be obtained for n consecutive frames, so the detection rate becomes even lower with the high-speed model, but if you use the logic of filtering processing (1), it is possible to detect data in n frames. By assuming that there is detected data if the ratio of undetected frames is equal to or higher than a predetermined ratio (the detected data is a predetermined ratio or more), the absolute number of undetected frames can be kept low.

本発明は上述した実施形態に制約されることなく、種々に変形して実施することが可能である。
既に述べたように、発話制御装置１００を適用する対象は移動ロボットＲＢに限られず、固定式のロボットであってもよいし、ロボットの形態ではない車両その他のマシン、あるいは据え置き型の機器であってもよい。 The present invention is not limited to the embodiments described above, and can be implemented with various modifications.
As already mentioned, the object to which the speech control device 100 is applied is not limited to the mobile robot RB, but may also be a fixed robot, a vehicle or other machine that is not a robot, or a stationary device. It's okay.

ＩＰカメラ１１２やマイク・スピーカ１２８の設置個数や位置、形状、向き等は適宜に選択又は変更することができる。また、ＡＩ処理高速化装置１１４は必須ではなく、特にこれを用いなくてもよい。 The number, positions, shapes, orientations, etc. of the IP cameras 112 and microphones/speakers 128 can be selected or changed as appropriate. Furthermore, the AI processing acceleration device 114 is not essential, and it is not necessary to use it.

また、各種処理（図８～図１０）で挙げた手順例は適宜に変更可能であるし、必ずしも手順例の通りに処理が行われなくてもよい。また、各種処理をどのような契機（割り込みイベント処理又はトリガイベント処理）で実行させるかは適宜に決定してもよい。 Furthermore, the procedure examples listed in the various processes (FIGS. 8 to 10) can be changed as appropriate, and the processing does not necessarily have to be performed according to the procedure examples. Further, the trigger (interrupt event processing or trigger event processing) at which each type of processing is executed may be determined as appropriate.

その他、実施形態等において図示とともに挙げた構造はあくまで好ましい一例であり、基本的な構造に各種の要素を付加し、あるいは一部を置換しても本発明を好適に実施可能であることはいうまでもない。 In addition, the structures mentioned with illustrations in the embodiments, etc. are just preferred examples, and it is possible to suitably implement the present invention even if various elements are added to the basic structure or some parts are replaced. Not even.

１００発話制御装置
１１０声掛けシステム
１１２ＩＰカメラ
１１８対人距離判定部
１２２演算部（音声出力部）
１２６出力装置（音声出力部）
１２８マイク・スピーカ（音声出力部）
１３６人物判定部
１４０検出エリア判定部
１４２フィルタリング部
ＤＡ検出エリア 100 Speech control device 110 Calling system 112 IP camera 118 Interpersonal distance determination unit 122 Arithmetic unit (audio output unit)
126 Output device (audio output unit)
128 Microphone/speaker (audio output section)
136 Person determination unit 140 Detection area determination unit 142 Filtering unit DA detection area

Claims

When human identification is performed continuously from images obtained by continuously capturing images of an imaging area where a person exists, the series of judgment results may contain irregular cases in which the identification of a person is successful and cases in which it is unsuccessful. a person determination unit having the determination ability included in;
The continuity of a detected state or an undetected state of a person is determined fictitiously from the ratio of successful cases and unsuccessful cases included in a series of judgment results obtained by the person judgment unit, and the person is based on the judgment. a filtering unit that generates a detected state or an undetected state as a virtual person detection result;
a detection area determination unit that defines a predetermined detection area within the imaging area and determines whether a person indicated by a hypothetical detection result by the filtering unit has entered the detection area;
A speech control device comprising: an audio output section that outputs a speech sound at a timing when the detection area determination section determines that a person has entered the detection area.

When human identification is performed continuously from images obtained by continuously capturing images of an imaging area where a person exists, the series of judgment results may contain irregular cases in which the identification of a person is successful and cases in which it is unsuccessful. a person determination unit having the determination ability included in;
When the person determination section obtains a successful determination result a predetermined number of times in a row, a detection result is generated that virtually puts the person in a detection state based on the continuity of the determination results, and after that, a success determination result is generated a predetermined number of times in a row. a filtering unit that generates a detection result that virtually indicates that the person is not detected based on the continuity of the fact that no determination result is obtained when no determination result is obtained ;
a detection area determination unit that defines a predetermined detection area within the imaging area and determines whether a person indicated by a hypothetical detection result by the filtering unit has entered the detection area;
A speech control device comprising: an audio output section that outputs a speech sound at a timing when the detection area determination section determines that a person has entered the detection area.

The speech control device according to claim 1 or 2 ,
The filtering section includes:
When an unsuccessful determination result is obtained after a successful determination result is obtained by the person determination section, a hypothetical person is continuously brought into the detection state based on the last successful determination result obtained. A speech control device characterized by generating a detection result.