JP7453193B2

JP7453193B2 - Mobile device, program, and method for controlling speech based on speech synthesis in conjunction with user's surrounding situation

Info

Publication number: JP7453193B2
Application number: JP2021156307A
Authority: JP
Inventors: 朋広小原; 剣明呉; 亮一川田
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2024-03-19
Anticipated expiration: 2041-09-27
Also published as: JP2023047411A

Description

本発明は、物体認識に基づく情報をユーザに提示する技術に関する。特に、スマートグラスやスマートフォンに適する。 The present invention relates to a technique for presenting information based on object recognition to a user. Particularly suitable for smart glasses and smartphones.

スマートグラスは、拡張現実(Augmented Reality)ウェアラブルコンピュータとしてのシースルー型の眼鏡である。これは、ユーザ（装着者）の視界前方に装着され、レンズに対する投射型ディスプレイを搭載する。装着者から見ると、非現実空間を表示する仮想現実(Virtual Reality)と異なって、現実空間の視界に重畳的に情報が表示される。 Smart glasses are see-through glasses that serve as augmented reality wearable computers. This is worn in front of the user's (wearer's) field of vision and is equipped with a projection type display for the lens. From the wearer's perspective, information is displayed superimposed on the field of view of the real space, unlike virtual reality, which displays an unreal space.

近年、高齢者介護施設や医療機関では、被介護者の増加や介護者の人手不足に伴って、介護サービスの質の低下が社会的課題となっている。特に介護者は、被介護者毎に異なる応対をする必要がある。特に、多数の被介護者が入居する施設では、介護者が、被介護者各々の個人情報（例えば介護に必要な症状等の情報）を覚えきれないことは当然である。 In recent years, deterioration in the quality of nursing care services has become a social issue at elderly care facilities and medical institutions due to an increase in the number of people being cared for and a shortage of caregivers. In particular, caregivers need to respond differently to each care recipient. Particularly in facilities where a large number of care recipients reside, it is natural for caregivers to be unable to remember the personal information of each care recipient (for example, information on symptoms necessary for care).

これに対し、介護者がスマートグラスを装着し、そのレンズに、ハンズフリーで被介護者の個人情報を表示する技術がある（例えば非特許文献１参照）。この技術によれば、スマートグラスのカメラで撮影された被介護者の顔画像から、その被介護者を同定し、その被介護者の個人情報をレンズに直ぐに映し出すことができる。また、その被介護者の個人情報を音声合成し、介護者の聴覚へ骨電動スピーカから音声再生することもできる。そのために、被介護者は、個人認識用の無線タグやマーカを装着する必要もない。 On the other hand, there is a technology in which a caregiver wears smart glasses and displays the personal information of the cared person on the lenses in a hands-free manner (for example, see Non-Patent Document 1). According to this technology, the cared person can be identified from the cared person's face image taken by the camera of the smart glasses, and the cared person's personal information can be immediately displayed on the lens. It is also possible to synthesize the cared person's personal information into voice and reproduce the voice from a bone-electric speaker to the caregiver's hearing. Therefore, the cared person does not need to wear a wireless tag or marker for personal recognition.

図１は、スマートグラスを通した視界を表す説明図である。 FIG. 1 is an explanatory diagram showing the field of view through smart glasses.

例えば介護現場で、介護者がスマートグラス１を装着しているとする。このとき、介護者（装着者）は、スマートグラス１のレンズを通した視界に、被介護者を見ると同時に、レンズに映し出されたその被介護者の「個人情報」を見ることができる。これによって、介護者は、ハンズフリーで被介護者の個人情報を、視覚的に読み取りながら、又は、聴覚的に聞き取りながら、介護作業を進めることができる。 For example, assume that a caregiver is wearing smart glasses 1 at a nursing care site. At this time, the caregiver (wearer) can see the cared person in the field of view through the lenses of the smart glasses 1, and at the same time see the cared person's "personal information" reflected on the lenses. As a result, the caregiver can proceed with care work while visually reading or audibly hearing the cared person's personal information hands-free.

KDDI総合研究所・善光会ニュースリリース「KDDI総合研究所と善光会、ARメガネを活用したハンズフリー介護作業支援システムを開発」(2021年2月2日)、[online]、［令和３年９月１７日検索］、インターネット＜URL:https://www.kddi-research.jp/newsrelease/2021/020201.html＞KDDI Research Institute/Zenkokai News Release “KDDI Research Institute and Zenkokai develop hands-free nursing care work support system using AR glasses” (February 2, 2021), [online], [Reiwa 3 Searched on September 17], Internet <URL: https://www.kddi-research.jp/newsrelease/2021/020201.html> 「内部パラメータ(焦点距離)の単位の話～pixelとmmの変換～」、[online]、［令和３年９月１７日検索］、インターネット＜URL:https://mem-archive.com/2018/02/25/post-201/＞“A story about the unit of internal parameters (focal length) ~ Conversion between pixel and mm ~”, [online], [Retrieved September 17, 2021], Internet <URL: https://mem-archive.com/ 2018/02/25/post-201/＞「画像から3次元データを復元する技術の調査三次元座標の算出の原理～エピポーラ幾何＆カメラ姿勢の推定～」、[online]、［令和３年９月１７日検索］、インターネット＜URL: https://qiita.com/akaiteto/items/f5857c7774794a6e5f5e＞“Survey on technology for restoring 3D data from images: Principles of calculating 3D coordinates ~Estimation of epipolar geometry & camera pose~”, [online], [Retrieved September 17, 2021], Internet <URL: https://qiita.com/akaiteto/items/f5857c7774794a6e5f5e＞

前述した非特許文献１に記載の技術によれば、スマートグラスに映る被介護者の個人情報を音声によって再生する場合、その個人情報の文字をそのまま音声合成しようとする。即ち、個人情報の文字数が長いほど、音声再生の時間が長くなり、個人情報の文字数が短いほど、音声再生の時間が短くなる。 According to the technology described in Non-Patent Document 1 mentioned above, when the personal information of the cared person reflected on the smart glasses is reproduced by voice, the characters of the personal information are directly synthesized into speech. That is, the longer the number of characters in the personal information, the longer the audio reproduction time will be, and the shorter the number of characters in the personal information, the shorter the audio reproduction time.

これに対し、本願の発明者らは、カメラを搭載した携帯装置の周辺状況によっては、映像に映り込む複数の物体の中で、その物体の情報を、概要のみできる限り短時間で聞きたい場合があるのではないか、と考えた。即ち、装着者は、視界に映る物体の情報を、しっかりと聞き取りたい状況もあれば、できる限り短時間で聞き取りたい状況もあるのではないか、と考えた。 In contrast, the inventors of the present application have proposed that, depending on the surroundings of a mobile device equipped with a camera, there may be cases where, among multiple objects reflected in the video, you want to hear only the outline of the object in the shortest possible time. I thought that there might be. In other words, we thought that there may be situations in which the wearer wants to hear information about objects in their field of vision clearly, and there may be situations in which they want to hear the information in the shortest possible time.

そこで、本発明は、ユーザの周辺状況に連動して音声合成に基づく音声を制御する携帯装置、プログラム及び方法を提供することを目的とする。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a portable device, a program, and a method for controlling speech based on speech synthesis in conjunction with the user's surrounding situation.

本発明によれば、映像を撮影するカメラと、音声出力するスピーカとを有する携帯装置において、
物体ＩＤ（識別子）毎に、文の提示情報を対応付けて予め記憶した提示情報記憶手段と、
映像から物体の画像領域を認識し、当該画像領域のサイズを検出し、当該画像領域から物体ＩＤを特定する物体認識手段と、
画像領域のサイズが所定範囲よりも大きい場合に、物体ＩＤに対応する提示情報の文に対する音声再生の時間長に影響する読み上げ情報量を、画像領域のサイズが所定範囲以下である場合よりも少なくするべく制御する読み上げ情報量制御手段と、
物体ＩＤに対応する提示情報を、読み上げ情報量に基づいて読み上げるべく音声合成し、スピーカから出力する音声合成手段と
を有することを特徴とする。 According to the present invention, in a mobile device having a camera that captures images and a speaker that outputs audio,
Presentation information storage means that associates and stores sentence presentation information for each object ID (identifier);
Object recognition means for recognizing an image area of the object from the video, detecting the size of the image area, and identifying an object ID from the image area;
When the size of the image area is larger than the predetermined range, the amount of reading information that affects the length of audio playback for the sentence of the presentation information corresponding to the object ID is set to be lower than when the size of the image area is less than the predetermined range. A reading information amount control means for controlling the amount of information to be read out so as to reduce the amount of information to be read out;
The present invention is characterized by having a voice synthesis means for voice-synthesizing the presentation information corresponding to the object ID so as to read it out based on the amount of read-out information, and outputting it from a speaker.

本発明の携帯装置における他の実施形態によれば、
物体ＩＤは、個人ＩＤであり、
提示情報は、個人情報であり、
物体認識手段における画像領域は、顔領域である
ことも好ましい。 According to another embodiment of the mobile device of the present invention,
The object ID is a personal ID,
The information presented is personal information,
It is also preferable that the image area in the object recognition means is a face area.

本発明の携帯装置における他の実施形態によれば、
読み上げ情報量制御手段は、読み上げ情報量を、話速とする
ことも好ましい。 According to another embodiment of the mobile device of the present invention,
It is also preferable that the reading information amount control means sets the reading information amount to the speaking speed.

本発明の携帯装置における他の実施形態によれば、
読み上げ情報量制御手段は、読み上げ情報量として、
画像領域のサイズが所定範囲以下である場合、通常の話速を選択し、
画像領域のサイズが所定範囲よりも大きい場合、通常の話速よりも早い話速を選択する
ことも好ましい。 According to another embodiment of the mobile device of the present invention,
The reading information amount control means is configured to control the amount of reading information as follows:
If the size of the image area is below the predetermined range, select the normal speaking speed,
When the size of the image area is larger than a predetermined range, it is also preferable to select a speaking speed faster than the normal speaking speed.

本発明の携帯装置における他の実施形態によれば、
提示情報記憶手段は、提示情報として、通常文と、当該通常文よりも短い短文とを対応付けて予め記憶しており、
読み上げ情報量制御手段は、読み上げ情報量として、
画像領域のサイズが所定範囲以下である場合、通常文を選択し、
画像領域のサイズが所定範囲よりも大きい場合、短文を選択する
ことも好ましい。 According to another embodiment of the mobile device of the present invention,
The presentation information storage means stores in advance a normal sentence and a short sentence shorter than the normal sentence in association with each other as presentation information,
The reading information amount control means is configured to control the amount of reading information as follows:
If the size of the image area is below the specified range, select the normal sentence,
When the size of the image area is larger than a predetermined range, it is also preferable to select a short sentence.

本発明の携帯装置における他の実施形態によれば、
提示情報記憶手段は、提示情報として、通常文を対応付けて予め記憶しており、
読み上げ情報量制御手段は、読み上げ情報量として、
画像領域のサイズが所定範囲以下である場合、通常文を選択し、
画像領域のサイズが所定範囲よりも大きい場合、当該通常文における用言及び体言のみを選択する
ことも好ましい。 According to another embodiment of the mobile device of the present invention,
The presentation information storage means stores in advance normal sentences in association with each other as presentation information,
The reading information amount control means is configured to control the amount of reading information as follows:
If the size of the image area is below the specified range, select the normal sentence,
When the size of the image area is larger than a predetermined range, it is also preferable to select only the usages and proverbs in the normal sentence.

本発明の携帯装置における他の実施形態によれば、
所定範囲は、被写体との間の距離に基づくものであり、被写体との間の距離が遠いほど画像領域のサイズが小さくなり、被写体との間の距離が近いほど画像領域のサイズが大きくなる
ことも好ましい。 According to another embodiment of the mobile device of the present invention,
The predetermined range is based on the distance to the subject; the farther the distance to the subject, the smaller the size of the image area, and the closer the distance to the subject, the larger the size of the image area. It is also preferable.

本発明の携帯装置における他の実施形態によれば、
カメラは、ユーザの視界の映像を撮影するべく一体的に搭載され、
当該携帯装置は、シースルー型のスマートグラス又はスマートフォンである
ことも好ましい。 According to another embodiment of the mobile device of the present invention,
The camera is integrally mounted to capture images of the user's field of view,
It is also preferable that the mobile device is see-through smart glasses or a smartphone.

本発明の携帯装置における他の実施形態によれば、
スピーカは、骨伝導スピーカであってもよい。 According to another embodiment of the mobile device of the present invention,
The speaker may be a bone conduction speaker.

本発明によれば、映像を撮影するカメラと、音声出力するスピーカとを有する携帯装置に搭載されたコンピュータを機能させるプログラムにおいて、
物体ＩＤ（識別子）毎に、文の提示情報を対応付けて予め記憶した提示情報記憶手段と、
映像から物体の画像領域を認識し、当該画像領域のサイズを検出し、当該画像領域から物体ＩＤを特定する物体認識手段と、
画像領域のサイズが所定範囲よりも大きい場合に、物体ＩＤに対応する提示情報の文に対する音声再生の時間長に影響する読み上げ情報量を、画像領域のサイズが所定範囲以下である場合よりも少なくするべく制御する読み上げ情報量制御手段と、
物体ＩＤに対応する提示情報を、読み上げ情報量に基づいて読み上げるべく音声合成し、スピーカから出力する音声合成手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, in a program that causes a computer installed in a mobile device having a camera that captures images and a speaker that outputs audio to function,
Presentation information storage means that associates and stores sentence presentation information for each object ID (identifier);
Object recognition means for recognizing an image area of the object from the video, detecting the size of the image area, and identifying an object ID from the image area;
When the size of the image area is larger than the predetermined range, the amount of reading information that affects the length of audio playback for the sentence of the presentation information corresponding to the object ID is set to be lower than when the size of the image area is less than the predetermined range. A reading information amount control means for controlling the amount of information to be read out so as to reduce the amount of information to be read out;
The present invention is characterized in that the computer functions as a voice synthesizing means that synthesizes the presentation information corresponding to the object ID to be read aloud based on the amount of readout information and outputs it from a speaker.

本発明によれば、映像を撮影するカメラと、音声出力するスピーカとを有する携帯装置の情報提示方法において、
携帯装置は、
物体ＩＤ（識別子）毎に、文の提示情報を対応付けて予め記憶しており、
映像から物体の画像領域を認識し、当該画像領域のサイズを検出し、当該画像領域から物体ＩＤを特定する第１のステップと、
画像領域のサイズが所定範囲よりも大きい場合に、物体ＩＤに対応する提示情報の文に対する音声再生の時間長に影響する読み上げ情報量を、画像領域のサイズが所定範囲以下である場合よりも少なくするべく制御する第２のステップと、
物体ＩＤに対応する提示情報を、読み上げ情報量に基づいて読み上げるべく音声合成し、スピーカから出力する第３のステップと
を実行することを特徴とする。
According to the present invention, in an information presentation method for a mobile device having a camera that captures images and a speaker that outputs audio,
The mobile device is
Sentence presentation information is associated and stored in advance for each object ID (identifier),
a first step of recognizing an image area of the object from the video, detecting the size of the image area, and identifying an object ID from the image area;
When the size of the image area is larger than the predetermined range, the amount of reading information that affects the length of audio playback for the sentence of the presentation information corresponding to the object ID is set to be lower than when the size of the image area is less than the predetermined range. a second step of controlling to reduce
The present invention is characterized by performing a third step of synthesizing the presentation information corresponding to the object ID to be read aloud based on the amount of information to be read out and outputting it from a speaker.

本発明の携帯装置、プログラム及び方法によれば、ユーザの周辺状況に連動して音声合成に基づく音声を制御することができる。 According to the portable device, program, and method of the present invention, it is possible to control speech based on speech synthesis in conjunction with the user's surrounding situation.

スマートグラスを通した視界を表す説明図である。It is an explanatory diagram showing visibility through smart glasses. 本発明におけるスマートグラスの機能構成図である。FIG. 2 is a functional configuration diagram of smart glasses according to the present invention. 本発明の提示情報記憶部における通常文／短文を表す説明図である。FIG. 3 is an explanatory diagram showing normal sentences/short sentences in the presentation information storage unit of the present invention. 本発明の読み上げ情報量制御部におけるフローチャートである。It is a flowchart in the reading information amount control unit of the present invention. 音声合成部における機能構成図である。It is a functional block diagram in a speech synthesis part. 装着者と対人との間の距離が遠い場合の音声合成を表す説明図である。FIG. 3 is an explanatory diagram showing speech synthesis when the distance between the wearer and the other person is long. 装着者と対人との間の距離が近い場合の音声合成を表す説明図である。FIG. 3 is an explanatory diagram showing speech synthesis when the distance between the wearer and the other person is short. 本発明におけるスマートフォンの利用形態を表す説明図である。FIG. 2 is an explanatory diagram showing a usage pattern of a smartphone according to the present invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail using the drawings.

図２は、本発明におけるスマートグラスの機能構成図である。 FIG. 2 is a functional configuration diagram of smart glasses according to the present invention.

スマートグラス１は、ユーザの視界に装着可能なシースルー型のものである。
図２によれば、スマートグラス１は、ハードウェアとして、レンズに情報を映し出すディスプレイ１０１と、対人や対象物を撮影するカメラ１０２と、装着者に音声を出力するスピーカ１０３とを搭載する。
また、スマートグラス１は、ソフトウェアとして、提示情報記憶部１００と、物体認識部１１と、読み上げ情報量制御部１３と、音声合成部１４とを有する。これら機能構成部は、スマートグラスに搭載されたコンピュータを機能させるプログラムを実行することによって実現される。 The smart glasses 1 are of a see-through type that can be worn in the user's field of vision.
According to FIG. 2, the smart glasses 1 are equipped with, as hardware, a display 101 that projects information on lenses, a camera 102 that photographs a person or an object, and a speaker 103 that outputs audio to the wearer.
The smart glasses 1 also include a presentation information storage section 100, an object recognition section 11, a reading information amount control section 13, and a speech synthesis section 14 as software. These functional components are realized by executing a program that causes a computer installed in the smart glasses to function.

［ディスプレイ１０１］
ディスプレイ１０１は、スマートグラス１の装着者の視界を通すレンズに、提示情報を映し出すことができる。これによって、装着者は、提示情報を視覚的に認識することができる。提示情報としては、一般的には文字であるが、勿論、画像や映像であってもよい。 [Display 101]
The display 101 can project the presented information on the lens through which the wearer of the smart glasses 1 sees. This allows the wearer to visually recognize the presented information. The presentation information is generally text, but may of course be images or videos.

［カメラ１０２］
カメラ１０２は、スマートグラス１と一体的に構成されたものであってもよいし、又は、スマートグラス１の外部に接続されたものであってもよい。カメラ１０２は、装着者の視界の映像を撮影する。映像には、対人（例えば被介護者）が映り込む場合もあれば、物体（例えば施設設備）が映り込む場合もある。カメラ１０２は、撮影した映像を常時、物体認識部１１へ出力する。 [Camera 102]
The camera 102 may be configured integrally with the smart glasses 1 or may be connected to the outside of the smart glasses 1. The camera 102 captures an image of the wearer's field of view. The video may include a person (for example, a care recipient) or an object (for example, facility equipment). The camera 102 always outputs captured images to the object recognition unit 11.

［スピーカ１０３］
スピーカ１０３は、スマートグラス１の装着者の聴覚へ、音声信号を出力するものである。音声信号は、ディスプレイ１０１に表示すべき文字を、音声合成して変換したものであってもよい。
また、スピーカ１０３は、例えば骨伝導スピーカであってもよい。装着者のこめかみに接して振動させることによって、対人に聞こえることなく、装着者に聴覚的に音声を認識させることができる。 [Speaker 103]
The speaker 103 outputs an audio signal to the hearing of the person wearing the smart glasses 1 . The audio signal may be obtained by converting the characters to be displayed on the display 101 by performing speech synthesis.
Further, the speaker 103 may be a bone conduction speaker, for example. By vibrating in contact with the wearer's temple, the wearer can aurally recognize the voice without being audible to others.

［提示情報記憶部１００］
提示情報記憶部１００は、物体ＩＤ（識別子）毎に、提示情報を対応付けて予め記憶したものである。
物体ＩＤは、個人ＩＤであり、提示情報は、個人情報であってもよい。例えば介護現場を想定する場合、個人ＩＤは、被介護者毎に付与されたＩＤであり、提示情報は、その被介護者の介護情報であってもよい。 [Presentation information storage unit 100]
The presentation information storage unit 100 stores presentation information in advance in association with each object ID (identifier).
The object ID may be a personal ID, and the presentation information may be personal information. For example, when assuming a nursing care site, the personal ID may be an ID assigned to each cared person, and the presentation information may be nursing care information of the cared person.

また、提示情報記憶部１００は、個人ＩＤ毎の個人情報を、メモリデバイスによって予め記憶したものであってもよいし（スタンドアロン型）、ネットワークを介してサーバにアクセスしてダウンロードする（サーバ－クライアント型）ものであってもよい。 Further, the presentation information storage unit 100 may store personal information for each individual ID in advance in a memory device (stand-alone type), or may download personal information by accessing a server via a network (server-client type).

図３は、本発明の提示情報記憶部における通常文／短文を表す説明図である。 FIG. 3 is an explanatory diagram showing normal sentences/short sentences in the presentation information storage section of the present invention.

図３によれば、提示情報記憶部１００は、個人ＩＤ毎に、複数の個人情報が対応付けられている。また、複数の個人情報の組み合わせに対して、提示情報として「通常文」が対応付けられている。
図３によれば、個人ＩＤ0001について、以下のような個人情報に対する「通常文」が対応付けられている。
＜個人ＩＤ＝0001＞
外傷チェック：行う
痛み有無の確認：実施
歩行状態の確認：実施
結果：特に問題無し
本人訴え：無し
＜通常文＞「外傷チェックを行い、痛み有無の確認と歩行状態の確認を実施した。その結果特に問題は無く、本人訴えも無かった。」
また、通常文に加えて、以下のような「短文」が対応付けられていてもよい。
＜短文＞「外傷、痛み、歩行状態に問題は無かった。本人訴えも無し。」 According to FIG. 3, in the presentation information storage unit 100, a plurality of pieces of personal information are associated with each individual ID. Furthermore, "regular sentences" are associated with presentation information for combinations of multiple pieces of personal information.
According to FIG. 3, personal ID 0001 is associated with the following "regular sentences" for personal information.
<Personal ID=0001>
Injury check: Performed Check for pain: Performed Check walking condition: Performed Results: No particular problems Complaints by the person: None <Normal sentence> ``We performed an injury check and confirmed the presence of pain and confirmed the walking condition.Results There were no particular problems and no complaints were made against him."
Furthermore, in addition to regular sentences, "short sentences" such as the following may be associated.
<Short sentence>"There were no injuries, pain, or problems with walking condition. There were no complaints by the patient."

尚、他の実施形態として、被介護者の個人情報としては、例えば以下のようなものがあってもよい。これら個人情報から、提示情報としての文章が作成されるものであってもよい。
（基本情報）
氏名、年齢、介護度、生年月日、部屋番号、ケアプラン、等。
（行動情報）
起床時間、食事（済／未）、水分補給（済／未）、服薬（済／未）、排泄（有／無）、入浴（済／未）、等。
（申し送り情報）
健康状態、体温、血圧（上・下）、脈拍、処置情報、症状、治療情報、未排泄日数、排泄時刻、排泄異常（有／無）、朝／昼／夜食時刻、朝／昼／夜食量、水分補給時刻、水分量、服薬時刻、入浴時刻、入浴異常（有／無）、等。 In addition, as another embodiment, the personal information of the cared person may include, for example, the following. A text serving as presentation information may be created from this personal information.
(basic information)
Name, age, level of care, date of birth, room number, care plan, etc.
(behavior information)
Wake up time, meals (done/not), hydration (done/not), medication (done/not), excretion (done/not), bathing (done/not), etc.
(Transfer information)
Health status, body temperature, blood pressure (upper/lower), pulse rate, treatment information, symptoms, treatment information, number of days without defecation, time of defecation, abnormal excretion (presence/absence), breakfast/noon/nighttime meal times, breakfast/noon/nighttime meal amount , time of hydration, amount of water, time of taking medication, time of bathing, abnormal bathing (presence/absence), etc.

［物体認識部１１］
物体認識部１１は、カメラ１０２によって撮影された映像から、対象物を認識する。
物体認識部１１は、映像から１つ以上の物体の画像領域を認識し、各画像領域のサイズを検出し、各画像領域から物体ＩＤを特定するものである。物体認識部１１は、深層学習に基づく画像分類や顔認識に適用される、一般的な機械学習エンジンである。これは、映像に映り込んでいる特定のクラス（人や物のようなカテゴリ）を認識する。
尚、画像領域は、人の顔画像であり、認識対象は、人の同定であってもよい。 [Object recognition unit 11]
The object recognition unit 11 recognizes a target object from an image captured by the camera 102.
The object recognition unit 11 recognizes image areas of one or more objects from a video, detects the size of each image area, and identifies an object ID from each image area. The object recognition unit 11 is a general machine learning engine applied to image classification and face recognition based on deep learning. This recognizes specific classes (categories such as people and objects) that appear in the video.
Note that the image region may be a face image of a person, and the recognition target may be identification of the person.

物体認識部１１は、物体の画像領域を検出する＜画像領域検出機能＞と、その画像領域の物体を検出する＜物体検出機能＞とを有する。 The object recognition unit 11 has an <image area detection function> that detects an image area of an object, and an <object detection function> that detects an object in the image area.

＜画像領域検出機能＞
画像領域検出機能は、映像から、対象物の画像領域を検出する。これは、映像フレームの画像から、物体領域（例えばバウンディングボックス）を切り出す。具体的には、Ｒ－ＣＮＮ(Regions with Convolutional Neural Networks)やＳＳＤ(Single Shot Multibox Detector)を用いる。
Ｒ－ＣＮＮは、四角形の物体領域を畳み込みニューラルネットワークの特徴と組み合わせて、物体領域のサブセットを検出する（領域提案）。次に、領域提案からＣＮＮ特徴量を抽出する。そして、ＣＮＮ特徴量を用いて予め学習したサポートベクタマシンによって、領域提案のバウンディングボックスを調整する。
ＳＳＤは、機械学習を用いた一般物体検知のアルゴリズムであって、デフォルトボックス(default boxes)という長方形の枠（バウンディングボックス）を決定する。１枚の画像上に、大きさの異なるデフォルトボックスを多数重畳させ、そのボックス毎に予測値を計算する。各デフォルトボックスについて、自身が物体からどのくらい離れていて、どのくらい大きさが異なるのか、とする位置の予測をすることができる。 <Image area detection function>
The image area detection function detects an image area of a target object from a video. This cuts out an object region (for example, a bounding box) from an image of a video frame. Specifically, R-CNN (Regions with Convolutional Neural Networks) and SSD (Single Shot Multibox Detector) are used.
R-CNN combines rectangular object regions with convolutional neural network features to detect a subset of object regions (region proposal). Next, CNN features are extracted from the region proposal. Then, the bounding box of the region proposal is adjusted using a support vector machine trained in advance using CNN features.
SSD is a general object detection algorithm using machine learning, and determines rectangular frames (bounding boxes) called default boxes. A large number of default boxes of different sizes are superimposed on one image, and a predicted value is calculated for each box. For each default box, it is possible to predict the position of the box, such as how far it is from the object and how different its size will be.

＜物体検出機能＞
物体検出機能は、画像領域から、対象物を検出する。対象物としては、人体や顔であってもよいし、例えば設備のような物であってもよい。
本発明の実施形態としては、介護現場を想定しているので、物体検出機能は、映像から人の顔領域を認識し、その顔領域から個人ＩＤを特定する。
物体検出機能は、識別すべき実際の個人の顔画像の特徴量を予め蓄積している。例えば、被介護者の顔画像の特徴量を蓄積している。
その上で、物体検出機能は、切り出された画像領域となる顔領域を、顔認識モデルを用いて、128/256/512次元の特徴量（ユークリッド距離）に変換する。顔認識モデルとしては、具体的にはGoogle（登録商標）のFacenet（登録商標）アルゴリズムを用いることもできる。これによって、顔領域から多次元ベクトルの特徴量に変換することができる。
そして、物体認識部１１は、個人の顔の特徴量の集合と照合し、最も距離が短い又は所定閾値以下となる特徴量となる個人ＩＤを特定する。 <Object detection function>
The object detection function detects a target object from an image area. The target object may be a human body or a face, or may be an object such as equipment.
Since the embodiment of the present invention assumes a nursing care site, the object detection function recognizes a person's face area from the video and identifies the individual ID from the face area.
The object detection function stores in advance feature amounts of facial images of actual individuals to be identified. For example, the feature amount of the face image of the cared person is accumulated.
Then, the object detection function converts the face region, which is the cut out image region, into a 128/256/512-dimensional feature amount (Euclidean distance) using a face recognition model. Specifically, Google (registered trademark)'s Facenet (registered trademark) algorithm can also be used as a face recognition model. With this, it is possible to convert the face region into a feature amount of a multidimensional vector.
Then, the object recognition unit 11 collates the feature amount with the set of facial features of the individual, and identifies the individual ID that has the shortest distance or the feature amount that is less than or equal to a predetermined threshold.

物体認識部１１は、小型ＩｏＴデバイスに実装できるように、学習モデルが軽量化されている。学習済みモデルの中間層特徴の選別（レイヤの削除）と、軽量化アーキテクチャの置換とによって、計算量を大幅に削減している。具体的には、Ｃ＋＋Nativeライブラリを用いて、スマートグラスのような小型デバイスであっても、１万人を１秒で識別可能となっている。 The object recognition unit 11 has a lightweight learning model so that it can be implemented in a small IoT device. The amount of calculation has been significantly reduced by selecting the middle layer features of the trained model (deleting layers) and replacing it with a lightweight architecture. Specifically, using the C++ Native library, even a small device like smart glasses can identify 10,000 people in one second.

ここで、対象物が映る画像領域のサイズと、対象物との間の距離との関係について説明する。
対象物を人の顔である場合、顔の横幅は１５ｃｍ程度であるので、認識された顔の画像領域の横幅のピクセル数から、その人との間の距離を推測することができる（例えば非特許文献２参照）。カメラによっては、例えば１０ピクセルが１ｃｍであるとする仮定することもできる。勿論、特定の対象物のサイズを計っておき、その対象物を特定の距離から撮影し、その画像領域のサイズ（ピクセル数）から固定的にサイズと距離との比を算出しておくものであってもよい。
尚、画像領域のピクセル数は、カメラの解像度とは正比例する。そのために、カメラの解像度に応じて、距離に応じた画像領域のピクセル数を決定しておくこともできる。
また、他の実施形態として、特定の対象物のサイズを計ることなく、カメラからの映像の複数の画像を用いて距離を推測する技術もある（例えば非特許文献３参照）。この場合、複数の単眼カメラ画像から、３三次元座標を算出することでき、そのＺ軸を、カメラから対象物までの距離として推測することもできる。 Here, the relationship between the size of the image area in which the target object appears and the distance between the target object and the target object will be explained.
When the target object is a human face, the width of the face is approximately 15 cm, so the distance between the person and the person can be estimated from the number of pixels in the width of the image area of the recognized face (for example, (See Patent Document 2). Depending on the camera, it may be assumed that 10 pixels equals 1 cm, for example. Of course, this method involves measuring the size of a specific object, photographing the object from a specific distance, and then calculating the ratio of size and distance in a fixed manner from the size of the image area (number of pixels). There may be.
Note that the number of pixels in the image area is directly proportional to the resolution of the camera. For this purpose, the number of pixels in the image area can be determined according to the distance according to the resolution of the camera.
Furthermore, as another embodiment, there is a technique for estimating the distance using a plurality of images of video from a camera without measuring the size of a specific object (for example, see Non-Patent Document 3). In this case, three-dimensional coordinates can be calculated from a plurality of monocular camera images, and the Z axis can also be estimated as the distance from the camera to the object.

［読み上げ情報量制御部１３］
読み上げ情報量制御部１３は、画像領域のサイズに基づいて、物体ＩＤに対応する提示情報における「読み上げ情報量」を制御する。
読み上げ情報量とは、提示情報の文又は文章に対する、音声合成による「音声再生時間」に影響する。即ち、読み上げ情報量を制御することによって、音声再生時間を制御する。
読み上げ情報量としては、以下の３つのパターンのいずれかで制御することができる。
＜読み上げ情報量を通常とする＞＜読み上げ情報量を少なくする＞
（パターン１）通常文を通常の話速で再生通常文を早い話速で再生
（パターン２）通常文を再生短文を再生
（パターン３）通常文を再生通常文の用言及び体言のみを再生 [Reading information amount control unit 13]
The reading information amount control unit 13 controls the “reading information amount” in the presentation information corresponding to the object ID based on the size of the image area.
The amount of information to be read out affects the "speech playback time" of a sentence or sentence of presentation information by speech synthesis. That is, the audio playback time is controlled by controlling the amount of reading information.
The amount of reading information can be controlled using one of the following three patterns.
<Set the amount of reading information to normal><Reduce the amount of reading information>
(Pattern 1) Play normal sentences at normal speaking speed Play normal sentences at fast speaking speed (Pattern 2) Play normal sentences Play short sentences (Pattern 3) Play normal sentences Play only the idioms and nominals of normal sentences

図４は、本発明の読み上げ情報量制御部におけるフローチャートである。 FIG. 4 is a flowchart of the reading information amount control section of the present invention.

読み上げ情報量制御部１３は、物体認識部１１によって認識された物体ＩＤについて、その画像領域のサイズが「所定範囲」以下であるか否かを判定する。
所定範囲は、被写体との間の距離に基づくものである。被写体との間の距離が遠いほど画像領域のサイズが小さくなり、被写体との間の距離が近いほど画像領域のサイズが大きくなる。 The reading information amount control unit 13 determines whether or not the size of the image area of the object ID recognized by the object recognition unit 11 is less than or equal to a “predetermined range”.
The predetermined range is based on the distance to the subject. The farther the distance to the subject, the smaller the size of the image area, and the shorter the distance to the subject, the larger the size of the image area.

図５は、装着者と対人との間の距離が遠い場合の音声合成を表す説明図である。
図６は、装着者と対人との間の距離が近い場合の音声合成を表す説明図である。 FIG. 5 is an explanatory diagram illustrating speech synthesis when the distance between the wearer and the other person is long.
FIG. 6 is an explanatory diagram illustrating speech synthesis when the distance between the wearer and the other person is short.

図５及び図６によれば、例えば介護現場で、介護者（装着者）は、スマートグラス１を装着した視界に、被介護者「Ａさん」が見えている。また、物体認識部１１は、物体検出機能によって顔領域から被介護者「Ａさん」の個人ＩＤを既に同定している。 According to FIGS. 5 and 6, for example, at a nursing care site, the caregiver (wearer) sees the cared person "Mr. A" in the field of view wearing the smart glasses 1. Furthermore, the object recognition unit 11 has already identified the individual ID of the care recipient "Mr. A" from the face area using the object detection function.

＜Ｓ１：画像領域のサイズが所定範囲以下である場合＞
図５によれば、読み上げ情報量制御部１３は、被介護者「Ａさん」との間の距離が遠いと判定する。このような場合、読み上げ情報量を「通常」として再生する。具体的には、「通常文」を「通常の話速」で再生する。
介護者としては、比較的遠い距離にある対人又は対象物については、対面するまで、時間的に余裕がある。そのために、視界に映る対人等の情報を、装着者に、しっかりと聞き取らせるようにする。 <S1: When the size of the image area is below the predetermined range>
According to FIG. 5, the read-aloud information amount control unit 13 determines that the distance between the cared person "Mr. A" and the cared person is long. In such a case, playback is performed with the reading information amount set to "normal". Specifically, "normal sentences" are played back at "normal speech speed."
As a caregiver, you have plenty of time until you can meet someone or an object that is relatively far away. To this end, the wearer should be able to clearly hear information about other people, etc. that is reflected in the field of view.

＜Ｓ２：画像領域のサイズが所定範囲よりも大きい場合＞
図６によれば、読み上げ情報量制御部１３は、被介護者「Ａさん」との間の距離が近いと判定する。このような場合、読み上げ情報量を「少なくする」ように再生する。具体的には、以下のＳ２１～Ｓ２３のいずれかで再生する。 <S2: When the size of the image area is larger than the predetermined range>
According to FIG. 6, the readout information amount control unit 13 determines that the distance between the cared person "Mr. A" and the cared person is short. In such a case, playback is performed to "reduce" the amount of information to be read aloud. Specifically, playback is performed in any of S21 to S23 below.

（Ｓ２１）通常の話速よりも早い話速を選択する
通常文であっても、話速を早くして音声再生することによって、装着者に短時間で聞き取らせることができる。但し、早くても１．５倍速程度であって、それ以上では逆に聞き取りづらくなる。 (S21) Select a speaking speed faster than the normal speaking speed Even if it is a normal sentence, by playing the voice at a faster speaking speed, the wearer can hear it in a short time. However, the fastest speed is about 1.5 times, and if it is faster than that, it becomes difficult to hear.

（Ｓ２２）通常文に対する短文で再生する
提示情報記憶部１００は、提示情報として、通常文と、当該通常文よりも短い短文とを対応付けて予め記憶する。話速が通常であっても、短文を音声再生することによって、装着者に短時間で聞き取らせることができる。 (S22) Reproducing short sentences relative to normal sentences The presentation information storage unit 100 stores in advance a normal sentence and a short sentence shorter than the normal sentence in association with each other as presentation information. Even if the speaking speed is normal, the wearer can listen to the short sentences in a short time by playing back the short sentences.

（Ｓ２３）通常文の用言及び体言のみで再生する
「用言」とは、動詞・形容詞・形容動詞であって、述語となり得る単語をいう。また、「体言」は、名詞であって、「が」「は」などを付けることによって主語になれる単語をいう。提示情報記憶部１００に提示情報として記録されている通常文から、用言及び体言のみの単語列を抽出する。話速が通常であっても、その単語列を音声再生することによって、装着者に短時間で聞き取らせることができる。 (S23) Reproducing only the idioms and nominals of normal sentences "Uses" are verbs, adjectives, and adjectival verbs, and are words that can be used as predicates. In addition, ``noun'' refers to words that are nouns and can become subjects by adding ``ga'', ``wa'', etc. A word string consisting only of idioms and nominal pronunciations is extracted from normal sentences recorded as presentation information in the presentation information storage unit 100. Even if the speaking speed is normal, the wearer can hear the word string in a short time by playing back the word string.

介護者としては、比較的近い距離にある対人又は対象物については、直ぐに対面して会話を開始する必要もあり、時間的に余裕がない。そのために、視界に映る対人等の情報を、装着者に、できる限り短時間で聞き取らせるようにする。 As a caregiver, it is necessary to immediately meet face-to-face and start a conversation with a person or object that is relatively close, and there is no time to spare. To this end, the wearer is made to listen to information about other people, etc. reflected in the field of vision in as short a time as possible.

具体的には、画像領域のサイズの所定範囲に応じた距離としては、例えば4m程度とするものであってもよい。勿論、人によって異なるが、人の歩幅は、おおよそ身長×0.45と考えられている。そうすると、身長170cmの人の歩幅は、およそ77cmとなる。即ち、4m程度の距離とは、およそ5歩先を意味する。この場合、5歩先よりも近い距離にある対人については、直ぐに対面することとなるので、装着者に、その対人の個人情報を、できる限り短時間で聞き取らせるようにする。 Specifically, the distance depending on the predetermined range of the size of the image area may be, for example, about 4 m. Of course, it differs from person to person, but a person's stride length is generally considered to be 0.45 x height. Therefore, the stride length of a person who is 170 cm tall is approximately 77 cm. In other words, a distance of about 4 meters means about 5 steps ahead. In this case, since the person who is closer than 5 steps away will be facing the person immediately, the wearer is made to listen to the person's personal information in the shortest possible time.

［音声合成部１４］
音声合成部１４は、物体ＩＤに対応する提示情報を、読み上げ情報量に基づいて読み上げるべく音声合成し、スピーカ１０３から出力する。 [Speech synthesis unit 14]
The speech synthesis unit 14 synthesizes speech to read the presentation information corresponding to the object ID based on the amount of information to be read, and outputs the result from the speaker 103.

図７は、音声合成部における機能構成図である。
図７によれば、一般的な音声合成の構成である。文字の提示情報の場合、音声合成によって、音声としてスピーカ１０３から出力することができる。音声合成については、IoT・組み込み向けマイコンボード単体に搭載されるような、軽量な日本語音声読み上げ機能を用いる。 FIG. 7 is a functional configuration diagram of the speech synthesis section.
According to FIG. 7, it is a general speech synthesis configuration. In the case of text presentation information, it can be output as voice from the speaker 103 by voice synthesis. For voice synthesis, we use a lightweight Japanese voice reading function that is installed on a single microcomputer board for IoT/embedded devices.

図８は、本発明におけるスマートフォンの利用形態を表す説明図である。 FIG. 8 is an explanatory diagram showing how the smartphone is used in the present invention.

前述した実施形態によれば、本発明の携帯装置は、シースルー型のスマートグラスであるとして説明したが、それに限られず、スマートフォンのような携帯端末であってもよい。スマートフォンの場合、カメラが一体的に構成されており、イヤフォンのような外部のスピーカにも接続される。 According to the embodiments described above, the portable device of the present invention has been described as a see-through type smart glass, but is not limited thereto, and may be a portable terminal such as a smartphone. In the case of smartphones, the camera is integrated and is also connected to external speakers such as earphones.

以上、詳細に説明したように、本発明の携帯装置、プログラム及び方法によれば、ユーザの周辺状況に連動して音声合成に基づく音声を制御することができる。
本発明によれば、携帯装置の周辺状況と、カメラに映る対人や対象物との間の距離に応じて、音声合成に基づく読み上げ情報量を動的に変化させることができる。 As described above in detail, according to the portable device, program, and method of the present invention, it is possible to control speech based on speech synthesis in conjunction with the user's surrounding situation.
According to the present invention, it is possible to dynamically change the amount of information read out based on voice synthesis, depending on the surrounding situation of the mobile device and the distance between the person and the target object shown in the camera.

例えば介護現場の場合、介護者と被介護者との間の距離とに応じて、介護者に余裕が持てるような時間長で、被介護者の個人情報を音声再生することができる。具体的には、介護者としては、比較的近い距離にある対人又は対象物については、直ぐに対面して会話を開始する必要もあり、時間的に余裕がない。そのために、視界に映る対人等の情報を、装着者に、できる限り短時間で聞き取らせるようにする。 For example, in the case of a nursing care site, the personal information of the cared person can be reproduced as audio for a length of time that the carer can afford, depending on the distance between the carer and the cared person. Specifically, as a caregiver, it is necessary to immediately meet face-to-face and start a conversation with a person or object that is relatively close, and there is no time to spare. To this end, the wearer is made to listen to information about other people, etc. reflected in the field of vision in as short a time as possible.

尚、これにより、例えば「介護現場における介護者が、スマートグラスやスマートフォンを装着することによって、適切な時間長で、被介護者の個人情報を音声再生することができる」ことから、国連が主導する持続可能な開発目標（ＳＤＧｓ）の目標３「あらゆる年齢のすべての人々の健康的な生活を確保し、福祉を推進する」に貢献することが可能となる。 Furthermore, this will allow, for example, ``caregivers at nursing care sites to play back personal information of cared for an appropriate length of time by wearing smart glasses or smartphones.'' It will be possible to contribute to Goal 3 of the Sustainable Development Goals (SDGs), ``Ensure healthy lives and promote well-being for all people of all ages.''

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Regarding the various embodiments of the present invention described above, various changes, modifications, and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above description is merely an example and is not intended to be limiting in any way. The invention is limited only by the claims and their equivalents.

１００提示情報記憶部
１０１ディスプレイ
１０２カメラ
１０３スピーカ
１１物体認識部
１３読み上げ情報量制御部
１４音声合成部 100 Presentation information storage section 101 Display 102 Camera 103 Speaker 11 Object recognition section 13 Reading information amount control section 14 Speech synthesis section

Claims

In a mobile device that has a camera that captures video and a speaker that outputs audio,
a presentation information storage means that associates and stores sentence presentation information for each object ID (identifier);
Object recognition means for recognizing an image area of the object from the video, detecting the size of the image area, and identifying an object ID from the image area;
When the size of the image area is larger than a predetermined range, the amount of reading information that affects the length of audio playback for sentences of presentation information corresponding to the object ID is set to be lower than when the size of the image area is smaller than the predetermined range. A reading information amount control means for controlling the amount of information to be read out so as to reduce the amount of information to be read out;
1. A portable device comprising a voice synthesizing means for voice-synthesizing presentation information corresponding to an object ID so as to read it out based on an amount of read-out information, and outputting the synthesized sound from a speaker.

The object ID is a personal ID,
The information presented is personal information,
2. The portable device according to claim 1, wherein the image area in the object recognition means is a face area.

3. The portable device according to claim 1, wherein the reading information amount control means sets the reading information amount to a speaking speed.

The reading information amount control means is configured to control the amount of reading information as follows:
If the size of the image area is below the predetermined range, select the normal speaking speed,
4. The portable device according to claim 3, wherein when the size of the image area is larger than a predetermined range, a speech speed faster than a normal speech speed is selected.

The presentation information storage means stores in advance a normal sentence and a short sentence shorter than the normal sentence in association with each other as presentation information,
The reading information amount control means is configured to control the amount of reading information as follows:
If the size of the image area is below the specified range, select the normal sentence,
3. The portable device according to claim 1, wherein a short sentence is selected when the size of the image area is larger than a predetermined range.

The presentation information storage means stores in advance normal sentences in association with each other as presentation information,
The reading information amount control means is configured to control the amount of reading information as follows:
If the size of the image area is below the specified range, select the normal sentence,
3. The portable device according to claim 1, wherein when the size of the image area is larger than a predetermined range, only the idioms and proverbs in the normal sentence are selected.

The predetermined range is based on the distance to the subject; the farther the distance to the subject, the smaller the size of the image area, and the closer the distance to the subject, the larger the size of the image area. The portable device according to any one of claims 4 to 6.

The camera is integrally mounted to capture images of the user's field of view,
The portable device according to any one of claims 1 to 7, wherein the portable device is a see-through smart glass or a smartphone.

9. The portable device according to claim 8, wherein the speaker is a bone conduction speaker.

In a program that operates a computer installed in a mobile device that has a camera that captures images and a speaker that outputs audio,
Presentation information storage means that associates and stores sentence presentation information for each object ID (identifier);
Object recognition means for recognizing an image area of the object from the video, detecting the size of the image area, and identifying an object ID from the image area;
When the size of the image area is larger than the predetermined range, the amount of reading information that affects the length of audio playback for the sentence of the presentation information corresponding to the object ID is set to be lower than when the size of the image area is less than the predetermined range. A reading information amount control means for controlling the amount of information to be read out so as to reduce the amount of information to be read out;
A program that causes a computer to function as a voice synthesis means that synthesizes presentation information corresponding to an object ID to read out loud based on the amount of information to be read out, and outputs it from a speaker.

In an information presentation method for a mobile device having a camera that captures video and a speaker that outputs audio,
The mobile device is
Sentence presentation information is associated and stored in advance for each object ID (identifier),
a first step of recognizing an image area of the object from the video, detecting the size of the image area, and identifying an object ID from the image area;
When the size of the image area is larger than the predetermined range, the amount of reading information that affects the length of audio playback for the sentence of the presentation information corresponding to the object ID is set to be lower than when the size of the image area is less than the predetermined range. a second step of controlling to reduce
An information presentation method characterized by performing a third step of synthesizing presentation information corresponding to an object ID to read out aloud based on an amount of readout information and outputting the result from a speaker.