JP2023112556A

JP2023112556A - Visualization device, visualization method and program

Info

Publication number: JP2023112556A
Application number: JP2022014425A
Authority: JP
Inventors: 恵理子渡邊; Eriko Watanabe
Original assignee: NEC Platforms Ltd
Current assignee: NEC Platforms Ltd
Priority date: 2022-02-01
Filing date: 2022-02-01
Publication date: 2023-08-14

Abstract

To provide a visualization device, a visualization method and a program that contribute to smooth communication even if mouths are hidden behind masks etc., during a conversation.SOLUTION: There is provided a visualization device that has: a speech acquisition part which acquires a voice; a voice recognition part which recognizes the voice to generate recognition data; a visualization part which generates visualization data by visualizing the recognition data; and a presentation part which presents the visualization data in the form of a video.SELECTED DRAWING: Figure 1

Description

本発明は、視覚化装置、視覚化方法、及びプログラムに関する。 The present invention relates to a visualization device, a visualization method, and a program.

聴覚に障がいをもつ人で、相手の口の動きを読み取って会話の補助とする場合があるが、疾病予防などのためにマスクを着用している人が多くなると口の動きを読み取ることができず、コミュニケーションに困難が生じる場合がある。 People with hearing disabilities may read the movement of the other person's mouth to assist in conversation, but as more people wear masks to prevent disease, etc., it is not possible to read the movement of the mouth. communication may be difficult.

特許文献１には以下のような情報処理装置が開示されている。第１の動作主体が音声利用者で第２の動作主体が手話利用者である場合に、同装置は、第１の動作主体が音声で発話したメッセージを、当該メッセージに対応する手話のジェスチャを行う手の動画に変換し、透過型ディスプレイに映る第１の動作主体に重畳して表示する発明が記載されている。 Patent Document 1 discloses an information processing apparatus as follows. When the first actor is a voice user and the second actor is a sign language user, the device converts a message spoken by the first actor into a sign language gesture corresponding to the message. An invention is described in which the motion is converted into a moving image of a hand to be performed, and the moving image is superimposed on the first action subject appearing on a transmissive display.

国際公開第２０１９／００３６１６号WO2019/003616

なお、上記先行技術文献の各開示を、本書に引用をもって繰り込むものとする。以下の分析は、本発明者らによってなされたものである。 In addition, each disclosure of the above prior art documents is incorporated into this document by reference. The following analysis was made by the inventors.

しかしながら、聴覚に障がいのある方でも、手話によってコミュニケーションをとることができない場合が少なくなく、若干の音声と相手の口の動きでコミュニケーションを行ないたいという要望が存在する。 However, there are many cases in which even people with hearing impairments cannot communicate by sign language, and there is a desire to communicate with a few voices and the movement of the other party's mouth.

本発明は、会話時にマスク等で口が隠れている場合にも円滑にコミュニケーションをとることが可能な視覚化装置、視覚化方法、及びプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a visualization device, a visualization method, and a program that enable smooth communication even when the mouth is hidden by a mask or the like during conversation.

本発明乃至開示の第一の視点によれば、音声を取得する音声取得部と、前記音声を認識して認識データとする音声認識部と、前記認識データを視覚化した視覚化データを生成する視覚化部と、前記視覚化データを映像として提示する提示部と、を有する視覚化装置が提供される。 According to a first aspect of the present invention or disclosure, a voice acquisition unit that acquires voice, a voice recognition unit that recognizes the voice and uses it as recognition data, and a visualization data that visualizes the recognition data is generated. A visualization device is provided that has a visualization unit and a presentation unit that presents the visualization data as a video.

本発明乃至開示の第二の視点によれば音声を取得するステップと、前記音声を認識して認識データとするステップと、前記認識データを視覚化した視覚化データを生成するステップと、前記視覚化データを提示するステップと、を有する視覚化方法が提供される。 According to a second aspect of the present invention or disclosure, a step of acquiring speech, a step of recognizing the speech to obtain recognition data, a step of generating visualization data by visualizing the recognition data, and a step of visualizing the recognition data; and presenting visualization data.

本発明乃至開示の第三の視点によれば、音声を取得する処理と、前記音声を認識して認識データとする処理と、前記認識データを視覚化した視覚化データを生成する処理と、前記視覚化データを提示する処理と、をコンピュータに実行させるためのプログラムが提供される。 According to a third aspect of the present invention or the disclosure, a process of acquiring speech, a process of recognizing the speech to obtain recognition data, a process of generating visualization data obtained by visualizing the recognition data, and A program is provided for causing a computer to perform a process of presenting visualization data.

本発明乃至開示の各視点によれば、会話時にマスク等で口が隠れている場合にも円滑にコミュニケーションをとることに貢献する視覚化装置、視覚化方法、及びプログラムを提供する。 According to each aspect of the present invention and the disclosure, there is provided a visualization device, a visualization method, and a program that contribute to smooth communication even when the mouth is hidden by a mask or the like during conversation.

一実施形態に係る視覚化装置の構成の一例を示すブロック図である。It is a block diagram showing an example of a configuration of a visualization device according to an embodiment. 第１の実施形態における本実施形態の視覚化装置の構成の一例を示すブロック図である。1 is a block diagram showing an example of the configuration of a visualization device according to a first embodiment; FIG. 第１の実施形態における視覚化装置における処理の概要を示すための概略図である。FIG. 2 is a schematic diagram showing an outline of processing in the visualization device in the first embodiment; FIG. 第１の実施形態に係る視覚化装置の動作の一例を示すフローチャートである。4 is a flow chart showing an example of the operation of the visualization device according to the first embodiment; 第１の実施形態に係る視覚化装置のハードウエア構成を示す概略図である。1 is a schematic diagram showing the hardware configuration of a visualization device according to a first embodiment; FIG. 第２の実施形態に係る視覚化装置に対応する補助器具Ａの処理を示すための概略図である。FIG. 11 is a schematic diagram for showing processing of the auxiliary instrument A corresponding to the visualization device according to the second embodiment; 第２の実施形態に係る視覚化装置に対応する補助器具Ａの構成を示すための概略図である。FIG. 11 is a schematic diagram showing the configuration of an auxiliary instrument A corresponding to the visualization device according to the second embodiment; 第２の実施形態に係る視覚化装置に対応する補助器具Ａの処理を示すための別の概略図である。FIG. 11 is another schematic diagram for showing the processing of the auxiliary instrument A corresponding to the visualization device according to the second embodiment; 第２の実施形態に係る音声解析部の動作の詳細を示すためのフローチャートである。9 is a flowchart for showing details of the operation of a speech analysis unit according to the second embodiment; 第２の実施形態に係る画面表示部の動作の詳細を示すためのフローチャートである。9 is a flowchart for showing details of the operation of the screen display unit according to the second embodiment;

初めに、一実施形態の概要について説明する。なお、この概要に付記した図面参照符号は、理解を助けるための一例として各要素に便宜上付記したものであり、この概要の記載はなんらの限定を意図するものではない。また、各図におけるブロック間の接続線は、双方向及び単方向の双方を含む。一方向矢印については、主たる信号（データ）の流れを模式的に示すものであり、双方向性を排除するものではない。さらに、本願開示に示す回路図、ブロック図、内部構成図、接続図などにおいて、明示は省略するが、入力ポート及び出力ポートが各接続線の入力端及び出力端のそれぞれに存在する。入出力インタフェースも同様である。 First, an overview of one embodiment will be described. It should be noted that the drawing reference numerals added to this outline are added to each element for convenience as an example to aid understanding, and the description of this outline does not intend any limitation. Also, connecting lines between blocks in each figure include both bi-directional and uni-directional. The unidirectional arrows schematically show the flow of main signals (data) and do not exclude bidirectionality. Furthermore, in the circuit diagrams, block diagrams, internal configuration diagrams, connection diagrams, etc. disclosed in the present application, an input port and an output port exist at the input end and the output end of each connection line, respectively, although not explicitly shown. The input/output interface is the same.

図１は一実施形態に係る視覚化装置の構成の一例を示すブロック図である。この図にあるように、一実施形態に係る視覚化装置１０は、音声取得部１１と、音声認識部１２と、視覚化部１３と、提示部１４と、を有する。 FIG. 1 is a block diagram showing an example of the configuration of a visualization device according to one embodiment. As shown in this figure, the visualization device 10 according to one embodiment has a speech acquisition unit 11, a speech recognition unit 12, a visualization unit 13, and a presentation unit .

音声取得部１１は音声を取得する。「音声」は発話時の音声を指すが、必ずしも発話者が目の前に居る必要はない。例えば、駅や車内、防災無線等におけるアナウンスの声なども「音声」に含まれる。「取得」とは、マイク等の入力装置により音声を取得して音声データとして記憶域に一時的に格納することにより音声データ化されたデータをネットワークや記憶媒体を介して取得する態様も含まれる。音声データの形式は特に限定されないが、会話時に即時認識しながら視覚化処理を実行するためには展開処理に時間がかかり応答性能が悪くなるため、非圧縮又は圧縮率の低いデータであることが望ましい。格納された音声データは後述の音声認識部１２へ送られる。 The voice acquisition unit 11 acquires voice. "Voice" refers to the voice at the time of speaking, but the speaker does not necessarily have to be in front of the speaker. For example, "speech" includes the voice of an announcement at a station, in a train, or in a disaster prevention radio system. "Acquisition" also includes a form of acquiring voice data via a network or a storage medium by temporarily storing it in a storage area as voice data using an input device such as a microphone. . The format of the audio data is not particularly limited, but it takes time to develop the data and the response performance deteriorates in order to execute the visualization process while recognizing it immediately during a conversation. desirable. The stored voice data is sent to the voice recognition unit 12, which will be described later.

音声認識部１２は、前記音声を認識して認識データとする。音声データを音声認識プログラムに入力し、認識データを出力する。音声認識プログラムの認識方式は特に限定されないが、会話時には即時処理の必要性が高いため、高速に処理可能であり、かつ、音声データを入力しながら同時に認識処理を遂行することができる方式が望ましい。「認識データ」とは一般に音声データをテキストデータへ出力したものを指す場合が多いが、必ずしもテキストデータとする必要はなく、例えば音声データを音素からなるデータへ出力したものでもよい。さらに、高速化のため、音素中の母音のみを認識し、データとして出力したものでもよい。 The voice recognition unit 12 recognizes the voice and uses it as recognition data. Speech data is input to a speech recognition program and recognition data is output. The recognition method of the speech recognition program is not particularly limited, but since there is a high need for immediate processing during conversation, a method that can process at high speed and that can simultaneously perform recognition processing while inputting speech data is desirable. . "Recognition data" generally refers to speech data output to text data in many cases, but it does not necessarily have to be text data. For example, speech data to phoneme data may be output. Furthermore, for speeding up, only vowels in phonemes may be recognized and output as data.

また例えば、音声の波形データと発話時の視覚化データの要素である口の形状とを関連付けて、波形データと音声データの類似度に応じて口の形状をアウトプットする態様でもよい。このように言語非依存の音響モデルと視覚化データを直接関連付けすることによって、日本語に限定されず外国語の音声も視覚化することが可能である。 Further, for example, waveform data of speech may be associated with the shape of the mouth, which is an element of visualization data at the time of utterance, and the shape of the mouth may be output according to the degree of similarity between the waveform data and the speech data. By directly associating the language-independent acoustic model with the visualization data in this way, it is possible to visualize not only Japanese but also foreign language speech.

視覚化部１３は、前記認識データを視覚化した視覚化データを生成する。「視覚化」とは音声の特徴量を視覚で認識可能な態様に変換することを指し、例えばテキストや、手話などのジェスチャ、口の動きなどが含まれる他、空間的なパターンやその動き、色、といった抽象度が高い態様に変換するといった処理も含まれる。「生成」とは、例えば認識データと、口の形状といったあらかじめ用意された視覚化データとが関連付けられていてもよいし、認識データに応じて即時に生成される視覚化のデータであってもよい。生成された視覚化データは提示部に送られる。 The visualization unit 13 generates visualization data by visualizing the recognition data. "Visualization" refers to converting speech features into visually recognizable forms. Examples include text, gestures such as sign language, mouth movements, spatial patterns and their movements, It also includes a process of converting to a form with a high degree of abstraction, such as color. "Generation" may mean, for example, that recognition data is associated with visualization data prepared in advance such as the shape of the mouth, or visualization data that is immediately generated according to recognition data. good. The generated visualization data is sent to the presenter.

提示部１４は、前記視覚化データを映像として提示する。生成された視覚化データをディスプレイ装置等の出力インタフェースより出力する。提示の態様は視覚で認識可能な態様であれば種々の方式に基づく態様が考えられ、一つの態様に限定されない。 The presentation unit 14 presents the visualization data as an image. The generated visualization data is output from an output interface such as a display device. The mode of presentation is not limited to one mode, and can be based on various methods as long as it is a visually recognizable mode.

以下に具体的な実施の形態について、図面を参照してさらに詳しく説明する。なお、各実施形態において同一構成要素には同一の符号を付し、その説明を省略する。 Specific embodiments will be described in more detail below with reference to the drawings. In addition, the same code|symbol is attached|subjected to the same component in each embodiment, and the description is abbreviate|omitted.

［第１の実施形態］
図２は本実施形態の視覚化装置の構成の一例を示すブロック図である。本実施形態の視覚化装置１０は下記の構成を有する。すなわち、第１の実施形態に係る視覚化装置１０は、図１に示す通り一実施形態と同様に、音声取得部１１と、音声認識部１２と、視覚化部１３と、提示部１４と、を有する。本実施形態の視覚化装置１０は上記構成要件に加えて身体認識部１５を新たに有する。 [First embodiment]
FIG. 2 is a block diagram showing an example of the configuration of the visualization device of this embodiment. The visualization device 10 of this embodiment has the following configuration. That is, as shown in FIG. 1, the visualization device 10 according to the first embodiment includes a speech acquisition unit 11, a speech recognition unit 12, a visualization unit 13, a presentation unit 14, and have The visualization device 10 of this embodiment newly has a body recognition unit 15 in addition to the above components.

本実施形態の視覚化部１３は、音声認識部１２にて認識された認識データを発話したときの口の形状を表す映像である視覚化データを生成してもよい。図３（Ａ）～（Ｆ）は、本実施形態の視覚化装置１０における処理の概要を示すための概略図である。「口の形状を表す映像」とは、図３（Ｂ）における点線部内の映像を指す。あらかじめ図３（Ａ）の口元が隠れていない映像を撮影し、図３（Ｂ）に示すように口元が含まれる切り取り範囲を決定し、図３（Ｃ）のような切り取った映像を生成する。このとき切り取り範囲の幅や高さ等のサイズや、切り取り範囲の位置、例えば両眼からの相対位置などを取得し映像に関連付けて記憶域に保持しておく。 The visualization unit 13 of the present embodiment may generate visualization data that is an image representing the shape of the mouth when the recognition data recognized by the speech recognition unit 12 is uttered. FIGS. 3A to 3F are schematic diagrams showing an outline of processing in the visualization device 10 of this embodiment. The “image representing the shape of the mouth” refers to the image within the dotted line portion in FIG. 3(B). An image in which the mouth is not hidden as shown in FIG. 3(A) is shot in advance, a cropping range including the mouth is determined as shown in FIG. 3(B), and a cropped image as shown in FIG. 3(C) is generated. . At this time, the size such as the width and height of the cutout range and the position of the cutout range, for example, the relative position from both eyes, are acquired and stored in a storage area in association with the image.

次に図３（Ｄ）で示すようなマスクを装着した映像を撮影し、その映像に図３（Ｃ）の切り取り範囲の映像を重畳する。このとき図３（Ｅ）で示すように重畳する位置を例えば両眼からの相対位置から算出する。このとき映像の画角が図３（Ｂ）と図３（Ｅ）とで異なる場合には適宜重畳する映像である図３（Ｃ）を拡大又は縮小して大きさを調整する。図３（Ｆ）が提示部１４が提示する映像の一例である。 Next, an image of the person wearing the mask as shown in FIG. 3(D) is shot, and the image of the cutout range shown in FIG. 3(C) is superimposed on the image. At this time, as shown in FIG. 3(E), the position of superimposition is calculated from the relative position from both eyes, for example. At this time, if the angle of view of the image differs between FIG. 3B and FIG. 3E, the image to be superimposed in FIG. 3C is enlarged or reduced to adjust the size. FIG. 3F is an example of a video presented by the presentation unit 14 .

図３（Ｃ）の重畳すべき切り取り範囲の映像は認識された音声によって異なるため、事前に認識された音声に対応する口の形状を表す映像を収集してもよい。また、全ての音声に対して切り取り範囲の映像を撮影するのは時間と手間がかかるので、少なくとも１以上で、認識された音声の種類より少ない映像を撮影し、画像処理により変形させて全ての認識された音声に対応する口の形状を表す映像を生成してもよい。 Since the image of the cut-out range to be superimposed in FIG. 3(C) differs depending on the recognized speech, images representing the shape of the mouth corresponding to the pre-recognized speech may be collected. In addition, since it takes time and effort to shoot images of the cut-out range for all sounds, at least 1 or more images are shot that are less than the types of recognized sounds, and are deformed by image processing to capture all sounds. An image representing the shape of the mouth corresponding to the recognized speech may be generated.

提示部１４は、透明ディスプレイ装置などの透過型表示装置に視覚化データを提示してもよい。例えば、上記の図３（Ｃ）の映像を透過型表示装置より透過して見えている話者の実物に位置合わせを行って図３（Ｆ）に示すような映像が提示される態様であってもよい。この場合においては後述する身体認識部１５による前処理が必要となる。 The presentation unit 14 may present the visualization data on a transmissive display device, such as a transparent display device. For example, there is a mode in which the image shown in FIG. 3(C) is aligned with the actual speaker seen through the transmissive display device, and the image shown in FIG. 3(F) is presented. may In this case, preprocessing by the body recognition unit 15, which will be described later, is required.

身体認識部１５は会話の相手の身体の部位を認識した身体認識データを生成する。「身体認識データ」とは、眼、口、鼻等の身体の部位をその形状により認識を行って、認識された部位の画面内の位置を求めたデータを指す。 A body recognition unit 15 generates body recognition data by recognizing body parts of a conversation partner. "Body recognition data" refers to data obtained by recognizing body parts such as eyes, mouth, nose, etc. based on their shapes and determining the positions of the recognized parts on the screen.

身体認識部１５は、さらに会話の相手の口の位置を認識した身体認識データを生成し、提示部１４は生成された視覚化データにかかる口の形状を表す映像を、生成された身体認識データを用いて会話の相手の口の位置に配置する処理を行ってもよい。例えば、透過型表示装置により実物に位置合わせを行いながら図３（Ｃ）のような口の形状を表す映像を重畳して提示する場合において、当部の処理により眼や頭などの身体の部位を認識してその位置を求め、基準とすることで、口の形状を表す映像の位置決めを行うことが可能である。 The body recognition unit 15 further generates body recognition data that recognizes the position of the mouth of the conversation partner, and the presentation unit 14 displays an image representing the mouth shape of the generated visualization data according to the generated body recognition data. may be used to place it at the position of the mouth of the conversation partner. For example, in the case of superimposing and presenting an image showing the shape of a mouth as shown in FIG. It is possible to position the image representing the shape of the mouth by recognizing and finding the position of the mouth and using it as a reference.

音声認識部１２は、さらに音声から会話の相手の感情を推定した感情データを取得し、視覚化部１３は、取得された感情データに基づいて視覚化データを生成する態様であってもよい。例えば音声認識の際に入力された音声の大きさや抑揚を通常時の音響モデルと比較し、あらかじめ保持されている特徴量の特定の差異に基づいて、喜怒哀楽等の感情を推定してもよい。例えば、視覚化部は推定された感情データに応じて、図３（Ｃ）で示す口の形状を表す映像を画像処理により変形することにより、感情が表出された口の形状を生成してもよい。例えば「喜び」の感情が推定された場合には、口角を上げた映像を画像処理により生成してもよい。 The speech recognition unit 12 may further acquire emotion data obtained by estimating the emotion of the conversation partner from the voice, and the visualization unit 13 may generate visualization data based on the acquired emotion data. For example, by comparing the loudness and intonation of the voice input during speech recognition with a normal acoustic model, emotions such as emotions can be estimated based on specific differences in feature values stored in advance. good. For example, the visualization unit transforms the image representing the shape of the mouth shown in FIG. good too. For example, when an emotion of “joy” is estimated, an image with the corners of the mouth raised may be generated by image processing.

なお、音声認識部１２は、さらに音声を認識してテキスト形式の認識データとし、提示部１４は、テキスト形式の認識データを提示する態様であってもよい。例えば提示部１４は、透過型表示装置の画面の一部にテキストの認識データを表示し、話者を表示装置越しに見ながら認識されたテキストをキャプションの様に表示する態様であってもよい。 It should be noted that the speech recognition unit 12 may further recognize the speech to generate recognition data in text format, and the presentation unit 14 may present the recognition data in text format. For example, the presentation unit 14 may display the recognition data of the text on a part of the screen of the transmissive display device, and display the recognized text like a caption while looking at the speaker through the display device. .

［動作の説明］
本実施形態の視覚化装置１０の動作の一例について図４を用いて説明する。図４は、第１の実施形態に係る視覚化装置１０の動作の一例を示すフローチャートである。 [Explanation of operation]
An example of the operation of the visualization device 10 of this embodiment will be described with reference to FIG. FIG. 4 is a flow chart showing an example of the operation of the visualization device 10 according to the first embodiment.

同装置が動作を開始すると、カメラ等の装置により話者の映像を走査して、身体認識データを生成する（ステップＳ４１）。なおこの処理は視覚化装置の動作を開始した後に一度だけ実行してもよい（ステップＳ４６、Ｎの場合の実線部）し、話者が動くことにより口の位置が画面内で変わる場合には、話中に繰り返し実行してもよい（ステップＳ４６、Ｎの場合の点線部）。次に、マイク等の装置により音声を取得する（ステップＳ４２）。次に取得した音声を認識して認識データとする（ステップＳ４３）。認識データを用いて視覚化データを生成する（ステップＳ４４）。次に、身体認識データに基づいて視覚化データを映像として提示する（ステップＳ４５）。その後終了か否かの判断を実行し（ステップＳ４６）、終了の場合（ステップＳ４６、Ｙ）には動作を終了し、終了でない場合（ステップＳ４６、Ｎ）には、引き続き音声を取得（ステップＳ４２）、又は身体認識データを生成（ステップＳ４１）する処理に戻る。 When the device starts to operate, a device such as a camera scans an image of the speaker and generates body recognition data (step S41). Note that this process may be executed only once after starting the operation of the visualization device (step S46, solid line in the case of N), and if the position of the mouth changes within the screen due to the movement of the speaker, , may be repeatedly executed during busy hours (step S46, dotted line in case of N). Next, voice is acquired by a device such as a microphone (step S42). Next, the acquired voice is recognized and used as recognition data (step S43). Visualization data is generated using the recognition data (step S44). Next, the visualization data is presented as an image based on the body recognition data (step S45). After that, it is judged whether or not it is finished (step S46). If it is finished (step S46, Y), the operation is finished. ), or return to the process of generating body recognition data (step S41).

身体認識データを生成するステップ（ステップＳ４１）は、視覚化データを映像として提示するステップ（ステップＳ４５）までに実行されていればよく、他の処理と並行して実行されてもよい。 The step of generating the body recognition data (step S41) may be performed by the step of presenting the visualization data as an image (step S45), and may be performed in parallel with other processes.

［ハードウエア構成］
本実施形態の視覚化装置１０は、情報処理装置（コンピュータ）により構成可能であり、図５に例示する構成を備える。例えば、視覚化装置１０は、内部バス５５により相互に接続される、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）５１、メモリ５２、入出力インタフェース５３及び通信手段であるＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣａｒｄ）５４等を備える。 [Hardware configuration]
The visualization device 10 of this embodiment can be configured by an information processing device (computer), and has the configuration illustrated in FIG. For example, the visualization device 10 includes a CPU (Central Processing Unit) 51 , a memory 52 , an input/output interface 53 and a communication means such as a NIC (Network Interface Card) 54 , which are interconnected by an internal bus 55 .

但し、図５に示す構成は、視覚化装置１０のハードウエア構成を限定する趣旨ではない。視覚化装置１０は、図示しないハードウエアを含んでもよい。また、視覚化装置１０に含まれるＣＰＵ等の数も図５の例示に限定する趣旨ではなく、例えば、複数のＣＰＵが視覚化装置１０に含まれていてもよい。 However, the configuration shown in FIG. 5 is not meant to limit the hardware configuration of the visualization device 10 . Visualization device 10 may include hardware not shown. Also, the number of CPUs and the like included in the visualization device 10 is not limited to the example shown in FIG.

メモリ５２は、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、補助記憶装置（ハードディスク等）である。 The memory 52 is a RAM (Random Access Memory), a ROM (Read Only Memory), an auxiliary storage device (hard disk, etc.).

入出力インタフェース５３は、図示しない表示装置や入力装置のインタフェースとなる手段である。表示装置は、例えば、透過型表示装置である透明ディスプレイやこれを搭載したスマートグラス等である。入力装置は、例えば、キーボードやマウス、ジェスチャ入力を受付けるセンサ等のユーザ操作を受付ける装置やマイク、カメラ等の対象の映像や音声を取得するための装置である。 The input/output interface 53 is a means serving as an interface for a display device and an input device (not shown). The display device is, for example, a transparent display, which is a transmissive display device, smart glasses equipped with the same, or the like. The input device is, for example, a device for receiving user operations such as a keyboard, a mouse, a sensor for receiving gesture input, or a device for acquiring target video or audio such as a microphone or camera.

視覚化装置１０の機能は、メモリ５２に格納された音声取得プログラム、音声認識プログラム、視覚化プログラム、提示プログラム、身体認識プログラム等といったプログラム群（処理モジュール）と、音声認識のための音響モデルや、口の形状を提示するための視覚化データなどのデータ等のデータ群により実現される。当該処理モジュールは、例えば、メモリ５２に格納された各プログラムをＣＰＵ５１が実行することで実現される。また、そのプログラムは、ネットワークを介してダウンロードするか、あるいは、プログラムを記憶した記憶媒体を用いて、更新することができる。さらに、上記処理モジュールは、半導体チップにより実現されてもよい。即ち、上記処理モジュールが行う機能を何らかのハードウエア、及び／又は、ソフトウエアで実行する手段があればよい。 The functions of the visualization device 10 include programs (processing modules) such as a speech acquisition program, a speech recognition program, a visualization program, a presentation program, and a body recognition program stored in the memory 52; , data such as visualization data for presenting the shape of the mouth. The processing module is implemented by the CPU 51 executing each program stored in the memory 52, for example. Also, the program can be downloaded via a network or updated using a storage medium storing the program. Furthermore, the processing module may be realized by a semiconductor chip. In other words, it is sufficient if there is means for executing the functions performed by the processing module by some kind of hardware and/or software.

［ハードウエアの動作］
視覚化装置１０は、動作を開始すると、身体認識プログラムがメモリ５２から呼び出されＣＰＵ５１で実行状態となる。同プログラムはカメラを制御し、会話の相手である話者の身体の映像を取得し、例えば基準となる眼の位置や、口や鼻の相対位置を形状認識により身体認識データとして取得する。取得されたデータはメモリ５２に一時的に格納される。 [Hardware operation]
When the visualization device 10 starts to operate, the body recognition program is called from the memory 52 and executed by the CPU 51 . The program controls the camera, acquires images of the speaker's body, and acquires, for example, the reference position of the eyes and the relative positions of the mouth and nose as body recognition data through shape recognition. The acquired data is temporarily stored in the memory 52 .

次に音声取得プログラムがメモリ５２から呼び出されＣＰＵ５１で実行状態となる。同プログラムは、マイクを制御し話者の音声を取得する。取得された音声は音声データとしてメモリ５２に一時的に格納される。 Next, the voice acquisition program is called from the memory 52 and executed by the CPU 51 . The program controls the microphone and captures the speaker's voice. The acquired voice is temporarily stored in the memory 52 as voice data.

次に音声認識プログラムがメモリ５２から呼び出されＣＰＵ５１で実行状態となる。同プログラムはメモリ５２に一時的に格納されている音声データを読み込み、例えばメモリ５２上に格納されている音響モデルの特徴量とのマッチング処理を行う。音声認識プログラムは既存の種々の音声認識方式を採用可能である。認識の結果は、例えば音素単位であってもよいし言語モデルを使用した単語レベルのものでもよい。認識データはメモリ５２に一時的に格納される。 Next, the speech recognition program is called from the memory 52 and executed by the CPU 51 . The program reads voice data temporarily stored in the memory 52 and performs matching processing with the feature values of the acoustic model stored in the memory 52, for example. The speech recognition program can employ various existing speech recognition methods. The recognition result may be, for example, phoneme-by-phoneme or word-level using a language model. The recognition data is temporarily stored in memory 52 .

次に視覚化プログラムがメモリ５２から呼び出されＣＰＵ５１で実行状態となる。同プログラムはメモリ５２に一時的に格納されている認識データを読み込み、認識データに関連付けられた、発声時の口の形状を表す映像を取得し、これを後述の提示プログラムに送る。あるいは、ベースとなる口の形状を表す映像を画像処理により変形させ認識結果に対応する視覚化データである映像を生成し、これを提示プログラムに送る。 The visualization program is then called from memory 52 and is ready for execution on CPU 51 . The program reads the recognition data temporarily stored in the memory 52, acquires an image representing the shape of the mouth at the time of speaking associated with the recognition data, and sends it to the presentation program described later. Alternatively, an image representing the base shape of the mouth is deformed by image processing to generate an image, which is visualization data corresponding to the recognition result, and this is sent to the presentation program.

次に提示プログラムがメモリ５２から呼び出されＣＰＵ５１で実行状態となる。同プログラムは、視覚化プログラムから視覚化データを読み込み、透明ディスプレイなどの表示装置に出力する。ここで、同プログラムはメモリ５２に一時的に格納されている身体認識データを読み込み、視覚化データを表示する位置決め処理を実行する。具体的には例えば透明ディスプレイに映る話者の眼の位置を基準とし、身体認識データに含まれる眼の位置からの相対距離の分だけ離れた位置等に視覚化データを配置する処理を実行する。これにより透明ディスプレイを透して見える実物の話者の口の位置と、口の形状を表す映像である図３（Ｃ）のような視覚化データの位置を合わせて図３（Ｆ）のように話者を映すことが可能である。 Next, the presentation program is called from the memory 52 and executed by the CPU 51 . The program reads visualization data from a visualization program and outputs it to a display device, such as a transparent display. Here, the same program reads the body recognition data temporarily stored in the memory 52 and executes the positioning process for displaying the visualization data. Specifically, for example, the position of the speaker's eyes reflected on the transparent display is used as a reference, and the visualization data is placed at a position separated by the relative distance from the eye position included in the body recognition data. . As a result, the position of the actual speaker's mouth that can be seen through the transparent display and the position of the visualization data as shown in FIG. It is possible to reflect the speaker on the

［効果の説明］
上記第１の実施形態に係る視覚化装置により、会話時にマスク等で口が隠れている場合にも円滑にコミュニケーションをとることが可能である。また、透明ディスプレイに映る実物の話者の姿に視覚化データを重畳する場合において、会話の際に発話者が動いても、自然な形で口の動きを提示することが可能である。 [Explanation of effect]
The visualization device according to the first embodiment enables smooth communication even when the mouth is hidden by a mask or the like during conversation. In addition, when the visualization data is superimposed on the image of the real speaker reflected on the transparent display, even if the speaker moves during conversation, it is possible to present mouth movements in a natural way.

［第２の実施形態］
本実施形態では、透過型表示装置であるスマートグラスに映る話者の姿に口の形の映像を重畳する機器（補助器具Ａ、視覚化装置１０に対応）について述べる。図６は本実施形態の視覚化装置に対応する補助器具Ａの処理を示すための概略図である。この図にあるように、補助器具Ａは透過型のディスプレイを有するスマートグラス６０１とマイク６０２及び音声解析機能とディスプレイへの画面表示機能とをもつ。スマートグラス６０１は聴き手６０３が装着し、マイク６０２で収音した話者６０４の音声を音声解析部にて単音の単位に分解・分析し、分析した結果をもとに画面表示部からスマートグラス６０１へ表示を行う。 [Second embodiment]
In the present embodiment, a device (auxiliary device A, corresponding to the visualization device 10) that superimposes a mouth-shaped image on the appearance of a speaker reflected on smart glasses, which is a transmissive display device, will be described. FIG. 6 is a schematic diagram showing the processing of the auxiliary instrument A corresponding to the visualization device of this embodiment. As shown in this figure, the auxiliary device A has smart glasses 601 having a transmissive display, a microphone 602, a voice analysis function, and a screen display function on the display. The smart glasses 601 are worn by the listener 603, and the voice of the speaker 604 picked up by the microphone 602 is decomposed and analyzed into single sounds by the voice analysis unit. 601 displays.

本実施形態の補助器具Ａは下記の構成を有する。図７は本実施形態の視覚化装置に対応する補助器具Ａの構成を示すための概略図である。また図８は第２の実施形態に係る視覚化装置に対応する補助器具Ａの処理を示すための別の概略図である。すなわち、図７に示す通り、例えばグラス型のウェアブルデバイスで、マイク７０１（音声取得部１１に対応）、透過型ディスプレイ７０２（提示部１４に対応）、音声解析部７０３（音声認識部１２に対応）、画面表示部７０４（視覚化部１３と提示部１４とに対応）で構成される。音声解析部７０３は、マイクから入力された音声を解析し、口の形の画像を生成し、画面表示部７０４は、図８の通り生成した口の形の画像８０１をスマートグラス８０２が有する透過型ディスプレイ８０３に表示する。 The auxiliary tool A of this embodiment has the following configuration. FIG. 7 is a schematic diagram showing the configuration of an auxiliary instrument A corresponding to the visualization device of this embodiment. FIG. 8 is another schematic diagram for showing the processing of the auxiliary tool A corresponding to the visualization device according to the second embodiment. That is, as shown in FIG. 7, for example, a glass-type wearable device includes a microphone 701 (corresponding to the voice acquisition unit 11), a transmissive display 702 (corresponding to the presentation unit 14), a voice analysis unit 703 (corresponding to the voice recognition unit 12). corresponding), and a screen display unit 704 (corresponding to the visualization unit 13 and presentation unit 14). The voice analysis unit 703 analyzes the voice input from the microphone and generates a mouth-shaped image, and the screen display unit 704 displays the generated mouth-shaped image 801 as shown in FIG. It is displayed on the type display 803 .

図９は、第２の実施形態に係る音声解析部７０３の動作の詳細を示すためのフローチャートである。まず一定時間の間（ステップＳ９０１）入力音声データを蓄積し（ステップＳ９０２）、一定時間後に蓄積した音声データを解析し、母音の単位で分割する（ステップＳ９０３）。その後、分割した音声データを1音ずつ画面表示部に送信する（ステップＳ９０４）。 FIG. 9 is a flowchart for showing details of the operation of the speech analysis unit 703 according to the second embodiment. First, input speech data is accumulated for a certain period of time (step S901) (step S902), and after the certain period of time the accumulated speech data is analyzed and divided into vowel units (step S903). After that, the divided voice data is transmitted to the screen display section one sound at a time (step S904).

図１０は、第２の実施形態に係る画面表示部７０４の動作の詳細を示すためのフローチャートである。音声解析部から受信した1音分の音データから、その音に合致した口の形の画像を生成し（ステップＳ１００１）、一定時間の間（ステップＳ１００３）ディスプレイに表示（ステップＳ１００２）してもよい。 FIG. 10 is a flowchart for showing details of the operation of the screen display unit 704 according to the second embodiment. From the sound data for one sound received from the sound analysis unit, an image of the shape of the mouth that matches the sound is generated (step S1001), and displayed on the display (step S1002) for a certain period of time (step S1003). good.

前述の実施形態の一部又は全部は、以下の各付記のようにも記載することができる。しかしながら、以下の各付記は、あくまでも、本発明の単なる例示に過ぎず、本発明は、かかる場合のみに限るものではない。
［付記１］
上述の第一の視点に係る視覚化装置のとおりである。
［付記２］
視覚化部は、認識データを発話したときの口の形状を表す映像である視覚化データを生成する、好ましくは付記１の視覚化装置。
［付記３］
音声認識部は、音声を構成する各音に含まれる母音を抽出し認識データとする、好ましくは付記１又は２の視覚化装置。
［付記４］
提示部は、視覚化データを透過型表示装置に提示する、好ましくは付記１から３のいずれか一の視覚化装置。
［付記５］
会話の相手の身体の部位を認識した身体認識データを生成する身体認識部をさらに有し、提示部は、前記身体認識データに基づいて視覚化データを提示する、好ましくは付記１から４のいずれか一の視覚化装置。
［付記６］
身体認識部は、会話の相手の口の位置を認識した身体認識データを生成し、提示部は視覚化データにかかる口の形状を表す映像を、身体認識データを用いて会話の相手の口の位置に配置する、好ましくは付記５の視覚化装置。
［付記７］
音声認識部は、さらに前記音声を認識してテキスト形式の認識データとし、提示部は、さらに前記テキスト形式の認識データを提示する、好ましくは付記１から６のいずれか一の視覚化装置。
［付記８］
音声認識部は、さらに音声から会話の相手の感情を推定した感情データを取得し、視覚化部は、さらに感情データに基づいて視覚化データを生成する、好ましくは付記１から７のいずれか一の視覚化装置。
［付記９］
上述の第二の視点に係る視覚化方法のとおりである。
［付記１０］
上述の第三の視点に係るプログラムのとおりである。
なお、付記９及び付記１０は、付記１と同様に、付記２～付記８に展開することが可能である。 Some or all of the above-described embodiments can also be described as the following appendices. However, each of the following supplementary notes is merely an example of the present invention, and the present invention is not limited to such cases.
[Appendix 1]
It is as the visualization device according to the first viewpoint described above.
[Appendix 2]
Preferably, the visualization device according to appendix 1, wherein the visualization unit generates visualization data that is an image representing a mouth shape when uttering the recognition data.
[Appendix 3]
Preferably, the visualization device according to appendix 1 or 2, wherein the speech recognition unit extracts vowels contained in each sound that constitutes speech and uses them as recognition data.
[Appendix 4]
4. A visualization device, preferably according to any one of Appendixes 1 to 3, wherein the presentation unit presents the visualization data on a transmissive display device.
[Appendix 5]
Preferably, any one of Appendices 1 to 4, further comprising a body recognition unit that generates body recognition data that recognizes body parts of a conversation partner, and wherein the presentation unit presents visualization data based on the body recognition data. Or one visualization device.
[Appendix 6]
The body recognition unit generates body recognition data that recognizes the position of the mouth of the conversation partner, and the presentation unit uses the body recognition data to generate an image representing the shape of the mouth of the conversation partner. A visualization device, preferably according to clause 5, placed in position.
[Appendix 7]
Preferably, the visualization device according to any one of appendices 1 to 6, wherein the speech recognition unit further recognizes the speech into textual recognition data, and the presenting unit further presents the textual recognition data.
[Appendix 8]
The speech recognition unit further obtains emotion data obtained by estimating the emotion of the conversation partner from the voice, and the visualization unit further generates visualization data based on the emotion data. visualization device.
[Appendix 9]
This is the same as the visualization method related to the second viewpoint described above.
[Appendix 10]
This is the same as the program related to the third viewpoint mentioned above.
It should be noted that Supplementary Notes 9 and 10 can be developed into Supplementary Notes 2 to 8 in the same manner as Supplementary Note 1.

なお、引用した上記の特許文献等の各開示は、本書に引用をもって繰り込むものとする。本発明の全開示（特許請求の範囲を含む）の枠内において、さらにその基本的技術思想に基づいて、実施形態ないし実施例の変更・調整が可能である。また、本発明の全開示の枠内において種々の開示要素（各請求項の各要素、各実施形態ないし実施例の各要素、各図面の各要素等を含む）の多様な組み合わせ、ないし、選択（部分的削除を含む）が可能である。すなわち、本発明は、特許請求の範囲を含む全開示、技術的思想にしたがって当業者であればなし得るであろう各種変形、修正を含むことは勿論である。特に、本書に記載した数値範囲については、当該範囲内に含まれる任意の数値ないし小範囲が、別段の記載のない場合でも具体的に記載されているものと解釈されるべきである。 The disclosures of the cited patent documents and the like are incorporated herein by reference. Within the framework of the full disclosure of the present invention (including the scope of claims), modifications and adjustments of the embodiments and examples are possible based on the basic technical concept thereof. Also, various combinations or selections of various disclosure elements (including each element of each claim, each element of each embodiment or example, each element of each drawing, etc.) within the framework of the full disclosure of the present invention (including partial deletion) is possible. That is, the present invention naturally includes various variations and modifications that can be made by those skilled in the art according to the entire disclosure including claims and technical ideas. In particular, any numerical range recited herein should be construed as specifically recited for any numerical value or subrange within that range, even if not otherwise stated.

１０：視覚化装置
１１：音声取得部
１２：音声認識部
１３：視覚化部
１４：提示部
１５：身体認識部
５１：ＣＰＵ
５２：メモリ
５３：入出力インタフェース
５５：内部バス
６０１：スマートグラス
６０２：マイク
６０３：聴き手
６０４：話者
７０１：マイク
７０２：透過型ディスプレイ
７０３：音声解析部
７０４：画面表示部
８０１：画像
８０２：スマートグラス
８０３：透過型ディスプレイ 10: Visualization device 11: Voice acquisition unit 12: Voice recognition unit 13: Visualization unit 14: Presentation unit 15: Body recognition unit 51: CPU
52: Memory 53: Input/output interface 55: Internal bus 601: Smart glasses 602: Microphone 603: Listener 604: Speaker 701: Microphone 702: Transmissive display 703: Sound analysis unit 704: Screen display unit 801: Image 802: Smart Glasses 803: Transmissive Display

Claims

an audio acquisition unit that acquires audio;
a voice recognition unit that recognizes the voice and uses it as recognition data;
a visualization unit that generates visualization data obtained by visualizing the recognition data;
a presentation unit that presents the visualization data as a video;
A visualization device having

The visualization unit generates visualization data that is an image representing the shape of the mouth when the recognition data is uttered.
2. The visualization device of claim 1.

3. The visualization device according to claim 1, wherein said speech recognition unit extracts vowels included in each sound constituting said speech and uses them as recognition data.

The presentation unit presents the visualization data on a transmissive display device.
Visualization device according to any one of claims 1 to 3.

further comprising a body recognition unit that generates body recognition data recognizing body parts of a conversation partner;
the presentation unit presents visualization data based on the body recognition data;
Visualization device according to any one of claims 1 to 4.

The body recognition unit generates body recognition data that recognizes the position of the mouth of the conversation partner,
The presentation unit arranges an image representing the shape of the mouth in the visualization data at the position of the mouth of the conversation partner using the body recognition data.
6. The visualization device of claim 5.

The speech recognition unit further recognizes the speech and converts it into text format recognition data,
The presentation unit further presents the recognition data in text format.
Visualization device according to any one of claims 1 to 6.

The speech recognition unit further acquires emotion data obtained by estimating the emotion of a conversation partner from the speech,
the visualization unit further generates visualization data based on the emotion data;
Visualization device according to any one of claims 1 to 7.

obtaining audio;
a step of recognizing the speech as recognition data;
generating visualization data that visualizes the recognition data;
presenting the visualization data;
A visualization method having

a process of obtaining audio;
a process of recognizing the voice and converting it into recognition data;
a process of generating visualization data that visualizes the recognition data;
a process of presenting the visualization data;
A program that causes a computer to run