JP2021108843A

JP2021108843A - Cognitive function determination apparatus, cognitive function determination system, and computer program

Info

Publication number: JP2021108843A
Application number: JP2020001799A
Authority: JP
Inventors: 真輝大西; Masaki Onishi
Original assignee: Exa Wizards Inc
Current assignee: Exa Wizards Inc
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2021-08-02
Anticipated expiration: 2040-01-09
Also published as: JP6712028B1

Abstract

To provide a cognitive function determination apparatus, a cognitive function determination system, a computer program, and a cognitive function determination method capable of determining a cognitive function regardless of the attributes of an object person.SOLUTION: The cognitive function determination apparatus comprises: an acquisition unit for acquiring a voice of an object person; a conversion unit for converting the acquired voice to a reference voice; and a determination unit for determining a cognitive function of the object person on the basis of the converted reference voice.SELECTED DRAWING: Figure 1

Description

本発明は、認知機能判定装置、認知機能判定システム、コンピュータプログラム及び認知機能判定方法に関する。 The present invention relates to a cognitive function determination device, a cognitive function determination system, a computer program, and a cognitive function determination method.

近年、認知症患者の増加が懸念され、様々なアプローチを用いて認知症の早期発見に関する技術が開発されている。特許文献１には、ユーザの音声データに基づいて韻律特徴量を抽出し、予め構築された学習モデルを用いて認知機能障害の危険度を算出する装置が開示されている。 In recent years, there is concern that the number of dementia patients will increase, and techniques for early detection of dementia have been developed using various approaches. Patent Document 1 discloses a device that extracts prosodic features based on user's voice data and calculates the risk of cognitive dysfunction using a pre-constructed learning model.

特開２０１１−２５５１０６号公報Japanese Unexamined Patent Publication No. 2011-255106

しかし、韻律などの音声の要素は、対象者の年齢、性別及び体格等の属性によって異なるため、属性が異なると精度よく認知機能を判定することができないおそれがある。また、認知機能を精度よく判定するためには、年齢等の属性毎に認知機能を判定するための学習モデルを準備する必要があり実用的ではない。 However, since the audio elements such as prosody differ depending on the attributes such as age, gender, and physique of the subject, there is a possibility that the cognitive function cannot be accurately determined if the attributes are different. Further, in order to accurately judge the cognitive function, it is necessary to prepare a learning model for judging the cognitive function for each attribute such as age, which is not practical.

本発明は、斯かる事情に鑑みてなされたものであり、対象者の属性に関わらず認知機能を判定することができる認知機能判定装置、認知機能判定システム、コンピュータプログラム及び認知機能判定方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and provides a cognitive function determination device, a cognitive function determination system, a computer program, and a cognitive function determination method capable of determining cognitive function regardless of the attributes of a subject. The purpose is to do.

本発明の実施の形態に係る認知機能判定装置は、対象者の音声を取得する取得部と、前記取得部で取得した音声を基準音声に変換する変換部と、前記変換部で変換した基準音声に基づいて前記対象者の認知機能を判定する判定部とを備える。 The cognitive function determination device according to the embodiment of the present invention includes an acquisition unit that acquires the voice of the target person, a conversion unit that converts the voice acquired by the acquisition unit into a reference voice, and a reference voice converted by the conversion unit. It is provided with a determination unit for determining the cognitive function of the subject based on the above.

本発明の実施の形態に係る認知機能判定システムは、対象者の音声を取得する取得部と、前記取得部で取得した音声を基準音声に変換する変換部と、前記変換部で変換した基準音声に基づいて前記対象者の認知機能を判定する判定部とを備える。 The cognitive function determination system according to the embodiment of the present invention includes an acquisition unit that acquires the voice of the target person, a conversion unit that converts the voice acquired by the acquisition unit into a reference voice, and a reference voice converted by the conversion unit. It is provided with a determination unit for determining the cognitive function of the subject based on the above.

本発明の実施の形態に係るコンピュータプログラムは、コンピュータに、対象者の音声を取得する処理と、取得した音声を基準音声に変換する処理と、変換した基準音声に基づいて前記対象者の認知機能を判定する処理とを実行させる。 The computer program according to the embodiment of the present invention has a process of acquiring a voice of a target person, a process of converting the acquired voice into a reference voice, and a cognitive function of the target person based on the converted reference voice. Is executed.

本発明の実施の形態に係る認知機能判定方法は、対象者の音声を取得し、取得された音声を基準音声に変換し、変換された基準音声に基づいて前記対象者の認知機能を判定する。 The cognitive function determination method according to the embodiment of the present invention acquires the voice of the subject, converts the acquired voice into a reference voice, and determines the cognitive function of the subject based on the converted reference voice. ..

本発明によれば、対象者の年齢、性別及び体格等の属性に関わらず認知機能を判定することができる。 According to the present invention, the cognitive function can be determined regardless of attributes such as age, gender and physique of the subject.

本実施の形態の認知機能判定システムの構成の一例を示す模式図である。It is a schematic diagram which shows an example of the structure of the cognitive function determination system of this embodiment. 対話音声の音声波形の一例を示す模式図である。It is a schematic diagram which shows an example of the voice waveform of the dialogue voice. 基準音声に対応する属性の一例を示す模式図である。It is a schematic diagram which shows an example of the attribute corresponding to a reference voice. 音声変換部の構成の一例を示す模式図である。It is a schematic diagram which shows an example of the structure of the voice conversion part. パラメータ変換部の構成の第１例を示す説明図である。It is explanatory drawing which shows the 1st example of the structure of the parameter conversion part. パラメータ変換部の構成の第２例を示す説明図である。It is explanatory drawing which shows the 2nd example of the structure of the parameter conversion part. 認知機能判定部の構成の第１例を示す模式図である。It is a schematic diagram which shows the 1st example of the structure of the cognitive function determination part. 認知機能判定部の構成の第２例を示す模式図である。It is a schematic diagram which shows the 2nd example of the structure of the cognitive function determination part. 認知機能判定部の構成の第３例を示す模式図である。It is a schematic diagram which shows the 3rd example of the structure of the cognitive function determination part. 認知機能判定システムの処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the processing procedure of the cognitive function determination system. 学習済みモデルの生成方法の一例を示すフローチャートである。It is a flowchart which shows an example of the generation method of a trained model.

以下、本発明の実施の形態を図面に基づいて説明する。図１は本実施の形態の認知機能判定システムの構成の一例を示す模式図である。認知機能判定システムは、認知機能判定装置５０、及び端末装置１０を備える。認知機能判定装置５０と端末装置１０とは、通信ネットワーク１を介して接続されている。端末装置１０は、例えば、パーソナルコンピュータ、タブレット、スマートフォン、スマートスピーカなどの情報処理装置で構成することができる。端末装置１０には、マイク１１が接続されている。マイク１１は、対象者及び対象者と対話する対話者の音声を取得することができる。なお、対象者と対話者の音声を取得することができるのであれば、マイク１１は端末装置１０に内蔵されていてもよい。対象者は、認知症判定の対象者であり、対話者は、医師、看護師、カウンセラ、介護士などの対象者と対話を行う者である。対話者がいる場合には、対象者は対話者と会話することができ、対話者がいないときは、予め定められた文章などを読み上げることができる。対象者の音声、あるいは対象者と対話者の音声は、マイク１１で集音され、端末装置１０を介して認知機能判定装置５０へ送信される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a schematic diagram showing an example of the configuration of the cognitive function determination system of the present embodiment. The cognitive function determination system includes a cognitive function determination device 50 and a terminal device 10. The cognitive function determination device 50 and the terminal device 10 are connected via the communication network 1. The terminal device 10 can be composed of, for example, an information processing device such as a personal computer, a tablet, a smartphone, or a smart speaker. A microphone 11 is connected to the terminal device 10. The microphone 11 can acquire the voice of the target person and the interlocutor interacting with the target person. The microphone 11 may be built in the terminal device 10 as long as the voices of the target person and the interlocutor can be acquired. The subject is a subject for dementia determination, and the interlocutor is a person who has a dialogue with the subject such as a doctor, a nurse, a counselor, and a caregiver. When there is an interlocutor, the subject can talk with the interlocutor, and when there is no interlocutor, a predetermined sentence or the like can be read aloud. The voice of the subject or the voice of the subject and the interlocutor is collected by the microphone 11 and transmitted to the cognitive function determination device 50 via the terminal device 10.

なお、対象者と対話者の対話は、通信ネットワーク１を介してオンライン上で行われてもよい。この場合、対象者が利用する端末装置１０Ａに接続されたマイク１１Ａにより対象者の音声を取得し、対話者が利用する端末装置１０Ｂに接続されたマイク１１Ｂにより対話者の音声を取得し、マイク１１Ａ、１１Ｂで取得した音声が認知機能判定装置５０へ送信される。 The dialogue between the target person and the interlocutor may be performed online via the communication network 1. In this case, the voice of the target person is acquired by the microphone 11A connected to the terminal device 10A used by the target person, and the voice of the interlocutor is acquired by the microphone 11B connected to the terminal device 10B used by the interlocutor. The voices acquired in 11A and 11B are transmitted to the cognitive function determination device 50.

認知機能判定装置５０は、装置全体を制御する制御部５１、通信部５２、音声識別部５３、記憶部５４、音声変換部５５、認知機能判定部５６、及び学習処理部５７を備える。制御部５１は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）などで構成することができる。通信部５２は、所要の通信モジュールで構成することができる。音声識別部５３は、ＣＰＵで構成することができる。記憶部５４は、ハードディスク又はフラッシュメモリなどで構成することができる。音声変換部５５及び認知機能判定部５６は、例えば、ニューラルネットワークで構成することができる。学習処理部５７は、ＣＰＵ（例えば、複数のプロセッサコアを実装したマルチ・プロセッサなど）、ＧＰＵ（Graphics Processing Units）、ＤＳＰ（Digital Signal Processors）、ＦＰＧＡ（Field-Programmable Gate Arrays）などのハードウェアを組み合わせることによって構成することができる。なお、認知機能判定装置５０の制御部５１、音声識別部５３、記憶部５４、音声変換部５５、認知機能判定部５６、及び学習処理部５７の各機能を端末装置１０に設けて、端末装置１０で認知機能レベルを判定するようにしてもよく、認知機能判定装置５０の一部の機能（例えば、音声識別部５３、音声変換部５５）を端末装置１０に設けるようにしてもよい。また、認知機能判定装置５０の各機能は、複数の装置に分散する形で設けてもよい。 The cognitive function determination device 50 includes a control unit 51, a communication unit 52, a voice identification unit 53, a storage unit 54, a voice conversion unit 55, a cognitive function determination unit 56, and a learning processing unit 57 that control the entire device. The control unit 51 can be composed of a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The communication unit 52 can be configured by a required communication module. The voice identification unit 53 can be configured by a CPU. The storage unit 54 may be composed of a hard disk, a flash memory, or the like. The voice conversion unit 55 and the cognitive function determination unit 56 can be configured by, for example, a neural network. The learning processing unit 57 uses hardware such as a CPU (for example, a multi-processor equipped with a plurality of processor cores), a GPU (Graphics Processing Units), a DSP (Digital Signal Processors), and an FPGA (Field-Programmable Gate Arrays). It can be configured by combining. The terminal device 10 is provided with the functions of the control unit 51, the voice identification unit 53, the storage unit 54, the voice conversion unit 55, the cognitive function determination unit 56, and the learning processing unit 57 of the cognitive function determination device 50. The cognitive function level may be determined by 10, or a part of the functions of the cognitive function determination device 50 (for example, the voice identification unit 53 and the voice conversion unit 55) may be provided in the terminal device 10. Further, each function of the cognitive function determination device 50 may be provided in a form of being dispersed in a plurality of devices.

通信部５２は、通信ネットワーク１を介して、端末装置１０との間で通信を行う機能を有し、端末装置１０との間で所要の情報の送受信を行うことができる。通信部５２は、取得部としての機能を有し、対象者と対話者の対話音声、あるいは対象者の音声を端末装置１０から取得することができる。 The communication unit 52 has a function of communicating with the terminal device 10 via the communication network 1, and can transmit and receive necessary information with the terminal device 10. The communication unit 52 has a function as an acquisition unit, and can acquire the dialogue voice between the target person and the interlocutor or the voice of the target person from the terminal device 10.

図２は対話音声の音声波形の一例を示す模式図である。縦軸は音声信号の振幅を示し、横軸は時間を示す。図２の例では、対話者の音声１、対象者の音声１、対話者の音声２、対象者の音声２、対話者の音声３と続いている。対話者の音声１と対象者の音声１との間には、回答遅延時間が存在し、同様に、対話者の音声２と対象者の音声２との間にも回答遅延時間が存在している。 FIG. 2 is a schematic diagram showing an example of a voice waveform of a dialogue voice. The vertical axis shows the amplitude of the audio signal, and the horizontal axis shows the time. In the example of FIG. 2, the voice 1 of the interlocutor, the voice 1 of the subject, the voice 2 of the interlocutor, the voice 2 of the subject, and the voice 3 of the interlocutor follow. There is a response delay time between the interlocutor's voice 1 and the target person's voice 1, and similarly, there is a response delay time between the interlocutor's voice 2 and the target person's voice 2. There is.

音声識別部５３は、識別部としての機能を有し、通信部５２を介して取得した対話音声から、対象者の音声と対話者の音声とを識別することができる。音声の識別は、予め対象者と対話者の音声データを記憶部５４に記憶しておき、記憶した音声データと照合することにより行うことができる。また、音声の識別は、機械学習を用いてもよい。例えば、医師、看護師、カウンセラ、介護士など対話者の音声を機械学習によって識別することができる。また、予め対象者の音声を機械学習させておいてもよい。また、音声の識別の他の方法として、対話者が、対象者が発話するとき、あるいは対話者が発話するときに端末装置１０に設けられた操作ボタン等を操作すると、端末装置１０は、操作ボタン等が操作されたことを示す識別フラグを音声データと同期させて認知機能判定装置５０へ送信する。音声識別部５３は、識別フラグを取得し、識別フラグの有無に応じて、対象者の音声か対話者の音声かを識別することができる。また、音声の識別の他の方法としては、マイク１１の指向性を利用することができる。例えば、指向性の高い領域に対象者が入るようにマイク１１を配置して、音声の歪の大小で区別することができる。 The voice identification unit 53 has a function as an identification unit, and can discriminate between the voice of the target person and the voice of the interlocutor from the dialogue voice acquired via the communication unit 52. The voice can be identified by storing the voice data of the target person and the interlocutor in the storage unit 54 in advance and collating the voice data with the stored voice data. In addition, machine learning may be used for voice identification. For example, the voices of interlocutors such as doctors, nurses, counselors, and caregivers can be identified by machine learning. Further, the voice of the target person may be machine-learned in advance. Further, as another method of voice identification, when the interlocutor operates an operation button or the like provided on the terminal device 10 when the target person speaks or when the interlocutor speaks, the terminal device 10 operates. An identification flag indicating that the button or the like has been operated is synchronized with the voice data and transmitted to the cognitive function determination device 50. The voice identification unit 53 can acquire the identification flag and identify whether it is the voice of the target person or the voice of the interlocutor depending on the presence or absence of the identification flag. Further, as another method of voice identification, the directivity of the microphone 11 can be used. For example, the microphone 11 can be arranged so that the target person can enter the region having high directivity, and can be distinguished by the magnitude of the distortion of the sound.

音声変換部５５は、変換部としての機能を有し、通信部５２を介して取得した音声を基準音声に変換する。 The voice conversion unit 55 has a function as a conversion unit, and converts the voice acquired via the communication unit 52 into a reference voice.

図３は基準音声に対応する属性の一例を示す模式図である。音声は、主に声帯と声道によって作り出される。声帯は、開閉弁の役割を有し、肺から吐き出された呼気によって周期的に振動する。声道は口腔や鼻腔などの空洞部分である。声帯や声道は、人の年齢、性別、体格等の様々な属性によって異なるため、人によって声質も変わってくる。図３の例では、対象者の属性をＣ１、Ｃ２、Ｃ３、…としたときに、年齢、性別、身長、体重が属性毎に異なる様子を示す。対話者の属性も対象者の属性と同様である。一方、基準音声は、例えば、年齢、性別、体格等の様々な属性が所定の属性の音声とすることができる。所定の属性は、例えば、図３に示すように、５０歳の男性で標準的な体格（例えば、身長が１７０ｃｍ、体重が７０ｋｇなど）とすることができる。なお、基準音声に対応する属性は図３の例に限定されない。 FIG. 3 is a schematic diagram showing an example of attributes corresponding to the reference voice. Voice is produced primarily by the vocal cords and vocal tract. The vocal cords act as an on-off valve and vibrate periodically by the exhaled breath exhaled from the lungs. The vocal tract is a hollow part such as the oral cavity or nasal cavity. Vocal cords and vocal tracts differ depending on various attributes such as age, gender, and physique of a person, so the voice quality also changes from person to person. In the example of FIG. 3, when the attributes of the subject are C1, C2, C3, ..., The age, gender, height, and weight are different for each attribute. The attributes of the interlocutor are the same as those of the target person. On the other hand, the reference voice can be, for example, a voice having various attributes such as age, gender, and physique having predetermined attributes. The predetermined attribute can be, for example, as shown in FIG. 3, a standard physique for a 50-year-old man (for example, height 170 cm, weight 70 kg, etc.). The attributes corresponding to the reference voice are not limited to the example of FIG.

音声変換部５５は、対象者又は対話者の音声に含まれる音韻情報を保持したまま声質を変換することができる。 The voice conversion unit 55 can convert the voice quality while retaining the phonological information contained in the voice of the target person or the interlocutor.

図４は音声変換部５５の構成の一例を示す模式図である。図４に示すように、音声変換部５５は、パラメータ抽出部５５１、パラメータ変換部５５２、及び音声合成部５５３を備える。音声変換部５５は、対象者又は対話者の音声信号（両者の音声信号でもよい）が入力されると、入力された音声信号を基準音声信号（基準音声の音声信号）に変換し、変換した基準音声信号を出力することができる。 FIG. 4 is a schematic diagram showing an example of the configuration of the voice conversion unit 55. As shown in FIG. 4, the voice conversion unit 55 includes a parameter extraction unit 551, a parameter conversion unit 552, and a voice synthesis unit 553. When the voice signal of the target person or the interlocutor (which may be the voice signal of both) is input, the voice conversion unit 55 converts the input voice signal into a reference voice signal (voice signal of the reference voice) and converts the input voice signal. A reference audio signal can be output.

パラメータ抽出部５５１は、入力された音声信号から、ピッチＸ、及びフォルマント周波数Ｙなどのパラメータを抽出し、抽出したピッチＸ、及びフォルマント周波数Ｙをパラメータ変換部５５２に出力する。なお、フォルマント周波数Ｙには、第１フォルマント周波数Ｙ１、第２フォルマント周波数Ｙ２、第３フォルマント周波数Ｙ３、第４フォルマント周波数Ｙ４などを含めることができる。ピッチＸは、音声の高低に関係し、声道の形状（例えば、長さ等）と関係し、属性の違いがピッチＸの違いとなって表れる。また、フォルマント周波数Ｙは、声道の形状等と関係し、属性の違いがフォルマント周波数Ｙの違いとなって表れる。本明細書において、パラメータ抽出部５５１が抽出するパラメータは、例えば、声道形状や声帯の違いを表すことができるパラメータであればよく、上述のようなピッチやフォルマント周波数を含む。パラメータは、後述の音声特徴量の一部と共通していてもよい。 The parameter extraction unit 551 extracts parameters such as pitch X and formant frequency Y from the input audio signal, and outputs the extracted pitch X and formant frequency Y to the parameter conversion unit 552. The formant frequency Y can include a first formant frequency Y1, a second formant frequency Y2, a third formant frequency Y3, a fourth formant frequency Y4, and the like. The pitch X is related to the pitch of the voice, is related to the shape of the vocal tract (for example, length, etc.), and the difference in attributes appears as the difference in pitch X. Further, the formant frequency Y is related to the shape of the vocal tract and the like, and the difference in attributes appears as the difference in the formant frequency Y. In the present specification, the parameter extracted by the parameter extraction unit 551 may be, for example, a parameter capable of expressing the difference in vocal tract shape and vocal cords, and includes the pitch and formant frequency as described above. The parameters may be common to some of the voice features described later.

パラメータ変換部５５２は、入力された音声のパラメータを基準音声のパラメータに変換する。例えば、パラメータ変換部５５２は、ピッチＸを基準ピッチＰに変換し、フォルマント周波数Ｙを基準フォルマント周波数Ｆに変換する。パラメータ変換部５５２は、変換して得られたパラメータ（基準ピッチＰ及び基準フォルマント周波数Ｆ）を音声合成部５５３に出力する。基準フォルマント周波数Ｆには、第１フォルマント周波数Ｆ１、第２フォルマント周波数Ｆ２、第３フォルマント周波数Ｆ３、第４フォルマント周波数Ｆ４などを含めることができる。基準ピッチＰ及び基準フォルマント周波数Ｆは、例えば、図３に例示した基準音声に対応するパラメータである。 The parameter conversion unit 552 converts the input voice parameter into the reference voice parameter. For example, the parameter conversion unit 552 converts the pitch X into the reference pitch P and the formant frequency Y into the reference formant frequency F. The parameter conversion unit 552 outputs the converted parameters (reference pitch P and reference formant frequency F) to the speech synthesis unit 553. The reference formant frequency F can include a first formant frequency F1, a second formant frequency F2, a third formant frequency F3, a fourth formant frequency F4, and the like. The reference pitch P and the reference formant frequency F are, for example, parameters corresponding to the reference voice illustrated in FIG.

音声合成部５５３は、パラメータ変換部５５２から入力された音声のパラメータ（基準ピッチＰ及び基準フォルマント周波数Ｆ）を用いて基準音声を生成して認知機能判定部５６へ出力することができる。 The voice synthesis unit 553 can generate a reference voice using the voice parameters (reference pitch P and reference formant frequency F) input from the parameter conversion unit 552 and output it to the cognitive function determination unit 56.

図５はパラメータ変換部５５２の構成の第１例を示す説明図である。図５に示すように、パラメータ変換部５５２は、変換テーブル５５２ａで構成することができる。例えば、対象者又は対話者の音声から抽出したピッチをＸ１、第１フォルマント周波数〜第４フォルマント周波数をＹ１１〜Ｙ４１とすると、変換式ＦＦ１を用いて基準ピッチＰ及び基準フォルマント周波数Ｆに変換することができる。同様に、対象者又は対話者の音声から抽出したピッチをＸ２、第１フォルマント周波数〜第４フォルマント周波数をＹ１２〜Ｙ４２とすると、変換式ＦＦ２を用いて基準ピッチＰ及び基準フォルマント周波数Ｆに変換することができる。他の属性の場合も同様である。このように、パラメータ変換部５５２は、ルールベースで基準音声のパラメータに変換することができる。なお、変換式に代えて、基準ピッチＰ及び基準フォルマント周波数Ｆの数値を変換テーブルに記録してもよい。この場合には、単に、ピッチＸ１、第１フォルマント周波数〜第４フォルマント周波数Ｙ１１〜Ｙ４１を、基準ピッチＰ及び基準フォルマント周波数Ｆの数値に置き換えるだけでよい。 FIG. 5 is an explanatory diagram showing a first example of the configuration of the parameter conversion unit 552. As shown in FIG. 5, the parameter conversion unit 552 can be configured by the conversion table 552a. For example, assuming that the pitch extracted from the voice of the subject or the interlocutor is X1 and the first formant frequency to the fourth formant frequency are Y11 to Y41, the conversion formula FF1 is used to convert the pitch to the reference pitch P and the reference formant frequency F. Can be done. Similarly, assuming that the pitch extracted from the voice of the subject or the interlocutor is X2 and the first formant frequency to the fourth formant frequency are Y12 to Y42, the conversion formula FF2 is used to convert the pitch to the reference pitch P and the reference formant frequency F. be able to. The same applies to other attributes. In this way, the parameter conversion unit 552 can convert the parameters of the reference voice on a rule basis. Instead of the conversion formula, the values of the reference pitch P and the reference formant frequency F may be recorded in the conversion table. In this case, it is sufficient to simply replace the pitch X1 and the first formant frequency to the fourth formant frequency Y11 to Y41 with the numerical values of the reference pitch P and the reference formant frequency F.

図６はパラメータ変換部５５２の構成の第２例を示す説明図である。図６に示すように、パラメータ変換部５５２は、ニューラルネットワーク５５２ｂで構成することができる。ニューラルネットワーク５５２ｂは、第１学習モデルとしての機能を有し、入力層、中間層、出力層を備える。ニューラルネットワーク５５２ｂとして、例えば、ＤＮＮ、ＲＮＮ、ＣＮＮ又はオートエンコーダを用いることができるが、他のモデルを用いてもよい。学習処理部５７は、学習用データを用いて学習済のニューラルネットワーク５５２ｂを生成することができる。学習処理部５７は、例えば、ＣＰＵ（例えば、複数のプロセッサコアを実装したマルチ・プロセッサなど）、ＧＰＵ（Graphics Processing Units）、ＤＳＰ（Digital Signal Processors）、ＦＰＧＡ（Field-Programmable Gate Arrays）などのハードウェアを組み合わせることによって構成することができる。また、量子プロセッサを組み合わせることもできる。 FIG. 6 is an explanatory diagram showing a second example of the configuration of the parameter conversion unit 552. As shown in FIG. 6, the parameter conversion unit 552 can be configured by the neural network 552b. The neural network 552b has a function as a first learning model, and includes an input layer, an intermediate layer, and an output layer. As the neural network 552b, for example, DNN, RNN, CNN or an autoencoder can be used, but other models may be used. The learning processing unit 57 can generate a trained neural network 552b using the learning data. The learning processing unit 57 is, for example, a hardware such as a CPU (for example, a multi-processor in which a plurality of processor cores are mounted), a GPU (Graphics Processing Units), a DSP (Digital Signal Processors), and an FPGA (Field-Programmable Gate Arrays). It can be configured by combining wear. It is also possible to combine quantum processors.

ニューラルネットワーク５５２ｂは、人の音声の音声データのパラメータを入力層に与え、入力層に与える人の音声のパラメータに対応する基準音声のパラメータを出力層に与えて生成することができる。この場合、入力層に与える音声データのパラメータは、ある属性の人の音声データから抽出したピッチ及びフォルマント周波数とすることができ、出力層に与えるのパラメータは、基準音声の音声データから抽出したピッチ及びフォルマント周波数とすることができる。学習用データは、ある属性の人の音声データから抽出したピッチ及びフォルマント周波数、並びに基準音声の音声データから抽出したピッチ及びフォルマント周波数とすることができる。このような学習用データは、様々な属性の人の音声データを収集して、第１訓練データとして準備することができる。なお、入力層に与える、任意の属性の人の音声と、出力層に与える基準音声とは、同じ音韻等に対応するパラメータの組が必要なので、両者の発話内容は同じものとする（同じ内容を話すものとする）。これにより、音声変換部５５は、ニューラルネットワーク５５２ｂを用いて、対象者及び対話者の音声を基準音声に変換することができる。 The neural network 552b can be generated by giving the parameters of the voice data of the human voice to the input layer and giving the parameters of the reference voice corresponding to the parameters of the human voice given to the input layer to the output layer. In this case, the parameters of the voice data given to the input layer can be the pitch and formant frequency extracted from the voice data of a person with a certain attribute, and the parameters given to the output layer can be the pitch extracted from the voice data of the reference voice. And formant frequencies. The learning data can be a pitch and formant frequency extracted from the voice data of a person with a certain attribute, and a pitch and formant frequency extracted from the voice data of the reference voice. Such learning data can be prepared as first training data by collecting voice data of people with various attributes. Since the voice of a person with an arbitrary attribute given to the input layer and the reference voice given to the output layer require a set of parameters corresponding to the same phoneme, etc., the utterance contents of both are the same (same content). Speaking). As a result, the voice conversion unit 55 can convert the voices of the target person and the interlocutor into the reference voice by using the neural network 552b.

認知機能判定部５６は、判定部としての機能を有し、音声変換部５５で変換した基準音声に基づいて対象者の認知機能を判定する。認知機能の判定は、例えば、基準音声の音声特徴量（例えば、音声の高さに関連するピッチ、母音や子音の特徴に関連するフォルマント周波数、声道特性に関連するメル周波数スペクトラム係数（ＭＦＣＣ）など）に基づいて行うことができる。認知機能の判定には、例えば、ルールベース、機械学習の一手法であるサポートベクターマシン（ＳＶＭ）、ニューラルネットワークなどの学習モデルを用いることができる。本明細書において、音声特徴量は、認知機能障害を判定することができる特徴量であり、音声の韻律的特徴を特定することができる特徴量であればよい。音声特徴量は、例えば、上述のようなピッチ、フォルマント周波数、メル周波数スペクトラム係数など、あるいはこれらの組み合わせを含む。 The cognitive function determination unit 56 has a function as a determination unit, and determines the cognitive function of the subject based on the reference voice converted by the voice conversion unit 55. The determination of cognitive function is, for example, the speech feature quantity of the reference speech (eg, pitch related to speech pitch, formant frequency related to vowel or consonant features, mel frequency spectrum coefficient (MFCC) related to vocal tract characteristics). Etc.). For the determination of the cognitive function, for example, a learning model such as a rule base, a support vector machine (SVM) which is a method of machine learning, or a neural network can be used. In the present specification, the speech feature amount is a feature amount capable of determining cognitive dysfunction, and may be any feature amount capable of specifying the prosodic feature of speech. The voice feature amount includes, for example, the pitch, formant frequency, mel frequency spectrum coefficient, etc. as described above, or a combination thereof.

上述のように、対象者の音声を基準音声に変換し、変換した基準音声を用いて認知機能を判定するので、対象者それぞれの声質が、年齢、性別及び体格等の属性によって異なる場合でも、それぞれの属性に適した認知機能のための学習モデルや判定装置を予め準備する必要がない。すなわち、属性に応じた学習モデルや判定装置を準備することなく、対象者の年齢、性別及び体格等の属性に関わらず認知機能を判定することができる。また、認知機能の判定を、高齢者を対象とするだけでなく若年者も対象とすることができる。 As described above, the voice of the subject is converted into the reference voice, and the cognitive function is judged using the converted reference voice. Therefore, even if the voice quality of each subject differs depending on the attributes such as age, gender, and physique, It is not necessary to prepare a learning model or a judgment device for cognitive function suitable for each attribute in advance. That is, the cognitive function can be determined regardless of the attributes such as age, gender, and physique of the subject without preparing a learning model or a determination device according to the attributes. In addition, the determination of cognitive function can be made not only for the elderly but also for the young.

また、認知機能判定部５６は、対象者の基準音声だけでなく、対話者の基準音声に基づいて対象者の認知機能を判定することができる。すなわち、対象者の基準音声だけでなく、対象者と対話する対話者の基準音声も認知機能の判定の要素とすることができる。対話者の質問等の発話に対する対象者の回答等の応答を、対象者の認知機能の判定に用いることができるので、人の問いかけに対して、対象者がどのように反応しているかを判断材料とすることができ、認知機能の判定の精度を向上させることができる。 Further, the cognitive function determination unit 56 can determine the cognitive function of the subject based on not only the reference voice of the subject but also the reference voice of the interlocutor. That is, not only the reference voice of the subject but also the reference voice of the interlocutor who interacts with the subject can be an element for determining the cognitive function. Since the response such as the answer of the subject to the utterance of the interlocutor's question can be used to judge the cognitive function of the subject, it is possible to judge how the subject responds to the question of the person. It can be used as a material, and the accuracy of determination of cognitive function can be improved.

次に、認知機能判定部５６の詳細について説明する。 Next, the details of the cognitive function determination unit 56 will be described.

図７は認知機能判定部５６の構成の第１例を示す模式図である。図７に示すように、認知機能判定部５６は、音声特徴量抽出部５６１、及びＤＮＮ（Deep Neural Network：深層ニューラルネットワーク）５６２を備える。音声特徴量抽出部５６１は、対話者の音声波形と対象者の音声波形（例えば、対話者の質問と質問に対する対象者の回答を１つの纏まりとする音声波形）に基づいて、対象者の音声特徴量（例えば、ピッチ、フォルマント周波数、メル周波数スペクトラム係数（ＭＣＦＦ））を抽出するとともに、対話者の音声特徴量（例えば、ピッチ、フォルマント周波数、メル周波数スペクトラム係数（ＭＣＦＦ））を抽出する。認知機能障害を特定するには、音声の３つの要素（韻律、音質及び音韻）のうち、特に韻律が重要な非言語情報であることが知られている。そこで、韻律を特徴付ける音声特徴量として、ピッチ、フォルマント周波数及びメル周波数スペクトラム係数を用いてＤＮＮ５６２を学習させることができる。 FIG. 7 is a schematic diagram showing a first example of the configuration of the cognitive function determination unit 56. As shown in FIG. 7, the cognitive function determination unit 56 includes a voice feature amount extraction unit 561 and a DNN (Deep Neural Network) 562. The voice feature amount extraction unit 561 is based on the voice waveform of the interlocutor and the voice waveform of the target person (for example, a voice waveform that combines the question of the interlocutor and the answer of the target person to the question). The feature amount (for example, pitch, formant frequency, mel frequency spectrum coefficient (MCFF)) is extracted, and the voice feature amount of the interlocutor (for example, pitch, formant frequency, mel frequency spectrum coefficient (MCFF)) is extracted. Of the three elements of speech (prosody, sound quality, and phonology), prosody is known to be particularly important nonverbal information for identifying cognitive dysfunction. Therefore, DNN562 can be learned by using the pitch, formant frequency, and Mel frequency spectrum coefficient as the speech features that characterize the prosody.

音声特徴量抽出部５６１には、識別フラグを入力することができる。識別フラグは、対象者フラグ及び対話者フラグとすることができる。例えば、音声特徴量抽出部５６１に対話者の音声が入力される場合、音声が入力されている間、対話者フラグを入力し続けてもよく、対話者の音声の開始と終了時に対話者フラグを入力してもよい。音声特徴量抽出部５６１に対象者の音声が入力される場合、音声が入力されている間、対象者フラグを入力し続けてもよく、対象者の音声の開始と終了時に対象者フラグを入力してもよい。これにより、音声特徴量抽出部５６１は、対象者のみの音声が入力される場合でも、対象者の音声と対話者の音声が順番に繰り返し入力される場合でも、対象者と対話者の別を識別することができる。音声特徴量抽出部５６１は、抽出した対象者の特徴量と対話者の特徴量とをＤＮＮ５６２に入力する。 An identification flag can be input to the voice feature amount extraction unit 561. The identification flag can be a target person flag and an interlocutor flag. For example, when the voice of the interlocutor is input to the voice feature amount extraction unit 561, the interlocutor flag may be continuously input while the voice is being input, and the interlocutor flag may be input at the start and end of the voice of the interlocutor. May be entered. When the voice of the target person is input to the voice feature amount extraction unit 561, the target person flag may be continuously input while the voice is being input, and the target person flag is input at the start and end of the voice of the target person. You may. As a result, the voice feature amount extraction unit 561 distinguishes between the target person and the interlocutor regardless of whether the voice of only the target person is input or the voice of the target person and the voice of the interlocutor are repeatedly input in order. Can be identified. The voice feature amount extraction unit 561 inputs the extracted feature amount of the target person and the feature amount of the interlocutor into the DNN 562.

また、ＤＮＮ５６２には、対話者の質問等の発話に対する対象者の回答等の応答時間を入力してもよい。ＤＮＮ５６２は、健常者及び認知機能障害者と対話する対話者の発話に対する健常者及び認知機能障害者の応答時間を含む学習用データを用いて生成されている。応答時間は、対話者の発話の終了時点から健常者及び認知機能障害者の回答の開始時点までの時間とすることができる。認知機能が低下するのに応じて応答時間は長くなる傾向があると考えられるので、応答時間を学習用データに含めることにより、ＤＮＮ５６２の認知機能の判定の精度を向上させることができる。 Further, the response time such as the answer of the target person to the utterance of the interlocutor's question or the like may be input to the DNN 562. DNN562 is generated using learning data including the response times of the healthy and cognitively impaired persons to the utterances of the interlocutors who interact with the healthy and cognitively impaired persons. The response time can be the time from the end of the interlocutor's utterance to the start of the response of the healthy person and the cognitively impaired person. Since it is considered that the response time tends to increase as the cognitive function declines, the accuracy of determining the cognitive function of DNN562 can be improved by including the response time in the learning data.

ＤＮＮ５６２は、人の音声が変換された基準音声が入力された場合に、当該人の認知機能レベルを出力することができる。図７の例では、認知機能レベル（認知機能障害のレベル）をレベル「１」からレベル「ｍ」までのｍ個に区分している。認知機能レベルｍが重度の認知機能障害に相当し、レベルを示す数値が小さいほど、認知機能障害は軽くなる。認知機能レベルが所定の閾値以下であれば健常者と判定し、所定の閾値を超える場合には認知症と判定してもよい。 The DNN562 can output the cognitive function level of the person when the reference voice converted from the person's voice is input. In the example of FIG. 7, the cognitive function level (level of cognitive dysfunction) is divided into m from level "1" to level "m". The cognitive function level m corresponds to severe cognitive dysfunction, and the smaller the numerical value indicating the level, the less the cognitive dysfunction. If the cognitive function level is equal to or lower than a predetermined threshold value, it may be determined to be a healthy person, and if it exceeds a predetermined threshold value, it may be determined to be dementia.

学習処理部５７は、学習用データ（第２訓練データ）を用いて学習済のＤＮＮ５６２を生成することができる。ＤＮＮ５６２は、健常者及び認知機能障害者それぞれの音声が変換された基準音声の基準音声データを入力層に与え、入力層に与える基準音声データに対応する健常者及び認知機能障害者それぞれの認知機能レベルを出力層に与えて生成することができる。この場合、健常者及び認知機能障害者の音声データから音声特徴量（例えば、ピッチ、フォルマント周波数、メル周波数スペクトラム係数など）を抽出し、抽出した音声特徴量を学習用データとして用いることができる。また、認知機能レベルは、例えば、数値で１〜５の如く５段階に区分してもよい（図７の例ではｍ＝５）。この場合、認知機能レベルが所定の閾値以下であれば健常者と判定し、所定の閾値を超える場合には認知症と判定してもよい。また、認知機能レベルは、正常、軽度認知症及び重度認知症の如く３段階に区分してもよく、正常及び認知症の如く２段階で区分してもよい。これにより、ＤＮＮ５６２は、変換された基準音声に基づいて対象者の認知機能レベルを判定することができる。 The learning processing unit 57 can generate the trained DNN562 using the learning data (second training data). The DNN562 gives the input layer the reference voice data of the reference voice in which the voices of the healthy person and the cognitively impaired person are converted, and the cognitive function of each of the healthy person and the cognitively impaired person corresponding to the reference voice data given to the input layer. It can be generated by giving a level to the output layer. In this case, a voice feature amount (for example, pitch, formant frequency, mel frequency spectrum coefficient, etc.) can be extracted from the voice data of a healthy person and a cognitively impaired person, and the extracted voice feature amount can be used as learning data. Further, the cognitive function level may be divided into 5 stages such as 1 to 5 numerically (m = 5 in the example of FIG. 7). In this case, if the cognitive function level is equal to or lower than a predetermined threshold value, it may be determined to be a healthy person, and if it exceeds the predetermined threshold value, it may be determined to be dementia. In addition, the cognitive function level may be divided into three stages such as normal, mild dementia and severe dementia, and may be divided into two stages such as normal and dementia. Thereby, the DNN 562 can determine the cognitive function level of the subject based on the converted reference voice.

また、ＤＮＮ５６２は、健常者及び認知機能障害者それぞれの音声が変換された基準音声に加えて、対話者の音声が変換された基準音声の基準音声データを入力層に与え、入力層に与える基準音声データに対応する健常者及び認知機能障害者それぞれの認知機能レベルを出力層に与えて生成することができる。 Further, the DNN 562 provides the reference voice data of the reference voice to which the voice of the interlocutor is converted in addition to the reference voice obtained by converting the voices of the healthy person and the cognitively disabled person to the input layer, and gives the reference voice data to the input layer. It is possible to give the cognitive function level of each of the healthy person and the cognitively impaired person corresponding to the voice data to the output layer and generate it.

図８は認知機能判定部５６の構成の第２例を示す模式図である。図８に示すように、認知機能判定部５６は、ＲＮＮ（Recurrent Neural Network：再帰型ニューラルネットワーク）５６３を備える。図８に示すように、対話者の音声波形と対象者の音声波形（例えば、対話者の質問と質問に対する対象者の回答を１つの纏まりとする音声波形）がＲＮＮ５６３に入力されると、ＲＮＮ５６３は、対象者の認知機能レベルを出力することができる。図８の例では、認知機能レベルをレベル「１」からレベル「ｍ」までのｍ個に区分している。認知機能レベルｍが重度の認知機能障害に相当し、レベルを示す数値が小さいほど、認知機能障害は軽くなる。 FIG. 8 is a schematic diagram showing a second example of the configuration of the cognitive function determination unit 56. As shown in FIG. 8, the cognitive function determination unit 56 includes an RNN (Recurrent Neural Network) 563. As shown in FIG. 8, when the voice waveform of the interlocutor and the voice waveform of the subject (for example, the voice waveform that combines the question of the interlocutor and the answer of the subject to the question) are input to the RNN563, the RNN563 Can output the cognitive function level of the subject. In the example of FIG. 8, the cognitive function level is divided into m from level "1" to level "m". The cognitive function level m corresponds to severe cognitive dysfunction, and the smaller the numerical value indicating the level, the less the cognitive dysfunction.

ＲＮＮ５６３には、識別フラグを入力することができる。識別フラグは、対象者フラグ及び対話者フラグとすることができる。例えば、ＲＮＮ５６３に対話者の音声が入力される場合、音声が入力されている間、対話者フラグを入力し続けてもよく、対話者の音声の開始と終了時に対話者フラグを入力してもよい。ＲＮＮ５６３に対象者の音声が入力される場合、音声が入力されている間、対象者フラグを入力し続けてもよく、対象者の音声の開始と終了時に対象者フラグを入力してもよい。これにより、ＲＮＮ５６３は、対象者のみの音声が入力される場合でも、対象者の音声と対話者の音声が順番に繰り返し入力される場合でも、対象者と対話者の別を識別することができる。なお、対話者フラグは入力しなくてもよい。例えば、対象者の音声だけがＲＮＮ５６３に入力される場合、あるいは、対象者の音声と対話者の音声とが予め識別される場合には、対話者フラグは不要である。 An identification flag can be input to the RNN563. The identification flag can be a target person flag and an interlocutor flag. For example, when the interlocutor's voice is input to RNN563, the interlocutor flag may be continuously input while the voice is input, or the interlocutor flag may be input at the start and end of the interlocutor's voice. good. When the voice of the target person is input to the RNN563, the target person flag may be continuously input while the voice is input, or the target person flag may be input at the start and end of the voice of the target person. Thereby, the RNN563 can distinguish between the target person and the interlocutor regardless of whether the voice of the target person only is input or the voice of the target person and the voice of the interlocutor are repeatedly input in order. .. The interlocutor flag does not have to be entered. For example, if only the voice of the target person is input to the RNN563, or if the voice of the target person and the voice of the interlocutor are identified in advance, the interlocutor flag is unnecessary.

学習処理部５７は、学習用データを用いて学習済のＲＮＮ５６３を生成することができる。ＲＮＮ５６３は、健常者及び認知機能障害者それぞれの音声が変換された基準音声の基準音声データを入力層に与え、入力層に与える基準音声データに対応する健常者及び認知機能障害者それぞれの認知機能レベルを出力層に与えて生成することができる。この場合、学習用データとしての音声データは、健常者及び認知機能障害者の音声データでもよく、健常者及び認知機能障害者と対話者の両方の音声データでもよい。音声データは、そのまま学習用データとして直接用いることができる。また、識別フラグをＲＮＮ５６３に入力して学習させてもよい。ＲＮＮ５６３は、変換された基準音声に基づいて対象者の認知機能レベルを判定することができる。 The learning processing unit 57 can generate the trained RNN563 using the learning data. The RNN563 gives the reference voice data of the reference voice to which the voices of the healthy person and the cognitively impaired person are converted to the input layer, and the cognitive function of each of the healthy person and the cognitively impaired person corresponding to the reference voice data given to the input layer. It can be generated by giving a level to the output layer. In this case, the voice data as the learning data may be voice data of a healthy person and a cognitively impaired person, or may be voice data of both a healthy person and a cognitively impaired person and an interlocutor. The voice data can be directly used as learning data as it is. Further, the identification flag may be input to the RNN563 for learning. The RNN563 can determine the cognitive function level of the subject based on the converted reference speech.

また、ＲＮＮ５６３は、健常者及び認知機能障害者それぞれの音声が変換された基準音声に加えて、対話者の音声が変換された基準音声の基準音声データを入力層に与え、入力層に与える基準音声データに対応する健常者及び認知機能障害者それぞれの認知機能レベルを出力層に与えて生成することができる。 Further, the RNN563 provides the reference voice data of the reference voice to which the voice of the interlocutor is converted in addition to the reference voice obtained by converting the voices of the healthy person and the cognitively disabled person to the input layer, and gives the reference voice data to the input layer. It is possible to give the cognitive function level of each of the healthy person and the cognitively impaired person corresponding to the voice data to the output layer and generate it.

図９は認知機能判定部５６の構成の第３例を示す模式図である。図９に示すように、認知機能判定部５６は、ＦＦＴ変換部５６５、及びＣＮＮ（Convolutional Neural Network：畳み込みニューラルネットワーク）５６４を備える。ＦＦＴ（Fast Fourier Transform：高速フーリエ変換）変換部５６５は、対話者の音声波形と対象者の音声波形（例えば、対話者の質問と質問に対する対象者の回答を１つの纏まりとする音声波形）をスペクトログラムに変換し、変換した、対象者及び対話者それぞれのスペクトログラムをＣＮＮ５６４に出力する。スペクトログラムは、２次元マップであり、縦軸は周波数を示し、横軸は時間を示し、２次元上の各点（座標）の明るさ又は色等によって、その点での周波数の振幅（強さ）を表すことができる。スペクトログラムは、対話者と対象者の音声波形にどのような周波数成分が含まれるかを示すことができる。 FIG. 9 is a schematic diagram showing a third example of the configuration of the cognitive function determination unit 56. As shown in FIG. 9, the cognitive function determination unit 56 includes an FFT conversion unit 565 and a CNN (Convolutional Neural Network) 564. The FFT (Fast Fourier Transform) transform unit 565 converts the interlocutor's voice waveform and the subject's voice waveform (for example, a voice waveform that combines the interlocutor's question and the subject's answer to the question). It is converted into a spectrogram, and the converted spectrograms of the subject and the interlocutor are output to CNN564. The spectrogram is a two-dimensional map, the vertical axis shows the frequency, the horizontal axis shows the time, and the amplitude (strength) of the frequency at that point depends on the brightness or color of each point (coordinates) in the two dimensions. ) Can be expressed. The spectrogram can show what frequency components are included in the speech waveforms of the interlocutor and the subject.

ＦＦＴ変換部５６５には、識別フラグを入力することができる。識別フラグは、対象者フラグ及び対話者フラグとすることができる。例えば、ＦＦＴ変換部５６５に対話者の音声が入力される場合、音声が入力されている間、対話者フラグを入力し続けてもよく、対話者の音声の開始と終了時に対話者フラグを入力してもよい。ＦＦＴ変換部５６５に対象者の音声が入力される場合、音声が入力されている間、対象者フラグを入力し続けてもよく、対象者の音声の開始と終了時に対象者フラグを入力してもよい。これにより、ＦＦＴ変換部５６５は、対象者のみの音声が入力される場合でも、対象者の音声と対話者の音声が順番に繰り返し入力される場合でも、対象者と対話者の別を識別することができる。ＣＮＮ５６４は、スペクトログラムが入力されると、対象者の認知機能レベルを出力することができる。図９の例では、認知機能レベルをレベル「１」からレベル「ｍ」までのｍ個に区分している。 An identification flag can be input to the FFT conversion unit 565. The identification flag can be a target person flag and an interlocutor flag. For example, when the voice of the interlocutor is input to the FFT conversion unit 565, the interlocutor flag may be continuously input while the voice is being input, and the interlocutor flag is input at the start and end of the voice of the interlocutor. You may. When the voice of the target person is input to the FFT conversion unit 565, the target person flag may be continuously input while the voice is being input, and the target person flag is input at the start and end of the voice of the target person. May be good. As a result, the FFT conversion unit 565 distinguishes between the target person and the interlocutor regardless of whether the voice of only the target person is input or the voice of the target person and the voice of the interlocutor are repeatedly input in order. be able to. The CNN564 can output the cognitive function level of the subject when the spectrogram is input. In the example of FIG. 9, the cognitive function level is divided into m from level "1" to level "m".

学習処理部５７は、学習用データを用いて学習済のＣＮＮ５６４を生成することができる。ＣＮＮ５６４は、健常者及び認知機能障害者の基準音声の音声データから変換されたスペクトログラムと、当該健常者及び認知機能障害者の認知機能レベルとを学習用データを用いて生成することができる。なお、スペクトログラムに代えて、音声波形を２次元マップとして捉えると、この２次元マップは、各点（座標）の明るさ又は色等によって、その点での音声信号の有無を表すことができる。そこで、２次元マップとして捉えた音声波形をＣＮＮ５６４に入力してもよい。 The learning processing unit 57 can generate the trained CNN564 using the learning data. CNN564 can generate a spectrogram converted from the voice data of the reference voice of a healthy person and a cognitively impaired person and a cognitive function level of the healthy person and the cognitively impaired person by using the learning data. If the audio waveform is captured as a two-dimensional map instead of the spectrogram, the two-dimensional map can indicate the presence or absence of an audio signal at each point (coordinates) by the brightness or color of each point. Therefore, the voice waveform captured as a two-dimensional map may be input to CNN564.

また、ＣＮＮ５６４は、健常者及び認知機能障害者の基準音声の音声データから変換されたスペクトログラムに加えて、対話者の基準音声の音声データから変換されたスペクトログラムと、当該健常者及び認知機能障害者の認知機能レベルとを学習用データを用いて生成することができる。 Further, CNN564 includes a spectrogram converted from the voice data of the reference voice of the interlocutor in addition to the spectrogram converted from the voice data of the reference voice of the healthy person and the cognitively impaired person, and the healthy person and the person with cognitive impairment. Cognitive function level and can be generated using the learning data.

本実施の形態において、認知機能の判定は、図７〜図９に例示した、各構成のいずれかを用いてもよく、各構成を組み合わせてもよい。例えば、図７と図８の各構成の両方を用いて認知機能の判定を行ってもよく、図７と図９の各構成の両方を用いて認知機能の判定を行ってもよい。構成を組み合わせる場合には、各構成の判定結果を総合的に判定して最終判定とすればよい。 In the present embodiment, the determination of the cognitive function may use any of the configurations illustrated in FIGS. 7 to 9, or may combine the configurations. For example, the cognitive function may be determined using both the configurations of FIGS. 7 and 8, and the cognitive function may be determined using both the configurations of FIGS. 7 and 9. When combining the configurations, the determination results of each configuration may be comprehensively determined and used as the final determination.

上述のように、ＤＮＮ５６２、ＲＮＮ５６３、ＣＮＮ５６４は、健常者及び認知機能障害者と対話する対話者の音声が変換された基準音声の音声データを含む学習用データを用いて生成されている。健常者及び認知機能障害者の基準音声だけでなく、健常者及び認知機能障害者と対話する対話者の基準音声も認知機能の判定の要素とすることができる。すなわち、対話者の質問等の発話に対する健常者及び認知機能障害者の回答等の応答を認知機能の判定に用いることができるので、人の問いかけに対して、健常者及び認知機能障害者がどのように反応しているかを学習することでき、ＤＮＮ５６２、ＲＮＮ５６３、ＣＮＮ５６４の認知機能の判定の精度を向上させることができる。 As described above, the DNN562, RNN563, and CNN564 are generated using learning data including the voice data of the reference voice in which the voices of the interlocutors interacting with the healthy person and the cognitively impaired person are converted. Not only the reference voice of the healthy person and the cognitively impaired person, but also the reference voice of the interlocutor who interacts with the healthy person and the cognitively impaired person can be a factor for determining the cognitive function. That is, since the responses such as the answers of the healthy person and the cognitively impaired person to the utterance of the interlocutor's question can be used to judge the cognitive function, which of the healthy person and the cognitively impaired person responds to the human question. It is possible to learn how to react in this way, and it is possible to improve the accuracy of determining the cognitive function of DNN562, RNN563, and CNN564.

本実施の形態において、ＤＮＮ５６２、ＲＮＮ５６３、ＣＮＮ５６４は、自身が判定した対象者の認知機能レベルを、医師が判断した認知機能レベル（修正認知機能レベル）に更新した学習用データを用いて再学習することができる。例えば、ＤＮＮ５６２が、ある対象者の認知機能レベルをレベル「３」と判定したとする。医師が診察によって当該対象者の認知機能レベルをレベル「４」と判定した場合、当該対象者の基準音声と認知機能レベルを「４」に更新した学習用データを用いてＤＮＮ５６２を再学習させることができる。ＲＮＮ５６３、ＣＮＮ５６４についても同様である。これにより、ＤＮＮ５６２、ＲＮＮ５６３、ＣＮＮ５６４の認知機能の判定の精度を向上させることができる。 In the present embodiment, the DNN562, RNN563, and CNN564 relearn the cognitive function level of the subject determined by themselves using the learning data updated to the cognitive function level (corrected cognitive function level) determined by the doctor. be able to. For example, assume that DNN562 determines the cognitive function level of a subject as level "3". If the doctor determines that the subject's cognitive function level is level "4" by medical examination, the DNN562 should be relearned using the learning data with the subject's reference voice and cognitive function level updated to "4". Can be done. The same applies to RNN563 and CNN564. As a result, the accuracy of determining the cognitive function of DNN562, RNN563, and CNN564 can be improved.

図１０は認知機能判定システムの処理手順の一例を示すフローチャートである。端末装置１０は、対話音声を取得し（Ｓ１１）、取得した対話音声を認知機能判定装置５０へ送信する（Ｓ１２）。認知機能判定装置５０は、対話音声を受信し（Ｓ１３）、対象者の音声と対話者の音声とを識別する（Ｓ１４）。認知機能判定装置５０は、対話者の音声及び対象者の音声を基準音声に変換し（Ｓ１５）、変換した基準音声に基づいて対象者の認知機能レベルを判定する（Ｓ１６）。認知機能判定装置５０は、判定結果を端末装置１０へ送信し（Ｓ１７）、後述のステップＳ１９の処理を行う。 FIG. 10 is a flowchart showing an example of the processing procedure of the cognitive function determination system. The terminal device 10 acquires the dialogue voice (S11) and transmits the acquired dialogue voice to the cognitive function determination device 50 (S12). The cognitive function determination device 50 receives the dialogue voice (S13) and discriminates between the voice of the subject and the voice of the dialogue person (S14). The cognitive function determination device 50 converts the voice of the interlocutor and the voice of the subject into a reference voice (S15), and determines the cognitive function level of the subject based on the converted reference voice (S16). The cognitive function determination device 50 transmits the determination result to the terminal device 10 (S17), and performs the process of step S19 described later.

端末装置１０は、判定結果を受信して出力し（Ｓ１８）、処理を終了する。認知機能判定装置５０は、基準音声に基づいて判定した認知機能レベルに対する医師の修正認知機能レベルを取得したか否かを判定し（Ｓ１９）、修正認知機能レベルを取得した場合（Ｓ１９でＹＥＳ）、当該基準音声及び修正認知機能レベルを再学習データとして記憶部５４に記憶し（Ｓ２０）、処理を終了する。認知機能判定装置５０は、修正認知機能レベルを取得していない場合（Ｓ１９でＮＯ）、処理を終了する。 The terminal device 10 receives the determination result and outputs it (S18), and ends the process. The cognitive function determination device 50 determines whether or not the doctor's modified cognitive function level has been acquired for the cognitive function level determined based on the reference voice (S19), and when the modified cognitive function level is acquired (YES in S19). , The reference voice and the modified cognitive function level are stored in the storage unit 54 as re-learning data (S20), and the process is terminated. When the cognitive function determination device 50 has not acquired the modified cognitive function level (NO in S19), the cognitive function determination device 50 ends the process.

図１１は学習済みモデルの生成方法の一例を示すフローチャートである。認知機能判定装置５０は、音声及び当該音声に対応する基準音声を含む第１訓練データを複数取得し（Ｓ３１）、取得した複数の第１訓練データを用いて第１学習済みモデルを生成する（Ｓ３２）。認知機能判定装置５０は、基準音声及び当該基準音声の話者の認知機能レベルを含む第２訓練データを複数取得し（Ｓ３３）、取得した複数の第２訓練データを用いて第２学習済みモデルを生成し（Ｓ３４）、処理を終了する。 FIG. 11 is a flowchart showing an example of a method of generating a trained model. The cognitive function determination device 50 acquires a plurality of first training data including a voice and a reference voice corresponding to the voice (S31), and generates a first trained model using the acquired plurality of first training data (S31). S32). The cognitive function determination device 50 acquires a plurality of second training data including the reference voice and the cognitive function level of the speaker of the reference voice (S33), and uses the acquired plurality of second training data to obtain a second trained model. Is generated (S34), and the process is terminated.

認知機能判定装置５０は、ＣＰＵ（プロセッサ）、ＧＰＵ、ＲＡＭ（メモリ）などを備えた汎用コンピュータを用いて実現することもできる。すなわち、図１０及び図１１に示すような、各処理の手順を定めたコンピュータプログラムをコンピュータに備えられたＲＡＭ（メモリ）にロードし、コンピュータプログラムをＣＰＵ（プロセッサ）で実行することにより、コンピュータ上で認知機能判定装置５０を実現することができる。コンピュータプログラムは記録媒体に記録され流通されてもよい。学習済のＤＮＮ５６２、ＲＮＮ５６３、ＣＮＮ５６４は、それぞれ学習処理部を備える他のサーバ等で生成して、認知機能判定装置５０にダウンロードしてもよい。 The cognitive function determination device 50 can also be realized by using a general-purpose computer including a CPU (processor), a GPU, a RAM (memory), and the like. That is, as shown in FIGS. 10 and 11, a computer program that defines the procedure for each process is loaded into a RAM (memory) provided in the computer, and the computer program is executed by the CPU (processor) on the computer. The cognitive function determination device 50 can be realized. The computer program may be recorded and distributed on a recording medium. The learned DNN562, RNN563, and CNN564 may be generated by another server or the like having a learning processing unit, and downloaded to the cognitive function determination device 50.

また、本実施の形態の認知機能判定装置５０をロボットやスマートスピーカに組み込むことができる。ロボットやスマートスピーカは、対象者と対話することにより、対象者の音声を取得し、基準音声に変換して認知機能レベルを判定することができる。この場合、ロボットやスマートスピーカの発話は、例えば、聞き取りにくい話し方と聞き取りやすい話し方の両方の音声を出力して対象者の反応を取得することができる。判定結果は、対象者の携帯端末（例えば、スマートフォン、タブレット）に出力してもよく、音声で判定結果を通知してもよい。このようなロボットは、病院、診療所、役所、店舗などに設置することができる。また、スマートスピーカは、対象者や家族の自宅に設置することにより、例えば、見守りサービスを実現できる。 Further, the cognitive function determination device 50 of the present embodiment can be incorporated into a robot or a smart speaker. By interacting with the subject, the robot or smart speaker can acquire the subject's voice and convert it into a reference voice to determine the cognitive function level. In this case, the utterance of the robot or the smart speaker can acquire the reaction of the target person by outputting the voices of both the difficult-to-hear and easy-to-hear speeches, for example. The determination result may be output to the target person's mobile terminal (for example, a smartphone or tablet), or the determination result may be notified by voice. Such robots can be installed in hospitals, clinics, government offices, stores, and the like. Further, by installing the smart speaker at the home of the target person or family, for example, a watching service can be realized.

また、本実施の形態の認知機能判定装置５０を、スマートフォン、タブレット、パーソナルコンピュータ、カメラ等に組み込み、対象者が電話やＴＶ電話を行う際に、音声を取得し、認知機能レベルを判定することができる。判定結果は、スマートフォン、タブレット、パーソナルコンピュータ、カメラに記録され、必要に応じて、あるいは定期的に表示又は出力するようにしてもよい。これにより、対象者は、自身の認知機能レベルの履歴をいつでも確認することができる。 Further, the cognitive function determination device 50 of the present embodiment is incorporated into a smartphone, tablet, personal computer, camera, or the like, and when the subject makes a telephone call or a videophone call, the voice is acquired and the cognitive function level is determined. Can be done. The determination result may be recorded in a smartphone, tablet, personal computer, or camera, and may be displayed or output as needed or periodically. As a result, the subject can check the history of his / her cognitive function level at any time.

本実施の形態の認知機能判定装置は、対象者の音声を取得する取得部と、前記取得部で取得した音声を基準音声に変換する変換部と、前記変換部で変換した基準音声に基づいて前記対象者の認知機能を判定する判定部とを備える。 The cognitive function determination device of the present embodiment is based on an acquisition unit that acquires the voice of the target person, a conversion unit that converts the voice acquired by the acquisition unit into a reference voice, and a reference voice converted by the conversion unit. It is provided with a determination unit for determining the cognitive function of the subject.

本実施の形態の認知機能判定システムは、対象者の音声を取得する取得部と、前記取得部で取得した音声を基準音声に変換する変換部と、前記変換部で変換した基準音声に基づいて前記対象者の認知機能を判定する判定部とを備える。 The cognitive function determination system of the present embodiment is based on an acquisition unit that acquires the voice of the target person, a conversion unit that converts the voice acquired by the acquisition unit into a reference voice, and a reference voice converted by the conversion unit. It is provided with a determination unit for determining the cognitive function of the subject.

本実施の形態のコンピュータプログラムは、コンピュータに、対象者の音声を取得する処理と、取得した音声を基準音声に変換する処理と、変換した基準音声に基づいて前記対象者の認知機能を判定する処理とを実行させる。 The computer program of the present embodiment determines the cognitive function of the target person based on the process of acquiring the voice of the target person, the process of converting the acquired voice into the reference voice, and the converted reference voice. Process and execute.

取得部は、対象者の音声を取得する。対象者の音声は、対象者と対話する対話者との対話音声とすることができる。変換部は、取得した音声を基準音声に変換する。基準音声は、例えば、年齢、性別、体格等の様々な属性が所定の属性の音声とすることができる。所定の属性は、例えば、５０歳の男性で標準的な体格（例えば、身長が１７０ｃｍ、体重が７０ｋｇなど）とすることができる。変換部は、対象者の音声に含まれる音韻情報を保持したまま声質を変換することができる。判定部は、変換した基準音声に基づいて対象者の認知機能を判定する。認知機能の判定には、例えば、基準音声の音声特徴量（例えば、音声の高さに関連するピッチ、母音や子音の特徴に関連するフォルマント周波数、声道特性に関連するメル周波数スペクトラム係数（ＭＦＣＣ）など）に基づいて行うことができる。認知機能の判定には、例えば、ルールベース、機械学習の一手法であるサポートベクターマシン（ＳＶＭ）、ニューラルネットワークなどの学習モデルを用いることができる。 The acquisition unit acquires the voice of the target person. The voice of the subject can be a dialogue voice with an interlocutor who interacts with the subject. The conversion unit converts the acquired voice into a reference voice. The reference voice can be, for example, a voice having various attributes such as age, gender, and physique. The predetermined attribute can be, for example, a standard physique for a 50-year-old man (eg, height 170 cm, weight 70 kg, etc.). The conversion unit can convert the voice quality while retaining the phonological information contained in the voice of the target person. The determination unit determines the cognitive function of the subject based on the converted reference voice. For the determination of cognitive function, for example, the voice feature amount of the reference voice (for example, the pitch related to the pitch of the voice, the formant frequency related to the characteristics of vowels and consonants, and the mel frequency spectrum coefficient related to the vocal tract characteristic (MFCC) ) Etc.). For the determination of the cognitive function, for example, a learning model such as a rule base, a support vector machine (SVM) which is a method of machine learning, or a neural network can be used.

本実施の形態の認知機能判定装置において、前記取得部は、前記対象者と対話する対話者の音声を取得し、前記変換部は、前記対話者の音声を基準音声に変換し、前記判定部は、前記対話者の変換された基準音声に基づいて前記対象者の認知機能を判定する。 In the cognitive function determination device of the present embodiment, the acquisition unit acquires the voice of the interlocutor interacting with the target person, the conversion unit converts the voice of the interlocutor into the reference voice, and the determination unit. Determines the cognitive function of the subject based on the converted reference voice of the interlocutor.

取得部は、対象者と対話する対話者の音声を取得し、変換部は、対話者の音声を基準音声に変換する。判定部は、対話者の変換された基準音声に基づいて対象者の認知機能を判定する。対象者の基準音声だけでなく、対象者と対話する対話者の基準音声も認知機能の判定の要素とすることができる。すなわち、対話者の質問等の発話に対する対象者の回答等の応答を、対象者の認知機能の判定に用いることができるので、人の問いかけに対して、対象者がどのように反応しているかを判断材料とすることができ、認知機能の判定の精度を向上させることができる。 The acquisition unit acquires the voice of the interlocutor who interacts with the target person, and the conversion unit converts the voice of the interlocutor into the reference voice. The determination unit determines the cognitive function of the subject based on the converted reference voice of the interlocutor. Not only the reference voice of the subject, but also the reference voice of the interlocutor who interacts with the subject can be a factor for determining the cognitive function. That is, since the response such as the answer of the subject to the utterance of the interlocutor's question can be used to judge the cognitive function of the subject, how the subject reacts to the question of the person. Can be used as a judgment material, and the accuracy of judgment of cognitive function can be improved.

本実施の形態の認知機能判定装置において、前記変換部は、対象者の音声が入力された場合に、基準音声を出力する第１学習済みモデルを含む。 In the cognitive function determination device of the present embodiment, the conversion unit includes a first trained model that outputs a reference voice when the voice of the target person is input.

本実施の形態の認知機能判定装置は、前記判定部は、前記基準音声が入力された場合に、対象者の認知機能レベルを出力する第２学習済みモデルを含む。 In the cognitive function determination device of the present embodiment, the determination unit includes a second learned model that outputs the cognitive function level of the subject when the reference voice is input.

本実施の形態の認知機能判定装置において、前記第２学習済みモデルは、対象者及び対話者の対話を前記第１学習済みモデルに入力して出力された前記基準音声が入力された場合に、前記対象者の認知機能レベルを出力する。 In the cognitive function determination device of the present embodiment, the second trained model inputs the dialogue between the target person and the interlocutor into the first trained model, and when the output reference voice is input, the second trained model is input. The cognitive function level of the subject is output.

第２学習済みモデルは、対象者及び対話者の対話を第１学習済みモデルに入力して出力された対象者及び対話者の基準音声が入力された場合に、対象者の認知機能レベルを出力することができる。第２学習済みモデルは、健常者及び認知機能障害者と対話する対話者の音声が変換された基準音声の音声データを含む学習用データを用いて生成されている。健常者及び認知機能障害者の基準音声だけでなく、健常者及び認知機能障害者と対話する対話者の基準音声も認知機能の判定の要素とすることができる。すなわち、対話者の質問等の発話に対する健常者及び認知機能障害者の回答等の応答を認知機能の判定に用いることができるので、人の問いかけに対して、健常者及び認知機能障害者がどのように反応しているかを学習することでき、第２学習済みモデルの認知機能の判定の精度を向上させることができる。 The second trained model outputs the cognitive function level of the subject when the dialogue between the subject and the interlocutor is input to the first trained model and the reference voice of the subject and the interlocutor is input. can do. The second trained model is generated using learning data including voice data of a reference voice to which the voice of an interlocutor interacting with a healthy person and a cognitively impaired person is converted. Not only the reference voice of the healthy person and the cognitively impaired person, but also the reference voice of the interlocutor who interacts with the healthy person and the cognitively impaired person can be a factor for determining the cognitive function. That is, since the responses such as the answers of the healthy person and the cognitively impaired person to the utterance of the interlocutor's question can be used to judge the cognitive function, which of the healthy person and the cognitively impaired person responds to the human question. It is possible to learn whether or not the reaction is occurring, and it is possible to improve the accuracy of determining the cognitive function of the second trained model.

本実施の形態の認知機能判定装置において、前記第２学習済みモデルは、前記基準音声の音声データから抽出される、ピッチ、フォルマント周波数及びメル周波数スペクトラム係数の少なくとも一つを含む音声特徴量が入力された場合に、前記対象者の認知機能レベルを出力する。 In the cognitive function determination device of the present embodiment, the second trained model is input with a voice feature amount including at least one of pitch, formant frequency, and mel frequency spectrum coefficient extracted from the voice data of the reference voice. When this is done, the cognitive function level of the subject is output.

第２学習済みモデルは、基準音声の音声データから抽出される、ピッチ、フォルマント周波数及びメル周波数スペクトラム係数の少なくとも一つを含む音声特徴量が入力された場合に、対象者の認知機能レベルを出力することができる。第２学習済みモデルは、基準音声の音声データから抽出される、ピッチ、フォルマント周波数及びメル周波数スペクトラム係数（ＭＣＦＦ）の少なくとも一つを含む音声特徴量を含む学習用データを用いて生成されている。認知機能障害を特定するには、音声の３つの要素（韻律、音質及び音韻）のうち、特に韻律が重要な非言語情報であることが知られている。そこで、韻律を特徴付ける音声特徴量として、ピッチ、フォルマント周波数及びメル周波数スペクトラム係数を用いて第２学習済みモデルを生成する。例えば、第２学習済みモデルが出力する認知機能レベルを、１〜５の５段階とする。認知機能レベルが予め「３」であると分かっている音声データから抽出されるピッチ、フォルマント周波数及びメル周波数スペクトラム係数を学習用データとして第２学習モデルに与えるとともに教師ラベルとして認知機能レベル「３」を第２学習モデルに与える。他の認知機能レベルについても同様である。 The second trained model outputs the cognitive function level of the subject when a speech feature quantity including at least one of pitch, formant frequency, and mel frequency spectrum coefficient extracted from the speech data of the reference speech is input. can do. The second trained model is generated using training data extracted from the voice data of the reference voice, which includes voice features including at least one of pitch, formant frequency and mel frequency spectrum coefficient (MCFF). .. Of the three elements of speech (prosody, sound quality, and phonology), prosody is known to be particularly important nonverbal information for identifying cognitive dysfunction. Therefore, a second trained model is generated using pitch, formant frequency, and Mel frequency spectrum coefficient as speech features that characterize prosody. For example, the cognitive function level output by the second trained model is set to 5 levels from 1 to 5. The pitch, formant frequency, and mel frequency spectrum coefficient extracted from the speech data whose cognitive function level is known to be "3" in advance are given to the second learning model as learning data, and the cognitive function level is "3" as a teacher label. Is given to the second learning model. The same is true for other cognitive function levels.

本実施の形態の認知機能判定装置において、前記第２学習済みモデルは、対話者の発話に対する対象者の回答の応答時間がさらに入力された場合に、前記対象者の認知機能レベルを出力する。 In the cognitive function determination device of the present embodiment, the second learned model outputs the cognitive function level of the subject when the response time of the response of the subject to the utterance of the interlocutor is further input.

第２学習済みモデルは、対話者の発話に対する対象者の回答の応答時間がさらに入力された場合に、対象者の認知機能レベルを出力することができる。第２学習済みモデルは、健常者及び認知機能障害者と対話する対話者の発話に対する健常者及び認知機能障害者の応答時間を含む学習用データを用いて生成されている。応答時間は、対話者の発話の終了時点から健常者及び認知機能障害者の回答の開始時間までの遅延時間とすることができる。認知機能が低下するのに応じて応答時間は長くなる傾向があると考えられるので、応答時間を学習用データに含めることにより、第２学習済みモデルの認知機能の判定の精度を向上させることができる。 The second trained model can output the cognitive function level of the subject when the response time of the subject's response to the interlocutor's utterance is further input. The second trained model is generated using learning data including the response time of the healthy person and the cognitively impaired person to the speech of the interlocutor who interacts with the healthy person and the cognitively impaired person. The response time can be the delay time from the end of the interlocutor's utterance to the start time of the response of the healthy person and the cognitively impaired person. Since it is considered that the response time tends to increase as the cognitive function declines, it is possible to improve the accuracy of determining the cognitive function of the second trained model by including the response time in the training data. can.

本実施の形態の認知機能判定装置は、基準音声を入力することにより前記第２学習済みモデルから出力された認知機能レベルに対する医師の修正認知機能レベルを取得し、前記基準音声及び修正認知機能レベルを前記第２学習済みモデルの再学習データとして記憶する。 The cognitive function determination device of the present embodiment acquires a doctor's modified cognitive function level with respect to the cognitive function level output from the second learned model by inputting the reference voice, and obtains the reference voice and the modified cognitive function level. Is stored as retraining data of the second trained model.

基準音声を入力することにより第２学習済みモデルから出力された認知機能レベルに対する医師の修正認知機能レベルを取得し、基準音声及び修正認知機能レベルを第２学習済みモデルの再学習データとして記憶する。第２学習済みモデルは、判定した人の認知機能レベルを医師の判断によって更新した認知機能レベルが入力されると、再学習を行うことができる。第２学習済みモデルは、判定した対象者の認知機能レベルを、医師が判断した認知機能レベルに更新した学習用データを用いて再学習可能である。例えば、第２学習済みモデルが、ある対象者の認知機能レベルをレベル「３」と判定したとする。医師が診察によって当該対象者の認知機能レベルをレベル「４」と判定した場合、当該対象者の基準音声と認知機能レベルを「４」に更新した学習用データを用いて第２学習済みモデルを再学習させることができる。これにより、第２学習済みモデルの認知機能の判定の精度を向上させることができる。 By inputting the reference voice, the doctor's modified cognitive function level with respect to the cognitive function level output from the second trained model is acquired, and the reference voice and the modified cognitive function level are stored as re-learning data of the second trained model. .. The second trained model can be relearned when the cognitive function level of the determined person is updated by the doctor's judgment and the cognitive function level is input. The second trained model can be relearned by using the learning data in which the determined cognitive function level of the subject is updated to the cognitive function level determined by the doctor. For example, suppose that the second trained model determines the cognitive function level of a subject as level "3". When the doctor determines that the subject's cognitive function level is level "4" by medical examination, the second trained model is created using the learning data in which the subject's reference voice and cognitive function level are updated to "4". Can be relearned. As a result, the accuracy of determining the cognitive function of the second trained model can be improved.

本実施の形態の認知機能判定装置は、前記対象者の音声と、前記対象者と対話する対話者の音声とを識別する識別部を備える。 The cognitive function determining device of the present embodiment includes an identification unit that discriminates between the voice of the target person and the voice of the interlocutor interacting with the target person.

識別部は、対話音声から、対象者の音声と対話者の音声とを識別する。音声の識別には、予め対象者と対話者の音声データを記憶しておき、記憶した音声データと照合する方法、対話者が、対象者が発話するとき、あるいは対話者が発話するときにボタン等を操作して区別する方法、マイクの指向性を利用して区別する方法などを用いることができる。これにより、対象者の音声と対話者の音声とを識別することができる。 The identification unit discriminates between the voice of the target person and the voice of the interlocutor from the dialogue voice. To identify the voice, a method of storing the voice data of the target person and the interlocutor in advance and collating with the stored voice data, a button when the interlocutor speaks by the target person or when the interlocutor speaks. It is possible to use a method of distinguishing by operating the above, a method of distinguishing by using the directivity of the microphone, and the like. Thereby, the voice of the target person and the voice of the interlocutor can be distinguished.

本実施の形態の認知機能判定装置において、前記取得部は、前記対象者と対話者とを識別する識別フラグを取得する。 In the cognitive function determination device of the present embodiment, the acquisition unit acquires an identification flag that identifies the target person and the interlocutor.

取得部は、対象者と対話者とを識別する識別フラグを取得する。識別フラグは、対象者フラグ、及び対話者フラグとすることができる。これにより、対象者のみの音声、対話者のみの音声、あるいは対象者と対話者の両者の音声の別を識別することができる。 The acquisition unit acquires an identification flag that identifies the target person and the interlocutor. The identification flag can be a target person flag and an interlocutor flag. Thereby, it is possible to distinguish between the voice of the target person only, the voice of the interlocutor only, or the voice of both the target person and the interlocutor.

本実施の形態の認知機能判定装置において、前記基準音声は、年齢、性別及び体格を含む人の属性が所定の属性の音声である。 In the cognitive function determination device of the present embodiment, the reference voice is a voice having a predetermined human attribute including age, gender, and physique.

本実施の形態の認知機能判定装置は、年齢、性別及び体格を含む人の属性毎に前記変換部を複数備える。 The cognitive function determining device of the present embodiment includes a plurality of the conversion units for each attribute of a person including age, gender, and physique.

変換部は、年齢、性別及び体格（例えば、身長、体重など）を含む属性毎に複数備えることができる。これにより、男女の別や体格の違いを問わず若年層から高齢者に至るまで様々な対象者の音声をより精度よく基準音声に変換することができる。 A plurality of conversion units can be provided for each attribute including age, gender, and physique (for example, height, weight, etc.). As a result, it is possible to more accurately convert the voices of various subjects from young people to elderly people into reference voices regardless of gender or physique.

１通信ネットワーク
１０端末装置
１１マイク
５０認知機能判定装置
５１制御部
５２通信部
５３音声識別部
５４記憶部
５５音声変換部
５５１パラメータ抽出部
５５２パラメータ変換部
５５２ａ変換テーブル
５５２ｂニューラルネットワーク
５５３音声合成部
５６認知機能判定部
５６１音声特徴量抽出部
５６２ＤＮＮ
５６３ＲＮＮ
５６４ＣＮＮ
５６５ＦＦＴ変換部
５７学習処理部 1 Communication network 10 Terminal device 11 Microphone 50 Cognitive function judgment device 51 Control unit 52 Communication unit 53 Voice identification unit 54 Storage unit 55 Voice conversion unit 551 Parameter extraction unit 552 Parameter conversion unit 552a Conversion table 552b Neural network 553 Voice synthesis unit 56 Cognition Function judgment unit 561 Voice feature amount extraction unit 562 DNN
563 RNN
564 CNN
565 FFT converter 57 Learning processing unit

本発明は、認知機能判定装置、認知機能判定システム及びコンピュータプログラムに関する。 The present invention cognitive function determination device, about the cognitive function determining system and a computer program.

本発明は、斯かる事情に鑑みてなされたものであり、対象者の属性に関わらず認知機能を判定することができる認知機能判定装置、認知機能判定システム及びコンピュータプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, cognitive function determination device capable of determining a cognitive function, regardless of the attribute of the subject, intended to provide a cognitive function determination system and computer program And.

Claims

The acquisition unit that acquires the voice of the target person,
A conversion unit that converts the voice acquired by the acquisition unit into a reference voice, and
A cognitive function determination device including a determination unit that determines the cognitive function of the subject based on the reference voice converted by the conversion unit.

The acquisition unit
Acquires the voice of the interlocutor who interacts with the subject,
The conversion unit
The voice of the interlocutor is converted into a reference voice, and the voice is converted into a reference voice.
The determination unit
The cognitive function determination device according to claim 1, wherein the cognitive function of the subject is determined based on the converted reference voice of the interlocutor.

The conversion unit
The cognitive function determination device according to claim 1 or 2, which includes a first trained model that outputs a reference voice when a subject's voice is input.

The determination unit
The cognitive function determination device according to claim 3, further comprising a second learned model that outputs the cognitive function level of the subject when the reference voice is input.

The second trained model is
The cognitive function determination according to claim 4, wherein the cognitive function level of the target person is output when the reference voice output by inputting the dialogue between the target person and the interlocutor into the first trained model is input. Device.

The second trained model is
Claim 4 or claim 4 that outputs the cognitive function level of the subject when a voice feature amount including at least one of a pitch, a formant frequency, and a mel frequency spectrum coefficient extracted from the voice data of the reference voice is input. The cognitive function determination device according to claim 5.

The second trained model is
The cognitive function determination device according to any one of claims 4 to 6, which outputs the cognitive function level of the subject when the response time of the response of the subject to the utterance of the interlocutor is further input.

By inputting the reference voice, the doctor's modified cognitive function level with respect to the cognitive function level output from the second trained model is acquired.
The cognitive function determination device according to any one of claims 4 to 7, which stores the reference voice and the modified cognitive function level as re-learning data of the second trained model.

The cognitive function determining device according to any one of claims 1 to 8, further comprising an identification unit that identifies the voice of the target person and the voice of the interlocutor interacting with the target person.

The acquisition unit
The cognitive function determination device according to claim 9, wherein an identification flag for identifying the target person and the interlocutor is acquired.

The reference voice is
The cognitive function determination device according to any one of claims 1 to 10, wherein the attributes of a person including age, gender, and physique are voices having predetermined attributes.

The cognitive function determining device according to any one of claims 1 to 11, further comprising a plurality of the conversion units for each attribute of a person including age, gender, and physique.

The acquisition unit that acquires the voice of the target person,
A conversion unit that converts the voice acquired by the acquisition unit into a reference voice, and
A cognitive function determination system including a determination unit that determines the cognitive function of the subject based on the reference voice converted by the conversion unit.

On the computer
The process of acquiring the target person's voice and
The process of converting the acquired voice to the reference voice,
A computer program that executes a process of determining the cognitive function of the subject based on the converted reference voice.

Get the target person's voice and
Convert the acquired voice to the reference voice and
Judging the cognitive function of the subject based on the converted reference voice,
Cognitive function judgment method.