WO2024116254A1

WO2024116254A1 - Information processing device, information processing method, information processing system, and information processing program

Info

Publication number: WO2024116254A1
Application number: PCT/JP2022/043832
Authority: WO
Inventors: 康宏大宮; 慎一徳野
Original assignee: Ｐｓｔ株式会社; 公立大学法人神奈川県立保健福祉大学
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2024-06-06

Abstract

Provided is an information processing device that acquires speech data, which is time-series data of speech uttered by a user, and generates preprocessed speech data representing the data that, from out of the acquired speech data, is no earlier than a first time from the start point of the speech data and no later than a second time from the end of the speech data. Additionally, this information processing device generates processing result data by applying dynamic time warping to the preprocessed speech data that is generated. The information processing device calculates a score representing the degree to which the user has a predetermined disease or symptom on the basis of the generated processing result data, and estimates whether or not the user has the predetermined disease or symptom on the basis of the calculated score.

Description

Information processing device, information processing method, information processing system, and information processing program

The disclosed technology relates to an information processing device, an information processing method, an information processing system, and an information processing program.

International Publication No. WO 2020/013296 discloses a device for predicting whether a user has a psychiatric or neurological disorder. This device calculates various acoustic parameters from the user's voice data and uses these acoustic parameters to predict whether the user has a psychiatric or neurological disorder.

The device disclosed in International Publication No. 2020/013296 estimates diseases using acoustic parameters calculated from voice data, but there is room for improvement in terms of accuracy.

The disclosed technology has been made in consideration of the above circumstances, and provides an information processing device, information processing method, information processing system, and information processing program that can accurately estimate whether a user has a specified disease or symptom by applying a dynamic time warping method to voice data, which is time-series data of voice uttered by a user.

In order to achieve the above object, a first aspect of the present disclosure is an information processing device including an acquisition unit that acquires voice data, which is time-series data of voice uttered by a user; a processing unit that generates preprocessed voice data representing data from the voice data acquired by the acquisition unit that is data that is a first hour or later from the start point of the voice data and that is a second hour or earlier than the end point of the voice data; a generation unit that generates processing result data by applying dynamic time warping to the preprocessed voice data generated by the processing unit; a calculation unit that calculates a score representing the degree to which the user has a predetermined disease or symptom based on the processing result data generated by the generation unit; and an estimation unit that estimates whether or not the user has a predetermined disease or symptom based on the score calculated by the calculation unit.

A second aspect of the present disclosure is an information processing method that causes a computer to execute the following processes: acquire voice data that is time-series data of voice uttered by a user; generate preprocessed voice data representing data from the acquired voice data that is a first hour or later from the start point of the voice data and a second hour or earlier than the end point of the voice data; generate processing result data by applying dynamic time warping to the generated preprocessed voice data; calculate a score representing the degree to which the user has a specified disease or symptom based on the generated processing result data; and estimate whether or not the user has the specified disease or symptom based on the calculated score.

A third aspect of the present disclosure is an information processing program for causing a computer to execute a process of acquiring voice data, which is time-series data of a voice uttered by a user, generating preprocessed voice data representing data from the acquired voice data that is a first hour or later from a start point of the voice data and a second hour or earlier than an end point of the voice data, generating processing result data by applying dynamic time warping to the generated preprocessed voice data, calculating a score representing the degree to which the user has a predetermined disease or symptom based on the generated processing result data, and inferring whether or not the user has the predetermined disease or symptom based on the calculated score.

The disclosed technology has the effect of being able to accurately estimate whether a user has a specific disease or specific symptoms by applying a dynamic time warping method to voice data, which is time-series data of the voice uttered by the user.

1 is a diagram illustrating an example of a schematic configuration of an information processing system according to a first embodiment. FIG. 1 is a diagram for explaining an overview of a first embodiment. FIG. 2 is a diagram illustrating a schematic diagram of audio data for a predetermined period. FIG. 11 is a diagram for explaining a shift process for audio data. FIG. 11 is a diagram for explaining a sampling process for audio data. FIG. 1 is a diagram illustrating an example of a usage form of an information processing system according to a first embodiment. FIG. 1 illustrates an example of a computer constituting an information processing device. FIG. 2 is a diagram illustrating an example of a process executed by the information processing apparatus of the first embodiment. FIG. 11 is a diagram for explaining an overview of a second embodiment. FIG. 11 is a diagram illustrating an example of a usage form of an information processing system according to a second embodiment. FIG. 11 is a diagram illustrating an example of a usage form of an information processing system according to a second embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment. FIG. 1 is a diagram showing experimental results according to an embodiment.

Below, an embodiment of the disclosed technology will be described in detail with reference to the drawings.

FIG. 1 shows an information processing system 10 according to the first embodiment. As shown in FIG. 1, the information processing system 10 according to the first embodiment includes a microphone 12, an information processing device 14, and a display device 16.

The information processing system 10 estimates whether or not the user has a specified disease or a specified symptom (hereinafter simply referred to as "disease, etc.") based on the user's voice collected by the microphone 12. Note that the information processing system 10 of this embodiment estimates whether or not the user has a psychiatric disease or a neurological disease, or a mental disorder symptom or a cognitive impairment symptom, as an example of a specified disease or a specified symptom.

The information processing device 14 of the information processing system 10 of the first embodiment performs a predetermined preprocessing on the voice data, which is time-series data of the voice uttered by the user, to generate preprocessed data. Then, the information processing device 14 determines whether or not the user has a disease, etc., based on the result of applying dynamic time warping to the preprocessed data.

In dynamic time warping, the distance between one time series data and another time series data is calculated. In this embodiment, the processing result data obtained by dynamic time warping is used to estimate whether the user has a disease or the like.

The details are explained below.

As shown in FIG. 1, the information processing device 14 functionally comprises an acquisition unit 20, a voice data storage unit 22, a reference data storage unit 24, a processing unit 26, a generation unit 28, a calculation unit 30, an estimation unit 32, and an output unit 34. The information processing device 14 is realized by a computer as described below.

The acquisition unit 20 acquires voice data, which is time-series data of the voice uttered by the user. The acquisition unit 20 then stores the voice data in the voice data storage unit 22.

The voice data storage unit 22 stores the voice data acquired by the acquisition unit 20.

The reference data storage unit 24 stores voice data of reference users who are known to have or have not had a disease, etc.

The processing unit 26 reads out the voice data stored in the voice data storage unit 22. The processing unit 26 then performs a predetermined preprocessing on the voice data to generate preprocessed voice data. The method for generating the preprocessed voice data will be described in detail below. Figure 2 shows a diagram for explaining the preprocessed voice data.

(Extracting the central part of the audio data)

When estimating whether a user has a disease or the like based on voice data produced by the user, it is preferable to use voice data in which the user's speech is stable.

In this regard, since the initial part of the time series data represented by the voice data is data from the time when the user started speaking, it is often not desirable to use the data from that part to infer diseases, etc. For example, if a user suddenly starts speaking after not making any sound, it is expected that the user's voice will become hoarse or the volume will become low due to the user's unstable speech. Even if such data is used to infer diseases, etc., it is expected that accurate results will not be obtained.

Furthermore, it is often not desirable to use the points near the end of the time series data represented by the voice data to infer illnesses, etc. For example, if a user speaks a long pronunciation, it is expected that the user may run out of breath and not be able to continue speaking, or the pronunciation of the end of the word may become unclear.

The processing unit 26 of the information processing device 14 in this embodiment therefore extracts central data from the audio data, which is time-series data.

Specifically, as shown in FIG. 2, the processing unit 26 generates data D2 representing data from the voice data D1 that is data from a first time T1 onward from the start point of the voice data D1 and that is data from a second time T2 onward from the end point of the voice data D1. The data D2 is data that corresponds to a time period T3 in the voice data D1. This generates data for the central portion where the user's speech is stable.

(Extraction of data for a given period)
Furthermore, the processing unit 26 extracts a predetermined period of data from the extracted central portion of data. Fig. 3 is a diagram for explaining the predetermined period of data. As shown in Fig. 3, the audio data Df is time-series data, and a predetermined signal is repeated. For example, in the example shown in Fig. 3, a similar signal waveform is repeated for each time interval T.

As described below, when estimating whether or not a user has a disease, voice data uttered by the user whose disease is to be estimated may be compared with voice data of a reference user whose disease is known to be present. For this reason, it is preferable that a predetermined period of data extracted from the voice data uttered by the user whose disease is to be estimated is aligned with a predetermined period of data in the voice data of the reference user. For this reason, for example, the processing unit 26 extracts data of a predetermined period that is the same as the period of the voice data of the reference user from the extracted central portion of data. This predetermined period is, for example, set in advance. Alternatively, for example, the predetermined period may be changed depending on the type of data.

(Extracting data by shifting along the time axis)
Next, the processing unit 26 shifts the extracted data of the predetermined period in the time axis direction. FIG. 4 shows a diagram for explaining the shift of data in the time axis direction. As shown in FIG. 4, consider a case where the voice data Ds has a repetition of a signal with a period Ts, and the reference user's voice data D _Ref has a repetition of a signal with a period T _Ref . In this case, as shown in FIG. 4, if the start part P1 of the cut-out of the voice data Ds and the start part P2 of the reference user's voice data D _Ref are not aligned, even if the voice data Ds and the reference user's voice data D _Ref are similar, the value representing the distance between the voice data D and the voice data D _Ref calculated by the dynamic time warping method may become large.

Then, the processing unit 26 shifts the extracted data for a predetermined period in the time axis direction. For example, the processing unit 26 shifts the data for a predetermined period shown in FIG. 4 in the time axis direction represented by the arrow S by a predetermined amount. Note that, for example, the amount of shift for this predetermined amount of time is set in advance. Alternatively, for example, the amount of shift may be changed depending on the type of data.

(Extraction of data by sampling at a predetermined sampling rate)
Next, the processing unit 26 extracts sampling data obtained by sampling from the data shifted in the time axis direction. As described above, in this embodiment, when estimating whether or not a user has a disease, the voice data uttered by the user whose disease is to be estimated may be compared with the voice data of a reference user whose disease is known to be present. Therefore, it is preferable that the sampling rate for the voice data uttered by the user whose disease is to be estimated and the sampling rate for the voice data of the reference user are the same.

5, consider sampled data D _A and D _B obtained by sampling audio data D. In this case, when the distance between sampled data D _A generated at a sampling rate A and sampled data D _B generated at a sampling rate B is calculated using the dynamic time warping method, a value representing a predetermined distance is calculated even though the original audio data D is the same.

For this reason, for example, the processing unit 26 generates sampling data extracted at the same sampling rate as the sampling rate of the reference user's voice data. This sampling rate is set in advance. Alternatively, for example, the sampling rate may be changed depending on the type of data. For example, 200 sampling points per cycle of data are extracted from a predetermined cycle of data.

(Expansion and contraction of data along the time axis)
Next, the processing unit 26 performs a time-axis expansion/contraction process on the sampling data obtained by sampling from the voice data. As described above, in this embodiment, when estimating whether or not a user has a disease, the voice data uttered by the target user whose disease is to be estimated may be compared with the voice data of a reference user whose disease is known to be present.

For this reason, it is preferable that the intervals in the time axis direction of the voice data uttered by the target user for which a disease or the like is to be estimated are aligned with the intervals in the time axis direction of the voice data of the reference user. For this reason, for example, the processing unit 26 executes a predetermined expansion/contraction process in the time axis direction on the data D3 shown in FIG. 2. The method of the predetermined expansion/contraction process is set in advance. Alternatively, for example, the method of expansion/contraction process may be changed depending on the type of data.

(Data expansion and contraction in the amplitude direction)
Next, the processing unit 26 performs expansion/contraction processing in the amplitude direction on the data that has been subjected to the expansion/contraction processing in the time axis direction. As described above, in this embodiment, when estimating whether or not a user has a disease, voice data uttered by a user whose disease is to be estimated may be compared with voice data of a reference user whose disease is known to be present.

For this reason, it is preferable that the amplitude of the voice data uttered by the target user for which a disease or the like is to be estimated is aligned with the amplitude of the voice data of the reference user. For this reason, for example, the processing unit 26 executes a predetermined stretching process in the amplitude direction on the data D4 shown in FIG. 2. The method of the predetermined stretching process is set in advance. Alternatively, for example, the method of stretching process may be changed depending on the type of data.

The processing unit 26 generates preprocessed audio data by performing multiple preprocessing processes as described above on the audio data.

The generating unit 28 generates processing result data by applying dynamic time warping to the preprocessed audio data generated by the processing unit 26. The processing result data obtained by applying dynamic time warping is calculated as a distance matrix representing the distance between each point of one time series data and each point of another time series data.

Specifically, the generation unit 28 reads out the voice data of the reference user stored in the reference data storage unit 24. Then, as shown in Fig. 2, the generation unit 28 applies a dynamic time warping method to the preprocessed voice data D5 and the voice data D _Ref of the reference user to generate processing result data representing the distance between the preprocessed voice data D5 and the voice data D _Ref of the reference user. Note that the voice data of the reference user may also be subjected to the preprocessing as described above.

The generating unit 28 may generate the processing result data using only the preprocessed audio data. For example, the generating unit 28 may apply a dynamic time warping method to first audio data representing data in a first time interval in the preprocessed audio data and second audio data representing data in a second time interval in the preprocessed audio data, thereby generating processing result data representing the distance between the first audio data and the second audio data.

More specifically, for example, as shown in FIG. 2, the generation unit 28 applies a dynamic time warping method to first audio data D5-1 representing data in a first time interval in the preprocessed audio data D5, and second audio data D5-1 representing data in a second time interval in the preprocessed audio data D5, thereby generating processing result data representing the distance between the first audio data D5-1 and the second audio data D5-2.

Next, as shown in FIG. 2, the generation unit 28 applies a dynamic time warping method to second audio data D5-2 representing data in a second time interval in the preprocessed audio data D5, and third audio data D5-3 representing data in a third time interval in the preprocessed audio data D5, thereby generating processing result data representing the distance between the second audio data D5-2 and the third audio data D5-3.

Furthermore, the generation unit 28 generates processing result data representing the distance between the first audio data D5-1 and the third audio data D5-3 by applying the dynamic time warping method to the first audio data D5-1 and the third audio data D5-3. In this way, the generation unit 28 generates processing result data for each pair of audio data D5-1 to D5-9 within a predetermined time period.

The calculation unit 30, which will be described later, may calculate a score representing the degree to which the user has a disease or the like based on the processing result data generated in this manner using only the preprocessed voice data D5.

The calculation unit 30 calculates a score representing the degree to which the user has a disease or the like based on the processing result data generated by the generation unit 28. For example, the calculation unit 30 uses the average value, maximum value, minimum value, standard deviation, and median value of each element of the distance matrix generated by the generation unit 28 to calculate a score representing the degree to which the user has a specified disease or symptom using a known method.

The estimation unit 32 estimates whether or not the user has a disease, etc., based on the score calculated by the calculation unit 30. For example, if the score is equal to or greater than a predetermined threshold, the estimation unit 32 estimates that the user has a disease, etc., and if the score is less than the predetermined threshold, the estimation unit 32 estimates that the user does not have a disease, etc.

The output unit 34 outputs the estimation result estimated by the estimation unit 32. Note that the output unit 34 may output the score itself as the estimation result.

The display device 16 displays the estimation results output from the estimation unit 32.

The medical professional or user who operates the information processing device 14 checks the estimation results output from the display device 16 and confirms what disease or symptoms the user may have.

The information processing system 10 of this embodiment is expected to be used, for example, under conditions such as those shown in FIG. 6.

In the example of FIG. 6, a medical professional H, such as a doctor, holds a tablet terminal, which is an example of the information processing system 10. The medical professional H uses a microphone (not shown) provided on the tablet terminal to collect voice data from a user U, who is a subject. The tablet terminal then estimates whether or not the user U has any disease or symptom based on the voice data of the user U, and outputs the estimation result to a display unit (not shown). The medical professional H refers to the estimation result displayed on the display unit (not shown) of the tablet terminal to determine whether or not the user U has any disease or symptom.

The information processing device 14 can be realized, for example, by a computer 50 shown in FIG. 7. The computer 50 has a CPU 51, a memory 52 as a temporary storage area, and a non-volatile storage unit 53. The computer 50 also has an input/output interface (I/F) 54 to which external devices and output devices are connected, and a read/write (R/W) unit 55 that controls reading and writing of data to the recording medium. The computer 50 also has a network I/F 56 that is connected to a network such as the Internet. The CPU 51, memory 52, storage unit 53, input/output I/F 54, R/W unit 55, and network I/F 56 are connected to each other via a bus 57.

The storage unit 53 can be realized by a Hard Disk Drive (HDD), a Solid State Drive (SSD), a flash memory, etc. The storage unit 53 as a storage medium stores programs for causing the computer 50 to function. The CPU 51 reads the programs from the storage unit 53, expands them into the memory 52, and sequentially executes the processes contained in the programs.

[Operation of the information processing system of the first embodiment]

Next, the specific operation of the information processing system 10 of the first embodiment will be described. The information processing device 14 of the information processing system 10 executes each process shown in FIG. 8.

First, in step S100, the acquisition unit 20 acquires the user's voice data collected by the microphone 12. Then, the acquisition unit 20 stores the voice data in the voice data storage unit 22.

Next, in step S102, the processing unit 26 reads out the voice data stored in the voice data storage unit 22. Then, the processing unit 26 extracts the central part of the voice data, which is data within a predetermined time period, from the voice data.

In step S104, the processing unit 26 extracts a predetermined period of data from the central portion of the audio data acquired in step S102.

In step S105, the processing unit 26 performs a shift process on the audio data for the predetermined period acquired in step S104.

In step S106, the processing unit 26 generates sampling data by performing a predetermined sampling process on the shifted data for a predetermined period obtained in step S105.

In step S108, the processing unit 26 performs an amplitude stretching process on the sampling data generated in step S106.

In step S110, the processing unit 26 performs expansion/contraction processing in the time axis direction on the sampling data that has been expanded/contracted in the amplitude direction and obtained in step S108.

By executing each process from step S102 to step S110, preprocessed audio data is generated by performing preprocessing on the audio data.

In step S112, the estimation unit 32 applies dynamic time warping to the preprocessed voice data and the reference user's voice data stored in the reference data storage unit 24, thereby generating processing result data representing the distance between the preprocessed voice data and the reference user's voice data.

In addition, reference users who have a specified disease, etc. and reference users who do not have the specified disease are set as reference users.

For this reason, for example, the estimation unit 32 generates processing result data between the preprocessed voice data and the voice data of a reference user who has a disease or the like, which is stored in the reference data storage unit 24. Alternatively, for example, the estimation unit 32 generates processing result data between the preprocessed voice data and the voice data of a reference user who does not have a disease or the like, which is stored in the reference data storage unit 24.

In step S114, the calculation unit 30 calculates a score representing the degree to which the user has a disease or the like, based on the processing result data generated in step S112 above. Note that the score may be, for example, a larger value the higher the degree to which the user has a disease or the like. Alternatively, the score may be, for example, a smaller value the higher the degree to which the user has a disease or the like.

For example, consider a case where the score takes a larger value the higher the degree to which the user has a disease, etc. In this case, when the distance between the preprocessed voice data and the voice data of a reference user who has a disease, etc. is small, the calculation unit 30 calculates the score so that the degree to which the user has a disease, etc. is high. On the other hand, for example, when the distance between the preprocessed voice data and the voice data of a reference user who has a disease, etc. is large, the calculation unit 30 calculates the score so that the degree to which the user has a disease, etc. is low.

Also, for example, when the distance between the preprocessed voice data and the voice data of a reference user who does not have a disease, etc. is small, the calculation unit 30 calculates the score so that the degree to which the user has a disease, etc. is low. On the other hand, when the distance between the preprocessed voice data and the voice data of a reference user who does not have a disease, etc. is large, the calculation unit 30 calculates the score so that the degree to which the user has a disease, etc. is high.

In step S116, the estimation unit 32 estimates whether or not the user has a disease, etc., based on the score calculated in step S114 above. For example, if the score is equal to or greater than a predetermined threshold, the estimation unit 32 estimates that the user has a disease, etc., and if the score is less than the predetermined threshold, the estimation unit 32 estimates that the user does not have a disease, etc.

The estimation unit 32 may also estimate which disease, etc. the user has based on the processing result data for each of the voice data of the reference user having disease A, the voice data of the reference user having disease B, and the voice data of the reference user having disease C.

In step S118, the output unit 34 outputs the estimation result estimated in step S116.

The display device 16 displays the inference results output from the output unit 34. The medical professional or user operating the information processing device 14 checks the inference results output from the display device 16 and confirms what disease or symptoms the user is likely to have.

As described above, the information processing system 10 of the first embodiment acquires voice data, which is time-series data of voice uttered by the user, and generates preprocessed data. The information processing device 14 then generates processing result data by applying dynamic time warping to the generated preprocessed voice data, and calculates a score representing the degree to which the user has a specified disease or symptom based on the generated processing result data. The information processing device 14 then estimates whether or not the user has a specified disease or symptom based on the calculated score. In this way, by applying dynamic time warping to the voice data, which is time-series data of voice uttered by the user, it is possible to accurately estimate whether or not the user has a specified disease or specified symptom.

The preprocessed voice data is the central portion of the acquired voice data that is a first hour or later from the start point of the voice data and that represents data that is a second hour or earlier than the end point of the voice data. By using the central portion of the voice data as the preprocessed voice data, it is possible to use the stable central portion of the voice uttered by the user to accurately estimate whether the user has a specified disease or a specified symptom.

The preprocessed voice data is also data for a predetermined period. The preprocessed voice data is also data obtained by shifting data in the time axis direction. The preprocessed voice data is also data obtained by performing a predetermined sampling process. The preprocessed voice data is also data obtained by performing a process to expand and contract the voice data in the time axis direction. The preprocessed voice data is also data obtained by performing a process to expand and contract the voice data in the amplitude direction. By performing these preprocessing processes on the voice data, it is possible to convert the voice data into a format suitable for estimating illness, etc., and it is possible to accurately estimate whether or not the user has an illness, etc.

Next, the second embodiment will be described. Note that, among the configurations of the information processing system of the second embodiment, the parts that are similar to those of the first embodiment will be given the same reference numerals and the description will be omitted.

FIG. 9 shows an information processing system 310 according to the second embodiment. As shown in FIG. 9, the information processing system 310 includes a user terminal 18 and an information processing device 314. The information processing device 314 further includes a communication unit 36.

The information processing device 314 of the information processing system 310 estimates whether the user has a disease or the like based on the user's voice collected by the microphone 12 provided on the user terminal 18.

The information processing system 310 of the second embodiment is expected to be used, for example, under the conditions shown in Figures 10 and 11.

In the example of FIG. 10, a medical professional H such as a doctor operates an information processing device 314, and a user U, who is a subject, operates a user terminal 18. The user U collects his/her own voice data "XXXX" using the microphone 12 of the user terminal 18 that he/she operates. The user terminal 18 then transmits the voice data to the information processing device 314 via a network 19 such as the Internet.

The information processing device 314 receives the voice data "XXX" of the user U transmitted from the user terminal 18. The information processing device 314 then estimates whether or not the user U has any disease or symptom based on the received voice data, and outputs the estimation result to the display unit 315 of the information processing device 314. The medical worker H refers to the estimation result displayed on the display unit 315 of the information processing device 314 and determines whether or not the user U has any disease or symptom.

On the other hand, in the example of FIG. 11, the subject, user U, collects his/her own voice data using the microphone 12 of the user terminal 18 that he/she operates. The user terminal 18 then transmits the voice data to the information processing device 314 via a network 19 such as the Internet. The information processing device 314 receives the user U's voice data transmitted from the user terminal 18. The information processing device 314 then estimates whether or not the user U has any disease or symptom based on the received voice data, and transmits the estimation result to the user terminal 18. The user terminal 18 receives the estimation result transmitted from the information processing device 14, and displays the estimation result on a display unit (not shown). The user checks the estimation result and confirms what disease or symptom the user is likely to have.

The information processing device 314 executes an information processing routine similar to that shown in FIG. 8 above.

As described above, the information processing system of the second embodiment can estimate whether a user has a psychiatric disorder, a neurological disorder, or symptoms thereof, using an information processing device 214 installed on the cloud.

Next, an example will be described. In this example, experimental results regarding the effect of the pretreatment described in this embodiment will be shown.

12 is a graph plotting speech data obtained from subjects evaluated as depressed patients (indicated by square marks in FIG. 12) and speech data obtained from subjects evaluated as healthy subjects (indicated by circles in FIG. 12). FIG. 12 shows data obtained using the preprocessing and DTW of this embodiment. The horizontal axis dist2 of the graph in FIG. 12 represents the distance from the average reference for healthy subjects, and the vertical axis dist3 represents the distance from the average reference for depressed patients. As shown in FIG. 12, the data represented by square marks, which is the speech data of depressed patients, tends to have a long distance from the average reference for healthy subjects and a short distance from the average reference for depressed patients. Furthermore, the data represented by circles, which is the speech data of healthy subjects, tends to have a short distance from the average reference for healthy subjects and a long distance from the average reference for depressed patients. Furthermore, FIG. 12 shows the ROC curve and the AUC value. As shown in FIG. 12, the AUC value is 1.0 when depression is judged by combining the distance dist2 from the average reference for healthy subjects and the distance dist3 from the average reference for depressed patients.

FIG. 13 is a table of the data shown in FIG. 12 and other experimental results. HAMD in FIG. 13 represents the score for depression evaluation. HAMD≧7 represents that only those with a score of 7 or more were evaluated. MDD represents patients with depression, PD represents patients with Parkinson's disease, AD represents patients with Alzheimer's disease, and HE represents healthy individuals. MDD=20 represents the use of data from 20 patients with depression, and HE=14 represents the use of data from 14 healthy individuals. Intra-Person DTW represents the case where a pair is generated between a certain section and another section in the speech data of one subject, features are generated, and DTW is performed. FIG. 13 shows the performance when the period adjustment, which is the preprocessing of this embodiment, is not performed, and the performance value is AUC=0.7893, which is a lower performance value than when preprocessing is performed. As shown in the results at the top of Figure 13, the AUC value when distinguishing between HE and MDD is 0.9643, and as shown in the results at the bottom, the AUC value when distinguishing between HE and PD is 0.9173.

FIG. 14 shows the experimental results indicating the effects of the various pre-processing methods used in this embodiment. The results in the top row of FIG. 14 are the baseline results. The results from the second row onwards indicate the effect of applying each pre-processing method to the data, and it can be seen that each pre-processing method contributes to the depression assessment performance. Note that in the data in the bottom row of the table in FIG. 14, the performance evaluation (AUC) values are reversed, and this is explained below.

Figure 15 shows the difference in distance calculated by DTW when no amplitude adjustment is performed (labeled "without amplitude normalization" in Figure 15) and when amplitude adjustment is performed (labeled "with amplitude normalization" in Figure 15). The vertical axis of the graph shown in Figure 15 is the distance calculated by DTW.

Figure 15 shows the distance values calculated by DTW for HE_HospitalA, which represents data obtained from multiple healthy subjects at Hospital A, multiple depressed patients MDD, and HE_HospitalB, which represents data obtained from multiple healthy subjects at Hospital B.

In "without amplitude normalization" in Figure 15, it can be seen that there is a large difference between HE_HospitalA, which represents data obtained from multiple healthy subjects at Hospital A, and HE_HospitalB, which represents data obtained from multiple healthy subjects at Hospital B. This is thought to be due to differences in the recording environment and recording settings of the voice data. When the sound pressure of the recorded voice data is low, the distance value calculated by DTW is small, while when the sound pressure of the voice data is high, the distance value calculated by DTW is calculated to be large. As a result, as shown in "without amplitude normalization" in Figure 15, there is a difference between the data distribution of HE_HospitalA and HE_HospitalB, which were recorded in different environments (there is a significant difference in the average value by t-test: p<0.01).

In contrast, in the case of "with amplitude normalization" in Figure 15, the difference between the recording conditions in HE_HospitalA and HE_HospitalB is corrected, and no difference is observed between the data distributions of HE_HospitalA and HE_HospitalB, which were recorded in different environments (no significant difference in the average values by t-test: p>0.1). In this way, by incorporating amplitude normalization into the preprocessing, it is possible to correctly classify whether a user has a psychiatric disorder, a neurological disorder, or symptoms thereof, without being affected by differences in recording conditions.

Note that diseases such as Alzheimer's disease and Parkinson's disease can also be predicted in a similar manner. Figure 16 shows various conditions and the results of discrimination between healthy individuals (HE) and patients (Sick) suffering from major depressive disorder (MDD), Alzheimer's disease (AD), and Parkinson's disease (PD) using the Intra-Person DTW of this embodiment. Figure 17 shows the ROC curve corresponding to the performance evaluation AUC shown in Figure 16 above. Figure 18 shows the DTW value calculated under the conditions shown in Figure 16 above. Figure 19 shows the actual symptoms (labeled "Actual" in Figure 19) and the prediction results using the method of this embodiment (labeled "Prediction" in Figure 19). Figure 20 shows the results of a multiple comparison test.

As shown in Figure 16, the AUC value when distinguishing between healthy individuals (HE) and patients suffering from some disease (Sick) is 0.8486. Also, as shown in Figure 20, a multiple comparison test of the average DTW values shows that the distributions of healthy individuals (HE) and those with each disease (MDD, AD, PD) are different (significant difference in the average: p<0.01). Note that "E" in the table stands for "x10" and the number next to it stands for the exponent.

The results shown in Figures 16 to 20 also show that the method of this embodiment can accurately estimate whether a user has a psychiatric disorder, a neurological disorder, or symptoms thereof.

The technology disclosed herein is not limited to the above-described embodiments, and various modifications and applications are possible without departing from the spirit and scope of the invention.

For example, although the present specification has described an embodiment in which the program is pre-installed, the program can also be provided by storing it on a computer-readable recording medium.

In the above embodiment, the processing that the CPU reads and executes the software (program) may be executed by various processors other than the CPU. Examples of the processor in this case include a PLD (Programmable Logic Device) such as an FPGA (Field-Programmable Gate Array) whose circuit configuration can be changed after manufacture, and a dedicated electrical circuit such as an ASIC (Application Specific Integrated Circuit) that is a processor having a circuit configuration designed specifically to execute a specific process. Alternatively, a GPGPU (General-purpose graphics processing unit) may be used as the processor. Each process may be executed by one of these various processors, or by a combination of two or more processors of the same or different types (e.g., multiple FPGAs, a combination of a CPU and an FPGA, etc.). More specifically, the hardware structure of these various processors is an electric circuit that combines circuit elements such as semiconductor elements.

In addition, in each of the above embodiments, the program is described as being pre-stored (installed) in storage, but this is not limiting. The program may be provided in a form stored in a non-transient storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), or a USB (Universal Serial Bus) memory. The program may also be downloaded from an external device via a network.

Furthermore, each process of this embodiment may be implemented by a computer or server equipped with a general-purpose processor and storage device, and each process may be executed by a program. This program is stored in a storage device, and can be recorded on a recording medium such as a magnetic disk, optical disk, or semiconductor memory, or can be provided via a network. Of course, any other components do not have to be implemented by a single computer or server, and may be distributed across multiple computers connected by a network.

In addition, in the above embodiments, the case where a psychiatric disease or a nervous system disease, or a psychiatric disorder or a cognitive impairment symptom is estimated as an example of a predetermined disease or a predetermined symptom is described as an example, but the present invention is not limited to this. The predetermined disease or the predetermined symptom may be any. It is assumed that various diseases or symptoms are reflected in the voice data. For example, not only respiratory diseases and symptoms, but also psychiatric diseases and the like are affected by the voice data. Therefore, in the above embodiments, the case where a psychiatric disease or a nervous system disease, or a psychiatric disorder or a cognitive impairment symptom is estimated as an example of a predetermined disease or a predetermined symptom is described as an example, but the present invention is not limited to this, and any disease or the like may be estimated as long as the effect of the disease or the like is affected by the voice data.

In addition, in each of the above embodiments, when generating preprocessed audio data, all of the multiple preprocessing processes are executed, but this is not limited to the above. Preprocessed audio data may be generated using at least one of the preprocessing processes described above.

All publications, patent applications, and technical standards described in this specification are incorporated by reference into this specification to the same extent as if each individual publication, patent application, and technical standard was specifically and individually indicated to be incorporated by reference.

Claims

An acquisition unit that acquires voice data, which is time-series data of a voice uttered by a user;
a processing unit that generates preprocessed voice data representing data of the voice data acquired by the acquisition unit, the data being data from a first time after a start point of the voice data and being data from a second time before an end point of the voice data;
a generation unit that generates processing result data by applying dynamic time warping to the pre-processed audio data generated by the processing unit;
a calculation unit that calculates a score representing a degree to which the user has a predetermined disease or symptom based on the processing result data generated by the generation unit;
an estimation unit that estimates whether or not the user has a predetermined disease or symptom based on the score calculated by the calculation unit;
An information processing device comprising:
the processing unit generates, as the preprocessed audio data, data for a predetermined period among data that is a first time or later from a start point of the audio data and is a second time or earlier than an end point of the audio data.
The information processing device according to claim 1 .
the processing unit generates, as the preprocessed audio data, data obtained by performing a predetermined sampling process on data that is a first time or later from a start point of the audio data and is a second time or earlier than an end point of the audio data;
3. The information processing device according to claim 1 or 2.
the processing unit performs a process of expanding or contracting data, which is a first time from the start point of the audio data and later and is a second time before the end point of the audio data, in a time axis direction, to generate the preprocessed audio data.
4. The information processing device according to claim 1.
the processing unit performs a process of expanding or contracting data in an amplitude direction, the data being a first time or later from a start point of the audio data and being a second time or earlier than an end point of the audio data, thereby generating the preprocessed audio data.
The information processing device according to any one of claims 1 to 4.
the processing unit generates the preprocessed audio data by shifting data, which is a first time or later from a start point of the audio data and is a second time or earlier than an end point of the audio data, in a time axis direction.
The information processing device according to any one of claims 1 to 5.
the generation unit applies the dynamic time warping method to the preprocessed voice data and the voice data of a reference user who is known to have the predetermined disease or symptom, thereby generating the processing result data representing a distance between the preprocessed voice data and the voice data of the reference user.
The information processing device according to any one of claims 1 to 6.
the generation unit applies the dynamic time warping method to first audio data representing data in a first time interval in the preprocessed audio data and second audio data representing data in a second time interval in the preprocessed audio data, thereby generating the processing result data representing a distance between the first audio data and the second audio data.
The information processing device according to any one of claims 1 to 6.
An information processing system including a user terminal equipped with a microphone and the information processing device according to any one of claims 1 to 8,
the user terminal transmits the voice data acquired by the microphone to the information processing device;
The acquisition unit of the information processing device acquires the voice data transmitted from the user terminal,
a communication unit of the information processing device that transmits an estimation result estimated by the estimation unit to a user terminal;
The user terminal receives the estimation result transmitted from the information processing device.
Information processing system.
Acquire voice data, which is time-series data of the voice uttered by the user,
generating pre-processed voice data representing data from the acquired voice data that is a first time or later from a start point of the voice data and a second time or earlier than an end point of the voice data;
applying dynamic time warping to the generated pre-processed audio data to generate processed result data;
Calculating a score representing the degree to which the user has a predetermined disease or symptom based on the generated processing result data;
Based on the calculated score, it is estimated whether the user has a predetermined disease or symptom.
An information processing method for causing a computer to execute processing.
Acquire voice data, which is time-series data of the voice uttered by the user,
generating pre-processed voice data representing data from the acquired voice data that is a first time or later from the start point of the voice data and a second time or earlier than the end point of the voice data;
applying dynamic time warping to the generated pre-processed audio data to generate processed result data;
Calculating a score representing the degree to which the user has a predetermined disease or symptom based on the generated processing result data;
Based on the calculated score, it is estimated whether the user has a predetermined disease or symptom.
An information processing program for causing a computer to execute processing.