CN115547484A - Method and device for detecting Alzheimer's disease based on voice analysis - Google Patents

Method and device for detecting Alzheimer's disease based on voice analysis Download PDF

Info

Publication number
CN115547484A
CN115547484A CN202210791816.5A CN202210791816A CN115547484A CN 115547484 A CN115547484 A CN 115547484A CN 202210791816 A CN202210791816 A CN 202210791816A CN 115547484 A CN115547484 A CN 115547484A
Authority
CN
China
Prior art keywords
information
voice
feature vector
features
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210791816.5A
Other languages
Chinese (zh)
Inventor
黄立
苏里
纪丽燕
周善斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN JINGXIANG TECHNOLOGY CO LTD
Original Assignee
SHENZHEN JINGXIANG TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN JINGXIANG TECHNOLOGY CO LTD filed Critical SHENZHEN JINGXIANG TECHNOLOGY CO LTD
Priority to CN202210791816.5A priority Critical patent/CN115547484A/en
Publication of CN115547484A publication Critical patent/CN115547484A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Pathology (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Computing Systems (AREA)
  • Educational Technology (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Psychology (AREA)
  • Social Psychology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Developmental Disabilities (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Databases & Information Systems (AREA)

Abstract

The application provides a detection method, a detection device, a training method, a training device, an electronic device and a non-volatile computer-readable storage medium for Alzheimer's disease based on voice analysis. The method comprises the following steps: acquiring voice information, wherein the voice information comprises voice of a user executing a preset description task; converting the voice information into text information; extracting a plurality of first features of the voice information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector; splicing the first feature vector and the second feature vector to generate a feature vector to be detected; and inputting the characteristic vector to be detected into a preset detection model so as to output a detection result. The microphone can be deployed on any electronic equipment with a microphone, and can be widely applied to various scenes. More characteristic information can be extracted through a text and voice fusion mode, and the accuracy of the detection model is improved.

Description

Method and device for detecting Alzheimer's disease based on voice analysis
Technical Field
The present application relates to the field of detection technologies, and in particular, to a method and an apparatus for detecting alzheimer's disease based on voice analysis, a training method, a detecting apparatus, a training apparatus, an electronic device, and a non-volatile computer-readable storage medium.
Background
Magnetic Resonance Imaging (MRI) scans require heavy medical equipment, and due to this, identification of Alzheimer's Disease (AD) and Mild Cognitive Impairment (MCI) based on MRI cannot be applied more widely to non-hospital settings.
Disclosure of Invention
The embodiment of the application provides a detection method, a training method, a detection device, a training device, electronic equipment and a non-volatile computer-readable storage medium of Alzheimer's disease based on voice analysis.
The embodiment of the application provides a method for detecting Alzheimer's disease based on voice analysis. The detection method comprises the following steps: acquiring voice information, wherein the voice information comprises voice of a user executing a preset description task; converting the voice information into text information; extracting a plurality of first features of the voice information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector; splicing the first feature vector and the second feature vector to generate a feature vector to be detected; and inputting the characteristic vector to be detected into a preset detection model so as to output a detection result.
The embodiment of the application provides a training method. The training method comprises the following steps: acquiring a training sample, wherein the training sample comprises a voice sample and a text sample converted from the voice sample, the training sample comprises label information, and the label information comprises a normal probability, a mild cognitive impairment probability and an Alzheimer disease probability; extracting a plurality of first features of the voice sample to generate a first feature vector and extracting a plurality of second features of the text sample to generate a second feature vector; splicing the first feature vector and the second feature vector to generate a feature vector of a sample to be detected; inputting the characteristics of the sample to be detected and the label information to a preset detection model so as to train the detection model to be converged.
The embodiment of the application provides a detection device. The detection device comprises a first acquisition module, a conversion module, a first extraction module, a first splicing module and a detection module. The first acquisition module is used for acquiring voice information, and the voice information comprises voice of a user executing a preset description task; the conversion module is used for converting the voice information into text information; the first extraction module is used for extracting a plurality of first features of the voice information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector; the first splicing module is used for splicing the first feature vector and the second feature vector to generate a feature vector to be tested; and the detection module is used for inputting the characteristic vector to be detected into a preset detection model so as to output a detection result.
The embodiment of the application provides a training device. The training device comprises a second acquisition module, a second extraction module, a second splicing module and a training module. The second acquisition module is used for acquiring a training sample, the training sample comprises a voice sample and a text sample converted from the voice sample, the training sample comprises label information, and the label information comprises a normal probability, a mild cognitive impairment probability and an Alzheimer disease probability; the second extraction module is used for extracting a plurality of first features of the voice sample to generate a first feature vector and extracting a plurality of second features of the text sample to generate a second feature vector; the second splicing module is used for splicing the first feature vector and the second feature vector to generate a feature vector of a sample to be detected; the training module is used for inputting the characteristics of the sample to be detected and the label information to a preset detection model so as to train the detection model to be convergent.
The embodiment of the application provides electronic equipment. The electronic equipment comprises a processor, wherein the processor is used for acquiring voice information, and the voice information comprises voice of a user executing a preset description task; converting the voice information into text information; extracting a plurality of first features of the voice information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector; splicing the first feature vector and the second feature vector to generate a feature vector to be detected; and inputting the characteristic vector to be detected into a preset detection model so as to output a detection result. Or acquiring a training sample, wherein the training sample comprises a voice sample and a text sample converted from the voice sample, and the training sample comprises label information, and the label information comprises a normal probability, a mild cognitive impairment probability and an Alzheimer disease probability; extracting a plurality of first features of the voice sample to generate a first feature vector and extracting a plurality of second features of the text sample to generate a second feature vector; splicing the first feature vector and the second feature vector to generate a feature vector of a sample to be detected; inputting the characteristics of the sample to be detected and the label information to a preset detection model so as to train the detection model to be convergent.
The present embodiments provide a non-transitory computer-readable storage medium having a computer program stored thereon. The computer program, when executed by a processor, implements a detection method or a training method. The detection method comprises the following steps: acquiring voice information, wherein the voice information comprises voice of a user executing a preset description task; converting the voice information into text information; extracting a plurality of first features of the voice information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector; splicing the first feature vector and the second feature vector to generate a feature vector to be detected; and inputting the characteristic vector to be detected into a preset detection model so as to output a detection result. The training method comprises the steps of obtaining a training sample, wherein the training sample comprises a voice sample and a text sample converted from the voice sample, the training sample comprises label information, and the label information comprises a normal probability, a mild cognitive impairment probability and an Alzheimer disease probability; extracting a plurality of first features of the voice sample to generate a first feature vector and extracting a plurality of second features of the text sample to generate a second feature vector; splicing the first feature vector and the second feature vector to generate a feature vector of a sample to be detected; inputting the characteristics of the sample to be detected and the label information to a preset detection model so as to train the detection model to be convergent.
In the detection method, the training method, the detection device, the training device, the electronic device and the nonvolatile computer readable storage medium for Alzheimer's disease based on voice analysis, voice information of a preset description task executed by a user is obtained, the voice information is converted into text information, feature extraction is respectively carried out on the voice information and the text information, finally, extracted and spliced feature vectors to be detected are input into a detection model to output a detection result, and therefore the probability that the user suffers from AD, MCI and the like is judged, the probability that the user suffers from AD, MCI and the like can be deployed on any electronic device with a microphone, and therefore the method, the device, the electronic device and the nonvolatile computer readable storage medium can be widely applied to various scenes. And the first characteristic vector and the second characteristic vector are respectively obtained through the voice information and the text information so as to be spliced to obtain the characteristic vector to be detected for detection, more characteristic information can be extracted through a text and voice fusion mode, the problem of incomplete data caused by objective reasons in the voice acquisition process can be solved, and the accuracy of a detection model is improved.
Additional aspects and advantages of embodiments of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart of a detection method according to certain embodiments of the present application;
FIG. 2 is a schematic illustration of a preset image of a detection method according to some embodiments of the present application;
FIG. 3 is a schematic illustration of the principle of the detection method of certain embodiments of the present application;
FIG. 4 is a schematic flow chart diagram of a training method according to some embodiments of the present application;
FIG. 5 is a block schematic diagram of a detection device according to certain embodiments of the present application;
FIG. 6 is a block diagram of an exercise device according to certain embodiments of the present application;
FIG. 7 is a schematic plan view of an electronic device of some embodiments of the present application; and
FIG. 8 is a schematic diagram of the interaction of a non-volatile computer readable storage medium and a processor of certain embodiments of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the embodiments of the present application, and are not to be construed as limiting the embodiments of the present application.
The terms appearing in the present application are explained first below:
machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula teaching learning.
Deep Learning (DL): is a branch of machine learning and is an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple non-linear transformations. Deep learning is to learn the internal rules and the expression levels of training sample data, and information obtained in the learning process is very helpful to the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds. Deep learning is a complex machine learning algorithm, and achieves far more effects in speech and image recognition than the prior related art.
Alzheimer's Disease (AD), commonly known as senile dementia, is a neurodegenerative Disease with a slow onset and a progressive deterioration with time. According to statistics, by 2019, the number of the patients suffering from the Alzheimer's disease in China exceeds 1000 thousands, and the countries are the countries with the largest number of the patients suffering from the Alzheimer's disease in the whole world. This condition can lead to progressive damage to the patient's neurons and their nerve connections, ultimately leading to death from the disease or complications of the disease. The early stage of AD is Mild Cognitive Impairment (MCI), in which patients have normal daily living capacity but progressive cognitive decline. From a therapeutic point of view, AD has an irreversible character and presents major therapeutic difficulties, but if the patient is treated during the MCI stage, it is effective in delaying the onset of dementia.
The aim of Automatic Speech Recognition (ASR) technology is to allow computers to "listen" to continuous Speech spoken by different people, also known as "Speech dictation machine", which is a technology for realizing conversion from "voice" to "text". Automatic Speech Recognition is also known as Speech Recognition (Speech Recognition) or Computer Speech Recognition (Computer Speech Recognition).
The Bidirectional Encoder Representation (BERT), which is based on transform, is a pre-trained language characterization model, which emphasizes that instead of using the traditional one-way language model or the method of shallow-layer splicing two one-way language models for pre-training, a new mask language model is used to generate a deep bidirectional language characterization. The goal of the BERT model is to obtain a description of a text containing rich semantic information using large-scale unlabeled corpus training, namely: semantic representation of text.
Convolutional Neural Networks (CNN) are a class of feed forward Neural Networks (fed forward Neural Networks) that contain convolution computations and have a deep structure, and are one of the algorithms that represent deep learning. Convolutional neural networks have a characteristic learning ability, and can perform translation invariant classification on input information according to a hierarchical structure thereof, and are also referred to as "translation invariant artificial neural networks".
Referring to fig. 1, a method for detecting alzheimer's disease based on speech analysis according to an embodiment of the present invention includes:
step 011: and acquiring voice information, wherein the voice information comprises voice of a user executing a preset description task.
Specifically, impaired cognitive function caused by AD affects the ability to express language, and is further reflected in the process and content of language expression. Therefore, by collecting the voice of the user performing the preset describing task (such as collecting the voice through a microphone of the electronic device) to obtain the voice information, the algorithm recognition based on the voice analysis evaluates the cognitive function impairment degree of the user by checking the language expression ability of the user, so as to perform the detection of the AD and the MCI based on the voice information.
The preset description task may include describing the content of the preset image and names of different target objects describing the target type within a preset time period. It can be understood that the description accuracy of the preset image is affected by the cognitive function impairment, and the more the cognitive function impairment is serious, the poorer the description accuracy is for the content in the image, as shown in fig. 2, when the description task is executed, the preset image can be displayed on the display screen of the electronic device, and the description accuracy of the preset image by a normal user and a user with cognitive function impairment is obviously different; similarly, impaired cognitive function may also affect fluency in describing, for example, the names of different target objects describing a target type within a preset duration may be the names of animals speaking as many as possible within a preset duration (e.g., 30 seconds, 1 minute, 2 minutes, etc.). Therefore, the method can be used for detecting the cognitive function damage degree by collecting the voice information of the user executing the preset description task, so as to detect the probability that the user suffers from AD and MCI.
Please refer to fig. 3, step 012: the voice information is converted into text information.
In principle, in the training process of gradient descent, the model can automatically correct parameters to express the influence of different characteristics on the detection result output by the detection model, so that more effective information can improve the generalization capability of the detection model and reduce misjudgment. Therefore, the voice information is converted into the text information, both the voice information and the text information contain voice content when the user executes the preset description task, and more effective information from different angles can be obtained.
The speech information is converted into the text information through the ASR, when the ASR converts the speech information into the text information, not only can the characters corresponding to each speech be obtained, but also the tone information of each character can be obtained, and therefore richer and more accurate text information can be converted.
Step 013: the method comprises the steps of extracting a plurality of first features of voice information to generate a first feature vector, and extracting a plurality of second features of text information to generate a second feature vector.
Specifically, after the voice information and the text information of the user after the description task is executed are acquired, the feature extraction operation can be performed. Extracting a plurality of first features from the voice information to generate a first feature vector; a plurality of second features are extracted from the textual information to generate a second feature vector.
The method comprises the steps of extracting features of voice information through a convolutional neural network, collecting voice features such as pause information and voice continuity information in voice through the convolutional neural network to obtain a plurality of first features, finally converting the collected voice features into a first feature vector of [ n x 100] through a pooling layer, a ReLU activation function and a full connection layer of a Sigmoid function activation function, wherein n is a preset value and can be determined according to the feature with the maximum length in a single first feature so as to ensure that the feature vector formed by all the first features can contain all information of all the features.
When the pause information and the continuity information of the voice in the voice information are recognized, firstly, the pause duration between different sentences in the voice information is recognized to determine the pause information, for example, the pause information comprises the number of pauses with different pause durations; it can be understood that the longer the number and duration of pauses, the more severely the user's cognitive function is impaired, and vice versa, the less impaired cognitive function. Then, according to the variance of the multiple pause durations, determining continuity information of the speech, for example, determining the variance of all pause durations to represent the continuity of the speech, it can be understood that the smaller the variance is, the better the pause consistency of the user when speaking is indicated, and the better the continuity of the speech of the user is indicated, and the larger the variance is, the worse the pause consistency of the user when speaking is indicated, and the worse the continuity of the speech of the user is indicated. In this way, by extracting a plurality of speech features related to the detection of the degree of cognitive impairment, the accuracy of detection of AD and MCI can be improved.
The feature extraction of the text information can be carried out through a migration model established based on a bert model, and text features such as lexical information, repeated lexical information and nonsense lexical information in the text information are extracted to obtain a plurality of second features. And then, the plurality of second features are subjected to feature processing and can also be converted into a [ m x 100] dimensional second feature vector, wherein m is a preset value and can be determined according to the feature with the maximum length in the single second feature, so that the feature vector formed by all the second features can contain all the information of all the features.
When the part-of-speech information, the repeated vocabulary information and the nonsense vocabulary information in the text information are recognized as the plurality of second features, the number of vocabularies with different parts-of-speech in the text information can be recognized as the part-of-speech information, and it can be understood that the more seriously the cognitive function is damaged, the more the vocabularies with different parts-of-speech are likely to be used, and therefore, the detection accuracy can be improved by extracting the features of the part-of-speech information. Similarly, the repetition times of different words in the text information can be identified as repeated word information, for example, the repetition times of each word in the text are counted, and it can be understood that the more serious the cognitive function is, the more the repeated times of the words are likely to be for the patient; finally, the number of the nonsense words in the text information can be identified as the nonsense word information, the text may have nonsense words such as "o", "kaye", etc., and a large number of the nonsense words are easier to say for the patient with more serious cognitive function impairment. In this way, by extracting a plurality of text features related to the detection of the degree of cognitive function impairment, the accuracy of detection of AD and MCI can be improved.
It can be understood that the ASR, CNN, the migration model established based on the bert model, and the detection model may all be deployed in the electronic device, so that a single electronic device can implement the detection of the AD and MCI.
Step 014: and splicing the first feature vector and the second feature vector to generate a feature vector to be detected.
Specifically, in the present application, the time sequence of the features does not substantially affect the detection accuracy, and therefore, when the first feature vector and the second feature vector are spliced, the first feature vector and the second feature vector may be directly spliced into a complete matrix, for example, after the first feature vector of [ n × 100] and the second feature vector of [ m × 100] are spliced, the feature vector to be detected of [ (m + n) × 100] may be obtained, so as to perform subsequent detection.
Step 015: and inputting the characteristic vector to be detected into a preset detection model so as to output a detection result.
Specifically, after the feature vector to be detected is obtained, the feature vector to be detected can be input into a preset detection model, and the detection result of the AD and MCI can be output, for example, the detection result includes a normal probability, a mild cognitive impairment probability and an alzheimer disease probability. The detection model is a classification model trained in advance, and can process the feature vectors to be detected through a full-connection network in the detection model so as to output normal probability, mild cognitive impairment probability and Alzheimer disease probability.
According to the detection method, the voice information of the user executing the preset description task is obtained, the voice information is converted into text information, the characteristics related to AD and MCI in the voice information and the text information are extracted, the extracted and spliced characteristic vectors to be detected are input into the detection model, and the detection result is output, so that the probability that the user suffers from AD, MCI and the like is judged, the detection method can be deployed on any electronic equipment with a microphone, and therefore the detection method can be widely applied to various scenes. And the first characteristic vector and the second characteristic vector are respectively obtained through the voice information and the text information so as to be spliced to obtain the characteristic vector to be detected for detection, more characteristic information can be extracted through a text and voice fusion mode, the problem of incomplete data caused by objective reasons in the voice acquisition process can be solved, and the accuracy of a detection model is improved.
Referring to fig. 4, the present application further provides a training method, including:
step 021: the method comprises the steps of obtaining a training sample, wherein the training sample comprises a voice sample and a text sample converted from the voice sample, the training sample comprises label information, and the label information comprises normal probability, mild cognitive impairment probability and Alzheimer disease probability.
In order to train the detection model, a large number of training samples are required to be obtained in advance, the training samples include voice samples of three different types of patients with AD, MCI and normal and text samples obtained by converting the voice samples through ASR, and the training samples include label information to indicate real normal probability, mild cognitive impairment probability and alzheimer probability of the patients corresponding to the training samples.
Step 022: the method comprises the steps of extracting a plurality of first features of a voice sample to generate a first feature vector, and extracting a plurality of second features of a text sample to generate a second feature vector.
Step 023: splicing the first feature vector and the second feature vector to generate a feature vector of the sample to be detected;
please refer to step 013 and step 014 for the detailed description of step 022 and step 023, respectively, and the schemes for feature extraction and feature vector generation are basically the same, which are not described herein again.
024: inputting the characteristics of the sample to be detected and the label information to a preset detection model so as to train the detection model to be convergent.
After the feature vector of the sample to be detected is obtained, the feature vector of the sample to be detected can be input into a preset detection model, the detection model can output an initial detection result, then loss values are calculated according to the detection result and label information corresponding to the feature vector of the sample to be detected, for example, the normal probability, the mild cognitive impairment probability and the Alzheimer's disease probability in the detection result are respectively different from the normal probability, the mild cognitive impairment probability and the Alzheimer's disease probability in the label information to obtain three difference values so as to determine the loss values, parameters of the detection model are adjusted according to the loss values, so that the loss values between the detection result output by the detection model and the corresponding label information are gradually reduced until the loss values are reduced to a preset threshold value, and then the detection model can be considered to be converged.
Therefore, the accuracy of the detection model from training to convergence can be improved by training the to-be-detected sample with the speech characteristics and the text characteristics fused.
In order to better implement the detection method of the embodiment of the present application, the embodiment of the present application further provides a detection apparatus 10. Referring to fig. 5, the detecting device 10 may include:
a first obtaining module 11, configured to obtain voice information, where the voice information includes a voice of a user for executing a preset description task;
a conversion module 12, configured to convert the voice information into text information;
a first extraction module 13, configured to extract a plurality of first features of the voice information to generate a first feature vector, and extract a plurality of second features of the text information to generate a second feature vector;
the first extraction module 13 is specifically configured to:
the method comprises the steps of extracting a plurality of first features of voice information based on a preset migration model to generate a first feature vector, and extracting a plurality of second features of text information based on a preset convolution neural network model to generate a second feature vector.
The first extraction module 13 is further specifically configured to:
extracting pause information and continuity information of voice in the voice information as a plurality of first features;
part-of-speech information, repeated vocabulary information, and nonsense vocabulary information in the text information are extracted as the plurality of second features.
The first extraction module 13 is specifically further configured to:
recognizing pause duration between different sentences in the voice information to determine pause information; and
determining continuity information of the voice according to the variance of the plurality of pause durations;
recognizing the number of vocabularies with different parts of speech in the text information as part of speech information;
recognizing the repeated times of different vocabularies in the text information to serve as repeated vocabulary information; and
the number of nonsense words in the text information is recognized as nonsense word information.
The first splicing module 14 is configured to splice the first eigenvector and the second eigenvector to generate an eigenvector to be tested;
and the detection module 15 is configured to input the feature vector to be detected into a preset detection model to output a detection result.
The detection module 15 is further specifically configured to process the feature vector to be detected through a fully connected network in the detection model to output a normal probability, a mild cognitive impairment probability, and an alzheimer probability.
In order to better implement the training method of the embodiment of the present application, an exercise device 20 is further provided in the embodiment of the present application. Referring to fig. 6, the training device 20 may include:
the second obtaining module 21 is configured to obtain a training sample, where the training sample includes a voice sample and a text sample converted from the voice sample, the training sample includes label information, and the label information includes a normal probability, a mild cognitive impairment probability, and an alzheimer disease probability;
a second extraction module 22, configured to extract a plurality of first features of the voice sample to generate a first feature vector, and extract a plurality of second features of the text sample to generate a second feature vector;
the second splicing module 23 is configured to splice the first feature vector and the second feature vector to generate a feature vector of the sample to be detected;
the training module 24 is configured to input the characteristics and the label information of the sample to be detected to a preset detection model, so as to train the detection model to converge.
The modules in the detection device 10 and the training device 20 may be implemented in whole or in part by software, hardware, and combinations thereof. The modules may be embedded in hardware or independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor calls and executes operations corresponding to the modules.
Referring to fig. 7, an electronic device 100 according to an embodiment of the present disclosure includes a processor 30. The processor 30 is configured to perform the detection method or the training method according to any of the above embodiments, and for brevity, the description is omitted here.
Among other things, the electronic device 100 may be a mobile phone, a smart phone, a Personal Digital Assistant (PDA), a tablet computer and a video game device, a portable terminal (e.g., a notebook computer), or a larger-sized device (e.g., a desktop computer and a television).
Referring to fig. 8, the present embodiment further provides a computer-readable storage medium 300, on which a computer program 310 is stored, and steps of the detection method or the training method according to any of the above embodiments are implemented when the computer program 310 is executed by the processor 30, which is not described herein again for brevity.
It will be appreciated that the computer program 310 comprises computer program code. The computer program code may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), software distribution medium, and the like.
In the description of the present specification, reference to the description of "one embodiment", "some embodiments", "illustrative embodiments", "examples", "specific examples" or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art will be able to combine and combine various embodiments or examples and features of various embodiments or examples described in this specification without contradictory guidance.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application and that variations, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims (11)

1. A method for detecting Alzheimer's disease based on voice analysis is characterized by comprising the following steps:
acquiring voice information, wherein the voice information comprises voice of a user executing a preset description task;
converting the voice information into text information;
extracting a plurality of first features of the voice information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector;
splicing the first feature vector and the second feature vector to generate a feature vector to be detected; and
and inputting the characteristic vector to be detected into a preset detection model so as to output a detection result.
2. The method of claim 1, wherein the extracting a plurality of first features of the speech information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector comprises:
extracting a plurality of first features of the voice information based on a preset convolution neural network model to generate a first feature vector, and extracting a plurality of second features of the text information based on a preset migration model to generate a second feature vector.
3. The detection method according to claim 1, wherein the extracting a plurality of first features of the speech information comprises:
extracting pause information and continuity information of voice in the voice information as a plurality of first features;
the extracting a plurality of second features of the text information comprises:
and extracting part-of-speech information, repeated vocabulary information and nonsense vocabulary information in the text information as a plurality of second features.
4. The detection method according to claim 3, wherein said extracting pause information and continuity information of a voice in said voice information as a plurality of said first features; the method comprises the following steps:
recognizing pause duration between different sentences in the voice information to determine pause information; and
determining continuity information of the voice according to the variance of the pause durations;
the extracting, as a plurality of the second features, part-of-speech information, repeated vocabulary information, and nonsense vocabulary information in the text information includes:
recognizing the number of vocabularies with different parts of speech in the text information as the part of speech information;
recognizing the repeated times of different words in the text information as the repeated word information; and
the number of the nonsense words in the text information is recognized as the nonsense word information.
5. The detection method according to claim 1, wherein the preset description task includes describing the content of a preset image and names of different target objects describing the target type within a preset time period.
6. The detection method according to claim 1, wherein the detection result includes a normal probability, a mild cognitive impairment probability and an Alzheimer's disease probability, and the inputting the feature vector to be detected into a preset detection model to output the detection result includes:
and processing the feature vector to be detected through a fully-connected network in the detection model so as to output the normal probability, the mild cognitive impairment probability and the Alzheimer disease probability.
7. A method of training, comprising:
acquiring a training sample, wherein the training sample comprises a voice sample and a text sample converted from the voice sample, the training sample comprises label information, and the label information comprises a normal probability, a mild cognitive impairment probability and an Alzheimer disease probability;
extracting a plurality of first features of the voice sample to generate a first feature vector and extracting a plurality of second features of the text sample to generate a second feature vector;
splicing the first feature vector and the second feature vector to generate a feature vector of a sample to be detected;
inputting the characteristics of the sample to be detected and the label information to a preset detection model so as to train the detection model to be convergent.
8. A detection device, comprising:
the first acquisition module is used for acquiring voice information, wherein the voice information comprises voice of a user executing a preset description task;
the conversion module is used for converting the voice information into text information;
the first extraction module is used for extracting a plurality of first features of the voice information to generate a first feature vector and extracting a plurality of second features of the text information to generate a second feature vector;
the first splicing module is used for splicing the first characteristic vector and the second characteristic vector to generate a characteristic vector to be detected; and
and the detection module is used for inputting the feature vector to be detected into a preset detection model so as to output a detection result.
9. An exercise device, comprising
The second acquisition module is used for acquiring a training sample, wherein the training sample comprises a voice sample and a text sample converted from the voice sample, the training sample comprises label information, and the label information comprises a normal probability, a mild cognitive impairment probability and an Alzheimer disease probability;
the second extraction module is used for extracting a plurality of first features of the voice sample to generate a first feature vector and extracting a plurality of second features of the text sample to generate a second feature vector;
the second splicing module is used for splicing the first characteristic vector and the second characteristic vector to generate a characteristic vector of a sample to be detected;
and the training module is used for inputting the characteristics of the sample to be detected and the label information to a preset detection model so as to train the detection model to be convergent.
10. An electronic device, comprising a processor configured to perform the detection method of any one of claims 1-6; or to perform the training method of claim 7.
11. A non-transitory computer-readable storage medium of a computer program, wherein the computer program, when executed by one or more processors, implements the detection method of any one of claims 1-6; or to perform the training method of claim 7.
CN202210791816.5A 2022-07-05 2022-07-05 Method and device for detecting Alzheimer's disease based on voice analysis Pending CN115547484A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210791816.5A CN115547484A (en) 2022-07-05 2022-07-05 Method and device for detecting Alzheimer's disease based on voice analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210791816.5A CN115547484A (en) 2022-07-05 2022-07-05 Method and device for detecting Alzheimer's disease based on voice analysis

Publications (1)

Publication Number Publication Date
CN115547484A true CN115547484A (en) 2022-12-30

Family

ID=84723776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210791816.5A Pending CN115547484A (en) 2022-07-05 2022-07-05 Method and device for detecting Alzheimer's disease based on voice analysis

Country Status (1)

Country Link
CN (1) CN115547484A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493968A (en) * 2018-11-27 2019-03-19 科大讯飞股份有限公司 A kind of cognition appraisal procedure and device
CN112908317A (en) * 2019-12-04 2021-06-04 中国科学院深圳先进技术研究院 Voice recognition system for cognitive impairment
CN113935330A (en) * 2021-10-22 2022-01-14 平安科技(深圳)有限公司 Voice-based disease early warning method, device, equipment and storage medium
US20220108714A1 (en) * 2020-10-02 2022-04-07 Winterlight Labs Inc. System and method for alzheimer's disease detection from speech
CN114512236A (en) * 2022-04-18 2022-05-17 山东师范大学 Intelligent auxiliary diagnosis system for Alzheimer's disease

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109493968A (en) * 2018-11-27 2019-03-19 科大讯飞股份有限公司 A kind of cognition appraisal procedure and device
CN112908317A (en) * 2019-12-04 2021-06-04 中国科学院深圳先进技术研究院 Voice recognition system for cognitive impairment
US20220108714A1 (en) * 2020-10-02 2022-04-07 Winterlight Labs Inc. System and method for alzheimer's disease detection from speech
CN113935330A (en) * 2021-10-22 2022-01-14 平安科技(深圳)有限公司 Voice-based disease early warning method, device, equipment and storage medium
CN114512236A (en) * 2022-04-18 2022-05-17 山东师范大学 Intelligent auxiliary diagnosis system for Alzheimer's disease

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
顾曰国 等: "《老年语言学与多模态研究》", 上海:同济大学出版社, pages: 216 - 220 *

Similar Documents

Publication Publication Date Title
CN110728997B (en) Multi-modal depression detection system based on context awareness
Zhang et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching
CN106782603B (en) Intelligent voice evaluation method and system
WO2024000867A1 (en) Emotion recognition method and apparatus, device, and storage medium
CN112329438B (en) Automatic lie detection method and system based on domain countermeasure training
CN110147548A (en) The emotion identification method initialized based on bidirectional valve controlled cycling element network and new network
CN115329779A (en) Multi-person conversation emotion recognition method
CN113592251B (en) Multi-mode integrated teaching state analysis system
CN113851131A (en) Cross-modal lip language identification method
CN112735404A (en) Ironic detection method, system, terminal device and storage medium
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN116522212B (en) Lie detection method, device, equipment and medium based on image text fusion
CN115424108B (en) Cognitive dysfunction evaluation method based on audio-visual fusion perception
CN116244474A (en) Learner learning state acquisition method based on multi-mode emotion feature fusion
CN115346657B (en) Training method and device for improving identification effect of senile dementia by utilizing transfer learning
CN114170997A (en) Pronunciation skill detection method, pronunciation skill detection device, storage medium and electronic equipment
CN115547484A (en) Method and device for detecting Alzheimer's disease based on voice analysis
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
CN115472182A (en) Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder
CN112700796B (en) Voice emotion recognition method based on interactive attention model
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
Bai Pronunciation Tutor for Deaf Children based on ASR
Roy A computational model of word learning from multimodal sensory input
CN115186083B (en) Data processing method, device, server, storage medium and product
KR102564570B1 (en) System and method for analyzing multimodal emotion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20221230

RJ01 Rejection of invention patent application after publication