CN113823267B - Automatic depression recognition method and device based on voice recognition and machine learning - Google Patents

Automatic depression recognition method and device based on voice recognition and machine learning Download PDF

Info

Publication number
CN113823267B
CN113823267B CN202110986901.2A CN202110986901A CN113823267B CN 113823267 B CN113823267 B CN 113823267B CN 202110986901 A CN202110986901 A CN 202110986901A CN 113823267 B CN113823267 B CN 113823267B
Authority
CN
China
Prior art keywords
features
term
feature
frequency
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110986901.2A
Other languages
Chinese (zh)
Other versions
CN113823267A (en
Inventor
张莉
辛逸男
刘欣阳
刘志宽
邓冉琪
吴鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN202110986901.2A priority Critical patent/CN113823267B/en
Publication of CN113823267A publication Critical patent/CN113823267A/en
Application granted granted Critical
Publication of CN113823267B publication Critical patent/CN113823267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/259Fusion by voting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses a depression automatic identification method and device based on voice identification and machine learning, comprising the following steps: s1, acquiring voice data of a patient; s2, selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics; and step S3, identifying the depression degree of the long-term features according to a random forest algorithm. By adopting the technical scheme of the invention, the problem that depression patients are difficult to find in early stage is effectively solved, and the diagnosis threshold of depression patients is reduced.

Description

Automatic depression recognition method and device based on voice recognition and machine learning
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to an automatic depression recognition method and device based on voice recognition and machine learning.
Background
By 2014, the prevalence rate of depression in China is 2.1%, and by 2017, 581 ten thousand people of serious mental disorder patients registered nationally can cause serious injuries to patients, families and society. The notification and the working scheme for exploring and developing the special service for preventing and treating the depression are published in 9 months 2020, the working scheme indicates that the knowledge rate, the treatment rate and the treatment rate of the public for preventing and treating the depression in China are lower, the treatment rate is only one tenth of that of all patients with the depression, the diagnosis and the treatment of the depression depend on doctors in mental hospitals, and the training of doctors in non-mental hospitals is increasing in China. Therefore, automation is particularly important for early diagnosis of depression.
Disclosure of Invention
The invention aims to solve the technical problem of providing an automatic depression recognition method and device based on voice recognition and machine learning, which are used for recording voice data of common people through designing questions and answers, and then using a machine learning algorithm to recognize and classify the voice data, so that the problem that depression patients are difficult to find in early stage is effectively solved, and the diagnosis threshold of depression patients is reduced.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
an automatic depression recognition method based on voice recognition and machine learning comprises the following steps:
s1, acquiring voice data of a patient;
s2, selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;
and step S3, identifying the depression degree of the long-term features according to a random forest algorithm.
Preferably, step S2 includes:
step 2.1, carrying out feature extraction on the voice data by adopting framing windowing;
step 2.2, selecting the extracted features according to the decision tree;
and 2.3, recombining the selected characteristics to generate long-term characteristics.
Preferably, the extracting features in step 2.1 are time domain features and frequency domain features, and the time domain features include: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid.
Preferably, in step 2.3, the short-time features are subjected to discretization, a threshold value is set according to the upper and lower third of the sites of each feature value, each feature is divided into three discrete features of low value, median value and high value, and then the features after discretization are subjected to feature combination in a co-occurrence mode; and after the characteristics are combined, counting the frequency of the occurrence of the characteristics in a frame of voice signal to generate long-term characteristics.
Preferably, in step S3, each of the feature values in the long-term features represents a frequency of occurrence of a plurality of feature-specific discrete values of the speech data in common in one frame, and when classifying according to the feature values, classifying is performed according to the frequency of occurrence of the discrete features of the speech data in common in one frame.
The invention also provides an automatic depression recognition device based on voice recognition and machine learning, which comprises:
the acquisition module is used for acquiring voice data of a patient;
the combination module is used for selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;
and the identification module is used for identifying the depression degree of the long-term features according to a random forest algorithm.
Preferably, the combination module includes:
the extraction unit is used for carrying out feature extraction on the voice data by adopting framing windowing;
the selection unit is used for selecting the extracted characteristics according to the decision tree;
and the combination unit is used for recombining the selected characteristics to generate long-term characteristics.
Preferably, the extracted features are time domain features and frequency domain features, the time domain features comprising: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid.
Preferably, the combining unit includes:
the discretization component is used for discretizing the short-time features, setting a threshold value according to the upper and lower third positions of each feature value, and dividing each feature into three discrete features of a low value, a median value and a high value;
the combination component is used for combining the features after discretization in a co-occurrence manner;
and the generation component is used for generating long-term features by adopting statistics of frequency of occurrence of the features in a frame of voice signal for the combined features.
Preferably, each of the long-term features in the recognition module represents a frequency of co-occurrence of a plurality of feature-specific discrete values of the speech data in one frame, and when classified according to the feature values, the classification is performed according to the frequency of co-occurrence of the discrete features of the speech data in one frame.
According to the invention, by collecting the voice signals, selecting the characteristics of the voice signals, recombining the characteristics into the new long-term characteristics, and identifying the depression degree by combining the random forest algorithm in machine learning, people can be helped to detect and diagnose the depression in an early stage in a simpler mode.
Drawings
FIG. 1 is a flow chart of an automatic depression recognition method based on voice recognition and machine learning technology of the invention;
FIG. 2 is a schematic diagram of voice data acquisition and recording;
FIG. 3 is a schematic diagram of threshold points;
FIG. 4 is a schematic diagram of a combined long-term feature machine learning classification;
fig. 5 is a schematic diagram of a structure diagram of an automatic depression recognition device based on voice recognition and machine learning technology.
Detailed Description
The invention will be described in further detail below with reference to the drawings by means of specific embodiments.
As shown in fig. 1, the present invention provides an automatic depression recognition method based on voice recognition and machine learning technology, comprising the following steps:
s1, acquiring voice data of a patient;
s2, selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;
and S2, identifying the depression degree of the long-term features according to a random forest algorithm.
Further, in step S1, voice data is collected by a smart device, such as a mobile phone or a bracelet, as shown in fig. 2. In the collection of a voice dataset, multiple subjects may be invited to interview, record, interview talk to the subject through design-fixed questions relating to multiple aspects of family, work, mood, self-depression assessment, etc. The interview session was recorded by a smart phone with recording capability on the market, the interview environment was kept quiet, the microphone sampling frequency was 44.1kHz, the voice data was saved in wav format, the average duration was 20 minutes, and the start and stop times of each text in the interview session were recorded as text documents. All subjects will fill in a psychiatric questionnaire (PHQ-8) before interview, with scores from 0 to 30, 15 points and more being considered depressive and 15 points and less being considered non-depressive.
Further, the features of the voice data in step S2 include time domain features and frequency domain features, the time domain features extracted by the present invention include short-time energy, zero crossing rate and energy entropy, and the frequency domain features include spectral entropy, fundamental frequency and centroid. The different features represent different physical meanings of speech: the short-time energy represents the speaking voice, and the energy is greatly reduced in the short-time period of the silent section, and the value is close to 0; the zero crossing rate is the number of times the energy is inverted at zero, and is generally used for distinguishing a noise section from a sounding section; the energy entropy represents the richness of data information contained in voice data, the content contained in different voice fragments is inconsistent, and the energy entropy can measure the richness of the information; spectral entropy represents the unvoiced and voiced sounds of sound, and simultaneously represents the air supply degree when a speaker speaks, and is a measure of the richness of voice data information in the frequency domain; the fundamental frequency is the lowest frequency of the sine wave forming the voice signal, and reflects the fundamental frequency of voice; the centroid represents the center of distribution of the speech waveform. Features of the speech signal may be diagnostic for the degree of depression,
step S2 comprises the steps of:
step 2.1, extracting the characteristics of the voice data
Firstly, according to the conversation starting and ending time in the recorded file, the voice of interviewee in voice data is removed, and the voice fragments of the testee are spliced. The voice signal is a non-stationary signal, but has stationary characteristics in 10-50ms, so that the voice data is subjected to framing and windowing, the voice signal is firstly divided into voice fragments, the characteristics of the voice signal are extracted through windowing, the movement of a Hamming window is smaller than the length of each frame of voice, the information loss of a Hamming window side lobe is prevented, and the Hamming window is calculated as follows:
short time energy:
the short-time energy is an average value of the energy of the voice data, the voice signal of each sampling point is set as A (n), the short-time energy E (energy) is defined as follows, and k is the sampling frequency of the window.
Zero-crossing rate:
the zero crossing rate refers to the number of times the energy of the voice signal passes through the zero crossing point, and is abbreviated as ZCR, and the ZCR rate is calculated as shown in the following formula.
The sgn function in the above equation is a sign function, as follows.
Energy entropy:
the energy entropy represents an information uncertainty measure of the energy of the voice data in the time domain, and is calculated as shown in the following formula.
EE=-∑p i log(p i )
Wherein p is i Is the ratio of the energy of a certain value to the sum of all energies in a speech segment.
Fundamental frequency:
the speech signal may be considered to consist of sine waves of different frequencies, the lowest frequency sine wave being the fundamental tone, the fundamental frequency representing the frequency of the fundamental tone.
Centroid:
the centroid may reflect the instability of the speech signal, and is calculated as shown in the following equation, assuming a centroid (spectral centroid, SC).
Wherein f (n) is the signal frequency, E (n) is the spectral energy of the corresponding frequency of the voice signal A (n) after short-time Fourier transform
Spectral entropy:
the spectral entropy is calculated in the frequency domain, unlike the energy entropy, and the spectral entropy is calculated after the voice signal is subjected to short-time fourier transform, and is calculated as shown in the following formula.
SE=-∑p i log(p i )
Wherein p is i Representing the ratio of the value of a certain sample point to the sum of the sample points.
Step 2.2, selecting the extracted features
The invention combines the features to avoid the excessive number of the combined features, so the extracted features are selected according to a decision tree, and the most important 4 features are selected for research, not limited to the 4 features.
Step 2.3, recombining the selected features
Short-term features are extracted through framing and windowing, the short-term features correspond to 10-50ms voice segment features, the short-term features of each window are not rich enough in representing information, the difference is large, feature values of each frame are generated by average values of all windows, the length of one frame is about 2-4s, but long-term features cannot well reflect information of the short-term features, and direct classification effects of the short-term features and the long-term features are not ideal. The invention adopts a feature combination method, firstly discretizing short-time features, setting a threshold value according to the upper and lower third sites of each feature value, dividing each feature into three features of low value, medium value and high value, and then carrying out feature combination on the discretized features. The method comprises the steps of firstly discretizing 4 selected characteristics into three types of low values, medium values and high values according to threshold values, wherein the used threshold values are the upper third position points and the lower third position points of the characteristic values, selecting one third position points as the threshold values, and reducing the influence of abnormal values on data to a certain extent, wherein the threshold values of each characteristic are shown in a table 1, two values of one column of the threshold values in the table 1 are the lower third position points and the upper third position points respectively, and the '0', '1' and '2' in the characteristic values correspond to the low, medium and high positions in the characteristic sizes. A discretized schematic is shown in fig. 3.
TABLE 1
The feature combination method combines certain feature values of any two short-term features into new short-term features in a co-occurrence mode, taking energy entropy and zero crossing rate as examples, the energy entropy EE is divided into three features EE (0), EE (1) and EE (2) after discretization, the three features represent low values, medium values and high values after discretization respectively, and after feature combination, the energy entropy and the zero crossing rate generate a 9-dimensional feature vector, as shown in table 2. The feature combination method is not limited to the pairwise combination, and the new feature vector can be generated by three or four short-term feature combinations.
TABLE 2
After feature combination, the generated feature vector is still a short-term feature, and in order to solve the defects in the short-term feature and the long-term feature, the invention adopts statistics of the frequency of occurrence of the features in a frame of voice signal to generate the long-term feature with more abundant information, and the calculation is shown in the following formula.
Wherein ZCR(s) and EE(s) represent characteristic values of zero crossing rate and energy entropy at t time, a 1 、a 2 、b 1 、b 2 Upper and lower thresholds of zero crossing rate and energy entropy respectively, W ZCR*EE And combining the feature vectors for the time t. V (V) ZCR*EE To calculate the combined long-term characteristics of the zero crossing rate and the energy entropy co-occurrence of the Δt time length. The above features have great difference in order magnitude, in order to compare the difference of different voice fragments caused by depression degree, and create a more stable model, the obtained combined features are normalized, the original distribution of the data is not changed, and only the data is stretched.
Further, in step S3, a machine learning classification algorithm based on combined features is adopted, and a random forest algorithm (Random Forests algorithm) is an integrated learning, is composed of a plurality of decision trees, belongs to a strong classifier, and outputs a prediction result by a majority voting method according to the prediction results of the plurality of decision trees. The single decision tree belongs to a weak classifier, which means that the classification result of the single decision tree is slightly stronger than the random classification, and the majority voting result of a plurality of decision trees is obviously better than the single decision tree according to the law of large numbers. In the random forest algorithm, the total number of samples is N, the number of features is M, N (N < N) samples are selected, M (M < M) features are selected, an optimal feature is selected in the single decision tree to divide left and right subtrees, T times of sampling are carried out, T decision trees are generated to form a random forest, and finally a prediction result of the random forest is generated by majority voting of the T decision trees. In the process of generating a single decision tree, a random sampling method is selected to ensure that samples of each decision tree are different, so that a prediction result of each decision tree is different, and replaced sampling (bagging) is adopted to ensure that samples in each decision tree have an intersection, prevent overlarge difference after training of each decision tree, and randomly select characteristics to avoid the problem of overfitting. The feature importance of a single decision tree is measured by calculating the degree to which features reduce in the tree node's coefficient of base (Gini index), which is calculated as shown in the following equation.
Where p (c|n) is the proportion of the number of samples belonging to the C class in the node N, C is the total number of classes, if all the nodes N are samples of the same class, the value of the base coefficient is 0, if all the samples of the class occupy the same proportion in the node N, the base coefficient takes the minimum value, and the smaller the base coefficient, the clearer the classification of the classes is explained. The plurality of random forests represent feature importance by calculating an average of the reduced values of the coefficient of the feature's kunit in all trees. Therefore, the random forest algorithm not only can process complex classification tasks, but also can sort the feature importance, and can be conveniently trained in parallel, has low hardware requirements, can be realized on a CPU, and is also suitable for various intelligent devices with recording functions. In the combined long-term feature created by the invention, each feature value does not singly represent a certain feature of voice data, but represents the frequency of the co-occurrence of a plurality of specific discrete values of the voice data in a frame, when the decision tree classifies according to the feature values, the decision tree classifies according to the feature values of the combined long-term feature of the voice data in a frame, compared with the long-term feature or the short-term feature, the combined long-term feature has richer information, the contribution of different feature values to the classifier can be judged, and the size of the feature values corresponds to certain acoustic characteristics. According to the invention, a long-term characteristic is generated for voice data of one frame, and short-term voice fragments can be classified, so that people can recognize depression through short dialogue in daily life, and depression can be found early, and help seeking threshold of depressed patients is reduced. A specific combined feature random forest classification framework is shown in fig. 4.
The present invention marks the depression category as 1, positive, and the non-depression category as 0, negative. The system classification performance evaluation comprises accuracy, sensitivity, specificity and F1 score, and the evaluation calculation is based on the following formula:
where TP represents positive samples predicted by the model as positive, TN represents negative samples predicted by the model as negative, FP represents negative samples predicted by the model as positive, FN represents positive samples predicted by the model as negative, R is recall, which is equivalent to sensitivity, and P is precision.
As shown in fig. 5, the present invention further provides an automatic depression recognition device based on voice recognition and machine learning, comprising:
the acquisition module is used for acquiring voice data of a patient;
the combination module is used for selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;
and the identification module is used for identifying the depression degree of the long-term features according to a random forest algorithm.
Further, the combination module includes:
the extraction unit is used for carrying out feature extraction on the voice data by adopting framing windowing;
the selection unit is used for selecting the extracted characteristics according to the decision tree;
and the combination unit is used for recombining the selected characteristics to generate long-term characteristics.
Further, the extracted features are time domain features and frequency domain features, the time domain features comprising: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid.
Further, the combining unit includes:
the discretization component is used for discretizing the short-time features, setting a threshold value according to the upper and lower third positions of each feature value and dividing each feature into three features of a low value, a median value and a high value;
the combination component is used for carrying out feature combination on the discretized features;
and the generation component is used for generating long-term features by adopting statistics of frequency of occurrence of the features in a frame of voice signal for the combined features.
Further, each feature value in the long-term features in the recognition module represents the frequency of co-occurrence of a plurality of feature-specific discrete values of the speech data in one frame, and when classifying according to the feature values, the classification is performed according to the frequency of co-occurrence of the discrete features of the speech data in one frame.
The voice data of the depression patients are easy to obtain, the voice data can be obtained through equipment such as a bracelet and a mobile phone, the voice data contains rich emotion information, and compared with the voice of normal people, the voice of the depression patients has larger difference, for example, the voice speed of the depression patients is slower, the tone is single, the hoarseness degree is increased, and the voice of the depression patients is more gas-sound, so that the early diagnosis of the depression can be realized by analyzing the information characteristics of the voice data. Therefore, the invention processes the voice signal, extracts a plurality of characteristics of the voice signal, and combines a machine learning algorithm to provide a novel method for identifying depression. The method can identify depression according to a short dialogue of people, and the extracted short-term voice signal features are required to be subjected to feature combination in the identification process, and then effective classification judgment is carried out. The invention adopts a technology combining the machine learning theory and the characteristic combination to effectively and accurately judge the daily depression degree of the crowd.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (2)

1. An automatic depression recognition method based on voice recognition and machine learning is characterized by comprising the following steps:
s1, acquiring voice data of a patient;
s2, selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;
s3, recognizing the depression degree of the long-term features according to a random forest algorithm;
the step S2 comprises the following steps:
step 2.1, carrying out feature extraction on the voice data by adopting framing windowing;
step 2.2, selecting the extracted features according to the decision tree;
step 2.3, recombining the selected characteristics to generate long-term characteristics;
the extracted features in step 2.1 are time domain features and frequency domain features, wherein the time domain features comprise: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid;
in step 2.3, discretizing the short-time features, setting a threshold according to the upper and lower third positions of each feature value, dividing each feature into three discrete features of low value, medium value and high value, and then combining the features after discretization in a co-occurrence mode; counting the frequency of occurrence of the features in a frame of voice signal after feature combination to generate long-term features;
in step S3, each of the feature values in the long-term features represents a frequency of co-occurrence of a plurality of feature-specific discrete values of the speech data in one frame, and when the feature values are classified, the classification is performed according to the frequency of co-occurrence of the discrete features of the speech data in one frame.
2. An automatic depression recognition device based on voice recognition and machine learning, comprising:
the acquisition module is used for acquiring voice data of a patient;
the combination module is used for selecting the characteristics of the voice data, and recombining the selected characteristics to generate long-term characteristics;
the identification module is used for identifying the depression degree of the long-term features according to a random forest algorithm;
the combination module comprises:
the extraction unit is used for carrying out feature extraction on the voice data by adopting framing windowing;
the selection unit is used for selecting the extracted characteristics according to the decision tree;
a combination unit for recombining the selected features to generate long-term features;
the extracted features are time domain features and frequency domain features, the time domain features comprising: short-time energy, zero crossing rate and energy entropy, and frequency domain features comprise: spectral entropy, fundamental frequency, and centroid;
the combination unit includes:
the discretization component is used for discretizing the short-time features, setting a threshold value according to the upper and lower third positions of each feature value, and dividing each feature into three discrete features of a low value, a median value and a high value;
the combination component is used for combining the features after discretization in a co-occurrence manner;
the generation component is used for generating long-term features by counting the frequency of occurrence of the features in a frame of voice signal for the combined features;
each characteristic value in the long-term characteristics in the identification module represents the frequency of the co-occurrence of a plurality of characteristic specific discrete values of the voice data in one frame, and when the characteristic values are classified, the characteristic values are classified according to the frequency of the co-occurrence of the discrete characteristics of the voice data in one frame.
CN202110986901.2A 2021-08-26 2021-08-26 Automatic depression recognition method and device based on voice recognition and machine learning Active CN113823267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110986901.2A CN113823267B (en) 2021-08-26 2021-08-26 Automatic depression recognition method and device based on voice recognition and machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110986901.2A CN113823267B (en) 2021-08-26 2021-08-26 Automatic depression recognition method and device based on voice recognition and machine learning

Publications (2)

Publication Number Publication Date
CN113823267A CN113823267A (en) 2021-12-21
CN113823267B true CN113823267B (en) 2023-12-29

Family

ID=78913628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110986901.2A Active CN113823267B (en) 2021-08-26 2021-08-26 Automatic depression recognition method and device based on voice recognition and machine learning

Country Status (1)

Country Link
CN (1) CN113823267B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10147423B2 (en) * 2016-09-29 2018-12-04 Intel IP Corporation Context-aware query recognition for electronic devices
US20190110754A1 (en) * 2017-10-17 2019-04-18 Satish Rao Machine learning based system for identifying and monitoring neurological disorders

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109599129A (en) * 2018-11-13 2019-04-09 杭州电子科技大学 Voice depression recognition methods based on attention mechanism and convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Predictive Modeling of Depression with a Large Claim Dataset;Riyi Qiu et al.;2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);1589-1595 *
机器学习在抑郁症领域的应用;董健宇等;心理科学进展;第28卷(第2期);266-274 *

Also Published As

Publication number Publication date
CN113823267A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN112750465B (en) Cloud language ability evaluation system and wearable recording terminal
Ooi et al. Multichannel weighted speech classification system for prediction of major depression in adolescents
US10478111B2 (en) Systems for speech-based assessment of a patient&#39;s state-of-mind
Low et al. Detection of clinical depression in adolescents’ speech during family interactions
Levitan et al. Combining Acoustic-Prosodic, Lexical, and Phonotactic Features for Automatic Deception Detection.
WO2023139559A1 (en) Multi-modal systems and methods for voice-based mental health assessment with emotion stimulation
Liu et al. Speech personality recognition based on annotation classification using log-likelihood distance and extraction of essential audio features
Islam et al. Early detection of COVID-19 patients using chromagram features of cough sound recordings with machine learning algorithms
Xiao et al. Hierarchical classification of emotional speech
Houari et al. Study the Influence of Gender and Age in Recognition of Emotions from Algerian Dialect Speech.
Joshy et al. Dysarthria severity assessment using squeeze-and-excitation networks
Liu et al. A novel decision tree for depression recognition in speech
Alimuradov et al. A method to determine cepstral markers of speech signals under psychogenic disorders
Vasuki Research Article Speech Emotion Recognition Using Adaptive Ensemble of Class Specific Classifiers
CN113823267B (en) Automatic depression recognition method and device based on voice recognition and machine learning
Ding et al. Automatic recognition of student emotions based on deep neural network and its application in depression detection
Milani et al. A real-time application to detect human voice disorders
Yagnavajjula et al. Detection of neurogenic voice disorders using the fisher vector representation of cepstral features
Suwannakhun et al. Characterizing Depressive Related Speech with MFCC
Mangalam et al. Emotion Recognition from Mizo Speech: A Signal Processing Approach
Bhavya et al. Speech Emotion Analysis Using Machine Learning for Depression Recognition: a Review
Kumar et al. Towards Improving the Performance of Dysarthric Speech Severity Assessment System
Fathan et al. An Ensemble Approach for the Diagnosis of COVID-19 from Speech and Cough Sounds
Manikandan et al. Speaker identification using a novel prosody with fuzzy based hierarchical decision tree approach
Akkaralaertsest et al. Classification of depressed speech samples with spectral energy ratios as depression indicator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant