CN116312647A - Night tooth real-time diagnosis method and system based on non-language audio feature recognition - Google Patents

Night tooth real-time diagnosis method and system based on non-language audio feature recognition Download PDF

Info

Publication number
CN116312647A
CN116312647A CN202310091119.3A CN202310091119A CN116312647A CN 116312647 A CN116312647 A CN 116312647A CN 202310091119 A CN202310091119 A CN 202310091119A CN 116312647 A CN116312647 A CN 116312647A
Authority
CN
China
Prior art keywords
mean
var
time
feature
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310091119.3A
Other languages
Chinese (zh)
Inventor
周伟
刘骁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202310091119.3A priority Critical patent/CN116312647A/en
Publication of CN116312647A publication Critical patent/CN116312647A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/45For evaluating or diagnosing the musculoskeletal system or teeth
    • A61B5/4538Evaluating a particular part of the muscoloskeletal system or a particular medical condition
    • A61B5/4542Evaluating the mouth, e.g. the jaw
    • A61B5/4557Evaluating bruxism
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention belongs to the technical field of computer-aided diagnosis, and particularly relates to a night tooth real-time diagnosis method and system based on non-language audio feature recognition. The invention includes collecting audio in a sleep state environment; converting the audio frequency into a short-time stable signal by a framing method; extracting various time domain features and various frequency domain features from the audio sample, and summarizing feature groups of variance distribution and mean distribution by combining various feature arrays of each different frame; the best predictive model for night tooth diagnosis is preferred by a machine learning model. The invention can identify occurrence of bruxism and occurrence of other sleep states with high accuracy in a complex sleep environment, does not need more computing resources to carry out pretreatment such as denoising and filtering on audio signals, can realize non-contact monitoring and diagnosis on bruxism with a simple process, can avoid uncomfortable feeling of patients, and provides practical reference and theoretical analysis basis for implementing intervention of subsequent bruxism.

Description

Night tooth real-time diagnosis method and system based on non-language audio feature recognition
Technical Field
The invention belongs to the technical field of computer-aided diagnosis, and particularly relates to a night tooth real-time diagnosis method and system based on non-language audio feature recognition.
Background
According to the international consensus definition in 2018 regarding the evaluation of bruxism, night bruxism is a rhythmic (staged) or rhythmic (tonic) activity of masticatory muscles during sleep in humans, not dyskinesias and sleep disorders in healthy individuals, but increasing the risk of other diseases such as tooth wear, masticatory muscle hypertrophy, temporomandibular joint dysfunction, facial muscle soreness, even resulting in headaches and the like. According to the research of reliable documents, the population suffering from bruxism is widely distributed, and the proportion of the population is 8% in teenagers. In order to reduce the negative influence of the bruxism on other people and the bruxism, the accurate diagnosis and treatment of the bruxism are realized, and the bruxism has important significance for life and health.
The current diagnostic methods for bruxism are mainly: questionnaires, clinical evaluations, and portable device diagnostics. The questionnaire and clinical assessment thereof are too low for the diagnostic specificity of bruxism and are assessment after conditions that have been caused by bruxism damage, and do not prevent the occurrence of bruxism; the portable equipment based on the surface electromyographic signal SEMG is widely studied, the real-time diagnosis of the bruxism based on electromyographic signals and the detection of the bruxism degree of the bruxism are improved in a certain range, but the existing algorithm for diagnosing the bruxism based on the surface electromyographic signals at home and abroad is excessively dependent on equipment for collecting the electromyographic signals, namely the electromyographic report record of the bruxism lacks authority and uniformity, and the algorithm for identifying the bruxism is only suitable for equipment for collecting the electromyographic signals developed by authors who propose the algorithm and is not suitable for identifying night bruxism signals collected by other electromyographic equipment, so the night bruxism data disclosed at home and abroad and the mature portable equipment based on the surface electromyographic signal for diagnosing the bruxism are almost absent. Most importantly, for the current population of bruxism, whether it is a healthy individual (i.e. the bruxism does not pose a pathological hazard to the patient) or is a majority, most healthy bruxism patients do not have serious pathological behaviors, only afflictions such as bruxism partners or rest, in which case the healthy bruxism patients wear contact devices such as myoelectric sensors, pressure sensors, etc. and most patients feel sleep discomfort.
Therefore, there is a need for a more sophisticated non-contact diagnostic method to detect the occurrence of bruxism and to diagnose the occurrence of bruxism in real time, which makes a substantial contribution from the source to the reduction of bruxism, thereby meeting the needs of a wide range of healthy bruxism patients, while also monitoring other sleep states such as sneeze, yawning, sighing, moan, throat clearing, etc.
Disclosure of Invention
In order to improve the existing method for diagnosing bruxism and overcome the defects that the specificity of the bruxism is low and the non-contact real-time diagnosis cannot be realized in the current clinical diagnosis, the invention provides a portable and non-contact night tooth (including relevant sleep states and the same below) real-time diagnosis method and system based on non-language audio feature recognition.
The night-tooth real-time diagnosis method and system based on non-language audio feature recognition not only can rapidly recognize original lossless audio, but also has stable robustness and accuracy. The invention can monitor the occurrence of bruxism in real time, adds a novel applicability algorithm for portable equipment for diagnosing bruxism, and provides important basis for the research of the advanced prevention and treatment of the subsequent bruxism.
The invention provides a non-language audio feature recognition-based night tooth grinding real-time diagnosis method, which comprises the following specific steps:
step S1, audio acquisition: an audio acquisition module is used for acquiring audio data in a sleep environment;
step S2, framing and downsampling: converting the collected audio data into a short-time stable audio signal by a framing method on a time domain, and reducing the sampling frequency by half;
step S3, feature extraction: extracting time domain features and frequency domain features from the audio samples after framing;
step S4, feature group induction: summarizing the characteristic groups of various time domain variance distribution and mean distribution and the characteristic groups of various frequency domain variance distribution and mean distribution according to various characteristic arrays of each different frame;
step S5, optimizing a machine learning model: based on the above feature sets of time domain, frequency domain and time frequency domain, inputting a plurality of machine learning models, selecting a premolars optimal model for diagnostic evaluation of the premolars (and other sleep states thereof).
Further:
in step S1, the audio acquisition module acquires frequency data in a sleep environment, which means that in a natural sleep environment, a microphone is used to acquire sound in the sleep environment, the microphone is fixed in a central region of a bed and is 30-50cm away from the vertical upper part of the head of a nocturnal molar patient, single-channel data with the sampling frequency of 44100HZ is obtained, the number of sampling bits is 16, the audio format is a WAV lossless audio format, and the length of a single-time captured audio data sample is 1S.
In step S2, the collected audio data is converted into a short-time stable audio signal by a framing method in a time domain, and the sampling frequency is reduced by half; the method specifically comprises the following steps: downsampling the collected audio data; the non-stationary audio signal is converted into a short-time stationary audio signal by a framing method in a time domain, the sampled data volume of one frame is 1024 points, and the movement of each framing is 512 point step sizes.
In step S3, the extracting the time domain feature and the frequency domain feature from the audio sample after framing includes the following sub-steps:
step S3-1, extracting time domain features, wherein the time domain features are AE (Amplitude envelope), EMS (Root-mean square energy) and ZCR (Zero crossing rate), and the specific calculation modes are as follows:
the AE formula is:
Figure BDA0004070391250000021
the EMS calculation formula is:
Figure BDA0004070391250000031
the ZCR calculation formula is:
Figure BDA0004070391250000032
in the above formula, s (K) represents a speech signal, K is coordinates of sampling points of the speech signal, K represents sampling points of a frame of speech signal, and t represents a speech signal of a t-th frame; i.e. tk represents the position of the first sample point of the t frame, (t+1) k-1 represents the position of the last sample point of the t frame, sgn is a sign function;
step S3-2, extracting frequency domain features, namely BER (Band Energy Ratio), SC (Spectral centroid), BW (Bandwidth) and SR (Spectral Roll-off), wherein the frequency domain features are calculated as follows:
BER is calculated as:
Figure BDA0004070391250000033
the SC calculation formula is:
Figure BDA0004070391250000034
BW is calculated as:
Figure BDA0004070391250000035
the SR calculation formula is:
Figure BDA0004070391250000036
in the above formula, t represents the speech signal of the t frame, m t (n) represents the signal of the t frame, n is the n sampling point, F represents the selection of the frequency band ratioN represents the number of sampling points per frame, f c Represents spectral points satisfying inequality, m i Representing energy.
In terms of frequency domain features, complementary harmonic and perceptual (Harmonics and Perceptrual) features are calculated as:
H(t)=A * sin(2 * pi * f * t + phi), (8)
S=K*(10 L/10 ), (9)
h (t) is the amplitude function of the harmonic, representing the amplitude of the harmonic at time t. A is the amplitude of the harmonic and represents the maximum amplitude of the harmonic. f is the frequency of the harmonic in hertz. t is time and the unit is seconds. phi is the phase of the harmonic, representing the phase difference of the harmonic with respect to the fundamental frequency. S is the sound pressure level, L is the sound intensity level, and K is a constant.
And S3-3, extracting time-frequency domain characteristics, namely the mel cepstrum coefficient (MFCC). Performing FFT (fast Fourier transform) on each frame in the step S2 to obtain a frequency spectrum, further obtaining a magnitude spectrum, adding a Mel filter bank to the magnitude spectrum, performing logarithmic operation on the output of the filter bank, and finally further performing Discrete Cosine Transform (DCT) to obtain an MFCC, wherein the expression is as follows:
Figure BDA0004070391250000041
l is the number of filters, (10)
Where i is the i-th MFCC coefficient, N is the number of sampling points per frame, L is the i-th filter, m (L) represents the output of each filter, and L is the total number of filters.
In step S4, according to the multiple feature arrays of each different frame, summarizing multiple feature sets of time domain variance distribution and mean distribution, and multiple feature sets of frequency domain variance distribution and mean distribution, specifically, according to the feature calculation formula in step S3, calculating features of each frame to obtain feature arrays; for example, for a feature EMS, a feature array is obtained, denoted EMS-1, EMS-2, EMS-3, EMS-4 … … EMS-i, where i is the number of frames; obtaining feature sets of variance distribution and Mean distribution, such as EMS_mean and MES_Var, through the feature arrays; etc. Through the feature array, feature sets (such as ems_mean, mes_var) of variance distribution and Mean distribution are obtained, so that the following feature total number can be obtained, wherein the total feature total number is 56 feature values, and specifically:
'envelope_mean','envelope_var','RMS_mean','RMS_var','zero_mean','zero_var',
'centroids_mean','centroids_var','bandwith_mean','bandwidth_var','rolloff_mean',
'rolloff_var','harmonics_mean','harmonics_var','perceptrual_mean','perceptrual_var',
'mfcc1_mean','mfcc2_mean','mfcc3_mean','mfcc4_mean','mfcc5_mean','mfcc6_mean',
'mfcc7_mean','mfcc8_mean','mfcc9_mean','mfcc10_mean','mfcc11_mean','mfcc12_mean',
'mfcc13_mean','mfcc14_mean','mfcc15_mean','mfcc16_mean','mfcc17_mean','mfcc18_mean','mfcc19_mean','mfcc20_mean',
'mfcc1_var','mfcc2_var','mfcc3_var','mfcc4_var','mfcc5_var','mfcc6_var','mfcc7_var',
'mfcc8_var','mfcc9_var','mfcc10_var','mfcc11_var','mfcc12_var','mfcc13_var',
'mfcc14_var','mfcc15_var','mfcc16_var','mfcc17_var','mfcc18_var','mfcc18_var','mfcc20_var';
in step S5, the machine learning method is used to construct an optimal model for the diagnosis of the night tooth and the prediction under other sleep states, and the model is specifically used for the diagnosis of the night tooth and the evaluation under other sleep states:
after the correlation analysis of the feature values, the feature values in step S4 are input into a typical machine learning network model, including but not limited to: bayesian model [ ]
Figure BDA0004070391250000042
Bayes), K-nearest neighbor (KNN), random gradient descent (Stochastic Gradient Descent), decision tree (precision), random Forest (Random Forest), support vector machine (support Vector Machine), logistic regression (Logistic Regression), neural network (NeuralNet), catboost, cross Gradient Booster, cross Gradient Booster (Random Forest); in comparison to these 11 model comparisons, the best model selection in applications where non-verbal audio feature recognition diagnoses bruxism in real time and other sleep states thereof is selected.
The standard for selecting the optimal model from different models is based on the value of the F-Score, the value of the F-Score is based on the comprehensive weighing value of the Precision and Recall, and the F-Score is introduced as a comprehensive index, namely, in order to balance the influence of the Precision and Recall, a classifier is comprehensively evaluated. F-score is the harmonic average of precision and recall.
Calculation formula for F-score (F1):
Figure BDA0004070391250000051
Figure BDA0004070391250000052
wherein TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives, respectively. Wherein, true Positive (TP) is a positive sample predicted by the model as a positive class, true Negative (TN) is a negative sample predicted by the model as a negative class, false Positive (FP) is a negative sample predicted by the model as a positive class, and False Negative (FN) is a positive sample predicted by the model as a negative class.
In the invention, the working flow of machine learning network model training is as follows: carrying out data enhancement on the audio sample to obtain a data set, and dividing the data set into a training set and a testing set; the training set data set is subjected to downsampling and framing operation, so that the follow-up algorithm is convenient to carry out; after framing each data set, obtaining a feature array of each sample according to the feature value formula; after carrying out correlation analysis on the characteristic values, inputting the characteristic values into different network models for training analysis, and verifying the advantages and disadvantages of different machine learning network models by using a test set; and selecting an optimal network model according to the F1 index, and analyzing the predicted results of the molar and other sleep states.
The working flow of the invention is as follows: under a sleeping environment, an audio acquisition module close to a patient acquires audio sounds during sleeping; the audio acquisition module singly adopts 1s of audio data; the MCU performs downsampling and framing analysis on the audio data; the signal processing unit performs feature analysis on the processed audio data; and inputting the characteristic data into an optimal model with well trained characteristics, and obtaining an output result by the model in real time, thereby realizing diagnosis and prediction of the night-time teeth grinding and other sleep states.
The invention also provides a night tooth real-time diagnosis system based on the night tooth real-time diagnosis method. The night tooth real-time diagnosis system comprises the following modules: the system comprises an audio acquisition module, a framing and downsampling module, a feature extraction module, a feature group homing module and a machine learning model optimizing module; these 5 modules in turn perform the operations of 5 steps of the night tooth real-time diagnostic method. The audio acquisition module executes the audio acquisition of the step S1; the framing and downsampling module performs framing and downsampling of the step S2; the feature extraction module performs feature extraction in the step S3; the feature group reduction module executes feature group reduction of the step S4; the machine learning model preference module performs machine learning model preference of step S5.
The night tooth real-time diagnosis method and system based on non-language audio feature recognition provided by the invention have the following advantages:
1. the invention can diagnose the occurrence of bruxism and other sleep states in real time by rapidly recognizing the audio signal, the audio acquisition system is not in actual contact with patients, and compared with other various attached portable devices and oral built-in devices, the invention can not cause any discomfort to the sleep of patients, thereby meeting the requirements of patients without pathological bruxism;
2. the invention can rapidly diagnose the occurrence of bruxism in real time by rapidly identifying the audio signal, can rapidly make diagnosis instructions when bruxism occurs, avoids various diagnosis which are summarized in the past on the whole night data, and does not actually slow down the harm of bruxism to human body;
3. the invention adopts the audio in the standard lossless WAV format as the input signal, thereby ensuring the accuracy and reliability of data and the reproducibility of experiments;
4. the data set comes from an open source database, and the experimental data is rich and has wide applicability;
5. the algorithm part fully excavates the machine learning characteristics, adopts various commonly used machine learning models, and performs comparison analysis;
6. the algorithm may be highly specific and robust. The method can distinguish various noise related to tooth grinding sounds and other sleep states in a natural sleep environment with high accuracy, including sneezing, screaming, groaning, crying, yawning, tongue clicking, laughing, lip beeping 22228, throat clearing, mouth cleaning, nasal discharge blowing, cough, sighing, tooth grinding, tooth trembling and breathing in total 16 sleep states;
7. the intelligent monitoring system can monitor the onset of the molar in real time, can be matched with the key information such as the statistics of the number of times of the molar, the energy ratio and the like, has important significance for the molar diagnosis report and the later guiding treatment, and has the function of monitoring other sleep states.
Drawings
FIG. 1 is a block diagram of a real-time diagnostic night molar framework based on non-verbal audio feature recognition in accordance with the present invention.
FIG. 2 is a comparative analysis of network model selection according to the present invention.
Fig. 3 is a diagram of an optimal network recognition molar sounds (including other sleep states) confusion matrix.
Fig. 4 is a signal flow diagram identifying molars and other sleep states.
Fig. 5 is a schematic diagram of hardware in an example of identifying molars and other sleep states thereof.
Detailed Description
The technical solutions of the embodiments of the present invention will be fully and clearly described below with reference to the accompanying drawings in the embodiments, and it is apparent that the described embodiments can be used only as a representative embodiment of the present invention, not as an entire embodiment. Based on the embodiments of the present invention, those of ordinary skill in the art may obtain other embodiments without creative effort, which fall within the protection scope of the present invention.
Embodiments of the present invention comprise two parts: a network model training portion and an implementation instance portion. The process of training the network model comprises the following steps:
step S1, an audio acquisition module is used for acquiring audio data in a sleep environment;
step S2, converting the acquired audio data into a short-time stable audio signal by a framing method on a time domain, and reducing the sampling frequency by half;
s3, carrying out time domain features and frequency domain features on the audio samples after framing and extracting time-frequency domain features thereof;
step S4, summarizing the characteristic groups of various time domain variance distribution and mean distribution and the characteristic groups of various frequency domain variance distribution and mean distribution by combining the various characteristic groups of each different frame;
and S5, constructing a prediction optimal model of the night tooth and other sleep states thereof by a machine learning method based on the time domain, the frequency domain and the time-frequency domain feature groups, and diagnosing the night tooth and evaluating the night tooth and other sleep states thereof.
In the S1 step, an audio acquisition module is used for acquiring audio data in a sleep environment, sound in the sleep environment is acquired through a microphone in a natural sleep environment, the microphone is fixed in a central region of a bed and is positioned at a position which is 30-50cm away from the vertical upper part of the head of a nocturnal molar patient, single-channel data with the sampling frequency of 44100HZ of audio data are obtained, the number of sampling bits is 16, the audio format is a WAV lossless audio format, and the length of a single-time captured audio data sample is 1S.
S2 comprises the following two substeps: in step S2-1, down-sampling the collected audio data; in step S2-1, the non-stationary audio signal is converted into a short-time stationary audio signal by a framing method in a time domain, and the sampled frame data size is 1024 points, and the frame movement is 512 point step sizes each time.
The feature extraction part of step S3 includes the following sub-steps:
step S3-1 extracts the time domain features after step S2, the time domain features AE, EMS, ZCR and the calculation method are as follows:
the belonged AE (Amplitude envelope) is calculated by the following steps:
Figure BDA0004070391250000071
the EMS (Root-mean square energy) calculation mode is as follows:
Figure BDA0004070391250000072
the ZCR (Zero crossing rate) calculation mode is as follows:
Figure BDA0004070391250000073
in the above formula, s (K) represents a speech signal, K is coordinates of sampling points of the speech signal, K represents sampling points of a frame of speech signal, and t represents a speech signal of a t-th frame; i.e. tk represents the position of the first sample point of the t-th frame, (t+1) k-1 represents the position of the last sample point of the t-th frame, sgn is a sign function.
Step S3-2 extracts the frequency domain features BER, SC, BW, SR after step S2 based on the following calculation:
the BER (Band Energy Ratio) calculation mode is as follows:
Figure BDA0004070391250000074
the SC (Spectral centroid) calculation mode is as follows:
Figure BDA0004070391250000081
the BW (Bandwidth) calculation mode is as follows:
Figure BDA0004070391250000082
the SR (Spectral Roll-off) calculation mode is as follows:
Figure BDA0004070391250000083
in the above formula, t represents the speech signal of the t frame, m t (N) represents the signal of the t frame, N is the N sampling point, F represents the separation frequency selected from the frequency band ratio, N represents the sampling point of each frame, F c Represents spectral points satisfying inequality, m i Representing energy.
In terms of frequency domain features, complementary harmonic and perceptual (Harmonics and perception) features are calculated as:
H(t)=A*sin(2*pi*f*t+phi),
S=K*(10 L/10 ),
h (t) is the amplitude function of the harmonic, representing the amplitude of the harmonic at time t. A is the amplitude of the harmonic and represents the maximum amplitude of the harmonic. f is the frequency of the harmonic in hertz. t is time and the unit is seconds. phi is the phase of the harmonic, representing the phase difference of the harmonic with respect to the fundamental frequency. S is the sound pressure level, L is the sound intensity level, and K is a constant.
And S3-3, extracting time-frequency domain characteristics, namely the mel cepstrum coefficient (MFCC). Performing FFT (fast Fourier transform) on each frame in the step S2 to obtain a frequency spectrum, further obtaining a magnitude spectrum, adding a Mel filter bank to the magnitude spectrum, performing logarithmic operation on the output of the filter bank, and finally further performing Discrete Cosine Transform (DCT) to obtain an MFCC, wherein the expression is as follows:
Figure BDA0004070391250000084
l is the number of the filters, and the number of the filters is equal to the number of the filters,
where i is the i-th MFCC coefficient, N is the number of sampling points per frame, L is the i-th filter, m (L) represents the output of each filter, and L is the total number of filters.
In step S4, based on the feature calculation formula in claim 4, the feature array (e.g., [ EMS-1, EMS-2, EMS-3, EMS-4, … … EMS-i ], where i is the number of frames) is obtained by calculating the features (e.g., feature EMS) of each frame, and the feature sets (e.g., ems_mean, mes_var) of variance distribution and Mean distribution are obtained by the feature array, so that the total number of features is 56.
'envelope_mean','envelope_var','RMS_mean','RMS_var','zero_mean','zero_var',
'centroids_mean','centroids_var','bandwith_mean','bandwidth_var','rolloff_mean',
'rolloff_var','harmonics_mean','harmonics_var','perceptrual_mean','perceptrual_var',
'mfcc1_mean','mfcc2_mean','mfcc3_mean','mfcc4_mean','mfcc5_mean','mfcc6_mean',
'mfcc7_mean','mfcc8_mean','mfcc9_mean','mfcc10_mean','mfcc11_mean','mfcc12_mean',
'mfcc13_mean','mfcc14_mean','mfcc15_mean','mfcc16_mean','mfcc17_mean','mfcc18_mean','mfcc19_mean','mfcc20_mean',
'mfcc1_var','mfcc2_var','mfcc3_var','mfcc4_var','mfcc5_var','mfcc6_var','mfcc7_var',
'mfcc8_var','mfcc9_var','mfcc10_var','mfcc11_var','mfcc12_var','mfcc13_var',
'mfcc14_var','mfcc15_var','mfcc16_var','mfcc17_var','mfcc18_var','mfcc18_var','mfcc20_var';
The summary features are as follows:
feature class envelope_ RMS zero Centroids Bandwidth
Quantity of 2 2 2 2 2
Feature class Roll-off harmonics_ Perceptrual MFCC Totals to
Quantity of 2 2 2 40 56
In step S5, after the correlation analysis of the feature values, the feature values in step S4 are input into a typical machine learning network model, including but not limited to: bayesian model [ ]
Figure BDA0004070391250000093
Bayes), K-nearest neighbor (KNN), random gradient descent (Stochastic Gradient Descent), decision tree (precision), random Forest (Random Forest), support vector machine (support Vector Machine), logicRegression (Logistic Regression), neural networks (Neural networks), catboost, cross Gradient Booster, cross Gradient Booster (Random Forest); comparing these 11 model comparisons, the analysis resulted in the best model selection in the application of non-verbal audio feature recognition to diagnose night-time teeth and other sleep states thereof in real time.
In step S5, in the method for selecting the optimal model from different models for real-time diagnosis of night-time teeth and other sleep states by non-linguistic audio feature recognition, the optimal criteria is selected based on the value of the F-Score, which is based on the comprehensive weighted value of the Precision and Recall (Recall), and the F-Score is introduced as a comprehensive index, i.e. a classifier is evaluated more comprehensively in order to balance the influence of the Precision and Recall. F-score is the harmonic average of precision and recall.
The calculation mode of F1 (namely F-score) is as follows:
Figure BDA0004070391250000091
Figure BDA0004070391250000092
wherein, TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives respectively. Wherein TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives, respectively. Where TP is a positive sample that is model predicted as a positive class, TN is a negative sample that is model predicted as a negative class, FP is a negative sample that is model predicted as a positive class, and FN is a positive sample that is model predicted as a negative class.
In step S5, after the correlation analysis of the feature values, the feature values in step S4 are input into a typical machine learning network model, including but not limited to: bayesian model [ ]
Figure BDA0004070391250000104
Bayes), K-nearest neighbor (KNN), random gradient descent (Stochastic Gradient Desc)ent), decision tree (precision), random Forest (Random Forest), support vector machine (support Vector Machine), logistic regression (Logistic Regression), neural network (Neural Nets), catboost, cross Gradient Booster, cross Gradient Booster (Random Forest); comparing these 11 model comparisons, the analysis resulted in the best model selection in the application of non-verbal audio feature recognition to diagnose night-time teeth and other sleep states thereof in real time.
In the method for selecting the optimal model from different models for diagnosing the night-time teeth and other sleep states of the night-time teeth by non-language audio feature recognition, the optimal standard is based on the value of the F-Score, the F-Score is based on the comprehensive balance value of the Precision and Recall rate, and the F-Score is introduced as the comprehensive index, namely, a classifier is comprehensively evaluated for balancing the influence of the Precision and Recall rate. F-score is the harmonic average of precision and recall.
The calculation mode of F1 (namely F-score) is as follows:
precision ratio of
Figure BDA0004070391250000101
Recall rate of recall />
Figure BDA0004070391250000102
Figure BDA0004070391250000103
Wherein, TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives respectively. Wherein TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives, respectively. Where TP is a positive sample that is model predicted as a positive class, TN is a negative sample that is model predicted as a negative class, FP is a negative sample that is model predicted as a positive class, and FN is a positive sample that is model predicted as a negative class.
The above network model combined with the selection F1-score criteria summarized the results as follows:
Model Score
0 Cat-boost 0.88827
1 Cross Gradient Booster 0.84358
2 Random Forest 0.83240
3 Support Vector Machine 0.76536
4 Logistic Regression 0.74302
5 Cross Gradient Booster(Random Forest) 0.72905
6 Neural Nets 0.71229
7 KNN 0.62011
8 Naive Bayes 0.60335
9 Stochastic Gradient Descent 0.58380
10 Decision trees 0.58380
among these, the scoring criteria for Catboost in the 11 commonly used machine learning network models are most prominent, and predictions of the accuracy and recall of the test set for sleep states in molars (including test-tooth and test-tooth) and 14 are summarized in the following table:
Precision Recall F1-score
coughing 0.85 0.96 0.90
crying 1.00 0.84 0.91
laughing 0.77 0.59 0.67
lip-popping 0.90 0.96 0.93
lip-smacking 0.79 0.83 0.81
moaning 0.85 0.88 0.87
nose-blowing 0.88 0.84 0.86
panting 0.92 0.92 0.92
screaming 1.00 0.97 0.98
sighing 1.00 0.80 0.89
sneezing 0.93 0.81 0.87
teeth-chattering 0.94 0.91 0.92
teeth-grinding 0.83 0.86 0.84
throat-clearing 0.71 0.95 0.82
tongue-clicking 0.96 0.96 0.96
yawning 0.94 0.89 0.91
embodiments of the present invention comprise two parts: a network model training portion and an implementation instance portion. The network model training part is shown in fig. 1, and the implemented process can build the following simple system in the manner of fig. 5:
the invention is based on the method of the non-language audio frequency characteristic recognition and the real-time diagnosis of the night-time tooth and other sleep states, adopting an audio acquisition module, an MCU microcontroller, a power module, a signal processing unit, a storage module and an OLCD display screen; the MCU drives an audio acquisition module to acquire 1s of unit audio signal each time, the format is lossless WAV standard format audio, and the sampling frequency is 44100HZ; after the audio module acquires the unit audio, the signal processing unit calculates the characteristic group of the unit audio and stores the data in the memory; inputting the characteristic group into the trained neural network, outputting specific indication parameters corresponding to the specific sleep molar state and reserving the indication parameters. The OLCD display module is used for realizing the identification instruction of the occurrence of the molar. The MCU identifies the occurrence of tooth grinding through the algorithm processing of fig. 4, and the identification result instruction is displayed on a display screen, so that the OLCD can carry out decision identification on each input signal under the control of the MCU and display the signals in real time; the storage module is used for storing the total count of the occurrence times of the whole late teeth grinding and storing the characteristic parameters. Providing reliable data basis for doctor's evaluation and subsequent treatment.

Claims (8)

1. A non-language audio feature recognition-based night tooth real-time diagnosis method is characterized by comprising the following specific steps:
step S1, audio acquisition: an audio acquisition module is used for acquiring audio data in a sleep environment;
step S2, framing and downsampling: converting the collected audio data into a short-time stable audio signal by a framing method on a time domain, and reducing the sampling frequency by half;
step S3, feature extraction: extracting time domain features and frequency domain features from the audio samples after framing;
step S4, feature group induction: summarizing the characteristic groups of various time domain variance distribution and mean distribution and the characteristic groups of various frequency domain variance distribution and mean distribution according to various characteristic arrays of each different frame;
step S5, optimizing a machine learning model: based on the time domain, the frequency domain and the time-frequency domain feature sets, inputting a plurality of machine learning models, and selecting a premolars optimal model for diagnosing and evaluating the premolars.
2. The method according to claim 1, wherein the step S1 of collecting frequency data in a sleep environment by using an audio collection module refers to collecting sound in the sleep environment by a microphone in a natural sleep environment, fixing the microphone in a central region of a bed, and obtaining single-channel data with a sampling frequency of 44100HZ at a distance of 30-50cm above a head of a patient with the night-time teeth, wherein the number of sampling bits is 16, the audio format is a lossless audio format of WAV, and the sample length of audio data captured once is 1S.
3. The method according to claim 2, wherein in step S2, the collected audio data is converted into a short-time stationary audio signal by a framing method in a time domain, and the sampling frequency is reduced by half; the method specifically comprises the following steps: downsampling the collected audio data; the non-stationary audio signal is converted into a short-time stationary audio signal by a framing method in a time domain, the sampled data volume of one frame is 1024 points, and the movement of each framing is 512 point step sizes.
4. The method according to claim 3, wherein the step S3 of extracting the time-domain features and the frequency-domain features from the audio samples after the framing, and the time-domain features, comprises the sub-steps of:
step S3-1, extracting time domain features, wherein the time domain features are AE, EMS, ZCR, and the specific calculation mode is as follows:
the AE formula is:
Figure FDA0004070391240000011
the EMS calculation formula is:
Figure FDA0004070391240000012
the ZCR calculation formula is:
Figure FDA0004070391240000013
wherein s (K) represents a voice signal, K is a coordinate of a sampling point number of the voice signal, K represents a sampling point number of a voice signal of a frame, and t represents a voice signal of a t frame; i.e. tk represents the position of the first sampling point of the t frame, (t+1) k-1 represents the position of the last sampling point of the t frame, sgn is a sign function;
step S3-2, extracting frequency domain features, wherein the frequency domain features are BER, SC, BW, SR, and the calculation mode is as follows:
BER is calculated as:
Figure FDA0004070391240000021
the SC calculation formula is:
Figure FDA0004070391240000022
BW is calculated as:
Figure FDA0004070391240000023
the SR calculation formula is:
Figure FDA0004070391240000024
in the above formula, t represents the speech signal of the t frame, m t (N) represents the signal of the t frame, N is the N sampling point, F represents the separation frequency selected from the frequency band ratio, N represents the sampling point of each frame, F c Represents spectral points satisfying inequality, m i Represents energy;
in terms of frequency domain features, complementary harmonic and perceptual (Harmonics and perception) features are calculated as:
H(t)=A*sin(2*pi*f*t+phi), (8)
S=K*(10 L/10 ), (9)
wherein H (t) is the amplitude function of the harmonic, representing the amplitude of the harmonic at time t; a is the amplitude of the harmonic, representing the maximum amplitude of the harmonic; f is the frequency of the harmonic in hertz; t is time in seconds; phi is the phase of the harmonic, representing the phase difference of the harmonic relative to the fundamental frequency; s is sound pressure level, L is sound intensity level, K is constant;
s3-3, extracting time-frequency domain characteristics, namely, mel cepstrum coefficient (MFCC); performing FFT (fast Fourier transform) on each frame in the step S2 to obtain a frequency spectrum, further obtaining a magnitude spectrum, adding a Mel filter bank to the magnitude spectrum, performing logarithmic operation on the output of the filter bank, and finally further performing Discrete Cosine Transform (DCT) to obtain an MFCC, wherein the expression is as follows:
Figure FDA0004070391240000025
where i is the i-th MFCC coefficient, N is the number of sampling points per frame, L is the i-th filter, m (L) represents the output of each filter, and L is the total number of filters.
5. The method according to claim 4, wherein in step S4, the feature sets of the time domain variance distribution and the mean distribution and the feature sets of the frequency domain variance distribution and the mean distribution are summarized according to the feature arrays of each different frame, specifically, the feature of each frame is calculated according to the feature calculation formula in step S3, so as to obtain the feature arrays; obtaining a feature set of variance distribution and mean distribution through a feature array; the feature sets of variance distribution and mean distribution are obtained through the feature array, so that the following feature total number is obtained, and 56 feature values are obtained in total, specifically:
'envelope_mean','envelope_var','RMS_mean','RMS_var','zero_mean','zero_var',
'centroids_mean','centroids_var','bandwith_mean','bandwidth_var','rolloff_mean',
'rolloff_var','harmonics_mean','harmonics_var','perceptrual_mean','perceptrual_var',
'mfcc1_mean','mfcc2_mean','mfcc3_mean','mfcc4_mean','mfcc5_mean','mfcc6_mean',
'mfcc7_mean','mfcc8_mean','mfcc9_mean','mfcc10_mean','mfcc11_mean','mfcc12_mean',
'mfcc13_mean','mfcc14_mean','mfcc15_mean','mfcc16_mean','mfcc17_mean','mfcc18_mean','mfcc19_mean','mfcc20_mean',
'mfcc1_var','mfcc2_var','mfcc3_var','mfcc4_var','mfcc5_var','mfcc6_var','mfcc7_var',
'mfcc8_var','mfcc9_var','mfcc10_var','mfcc11_var','mfcc12_var','mfcc13_var',
'mfcc14_var','mfcc15_var','mfcc16_var','mfcc17_var','mfcc18_var','mfcc18_var','mfcc20_var'。
6. the method according to claim 5, wherein in step S5, the machine learning method is used to construct a predicted optimal model of the night tooth and other sleep states thereof for diagnosis of the night tooth and evaluation of the other sleep states thereof, specifically:
inputting the characteristic values in the step S4 into a plurality of machine learning network models; comparing the multiple models, and selecting an optimal model in the real-time diagnosis of the night-time teeth grinding; the criteria for selecting the optimal model from the different models is the value of F-score; the F-score value is a composite tradeoff between Precision and Recall, and the F-score is the harmonic average of Precision and Recall:
Figure FDA0004070391240000031
Figure FDA0004070391240000032
wherein TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives respectively; f1 is F-score.
7. The method of claim 6, wherein the machine learning network model comprises a bayesian model, K-nearest neighbor, random gradient descent, decision tree, random forest, support vector machine, logistic regression, neural network, catboost, cross Gradient Booster, cross Gradient Booster.
8. A real-time diagnostic system for bruxism based on non-verbal audio feature recognition based on the real-time diagnostic method for bruxism according to one of claims 1 to 7, comprising the following modules: the system comprises an audio acquisition module, a framing and downsampling module, a feature extraction module, a feature group homing module and a machine learning model optimizing module; these 5 modules in turn perform the operations of 5 steps of the night tooth real-time diagnostic method.
CN202310091119.3A 2023-02-09 2023-02-09 Night tooth real-time diagnosis method and system based on non-language audio feature recognition Pending CN116312647A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310091119.3A CN116312647A (en) 2023-02-09 2023-02-09 Night tooth real-time diagnosis method and system based on non-language audio feature recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310091119.3A CN116312647A (en) 2023-02-09 2023-02-09 Night tooth real-time diagnosis method and system based on non-language audio feature recognition

Publications (1)

Publication Number Publication Date
CN116312647A true CN116312647A (en) 2023-06-23

Family

ID=86798732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310091119.3A Pending CN116312647A (en) 2023-02-09 2023-02-09 Night tooth real-time diagnosis method and system based on non-language audio feature recognition

Country Status (1)

Country Link
CN (1) CN116312647A (en)

Similar Documents

Publication Publication Date Title
Hassan et al. COVID-19 detection system using recurrent neural networks
Alsabek et al. Studying the Similarity of COVID-19 Sounds based on Correlation Analysis of MFCC
CN103251388B (en) Method and system of snoring monitoring and prevention and treatment based on smart phone platform
Matos et al. An automated system for 24-h monitoring of cough frequency: the leicester cough monitor
Panek et al. Acoustic analysis assessment in speech pathology detection
WO2010066008A1 (en) Multi-parametric analysis of snore sounds for the community screening of sleep apnea with non-gaussianity index
Song Diagnosis of pneumonia from sounds collected using low cost cell phones
EP3954278A1 (en) Apnea monitoring method and device
AU2013274940B2 (en) Cepstral separation difference
CN110600053A (en) Cerebral stroke dysarthria risk prediction method based on ResNet and LSTM network
Singh et al. Short unsegmented PCG classification based on ensemble classifier
Ding et al. Deep connected attention (DCA) ResNet for robust voice pathology detection and classification
Cohen-McFarlane et al. Comparison of silence removal methods for the identification of audio cough events
Usman et al. Heart rate detection and classification from speech spectral features using machine learning
Shethwala et al. Transfer learning aided classification of lung sounds-wheezes and crackles
CN111312293A (en) Method and system for identifying apnea patient based on deep learning
US20220061694A1 (en) Lung health sensing through voice analysis
Rizal et al. Lung sounds classification using spectrogram's first order statistics features
Luo et al. Design of embedded real-time system for snoring and OSA detection based on machine learning
CN113974607A (en) Sleep snore detecting system based on impulse neural network
Alimuradov et al. A method to determine cepstral markers of speech signals under psychogenic disorders
Porieva et al. Investigation of lung sounds features for detection of bronchitis and COPD using machine learning methods
Rahman et al. Efficient online cough detection with a minimal feature set using smartphones for automated assessment of pulmonary patients
Sengupta et al. Optimization of cepstral features for robust lung sound classification
Dam et al. e-Breath: breath detection and monitoring using frequency cepstral feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination