CN118016106A

CN118016106A - Elderly emotion health analysis and support system

Info

Publication number: CN118016106A
Application number: CN202410411579.4A
Authority: CN
Inventors: 朱彤; 姜惠; 李曼曼; 王莹; 王惠
Original assignee: Shandong Provincial Hospital Affiliated to Shandong First Medical University
Current assignee: Shandong Provincial Hospital Affiliated to Shandong First Medical University
Priority date: 2024-04-08
Filing date: 2024-04-08
Publication date: 2024-05-10

Abstract

The invention relates to the technical field of voice analysis, in particular to an emotion health analysis and support system for the elderly. The system comprises: the voice processing device is used for processing the voice emotion analysis; the voice acquisition device is used for acquiring voice signals of the target old people in a set time period; the voice processing device is used for applying a pre-emphasis filter to the collected voice signals to balance the frequency spectrum, obtaining a pre-processing signal, extracting MFCC features, fundamental frequency features and energy features from the pre-processing signal, and taking the MFCC features, the fundamental frequency features and the energy features as elements in feature vectors to form the feature vectors; the voice emotion analysis device is used for carrying out emotion analysis on the feature vector of the voice segment by using a pre-trained voice emotion analysis model, judging emotion characteristics of the voice segment and sending out an early warning signal requiring emotion intervention. The invention can monitor the emotion state of the old people in real time, accurately identify emotion characteristics, send early warning signals in time and perform emotion intervention.

Description

Elderly emotion health analysis and support system

Technical Field

The invention belongs to the technical field of voice analysis, and particularly relates to an emotion health analysis and support system for the elderly.

Background

With the continuous development of society and the aggravation of population aging, the affective health problem of the elderly is attracting attention. The affective health condition of the elderly has important influence on the quality of life and social participation, so that development of a system capable of timely monitoring and supporting the affective health of the elderly has important significance. The traditional emotion health monitoring method mainly depends on face-to-face evaluation by medical institutions or professionals, and has the problems of high resource consumption, high cost, poor real-time performance and the like, and is difficult to meet the requirement of large-scale emotion health monitoring. Therefore, it is important to develop a system that can automatically monitor the emotional well-being of the elderly and provide support and intervention when necessary.

Over the past few years, as the field of speech processing and emotion analysis has evolved, more and more research has focused on emotion recognition and health monitoring using speech signals. Traditional speech emotion analysis methods rely mainly on machine learning based models such as Support Vector Machines (SVMs), deep Neural Networks (DNNs), etc. According to the method, the emotion recognition and classification are realized by extracting the characteristics of the voice signals and training the voice signals and the marked emotion types. However, these methods often require a large amount of labeling data and a complex model training process, and have problems of low accuracy, poor generalization capability, and the like in practical applications.

In addition to machine learning based methods, some methods based on signal processing and feature extraction have been proposed for speech emotion analysis. For example, emotion recognition is performed using features such as fundamental frequency, energy, mel-frequency cepstrum coefficient (MFCC) of a speech signal, and the like. These methods generally have better real-time and applicability, but their accuracy and robustness still remain to be improved.

In addition, systems for the emotional wellness monitoring of elderly people have also developed some attempts in the academia and industry. These systems typically include components such as voice acquisition devices, emotion analysis algorithms, and supporting intervention mechanisms, intended to enable the monitoring and support of emotional well-being through analysis of the voice signals of the elderly. However, the existing system often lacks deep analysis of voice characteristics of the old and an effective emotion recognition algorithm, so that the effect of the system in practical application is poor, and the problems of high false alarm rate, low accuracy and the like exist.

Therefore, for the field of emotion health analysis and support of the elderly, a system capable of combining a voice processing technology and an emotion analysis algorithm to accurately monitor and support the emotion state of the elderly is needed. The system has the capability of efficiently extracting and analyzing the characteristics of the voice signals of the old, and simultaneously combines an advanced emotion analysis algorithm to realize accurate judgment and timely intervention on the emotion state of the old. Meanwhile, the system also considers the personalized requirements and privacy protection problems of the old, and ensures the acceptability and reliability of the system in practical application.

Disclosure of Invention

The invention mainly aims to provide the power grid primary frequency modulation intelligent control system, which realizes comprehensive monitoring and support of the emotion health of the elderly through the collection, processing and emotion analysis of voice signals, and has important application prospect and social significance. The invention can monitor the emotion state of the old people in real time, accurately identify emotion characteristics, timely send out early warning signals, perform emotion intervention and support and the like, and can provide effective guarantee and support for the emotion health of the old people.

In order to solve the problems, the technical scheme of the invention is realized as follows:

An affective health analysis and support system for elderly people, the system comprising: the voice processing device is used for processing the voice emotion analysis; the voice acquisition device is used for acquiring voice signals of the target old people in a set time period, and performing signal preliminary analysis on the voice signals to judge whether voice emotion analysis is needed or not, and specifically comprises the following steps: statistically analyzing the voice average energy and voice frequency duty ratio of the voice signal; the average energy of the voice is defined as the ratio of the total energy of the voice signal to the time period in a set time period; the voice frequency duty ratio is defined as the ratio of the length of a voice signal to the time period in a set time period; if the average energy of the voice or the duty ratio of the voice frequency are in the respective corresponding threshold range, judging that voice emotion analysis is not needed, otherwise, judging that voice emotion analysis is needed; the voice processing device is used for applying a pre-emphasis filter to the collected voice signal to balance the frequency spectrum when the voice collecting device judges that voice emotion analysis is needed, obtaining a pre-processing signal, extracting MFCC characteristics, fundamental frequency characteristics and energy characteristics from the pre-processing signal, taking the MFCC characteristics, the fundamental frequency characteristics and the energy characteristics as elements in characteristic vectors to form the characteristic vectors, and dividing the pre-processing signal into a voice section and a non-voice section by using a zero-crossing rate method based on the characteristic vectors; the voice emotion analysis device is used for performing emotion analysis on the feature vector of the voice segment by using a pre-trained voice emotion analysis model, and judging emotion characteristics of the voice segment, wherein the emotion characteristics comprise: if the ratio of the total frame number of the voice segments in the negative emotion characteristics to the length of the time period exceeds a set threshold value in a set time period, the target old person is judged to be in the negative emotion, and an early warning signal requiring emotion intervention is sent.

Further, the voice acquisition device includes: the device comprises an acquisition unit, an enhancement unit, a preliminary analysis unit and a noise separation unit; the acquisition unit is used for judging whether the sent voice signal is sent by the target old people or not through voice recognition in a set time period, and if yes, acquiring the voice signal; the enhancement unit is used for carrying out signal enhancement on the voice signal to obtain an enhanced voice signal; the primary analysis unit performs signal primary analysis on the voice signal to judge whether voice emotion analysis is needed; the noise separation unit is used for separating background noise from the voice signal when the voice emotion analysis is judged to be needed.

Further, the enhancing unit performs signal enhancement on the voice signal, and the method for obtaining the enhanced voice signal includes: the following formula is used to apply the speech signalPerforming short-time Fourier transform based on autoregressive model to obtain time-frequency representation：

；

Wherein,A time slice index representing a short time fourier transform; /(I)Representing a frequency index; /(I)Window length for each time segment; /(I)Is a window function; /(I)Is a coefficient of an autoregressive model,/>Is the order of the autoregressive model; /(I)Is an imaginary symbol; a time domain index; the time-frequency representation is enhanced by using a Wiener filter with nonlinear dynamic range compression characteristics by the following formula:

；

Wherein, And/>Power spectrum estimates representing noise and speech signals, respectively; /(I)To enhance the frequency domain representation of the speech signal, the enhanced frequency domain signal/>And performing inverse short-time Fourier transform to obtain an enhanced voice signal.

Further, the noise separation unit, when judging that the voice emotion analysis is needed, separates the background noise from the voice signal, the method comprises the following steps: representing a speech signal as a waveform function in the time domain; Converting it to the frequency domain by short-time fourier transformation, resulting in a frequency domain representation/>; Let the length of the time period be/>Length of use is/>Which is segmented by a window function of window length/>The overlap length between windows is/>; The window function selects a hamming window, defining the window function as:

；

Wherein, A sample index representing a window; extending the window function to a length/>, by applying it to each time segment of the speech signal and applying zero paddingTo obtain window signal/>, in time domain; For each window signalApplying a discrete fourier transform to obtain a frequency domain representation/>; Setting background noise to be steady state and linearly overlapped with the voice signal; modeling and estimating background noise using an adaptive filter on the frequency domain; let/>Representing clean spectrum of speech signal,/>A spectrum representing background noise; defining the frequency domain response of the adaptive filter as:

；

Wherein, Is at time/>An adaptive filter frequency domain response at; the adaptive filter/>, is used by the following formulaReconstructing the voice signal in the frequency domain to obtain a reconstructed signal:

；

Wherein, Reconstructing the signal; converting the reconstructed signal back to the time domain to obtain a speech signal/>, wherein the speech signal/>, is obtained by separating background noise from the speech signal。

Further, when the voice acquisition device judges that voice emotion analysis is needed, the voice processing device applies a pre-emphasis filter to the acquired voice signal to balance the frequency spectrum through the following formula, and then divides the voice signal into overlapped frames:

；

Wherein, Is the original signal; /(I)Is a pre-emphasized signal; /(I)Is a pre-emphasis coefficient; /(I)Represents the/>A frame; Is the frame length; /(I) Is a frame shift, representing the overlap between adjacent frames.

Further, the voice processing device performs window function processing on each frame of signal, applies discrete fourier transform, and then processes the signal through a mel filter to extract MFCC characteristics, and the formula is as follows:

；

Wherein, First/>First/>, in speech frameA sample number; /(I)Representing the discrete Fourier transform processed firstFrequency domain coefficients/>The number of points representing the discrete fourier transform, and also the number of samples per speech frame; /(I)Equal toRepresenting the number of independent frequency components in the discrete fourier transform; /(I)Is/>The individual Mel filter is at the/>Gain for each frequency point; the Mel filter is a group of overlapped triangular band-pass filters, which are used for simulating the frequency perception of human ears and carrying out nonlinear Mel scale conversion on the frequency; /(I)The number of mel filters represents the number of frequency bands divided on the mel scale; /(I)By passing the (th) >The result of applying the individual Mel filter to the coefficients of the discrete Fourier transform and taking the logarithm is representative of the/>Logarithmic energy of the individual frequency bands; /(I)For/>The individual mel-frequency cepstrum coefficients are pair/>The discrete cosine transform result is applied to convert the logarithmic energy spectrum of the Mel filter into a cepstrum coefficient of a time domain, reduce the correlation between features, and highlight the shape features of the frequency spectrum; /(I)Is the number of MFCC features that are ultimately extracted.

Further, fundamental frequency characteristicsThe method is calculated by the following formula:

；

Wherein, Is an autocorrelation function; /(I)Is/>Peak position of (2); /(I)Is the sampling frequency; energy characteristics/>The method is calculated by the following formula:

；

Wherein, For/>The energy of the individual speech frames; the feature vectors obtained are:

。

Further, the pre-trained speech emotion analysis model is a three-branch vector-holding model, and the class labels are as follows:

，

Three emotion characteristics and category labels Each element of the list corresponds to an emotion type; the speech emotion analysis model is expressed using the following formula:

；

Wherein, A normal vector representing a decision hyperplane for defining a classification boundary; /(I)Representing bias terms, also called intercepts, which are parameters in the support vector machine model for translating the classification boundaries; /(I)A weight vector representing an additional decision function for defining an additional classification boundary; /(I)A bias term representing an additional decision function for translating the additional classification boundary; Representing a relaxation variable, representing the extent to which deviation from the hyperplane is allowed; /(I) Representing regularization parameters,/>The importance of the relaxation variable is controlled, and the larger the value of the relaxation variable is, the more serious the punishment to misclassification is; /(I)Representing an indicator variable; /(I)Is an additional decision function; collecting feature vectors of historical voice segments and corresponding class labels, taking part of model training test data as training data, training a voice emotion analysis model, and finding an optimal decision boundary to correctly classify samples in a training set to the greatest extent and keeping generalization capability of the model; and taking the part of the model training test data except the training data as the test data, using the test data to evaluate the trained support vector machine model, wherein the evaluation index is the accuracy, stopping training if the accuracy exceeds a set accuracy threshold, otherwise, adjusting parameters of the model, and continuing training until the accuracy exceeds the set accuracy threshold.

Further, additional decision functionsThe expression is used as follows:

。

the emotion health analysis and support system for the elderly has the following beneficial effects:

Firstly, the invention realizes the comprehensive monitoring of the emotion state of the elderly through the combined use of the voice signal acquisition device, the processing device and the emotion analysis device. The voice acquisition device can acquire voice signals of the target old people in a set time period, and perform preliminary analysis to judge whether emotion analysis is needed. The acquired speech signal is passed through a pre-emphasis filter of the processing device to balance the spectrum, then split into overlapping frames, and features such as MFCC, fundamental frequency and energy features are extracted in preparation for subsequent emotion analysis. The emotion analysis device analyzes the extracted features by utilizing a pre-trained voice emotion analysis model and judges the emotion state of the aged, so that the real-time monitoring of the emotion health of the aged is realized.

Secondly, the emotion analysis model based on machine learning is adopted, and methods such as a support vector machine and the like are combined, so that the emotion state of the old can be accurately judged. The pre-trained voice emotion analysis model can effectively identify emotion characteristics in voice signals, including positive emotion, neutral emotion and negative emotion, by utilizing collected voice characteristics, so that the emotion state of the old can be accurately judged. Through training and testing of the model, model parameters can be continuously optimized, accuracy and robustness of emotion analysis are improved, and reliable support is provided for monitoring emotion health of the elderly.

The invention further provides a real-time intervention mechanism which can timely send out early warning signals when the emotion state of the old is abnormal so as to perform emotion intervention and support. Through monitoring the emotion analysis result, when the system detects that the old people are in a negative emotion state, an early warning signal can be timely sent out to prompt related staff or family members to intervene and support. The timely intervention mechanism can effectively prevent the occurrence of the affective health problem of the old and improve the life quality and happiness of the old.

Drawings

Fig. 1 is a schematic system structure diagram of an embodiment of the invention for analyzing and supporting emotional health of elderly people.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Example 1: referring to fig. 1, an affective health analysis and support system for elderly people, the system comprising: the voice processing device is used for processing the voice emotion analysis; the voice acquisition device is used for acquiring voice signals of the target old people in a set time period, and performing signal preliminary analysis on the voice signals to judge whether voice emotion analysis is needed or not, and specifically comprises the following steps: statistically analyzing the voice average energy and voice frequency duty ratio of the voice signal; the average energy of the voice is defined as the ratio of the total energy of the voice signal to the time period in a set time period; the voice frequency duty ratio is defined as the ratio of the length of a voice signal to the time period in a set time period; if the average energy of the voice or the duty ratio of the voice frequency are in the respective corresponding threshold range, judging that voice emotion analysis is not needed, otherwise, judging that voice emotion analysis is needed; the voice processing device is used for applying a pre-emphasis filter to the collected voice signal to balance the frequency spectrum when the voice collecting device judges that voice emotion analysis is needed, obtaining a pre-processing signal, extracting MFCC characteristics, fundamental frequency characteristics and energy characteristics from the pre-processing signal, taking the MFCC characteristics, the fundamental frequency characteristics and the energy characteristics as elements in characteristic vectors to form the characteristic vectors, and dividing the pre-processing signal into a voice section and a non-voice section by using a zero-crossing rate method based on the characteristic vectors; the voice emotion analysis device is used for performing emotion analysis on the feature vector of the voice segment by using a pre-trained voice emotion analysis model, and judging emotion characteristics of the voice segment, wherein the emotion characteristics comprise: if the ratio of the total frame number of the voice segments in the negative emotion characteristics to the length of the time period exceeds a set threshold value in a set time period, the target old person is judged to be in the negative emotion, and an early warning signal requiring emotion intervention is sent.

Specifically, the voice acquisition device captures a voice signal of the target elderly person within a set period of time by using a built-in microphone or external equipment. When the target elderly interact with the system or a voice signal occurs in a monitoring period set by the system in a specific time period, the acquisition device starts to work. The speech signal is transmitted as an analog electrical signal to the system interior for further processing and analysis. The voice processing device receives the voice signal from the voice acquisition device, and performs preprocessing and feature extraction on the voice signal. In the preprocessing stage, a pre-emphasis filter is typically applied to balance the spectrum and eliminate high frequency parts of the speech signal. Then, sound features such as mel-frequency cepstral coefficients (MFCCs), fundamental frequency features, energy features, and the like are extracted from the preprocessed signals. These feature extraction processes may be implemented using digital signal processing techniques.

MFCC features are a set of coefficients obtained by fourier transforming a speech signal and then discrete cosine transforming the spectrum on the mel frequency scale. These coefficients reflect the characteristics of the speech signal in the mel frequency domain, including the harmonic structure and formants of sound. The MFCC features can well characterize the spectral structure and acoustic properties of the speech signal, and more fully describe the speech content of the speech signal. In emotion analysis, the MFCC features may capture information such as sound quality, pitch variation, etc. of the speech signal, thereby providing important features that help determine emotion. The fundamental frequency is the most dominant frequency component in sound and determines the pitch of the sound. Fundamental frequency features are typically used to describe the pitch or tone of a sound. The fundamental frequency features can capture pitch variations of the speech signal, such as pitch and heave, intonation variations, etc. These tone changes are often closely related to emotional states, such as high tones during high mood, low tones during depression, etc. Therefore, the fundamental frequency features are of great significance for emotion analysis. The fundamental frequency is the periodic oscillation of the lowest frequency in the speech signal, typically corresponding to the pitch of the sound. Fundamental frequency features are typically used to describe the pitch or level of the sound of a speech signal. The fundamental frequency characteristic is selected because pitch is one of the important indicators of emotion expression in a speech signal. People tend to change the pitch of sound in different emotional states, e.g., sound higher when pleasant and sound lower when depressed. Thus, the fundamental frequency features can provide important clues about emotion information in the speech signal. The energy characteristic is the energy distribution of the speech signal and is generally used to describe the intensity or volume of the speech signal. The energy characteristics are chosen because emotional expressions are often accompanied by changes in the intensity and volume of sound. For example, the sound may be loud when angry and may be less loud when sad. Thus, the energy signature can provide important information about the emotional intensity in the speech signal.

The energy, fundamental frequency, and MFCC represent different aspects of the speech signal. The energy features reflect the intensity and amplitude variations of the speech signal, the fundamental frequency features reflect the pitch variations of the speech signal, and the MFCC features extract the spectral features of the speech signal. The three characteristics are combined for use, so that various emotion related information in the voice signal can be more comprehensively captured, including emotion intensity, pitch variation, voice frequency spectrum characteristics and the like. The three features of energy, fundamental frequency and MFCC each represent different aspects of the speech signal, and the combined use of these features can improve the accuracy of the emotional state of the speaker. For example, when the energy of sound increases, the fundamental frequency becomes high, and the spectral characteristics of the sound change, this indicates that the speaker is in an angry or excited emotional state; and when the energy of the sound decreases, the fundamental frequency becomes low, and the spectral characteristics of the sound change, this indicates that the speaker is in a sad or depressed emotional state. Because each feature has its unique information expression pattern, there is complementarity and independence between them. Therefore, the three characteristics are combined for use, so that the robustness of the emotion analysis system can be enhanced, the sensitivity to environmental noise and individual difference of speakers is reduced, and the generalization capability and applicability of the system are improved. The three features of energy, fundamental frequency, and MFCC can be seen as abstract representations of different aspects of the speech signal. The combination of the features can provide a more comprehensive and multidimensional feature space, which is helpful for better describing emotion information in the voice signal and improving the performance of an emotion analysis system.

Example 2: the voice acquisition device comprises: the device comprises an acquisition unit, an enhancement unit, a preliminary analysis unit and a noise separation unit; the acquisition unit is used for judging whether the sent voice signal is sent by the target old people or not through voice recognition in a set time period, and if yes, acquiring the voice signal; the enhancement unit is used for carrying out signal enhancement on the voice signal to obtain an enhanced voice signal; the primary analysis unit performs signal primary analysis on the voice signal to judge whether voice emotion analysis is needed; the noise separation unit is used for separating background noise from the voice signal when the voice emotion analysis is judged to be needed.

Specifically, the voice recognition is used for judging whether the sent voice signal is sent by the target old people or not so as to avoid subsequent processing and analysis on some invalid voice signals. In daily life, various irrelevant sounds appear, and if all voice signals are analyzed and processed indiscriminately, the system resources are wasted. The noise separation unit separates the background noise in the environment from the target voice signal through a noise suppression algorithm, such as a self-adaptive filter, a spectral subtraction method and the like, so that the accuracy and the reliability of the subsequent emotion analysis are improved, and the system can better focus on the voice emotion analysis of the target elderly.

The method for performing signal preliminary analysis on the voice signal to judge whether voice emotion analysis is needed or not specifically comprises the following steps: statistically analyzing the voice average energy and voice frequency duty ratio of the voice signal; the average energy of the voice is defined as the ratio of the total energy of the voice signal to the time period in a set time period; the voice frequency duty ratio is defined as the ratio of the length of a voice signal to the time period in a set time period; if the average energy of the voice or the duty ratio of the voice frequency is in the respective corresponding threshold range, judging that voice emotion analysis is not needed, otherwise, judging that voice emotion analysis is needed. The average energy of speech reflects the intensity of sound of a speech signal over a period of time. By analyzing the average energy of the voice, the volume of the voice signal can be primarily known, so that whether the speaker is sending out strong voice with emotion colors can be judged. High energy speech signals indicate that the speaker is in an excited, angry, or other emotional state, while low energy speech signals indicate that the speaker is calm or negative. The speech frequency duty cycle reflects the duty cycle of the speech signal over the total time period. By analyzing the voice frequency duty ratio, the voice activity and frequency of the speaker can be primarily known, namely the number and frequency of voices emitted by the speaker in a certain time period. A higher speech frequency ratio indicates a greater mood wave and frequent utterances of the speaker, while a lower speech frequency ratio indicates a more stable or inactive mood of the speaker. Through basic feature analysis of the voice signals, the current emotion states of the speaker are primarily known, including the intensity and the liveness of the emotion. If the energy and frequency of the speech signal are both in a relatively calm range, this indicates that the speaker is calm or normal in emotion, and no further emotion analysis is required. Conversely, if the energy and frequency of the speech signal exceed the preset threshold range, it indicates that the speaker is in an emotional state such as being excited, angry, depressed, etc., and further speech emotion analysis is needed at this time to further understand the emotional state of the speaker and provide corresponding support and intervention. The enhancement unit performs enhancement processing on the collected voice signals by applying signal processing technologies such as filtering, noise reduction and the like, eliminates the existing noise, distortion and other interference, obtains clearer and more reliable enhanced voice signals, and provides better input for subsequent analysis.

Example 3: the enhancement unit is used for carrying out signal enhancement on the voice signal, and the method for obtaining the enhanced voice signal comprises the following steps: the following formula is used to apply the speech signalPerforming short-time Fourier transform based on autoregressive model to obtain time-frequency representation：

；

The original speech signal is divided into several short periods of time, and a window function is used for truncation in each short period of time. The window function serves to limit the time domain length of the signal in each time segment and gradually attenuate it to zero to reduce the spectral leakage effects in time domain analysis. And in each time segment, converting the time domain signal into a frequency domain by adopting Fourier transformation to obtain the frequency spectrum representation of the voice signal in each time segment. This procedure makes it possible to observe the energy distribution of the speech signal at different frequencies, so that the acoustic properties of the speech signal are better understood. Furthermore, the second term in the formula is part of an autoregressive model for capturing the autocorrelation of the speech signal. The autoregressive model assumes that the voice signal at the current moment has a linear relation with the signals at a plurality of previous moments, and can obtain the autocorrelation information of the voice signal by solving the autoregressive coefficient, so that the representation capability of the voice signal is further enhanced.

；

Molecular part The power spectrum of the original speech signal in the frequency domain, i.e. the energy distribution of the speech signal in different frequencies, is shown. By calculating the power spectrum of the original speech signal, the intensity of the speech signal at different frequencies can be known. Denominator part/>The ratio between the original speech signal power spectrum and the noise power spectrum is shown. Wherein/>Representing the power spectrum ratio of the noise to the speech signal. The larger this ratio is, the higher the proportion of noise in the signal is, and conversely, the higher the proportion of noise in the signal is. The effect of this part is to compare the power spectra of the signal and noise, thus suppressing the noise during enhancement. The Wiener filter is a classical signal processing filter, has nonlinear dynamic range compression characteristics, and can effectively suppress noise and improve signal quality. />, in the formulaThe principle of Wiener filter is partly utilized, and the signal is enhanced by comparing the power spectrum of the original voice signal with the noise power spectrum. Wherein/>And/>Power spectrum estimates representing noise and speech signals, respectively; /(I)To enhance the frequency domain representation of the speech signal, the enhanced frequency domain signal/>And performing inverse short-time Fourier transform to obtain an enhanced voice signal.

Example 4: the noise separation unit is used for separating background noise from a voice signal when the voice emotion analysis is judged to be needed, and the method comprises the following steps: representing a speech signal as a waveform function in the time domain; Converting it to the frequency domain by short-time fourier transformation, resulting in a frequency domain representation/>; Let the length of the time period be/>Length of use is/>Which is segmented by a window function of window length/>The overlap length between windows is/>; The window function selects a hamming window, defining the window function as:

；

Specifically, first, the original speech signal is represented as a waveform function in the time domain. Then, the voice signal in the time domain is converted into the frequency domain through short-time Fourier transform to obtain frequency domain representation/>. The purpose of this step is to convert the speech signal from the time domain to the frequency domain for frequency domain processing thereof. Segmenting a speech signal represented by a frequency domain, each segment having a length/>And applies a hamming window function to reduce spectral leakage effects. The purpose of this step is to divide the speech signal into a plurality of window segments for subsequent independent processing of each window segment and to reduce the effects of spectral leakage by applying a window function. Applying zero padding to each window signal expands it to length/>Then, discrete Fourier transform is carried out on the window signal after expansion to obtain frequency domain representation/>. The purpose of this step is to convert each windowed signal to the frequency domain for processing in the frequency domain. Assuming that the background noise is stationary and linearly superimposed with the speech signal, the background noise is modeled and estimated using an adaptive filter in the frequency domain. The purpose of this step is to estimate the spectral characteristics of the background noise for subsequent separation processing. According to the clean frequency spectrum/>, of the speech signalAnd the spectrum of background noise/>Calculating the frequency domain response/>, of the adaptive filter. The purpose of this step is to calculate the frequency domain response of the adaptive filter based on the spectral characteristics of the speech signal and the background noise for subsequent reconstruction processing of the speech signal. Frequency domain response using an adaptive filter performs reconstruction in the frequency domain of the speech signal, resulting in a reconstructed signal/>. The aim of this step is to de-noise the speech signal according to the frequency domain response of the adaptive filter, separating out the background noise component, thus obtaining a cleaner speech signal. Converting the reconstructed signal back to the time domain to obtain a clean speech signal/>, wherein the clean speech signal/>, is obtained after the background noise is separated from the speech signal. The purpose of this step is to convert the reconstructed signal of the frequency domain representation back to the time domain for subsequent speech emotion analysis and other processing.

Example 5: and the voice processing device is used for balancing the frequency spectrum by applying a pre-emphasis filter to the collected voice signals when the voice collection device judges that voice emotion analysis is needed, and then dividing the voice signals into overlapped frames according to the following formula:

；

Specifically, in the formulaThe application of a pre-emphasis filter is shown. The pre-emphasis filter is used for balancing the frequency spectrum of the voice signal and enhancing the energy of the high-frequency part so as to improve the signal-to-noise ratio of the voice signal. In this formula,/>Representing clean speech signal after background noise separation processing,/>Representing pre-emphasized speech signals,/>Is the pre-emphasis coefficient. The pre-emphasis filter is used for reducing attenuation of high-frequency part of voice signal in transmission process, thereby improving definition and recognizability of voice signalA process of dividing the pre-emphasized speech signal into overlapping frames is shown. In this formula,/>Represents the/>Frame,/>Is the frame length,/>Is a frame shift (or frame shift step) representing the length of overlap between adjacent frames. By dividing the speech signal into frames, the speech signal may be segmented in time so that the speech signal within each frame may be considered short-time stationary, thereby making subsequent feature extraction and analysis easier.

Example 6: the voice processing device performs window function processing on each frame of signal, applies discrete Fourier transform, and then processes the signal through a Mel filter to extract the MFCC characteristics, and the formula is as follows:

；

Wherein, First/>First/>, in speech frameA sample number; /(I)Representing the discrete Fourier transform processed firstFrequency domain coefficients; /(I)The number of points representing the discrete fourier transform, and also the number of samples per speech frame; /(I)Equal toRepresenting the number of independent frequency components in the discrete fourier transform; /(I)Is/>The individual Mel filter is at the/>Gain for each frequency point; the Mel filter is a group of overlapped triangular band-pass filters, which are used for simulating the frequency perception of human ears and carrying out nonlinear Mel scale conversion on the frequency; /(I)The number of mel filters represents the number of frequency bands divided on the mel scale; /(I)By passing the (th) >The result of applying the individual Mel filter to the coefficients of the discrete Fourier transform and taking the logarithm is representative of the/>Logarithmic energy of the individual frequency bands; /(I)For/>The individual mel-frequency cepstrum coefficients are pair/>The discrete cosine transform result is applied to convert the logarithmic energy spectrum of the Mel filter into a cepstrum coefficient of a time domain, reduce the correlation between features, and highlight the shape features of the frequency spectrum; /(I)Is the number of MFCC features that are ultimately extracted.

Specifically, first, the Discrete Fourier Transform (DFT) portion of the equation functions to convert the time-domain speech signal into a frequency-domain representation. By performing discrete Fourier transform on each frame of voice signal, the voice signal can be analyzed on the frequency domain to obtain amplitude and phase information of different frequency components. This step is critical for speech emotion analysis because different emotional states tend to appear on the spectral structure of the speech signal, such as anger leading to an increase or decrease in the sound frequency and sadness leading to a change in the sound frequency. Next, the mel-filter processing section maps the discrete fourier-transformed frequency domain signal onto a mel-frequency scale. The mel frequency scale is more in line with the perception characteristics of the human auditory system, and important frequency components in the voice signal can be better captured. The mel filter divides the spectrum into a series of triangular band pass filters that weight energy in different frequency ranges to more accurately represent the perception of different frequencies by human hearing. The effect of this step is to emphasize the important frequency components of the speech signal while suppressing the unimportant frequency components, thereby improving the degree of differentiation and robustness of the features. Finally, the MFCC feature extraction portion converts the frequency domain energy spectrum processed by the mel filter into a set of cepstrum coefficients. These cepstrum coefficients reflect the spectral characteristics of the speech signal on the mel frequency axis, including the sound color and resonance characteristics of the speech signal. By employing a Discrete Cosine Transform (DCT), the correlation between features can be reduced and the frequency domain energy spectrum converted to a more compact and stable representation of the features. The MFCC features have better discrimination and robustness and are suitable for use in speech emotion analysis and other speech processing tasks.

First part of the equationRepresenting for each speech frame/>Discrete Fourier Transform (DFT) is performed. Discrete Fourier transform is a technique for converting a time-domain signal into the frequency domain, which converts a speech signal from a time-domain representation into a frequency-domain representation, resulting in frequency-domain coefficients/>Wherein/>Representing the frequency index,/>Representing the number of samples per speech frame. Next, the second part of the equation/>The processing of the mel filter is shown. A mel filter is a set of filters used to mimic human ear frequency perception that performs a non-linear mel scale conversion of the spectrum on the frequency axis. Here,/>Represents the/>The individual Mel filter is at the/>Gain of individual frequency points,/>Is the number of independent frequency components in the discrete fourier transform. The frequency axis can be converted into the Mel scale through the treatment of the Mel filter, which is more in accordance with the auditory characteristics of human ears, and the logarithmic energy spectrum/>. Finally, the third part of the formulaThe extraction process of MFCC features is shown. MFCC is a commonly used method of speech signal feature extraction by combining the logarithmic energy spectrum/>A set of cepstral coefficients, MFCCs, are obtained by applying a discrete cosine transform. These coefficients can reduce the correlation between features and highlight spectral shape features of the speech signal, providing a more discernable feature representation. /(I)

Example 7: fundamental frequency characteristicsThe method is calculated by the following formula:

；

。

Specifically, first, the fundamental frequency characteristics Is calculated by an autocorrelation function/>Is obtained by analysis of (a). The autocorrelation function is a function describing the correlation between the signal and the self-delayed version by calculating the speech frame/>At different delays/>Autocorrelation value/>The peak position/>, of the autocorrelation function can be found. Fundamental frequency characteristics/>Is defined as the sampling frequencyAnd autocorrelation function peak position/>Is a ratio of (2). Fundamental frequency characteristics/>In speech signal analysis, it is common to represent the fundamental frequency of speech, i.e. the pitch of sound. Second, energy characteristics/>By calculation of speech frame energy/>Is obtained by the difference of (a). Energy/>Is speech frame/>The sum of the squares of the amplitudes of all samples of a speech frame is indicative of the overall energy of the speech frame. Energy characteristics/>Then the difference between the energies of adjacent speech frames is used to represent the energy variation between speech frames. Energy characteristics/>Are commonly used in speech signal analysis to represent energy changes in speech, such as intensity changes in sound or recognition of pauses and active portions of speech. Finally, the feature vector/>Including fundamental frequency features/>Energy characteristics/>And a series of cepstral coefficients extracted by MFCC/>. These MFCC coefficients reflect the spectral characteristics of the speech signal on the mel frequency axis, with good discrimination and robustness. By combining the fundamental frequency characteristics, the energy characteristics and the MFCC coefficients, the frequency spectrum and the energy characteristics of the voice signal can be more comprehensively described, and more abundant characteristic information is provided for subsequent voice emotion analysis.

Example 8: the pre-trained voice emotion analysis model is a three-branch vector-holding model, and the category labels are as follows:

，

；

Wherein, A normal vector representing a decision hyperplane for defining a classification boundary; /(I)Representing bias terms, also called intercepts, which are parameters in the support vector machine model for translating the classification boundaries; /(I)A weight vector representing an additional decision function for defining an additional classification boundary; /(I)A bias term representing an additional decision function for translating the additional classification boundary; /(I)Representing a relaxation variable, representing the extent to which deviation from the hyperplane is allowed; /(I)Representing regularization parameters,/>The importance of the relaxation variable is controlled, and the larger the value of the relaxation variable is, the more serious the punishment to misclassification is; /(I)Representing an indicator variable; /(I)Is an additional decision function; collecting feature vectors of historical voice segments and corresponding class labels, taking part of model training test data as training data, training a voice emotion analysis model, and finding an optimal decision boundary to correctly classify samples in a training set to the greatest extent and keeping generalization capability of the model; and taking the part of the model training test data except the training data as the test data, using the test data to evaluate the trained support vector machine model, wherein the evaluation index is the accuracy, stopping training if the accuracy exceeds a set accuracy threshold, otherwise, adjusting parameters of the model, and continuing training until the accuracy exceeds the set accuracy threshold.

In particular, embodiment 8 describes a Support Vector Machine (SVM) model for speech emotion analysis, which aims at responding to an input speech feature vectorEmotion classification is performed on the mobile terminal, wherein emotion classification is divided into three types: negative emotion) Neutral emotion (/ >)) And forward emotion (/ >)). The model learns an optimal decision boundary by optimizing an objective function to distinguish between different emotion classes of speech. First, the objective function is composed of two parts. The first part is to minimize the weight vector/>Square norm of (i.e./>)The goal of this section is to find a suitable decision hyperplane to maximize the correct classification of samples in the training set. At the same time, to avoid overfitting the training data, the objective function also includes a regularization term/>Wherein/>Is a regularization parameter for controlling the complexity of the model, and/>Is a relaxation variable, indicating that a certain degree of misclassification is allowed. The second section is a description of constraints. Constraints ensure that the classification of training samples on decision boundaries is correct. For each training sample, its eigenvector/>And corresponding category label/>The inner product between them is at least/>, from the boundary. Such constraints ensure that most samples are correctly classified, while by introducing a relaxation variable/>Allowing a degree of classification errors. Additional constraints include an auxiliary decision function/>It is used in parallel with the main decision function to improve the classification performance of the model. This auxiliary decision function is also subject to similar constraints, ensuring that it is on the boundary of the correct classification. In addition,/>Is an indication variable, according to category label/>Is determined whether additional constraints are enabled. If category label/>Neutral emotion/>Set to 1 indicates that additional constraints are enabled, otherwise set to 0 indicates that additional constraints are not enabled. In the model training process, feature vectors of historical voice segments and corresponding class labels are used as training test data. This portion of data is divided into training data for training the model and test data for evaluating the performance of the model. The evaluation index is typically the accuracy, i.e. the proportion of samples that are correctly classified. If the accuracy rate reaches the set threshold value, stopping training, otherwise, adjusting the model parameters and continuing training until the set accuracy rate threshold value is reached.

Example 9: additional decision functionThe expression is used as follows:

。

specifically, an additional decision function Is introduced to increase the classification of a category. In Support Vector Machine (SVM) models, the original decision function usually only deals with classification problems of two classes, but in practical applications, more emotion classes are sometimes required to be processed. The function of the additional decision function is to introduce additional classification boundaries, so that the support vector machine model can process classification problems of more than two categories, and the applicability and performance of the model are improved. Specifically, the additional decision function is created by defining an additional weight vector/>And bias term/>An additional classification boundary is provided for each additional emotion category. These additional classification boundaries, together with the original classification boundaries, form a multi-class classification model that enables the model to distinguish and identify multiple different emotion classes simultaneously. By adjusting the parameters of the additional classification boundaries, the classification accuracy of each emotion category can be adjusted, so that more accurate emotion classification is realized.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An affective health analysis and support system for elderly people, said system comprising: the voice processing device is used for processing the voice emotion analysis; the voice acquisition device is used for acquiring voice signals of the target old people in a set time period, and performing signal preliminary analysis on the voice signals to judge whether voice emotion analysis is needed or not, and specifically comprises the following steps: statistically analyzing the voice average energy and voice frequency duty ratio of the voice signal; the average energy of the voice is defined as the ratio of the total energy of the voice signal to the time period in a set time period; the voice frequency duty ratio is defined as the ratio of the length of a voice signal to the time period in a set time period; if the average energy of the voice or the duty ratio of the voice frequency are in the respective corresponding threshold range, judging that voice emotion analysis is not needed, otherwise, judging that voice emotion analysis is needed; the voice processing device is used for applying a pre-emphasis filter to the collected voice signal to balance the frequency spectrum when the voice collecting device judges that voice emotion analysis is needed, obtaining a pre-processing signal, extracting MFCC characteristics, fundamental frequency characteristics and energy characteristics from the pre-processing signal, taking the MFCC characteristics, the fundamental frequency characteristics and the energy characteristics as elements in characteristic vectors to form the characteristic vectors, and dividing the pre-processing signal into a voice section and a non-voice section by using a zero-crossing rate method based on the characteristic vectors; the voice emotion analysis device is used for performing emotion analysis on the feature vector of the voice segment by using a pre-trained voice emotion analysis model, and judging emotion characteristics of the voice segment, wherein the emotion characteristics comprise: if the ratio of the total frame number of the voice segments in the negative emotion characteristics to the length of the time period exceeds a set threshold value in a set time period, the target old person is judged to be in the negative emotion, and an early warning signal requiring emotion intervention is sent.

2. The system for analyzing and supporting emotional well-being of elderly people according to claim 1, wherein said voice acquisition device comprises: the device comprises an acquisition unit, an enhancement unit, a preliminary analysis unit and a noise separation unit; the acquisition unit is used for judging whether the sent voice signal is sent by the target old people or not through voice recognition in a set time period, and if yes, acquiring the voice signal; the enhancement unit is used for carrying out signal enhancement on the voice signal to obtain an enhanced voice signal; the primary analysis unit performs signal primary analysis on the voice signal to judge whether voice emotion analysis is needed; the noise separation unit is used for separating background noise from the voice signal when the voice emotion analysis is judged to be needed.

3. The system for analyzing and supporting emotional well-being of elderly people according to claim 2, wherein the enhancing unit performs signal enhancement on the voice signal, and the method for obtaining the enhanced voice signal comprises: the following formula is used to apply the speech signalPerforming short-time Fourier transform based on an autoregressive model to obtain a time-frequency representation/>：

；

Wherein,A time slice index representing a short time fourier transform; /(I)Representing a frequency index; /(I)Window length for each time segment; /(I)Is a window function; /(I)Is a coefficient of an autoregressive model,/>Is the order of the autoregressive model; /(I)Is an imaginary symbol; /(I)A time domain index; the time-frequency representation is enhanced by using a Wiener filter with nonlinear dynamic range compression characteristics by the following formula:

；

4. The system for analyzing and supporting emotional well-being of elderly people according to claim 3, wherein the noise separation unit, when judging that the emotional well-being of the voice is needed, separates the background noise from the voice signal, comprises: representing a speech signal as a waveform function in the time domain; Converting it to the frequency domain by short-time fourier transformation, resulting in a frequency domain representation/>; Let the length of the time period be/>Length of use is/>Which is segmented by a window function of window length/>The overlap length between windows is/>; The window function selects a hamming window, defining the window function as:

；

Wherein, A sample index representing a window; extending the window function to a length/>, by applying it to each time segment of the speech signal and applying zero paddingTo obtain window signal/>, in time domain; For each window signal/>Applying a discrete fourier transform to obtain a frequency domain representation/>; Setting background noise to be steady state and linearly overlapped with the voice signal; modeling and estimating background noise using an adaptive filter on the frequency domain; let/>Representing clean spectrum of speech signal,/>A spectrum representing background noise; defining the frequency domain response of the adaptive filter as:

；

5. The system according to claim 4, wherein the voice processing means, when the voice acquisition means judges that the voice emotion analysis is required, applies a pre-emphasis filter to the acquired voice signal to balance the spectrum by the following formula, and then divides the voice signal into overlapping frames:

；

Wherein, Is the original signal; /(I)Is a pre-emphasized signal; /(I)Is a pre-emphasis coefficient; /(I)Represents the/>A frame; /(I)Is the frame length; /(I)Is a frame shift, representing the overlap between adjacent frames.

6. The system of claim 5, wherein the speech processing means performs a window function process on each frame of the signal, applies a discrete fourier transform, and then processes the discrete fourier transform with a mel filter to extract MFCC characteristics, as follows:

；

Wherein, First/>First/>, in speech frameA sample number; /(I)Representing the discrete Fourier transform processed/>Frequency domain coefficients; /(I)The number of points representing the discrete fourier transform, and also the number of samples per speech frame; /(I)Equal to/>Representing the number of independent frequency components in the discrete fourier transform; /(I)Is/>The individual Mel filter is at the/>Gain for each frequency point; the Mel filter is a group of overlapped triangular band-pass filters, which are used for simulating the frequency perception of human ears and carrying out nonlinear Mel scale conversion on the frequency; /(I)The number of mel filters represents the number of frequency bands divided on the mel scale; /(I)By passing the (th) >The result of applying the individual Mel filter to the coefficients of the discrete Fourier transform and taking the logarithm is representative of the/>Logarithmic energy of the individual frequency bands; /(I)For/>The individual mel-frequency cepstrum coefficients are pair/>The discrete cosine transform result is applied to convert the logarithmic energy spectrum of the Mel filter into a cepstrum coefficient of a time domain, reduce the correlation between features, and highlight the shape features of the frequency spectrum; /(I)Is the number of MFCC features that are ultimately extracted.

7. The system for analyzing and supporting emotional well-being of elderly people according to claim 6, wherein the fundamental frequency featuresThe method is calculated by the following formula:

；

。

8. The system of claim 7, wherein the pre-trained speech emotion analysis model is a three-branch vector-holding model, and the class labels are:

，

；

9. The system for emotional well-being analysis and support of elderly people according to claim 8, wherein additional decision functionsThe expression is used as follows:

。