CN113053417A - Method, system, equipment and storage medium for recognizing emotion of voice with noise - Google Patents

Method, system, equipment and storage medium for recognizing emotion of voice with noise Download PDF

Info

Publication number
CN113053417A
CN113053417A CN202110332451.5A CN202110332451A CN113053417A CN 113053417 A CN113053417 A CN 113053417A CN 202110332451 A CN202110332451 A CN 202110332451A CN 113053417 A CN113053417 A CN 113053417A
Authority
CN
China
Prior art keywords
voice
iteration
noise
residual
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110332451.5A
Other languages
Chinese (zh)
Other versions
CN113053417B (en
Inventor
姜晓庆
陈贞翔
杨倩
郑永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Sizheng Information Technology Co Ltd
University of Jinan
Original Assignee
Shandong Sizheng Information Technology Co Ltd
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Sizheng Information Technology Co Ltd, University of Jinan filed Critical Shandong Sizheng Information Technology Co Ltd
Priority to CN202110332451.5A priority Critical patent/CN113053417B/en
Publication of CN113053417A publication Critical patent/CN113053417A/en
Application granted granted Critical
Publication of CN113053417B publication Critical patent/CN113053417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for recognizing emotion of voice with noise, which are used for acquiring a voice signal with noise to be recognized; carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points; carrying out feature extraction on the voice fragments with voice to obtain voice features; and inputting the voice features into the trained voice emotion recognition model, and outputting emotion types. The endpoint detection method can calculate the conditional entropy between the predicted residual error and the last iteration signal estimation value in the iteration process of the orthogonal matching pursuit algorithm in the sample reconstruction process, directly provides the endpoint detection result of the reconstructed sample while completing the sample reconstruction according to the residual conditional entropy difference value before and after the iteration, fully utilizes the data generated in the sample reconstruction process, saves the subsequent analysis and processing time of the system, and has the anti-noise performance because the endpoint detection method is established on the compression perception reconstruction algorithm.

Description

Method, system, equipment and storage medium for recognizing emotion of voice with noise
Technical Field
The present application relates to the field of speech emotion recognition technology, and in particular, to a method, a system, a device, and a storage medium for recognizing speech emotion with noise.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
The voice endpoint detection method has wide and important application in the field of voice signal processing, and has important research significance in the aspects of reducing the data volume of processing, learning the effective characteristics of voice, recognizing voice and recognizing the accuracy of voice emotion and the like.
The presence of ubiquitous noise tends to degrade the accuracy of speech endpoint detection, and current research has shown that Compressed Sensing (CS) also has superior performance in denoising speech signals. According to the CS theory, the observation value obtained after the voice signal is transformed by a proper sparse basis and an observation matrix contains all useful information in the voice signal, the observation value can be used for reconstructing the voice signal at a receiving end by adopting a certain reconstruction algorithm after being transmitted, and noise cannot be reconstructed because the sparsity cannot be realized, so that the compressed sensing can greatly reduce the voice transmission data volume and simultaneously realize the denoising processing in the reconstruction process. Meanwhile, the unvoiced segment in the voice signal has the noise-like characteristic and can be inhibited in the reconstruction process, so that the unvoiced and voiced division of the reconstructed sample is more accurate, and the extraction accuracy of the subsequent voice characteristic parameters is improved. The existing research also shows that the reconstructed voice sample under the compressed sensing theory can be effectively applied to the emotion recognition of the voice with noise.
In the conventional research, attention is focused on signal reconstruction, and research and application of neglecting parameters and data characteristics generated in the reconstruction process cause waste of data resources. For example, if the reconstructed sample needs to be subjected to the voice sample endpoint detection, the reconstructed sample needs to be analyzed by using a certain endpoint detection algorithm after being obtained, and the reconstruction cannot be realized and an endpoint detection result is given at the same time, so that the existing endpoint detection processing method undoubtedly increases the time delay of system processing. In addition, the existing endpoint detection algorithms are all based on the voice signals to process, and have high data dimensionality and low operation efficiency.
Disclosure of Invention
In order to solve the defects of the prior art, the application provides a method and a system for recognizing the speech emotion with noise;
in a first aspect, the application provides a method for recognizing speech emotion with noise;
the method for recognizing the emotion of the voice with the noise comprises the following steps:
acquiring a voice signal with noise to be identified;
carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points;
carrying out feature extraction on the voice fragments with voice to obtain voice features;
and inputting the voice features into the trained voice emotion recognition model, and outputting emotion types.
In a second aspect, the present application provides a noisy speech emotion recognition system;
a noisy speech emotion recognition system comprising:
an acquisition module configured to: acquiring a voice signal with noise to be identified;
an endpoint detection module configured to: carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points;
a feature extraction module configured to: carrying out feature extraction on the voice fragments with voice to obtain voice features;
an output module configured to: and inputting the voice features into the trained voice emotion recognition model, and outputting emotion types.
In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the beneficial effects of this application are:
(1) the invention provides a voice endpoint detection method based on residual error conditional entropy difference generated in an iteration process, and the method is effectively applied to voice recognition with noise and emotion. The endpoint detection method can calculate the conditional entropy between the predicted residual error and the last iteration signal estimation value in the iteration process of an Orthogonal Matching Pursuit (OMP) algorithm in the sample reconstruction process, directly give the endpoint detection result of the reconstructed sample while completing the sample reconstruction according to the difference value of the conditional entropy of the residual error before and after the iteration, fully utilize the data generated in the sample reconstruction process, save the subsequent analysis and processing time of the system, and have the anti-noise performance because the endpoint detection method is established on the compressed sensing reconstruction algorithm.
(2) Processing an emotional voice component in an emotional video by adopting a compressed sensing theory, completing sparse transformation of the emotional voice by using discrete cosine transformation, and providing a prediction residual conditional entropy parameter of the emotional voice compressed sensing reconstruction by using a Gaussian random matrix as an observation matrix and an Orthogonal Matching Pursuit (OMP) algorithm as a reconstruction algorithm;
(3) providing a residual conditional entropy difference analysis idea before and after OMP reconstruction iteration;
(4) according to the residual error condition entropy difference value and a threshold value, giving an endpoint detection result while finishing sample reconstruction;
(5) and realizing voice emotion recognition of the voice test sample with the noise emotion based on the end point detection result.
(6) The voice signal endpoint detection method adopting the residual error conditional entropy difference is based on a compressed sensing theory, endpoint detection is completed during sample reconstruction, and because noise has no sparsity and cannot be reconstructed, the voice endpoint detection result obtained by the method has anti-noise performance;
(7) when the voice is reconstructed, the voice signal endpoint detection method adopting the residual conditional entropy difference value obtains the judgment result whether the voice frame is a voiced segment according to the calculated residual conditional entropy difference value, does not need to process the reconstructed voice sample, has small time delay and can realize quick judgment;
(8) the voice signal endpoint detection method adopting the residual conditional entropy difference value deeply and effectively excavates the data characteristics in the reconstruction process through the calculation of the information theory parameters, fully utilizes the data in the sample reconstruction process and saves the calculation resources;
(9) the voice signal endpoint detection method adopting the residual error conditional entropy difference can be effectively applied to noisy voice emotion recognition.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of a method of the first embodiment;
FIG. 2(a) is a schematic time domain waveform of the first embodiment;
FIG. 2(b) is a schematic time-domain waveform of a noisy speech according to the first embodiment;
FIG. 2(c) is the residual conditional entropy difference and the threshold for endpoint detection of the first embodiment;
fig. 3 is a flowchart of endpoint detection according to the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a method for recognizing noisy speech emotion;
as shown in fig. 1, the method for recognizing emotion of noisy speech includes:
s100: acquiring a voice signal with noise to be identified;
s200: carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points;
s300: carrying out feature extraction on the voice fragments with voice to obtain voice features;
s400: and inputting the voice features into the trained voice emotion recognition model, and outputting emotion types.
As one or more embodiments, the S200: carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points; the method specifically comprises the following steps:
s201: carrying out sparse conversion processing on a voice signal with noise to be recognized;
s202: randomly generating a Gaussian random matrix for the voice signals after the sparse conversion processing; taking the Gaussian random matrix as an observation matrix of the voice signal;
s203: and based on the observation matrix, carrying out sample reconstruction by adopting an Orthogonal Matching Pursuit (OMP) algorithm to obtain an endpoint detection result.
Further, the step S201: carrying out sparse conversion processing on a voice signal with noise to be recognized; the method specifically comprises the following steps:
and performing sparse conversion processing on the voice signal with the noise to be recognized by adopting discrete cosine transform.
Further, the step S202: randomly generating a Gaussian random matrix for the voice signals after the sparse conversion processing; wherein, the Gaussian random matrix follows normal distribution with the mean value of 0, the variance of 1 and the standard deviation of 1.
As one or more embodiments, as shown in fig. 3, the S203: based on the observation matrix, adopting an orthogonal matching pursuit algorithm (OMP) to reconstruct a sample to obtain an endpoint detection result; the method specifically comprises the following steps:
s2031: obtaining a voice observation value of each frame according to the observation matrix;
s2032: when the sensor runs for the first time, setting the residual error as a voice observation value, and calculating a correlation coefficient of the residual error and the sensing matrix;
when the sensor is not operated for the first time, calculating a residual error between the last iteration estimation value and the voice observation value and a correlation coefficient between the residual error and the sensing matrix;
s2033: searching atoms with the maximum correlation coefficient in the sensing matrix, and updating a support set reconstructed by the signals by using the atoms with the maximum correlation coefficient;
s2034: based on the support set, approximating the observed value by using a least square method to obtain an estimated value of the signal;
s2035: updating the residual error, and calculating the conditional entropy of the residual error;
s2036: judging whether the sparsity condition is reached, if so, returning to S2032; if not, calculating a residual conditional entropy difference value between the first iteration and the last iteration, and obtaining a reconstructed sample according to an estimated value of a signal at the moment;
s2037: judging whether the difference value of the residual conditional entropy of the first iteration and the last iteration is higher than a set threshold value, and if so, considering the current frame speech as a voiced segment; if the current frame voice is lower than the set threshold, the current frame voice is considered to be a silent section, and an endpoint detection result of the current frame voice is obtained;
s2038: based on the endpoint detection results, voiced speech segments in the reconstructed samples are obtained.
Further, the S2031: obtaining a voice observation value of each frame according to the observation matrix; the method specifically comprises the following steps:
if a frame of voice signal is x, completing sparse conversion through discrete cosine transform, wherein the signal is a discrete cosine coefficient alpha, namely x ═ Ψ alpha, and Ψ is a sparse matrix formed by DCT bases; then the observation is y ═ θ α, where Θ ═ Φ Ψ, and Φ is the observation matrix.
Further, the S2032: calculating a residual error between the last iteration estimation value and the voice observation value and a correlation coefficient between the residual error and the sensing matrix; the method specifically comprises the following steps:
reconstructed residual r obtained from the t-th iterationtThe calculation formula of (2) is as follows:
Figure BDA0002996707110000071
wherein A istIs a support set formed by atoms of a sensing matrix in the t iteration process of the OMP algorithm,
Figure BDA0002996707110000072
and y is an observed value which is an estimated value calculated by a least square method in the t-th iteration process.
Further, the correlation coefficient of the residual error and the sensing matrix is calculated by using the inner product of the residual error and the column vector of the sensing matrix.
It should be understood that the sensing matrix is obtained by multiplying a sparsity matrix of sparse transformation and an observation matrix, and can ensure that signals can be sampled and compressed simultaneously.
Further, the S2033: searching atoms with the maximum correlation coefficient in the sensing matrix, and updating a support set reconstructed by the signals by using the atoms with the maximum correlation coefficient; the support set is a set of columns found from the sensing matrix according to the correlation coefficient.
Further, the S2035: updating the residual error, and calculating the conditional entropy of the residual error; the method specifically comprises the following steps:
storing the residual error obtained by each iteration and updating the residual error;
based on the updated residual, a residual conditional entropy is calculated.
Further, calculating a residual conditional entropy based on the updated residual; residual conditional entropy σeThe calculation formula of (2) is as follows:
Figure BDA0002996707110000081
At-1is a support set formed by atoms of a sensing matrix in the t-1 iteration process of the OMP algorithm,
Figure BDA0002996707110000082
is an estimated value calculated by a least square method in the process of t-1 times of iteration.
Further, the S2036: judging whether the sparsity condition is reached, if so, returning to S2032; if not, calculating a residual conditional entropy difference value between the first iteration and the last iteration; the method specifically comprises the following steps:
and subtracting the residual conditional entropy obtained by the first iteration from the residual conditional entropy obtained by the last iteration to obtain a difference value.
Further, the sparsity condition refers to that whether iteration is terminated or not is judged by judging the number of iterations and the sparsity K after each iteration is completed in the sample reconstruction process. If the iteration number is less than K, continuing the iteration, otherwise, terminating the iteration.
Further, S300: extracting the characteristics of each voice segment with voice to obtain voice characteristics; the specific voice features include: prosodic features (e.g., fundamental frequency, short-term energy, time-dependent features such as sample duration, voiced segment duration, speech rate, etc.), psychoacoustic features (e.g., first, second, third formants, etc.), spectral features (e.g., MFCC parameters), and statistical parameters (maximum, minimum, mean) of the above features, etc.
Further, the step S400: inputting the voice features into the trained voice emotion recognition model, and outputting emotion types; the training step of the trained speech emotion recognition model comprises the following steps:
constructing a neural network model; the neural network model is a convolutional neural network;
constructing a training set, wherein the training set comprises voice features of known emotion classes;
and inputting the training set into a neural network model for training, and stopping training when the loss function reaches the minimum value or reaches the set iteration times to obtain the trained speech emotion recognition model.
The compressed sensing is applied to voice signal processing, and if discrete cosine transform is selected to complete sparse transform of voice signals, a Gaussian random matrix is adopted as an observation matrix, and an Orthogonal Matching Pursuit (OMP) algorithm is adopted as a sample reconstruction algorithm.
The invention provides a voice signal endpoint detection method adopting a residual conditional entropy difference value, which is based on a prediction residual generated in an OMP iterative execution process. The OMP algorithm is a common algorithm in speech signal reconstruction, and updates a support set of signal reconstruction by calculating a residual error between an estimated value and an observed value of each iteration and a correlation between the residual error and a sensing matrix until a sparsity condition is reached, and then completes the signal reconstruction. The calculation of the residual is an important ring in the OMP algorithm, and from the information theory perspective, the acquisition of the voice information in the iterative process means the reduction of the residual entropy. The invention adopts the conditional entropy sigma between the residual error of the introduced t iteration and the signal estimation value of the last iterationeTo judge the extraction degree of the voice component in the reconstructed residual.
In the OMP algorithm, the reconstructed residual r obtained in the t-th iterationtThe calculation formula of (2) is as follows:
Figure BDA0002996707110000091
wherein A istIs a support set formed by atoms of a sensing matrix in the t iteration process of the OMP algorithm,
Figure BDA0002996707110000092
is the estimated value calculated by the least square method in the t-th iteration process.
σeThe calculation formula of (2) is as follows:
Figure BDA0002996707110000101
At-1is a support set formed by atoms of a sensing matrix in the t-1 iteration process of the OMP algorithm,
Figure BDA0002996707110000102
is an estimated value calculated by a least square method in the process of t-1 times of iteration.
And when the iteration is finished, solving a residual conditional entropy difference value between the last iteration and the first iteration, and obtaining an endpoint detection result through threshold judgment.
Fig. 2(a) shows a speech time domain waveform in a process of reconstructing a certain speech sample by using an OMP algorithm, fig. 2(b) shows a time domain waveform of a noisy speech, and fig. 2(c) shows a residual conditional entropy difference value and a threshold value between a last iteration and a first iteration.
As can be seen from the figure, the sample has a strong noise level, the signal-to-noise ratio of the noisy sample is 0dB, and the voice signal is covered by noise, but according to the algorithm, the residual conditional entropy difference is more stable in a noise environment and has better robustness, and the starting point and the ending point of the noisy voice can be detected by setting a smaller threshold.
It can be seen that the difference in residual conditional entropy in the iterative process corresponds well to the active components, σ, in the speech sampleeThe variation trend of the method is corresponding to the position of a voiced segment (including unvoiced sound and voiced sound) in an original waveform, and the starting and ending point judgment of the reconstructed voice sample can be completed by adopting an empirical threshold conditionEnd point detection of noisy speech may then be achieved using a lower threshold (e.g., 0.01) as in fig. 2 (c). And the end points of the reconstructed samples can be obtained while the samples are reconstructed by the algorithm, and other end point detection algorithms do not need to be implemented on the reconstructed samples.
The overall flow chart of the noisy speech emotion recognition of the speech signal endpoint detection method using the residual conditional entropy difference is shown in fig. 1. As can be seen from fig. 1, when the noisy emotion speech is reconstructed, an endpoint detection result of the reconstructed sample can be obtained, subsequent feature extraction and feature learning can be performed according to the endpoint detection result, and an effective emotion recognition model can be trained by using a feature parameter set of the emotion speech, thereby realizing the noisy speech emotion recognition.
Example two
The embodiment provides a system for recognizing noisy speech emotion;
a noisy speech emotion recognition system comprising:
an acquisition module configured to: acquiring a voice signal with noise to be identified;
an endpoint detection module configured to: carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points;
a feature extraction module configured to: carrying out feature extraction on the voice fragments with voice to obtain voice features;
an output module configured to: and inputting the voice features into the trained voice emotion recognition model, and outputting emotion types.
It should be noted here that the above-mentioned obtaining module, the endpoint detecting module, the feature extracting module and the output module correspond to steps S100 to S400 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the contents disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. The method for recognizing the emotion of the voice with the noise is characterized by comprising the following steps of:
acquiring a voice signal with noise to be identified;
carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points;
carrying out feature extraction on the voice fragments with voice to obtain voice features;
and inputting the voice features into the trained voice emotion recognition model, and outputting emotion types.
2. The method of emotion recognition with noise as set forth in claim 1, wherein the end point detection processing is performed on the signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points; the method specifically comprises the following steps:
carrying out sparse conversion processing on a voice signal with noise to be recognized;
randomly generating a Gaussian random matrix for the voice signals after the sparse conversion processing; taking the Gaussian random matrix as an observation matrix of the voice signal;
and based on the observation matrix, carrying out sample reconstruction by adopting an Orthogonal Matching Pursuit (OMP) algorithm to obtain an endpoint detection result.
3. The method of emotion recognition with noisy speech according to claim 2, wherein the noisy speech signal to be recognized is subjected to a sparse conversion process; the method specifically comprises the following steps:
and performing sparse conversion processing on the voice signal with the noise to be recognized by adopting discrete cosine transform.
4. The method of recognizing emotion of speech with noise as set forth in claim 2, wherein based on the observation matrix, the sample is reconstructed by using an orthogonal matching pursuit algorithm OMP to obtain the end point detection result; the method specifically comprises the following steps:
(1): obtaining a voice observation value of each frame according to the observation matrix;
(2): when the sensor runs for the first time, setting the residual error as a voice observation value, and calculating a correlation coefficient of the residual error and the sensing matrix;
when the sensor is not operated for the first time, calculating a residual error between the last iteration estimation value and the voice observation value and a correlation coefficient between the residual error and the sensing matrix;
(3): searching atoms with the maximum correlation coefficient in the sensing matrix, and updating a support set reconstructed by the signals by using the atoms with the maximum correlation coefficient;
(4): based on the support set, approximating the observed value by using a least square method to obtain an estimated value of the signal;
(5): updating the residual error, and calculating the conditional entropy of the residual error;
(6): judging whether the sparsity condition is reached, if so, returning to the step (2); if not, calculating a residual conditional entropy difference value between the first iteration and the last iteration, and considering the estimated value of the signal at the moment as a reconstructed sample;
(7): judging whether the difference value of the residual conditional entropy of the first iteration and the last iteration is higher than a set threshold value, and if so, considering the current frame speech as a voiced segment; if the current frame voice is lower than the set threshold, the current frame voice is considered to be a silent section, and an endpoint detection result of the current frame voice is obtained;
(8): based on the endpoint detection results, voiced speech segments in the reconstructed samples are obtained.
5. The noisy speech emotion recognition method of claim 4, wherein a residual between the last iteration estimate and the speech observation and a correlation coefficient between the residual and the sensing matrix are calculated; the method specifically comprises the following steps:
reconstructed residual r obtained from the t-th iterationtThe calculation formula of (2) is as follows:
Figure FDA0002996707100000021
wherein A istIs a support set formed by atoms of a sensing matrix in the t iteration process of the OMP algorithm,
Figure FDA0002996707100000022
and y is an observed value which is an estimated value calculated by a least square method in the t-th iteration process.
6. The method of emotion recognition with noise as set forth in claim 4, wherein the residual is updated, and the conditional entropy of the residual is calculated; the method specifically comprises the following steps:
storing the residual error obtained by each iteration and updating the residual error;
calculating a residual conditional entropy based on the updated residual;
calculating a residual conditional entropy based on the updated residual; residual errorConditional entropy σeThe calculation formula of (2) is as follows:
Figure FDA0002996707100000031
At-1is a support set formed by atoms of a sensing matrix in the t-1 iteration process of the OMP algorithm,
Figure FDA0002996707100000032
is an estimated value calculated by a least square method in the process of t-1 times of iteration.
7. The method for recognizing the emotion of speech with noise as recited in claim 4, wherein the sparsity condition is that whether the iteration is terminated is determined by determining the number of iterations and the magnitude of sparsity K after each iteration is completed in the sample reconstruction process; if the iteration number is less than K, continuing the iteration, otherwise, terminating the iteration.
8. The system for recognizing the emotion of the voice with the noise is characterized by comprising the following steps:
an acquisition module configured to: acquiring a voice signal with noise to be identified;
an endpoint detection module configured to: carrying out end point detection processing on a voice signal with noise to be recognized; obtaining a plurality of voice segments with voice according to the end points;
a feature extraction module configured to: carrying out feature extraction on the voice fragments with voice to obtain voice features;
an output module configured to: and inputting the voice features into the trained voice emotion recognition model, and outputting emotion types.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
CN202110332451.5A 2021-03-29 2021-03-29 Method, system, equipment and storage medium for recognizing emotion of voice with noise Active CN113053417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110332451.5A CN113053417B (en) 2021-03-29 2021-03-29 Method, system, equipment and storage medium for recognizing emotion of voice with noise

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110332451.5A CN113053417B (en) 2021-03-29 2021-03-29 Method, system, equipment and storage medium for recognizing emotion of voice with noise

Publications (2)

Publication Number Publication Date
CN113053417A true CN113053417A (en) 2021-06-29
CN113053417B CN113053417B (en) 2022-04-19

Family

ID=76516320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110332451.5A Active CN113053417B (en) 2021-03-29 2021-03-29 Method, system, equipment and storage medium for recognizing emotion of voice with noise

Country Status (1)

Country Link
CN (1) CN113053417B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method
CN103474066A (en) * 2013-10-11 2013-12-25 福州大学 Ecological voice recognition method based on multiband signal reconstruction
CN107293302A (en) * 2017-06-27 2017-10-24 苏州大学 A kind of sparse spectrum signature extracting method being used in voice lie detection system
CN107657964A (en) * 2017-08-15 2018-02-02 西北大学 Depression aided detection method and grader based on acoustic feature and sparse mathematics
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN111081280A (en) * 2019-12-30 2020-04-28 苏州思必驰信息科技有限公司 Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method
CN103474066A (en) * 2013-10-11 2013-12-25 福州大学 Ecological voice recognition method based on multiband signal reconstruction
CN107293302A (en) * 2017-06-27 2017-10-24 苏州大学 A kind of sparse spectrum signature extracting method being used in voice lie detection system
CN107657964A (en) * 2017-08-15 2018-02-02 西北大学 Depression aided detection method and grader based on acoustic feature and sparse mathematics
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN111081280A (en) * 2019-12-30 2020-04-28 苏州思必驰信息科技有限公司 Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANURAG SINGH ET AL.: "Compressed sensing framework of data reduction at multiscale level for eigenspace multichannel ECG signals", 《2015 TWENTY FIRST NATIONAL CONFERENCE ON COMMUNICATIONS (NCC)》 *
P N JAYANTHI ET AL.: "Sparse channel estimation for MIMO-OFDM systems using compressed sensing", 《2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT)》 *
PARVIN AHMADI ET AL.: "A new method for voice activity detection based on sparse representation", 《2014 7TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING》 *

Also Published As

Publication number Publication date
CN113053417B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
US6985858B2 (en) Method and apparatus for removing noise from feature vectors
CN113076847B (en) Multi-mode emotion recognition method and system
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN109448746B (en) Voice noise reduction method and device
Kheder et al. Additive noise compensation in the i-vector space for speaker recognition
JP2010078650A (en) Speech recognizer and method thereof
JP4705414B2 (en) Speech recognition apparatus, speech recognition method, speech recognition program, and recording medium
CN113643693A (en) Acoustic model conditioned on sound features
CN101123090B (en) Speech recognition by statistical language using square-rootdiscounting
CN112489625A (en) Voice emotion recognition method, system, mobile terminal and storage medium
CN110648655B (en) Voice recognition method, device, system and storage medium
Helali et al. Real time speech recognition based on PWP thresholding and MFCC using SVM
US7966179B2 (en) Method and apparatus for detecting voice region
CN113053417B (en) Method, system, equipment and storage medium for recognizing emotion of voice with noise
CN113065449B (en) Face image acquisition method and device, computer equipment and storage medium
KR20170088165A (en) Method and apparatus for speech recognition using deep neural network
Mendiratta et al. Automatic speech recognition using optimal selection of features based on hybrid ABC-PSO
Nicolson et al. Sum-product networks for robust automatic speaker identification
CN112397087B (en) Formant envelope estimation method, formant envelope estimation device, speech processing method, speech processing device, storage medium and terminal
CN112216285A (en) Multi-person session detection method, system, mobile terminal and storage medium
Alimuradov Research of frequency-selective properties of empirical mode decomposition methods for speech signals' pitch frequency estimation
Tu et al. Computational auditory scene analysis based voice activity detection
KR102002535B1 (en) Apparatus and method for analyzing sound
Morales et al. Adding noise to improve noise robustness in speech recognition.
Tan et al. Feature enhancement using sparse reference and estimated soft-mask exemplar-pairs for noisy speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant