CN111696579B - Speech emotion recognition method, device, equipment and computer storage medium - Google Patents

Speech emotion recognition method, device, equipment and computer storage medium Download PDF

Info

Publication number
CN111696579B
CN111696579B CN202010554606.5A CN202010554606A CN111696579B CN 111696579 B CN111696579 B CN 111696579B CN 202010554606 A CN202010554606 A CN 202010554606A CN 111696579 B CN111696579 B CN 111696579B
Authority
CN
China
Prior art keywords
features
emotion
phoneme
spectrogram
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010554606.5A
Other languages
Chinese (zh)
Other versions
CN111696579A (en
Inventor
陈剑超
肖龙源
李稀敏
刘晓葳
叶志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010554606.5A priority Critical patent/CN111696579B/en
Publication of CN111696579A publication Critical patent/CN111696579A/en
Application granted granted Critical
Publication of CN111696579B publication Critical patent/CN111696579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a speech emotion recognition method, a speech emotion recognition device, speech emotion recognition equipment and a computer storage medium, wherein the method comprises the following steps of: acquiring a phoneme label of a user; wherein the phoneme label is a pronunciation dictionary for judging the emotion data set of the user; extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme characteristic phoneme characteristics; acquiring voice characteristics of a user; and respectively inputting the spectrogram characteristics and the phoneme characteristics into the neural network model to obtain emotion prediction output characteristics, and identifying the speech emotion of the user according to the emotion prediction output characteristics. The method can be used for identifying the speech emotion by combining the spectrogram characteristics and the phoneme characteristics, improves the classification accuracy by the aid of the phoneme information, and reduces the manual interference.

Description

Speech emotion recognition method, device, equipment and computer storage medium
Technical Field
The present invention relates to the field of speech research, and in particular, to a speech emotion recognition method, apparatus, device, and computer storage medium.
Background
The speech emotion recognition is an important branch in the field of speech research, and is a technology for judging emotion of a person according to the speech of the person, and is designed to be a core problem in multiple speech researches such as signal processing, feature extraction, pattern recognition and the like. In recent years, with the rapid development of information technology, speech emotion recognition has important application in a plurality of scenes, and is embodied as follows: 1. telephone traffic systems and large commercial establishments need to handle thousands of customer calls every day, wherein ensuring the satisfaction of telephone customers is an important measure to prevent customer loss, and therefore timely finding and early warning are needed for discontent emotions of customers in calls. 2. Education cause and research show that the learning effect of learners is greatly related to the emotional state of learners, and slightly negative emotion in the learning process is helpful for generation of critical thinking. 3. In the advertising mode, in the past, an advertiser can only place advertisements in a large range so as to achieve maximum coverage on potential customers, but the placement mode is high in cost and poor in pertinence. The emotional tendencies of the advertisement readers are their most direct feedback on the advertisement ratings. Based on the voice emotion recognition system, the emotion state of a reader can be acquired, the advertisement placer is helped to obtain feedback information evaluated by the reader, the placement strategy is changed, and the cost is reduced.
However, due to the complexity of emotion, there are two methods of definition, one is discrete emotion definition and one is continuous emotion definition. Discrete emotion definition is an intuitive and simple method, and evaluators mark voices as certain well-defined emotion categories such as 'happy', 'too happy', 'angry', and the like through own subjective feelings. Continuous emotion definition uses not emotion categories but scores in some psychological dimensions to measure emotion, a common model is an intensity-valence model, where intensity reflects certain characteristics of the utterance, usually the more intense sounds, whose high frequency parts contain more energy and have higher pitch. However, some emotions cannot be distinguished using intensity alone, and thus need to be distinguished by valence alone. In the prior art, a complete speech emotion recognition framework mainly comprises three steps of speech feature extraction, emotion distinguishing information acquisition and classifier training, and finally emotion label prediction can be obtained. Referring to fig. 1: the first step of the speech emotion system is to extract speech features which can be used for model training from an original waveform, and the features used in speech emotion recognition are various and can be classified into vocal tract features, prosody features, statistical features and the like. And then further acquiring information capable of distinguishing various emotion categories from the extracted voice features. The traditional method generally increases the distinctiveness of emotion through the skillfully designed feature combination, and is completed through the high-level output of a neural network more and more along with the development of deep learning technology. Finally, after emotion distinguishing information is obtained, a classifier can be trained to obtain emotion prediction on test data, the classification is a generative classifier and a distinguishing classifier, and a full link layer is used for classification in a neural network. However, the conventional method has a big problem that the conventional method is highly sensitive to features, so most researches need to be carried out through a complex feature selection process before training a classifier, and the process of extracting emotion distinguishing information is completed by a feature selection algorithm to a certain extent, so that more manual interference is introduced.
Disclosure of Invention
In view of the foregoing problems, an object of the present invention is to provide a speech emotion recognition method, apparatus, device and storage medium, which can recognize speech emotion by combining spectrogram features and phoneme features, improve classification accuracy by assistance of phoneme information, and reduce artificial interference.
The embodiment of the invention provides a speech emotion recognition method, which comprises the following steps:
acquiring a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;
acquiring voice characteristics of a user;
extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme characteristic phoneme characteristics;
and respectively inputting the voice features and the phoneme features into a neural network model to obtain emotion prediction output features, and identifying the voice emotion of the user according to the emotion prediction output features.
Preferably, the emotion data set comprises at least one of: happy, sad, and happy; the voice features are spectrogram features; the phoneme label is defined using 39 phonemes.
Preferably, the speech features and the phoneme features are respectively input into a neural network model to obtain emotion prediction output features, and the speech emotion of the user is recognized according to the emotion prediction output features, specifically:
normalizing the spectrogram features to extract spectrogram image texture features;
segmenting the texture features of the spectrogram image into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form spectrogram features after a training batch;
and respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into a neural network model for splicing so as to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.
In a second aspect, an embodiment of the present invention provides a speech emotion recognition apparatus, including:
a phoneme label obtaining unit, configured to obtain a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;
the voice feature acquisition unit is used for acquiring the voice features of the user;
a one-hot vector extraction unit, configured to extract one-hot vectors of each phoneme label, and splice each one-hot vector into a two-dimensional matrix according to a time domain to generate a phoneme feature;
and the voice emotion recognition unit is used for respectively inputting the voice features and the phoneme features into a neural network model so as to obtain emotion prediction output features, and recognizing the voice emotion of the user according to the emotion prediction output features.
Preferably, the emotion data set comprises at least one of: happy, sad, and happy; the voice features are spectrogram features; the phoneme label is defined using 39 phonemes.
Preferably, the speech emotion recognition unit includes:
the normalization module is used for performing normalization processing on the spectrogram features so as to extract spectrogram image texture features;
the segmentation module is used for segmenting the texture features of the spectrogram image into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form spectrogram features after a training batch;
and the speech emotion recognition module is used for respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into the neural network model for splicing so as to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.
The embodiment of the invention also provides voice emotion recognition equipment which comprises a processor, a memory and a computer program stored in the memory, wherein the computer program can be executed by the processor to realize the voice emotion recognition method in the embodiment.
The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, a device where the computer-readable storage medium is located is controlled to execute the method for writing data into a water card, where the method is described in the above embodiment.
In the embodiment, the speech emotion is recognized by combining the spectrogram characteristic and the phoneme characteristic, the classification accuracy is improved by the aid of the phoneme information, and the artificial interference is reduced.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart illustrating a speech emotion recognition method according to a first embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a CNN speech emotion recognition network combined with phoneme information according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of texture features of a spectrogram image provided in an embodiment of the present invention.
FIG. 4 is a schematic structural diagram of a speech emotion recognition apparatus according to a second embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1 to fig. 3, a first embodiment of the present invention provides a speech emotion recognition method, which can be executed by a speech emotion recognition device, and in particular, executed by one or more processors in the speech emotion recognition device, and at least includes the following steps:
s101, acquiring a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;
in this embodiment, the phoneme label of the user is obtained by the G2P tool, and it should be noted that the phoneme definition used in the phoneme label text in the present application is from the university of kanji merlon pronunciation dictionary, and 39 phoneme definitions are used. Of course, it is to be appreciated that the emotion data set includes at least one of: happiness, sadness, difficulty and happiness, and the invention will not be described herein.
S102, acquiring the voice characteristics of the user.
In this embodiment, the speech features include MFCC, mel Filterbank features, PLP features, prosodic features, and spectrogram features, where the extracted features of the spectrogram are finer and have better effects, and other extracted features are coarser and shallower, and are better for a certain kind of emotion, such as happy feeling, but some of them are worse, so in this application, the speech features are spectrogram features, and specifically, the speech features are as follows:
table one:
feature(s) Parameter(s) Dimension (D)
MFCC 25ms window shift, 10ms window length, 26 mel filter banks 39
Mel Filterbank 25ms window length, 10ms window shift, 40 mel filter banks 40
PLP 25ms window length, 10ms window shift, 5 th order linear sensing 18
Prosodic features Fundamental frequency, sound probability, loudness curve 3
Speech spectrum Window length of 40ms, window shift of 10ms, 1600FFT points 800
S103, extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme feature phoneme features.
In this embodiment, the phoneme features are described by using a one-hot vector, that is, for each phoneme p, a 39-dimensional zero-one vector x is used for description, and the following expression is specifically used:
Figure BDA0002543838140000061
if the phoneme p is the ith phoneme defined by 39 phonemes, the ith dimension of the x vector is 1, and the rest are 0.
It should be noted that, for the phoneme sequence of each sentence, the obtained one-hot vectors are spliced into a two-dimensional matrix according to the time domain to form phoneme features, the phoneme features also use a batch training mode, the sentence sequence of each batch is kept consistent with the sequence of the spectrogram features, so as to accurately identify the speech emotion.
And S104, respectively inputting the voice features and the phoneme features into a neural network model to obtain emotion prediction output features, and identifying the voice emotion of the user according to the emotion prediction output features.
In this embodiment, the convolutional neural network is still used for the learning of the phoneme features, and the single-layer convolutional layer and the global mean pooling are used to obtain the high-level feature output. The design details of the phoneme convolution network part are shown in fig. 2, the convolutional layer is designed to cover all phonemes by using convolution kernels with the width of 39, the width of the convolution kernels is 3, therefore, each convolution kernel covers the information of the front and back 3 phonemes, the output dimension of the convolutional layer is 32, the output features pass through a ReLU activation function, then 32-dimensional feature vectors are obtained by global mean pooling, the feature vectors and the output of the speech part are spliced into 112-dimensional vectors, and emotion prediction output features are obtained by using a full link layer.
Specifically, the S104 includes the following steps:
s1041, performing normalization processing on the spectrogram characteristics to extract spectrogram image texture characteristics;
referring to fig. 3, in the present embodiment, the normalization is to take the logarithm of the spectrogram, but the μ rate compression is used in the present application, and the formula is as follows:
Figure BDA0002543838140000062
the mu rate compression can improve the proportion of the low-amplitude part of the spectrogram, on one hand, the numerical difference of the spectrogram is reduced, the training process is more stable, and on the other hand, the neural network can utilize more information.
S1042, segmenting the speech spectrogram image texture features into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form speech spectrogram features after training batch;
and S1043, respectively inputting the spectrogram characteristics and the phoneme characteristics after the training batch is formed into a neural network model for splicing to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.
In this embodiment, the network input features are normalized amplitude spectra, the size of which is 400x200, and are transformed into 16-dimensional features after being convolved by the first layer, the output dimension of each layer is increased by 16 for further extracting the high-level features, and after each layer of convolutional layers, the maximum pooling layer with a pooling window of 2x2 is added to the first three layers of convolutional layers. After the fifth convolutional layer, the global mean pooling layer is used to downsample the high-level emotional features to a single value, and the loss function used is the cross-entropy loss function:
Figure BDA0002543838140000071
the network training adopts a random gradient descent method, a cosine attenuation function is adopted for the setting of the learning rate, and the initial learning rate is set to be 0.05.
In summary, by acquiring a phoneme label of a user and a voice feature of the user, extracting the phoneme feature of the phoneme label, and then respectively inputting the voice feature and the phoneme feature into a neural network model to obtain an emotion prediction output feature, and recognizing the voice emotion of the user according to the emotion prediction output feature, the invention can recognize the voice emotion by combining a spectrogram feature and a phoneme feature, improves the classification accuracy by the aid of phoneme information, and reduces artificial interference.
The first embodiment of the present invention:
referring to fig. 4, a second embodiment of the present invention provides a speech emotion recognition apparatus, including:
a phoneme label acquiring unit 100, configured to acquire a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;
a voice feature obtaining unit 200, configured to obtain a voice feature of a user;
a one-hot vector extracting unit 300, configured to extract one-hot vectors of each phoneme label, and splice each one-hot vector into a two-dimensional matrix according to a time domain to generate a phoneme feature;
and a speech emotion recognition unit 400, configured to input the speech features and the phoneme features into a neural network model respectively to obtain emotion prediction output features, and recognize speech emotion of the user according to the emotion prediction output features.
In the foregoing embodiment, in a preferred embodiment of the present invention, the emotion data set includes at least one of: happy, sad, and happy; the voice features are spectrogram features; the phoneme label is defined using 39 phonemes.
In the foregoing embodiment, in a preferred embodiment of the present invention, the speech emotion recognition unit includes:
the normalization module is used for performing normalization processing on the spectrogram characteristics so as to extract spectrogram image texture characteristics;
the segmentation module is used for segmenting the speech spectrogram image texture features into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form speech spectrogram features after training batch;
and the speech emotion recognition module is used for respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into the neural network model for splicing so as to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.
The embodiment of the invention also provides voice emotion recognition equipment which comprises a processor, a memory and a computer program stored in the memory, wherein the computer program can be executed by the processor to realize the voice emotion recognition method in the embodiment.
The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the method for writing data into a water card, which is used for preventing data loss, according to the above embodiment.
Illustratively, the computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the speech emotion recognition device.
The speech emotion recognition device can include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that the schematic diagram is merely an example of the speech emotion recognition device, and is not meant to be limiting, and may include more or less components than those shown, or some components may be combined, or different components, for example, the speech emotion recognition device may further include an input/output device, a network access device, a bus, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the control center of the speech emotion recognition device connects the various parts of the entire speech emotion recognition device using various interfaces and lines.
The memory can be used for storing the computer programs and/or modules, and the processor can realize various functions of the speech emotion recognition equipment by running or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the integrated unit of the speech emotion recognition device can be stored in a computer readable storage medium if the integrated unit is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, and software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (6)

1. A speech emotion recognition method is characterized by comprising the following steps:
acquiring a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;
acquiring voice characteristics of a user;
extracting one-hot vectors of each phoneme label, and splicing each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme characteristics;
respectively inputting the voice features and the phoneme features into a neural network model to obtain emotion prediction output features, and identifying the voice emotion of the user according to the emotion prediction output features; the voice features are spectrogram features; the method comprises the following specific steps:
normalizing the spectrogram features to extract spectrogram image texture features;
segmenting the texture features of the spectrogram image into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form spectrogram features after a training batch;
and respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into a neural network model for splicing so as to obtain emotion prediction output characteristics, and identifying the speech emotion of the user according to the emotion prediction output characteristics.
2. The speech emotion recognition method of claim 1, wherein the emotion data set includes at least one of: happy, sad, and happy; the phoneme label is defined using 39 phonemes.
3. A speech emotion recognition apparatus, comprising:
a phoneme label obtaining unit, configured to obtain a phoneme label of a user; the phoneme label is a pronunciation dictionary for judging the emotion data set of the user;
the voice feature acquisition unit is used for acquiring the voice features of the user;
a one-hot vector extraction unit, configured to extract one-hot vectors of each phoneme label, and splice each one-hot vector into a two-dimensional matrix according to a time domain to generate phoneme features;
the voice emotion recognition unit is used for respectively inputting the voice features and the phoneme features into a neural network model so as to obtain emotion prediction output features, and recognizing the voice emotion of the user according to the emotion prediction output features; the voice features are spectrogram features;
the speech emotion recognition unit comprises:
the normalization module is used for performing normalization processing on the spectrogram characteristics so as to extract spectrogram image texture characteristics;
the segmentation module is used for segmenting the texture features of the spectrogram image into segment-level features with the same length, and performing zero filling processing on the part with insufficient length to form spectrogram features after a training batch;
and the speech emotion recognition module is used for respectively inputting the spectrogram characteristics and the phoneme characteristics which form the training batch into the neural network model for splicing so as to obtain emotion prediction output characteristics, and recognizing the speech emotion of the user according to the emotion prediction output characteristics.
4. The speech emotion recognition device of claim 3, wherein the emotion data set includes at least one of: happy, sad, and happy; the phoneme label is defined using 39 phonemes.
5. A speech emotion recognition device comprising a processor, a memory and a computer program stored in the memory, the computer program being executable by the processor to implement the speech emotion recognition method as claimed in any one of claims 1 to 2.
6. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the speech emotion recognition method according to any one of claims 1-2.
CN202010554606.5A 2020-06-17 2020-06-17 Speech emotion recognition method, device, equipment and computer storage medium Active CN111696579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010554606.5A CN111696579B (en) 2020-06-17 2020-06-17 Speech emotion recognition method, device, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010554606.5A CN111696579B (en) 2020-06-17 2020-06-17 Speech emotion recognition method, device, equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN111696579A CN111696579A (en) 2020-09-22
CN111696579B true CN111696579B (en) 2022-10-28

Family

ID=72481723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010554606.5A Active CN111696579B (en) 2020-06-17 2020-06-17 Speech emotion recognition method, device, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN111696579B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750468A (en) * 2020-12-28 2021-05-04 厦门嘉艾医疗科技有限公司 Parkinson disease screening method, device, equipment and storage medium
CN113257225B (en) * 2021-05-31 2021-11-02 之江实验室 Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
CN114566189B (en) * 2022-04-28 2022-10-04 之江实验室 Speech emotion recognition method and system based on three-dimensional depth feature fusion

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107845390A (en) * 2017-09-21 2018-03-27 太原理工大学 A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
WO2019139430A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
CN110050302A (en) * 2016-10-04 2019-07-23 纽昂斯通讯有限公司 Speech synthesis
CN110148406A (en) * 2019-04-12 2019-08-20 北京搜狗科技发展有限公司 A kind of data processing method and device, a kind of device for data processing
CN111079794A (en) * 2019-11-21 2020-04-28 华南师范大学 Sound data enhancement method based on inter-category mutual fusion
CN111210807A (en) * 2020-02-21 2020-05-29 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10249294B2 (en) * 2016-09-09 2019-04-02 Electronics And Telecommunications Research Institute Speech recognition system and method
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110050302A (en) * 2016-10-04 2019-07-23 纽昂斯通讯有限公司 Speech synthesis
CN107845390A (en) * 2017-09-21 2018-03-27 太原理工大学 A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features
WO2019139430A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN110148406A (en) * 2019-04-12 2019-08-20 北京搜狗科技发展有限公司 A kind of data processing method and device, a kind of device for data processing
CN111079794A (en) * 2019-11-21 2020-04-28 华南师范大学 Sound data enhancement method based on inter-category mutual fusion
CN111210807A (en) * 2020-02-21 2020-05-29 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于语谱图提取深度空间注意特征的语音情感识别算法;王金华 等;《电信科学》;20190731(第7期);第100-108页 *
面向语音情感识别的语谱特征提取算法研究;唐闺臣 等;《计算机工程与应用》;20161231;第52卷(第21期);第152-156、174页 *

Also Published As

Publication number Publication date
CN111696579A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111696579B (en) Speech emotion recognition method, device, equipment and computer storage medium
Jing et al. Prominence features: Effective emotional features for speech emotion recognition
US20190005943A1 (en) Speech recognition system using machine learning to classify phone posterior context information and estimate boundaries in speech from combined boundary posteriors
US8676574B2 (en) Method for tone/intonation recognition using auditory attention cues
CN109036381A (en) Method of speech processing and device, computer installation and readable storage medium storing program for executing
KR20130133858A (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
CN110083716A (en) Multi-modal affection computation method and system based on Tibetan language
Sethu et al. Speech based emotion recognition
CN113223560A (en) Emotion recognition method, device, equipment and storage medium
WO2021175031A1 (en) Information prompting method and apparatus, electronic device, and medium
Patnaik Speech emotion recognition by using complex MFCC and deep sequential model
Chittaragi et al. Automatic dialect identification system for Kannada language using single and ensemble SVM algorithms
CN111179910A (en) Speed of speech recognition method and apparatus, server, computer readable storage medium
Airaksinen et al. Data augmentation strategies for neural network F0 estimation
CN108899046A (en) A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification
CN114927126A (en) Scheme output method, device and equipment based on semantic analysis and storage medium
Shah et al. Speech emotion recognition based on SVM using MATLAB
CN113539243A (en) Training method of voice classification model, voice classification method and related device
Trouvain et al. Canary song decoder: Transduction and implicit segmentation with ESNs and LTSMs
Noroozi et al. A study of language and classifier-independent feature analysis for vocal emotion recognition
China Bhanja et al. Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system
Sinha et al. Fusion of multi-stream speech features for dialect classification
CN114566156A (en) Keyword speech recognition method and device
Yadav et al. Speech Emotion Recognition using Convolutional Recurrent Neural Network
Bakshi et al. Improving Indian spoken-language identification by feature selection in duration mismatch framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant