CN115273907A - Speech emotion recognition method and device - Google Patents
Speech emotion recognition method and device Download PDFInfo
- Publication number
- CN115273907A CN115273907A CN202210908406.4A CN202210908406A CN115273907A CN 115273907 A CN115273907 A CN 115273907A CN 202210908406 A CN202210908406 A CN 202210908406A CN 115273907 A CN115273907 A CN 115273907A
- Authority
- CN
- China
- Prior art keywords
- voice
- text
- word
- information
- voice file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 36
- 238000012545 processing Methods 0.000 claims abstract description 53
- 230000008451 emotion Effects 0.000 claims abstract description 42
- 230000004927 fusion Effects 0.000 claims abstract description 35
- 238000004458 analytical method Methods 0.000 claims abstract description 14
- 238000011176 pooling Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 238000006243 chemical reaction Methods 0.000 claims description 16
- 230000011218 segmentation Effects 0.000 claims description 12
- 238000004140 cleaning Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 8
- 238000003058 natural language processing Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a speech emotion recognition method and a speech emotion recognition device, which can be applied to the field of big data or the field of finance, and the method comprises the following steps: acquiring a voice file; preprocessing a voice file to obtain voice characteristic information corresponding to the voice file; starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information; carrying out weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector; fusing the weighted voice characteristic information and the weighted text vector to obtain fusion characteristics corresponding to the voice file; and inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis to obtain the emotion types corresponding to the voice files. By applying the method provided by the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice emotion recognition method and device.
Background
With the development of artificial intelligence, the position of emotion calculation is more important, and emotion calculation attempts to endow the robot with the capabilities of observing, understanding and generating various emotions, so that the robot has emotion and is more humanoid. The voice is used as an important transmission medium in human communication and contains a large amount of emotion information, and the voice emotion recognition can well improve the capability of a machine for understanding human voice emotion, so that the voice emotion recognition method is widely applied to human-computer conversation, and human-computer interaction is more natural and harmonious.
The speech emotion recognition method in the prior art is used for carrying out emotion classification on speech features through a neural network, but the method in the prior art only focuses on analysis in the aspect of the temperament, so that the method in the prior art for speech emotion recognition still cannot well express the emotion in the speech.
Disclosure of Invention
In view of the above, the present invention provides a speech emotion recognition method, by which speech emotion recognition can be performed in combination with text information in addition to speech features, and speech emotion recognition accuracy is improved.
The invention also provides a speech emotion recognition device which is used for ensuring the realization and the application of the method in practice.
A speech emotion recognition method includes:
acquiring a voice file;
preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information;
weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
fusing the weighted voice feature information and the weighted text vector to obtain a fusion feature corresponding to the voice file;
inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files.
Optionally, in the method, the preprocessing the voice file to obtain the voice feature information corresponding to the voice file includes:
acquiring MFCC features in the voice file;
and processing the MFCC by applying a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
Optionally, the method for converting the voice file into text information by using a preset text processing tool includes:
enabling the text processing tool to convert the voice file into initial text information;
and performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
Optionally, the above method, generating a text vector corresponding to the text information includes:
performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK, and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
Optionally, in the method, the fusing the weighted speech feature information with the weighted text vector to obtain a fusion feature corresponding to the speech file includes:
acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
weighting to obtain word voice characteristics of voice corresponding to each word based on the word characteristics of each word and the frame voice characteristics by applying a preset attention mechanism;
and splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
A speech emotion recognition apparatus comprising:
an acquisition unit configured to acquire a voice file;
the first processing unit is used for preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
the conversion unit is used for starting a preset text processing tool, converting the voice file into text information and generating a text vector corresponding to the text information;
the second processing unit is used for carrying out weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
the feature fusion unit is used for fusing the weighted voice feature information and the weighted text vector to obtain fusion features corresponding to the voice file;
and the analysis unit is used for inputting the fusion characteristics into a preset maximum pooling layer and a preset full-link layer for emotion analysis to obtain the emotion types corresponding to the voice files.
The above apparatus, optionally, the first processing unit includes:
a first obtaining subunit, configured to obtain an MFCC feature in the voice file;
and the first processing subunit is used for processing the MFCC by applying preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
The above apparatus, optionally, the conversion unit includes:
the first conversion subunit is used for starting the text processing tool and converting the voice file into initial text information;
and the data cleaning subunit is used for performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
The above apparatus, optionally, the conversion unit includes:
the second processing subunit is used for performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
the second conversion subunit is used for performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and the input subunit is used for inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
The above apparatus, optionally, the feature fusion unit includes:
the second acquiring subunit is used for acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
the weighting subunit is used for weighting and obtaining the word voice characteristics of the voice corresponding to each word based on the word characteristics of each word and the frame voice characteristics by applying a preset attention mechanism;
and the splicing subunit is used for splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
A storage medium, the storage medium comprising stored instructions, wherein when the instructions are executed, a device on which the storage medium is located is controlled to execute the above-mentioned speech emotion recognition method.
An electronic device comprising a memory, and one or more instructions stored in the memory and configured to be executed by the one or more processors to perform the method of speech emotion recognition.
Compared with the prior art, the invention has the following advantages:
the invention provides a speech emotion recognition method, which comprises the following steps: acquiring a voice file; preprocessing the voice file to obtain voice characteristic information corresponding to the voice file; starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information; weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector; fusing the weighted voice feature information and the weighted text vector to obtain a fusion feature corresponding to the voice file; inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files. By applying the method provided by the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for speech emotion recognition according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method of speech emotion recognition provided in the embodiments of the present invention;
FIG. 3 is a flowchart of another method of speech emotion recognition method according to an embodiment of the present invention
FIG. 4 is a block diagram of an apparatus for emotion recognition;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In this application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions, and the terms "comprises", "comprising", or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.
The embodiment of the invention provides a speech emotion recognition method, which can be applied to various system platforms, wherein the execution main body of the method can be a computer terminal or a processor of various mobile devices, and a flow chart of the method is shown in figure 1, and the method specifically comprises the following steps:
s101: and acquiring a voice file.
In the present invention, the voice file contains voice data.
S102: and preprocessing the voice file to obtain voice characteristic information corresponding to the voice file.
The method comprises the following steps of preprocessing a voice file:
acquiring MFCC features in the voice file;
and processing the MFCC by applying a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
The method includes the steps that MFCC (Mel-scale frequency Cepstral Coefficients) is a feature of voice at a low latitude, a low-dimensional frame-based MFCC feature of the voice is obtained firstly, then the voice is subjected to high-dimensional feature representation based on a frame by using BilSTM, and voice feature information corresponding to a voice file is obtained.
S103: and starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information.
It should be noted that the text processing tool may be an Automatic Speech Recognition (ASR) technique.
Further, when converting a voice file into text information, the following process may be performed:
enabling the text processing tool to convert the voice file into initial text information;
and performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
It should be noted that there may be some text conversion errors in the voice data in the voice file due to unclear speaking or polyphones, etc., and the final text information is obtained by cleaning the data, correcting the text content in the initial text information, and removing the invalid characters and stop words therein.
S104: and performing weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector.
Specifically, the Attention mechanism can be utilized to dynamically learn the weight of each word text feature and the feature of each frame of voice. In the whole voice file, the information content of each frame of voice is different, and the voice of some frames contains key information content, so the invention utilizes the weight of text characteristics to multiply the characteristics of each frame of voice, thereby determining the importance degree of each frame of voice, which is a weighting process, adds the weighted characteristics of each frame of voice and the text characteristics of each word to obtain the voice alignment characteristics of each word, connects the aligned characteristics and the text characteristics in series to obtain fused characteristics, and finally inputs the characteristics into the BilSTM for characteristic processing.
S105: and fusing the weighted voice characteristic information and the weighted text vector to obtain fusion characteristics corresponding to the voice file.
Wherein, the fusion of characteristics can be made through BilSTM.
S106: inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files.
It should be noted that the max-pooling layer and the full-link layer may be a processing module in BiLSTM.
In the method provided by the embodiment of the present invention, a voice file is obtained, and after being preprocessed, the voice file obtains corresponding voice feature information, and simultaneously converts the voice feature information into text information, the file information is converted into a text vector, and after weighting processing is performed on the voice feature information and the text vector, feature fusion is performed on the voice feature information and the text vector, and analysis can be performed by considering a maximum pooling layer and a full connection layer, for example: context, word meaning, speed of speech, etc., and analyzing the corresponding emotion type of the voice file.
Further, the emotion number corresponding to the emotion type is output, and the emotion of the user corresponding to the voice file can be obtained according to the emotion number.
By applying the method provided by the embodiment of the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.
In the method provided in the embodiment of the present invention, the process of generating the text vector corresponding to the text information is shown in fig. 2, and specifically may include:
s201: and performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information.
S202: and performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK, and converting each word into a corresponding 300-dimensional vector based on the part of speech of each word.
Wherein the 300-dimensional vector for each word contains additional contextual meaning between the words.
S203: and inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
It should be noted that BiLSTM, i.e., bi-directional LSTM, is composed of two separate LSTM combined together.
In the present invention, text information is extracted from a speech file with high accuracy using an Automatic Speech Recognition (ASR) technique. The present invention uses the processed textual information as another form to predict the emotional category of a given signal. To use textual information, the speech transcript is tagged and coded into a tagging sequence using the Natural Language Toolkit (NLTK). Each token is then passed through a layer of embedded words that converts the word index into a corresponding 300-dimensional vector that contains additional contextual meaning between the words. The sequence of embedded tokens is fed into the text RNN and finally the emotion class is predicted from the last hidden state of the text RNN using the SoftMax function.
In the method provided in the embodiment of the present invention, the process of fusing the weighted speech feature information and the weighted text vector to obtain the fusion feature corresponding to the speech file is shown in fig. 3, and may specifically include:
s301: acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
s302: weighting to obtain word voice characteristics of voice corresponding to each word based on the word characteristics of each word and the frame voice characteristics by applying a preset attention mechanism;
s303: and splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
In the invention, the weight of each word text characteristic and the characteristic of each frame of voice are dynamically learned by using an Attention mechanism, then the characteristic of voice alignment of each word is obtained by weighted summation, then the aligned characteristic and the characteristic of the text are spliced and subjected to characteristic fusion by using the BilSTM, and finally the maximum pooling layer and the full connection layer are used for emotion classification.
The specific implementation procedures and derivatives thereof of the above embodiments are within the scope of the present invention.
Corresponding to the method described in fig. 1, an embodiment of the present invention further provides a speech emotion recognition apparatus, which is used for specifically implementing the method in fig. 1, where the speech emotion recognition apparatus provided in the embodiment of the present invention may be applied to a computer terminal or various mobile devices, and a schematic structural diagram of the speech emotion recognition apparatus is shown in fig. 4, and specifically includes:
an obtaining unit 401 configured to obtain a voice file;
a first processing unit 402, configured to pre-process the voice file, and obtain voice feature information corresponding to the voice file;
a conversion unit 403, configured to start a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information;
a second processing unit 404, configured to perform weighting processing on the speech feature information and the text vector, so as to obtain weighted speech feature information and weighted text vector;
a feature fusion unit 405, configured to fuse the weighted speech feature information and the weighted text vector to obtain a fusion feature corresponding to the speech file;
and the analysis unit 406 is configured to input the fusion feature into a preset maximum pooling layer and a preset full link layer for emotion analysis, so as to obtain an emotion type corresponding to the voice file.
In the device provided in the embodiment of the present invention, a voice file is obtained, and is preprocessed to obtain corresponding voice feature information, and is converted into text information, the file information is converted into a text vector, and the voice feature information and the text vector are weighted and then feature-fused, and can be analyzed by considering a maximum pooling layer and a full link layer, for example: context, word meaning, speed of speech, etc., and analyzing the corresponding emotion type of the voice file.
By applying the device provided by the embodiment of the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.
In the apparatus provided in the embodiment of the present invention, the first processing unit 402 includes:
a first obtaining subunit, configured to obtain an MFCC feature in the voice file;
and the first processing subunit is used for processing the MFCC by applying the BilSTM to obtain the voice characteristic information corresponding to the voice file.
In the apparatus provided in the embodiment of the present invention, the conversion unit 403 includes:
the first conversion subunit is used for starting the text processing tool and converting the voice file into initial text information;
and the data cleaning subunit is used for performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
In the apparatus provided in the embodiment of the present invention, the conversion unit 403 includes:
the second processing subunit is used for performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
the second conversion subunit is used for performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and the input subunit is used for inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
In the apparatus provided in the embodiment of the present invention, the feature fusion unit 405 includes:
the second acquiring subunit is used for acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
the weighting subunit is used for weighting and obtaining the word speech characteristics of the speech corresponding to each word based on the word characteristics of each word and the frame speech characteristics by applying a preset attention mechanism;
and the splicing subunit is used for splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
The specific working processes of each unit and sub-unit in the speech emotion recognition apparatus disclosed in the above embodiment of the present invention can refer to the corresponding contents in the speech emotion recognition method disclosed in the above embodiment of the present invention, and are not described herein again.
It should be noted that the speech emotion recognition method and device provided by the invention can be applied to the field of cloud computing or the field of finance. The foregoing is merely an example, and does not limit the application field of the speech emotion recognition method and apparatus provided by the present invention.
The speech emotion recognition method and the speech emotion recognition device can be used in the financial field or other fields, for example, can be used in speech service application scenes in the financial field. Other fields are any fields other than the financial field, for example, the cloud computing field. The foregoing is only an example, and does not limit the application field of the speech emotion recognition method and apparatus provided by the present invention.
The embodiment of the invention also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the equipment where the storage medium is located is controlled to execute the speech emotion recognition method.
An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 5, which specifically includes a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to perform the following operations according to the one or more instructions 502:
acquiring a voice file;
preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information;
weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
fusing the weighted voice feature information and the weighted text vector to obtain a fusion feature corresponding to the voice file;
and inputting the fusion features into a preset maximum pooling layer and a full connection layer for emotion analysis to obtain the emotion types corresponding to the voice files.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments, which are substantially similar to the method embodiments, are described in a relatively simple manner, and reference may be made to some descriptions of the method embodiments for relevant points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
To clearly illustrate this interchangeability of hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A speech emotion recognition method is characterized by comprising the following steps:
acquiring a voice file;
preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information;
weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
fusing the weighted voice characteristic information and the weighted text vector to obtain fusion characteristics corresponding to the voice file;
inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files.
2. The method according to claim 1, wherein the preprocessing the voice file to obtain the voice feature information corresponding to the voice file includes:
obtaining MFCC features in the voice file;
and processing the MFCC by using a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
3. The method of claim 1, wherein the enabling of a pre-configured text processing tool to convert the voice file into text information comprises:
enabling the text processing tool to convert the voice file into initial text information;
and performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
4. The method according to claim 2, wherein the generating a text vector corresponding to the text information comprises:
performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK, and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
5. The method according to claim 4, wherein the fusing the weighted speech feature information with the weighted text vector to obtain the corresponding fused feature of the speech file comprises:
acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
weighting and obtaining the word voice characteristics of the voice corresponding to each word by applying a preset attention mechanism based on the word characteristics of each word and the frame voice characteristics;
and splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
6. A speech emotion recognition apparatus, comprising:
an acquisition unit configured to acquire a voice file;
the first processing unit is used for preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
the conversion unit is used for starting a preset text processing tool, converting the voice file into text information and generating a text vector corresponding to the text information;
the second processing unit is used for carrying out weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
the feature fusion unit is used for fusing the weighted voice feature information and the weighted text vector to obtain fusion features corresponding to the voice file;
and the analysis unit is used for inputting the fusion characteristics into a preset maximum pooling layer and a preset full-link layer for emotion analysis to obtain the emotion types corresponding to the voice files.
7. The apparatus of claim 6, wherein the first processing unit comprises:
a first obtaining subunit, configured to obtain an MFCC feature in the voice file;
and the first processing subunit is used for processing the MFCC by applying a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
8. The apparatus of claim 6, wherein the conversion unit comprises:
the first conversion subunit is used for starting the text processing tool and converting the voice file into initial text information;
and the data cleaning subunit is used for performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
9. The apparatus of claim 7, wherein the conversion unit comprises:
the second processing subunit is used for performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
the second conversion subunit is used for performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and the input subunit is used for inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
10. The apparatus of claim 9, wherein the feature fusion unit comprises:
the second acquiring subunit is used for acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
the weighting subunit is used for weighting and obtaining the word speech characteristics of the speech corresponding to each word based on the word characteristics of each word and the frame speech characteristics by applying a preset attention mechanism;
and the splicing subunit is used for splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210908406.4A CN115273907A (en) | 2022-07-29 | 2022-07-29 | Speech emotion recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210908406.4A CN115273907A (en) | 2022-07-29 | 2022-07-29 | Speech emotion recognition method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115273907A true CN115273907A (en) | 2022-11-01 |
Family
ID=83770565
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210908406.4A Pending CN115273907A (en) | 2022-07-29 | 2022-07-29 | Speech emotion recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115273907A (en) |
-
2022
- 2022-07-29 CN CN202210908406.4A patent/CN115273907A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3582119B1 (en) | Spoken language understanding system and method using recurrent neural networks | |
CN111312245B (en) | Voice response method, device and storage medium | |
WO2021000497A1 (en) | Retrieval method and apparatus, and computer device and storage medium | |
CN112259089B (en) | Speech recognition method and device | |
CN111522916B (en) | Voice service quality detection method, model training method and device | |
CN111930914A (en) | Question generation method and device, electronic equipment and computer-readable storage medium | |
CN113450758B (en) | Speech synthesis method, apparatus, device and medium | |
CN110223134B (en) | Product recommendation method based on voice recognition and related equipment | |
CN115690553A (en) | Emotion analysis method and system based on multi-modal dialog content combined modeling | |
CN110890097A (en) | Voice processing method and device, computer storage medium and electronic equipment | |
CN114005446A (en) | Emotion analysis method, related equipment and readable storage medium | |
CN115935182A (en) | Model training method, topic segmentation method in multi-turn conversation, medium, and device | |
CN115687565A (en) | Text generation method and device | |
CN112349294A (en) | Voice processing method and device, computer readable medium and electronic equipment | |
CN117892237B (en) | Multi-modal dialogue emotion recognition method and system based on hypergraph neural network | |
CN114239607A (en) | Conversation reply method and device | |
CN117853175A (en) | User evaluation information prediction method and device and electronic equipment | |
CN115273907A (en) | Speech emotion recognition method and device | |
CN114519094A (en) | Method and device for conversational recommendation based on random state and electronic equipment | |
CN114067842A (en) | Customer satisfaction degree identification method and device, storage medium and electronic equipment | |
CN114528851A (en) | Reply statement determination method and device, electronic equipment and storage medium | |
CN111414468A (en) | Method and device for selecting dialect and electronic equipment | |
CN115169367B (en) | Dialogue generating method and device, and storage medium | |
CN118377909B (en) | Customer label determining method and device based on call content and storage medium | |
CN117409780B (en) | AI digital human voice interaction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |