CN115273907A - Speech emotion recognition method and device - Google Patents

Speech emotion recognition method and device Download PDF

Info

Publication number
CN115273907A
CN115273907A CN202210908406.4A CN202210908406A CN115273907A CN 115273907 A CN115273907 A CN 115273907A CN 202210908406 A CN202210908406 A CN 202210908406A CN 115273907 A CN115273907 A CN 115273907A
Authority
CN
China
Prior art keywords
voice
text
word
information
voice file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210908406.4A
Other languages
Chinese (zh)
Inventor
殷素素
汪兰叶
吕雨慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202210908406.4A priority Critical patent/CN115273907A/en
Publication of CN115273907A publication Critical patent/CN115273907A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a speech emotion recognition method and a speech emotion recognition device, which can be applied to the field of big data or the field of finance, and the method comprises the following steps: acquiring a voice file; preprocessing a voice file to obtain voice characteristic information corresponding to the voice file; starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information; carrying out weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector; fusing the weighted voice characteristic information and the weighted text vector to obtain fusion characteristics corresponding to the voice file; and inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis to obtain the emotion types corresponding to the voice files. By applying the method provided by the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.

Description

Voice emotion recognition method and device
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice emotion recognition method and device.
Background
With the development of artificial intelligence, the position of emotion calculation is more important, and emotion calculation attempts to endow the robot with the capabilities of observing, understanding and generating various emotions, so that the robot has emotion and is more humanoid. The voice is used as an important transmission medium in human communication and contains a large amount of emotion information, and the voice emotion recognition can well improve the capability of a machine for understanding human voice emotion, so that the voice emotion recognition method is widely applied to human-computer conversation, and human-computer interaction is more natural and harmonious.
The speech emotion recognition method in the prior art is used for carrying out emotion classification on speech features through a neural network, but the method in the prior art only focuses on analysis in the aspect of the temperament, so that the method in the prior art for speech emotion recognition still cannot well express the emotion in the speech.
Disclosure of Invention
In view of the above, the present invention provides a speech emotion recognition method, by which speech emotion recognition can be performed in combination with text information in addition to speech features, and speech emotion recognition accuracy is improved.
The invention also provides a speech emotion recognition device which is used for ensuring the realization and the application of the method in practice.
A speech emotion recognition method includes:
acquiring a voice file;
preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information;
weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
fusing the weighted voice feature information and the weighted text vector to obtain a fusion feature corresponding to the voice file;
inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files.
Optionally, in the method, the preprocessing the voice file to obtain the voice feature information corresponding to the voice file includes:
acquiring MFCC features in the voice file;
and processing the MFCC by applying a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
Optionally, the method for converting the voice file into text information by using a preset text processing tool includes:
enabling the text processing tool to convert the voice file into initial text information;
and performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
Optionally, the above method, generating a text vector corresponding to the text information includes:
performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK, and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
Optionally, in the method, the fusing the weighted speech feature information with the weighted text vector to obtain a fusion feature corresponding to the speech file includes:
acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
weighting to obtain word voice characteristics of voice corresponding to each word based on the word characteristics of each word and the frame voice characteristics by applying a preset attention mechanism;
and splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
A speech emotion recognition apparatus comprising:
an acquisition unit configured to acquire a voice file;
the first processing unit is used for preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
the conversion unit is used for starting a preset text processing tool, converting the voice file into text information and generating a text vector corresponding to the text information;
the second processing unit is used for carrying out weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
the feature fusion unit is used for fusing the weighted voice feature information and the weighted text vector to obtain fusion features corresponding to the voice file;
and the analysis unit is used for inputting the fusion characteristics into a preset maximum pooling layer and a preset full-link layer for emotion analysis to obtain the emotion types corresponding to the voice files.
The above apparatus, optionally, the first processing unit includes:
a first obtaining subunit, configured to obtain an MFCC feature in the voice file;
and the first processing subunit is used for processing the MFCC by applying preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
The above apparatus, optionally, the conversion unit includes:
the first conversion subunit is used for starting the text processing tool and converting the voice file into initial text information;
and the data cleaning subunit is used for performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
The above apparatus, optionally, the conversion unit includes:
the second processing subunit is used for performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
the second conversion subunit is used for performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and the input subunit is used for inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
The above apparatus, optionally, the feature fusion unit includes:
the second acquiring subunit is used for acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
the weighting subunit is used for weighting and obtaining the word voice characteristics of the voice corresponding to each word based on the word characteristics of each word and the frame voice characteristics by applying a preset attention mechanism;
and the splicing subunit is used for splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
A storage medium, the storage medium comprising stored instructions, wherein when the instructions are executed, a device on which the storage medium is located is controlled to execute the above-mentioned speech emotion recognition method.
An electronic device comprising a memory, and one or more instructions stored in the memory and configured to be executed by the one or more processors to perform the method of speech emotion recognition.
Compared with the prior art, the invention has the following advantages:
the invention provides a speech emotion recognition method, which comprises the following steps: acquiring a voice file; preprocessing the voice file to obtain voice characteristic information corresponding to the voice file; starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information; weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector; fusing the weighted voice feature information and the weighted text vector to obtain a fusion feature corresponding to the voice file; inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files. By applying the method provided by the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for speech emotion recognition according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method of speech emotion recognition provided in the embodiments of the present invention;
FIG. 3 is a flowchart of another method of speech emotion recognition method according to an embodiment of the present invention
FIG. 4 is a block diagram of an apparatus for emotion recognition;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In this application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions, and the terms "comprises", "comprising", or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.
The embodiment of the invention provides a speech emotion recognition method, which can be applied to various system platforms, wherein the execution main body of the method can be a computer terminal or a processor of various mobile devices, and a flow chart of the method is shown in figure 1, and the method specifically comprises the following steps:
s101: and acquiring a voice file.
In the present invention, the voice file contains voice data.
S102: and preprocessing the voice file to obtain voice characteristic information corresponding to the voice file.
The method comprises the following steps of preprocessing a voice file:
acquiring MFCC features in the voice file;
and processing the MFCC by applying a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
The method includes the steps that MFCC (Mel-scale frequency Cepstral Coefficients) is a feature of voice at a low latitude, a low-dimensional frame-based MFCC feature of the voice is obtained firstly, then the voice is subjected to high-dimensional feature representation based on a frame by using BilSTM, and voice feature information corresponding to a voice file is obtained.
S103: and starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information.
It should be noted that the text processing tool may be an Automatic Speech Recognition (ASR) technique.
Further, when converting a voice file into text information, the following process may be performed:
enabling the text processing tool to convert the voice file into initial text information;
and performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
It should be noted that there may be some text conversion errors in the voice data in the voice file due to unclear speaking or polyphones, etc., and the final text information is obtained by cleaning the data, correcting the text content in the initial text information, and removing the invalid characters and stop words therein.
S104: and performing weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector.
Specifically, the Attention mechanism can be utilized to dynamically learn the weight of each word text feature and the feature of each frame of voice. In the whole voice file, the information content of each frame of voice is different, and the voice of some frames contains key information content, so the invention utilizes the weight of text characteristics to multiply the characteristics of each frame of voice, thereby determining the importance degree of each frame of voice, which is a weighting process, adds the weighted characteristics of each frame of voice and the text characteristics of each word to obtain the voice alignment characteristics of each word, connects the aligned characteristics and the text characteristics in series to obtain fused characteristics, and finally inputs the characteristics into the BilSTM for characteristic processing.
S105: and fusing the weighted voice characteristic information and the weighted text vector to obtain fusion characteristics corresponding to the voice file.
Wherein, the fusion of characteristics can be made through BilSTM.
S106: inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files.
It should be noted that the max-pooling layer and the full-link layer may be a processing module in BiLSTM.
In the method provided by the embodiment of the present invention, a voice file is obtained, and after being preprocessed, the voice file obtains corresponding voice feature information, and simultaneously converts the voice feature information into text information, the file information is converted into a text vector, and after weighting processing is performed on the voice feature information and the text vector, feature fusion is performed on the voice feature information and the text vector, and analysis can be performed by considering a maximum pooling layer and a full connection layer, for example: context, word meaning, speed of speech, etc., and analyzing the corresponding emotion type of the voice file.
Further, the emotion number corresponding to the emotion type is output, and the emotion of the user corresponding to the voice file can be obtained according to the emotion number.
By applying the method provided by the embodiment of the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.
In the method provided in the embodiment of the present invention, the process of generating the text vector corresponding to the text information is shown in fig. 2, and specifically may include:
s201: and performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information.
S202: and performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK, and converting each word into a corresponding 300-dimensional vector based on the part of speech of each word.
Wherein the 300-dimensional vector for each word contains additional contextual meaning between the words.
S203: and inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
It should be noted that BiLSTM, i.e., bi-directional LSTM, is composed of two separate LSTM combined together.
In the present invention, text information is extracted from a speech file with high accuracy using an Automatic Speech Recognition (ASR) technique. The present invention uses the processed textual information as another form to predict the emotional category of a given signal. To use textual information, the speech transcript is tagged and coded into a tagging sequence using the Natural Language Toolkit (NLTK). Each token is then passed through a layer of embedded words that converts the word index into a corresponding 300-dimensional vector that contains additional contextual meaning between the words. The sequence of embedded tokens is fed into the text RNN and finally the emotion class is predicted from the last hidden state of the text RNN using the SoftMax function.
In the method provided in the embodiment of the present invention, the process of fusing the weighted speech feature information and the weighted text vector to obtain the fusion feature corresponding to the speech file is shown in fig. 3, and may specifically include:
s301: acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
s302: weighting to obtain word voice characteristics of voice corresponding to each word based on the word characteristics of each word and the frame voice characteristics by applying a preset attention mechanism;
s303: and splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
In the invention, the weight of each word text characteristic and the characteristic of each frame of voice are dynamically learned by using an Attention mechanism, then the characteristic of voice alignment of each word is obtained by weighted summation, then the aligned characteristic and the characteristic of the text are spliced and subjected to characteristic fusion by using the BilSTM, and finally the maximum pooling layer and the full connection layer are used for emotion classification.
The specific implementation procedures and derivatives thereof of the above embodiments are within the scope of the present invention.
Corresponding to the method described in fig. 1, an embodiment of the present invention further provides a speech emotion recognition apparatus, which is used for specifically implementing the method in fig. 1, where the speech emotion recognition apparatus provided in the embodiment of the present invention may be applied to a computer terminal or various mobile devices, and a schematic structural diagram of the speech emotion recognition apparatus is shown in fig. 4, and specifically includes:
an obtaining unit 401 configured to obtain a voice file;
a first processing unit 402, configured to pre-process the voice file, and obtain voice feature information corresponding to the voice file;
a conversion unit 403, configured to start a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information;
a second processing unit 404, configured to perform weighting processing on the speech feature information and the text vector, so as to obtain weighted speech feature information and weighted text vector;
a feature fusion unit 405, configured to fuse the weighted speech feature information and the weighted text vector to obtain a fusion feature corresponding to the speech file;
and the analysis unit 406 is configured to input the fusion feature into a preset maximum pooling layer and a preset full link layer for emotion analysis, so as to obtain an emotion type corresponding to the voice file.
In the device provided in the embodiment of the present invention, a voice file is obtained, and is preprocessed to obtain corresponding voice feature information, and is converted into text information, the file information is converted into a text vector, and the voice feature information and the text vector are weighted and then feature-fused, and can be analyzed by considering a maximum pooling layer and a full link layer, for example: context, word meaning, speed of speech, etc., and analyzing the corresponding emotion type of the voice file.
By applying the device provided by the embodiment of the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.
In the apparatus provided in the embodiment of the present invention, the first processing unit 402 includes:
a first obtaining subunit, configured to obtain an MFCC feature in the voice file;
and the first processing subunit is used for processing the MFCC by applying the BilSTM to obtain the voice characteristic information corresponding to the voice file.
In the apparatus provided in the embodiment of the present invention, the conversion unit 403 includes:
the first conversion subunit is used for starting the text processing tool and converting the voice file into initial text information;
and the data cleaning subunit is used for performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
In the apparatus provided in the embodiment of the present invention, the conversion unit 403 includes:
the second processing subunit is used for performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
the second conversion subunit is used for performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and the input subunit is used for inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
In the apparatus provided in the embodiment of the present invention, the feature fusion unit 405 includes:
the second acquiring subunit is used for acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
the weighting subunit is used for weighting and obtaining the word speech characteristics of the speech corresponding to each word based on the word characteristics of each word and the frame speech characteristics by applying a preset attention mechanism;
and the splicing subunit is used for splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
The specific working processes of each unit and sub-unit in the speech emotion recognition apparatus disclosed in the above embodiment of the present invention can refer to the corresponding contents in the speech emotion recognition method disclosed in the above embodiment of the present invention, and are not described herein again.
It should be noted that the speech emotion recognition method and device provided by the invention can be applied to the field of cloud computing or the field of finance. The foregoing is merely an example, and does not limit the application field of the speech emotion recognition method and apparatus provided by the present invention.
The speech emotion recognition method and the speech emotion recognition device can be used in the financial field or other fields, for example, can be used in speech service application scenes in the financial field. Other fields are any fields other than the financial field, for example, the cloud computing field. The foregoing is only an example, and does not limit the application field of the speech emotion recognition method and apparatus provided by the present invention.
The embodiment of the invention also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the equipment where the storage medium is located is controlled to execute the speech emotion recognition method.
An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 5, which specifically includes a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to perform the following operations according to the one or more instructions 502:
acquiring a voice file;
preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information;
weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
fusing the weighted voice feature information and the weighted text vector to obtain a fusion feature corresponding to the voice file;
and inputting the fusion features into a preset maximum pooling layer and a full connection layer for emotion analysis to obtain the emotion types corresponding to the voice files.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments, which are substantially similar to the method embodiments, are described in a relatively simple manner, and reference may be made to some descriptions of the method embodiments for relevant points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
To clearly illustrate this interchangeability of hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A speech emotion recognition method is characterized by comprising the following steps:
acquiring a voice file;
preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information;
weighting the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
fusing the weighted voice characteristic information and the weighted text vector to obtain fusion characteristics corresponding to the voice file;
inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis, and obtaining the emotion types corresponding to the voice files.
2. The method according to claim 1, wherein the preprocessing the voice file to obtain the voice feature information corresponding to the voice file includes:
obtaining MFCC features in the voice file;
and processing the MFCC by using a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
3. The method of claim 1, wherein the enabling of a pre-configured text processing tool to convert the voice file into text information comprises:
enabling the text processing tool to convert the voice file into initial text information;
and performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
4. The method according to claim 2, wherein the generating a text vector corresponding to the text information comprises:
performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK, and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
5. The method according to claim 4, wherein the fusing the weighted speech feature information with the weighted text vector to obtain the corresponding fused feature of the speech file comprises:
acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
weighting and obtaining the word voice characteristics of the voice corresponding to each word by applying a preset attention mechanism based on the word characteristics of each word and the frame voice characteristics;
and splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
6. A speech emotion recognition apparatus, comprising:
an acquisition unit configured to acquire a voice file;
the first processing unit is used for preprocessing the voice file to obtain voice characteristic information corresponding to the voice file;
the conversion unit is used for starting a preset text processing tool, converting the voice file into text information and generating a text vector corresponding to the text information;
the second processing unit is used for carrying out weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector;
the feature fusion unit is used for fusing the weighted voice feature information and the weighted text vector to obtain fusion features corresponding to the voice file;
and the analysis unit is used for inputting the fusion characteristics into a preset maximum pooling layer and a preset full-link layer for emotion analysis to obtain the emotion types corresponding to the voice files.
7. The apparatus of claim 6, wherein the first processing unit comprises:
a first obtaining subunit, configured to obtain an MFCC feature in the voice file;
and the first processing subunit is used for processing the MFCC by applying a preset BilSTM to obtain the voice characteristic information corresponding to the voice file.
8. The apparatus of claim 6, wherein the conversion unit comprises:
the first conversion subunit is used for starting the text processing tool and converting the voice file into initial text information;
and the data cleaning subunit is used for performing data cleaning on the initial text information, removing invalid characters and stop words in the initial text information, and obtaining text information corresponding to the voice file.
9. The apparatus of claim 7, wherein the conversion unit comprises:
the second processing subunit is used for performing word segmentation processing on the text information by using a preset word segmentation tool to obtain a plurality of words corresponding to the text information;
the second conversion subunit is used for performing part-of-speech tagging on each word by applying a preset natural language processing tool NLTK and converting each word into a corresponding 300-dimensional vector based on the part-of-speech of each word;
and the input subunit is used for inputting the 300-dimensional vector corresponding to each word into the BilSTM to obtain a text vector corresponding to the text information output by the BilSTM.
10. The apparatus of claim 9, wherein the feature fusion unit comprises:
the second acquiring subunit is used for acquiring frame voice characteristics of each frame of voice in the voice characteristic information;
the weighting subunit is used for weighting and obtaining the word speech characteristics of the speech corresponding to each word based on the word characteristics of each word and the frame speech characteristics by applying a preset attention mechanism;
and the splicing subunit is used for splicing the word voice feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word to obtain the fusion feature corresponding to the voice file.
CN202210908406.4A 2022-07-29 2022-07-29 Speech emotion recognition method and device Pending CN115273907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210908406.4A CN115273907A (en) 2022-07-29 2022-07-29 Speech emotion recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210908406.4A CN115273907A (en) 2022-07-29 2022-07-29 Speech emotion recognition method and device

Publications (1)

Publication Number Publication Date
CN115273907A true CN115273907A (en) 2022-11-01

Family

ID=83770565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210908406.4A Pending CN115273907A (en) 2022-07-29 2022-07-29 Speech emotion recognition method and device

Country Status (1)

Country Link
CN (1) CN115273907A (en)

Similar Documents

Publication Publication Date Title
EP3582119B1 (en) Spoken language understanding system and method using recurrent neural networks
CN111312245B (en) Voice response method, device and storage medium
WO2021000497A1 (en) Retrieval method and apparatus, and computer device and storage medium
CN112259089B (en) Speech recognition method and device
CN111522916B (en) Voice service quality detection method, model training method and device
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN110223134B (en) Product recommendation method based on voice recognition and related equipment
CN115690553A (en) Emotion analysis method and system based on multi-modal dialog content combined modeling
CN110890097A (en) Voice processing method and device, computer storage medium and electronic equipment
CN114005446A (en) Emotion analysis method, related equipment and readable storage medium
CN115935182A (en) Model training method, topic segmentation method in multi-turn conversation, medium, and device
CN115687565A (en) Text generation method and device
CN112349294A (en) Voice processing method and device, computer readable medium and electronic equipment
CN117892237B (en) Multi-modal dialogue emotion recognition method and system based on hypergraph neural network
CN114239607A (en) Conversation reply method and device
CN117853175A (en) User evaluation information prediction method and device and electronic equipment
CN115273907A (en) Speech emotion recognition method and device
CN114519094A (en) Method and device for conversational recommendation based on random state and electronic equipment
CN114067842A (en) Customer satisfaction degree identification method and device, storage medium and electronic equipment
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN111414468A (en) Method and device for selecting dialect and electronic equipment
CN115169367B (en) Dialogue generating method and device, and storage medium
CN118377909B (en) Customer label determining method and device based on call content and storage medium
CN117409780B (en) AI digital human voice interaction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination