CN115273907A

CN115273907A - Speech emotion recognition method and device

Info

Publication number: CN115273907A
Application number: CN202210908406.4A
Authority: CN
Inventors: 殷素素; 汪兰叶; 吕雨慧
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-01

Abstract

The application discloses a speech emotion recognition method and a speech emotion recognition device, which can be applied to the field of big data or the field of finance, and the method comprises the following steps: acquiring a voice file; preprocessing a voice file to obtain voice characteristic information corresponding to the voice file; starting a preset text processing tool, converting the voice file into text information, and generating a text vector corresponding to the text information; carrying out weighting processing on the voice characteristic information and the text vector to obtain weighted voice characteristic information and weighted text vector; fusing the weighted voice characteristic information and the weighted text vector to obtain fusion characteristics corresponding to the voice file; and inputting the fusion characteristics into a preset maximum pooling layer and a preset full connection layer for emotion analysis to obtain the emotion types corresponding to the voice files. By applying the method provided by the invention, the speech emotion can be recognized by combining text information besides the speech characteristics, and the recognition precision of the speech emotion is improved.

Description

Speech emotion recognition method and device

技术领域technical field

本发明涉及语音识别技术领域，特别是涉及一种语音情感识别方法及装置。The invention relates to the technical field of speech recognition, in particular to a speech emotion recognition method and device.

背景技术Background technique

随着人工智能的发展，情感计算的地位越显重要，情感计算试图赋予机器类人的观察、理解和生成各种情感的能力，使机器具有情感，更加类人化。语音作为人类交流中重要的传输媒介，包含了大量的情感信息，语音情感识别可以很好地提升机器理解人类语音情感的能力，从而更加广泛地应用于人机对话中，使人机交互更加自然和谐。With the development of artificial intelligence, the status of affective computing is becoming more and more important. Affective computing tries to endow machines with the ability to observe, understand and generate various emotions, making machines more human-like with emotions. Speech, as an important transmission medium in human communication, contains a lot of emotional information. Speech emotion recognition can improve the ability of machines to understand human voice emotions, and thus be more widely used in human-computer dialogue, making human-computer interaction more natural harmonious.

现有技术的语音情感识别方式是通过神经网络对语音特征进行情感分类，但现有技术的方法仅注重音律方面的分析，因此现有技术中语音情感识别的方法仍然不能很好地表达话语中的情感。The speech emotion recognition method in the prior art is to carry out emotion classification on the speech features through the neural network, but the method in the prior art only pays attention to the analysis of the aspect of the rhythm, so the method in the speech emotion recognition in the prior art still cannot express the speech well. emotion.

发明内容Contents of the invention

有鉴于此，本发明提供一种语音情感识别方法，通过该方法，可以除了语音特征之外结合文本信息进行语音情感的识别，提高语音情感的识别精度。In view of this, the present invention provides a speech emotion recognition method, by which speech emotion can be recognized in combination with text information in addition to speech features, and the recognition accuracy of speech emotion can be improved.

本发明还提供了一种语音情感识别装置，用以保证上述方法在实际中的实现及应用。The present invention also provides a speech emotion recognition device to ensure the realization and application of the above method in practice.

一种语音情感识别方法，包括：A speech emotion recognition method, comprising:

获取语音文件；Get audio files;

对所述语音文件进行预处理，获得所述语音文件对应的语音特征信息；Preprocessing the voice file to obtain voice feature information corresponding to the voice file;

启用预先设置的文本处理工具，将所述语音文件转换成文本信息，并生成所述文本信息对应的文本向量；Enable a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information;

对所述语音特征信息及所述文本向量进行加权处理，获得已加权的语音特征信息及已加权的文本向量；performing weighting processing on the speech feature information and the text vector to obtain weighted speech feature information and weighted text vector;

将所述已加权的语音特征信息与所述已加权的文本向量进行融合，获得所述语音文件对应的融合特征；Fusing the weighted speech feature information with the weighted text vector to obtain fusion features corresponding to the speech file;

将所述融合特征输入预先设置的最大池化层和全连接层进行情感分析，获得所述语音文件对应的情感类型。The fusion feature is input into the preset maximum pooling layer and fully connected layer for emotion analysis, and the emotion type corresponding to the voice file is obtained.

上述的方法，可选的，所述对所述语音文件进行预处理，获得所述语音文件对应的语音特征信息，包括：In the above method, optionally, the preprocessing of the voice file to obtain the voice feature information corresponding to the voice file includes:

获取所述语音文件中的MFCC特征；Obtain the MFCC feature in the voice file;

应用预先设置的BiLSTM对所述MFCC进行处理，获得所述语音文件对应的语音特征信息。Applying the preset BiLSTM to process the MFCC to obtain the voice feature information corresponding to the voice file.

上述的方法，可选的，所述启用预先设置的文本处理工具，将所述语音文件转换成文本信息，包括：In the above method, optionally, the enabling of a preset text processing tool to convert the voice file into text information includes:

启用所述文本处理工具，将所述语音文件转换成初始文本信息；Enable the text processing tool to convert the voice file into initial text information;

对所述初始文本信息进行数据清洗，去除所述初始文本信息中的无效字符和停用词，获得所述语音文件对应的文本信息。Perform data cleaning on the initial text information, remove invalid characters and stop words in the initial text information, and obtain text information corresponding to the voice file.

上述的方法，可选的，所述生成所述文本信息对应的文本向量，包括：In the above method, optionally, the generating the text vector corresponding to the text information includes:

应用预先设置的分词工具对所述文本信息进行分词处理，获得所述文本信息对应的多个单词；Applying a preset word segmentation tool to perform word segmentation processing on the text information to obtain a plurality of words corresponding to the text information;

应用预先设置的自然语言处理工具NLTK对每个所述单词进行词性标记，并基于每个所述单词的词性，将每个所述单词转换成其对应的300维向量；Apply the pre-set natural language processing tool NLTK to carry out part-of-speech tagging for each of the words, and based on the part of speech of each of the words, convert each of the words into its corresponding 300-dimensional vector;

将每个所述单词对应的300维向量输入至所述BiLSTM，获得所述BiLSTM输出的所述文本信息对应的文本向量。Input the 300-dimensional vector corresponding to each word into the BiLSTM, and obtain the text vector corresponding to the text information output by the BiLSTM.

上述的方法，可选的，所述将所述已加权的语音特征信息与所述已加权的文本向量进行融合，获得所述语音文件对应的融合特征，包括：In the above method, optionally, the merging of the weighted speech feature information and the weighted text vector to obtain the fusion feature corresponding to the speech file includes:

获取所述语音特征信息中每帧语音的帧语音特征；Acquiring frame speech features of each frame of speech in the speech feature information;

应用预先设置的注意力机制基于每个所述单词的单词特征及每个所述帧语音特征，加权获得每个所述单词对应的语音的单词语音特征；Applying a pre-set attention mechanism based on the word features of each of the words and each of the frame voice features, weighted to obtain the word voice features of the voice corresponding to each of the words;

将每个所述单词对应的语音的单词语音特征与该单词对应的300维向量进行拼接，获得所述语音文件对应的融合特征。Splicing the word speech feature of the speech corresponding to each of the words with the 300-dimensional vector corresponding to the word, to obtain the fusion feature corresponding to the speech file.

一种语音情感识别装置，包括：A speech emotion recognition device, comprising:

获取单元，用于获取语音文件；an acquisition unit, configured to acquire a voice file;

第一处理单元，用于对所述语音文件进行预处理，获得所述语音文件对应的语音特征信息；A first processing unit, configured to preprocess the voice file to obtain voice feature information corresponding to the voice file;

转换单元，用于启用预先设置的文本处理工具，将所述语音文件转换成文本信息，并生成所述文本信息对应的文本向量；A conversion unit, configured to enable a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information;

第二处理单元，用于对所述语音特征信息及所述文本向量进行加权处理，获得已加权的语音特征信息及已加权的文本向量；A second processing unit, configured to perform weighting processing on the speech feature information and the text vector, to obtain weighted speech feature information and a weighted text vector;

特征融合单元，用于将所述已加权的语音特征信息与所述已加权的文本向量进行融合，获得所述语音文件对应的融合特征；A feature fusion unit, configured to fuse the weighted speech feature information with the weighted text vector to obtain fusion features corresponding to the speech file;

分析单元，用于将所述融合特征输入预先设置的最大池化层和全连接层进行情感分析，获得所述语音文件对应的情感类型。An analysis unit, configured to input the fusion feature into a preset maximum pooling layer and a fully connected layer to perform emotion analysis, and obtain the emotion type corresponding to the voice file.

上述的装置，可选的，所述第一处理单元，包括：In the above device, optionally, the first processing unit includes:

第一获取子单元，用于获取所述语音文件中的MFCC特征；The first obtaining subunit is used to obtain the MFCC feature in the voice file;

第一处理子单元，用于应用预先设置的BiLSTM对所述MFCC进行处理，获得所述语音文件对应的语音特征信息。The first processing subunit is configured to apply a preset BiLSTM to process the MFCC to obtain voice feature information corresponding to the voice file.

上述的装置，可选的，所述转换单元，包括：The above device, optionally, the conversion unit includes:

第一转换子单元，用于启用所述文本处理工具，将所述语音文件转换成初始文本信息；The first conversion subunit is used to enable the text processing tool to convert the voice file into initial text information;

数据清洗子单元，用于对所述初始文本信息进行数据清洗，去除所述初始文本信息中的无效字符和停用词，获得所述语音文件对应的文本信息。The data cleaning subunit is configured to perform data cleaning on the initial text information, remove invalid characters and stop words in the initial text information, and obtain text information corresponding to the voice file.

第二处理子单元，用于应用预先设置的分词工具对所述文本信息进行分词处理，获得所述文本信息对应的多个单词；The second processing subunit is configured to apply a preset word segmentation tool to perform word segmentation processing on the text information, and obtain multiple words corresponding to the text information;

第二转换子单元，用于应用预先设置的自然语言处理工具NLTK对每个所述单词进行词性标记，并基于每个所述单词的词性，将每个所述单词转换成其对应的300维向量；The second conversion subunit is used to apply the pre-set natural language processing tool NLTK to carry out part-of-speech tagging on each of the words, and based on the part-of-speech of each of the words, convert each of the words into its corresponding 300-dimensional vector;

输入子单元，用于将每个所述单词对应的300维向量输入至所述BiLSTM，获得所述BiLSTM输出的所述文本信息对应的文本向量。The input subunit is configured to input the 300-dimensional vector corresponding to each word into the BiLSTM, and obtain the text vector corresponding to the text information output by the BiLSTM.

上述的装置，可选的，所述特征融合单元，包括：In the above device, optionally, the feature fusion unit includes:

第二获取子单元，用于获取所述语音特征信息中每帧语音的帧语音特征；The second acquisition subunit is used to acquire the frame speech features of each frame of speech in the speech feature information;

加权子单元，用于应用预先设置的注意力机制基于每个所述单词的单词特征及每个所述帧语音特征，加权获得每个所述单词对应的语音的单词语音特征；The weighting subunit is used to apply a preset attention mechanism based on the word features of each of the words and each of the frame voice features, weighted to obtain the word voice features of the voice corresponding to each of the words;

拼接子单元，用于将每个所述单词对应的语音的单词语音特征与该单词对应的300维向量进行拼接，获得所述语音文件对应的融合特征。The splicing subunit is used to splice the word phonetic feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word, to obtain the fusion feature corresponding to the voice file.

一种存储介质，所述存储介质包括存储的指令，其中，在所述指令运行时控制所述存储介质所在的设备执行上述的语音情感识别方法。A storage medium, the storage medium includes stored instructions, wherein when the instructions are run, the device where the storage medium is located is controlled to execute the above speech emotion recognition method.

一种电子设备，包括存储器，以及一个或者一个以上的指令，其中一个或者一个以上指令存储于存储器中，且经配置以由一个或者一个以上处理器执行上述的语音情感识别方法。An electronic device includes a memory, and one or more instructions, wherein the one or more instructions are stored in the memory, and are configured to be executed by one or more processors of the above speech emotion recognition method.

与现有技术相比，本发明包括以下优点：Compared with the prior art, the present invention includes the following advantages:

本发明提供一种语音情感识别方法，包括：获取语音文件；对所述语音文件进行预处理，获得所述语音文件对应的语音特征信息；启用预先设置的文本处理工具，将所述语音文件转换成文本信息，并生成所述文本信息对应的文本向量；对所述语音特征信息及所述文本向量进行加权处理，获得已加权的语音特征信息及已加权的文本向量；将所述已加权的语音特征信息与所述已加权的文本向量进行融合，获得所述语音文件对应的融合特征；将所述融合特征输入预先设置的最大池化层和全连接层进行情感分析，获得所述语音文件对应的情感类型。应用本发明提供的方法，可以除了语音特征之外结合文本信息进行语音情感的识别，提高语音情感的识别精度。The present invention provides a speech emotion recognition method, comprising: obtaining a speech file; preprocessing the speech file to obtain speech feature information corresponding to the speech file; enabling a preset text processing tool to convert the speech file into text information, and generate a text vector corresponding to the text information; carry out weighting processing on the speech feature information and the text vector to obtain weighted speech feature information and a weighted text vector; The voice feature information is fused with the weighted text vector to obtain the fusion feature corresponding to the voice file; the fusion feature is input into a preset maximum pooling layer and a fully connected layer for sentiment analysis to obtain the voice file corresponding emotion type. By applying the method provided by the present invention, voice emotion recognition can be carried out in combination with text information in addition to voice features, and the recognition accuracy of voice emotion can be improved.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided drawings on the premise of not paying creative efforts.

图1为本发明实施例提供的一种语音情感识别方法的方法流程图；Fig. 1 is the method flowchart of a kind of speech emotion recognition method that the embodiment of the present invention provides;

图2为本发明实施例提供的一种语音情感识别方法的又一方法流程图；Fig. 2 is another method flowchart of a speech emotion recognition method provided by an embodiment of the present invention;

图3为本发明实施例提供的一种语音情感识别方法的再一方法流程图Fig. 3 is another method flowchart of a speech emotion recognition method provided by an embodiment of the present invention

图4为本发明实施例提供的一种语音情感识别装置的装置结构图；FIG. 4 is a device structural diagram of a speech emotion recognition device provided by an embodiment of the present invention;

图5为本发明实施例提供的一种电子设备结构示意图。FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

在本申请中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this application, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any actual relationship or order, the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a series of elements includes not only those elements, but also Including other elements not expressly listed, or also including elements inherent in such process, method, article or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

本发明可用于众多通用或专用的计算装置环境或配置中。例如：个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器装置、包括以上任何装置或设备的分布式计算环境等等。The invention is applicable to numerous general purpose or special purpose computing device environments or configurations. For example: personal computer, server computer, handheld or portable device, tablet type device, multiprocessor device, distributed computing environment including any of the above devices or devices, etc.

本发明实施例提供了一种语音情感识别方法，该方法可以应用在多种系统平台，其执行主体可以为计算机终端或各种移动设备的处理器，所述方法的方法流程图如图1所示，具体包括：The embodiment of the present invention provides a voice emotion recognition method, which can be applied to various system platforms, and its execution body can be a computer terminal or a processor of various mobile devices. The method flow chart of the method is shown in Figure 1 , specifically include:

S101：获取语音文件。S101: Obtain a voice file.

在本发明中，语音文件包含语音数据。In the present invention, the voice file contains voice data.

S102：对所述语音文件进行预处理，获得所述语音文件对应的语音特征信息。S102: Perform preprocessing on the voice file to obtain voice feature information corresponding to the voice file.

其中，对语音文件进行预处理的过程为：Among them, the process of preprocessing the voice file is:

其中，MFCC(Mel-scaleFrequency Cepstral Coefficients，梅尔倒谱系数)为语音在低纬度下的特征，先获取语音的低维度的基于帧的MFCC特征，然后用BiLSTM对语音基于帧进行高维特征表示，获得语音文件对应的语音特征信息。Among them, MFCC (Mel-scale Frequency Cepstral Coefficients, Mel cepstral coefficient) is the feature of speech at low latitude, first obtain the low-dimensional frame-based MFCC feature of speech, and then use BiLSTM to perform high-dimensional feature representation of speech based on frame , to obtain the voice feature information corresponding to the voice file.

S103：启用预先设置的文本处理工具，将所述语音文件转换成文本信息，并生成所述文本信息对应的文本向量。S103: Activate a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information.

需要说明的是，文本处理工具可以是自动语音识别(ASR)技术。It should be noted that the text processing tool may be automatic speech recognition (ASR) technology.

进一步地，将语音文件转换成文本信息时，可以执行下述过程：Further, when converting the voice file into text information, the following process can be performed:

需要说明的是，语音文件中的语音数据可能会存在因为口齿不清或者是多音字等因素导致语音转成文本信息时存在部分文字转换存在错误，通过对数据清洗，修正初始文本信息中的文本内容，并去除其中的无效字符和停用词，获得最终的文本信息。It should be noted that the voice data in the voice file may have some errors in text conversion when the voice is converted into text information due to factors such as slurred speech or polyphonic characters. By cleaning the data, correct the text in the initial text information content, and remove invalid characters and stop words to obtain the final text information.

S104：对所述语音特征信息及所述文本向量进行加权处理，获得已加权的语音特征信息及已加权的文本向量。S104: Perform weighting processing on the speech feature information and the text vector to obtain weighted speech feature information and a weighted text vector.

具体的，可以利用Attention机制动态学习每个单词文本特征的权重和每帧语音的特征。在整个语音文件中，每帧语音所包含的信息量不同，有些帧的语音包含关键信息内容，因此本发明利用文本特征的权重与每帧语音的特征相乘，从而确定每帧语音的重要程度，这为加权过程，将加权后的每帧语音特征与每个单词文本特征进行相加得到每个单词的语音对齐的特征，将对齐的特征和文本的特征进行串联获得融合后的特征，最后将这些特征输入BiLSTM来做特征处理。Specifically, the Attention mechanism can be used to dynamically learn the weight of each word text feature and the feature of each frame of speech. In the entire voice file, the amount of information contained in each frame of voice is different, and the voice of some frames contains key information content, so the present invention uses the weight of text features to multiply the features of each frame of voice, thereby determining the importance of each frame of voice , which is a weighting process, adding the weighted speech features of each frame to the text features of each word to obtain the speech alignment features of each word, concatenating the aligned features and text features to obtain fused features, and finally Input these features into BiLSTM for feature processing.

S105：将所述已加权的语音特征信息与所述已加权的文本向量进行融合，获得所述语音文件对应的融合特征。S105: Fusion the weighted speech feature information and the weighted text vector to obtain fusion features corresponding to the speech file.

其中，可以通过BiLSTM来做特征的融合。Among them, the fusion of features can be done through BiLSTM.

S106：将所述融合特征输入预先设置的最大池化层和全连接层进行情感分析，获得所述语音文件对应的情感类型。S106: Input the fusion feature into the preset maximum pooling layer and fully connected layer to perform emotion analysis, and obtain the emotion type corresponding to the voice file.

需要说明的是，该最大池化层和全连接层可以是BiLSTM中的一个处理模块。It should be noted that the maximum pooling layer and the fully connected layer can be a processing module in BiLSTM.

本发明实施例提供的方法中，获取语音文件，并对其进行预处理后获取其对应的语音特征信息，同时转换成文本信息，将文件信息转换成文本向量，对语音特征信息及文本向量进行加权处理后将两者进行特征融合，可以通过最大池化层和全连接层考虑进行分析，例如：语境、文字含义以及语速等，分析出语音文件对应的情感类型。In the method provided by the embodiment of the present invention, the voice file is obtained, and its corresponding voice feature information is obtained after preprocessing it, and converted into text information at the same time, the file information is converted into a text vector, and the voice feature information and the text vector are processed. After the weighting process, the features of the two are fused, and the maximum pooling layer and the fully connected layer can be considered for analysis, such as: context, text meaning, and speech rate, etc., to analyze the emotion type corresponding to the voice file.

进一步地，输出情感类型对应的情感编号，可以根据情感编号获知该语音文件对应的用户的情绪。Further, the emotion number corresponding to the emotion type is output, and the emotion of the user corresponding to the voice file can be known according to the emotion number.

应用本发明实施例提供的方法，可以除了语音特征之外结合文本信息进行语音情感的识别，提高语音情感的识别精度。By applying the method provided by the embodiment of the present invention, voice emotion recognition can be performed in combination with text information in addition to voice features, and the recognition accuracy of voice emotion can be improved.

本发明实施例提供的方法中，所述生成所述文本信息对应的文本向量的过程如图2所示，具体可以包括：In the method provided in the embodiment of the present invention, the process of generating the text vector corresponding to the text information is shown in Figure 2, which may specifically include:

S201：应用预先设置的分词工具对所述文本信息进行分词处理，获得所述文本信息对应的多个单词。S201: Applying a preset word segmentation tool to perform word segmentation processing on the text information to obtain multiple words corresponding to the text information.

S202：应用预先设置的自然语言处理工具NLTK对每个所述单词进行词性标记，并基于每个所述单词的词性，将每个所述单词转换成其对应的300维向量。S202: Using the preset natural language processing tool NLTK to perform part-of-speech tagging on each of the words, and based on the part-of-speech of each of the words, converting each of the words into its corresponding 300-dimensional vector.

其中，每个单词对应的300维向量包含单词之间的附加上下文意义。Among them, the 300-dimensional vector corresponding to each word contains additional contextual meaning between words.

S203：将每个所述单词对应的300维向量输入至所述BiLSTM，获得所述BiLSTM输出的所述文本信息对应的文本向量。S203: Input the 300-dimensional vector corresponding to each word into the BiLSTM, and obtain a text vector corresponding to the text information output by the BiLSTM.

需要说明的是，BiLSTM即双向LSTM，由两个单独的两个LSTM组合合成。It should be noted that BiLSTM is a bidirectional LSTM, which is synthesized by combining two separate LSTMs.

在本发明中，利用自动语音识别(ASR)技术从语音文件中高精度地提取出文本信息。本发明将处理过的文本信息作为另一种形式来预测给定信号的情绪类别。为了使用文本信息，使用自然语言工具包(NLTK)将语音转录本标记化并编入标记序列中。然后，每个标记都通过一个嵌入字的层传递，该层将字索引转换为相应的300维向量，该向量包含单词之间的附加上下文意义。嵌入令牌的序列被送入文本RNN，最后，使用SoftMax函数从文本RNN的最后一个隐藏状态预测出情绪类。In the present invention, automatic speech recognition (ASR) technology is used to extract text information from audio files with high precision. The present invention uses processed text information as another form to predict the emotional category of a given signal. To use textual information, speech transcripts were tokenized and encoded into tokenized sequences using the Natural Language Toolkit (NLTK). Each token is then passed through a word-embedding layer that converts word indices into corresponding 300-dimensional vectors that contain additional contextual meaning between words. The sequence of embedded tokens is fed into the text RNN, and finally, the sentiment class is predicted from the last hidden state of the text RNN using the SoftMax function.

本发明实施例提供的方法中，所述将所述已加权的语音特征信息与所述已加权的文本向量进行融合，获得所述语音文件对应的融合特征的过程如图3所示，具体可以包括：In the method provided in the embodiment of the present invention, the process of merging the weighted speech feature information with the weighted text vector to obtain the fusion feature corresponding to the speech file is shown in Figure 3, which can be specifically include:

S301：获取所述语音特征信息中每帧语音的帧语音特征；S301: Obtain frame speech features of each frame of speech in the speech feature information;

S302：应用预先设置的注意力机制基于每个所述单词的单词特征及每个所述帧语音特征，加权获得每个所述单词对应的语音的单词语音特征；S302: Applying a preset attention mechanism based on the word features of each of the words and the speech features of each of the frames, weighting the word speech features of the speech corresponding to each of the words;

S303：将每个所述单词对应的语音的单词语音特征与该单词对应的300维向量进行拼接，获得所述语音文件对应的融合特征。S303: Concatenate the word speech features of the speech corresponding to each word with the 300-dimensional vector corresponding to the word, to obtain the fusion feature corresponding to the speech file.

在本发明中，利用Attention机制动态学出每个单词文本特征的权重和每帧语音的特征，然后加权求和得到每个单词的语音对齐的特征，接着我们将对齐的特征和文本的特征拼接并用BiLSTM来做特征的融合，最后我们用最大池化层和全连接层进行情感分类。In the present invention, the Attention mechanism is used to dynamically learn the weight of the text features of each word and the features of each frame of speech, and then weighted and summed to obtain the speech alignment features of each word, and then we splice the aligned features and text features And use BiLSTM for feature fusion, and finally we use the maximum pooling layer and the fully connected layer for sentiment classification.

上述各个实施例的具体实施过程及其衍生方式，均在本发明的保护范围之内。The specific implementation process of each of the above embodiments and its derivation methods are within the protection scope of the present invention.

与图1所述的方法相对应，本发明实施例还提供了一种语音情感识别装置，用于对图1中方法的具体实现，本发明实施例提供的语音情感识别装置可以应用计算机终端或各种移动设备中，其结构示意图如图4所示，具体包括：Corresponding to the method described in Figure 1, the embodiment of the present invention also provides a voice emotion recognition device for the specific implementation of the method in Figure 1, the voice emotion recognition device provided in the embodiment of the present invention can be applied to a computer terminal or Among the various mobile devices, their structural diagrams are shown in Figure 4, specifically including:

获取单元401，用于获取语音文件；An acquisition unit 401, configured to acquire a voice file;

第一处理单元402，用于对所述语音文件进行预处理，获得所述语音文件对应的语音特征信息；The first processing unit 402 is configured to preprocess the voice file to obtain voice feature information corresponding to the voice file;

转换单元403，用于启用预先设置的文本处理工具，将所述语音文件转换成文本信息，并生成所述文本信息对应的文本向量；The conversion unit 403 is configured to enable a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information;

第二处理单元404，用于对所述语音特征信息及所述文本向量进行加权处理，获得已加权的语音特征信息及已加权的文本向量；The second processing unit 404 is configured to perform weighting processing on the speech feature information and the text vector to obtain weighted speech feature information and a weighted text vector;

特征融合单元405，用于将所述已加权的语音特征信息与所述已加权的文本向量进行融合，获得所述语音文件对应的融合特征；A feature fusion unit 405, configured to fuse the weighted speech feature information with the weighted text vector to obtain fusion features corresponding to the speech file;

分析单元406，用于将所述融合特征输入预先设置的最大池化层和全连接层进行情感分析，获得所述语音文件对应的情感类型。The analysis unit 406 is configured to input the fusion feature into the preset maximum pooling layer and fully connected layer for sentiment analysis, and obtain the emotion type corresponding to the voice file.

本发明实施例提供的装置中，获取语音文件，并对其进行预处理后获取其对应的语音特征信息，同时转换成文本信息，将文件信息转换成文本向量，对语音特征信息及文本向量进行加权处理后将两者进行特征融合，可以通过最大池化层和全连接层考虑进行分析，例如：语境、文字含义以及语速等，分析出语音文件对应的情感类型。In the device provided by the embodiment of the present invention, the voice file is obtained, and its corresponding voice feature information is obtained after preprocessing, and converted into text information at the same time, the file information is converted into a text vector, and the voice feature information and the text vector are processed. After the weighting process, the features of the two are fused, and the maximum pooling layer and the fully connected layer can be considered for analysis, such as: context, text meaning, and speech rate, etc., to analyze the emotion type corresponding to the voice file.

应用本发明实施例提供的装置，可以除了语音特征之外结合文本信息进行语音情感的识别，提高语音情感的识别精度。By applying the device provided by the embodiment of the present invention, in addition to speech features, text information can be combined to recognize speech emotion, and the recognition accuracy of speech emotion can be improved.

本发明实施例提供的装置中，所述第一处理单元402，包括：In the device provided in the embodiment of the present invention, the first processing unit 402 includes:

第一处理子单元，用于应用所述BiLSTM对所述MFCC进行处理，获得所述语音文件对应的语音特征信息。The first processing subunit is configured to apply the BiLSTM to process the MFCC to obtain voice feature information corresponding to the voice file.

本发明实施例提供的装置中，所述转换单元403，包括：In the device provided in the embodiment of the present invention, the conversion unit 403 includes:

本发明实施例提供的装置中，其特征在于，所述转换单元403，包括：In the device provided by the embodiment of the present invention, it is characterized in that the conversion unit 403 includes:

本发明实施例提供的装置中，所述特征融合单元405，包括：In the device provided in the embodiment of the present invention, the feature fusion unit 405 includes:

以上本发明实施例公开的语音情感识别装置中各个单元及子单元的具体工作过程，可参见本发明上述实施例公开的语音情感识别方法中的对应内容，这里不再进行赘述。For the specific working process of each unit and subunit in the speech emotion recognition device disclosed in the above embodiments of the present invention, please refer to the corresponding content in the speech emotion recognition method disclosed in the above embodiments of the present invention, which will not be repeated here.

需要说明的是，本发明提供的语音情感识别方法及装置可用于云计算领域或金融领域。上述仅为示例，并不对本发明提供的语音情感识别方法及装置的应用领域进行限定。It should be noted that the voice emotion recognition method and device provided by the present invention can be used in the cloud computing field or the financial field. The above is only an example, and does not limit the application field of the speech emotion recognition method and device provided by the present invention.

本发明提供的语音情感识别方法及装置可用于金融领域或其他领域，例如，可用于金融领域中的语音业务应用场景。其他领域为除金融领域之外的任意领域，例如，云计算领域。上述仅为示例，并不对本发明提供的语音情感识别方法及装置的应用领域进行限定。The voice emotion recognition method and device provided by the present invention can be used in the financial field or other fields, for example, can be used in voice business application scenarios in the financial field. Other domains are arbitrary domains other than the financial domain, for example, the cloud computing domain. The above is only an example, and does not limit the application field of the speech emotion recognition method and device provided by the present invention.

本发明实施例还提供了一种存储介质，所述存储介质包括存储的指令，其中，在所述指令运行时控制所述存储介质所在的设备执行上述语音情感识别方法。An embodiment of the present invention also provides a storage medium, the storage medium includes stored instructions, wherein when the instructions are run, the device where the storage medium is located is controlled to execute the above voice emotion recognition method.

本发明实施例还提供了一种电子设备，其结构示意图如图5所示，具体包括存储器501，以及一个或者一个以上的指令502，其中一个或者一个以上指令502存储于存储器501中，且经配置以由一个或者一个以上处理器503执行所述一个或者一个以上指令502进行以下操作：The embodiment of the present invention also provides an electronic device, the structural diagram of which is shown in FIG. Configured to execute the one or more instructions 502 by one or more processors 503 to perform the following operations:

获取语音文件；Get audio files;

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统或系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment. The systems and system embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is It can be located in one place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without creative effort.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现。Professionals can further realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two.

为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in terms of functions in the above description. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech emotion recognition method, characterized in that, comprising:

Get audio files;

Preprocessing the voice file to obtain voice feature information corresponding to the voice file;

Enable a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information;

performing weighting processing on the speech feature information and the text vector to obtain weighted speech feature information and weighted text vector;

Fusing the weighted speech feature information with the weighted text vector to obtain fusion features corresponding to the speech file;

The fusion feature is input into the preset maximum pooling layer and fully connected layer for emotion analysis, and the emotion type corresponding to the voice file is obtained.

2. The method according to claim 1, wherein the preprocessing the voice file to obtain the voice feature information corresponding to the voice file comprises:

Obtain the MFCC feature in the voice file;

Apply the preset BiLSTM to process the MFCC to obtain the voice feature information corresponding to the voice file.

3. The method according to claim 1, wherein said enabling a preset text processing tool to convert said voice file into text information comprises:

Enable the text processing tool to convert the voice file into initial text information;

Perform data cleaning on the initial text information, remove invalid characters and stop words in the initial text information, and obtain text information corresponding to the voice file.

4. The method according to claim 2, wherein said generating a text vector corresponding to said text information comprises:

Applying a preset word segmentation tool to perform word segmentation processing on the text information to obtain a plurality of words corresponding to the text information;

Apply the pre-set natural language processing tool NLTK to carry out part-of-speech tagging for each of the words, and based on the part of speech of each of the words, convert each of the words into its corresponding 300-dimensional vector;

Input the 300-dimensional vector corresponding to each word into the BiLSTM, and obtain the text vector corresponding to the text information output by the BiLSTM.

5. The method according to claim 4, wherein said merging said weighted speech feature information and said weighted text vector to obtain a fusion feature corresponding to said speech file comprises:

Acquiring frame speech features of each frame of speech in the speech feature information;

Applying a pre-set attention mechanism based on the word features of each of the words and each of the frame voice features, weighted to obtain the word voice features of the voice corresponding to each of the words;

Splicing the word speech feature of the speech corresponding to each of the words with the 300-dimensional vector corresponding to the word, to obtain the fusion feature corresponding to the speech file.

6. A speech emotion recognition device, characterized in that, comprising:

an acquisition unit, configured to acquire a voice file;

A first processing unit, configured to preprocess the voice file to obtain voice feature information corresponding to the voice file;

A conversion unit, configured to enable a preset text processing tool, convert the voice file into text information, and generate a text vector corresponding to the text information;

A second processing unit, configured to perform weighting processing on the speech feature information and the text vector, to obtain weighted speech feature information and a weighted text vector;

A feature fusion unit, configured to fuse the weighted speech feature information with the weighted text vector to obtain fusion features corresponding to the speech file;

An analysis unit, configured to input the fusion feature into a preset maximum pooling layer and a fully connected layer to perform emotion analysis, and obtain the emotion type corresponding to the voice file.

7. The device according to claim 6, wherein the first processing unit comprises:

The first obtaining subunit is used to obtain the MFCC feature in the voice file;

The first processing subunit is configured to apply a preset BiLSTM to process the MFCC to obtain voice feature information corresponding to the voice file.

8. The device according to claim 6, wherein the converting unit comprises:

The first conversion subunit is used to enable the text processing tool to convert the voice file into initial text information;

The data cleaning subunit is configured to perform data cleaning on the initial text information, remove invalid characters and stop words in the initial text information, and obtain text information corresponding to the voice file.

9. The device according to claim 7, wherein the converting unit comprises:

The second processing subunit is configured to apply a preset word segmentation tool to perform word segmentation processing on the text information, and obtain multiple words corresponding to the text information;

The second conversion subunit is used to apply the pre-set natural language processing tool NLTK to carry out part-of-speech tagging on each of the words, and based on the part-of-speech of each of the words, convert each of the words into its corresponding 300-dimensional vector;

The input subunit is configured to input the 300-dimensional vector corresponding to each word into the BiLSTM, and obtain the text vector corresponding to the text information output by the BiLSTM.

10. The device according to claim 9, wherein the feature fusion unit comprises:

The second acquisition subunit is used to acquire the frame speech features of each frame of speech in the speech feature information;

The weighting subunit is used to apply a preset attention mechanism based on the word features of each of the words and each of the frame voice features, weighted to obtain the word voice features of the voice corresponding to each of the words;

The splicing subunit is used to splice the word phonetic feature of the voice corresponding to each word with the 300-dimensional vector corresponding to the word, to obtain the fusion feature corresponding to the voice file.