CN113362858B

CN113362858B - Voice emotion classification method, device, equipment and medium

Info

Publication number: CN113362858B
Application number: CN202110850075.9A
Authority: CN
Inventors: 刘广
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2023-10-31
Anticipated expiration: 2041-07-27
Also published as: CN113362858A

Abstract

The invention discloses a voice emotion classification method, a device, computer equipment and a storage medium, which relate to an artificial intelligence technology, wherein voice data to be recognized is firstly obtained and voice recognition is carried out to obtain a voice recognition result, then a target voice recognition sub-result selected in the voice data to be recognized is preprocessed according to a character preprocessing strategy to obtain a preprocessing result, a target BERT model is used for extracting characteristics of the preprocessing result to obtain a final vector expression result, and finally the final vector expression result is input into a pre-trained emotion classification model for operation to obtain a corresponding emotion classification result. The method realizes feature extraction of deeper network structure, can display and distinguish emotion influence of a speaker, weights two granularities of neuron and vector for the features, has finer feature fusion granularity, and finally obtains more accurate emotion recognition result.

Description

Voice emotion classification method, device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence speech semantic technology, and in particular, to a speech emotion classification method, apparatus, computer device, and storage medium.

Background

Emotion recognition is an important branch of the field of artificial intelligence, especially in conversational scenarios. During a conversation, a speaker may receive an emotional influence from the other speaker, which attempts to change the speaker's emotion, on the one hand, and from the speaker itself, which attempts to preserve the speaker's emotion, on the other hand. To model these two classes of emotional impact, existing methods use two "recurrent neural network" based model structures, the "flat" and "hierarchical".

However, 1) existing methods are all based on "recurrent neural networks" and do not utilize a powerful pre-trained BERT model. 2) The flat model can not distinguish different speakers by concatenating emotion expressions of different speakers in the same time sequence; 3) The emotion expressions of the same speaker are connected in series in the same time sequence through the branch layer, but the emotion influences of different speakers are still mixed in the same time sequence of the trunk layer and cannot be distinguished.

Disclosure of Invention

The embodiment of the invention provides a voice emotion classification method, a device, computer equipment and a storage medium, and aims to solve the problem that in the prior art, the result of emotion recognition on a dialogue in a multi-person dialogue scene is inaccurate based on an existing model.

In a first aspect, an embodiment of the present invention provides a method for classifying speech emotion, including:

responding to a voice emotion classification instruction, acquiring voice data to be recognized according to the voice emotion classification instruction, and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged according to time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;

acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;

preprocessing a target voice recognition sub-result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and extracting features of the preprocessing result through the target BERT model to obtain a final vector expression result; and

and calling a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.

In a second aspect, an embodiment of the present invention provides a speech emotion classification device, including:

the speaker recognition unit is used for responding to the voice emotion classification instruction if the voice data to be recognized sent by the user side or other servers are detected, acquiring the voice data to be recognized according to the voice emotion classification instruction and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged according to time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;

The target model selection unit is used for acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;

the final vector acquisition unit is used for preprocessing a target voice recognition sub-result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and extracting features of the preprocessing result through the target BERT model to obtain a final vector expression result; and

and the emotion classification unit is used for calling a pre-trained emotion classification model, inputting the final vector expression result into the emotion classification model for operation, and obtaining a corresponding emotion classification result.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for classifying speech emotion according to the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor causes the processor to perform the method for classifying speech emotion according to the first aspect.

The embodiment of the invention provides a voice emotion classification method, a device, computer equipment and a storage medium, which are characterized in that voice data to be recognized are firstly obtained and voice recognition is carried out to obtain a voice recognition result, then a target voice recognition sub-result selected in the voice data to be recognized is preprocessed according to a character preprocessing strategy to obtain a preprocessing result, the preprocessing result is subjected to feature extraction through a target BERT model to obtain a final vector expression result, and finally the final vector expression result is input into a pre-trained emotion classification model to be operated to obtain a corresponding emotion classification result. The method realizes feature extraction of deeper network structure, can display and distinguish emotion influence of a speaker, weights two granularities of neuron and vector for the features, has finer feature fusion granularity, and finally obtains more accurate emotion recognition result.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a speech emotion classification method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for classifying speech emotion according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of a flat BERT model of a speech emotion classification method according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of a hierarchical BERT model in a speech emotion classification method according to an embodiment of the present invention;

FIG. 2c is a schematic diagram of a space-time BERT model in a speech emotion classification method according to an embodiment of the present invention;

FIG. 2d is a schematic sub-flowchart of a method for classifying speech emotion according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a speech emotion classification device according to an embodiment of the present invention;

fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of an application scenario of a voice emotion classification method according to an embodiment of the present invention; fig. 2 is a schematic flow chart of a voice emotion classification method according to an embodiment of the present invention, where the voice emotion classification method is applied to a server, and the method is executed by application software installed in the server.

As shown in fig. 2, the method includes steps S101 to S106.

S101, responding to a voice emotion classification instruction, acquiring voice data to be recognized according to the voice emotion classification instruction, and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.

In this embodiment, in order to more clearly understand the technical solution of the present application, the following describes the execution subject concerned in detail. The application relates to a technical scheme for describing an execution subject by taking a server as an execution subject.

The user terminal can collect the voice data to be recognized of the multi-user communication on site at the user terminal, or a plurality of user terminals participate in the same online video conference, and the background server receives the voice data to be recognized, which is collected and uploaded based on the plurality of user terminals, when the multi-user communication is performed in the same video conference scene. After the server receives the voice data to be recognized, speaker recognition can be performed on the voice data to be recognized.

The server stores a speaker recognition model in the server to recognize the speaker of the voice data to be recognized, which is uploaded by the user side, so as to obtain a voice recognition result; and a pre-trained BERT model set is stored in the server to perform speaker context-based emotion recognition on the voice recognition result.

In specific implementation, the speaker recognition (Speaker Recognition, abbreviated as SR) technology is also called voiceprint recognition (Voiceprint Recognition, abbreviated as VPR) technology, and the voiceprint recognition technology mainly adopts an MFCC (MFCC, mel frequency cepstrum coefficient) and GMM model (gaussian mixture model) framework, so that the speaker recognition can be effectively performed on the voice data to be recognized through the speaker recognition technology, and a voice recognition result corresponding to the voice data to be recognized is obtained; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.

For example, a voice recognition result corresponding to a section of voice data to be recognized is represented by U, and the expression of U is as followsIn particular, the speech recognition result U comprises 8 dialogs arranged in ascending order of time sequence, wherein the first dialog is spoken by speaker 1 (using +.>A representation, wherein the subscript 1 represents the chronological order 1, the superscript 1 represents the speaker number identification of speaker 1,/->The whole represents the speaking content of speaker 1), the second session is what speaker 2 said (with +.>A representation in which the subscript 2 indicates the chronological order 2 and the superscript 2 indicates the speaker number identification of speaker 2), the third session is that described by speaker 1 (with +. >Representation, wherein the subscript 3 indicates the chronological order 3 and the superscript 1 indicates the speaker number identification of speaker 1), the fourth session is that speaker 1 said (with +.>A representation, wherein the subscript 4 indicates the chronological order 4 and the superscript 1 indicates the speaker number identification of speaker 1), the fifth dialog is spoken by speaker 3 (with +.>A representation in which the subscript 5 indicates the chronological order 5 and the superscript 3 indicates the speaker number identification of speaker 3), the sixth session is that described by speaker 2 (with +.>Representation, wherein the subscript 6 indicates the chronological order 6 and the superscript 2 indicates the speaker number identification of speaker 2), the seventh dialog is spoken by speaker 1 (with +.>The eighth dialog is spoken by speaker 2 (with +.f), where the subscript 7 indicates the chronological order 7 and the superscript 1 indicates the speaker number identification of speaker 1)>And (2) a representation, wherein the subscript 8 represents the time sequence order 2 and the superscript 2 represents the speaker number identification of the speaker 2). Through the above-mentioned speaker identification technology, the speaking content of each speaker in the multi-person conversation is effectively distinguished.

S102, acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model.

In this embodiment, a pre-trained BERT model set is stored locally on the server, where the BERT model set includes at least a flat BERT model, a hierarchical BERT model, and a spatiotemporal BERT model. When the server performs BERT model selection, one of the flat BERT model, the hierarchical BERT model and the space-time BERT model is randomly selected. The three models can be used for effectively extracting the effective vector expression result in the voice recognition result so as to carry out the follow-up accurate emotion recognition. If the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; and if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy.

S103, preprocessing the target voice recognition sub-result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and extracting features of the preprocessing result through the target BERT model to obtain a final vector expression result.

In this embodiment, when a target speech recognition sub-result is arbitrarily selected in the speech data to be recognized, for example, selected in the above-mentioned speech recognition result UThe voice recognition sub-result is used as a target voice recognition sub-result, and the target voice recognition sub-result can be preprocessed according to the corresponding character preprocessing strategy to obtain a preprocessing result, so that the preprocessing result is finally input into a target BERT model to perform special processing on the preprocessing resultAnd extracting the sign to obtain a final vector expression result. That is, when it is determined that different target BERT models are adopted to perform feature extraction, a corresponding character preprocessing strategy is adopted to perform preprocessing on the target voice recognition sub-result before the target voice recognition sub-result to obtain a preprocessing result, so that the information dimension of the target voice recognition sub-result can be increased, and feature extraction is more accurate.

In one embodiment, as shown in fig. 2d, step S103 includes:

S1031, acquiring any BERT model in a pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a hierarchical BERT model and a space-time BERT model;

s1032, when the target BERT model is determined to be a flat BERT model, preprocessing a first target voice recognition sub-result selected from the voice recognition results according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result, and extracting features of the first preprocessing result through the target BERT model to obtain a final vector expression result; the first character preprocessing strategy is used for adding a mixed context sequence to a first target voice recognition sub-result;

s1033, when the target BERT model is determined to be a hierarchical BERT model, preprocessing a second target voice recognition sub-result selected from the voice recognition results according to a second character preprocessing strategy stored in advance to obtain a second preprocessing result, and extracting features of the second preprocessing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a precursor result of a second target voice recognition sub-result and respectively adding an internal context sequence into each voice recognition sub-result and the second target voice recognition sub-result in the precursor result;

S1034, when the target BERT model is determined to be a space-time BERT model, preprocessing a third target voice recognition sub-result selected from the voice recognition results according to a pre-stored third character preprocessing strategy to obtain a third preprocessing result, and extracting features of the third preprocessing result through the target BERT model to obtain a final vector expression result; the third character preprocessing strategy is used for respectively adding a standard context sequence and an internal context sequence into a third target voice recognition sub-result.

In this embodiment, the flat BERT model corresponds to a BERT model with a flat structure, and the final vector expression result corresponding to the first target speech recognition sub-result can be obtained by processing the first target speech recognition sub-result selected from the speech recognition results into an input variable and directly inputting the input variable into the BERT model for operation, and the model structure diagram of the specific flat BERT model is shown in fig. 2a. The final vector expression result is the extraction of the most effective features in the speech recognition sub-result, and can provide effective input features for subsequent emotion recognition.

In one embodiment, step S1032 includes:

acquiring a mixed context sequence from the voice recognition result according to a preset context window size value and a selected first target voice recognition sub-result;

Splicing the first target voice recognition sub-result and the mixed context sequence into a first sequence according to a preset first splicing strategy;

and inputting the first sequence into a flat BERT model for operation to obtain a corresponding first vector expression result, and taking the first vector expression result as a final vector expression result corresponding to the first target voice recognition sub-result.

In this embodiment, in order to more clearly understand the following technical solutions, three context sequences involved in extracting the speech recognition result U are described in detail below:

mixed context sequences (i.e., conv-context), denoted by ψ, e.gRepresenting the speech recognition result U in +.>For first target speech recognitionAnd (5) performing mixed context sequence extraction by taking 5 as a preset context window size value K. When extracting the mixed context sequence, directly pushing forward 5 bits by taking the first target voice recognition sub-result as a starting point to obtain +.> The mixed context sequence is obtained by directly pushing forward according to the preset size value of the context window in reverse order without distinguishing the speaker.

Standard context sequences (i.e., inter-context), represented by phi, e.g Representing the speech recognition result U in +.>The method comprises the steps of taking a first target voice recognition sub-result and taking 5 as a preset context window size value K to extract a standard context sequence, wherein the method aims at taking the first target voice recognition sub-result as a starting point to push forward 5 bits forwards to obtain an initial sequence, and removing all voice recognition sub-results of the same speaker as the first target voice recognition sub-result to obtain the standard context sequence->

Internal context sequences (i.e., intra-context), useRepresentation, e.g.>Representing the speech recognition result U in +.>The method comprises the steps of taking a first target voice recognition sub-result and taking 5 as a preset context window size value K to extract an internal context sequence, wherein the method aims at taking the first target voice recognition sub-result as a starting point to push forward 5 bits forwards to obtain an initial sequence, and removing voice recognition sub-results of all speakers which are different from the first target voice recognition sub-result to obtain the internal context sequence->

And acquiring a mixed context sequence from the voice recognition result according to a preset context window size value and the selected first target voice recognition sub-result, namely acquiring voice recognition sub-results with the same number as the context window size value from the voice recognition result in reverse order by taking the selected first target voice recognition sub-result as a starting point to form the mixed context sequence.

In an embodiment, the splicing the first target voice recognition sub-result and the mixed context sequence into the first sequence according to a preset first splicing policy includes:

the characters included in the first target voice recognition sub-result are coded through double bytes to obtain a corresponding first coding result, and a pre-stored first word embedding vector is spliced at the tail end of the first coding result to obtain a first processing result;

the characters included in the mixed context sequence are coded through double bytes to obtain a corresponding second coding result, and a second processing result is obtained by splicing a pre-stored second class word embedded vector at the tail end of the second coding result;

adding a first preset character string before the first processing result, adding a second preset character string between the first processing result and the second processing result, and adding a second preset character string after the second processing result to obtain a first initial time sequence;

and splicing the corresponding position embedded vectors at the tail of each character in the first initial time sequence to obtain a first time sequence.

In this embodiment, i.e. for a flat BERT model, the goal is to predict the emotion of the ith speech recognition sub-result, the input is structured as:

Wherein,,representing an expression sequence comprising T words, < >>Representative comprises->The mixed context sequence of individual words, K, is a preset context window size value. The vector representation is obtained by inputting the BERT model after splicing and converting the vector representation into the ebadd: r is (r) _i ＝BERT(X _i )。

When the target BERT model is determined to be a flat BERT model, the input of the flat BERT model comprises the following key points: 1) Sub-result of first target voice recognition(first target Speech recognition sub-result +.>Can also be understood as the selected target expression) and the mixed context sequence +.>Splicing into a time sequence; 2) At the head of the time seriesPart addition [ CLS ]]Special characters (wherein [ CLS ]]A first preset string) for explicitly outputting the position; 3) Use of [ SEP ]]Special characters (wherein [ SEP ]]A second predetermined string) to distinguish between the target expression and the mixed context sequence; 4) Converting all characters to WordPiece embeddings; 5) Splicing type A text (i.e. pre-stored first word embedding vectors) for the first target voice recognition sub-result, and splicing type B text (i.e. pre-stored second word embedding vectors) for the mixed context sequence in order to enhance the distinguishing degree of the two text; 6) The position of the position is spliced for each character, and position information of the time sequence is reserved. The first time sequence constructed in the above manner can perform longer time sequence modeling, and a deeper network structure is mined.

And then inputting the first time sequence into the flat BERT model for feature extraction, wherein the obtained first vector expression result is expressed by taking the output of the [ CLS ] position of the last layer of the BERT as the vector expression of the whole time sequence.

In this embodiment, when the target BERT model is determined to be a hierarchical BERT model, the hierarchical BERT model corresponds to a BERT model with a multi-layer structure, and at least includes a BERT layer and a transform layer, the second target speech recognition sub-result selected from the speech recognition results and the screened speech recognition sub-result are preprocessed and then respectively input into the BERT model of the BERT layer for operation, so that a second vector expression result corresponding to the second target speech recognition sub-result and the screened speech recognition sub-result can be obtained, and a second vector expression result set formed by the second vector expression result is used as a final vector expression result corresponding to the second target speech recognition sub-result, and a model structure diagram of the specific hierarchical BERT model is shown in fig. 2b. The final vector expression result is the extraction of the most effective features in the speech recognition sub-result, and can provide effective input features for subsequent emotion recognition.

In one embodiment, step S1033 includes:

and according to a preset context window size value, acquiring voice recognition sub-results with the number equal to that of the context window size value in the voice recognition result in a reverse order from the second target voice recognition sub-result which is selected as a starting point to form a target voice recognition sub-result set, preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and extracting features of the second preprocessing result through the target BERT model to obtain a final vector expression result.

And sequentially inputting the preprocessing results respectively corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set to a BERT layer and a Transformer layer in a target BERT model for feature extraction to obtain a second vector expression result corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set.

When extracting a second vector expression result corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set, inputting the second target voice recognition sub-result and the preprocessing result corresponding to the target voice recognition sub-result set to a BERT layer and a transducer layer of a target BERT model for feature extraction, and weighting two granularities of neurons and vectors by the obtained second vector expression result through the extraction of the two layers of models, so that finer granularity features are fused, and the features have layering feel.

In an embodiment, the preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and extracting features of the second preprocessing result through the target BERT model to obtain a final vector expression result, including:

acquiring an ith target voice recognition sub-result in the target voice recognition sub-results set; wherein, the initial value of i is 1;

acquiring an ith internal context sequence from the voice recognition result according to a preset context window size value and an ith target voice recognition sub-result;

splicing the ith target voice recognition sub-result and the ith internal context sequence into an ith sub-time sequence according to a preset second splicing strategy;

updating the value of i by increasing i by 1, and judging whether the value of i exceeds the size value of the context window; if the i value does not exceed the size value of the context window, returning to the step of acquiring the i-th target voice recognition sub-result in the target voice recognition sub-result set;

if the i value exceeds the size value of the context window, sequentially acquiring the 1 st sub-time sequence to the i-1 st sub-time sequence;

Splicing the second target voice recognition sub-result and the corresponding target internal context sequence into an ith sub-time sequence according to the second splicing strategy;

respectively inputting the 1 st sub-time sequence to the i th sub-time sequence to a BERT layer in a target BERT model for feature extraction to obtain second vector initial expression results respectively corresponding to the 1 st sub-time sequence to the i th sub-time sequence;

splicing the second vector initial expression results respectively corresponding to the 1 st sub-time sequence to the i th sub-time sequence to obtain a first splicing result;

and inputting the first splicing result to a transducer layer in the target BERT model to perform feature extraction, so as to obtain a second vector expression result.

In this embodiment, for example, the preset context window size value k=5, and the speech recognition result Then the 1 st target speech recognition sub-result of the target speech recognition sub-results is +.>Its corresponding 1 st internal context sequence +.>Is an empty set; similarly, the 2 nd target voice recognition sub-result in the target voice recognition sub-results is +.>Its corresponding 2 nd internal context sequence +.>The 3 rd target voice recognition sub-result in the target voice recognition sub-results is +. >Its corresponding 3 rd internal context sequence +.>The 4 th target voice recognition sub-result in the target voice recognition sub-results isIts corresponding 4 th internal context sequence +.>Is an empty set; the 5 th target voice recognition sub-result in the target voice recognition sub-results is +.>Its corresponding 5 th internal context sequence +.>

When the ith target voice recognition sub-result and the ith internal context sequence are spliced into the ith sub-time sequence according to a preset second splicing strategy, the method specifically comprises the following steps: the characters included in the ith target voice recognition sub-result are subjected to double-byte coding to obtain a corresponding ith group of first sub-coding result, and a pre-stored first class word embedded vector is spliced at the tail end of the ith group of first sub-coding result to obtain an ith group of first processing result; the characters included in the ith internal context sequence are coded through double bytes to obtain a corresponding ith group of second sub-coding results, and a pre-stored second class word embedded vector is spliced at the tail end of the ith group of second sub-coding results to obtain an ith group of second processing results; adding [ CLS ] characters before the first sub-coding result of the ith group, adding [ SEP ] characters between the first sub-coding result of the ith group and the second processing result of the ith group, and adding [ SEP ] characters after the second processing result of the ith group to obtain an initial time sequence of the ith group; and splicing the corresponding position embedded vectors at the tail of each character in the ith group of initial time sequence to obtain an ith sub time sequence. Aiming at the improved hierarchical BERT model of the hierarchical cyclic neural network, the hierarchical structure can effectively distinguish speakers compared with the flat structure.

Sequentially acquiring 1 st sub-time sequence to i th sub-time sequence, inputting the 1 st sub-time sequence to the BERT layer in the target BERT model for feature extraction to obtain a second vector initial expression result with context of a speaker at each moment, namely inputting the 1 st sub-time sequence to the BERT layer in the target BERT model for feature extraction to obtain(/>For the purpose of the feature extraction, namely inputting the 2 nd sub-time sequence into the BERT layer in the target BERT model to obtain +.>Namely inputting the 3 rd sub-time sequence into the BERT layer in the target BERT model to perform feature extraction to obtain +.>Namely inputting the 4 th sub-time sequence into the BERT layer in the target BERT model to perform feature extraction to obtain +.>Namely inputting the 5 th sub-time sequence into the BERT layer in the target BERT model to perform feature extraction to obtain +.>Namely inputting the 6 th sub-time sequence into the BERT layer in the target BERT model to perform feature extraction to obtain +.>And after the 6 second vector initial expression results are obtained, splicing according to the ascending order of the angle marks to obtain a first splicing result. And finally, inputting the first splicing result to a converter layer in the target BERT model to perform feature extraction (the number of layers of the input conversion layer is 6), and obtaining a second vector expression result.

The second vector is expressed in the last layer of the encode part of the transducer layerThe output of the location is expressed as a vector that is ultimately used for emotion classification.

In this embodiment, when it is determined that the target BERT model is a space-time BERT model, the hierarchical BERT model corresponds to a BERT model comprehensively considered from a time angle and a space angle, and is processed into two input variables based on a third target speech recognition sub-result selected from the speech recognition results (one input variable is obtained by performing a splicing process based on the third target speech recognition sub-result and a corresponding current standard context sequence thereof, and the other input variable is obtained by performing a splicing process based on the third target speech recognition sub-result and a corresponding current internal context sequence thereof), and then directly input into the BERT model for operation, and the respective obtained operation results are subjected to a fusion process of a fusion model, so that a third three-dimensional expression result corresponding to the third target speech recognition sub-result is obtained as a final vector expression result, and a model structure diagram of the specific space-time BERT model is shown in fig. 2c. Likewise, the final vector expression result is the extraction of the most effective features in the speech recognition sub-result, which can provide effective input features for subsequent emotion recognition.

In an embodiment, in step S1034, preprocessing the third target speech recognition sub-result selected from the speech recognition results according to a pre-stored third character preprocessing strategy to obtain a third preprocessed result, and extracting features of the third preprocessed result by using the target BERT model to obtain a final vector expression result, which includes:

and acquiring a current standard context sequence and a current internal context sequence which are respectively corresponding to the third target voice recognition sub-result in the voice recognition result, respectively splicing the third target voice recognition sub-result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence, respectively inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model, and extracting features to obtain a third-direction expression result.

In this embodiment, when extracting the third vector expression result corresponding to the third target voice recognition sub-result, the current first time sequence and the current second time sequence obtained by processing the third target voice recognition sub-result and respectively inputting the third target voice recognition sub-result to the BERT layer in the target BERT model are first obtained from a time angle, and then the first time sequence and the current second time sequence are spliced from a space angle (i.e., the current first time sequence and the current second time sequence are fused by inputting the current first time sequence and the current second time sequence to the fusion model layer in the target BERT model) to obtain the third vector expression result. From a temporal perspective, the emotional impact of the speaker can be distinguished; from a spatial perspective, the weighting of the two granularities of neurons and vectors can be performed for the features, and the feature fusion granularity is finer.

In an embodiment, the obtaining the current standard context sequence and the current internal context sequence corresponding to the third target voice recognition sub-result in the voice recognition result respectively, splicing the third target voice recognition sub-result with the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence to a BERT layer and a fusion model layer in a target BERT model respectively to perform feature extraction to obtain a third vector expression result, where the method includes:

respectively acquiring a current standard context sequence and a current internal context sequence from the voice recognition result according to a preset context window size value and a selected third target voice recognition sub-result;

splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, and splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to the third splicing strategy;

The current first time sequence is input to a BERT layer in a target BERT model to perform feature extraction to obtain a current first vector initial expression result, and the current second time sequence is input to the BERT layer in the target BERT model to perform feature extraction to obtain a current second vector initial expression result;

longitudinally splicing the current first vector initial expression result and the current second vector initial expression result to obtain a current splicing result;

and inputting the current splicing result to a fusion model layer in a target BERT model for fusion processing to obtain a third vector expression result corresponding to the third target voice recognition sub-result.

In this embodiment, a current standard context sequence is obtained from the speech recognition results according to a preset context window size value and the selected third target speech recognition sub-result, that is, the speech recognition sub-results with the same number as the context window size value are obtained from the speech recognition results in reverse order with the selected third target speech recognition sub-result as the starting point, and all the speech recognition sub-results with the same speaker as the third target speech recognition sub-result are removed to form the current standard context sequence.

And acquiring a current internal context sequence from the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result, namely acquiring voice recognition sub-results with the same number as the context window size value from the voice recognition result in reverse order by taking the selected third target voice recognition sub-result as a starting point, and removing voice recognition sub-results of all speakers which are different from the third target voice recognition sub-result to form the current internal context sequence.

And inputting the current first time sequence to a BERT layer in a target BERT model to perform feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence to the BERT layer in the target BERT model to perform feature extraction to obtain a current second vector initial expression result, wherein the output of the [ CLS ] position of the last BERT layer is used as the vector expression of the whole time sequence.

Splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, wherein the method specifically comprises the following steps: the characters included in the third target voice recognition sub-result are coded through double bytes to obtain a corresponding current first coding result, and a pre-stored first word embedding vector is spliced at the tail end of the current first coding node to obtain a current first processing result; the characters included in the current standard context sequence are coded through double bytes to obtain a corresponding current second coding result, and a pre-stored second class word embedded vector is spliced at the tail end of the current second coding result to obtain a current second processing result; adding [ CLS ] characters before the current first processing result, adding [ SEP ] characters between the current first processing result and the current second processing result, and adding [ SEP ] characters after the current second processing result to obtain a current first initial time sequence; and splicing the corresponding position embedded vectors at the tail of each character in the first current initial time sequence to obtain the current first time sequence. And splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to a preset third splicing strategy, wherein the splicing acquisition process of the current second time sequence is also an acquisition process of the current first time sequence.

For example, the third target speech recognition sub-result isAnd the preset context window size value is 5, the current standard context sequence +.>And the current internal context sequence +> The third target speech recognition sub-result is +.>And the current standard context sequence-> Splicing the first target voice recognition sub-result into a current first initial time sequence through a preset third splicing strategy, wherein the third target voice recognition sub-result is +.>And the current internal context sequence->Splicing the first vector initial expression result into a current second initial time sequence through a preset third splicing strategy, inputting the current first initial time sequence into a BERT layer in a target BERT model, and extracting features to obtain the current first vector initial expression result->Inputting the current second time sequence into a BERT layer in a target BERT model to perform feature extraction to obtain a current second vector initial expression result +.>Wherein->And->d _f For vector dimension, the current first vector initially expresses the result +.>And the current second vector initial expression result +.>Is a representation of two emotion influence vectors obtained from the time dimension.

Initial expression of the current first vector resultAnd the current second vector initial expression result +.>Performing longitudinal splicing to obtain current splicing result +.>Finally, the current splicing result is- >The specific implementation can adopt tensor operation when the fusion model layer input into the target BERT model carries out fusion processing, namely:

wherein RELU () represents a linear rectification function, W _b Is thatEach neuron in (i.e. each neuron has a different weight) assigned to the level of the neuron, W _a Is->The two row vectors of (i.e. the same weights assigned by neurons in a row vector) assign a vector level of weights +.>And represents a bias term. So the current splicing result is obtainedAnd (3) performing tensor operation on the fusion model layer input into the target BERT model to obtain a third vector expression result.

S104, invoking a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.

In the present embodiment, r can be used for the final vector expression results obtained in step S1032, step S1033 or step S1034 _i The representation (e.g. the specific examples given above all obtained with r ₇ Representation), invoking a pre-trained emotion classification model, and inputting a final vector expression result into the emotion classification model for operation, wherein the method comprises the following steps of:

o _i ＝tanh(W _o r _i )

wherein tanh () is a hyperbolic tangent function, W _o Is r _i The corresponding first weight, softmax (), can be understood as a linear classifier, Is o _i Corresponding second weight value, +.>Is the final predicted emotion classification result.

The method realizes feature extraction of deeper network structure, can display and distinguish emotion influence of a speaker, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and finally obtains more accurate emotion recognition result.

The embodiment of the invention also provides a voice emotion classification device which is used for executing any embodiment of the voice emotion classification method. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of a speech emotion classification device according to an embodiment of the present invention. The speech emotion classification device 100 may be configured in a server.

As shown in fig. 3, the speech emotion classification device 100 includes: speaker recognition unit 101, object model selection unit 102, final vector acquisition unit 103, emotion classification unit 104.

A speaker recognition unit 101, configured to respond to a voice emotion classification instruction, obtain voice data to be recognized according to the voice emotion classification instruction, and perform voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.

In this embodiment, the speaker recognition (Speaker Recognition, abbreviated as SR) technology is also called voiceprint recognition (Voiceprint Recognition, abbreviated as VPR) technology, and the voiceprint recognition technology mainly adopts MFCC features (MFCC, mel frequency cepstrum coefficient) and GMM model (gaussian mixture model) frames, so that the speaker recognition can be effectively performed on the voice data to be recognized through the speaker recognition technology, and a voice recognition result corresponding to the voice data to be recognized is obtained; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.

The target model selecting unit 102 is configured to obtain a pre-trained target BERT model and a character preprocessing policy corresponding to the target BERT model.

In this embodiment, a pre-trained BERT model set is pre-stored locally on the server, where the BERT model set includes at least a flat BERT model, a hierarchical BERT model, and a spatiotemporal BERT model. When the server performs BERT model selection, one of the flat BERT model, the hierarchical BERT model and the space-time BERT model is randomly selected. The three models can be used for effectively extracting the effective vector expression result in the voice recognition result so as to carry out the follow-up accurate emotion recognition. If the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; and if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy.

And a final vector obtaining unit 103, configured to pre-process the target speech recognition sub-result selected in the speech data to be recognized according to the character pre-processing policy to obtain a pre-processing result, and perform feature extraction on the pre-processing result through the target BERT model to obtain a final vector expression result.

In this embodiment, when a target speech recognition sub-result is arbitrarily selected in the speech data to be recognized, for example, selected in the above-mentioned speech recognition result UThe voice recognition sub-result is used as a target voice recognition sub-result, and the target voice recognition sub-result can be preprocessed according to a corresponding character preprocessing strategy to obtain a preprocessing result, so that the preprocessing result is finally input into a target BERT model to perform feature extraction on the preprocessing result to obtain a final vector expression result. That is, when it is determined that different target BERT models are adopted to perform feature extraction, a corresponding character preprocessing strategy is adopted to perform preprocessing on the target voice recognition sub-result before the target voice recognition sub-result to obtain a preprocessing result, so that the information dimension of the target voice recognition sub-result can be increased, and feature extraction is more accurate.

In one embodiment, as shown in fig. 3, the final vector obtaining unit 103 includes:

a target model obtaining unit 1031, configured to obtain any one BERT model in a pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a hierarchical BERT model and a space-time BERT model;

a first model processing unit 1032, configured to, when it is determined that the target BERT model is a flat BERT model, perform preprocessing on a first target speech recognition sub-result selected from the speech recognition results according to a first character preprocessing policy stored in advance to obtain a first preprocessed result, and perform feature extraction on the first preprocessed result by using the target BERT model to obtain a final vector expression result; the first character preprocessing strategy is used for adding a mixed context sequence to a first target voice recognition sub-result;

a second model processing unit 1033, configured to, when determining that the target BERT model is a hierarchical BERT model, perform preprocessing on a second target speech recognition sub-result selected from the speech recognition results according to a second character preprocessing policy stored in advance to obtain a second preprocessed result, and perform feature extraction on the second preprocessed result by using the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a precursor result of a second target voice recognition sub-result and respectively adding an internal context sequence into each voice recognition sub-result and the second target voice recognition sub-result in the precursor result;

A third model processing unit 1034, configured to, when determining that the target BERT model is a space-time BERT model, pre-process a selected third target speech recognition sub-result in the speech recognition results according to a pre-stored third character pre-processing policy to obtain a third pre-processing result, and perform feature extraction on the third pre-processing result by using the target BERT model to obtain a final vector expression result; the third character preprocessing strategy is used for respectively adding a standard context sequence and an internal context sequence into a third target voice recognition sub-result.

In one embodiment, the first model processing unit 1032 includes:

The mixed context sequence acquisition unit is used for acquiring a mixed context sequence from the voice recognition result according to a preset context window size value and the selected first target voice recognition sub-result;

the first time sequence acquisition unit is used for splicing the first target voice recognition sub-result and the mixed context sequence into a first time sequence according to a preset first splicing strategy;

the first operation unit is used for inputting the first time sequence into a flat BERT model for operation to obtain a corresponding first vector expression result, and the first vector expression result is used as a final vector expression result corresponding to the first target voice recognition sub-result.

mixed context sequences (i.e., conv-context), denoted by ψ, e.gRepresenting the speech recognition result U in +.>And extracting a mixed context sequence by taking 5 as a preset context window size value K as a first target voice recognition sub-result. When extracting the mixed context sequence, directly pushing forward 5 bits by taking the first target voice recognition sub-result as a starting point to obtain +. > The mixed context sequence is obtained by directly pushing forward according to the preset size value of the context window in reverse order without distinguishing the speaker.

Standard context sequences (i.e., inter-context), represented by phi, e.gRepresenting the speech recognition result U in +.>The method comprises the steps of taking a first target voice recognition sub-result and taking 5 as a preset context window size value K to extract a standard context sequence, wherein the method aims at taking the first target voice recognition sub-result as a starting point to push forward 5 bits forwards to obtain an initial sequence, and removing all voice recognition sub-results of the same speaker as the first target voice recognition sub-result to obtain the standard context sequence->

Internal context sequences (i.e., intra-context), useRepresentation, e.g.>Representing the speech recognition result U in +.>The method comprises the steps of taking a first target voice recognition sub-result and taking 5 as a preset context window size value K to extract an internal context sequence, wherein the method aims at taking the first target voice recognition sub-result as a starting point to push forward 5 bits forwards to obtain an initial sequence, and removing voice recognition sub-results of all speakers which are different from the first target voice recognition sub-result to obtain the internal context sequence- >

In an embodiment, the first timing sequence acquisition unit includes:

the first splicing unit is used for obtaining a corresponding first coding result by double-byte coding of characters included in the first target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the first coding result to obtain a first processing result;

the second splicing unit is used for obtaining a corresponding second coding result through double-byte coding on characters included in the mixed context sequence, and splicing a pre-stored second-class word embedded vector at the tail end of the second coding result to obtain a second processing result;

the third splicing unit is used for adding a first preset character string before the first processing result, adding a second preset character string between the first processing result and the second processing result, and adding a second preset character string after the second processing result to obtain a first initial time sequence;

And the fourth splicing unit is used for splicing the corresponding position embedded vector at the tail part of each character in the first initial time sequence to obtain the first time sequence.

When the target BERT model is determined to be a flat BERT model, the input of the flat BERT model comprises the following key points: 1) Sub-result of first target voice recognition(first target Speech recognition sub-result +.>Can also be understood as the selected target expression) and the mixed context sequence +.>Splicing into a time sequence; 2) Augmentation of [ CLS ] at the head of a time series]Special characters are used to explicitly output the position; 3) Use of [ SEP ]]Distinguishing target expression from mixed context sequences by special characters; 4) Converting all characters to WordPiece embeddings; 5) Splicing type A text (i.e. pre-stored first word embedding vectors) for the first target voice recognition sub-result, and splicing type B text (i.e. pre-stored second word embedding vectors) for the mixed context sequence in order to enhance the distinguishing degree of the two text; 6) The position of the position is spliced for each character, and position information of the time sequence is reserved. The first time sequence constructed in the above manner can perform longer time sequence modeling, and a deeper network structure is mined.

In this embodiment, when the target BERT model is determined to be a hierarchical BERT model, the hierarchical BERT model corresponds to a BERT model with a multi-layer structure, and at least includes a BERT layer and a transform layer, based on a second target speech recognition sub-result selected from the speech recognition results and the screened speech recognition sub-result, the second vector expression results corresponding to the second target speech recognition sub-result and the screened speech recognition sub-result are respectively input into the BERT model of the BERT layer for operation, and a second vector expression result set formed by the second vector expression results is used as a final vector expression result corresponding to the second target speech recognition sub-result, and a model structure diagram of the specific hierarchical BERT model is shown in fig. 2b. The final vector expression result is the extraction of the most effective features in the speech recognition sub-result, and can provide effective input features for subsequent emotion recognition.

In an embodiment, the second model processing unit 1033 is further configured to:

In this embodiment, for example, the preset context window size value k=5, and the speech recognition result Then the 1 st target speech recognition sub-result of the target speech recognition sub-results is +.>Its corresponding 1 st internal context sequence +.>Is an empty set; similarly, the 2 nd target voice recognition sub-result in the target voice recognition sub-results is +.>Its corresponding 2 nd internal context sequence +.>The 3 rd target voice recognition sub-result in the target voice recognition sub-results is +.>Its corresponding 3 rd internal context sequence +.>The 4 th target voice recognition sub-result in the target voice recognition sub-results isIts corresponding 4 th internal context sequence +.>Is an empty set; by a means ofThe 5 th target speech recognition sub-result of the target speech recognition sub-results is +.>Its corresponding 5 th internal context sequence +.>

In an embodiment, the third model processing unit 1034 further includes:

and the second hierarchical extraction unit is used for acquiring a current standard context sequence and a current internal context sequence which are respectively corresponding to the third target voice recognition sub-result in the voice recognition result, splicing the third target voice recognition sub-result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and respectively inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model to perform feature extraction so as to obtain a third expression result.

In an embodiment, the third model processing unit 1034 is further configured to:

For example, the third target speech recognition sub-result isAnd the preset context window size value is 5, the current standard context sequence +.>And the current internal context sequence +> The third target speech recognition sub-result is +.>And the current standard context sequence-> Splicing the first target voice recognition sub-result into a current first initial time sequence through a preset third splicing strategy, wherein the third target voice recognition sub-result is +.>And the current internal context sequence->Splicing the first vector initial expression result into a current second initial time sequence through a preset third splicing strategy, inputting the current first initial time sequence into a BERT layer in a target BERT model, and extracting features to obtain the current first vector initial expression result/>Inputting the current second time sequence into a BERT layer in a target BERT model to perform feature extraction to obtain a current second vector initial expression result +.>Wherein->And->d _f For vector dimension, the current first vector initially expresses the result +.>And the current second vector initial expression result +.>Is a representation of two emotion influence vectors obtained from the time dimension.

And the emotion classification unit 104 is configured to invoke a pre-trained emotion classification model, and input the final vector expression result to the emotion classification model for operation, so as to obtain a corresponding emotion classification result.

In the present embodiment, the final vector expression results obtained by the first model processing unit 1032, the second model processing unit 1033, or the third model processing unit 1034 can be r _i The representation (e.g. the specific examples given above all obtained with r ₇ Representation), invoking a pre-trained emotion classification model, and inputting a final vector expression result into the emotion classification model for operation, wherein the method comprises the following steps of:

Wherein tanh () is a hyperbolic tangent function, W _o Is r _i The corresponding first weight, softmax (), can be understood as a linear classifier,is o _i Corresponding second weight value, +.>Is the final predicted emotion classification result.

The device realizes feature extraction of deeper network structure, can display and distinguish emotion influence of a speaker, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and finally obtains more accurate emotion recognition result.

The above-described speech emotion classification apparatus may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be a stand-alone server or a server cluster formed by a plurality of servers.

With reference to FIG. 4, the computer device 500 includes a processor 502, a memory, and a network interface 505, connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a speech emotion classification method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a speech emotion classification method.

The network interface 505 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computer device 500 to which the present inventive arrangements may be implemented, and that a particular computer device 500 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

The processor 502 is configured to execute a computer program 5032 stored in a memory, so as to implement the speech emotion classification method disclosed in the embodiment of the present invention.

Those skilled in the art will appreciate that the embodiment of the computer device shown in fig. 4 is not limiting of the specific construction of the computer device, and in other embodiments, the computer device may include more or less components than those shown, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 4, and will not be described again.

It should be appreciated that in an embodiment of the invention, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a nonvolatile computer readable storage medium or a volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor implements the speech emotion classification method disclosed in the embodiment of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for classifying speech emotion, comprising:

Invoking a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result;

the target BERT model is any one of a flat BERT model, a hierarchical BERT model and a space-time BERT model; if the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy;

the preprocessing of the target voice recognition sub-result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and the feature extraction of the preprocessing result through the target BERT model to obtain a final vector expression result comprises the following steps:

any BERT model in a pre-trained BERT model set is obtained to serve as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a hierarchical BERT model and a space-time BERT model;

When the target BERT model is determined to be a flat BERT model, preprocessing a first target voice recognition sub-result selected from the voice recognition results according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result, and extracting features of the first preprocessing result through the target BERT model to obtain a final vector expression result; the first character preprocessing strategy is used for adding a mixed context sequence to a first target voice recognition sub-result;

when the target BERT model is determined to be a hierarchical BERT model, preprocessing a second target voice recognition sub-result selected from the voice recognition results according to a second character preprocessing strategy stored in advance to obtain a second preprocessing result, and extracting features of the second preprocessing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a precursor result of a second target voice recognition sub-result and respectively adding an internal context sequence into each voice recognition sub-result and the second target voice recognition sub-result in the precursor result;

when the target BERT model is determined to be a space-time BERT model, preprocessing a third target voice recognition sub-result selected from the voice recognition results according to a pre-stored third character preprocessing strategy to obtain a third preprocessing result, and extracting features of the third preprocessing result through the target BERT model to obtain a final vector expression result; the third character preprocessing strategy is used for respectively adding a standard context sequence and an internal context sequence into a third target voice recognition sub-result;

Wherein the mixed context sequence is used forExpressed by +.>Representing the mixing of the context sequences in U +.>The K bits are pushed forward to obtain the mixed context sequence; u is the voice recognition result, i is timeOrder of (I) and (II)>For speaker sequence number identification,/->Represents the i-th sequential order in U +.>The speaking content of the speaker;

for the standard context sequenceExpressed by +.>Representing the standard context sequence as +.in U when extracting it>The K bit is pushed forward as the starting point and all +.>The speaking content of the speaker obtains the standard context sequence;

for the internal context sequenceExpressed by +.>Representing the expression of +.in U when extracting the internal context>The K bit is pushed forward as starting point and all other than +.>The speaker's speaking content gets the internal context sequence.

2. The method for classifying speech emotion according to claim 1, wherein when the target BERT model is determined to be a flat BERT model, preprocessing a first target speech recognition sub-result selected from the speech recognition results according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result, and extracting features of the first preprocessing result by the target BERT model to obtain a final vector expression result, wherein the method comprises:

3. The method of claim 2, wherein the splicing the first target speech recognition sub-result and the mixed context sequence into a first sequence according to a preset first splicing policy comprises:

4. The method for classifying speech emotion according to claim 1, wherein preprocessing the second target speech recognition sub-result selected from the speech recognition results according to a second character preprocessing strategy stored in advance to obtain a second preprocessed result, and extracting features of the second preprocessed result by the target BERT model to obtain a final vector expression result, includes:

according to a preset context window size value, a target voice recognition sub-result set is formed by acquiring voice recognition sub-results with the number equal to that of the context window size value in the voice recognition result in a reverse order by taking the selected second target voice recognition sub-result as a starting point, the second target voice recognition sub-result and the target voice recognition sub-result set are preprocessed according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and the target BERT model is used for extracting features of the second preprocessing result to obtain a final vector expression result;

The preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and extracting features of the second preprocessing result through the target BERT model to obtain a final vector expression result, including:

5. The method for classifying speech emotion according to claim 1, wherein preprocessing the selected third target speech recognition sub-result in the speech recognition results according to a pre-stored third character preprocessing strategy to obtain a third preprocessed result, and extracting features of the third preprocessed result by the target BERT model to obtain a final vector expression result, includes:

6. The method for classifying speech emotion according to claim 5, wherein the obtaining a current standard context sequence and a current internal context sequence of the third target speech recognition sub-result respectively corresponding to the speech recognition result, and splicing the third target speech recognition sub-result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence to a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third expression result, comprises:

7. A speech emotion classification device, comprising:

the speaker recognition unit is used for responding to the voice emotion classification instruction, acquiring voice data to be recognized according to the voice emotion classification instruction and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged according to time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;

the emotion classification unit is used for calling a pre-trained emotion classification model, inputting the final vector expression result into the emotion classification model for operation, and obtaining a corresponding emotion classification result;

the final vector acquisition unit includes:

the target model acquisition unit is used for acquiring any BERT model in the pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a hierarchical BERT model and a space-time BERT model;

The first model processing unit is used for preprocessing a first target voice recognition sub result selected from the voice recognition results according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result when the target BERT model is determined to be a flat BERT model, and extracting features of the first preprocessing result through the target BERT model to obtain a final vector expression result; the first character preprocessing strategy is used for adding a mixed context sequence to a first target voice recognition sub-result;

the second model processing unit is used for preprocessing a second target voice recognition sub result selected from the voice recognition results according to a second character preprocessing strategy stored in advance to obtain a second preprocessing result when the target BERT model is determined to be a hierarchical BERT model, and extracting features of the second preprocessing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a precursor result of a second target voice recognition sub-result and respectively adding an internal context sequence into each voice recognition sub-result and the second target voice recognition sub-result in the precursor result;

The third model processing unit is used for preprocessing a selected third target voice recognition sub-result in the voice recognition result according to a pre-stored third character preprocessing strategy to obtain a third preprocessing result when the target BERT model is determined to be a space-time BERT model, and extracting features of the third preprocessing result through the target BERT model to obtain a final vector expression result; the third character preprocessing strategy is used for respectively adding a standard context sequence and an internal context sequence into a third target voice recognition sub-result;

wherein the mixed context sequence is used forExpressed by +.>Representing the mixing of the context sequences in U +.>The K bits are pushed forward to obtain the mixed context sequence; u is the voice recognition result, i is the time sequence order,>for speaker sequence number identification,/->Represents the i-th sequential order in U +.>The speaking content of the speaker;

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech emotion classification method of any of claims 1 to 6 when the computer program is executed by the processor.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the speech emotion classification method of any of claims 1 to 6.