CN113362858A - Voice emotion classification method, device, equipment and medium - Google Patents
Voice emotion classification method, device, equipment and medium Download PDFInfo
- Publication number
- CN113362858A CN113362858A CN202110850075.9A CN202110850075A CN113362858A CN 113362858 A CN113362858 A CN 113362858A CN 202110850075 A CN202110850075 A CN 202110850075A CN 113362858 A CN113362858 A CN 113362858A
- Authority
- CN
- China
- Prior art keywords
- result
- target
- bert model
- preprocessing
- voice recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 91
- 238000000034 method Methods 0.000 title claims abstract description 63
- 239000013598 vector Substances 0.000 claims abstract description 193
- 238000000605 extraction Methods 0.000 claims abstract description 89
- 230000004927 fusion Effects 0.000 claims abstract description 22
- 230000014509 gene expression Effects 0.000 claims description 151
- 238000007781 pre-processing Methods 0.000 claims description 149
- 238000012545 processing Methods 0.000 claims description 53
- 238000013145 classification model Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 15
- 230000001174 ascending effect Effects 0.000 claims description 12
- 238000007499 fusion processing Methods 0.000 claims description 7
- 235000019580 granularity Nutrition 0.000 abstract description 16
- 210000002569 neuron Anatomy 0.000 abstract description 16
- 230000008909 emotion recognition Effects 0.000 abstract description 15
- 238000005516 engineering process Methods 0.000 abstract description 11
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 16
- 230000002996 emotional effect Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000036651 mood Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a speech emotion classification method, a speech emotion classification device, computer equipment and a storage medium, and relates to an artificial intelligence technology. The method realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and obtains a more accurate emotion recognition result finally.
Description
Technical Field
The invention relates to the technical field of artificial intelligence voice semantics, in particular to a voice emotion classification method and device, computer equipment and a storage medium.
Background
Emotion recognition is an important branch of the field of artificial intelligence, especially in conversational scenes. During a conversation, a speaker receives emotional influences from other speakers in an attempt to change the speaker's mood, and from the speaker's own mood in an attempt to maintain the speaker's mood. To model these two types of emotional influences, existing methods use both "flat" and "hierarchical" model structures based on a "recurrent neural network" to model.
However, 1) existing methods are all based on "recurrent neural networks", and do not utilize a powerful pre-trained BERT model. 2) The 'flat' model cannot distinguish different speakers by serially connecting emotional expressions of different speakers in the same time sequence; 3) although the emotion expressions of the same speaker are connected in series in the same time sequence through the branch layer, the emotion influences of different speakers are still mixed in the same time sequence of the main layer and cannot be distinguished.
Disclosure of Invention
The embodiment of the invention provides a speech emotion classification method, a speech emotion classification device, computer equipment and a storage medium, and aims to solve the problem that in the prior art, emotion recognition results of conversations in a multi-person conversation scene are inaccurate based on existing models.
In a first aspect, an embodiment of the present invention provides a speech emotion classification method, which includes:
responding to a voice emotion classification instruction, acquiring voice data to be recognized according to the voice emotion classification instruction, and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
preprocessing the target voice identifier result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and performing feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result; and
and calling a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.
In a second aspect, an embodiment of the present invention provides a speech emotion classification apparatus, which includes:
the speaker recognition unit is used for responding to the speech emotion classification instruction, acquiring the speech data to be recognized according to the speech emotion classification instruction and performing speech recognition to obtain a speech recognition result if the speech data to be recognized sent by the user side or other servers is detected; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
the target model selection unit is used for acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
a final vector obtaining unit, configured to pre-process the target speech recognizer result selected in the speech data to be recognized according to the character pre-processing policy to obtain a pre-processing result, and perform feature extraction on the pre-processing result through the target BERT model to obtain a final vector expression result; and
and the emotion classification unit is used for calling a pre-trained emotion classification model, inputting the final vector expression result to the emotion classification model for operation, and obtaining a corresponding emotion classification result.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the speech emotion classification method according to the first aspect when executing the computer program.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the speech emotion classification method according to the first aspect.
The embodiment of the invention provides a speech emotion classification method, a speech emotion classification device, computer equipment and a storage medium, wherein speech data to be recognized are obtained and are subjected to speech recognition to obtain a speech recognition result, then a target speech recognizer result selected from the speech data to be recognized is preprocessed according to a character preprocessing strategy to obtain a preprocessing result, a target BERT model is used for carrying out feature extraction on the preprocessing result to obtain a final vector expression result, and finally the final vector expression result is input to a pre-trained emotion classification model to be operated to obtain a corresponding emotion classification result. The method realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and obtains a more accurate emotion recognition result finally.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an application scenario of a speech emotion classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a speech emotion classification method according to an embodiment of the present invention;
FIG. 2a is a model structure diagram of a flat BERT model of a speech emotion classification method according to an embodiment of the present invention;
FIG. 2b is a model structure diagram of a hierarchical BERT model in the speech emotion classification method according to the embodiment of the present invention;
FIG. 2c is a model structure diagram of a spatiotemporal BERT model in a speech emotion classification method according to an embodiment of the present invention;
FIG. 2d is a sub-flow diagram of a speech emotion classification method according to an embodiment of the present invention;
FIG. 3 is a schematic block diagram of a speech emotion classification apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a speech emotion classification method according to an embodiment of the present invention; fig. 2 is a schematic flowchart of a speech emotion classification method according to an embodiment of the present invention, where the speech emotion classification method is applied to a server and is executed by application software installed in the server.
As shown in fig. 2, the method includes steps S101 to S106.
S101, responding to a speech emotion classification instruction, acquiring speech data to be recognized according to the speech emotion classification instruction, and performing speech recognition to obtain a speech recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
In the present embodiment, in order to more clearly understand the technical solution of the present application, the following describes the related execution body in detail. The technical scheme is described by taking the server as an execution subject.
The background server receives the voice data to be recognized which are collected and uploaded based on the plurality of user terminals when the plurality of users communicate under the same video conference scene. After the server receives the voice data to be recognized, speaker recognition can be carried out on the voice data to be recognized.
The server is used for storing a speaker recognition model so as to perform speaker recognition on the voice data to be recognized uploaded by the user side to obtain a voice recognition result; and a set of pre-trained BERT models is stored in the server so as to perform emotion recognition based on the speaker context on the voice recognition result.
In specific implementation, the Speaker Recognition (i.e., Speaker Recognition, abbreviated as SR) technology is also called Voiceprint Recognition (abbreviated as VPR) technology, the Voiceprint Recognition technology mainly adopts MFCC (MFCC, i.e., mel frequency cepstrum coefficient) and GMM (gaussian mixture model) framework, and Speaker Recognition can be effectively performed on voice data to be recognized through the Speaker Recognition technology to obtain a voice Recognition result corresponding to the voice data to be recognized; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
For example, a speech recognition result corresponding to a piece of speech data to be recognized is represented by U, and the expression of U is as followsThe speech recognition result U is embodied in 8 dialogues arranged in time-series ascending order, the first dialog being spoken by the speaker 1 (byWhere subscript 1 denotes chronological order 1, superscript 1 denotes the speaker number identification of speaker 1,the contents of speaker 1 as a whole), and a second session, speaker 2 (by using the speech of speaker 1)Wherein the subscript 2 denotes the chronological order 2, the superscript 2 denotes the speaker number designation of speaker 2), and a third session is spoken by speaker 1 (using the suffix 2)Wherein the subscript 3 denotes the chronological order 3, the superscript 1 denotes the speaker number identification of speaker 1), and the fourth session is spoken by speaker 1 (by the subscript 1)Wherein subscript 4 denotes chronological order 4, superscript 1 denotes speaker number identification of speaker 1), and a fifth session is spoken by speaker 3 (using the speaker number identification of speaker 1)Wherein the subscript 5 represents the time sequence order 5, superscript 3 tableSpeaker 3, speaker number), and a sixth conversation is spoken by speaker 2 (usingWherein the subscript 6 denotes the chronological order 6, the superscript 2 denotes the speaker number identification of speaker 2), and a seventh conversation is spoken by speaker 1 (using the subscript 2)Wherein the subscript 7 denotes the chronological order 7, the superscript 1 denotes the speaker number identification of speaker 1), and an eighth conversation is spoken by speaker 2 (using the suffix)Where subscript 8 denotes chronological order 2 and superscript 2 denotes the speaker number identification of speaker 2). By the speaker recognition technology, the speaking content of each speaker in the multi-person conversation is effectively distinguished.
S102, obtaining a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model.
In this embodiment, a set of pre-trained BERT models is stored locally in the server, where the set of BERT models at least includes a flat BERT model, a hierarchical BERT model, and a spatio-temporal BERT model. When the server selects the BERT model, one of the flat BERT model, the level BERT model and the space-time BERT model is selected at random. Effective vector expression results in the voice recognition results can be effectively extracted through the three models so as to perform subsequent accurate emotion recognition. If the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; and if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy.
S103, preprocessing the target voice identifier result selected from the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and performing feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result.
In this embodiment, when a target speech recognition sub-result is arbitrarily selected from the speech data to be recognized, for example, the target speech recognition sub-result is selected from the speech recognition result UThe speech recognition sub-result is used as a target speech recognition sub-result, and the target speech recognition sub-result can be preprocessed according to a corresponding character preprocessing strategy to obtain a preprocessing result, so that the preprocessing result is finally input into a target BERT model to perform feature extraction on the preprocessing result to obtain a final vector expression result. That is, when it is determined that different target BERT models are adopted for feature extraction, a corresponding character preprocessing strategy is adopted for preprocessing a target voice recognition sub-result to obtain a preprocessing result, so that the information dimension of the target voice recognition sub-result can be increased, and the feature extraction is more accurate.
In one embodiment, as shown in fig. 2d, step S103 includes:
s1031, obtaining any one BERT model in a pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a level BERT model and a space-time BERT model;
s1032, when the target BERT model is determined to be a flat BERT model, preprocessing a first target voice recognizer result selected from the voice recognition result according to a first pre-stored character preprocessing strategy to obtain a first preprocessing result, and performing feature extraction on the first preprocessing result through the target BERT model to obtain a final vector expression result; wherein the first character pre-processing strategy is used to add a mixed context sequence in a first target speech recognizer result;
s1033, when the target BERT model is determined to be a hierarchical BERT model, preprocessing a selected second target voice recognition sub-result in the voice recognition result according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a preceding result of a second target voice recognition sub-result and respectively adding an internal context sequence in each voice recognition sub-result in the preceding result and the second target voice recognition sub-result;
s1034, when the target BERT model is determined to be a space-time BERT model, preprocessing a selected third target voice recognizer result in the voice recognition result according to a prestored third character preprocessing strategy to obtain a third preprocessing result, and performing feature extraction on the third preprocessing result through the target BERT model to obtain a final vector expression result; wherein the third character preprocessing strategy is used for adding a standard context sequence and an internal context sequence, respectively, in a third target speech recognizer result.
In this embodiment, the flat BERT model corresponds to a BERT model with a flat structure, and the first target speech recognition sub-result selected from the speech recognition result is processed into an input variable and then directly input into the BERT model for operation, so as to obtain a final vector expression result corresponding to the first target speech recognition sub-result, where a specific model structure diagram of the flat BERT model is shown in fig. 2 a. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In one embodiment, step S1032 includes:
acquiring a mixed context sequence in the voice recognition result according to a preset context window size value and the selected first target voice recognition sub-result;
splicing the first target voice identifier result and the mixed context sequence into a first time sequence according to a preset first splicing strategy;
and inputting the first time sequence into a flat BERT model for operation to obtain a corresponding first vector expression result, and taking the first vector expression result as a final vector expression result corresponding to the first target voice recognizer result.
In this embodiment, in order to more clearly understand the subsequent technical solutions, three context sequences involved in extracting the speech recognition result U will be described in detail below:
mixed context sequences (i.e., conv-contexts), denoted by Ψ, e.g.Represents the speech recognition result U toAnd identifying a sub-result for the first target voice, and extracting a mixed context sequence by taking 5 as a preset context window size value K. When extracting the mixed context sequence, the method directly takes the result of the first target speech recognizer as a starting point to push backwards by 5 bits to obtain When the mixed context sequence is obtained, speakers are not distinguished, and the mixed context sequence is directly obtained by pushing forwards in a reverse order according to the size value of a preset context window.
Standard context sequences (i.e., inter-contexts), denoted by φ, e.g.Represents the speech recognition result U toFor the first target speech recognition sub-result, and using 5 as preset context window size value K to make standard context sequence extraction, and its goal is to forward and backward push 5 bits to obtain an initial sequence by using first target speech recognition sub-result as starting point and remove all speech recognition sub-results of speakers identical to first target speech recognition sub-result so as to obtain standard speech recognition sub-resultSequences of the text
Internal context sequence (i.e., intra-context), usingRepresents, for exampleRepresents the speech recognition result U toExtracting internal context sequence for the first target speech recognizer result by using 5 as preset context window size value K, and obtaining an initial sequence by backward pushing 5 bits from the first target speech recognizer result as a starting point and removing all speech recognizer results of speakers which are different from the first target speech recognizer result, thereby obtaining the internal context sequence
The mixed context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected first target speech recognition sub-result, namely, the mixed context sequence is formed by obtaining the speech recognition sub-results with the same number as the size value of the context window from the speech recognition result in reverse order by taking the selected first target speech recognition sub-result as a starting point.
In an embodiment, the concatenating the first target speech recognizer result and the mixed context sequence into a first time sequence according to a preset first concatenation strategy includes:
obtaining a corresponding first coding result by double-byte coding of characters included in the first target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the first coding result to obtain a first processing result;
obtaining a corresponding second coding result by double-byte coding of characters included in the mixed context sequence, and splicing a prestored second word embedding vector at the tail end of the second coding result to obtain a second processing result;
adding a first preset character string before the first processing result, adding a second preset character string between the first processing result and the second processing result, and adding a second preset character string after the second processing result to obtain a first initial time sequence;
and splicing the corresponding position embedding vector at the tail of each character in the first initial time sequence to obtain a first time sequence.
In the present embodiment, i.e. for the flat BERT model, the goal is to predict the emotion of the ith speech recognizer result, the input is constructed as:
wherein the content of the first and second substances,represents an expression sequence comprising T words,represents and includesA mixed context sequence of words, K being a preset context window size value. And inputting the parameters into a BERT model after splicing and converting into embedding to obtain vector representation: r isi=BERT(Xi)。
Determining the target BERT model as a flat BERT modelType, the inputs to the flat BERT model include the following key points: 1) identifying a first target speech into a sub-result(first target Speech recognizer resultAlso understood as selected target expression) and mixed context sequencesSplicing into a time sequence; 2) increasing [ CLS ] in time-series header]Special characters (where [ CLS)]As can be understood a first predetermined string) for unambiguous output location; 3) using [ SEP ]]Special character (wherein [ SEP ]]As can be understood as a second pre-defined string) to distinguish between the target expression and the mixed context sequence; 4) converting all characters into WordPiece embeddings; 5) character splicing type A embedding (namely a prestored first word embedding vector) for the first target voice recognition sub-result, and character splicing type B embedding (namely a prestored second word embedding vector) for the mixed context sequence so as to enhance the distinguishing degree of the two; 6) the position embedding is spliced for each character to retain the position information of the time series. The first time sequence constructed in the above way can be used for modeling a longer time sequence, and a deeper network structure is mined.
And when the first time sequence is input into the flat BERT model for feature extraction, the obtained first vector expression result is the vector expression of the whole time sequence by using the output of the last layer [ CLS ] position of the BERT as the output.
In this embodiment, when the target BERT model is determined to be a hierarchical BERT model, the hierarchical BERT model corresponds to a BERT model with a multilayer structure, and at least includes a BERT layer and a Transformer layer, the second target speech recognition sub-result selected from the speech recognition results and the speech recognition sub-result obtained by screening are preprocessed and then respectively input into the BERT model of the BERT layer for operation, so that second vector expression results respectively corresponding to the second target speech recognition sub-result and the speech recognition sub-result obtained by screening are obtained, a second vector expression result set formed by corresponding to the second vector expression results is used as a final vector expression result corresponding to the second target speech recognition sub-result, and a model structure diagram of the specific hierarchical BERT model is shown in fig. 2 b. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In one embodiment, step S1033 includes:
and forward acquiring a number of voice recognition sub-results equal to the size value of the context window in the voice recognition results in a reverse sequence by taking the selected second target voice recognition sub-result as a starting point according to a preset size value of the context window to form a target voice recognition sub-result set, preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and extracting the characteristics of the second preprocessing result through the target BERT model to obtain a final vector expression result.
And sequentially inputting the preprocessing results respectively corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set into a BERT layer and a Transformer layer in a target BERT model for feature extraction, and obtaining a second vector expression result corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set.
When extracting the second vector expression result corresponding to the second target speech recognition sub-result and the target speech recognition sub-result set, the second target speech recognition sub-result and the preprocessing result corresponding to the target speech recognition sub-result set need to be input to a BERT layer and a transform layer of the target BERT model for feature extraction, and the second vector expression result obtained through the extraction of the two layers of models is subjected to weighting of two granularities of neurons and vectors, so that the features of finer granularity are fused, and the features have more 'hierarchical' feeling.
In an embodiment, the preprocessing the second target speech recognition sub-result and the target speech recognition sub-result set according to a pre-stored second character preprocessing policy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result includes:
acquiring an ith target voice recognition sub-result in the target voice recognition sub-result set; wherein the initial value of i is 1;
acquiring an ith internal context sequence from the voice recognition result according to a preset context window size value and an ith target voice recognition sub-result;
splicing the ith target voice recognition sub-result and the ith internal context sequence into an ith sub-time sequence according to a preset second splicing strategy;
increasing 1 to the i value, and judging whether the i value exceeds the size value of the context window; if the value i does not exceed the size value of the context window, returning to execute the step of obtaining the ith target voice recognition sub-result in the target voice recognition sub-result set;
if the value of i exceeds the size value of the context window, sequentially acquiring a 1 st sub-timing sequence to an i-1 st sub-timing sequence;
splicing the second target voice recognition sub-result and the corresponding target internal context sequence into an ith sub-time sequence according to the second splicing strategy;
respectively inputting the 1 st to ith sub-time sequence sequences into a BERT layer in a target BERT model for feature extraction, and obtaining second vector initial expression results respectively corresponding to the 1 st to ith sub-time sequences;
splicing second vector initial expression results corresponding to the 1 st to the ith sub-time sequences respectively to obtain a first splicing result;
and inputting the first splicing result into a Transformer layer in a target BERT model for feature extraction to obtain a second vector expression result.
In this embodiment, for example, the preset context window size value K is 5,and the speech recognition result Then the 1 st target speech recognition sub-result in said set of target speech recognition sub-results isIts corresponding 1 st internal context sequenceIs an empty set; similarly, the 2 nd target speech recognition sub-result in the target speech recognition sub-result set isIts corresponding 2 nd internal context sequenceThe 3 rd target speech recognition sub-result in the target speech recognition sub-result set isIts corresponding 3 rd internal context sequenceThe 4 th target speech recognition sub-result of the target speech recognition sub-result set isIts corresponding 4 th internal context sequenceIs an empty set; the 5 th target speech recognition sub-result of the target speech recognition sub-result set isIts corresponding 5 th internal context sequence
When the ith target speech recognition sub-result and the ith internal context sequence are spliced into the ith sub-timing sequence according to a preset second splicing strategy, the method specifically comprises the following steps: obtaining a corresponding ith group of first sub-coding results by double-byte coding of characters included in the ith target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the ith group of first sub-coding results to obtain an ith group of first processing results; obtaining a corresponding ith group of second sub-coding results by double-byte coding of characters included in the ith internal context sequence, and splicing a pre-stored second-class word embedding vector at the tail end of the ith group of second sub-coding results to obtain an ith group of second processing results; adding [ CLS ] characters before the ith group of first sub-coding results, adding [ SEP ] characters between the ith group of first sub-coding results and the ith group of second processing results, and adding [ SEP ] characters after the ith group of second processing results to obtain an ith group of initial time sequence sequences; and splicing corresponding position embedded vectors at the tail of each character in the ith group of initial time sequence sequences to obtain the ith sub-time sequence. Aiming at the improved hierarchical BERT model of the hierarchical recurrent neural network, the hierarchical structure can effectively distinguish speakers compared with a flat structure.
Sequentially obtaining the 1 st to the ith sub-time sequence sequences and inputting the sequences to a BERT layer in a target BERT model for feature extraction to obtain a second vector initial expression result with context of the speaker at each moment, namely inputting the 1 st sub-time sequence to the BERT layer in the target BERT model for feature extraction to obtain(For), BERT that inputs the 2 nd sub-timing sequence into the target BERT modelLayer characteristic extraction is carried out to obtainNamely inputting the 3 rd sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtainNamely, inputting the 4 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtainNamely, inputting the 5 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtainNamely, the 6 th sub-time sequence is input into a BERT layer in a target BERT model for feature extraction to obtainAnd after the 6 second vector initial expression results are obtained, splicing according to the ascending sequence of the angle marks to obtain a first splicing result. And finally, inputting the first splicing result to a transform layer in a target BERT model for feature extraction (specifically, inputting an encode part of the transform layer, wherein the number of layers of the encode part of the transform layer is 6) to obtain a second vector expression result.
The second vector expression result is obtained with the last layer of the encode part of the Transformer layer inThe output of the location is used as a vector expression for final emotion classification.
In this embodiment, when the target BERT model is determined to be a space-time BERT model, the hierarchical BERT model corresponds to a BERT model considered comprehensively from both time and space perspectives, the third target speech recognition sub-result selected from the speech recognition result is processed into two input variables (one input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current standard context sequence, and the other input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current internal context sequence), and then the input variables are directly input into the BERT model for operation, and the respective obtained operation results are subjected to a fusion process by the fusion model, a third vector expression result corresponding to the third target speech recognizer result may be obtained as a final vector expression result, and a model structure diagram of the concrete space-time BERT model is shown in fig. 2 c. Similarly, the final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, in step S1034, the preprocessing the selected third target speech recognition sub-result in the speech recognition result according to a pre-stored third character preprocessing policy to obtain a third preprocessing result, and performing feature extraction on the third preprocessing result through the target BERT model to obtain a final vector expression result, including:
and acquiring a current standard context sequence and a current internal context sequence which correspond to the third target voice identifier result in the voice identification result respectively, splicing the third target voice identifier result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result.
In this embodiment, when a third vector expression result corresponding to the third target speech recognition sub-result is extracted, a current first time sequence and a current second time sequence obtained by processing the third target speech recognition sub-result and respectively inputting the third target speech recognition sub-result to a BERT layer in a target BERT model are obtained from a time perspective, and then the first time sequence and the current second time sequence are spliced from a space perspective (that is, the current first time sequence and the current second time sequence are input to a fusion model layer in the target BERT model) to obtain a third vector expression result. From a time perspective, the emotional impact of the speaker can be discerned explicitly; from a spatial perspective, the weighting of two granularities, neuron and vector, can be performed for the features, and the feature fusion granularity is finer.
In an embodiment, the obtaining of the current standard context sequence and the current internal context sequence of the third target speech recognition sub-result respectively corresponding to the speech recognition result, respectively splicing the third target speech recognition sub-result with the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence, and respectively inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model to perform feature extraction, so as to obtain a third vector expression result, includes:
respectively acquiring a current standard context sequence and a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result;
splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, and splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to the third splicing strategy;
inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result;
longitudinally splicing the current first vector initial expression result and the current second vector initial expression result to obtain a current splicing result;
and inputting the current splicing result into a fusion model layer in a target BERT model for fusion processing to obtain a third vector expression result corresponding to the third target voice identifier result.
In this embodiment, the current standard context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected third target speech recognition sub-result, that is, the speech recognition sub-results with the same number as the size value of the context window are obtained from the speech recognition result in reverse order by using the selected third target speech recognition sub-result as a starting point, and all the speech recognition sub-results with the same speaker as the third target speech recognition sub-result are removed to form the current standard context sequence.
And obtaining a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result, namely obtaining the voice recognition sub-results with the same number as the context window size value in the voice recognition result in a reverse sequence by taking the selected third target voice recognition sub-result as a starting point, and removing all the voice recognition sub-results of the speakers which are different from the third target voice recognition sub-result to form the current internal context sequence.
And inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result, wherein the output of the position of the last layer [ CLS ] of the BERT is used as the vector expression of the whole time sequence.
And splicing the third target speech recognizer result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, wherein the method specifically comprises the following steps: obtaining a corresponding current first coding result by double-byte coding of characters included in the third target voice recognition result, and splicing a pre-stored first word embedding vector at the tail end of the current first coding result to obtain a current first processing result; obtaining a corresponding current second coding result by double-byte coding of characters included in the current standard context sequence, and splicing a prestored second-class word embedding vector at the tail end of the current second coding result to obtain a current second processing result; adding a [ CLS ] character before the current first processing result, adding a [ SEP ] character between the current first processing result and the current second processing result, and adding a [ SEP ] character after the current second processing result to obtain a current first initial time sequence; and splicing corresponding position embedded vectors at the tail of each character in the first current initial time sequence to obtain a current first time sequence. And the splicing acquisition process of splicing the third target voice identifier result and the current internal context sequence into the current second time sequence according to a preset third splicing strategy is also an acquisition process of referring to the current first time sequence.
For example, the third target speech recognizer results inAnd the size of the preset context window is 5, the current standard context sequenceAnd current internal context sequence The third target speech recognizer has the result ofWith current standard context sequence The current first initial time sequence is spliced through a preset third splicing strategy, and the result of the third target voice identifier isWith the current internal context sequenceSplicing the current first initial time sequence into a current second initial time sequence through a preset third splicing strategy, inputting the current first initial time sequence into a BERT layer in a target BERT model, and performing feature extraction to obtain a current first vector initial expression resultInputting the current second time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current second vector initial expression resultWhereinAnd isdfFor vector dimension, the current first vector initially expresses the resultAnd the current second vector initial expression resultAre two emotional impact vector expressions obtained from the time dimension.
Initially expressing the current first vector to a resultAnd the current second vector initial expression resultPerforming longitudinal splicing to obtain the current splicing resultFinally, the current splicing result is obtainedTensor operation can be specifically implemented when the fusion model layer input into the target BERT model is subjected to fusion processing, namely:
wherein RELU () represents a linear rectification function, WbIs composed ofWherein each neuron is assigned with neuron-level weight (i.e. the weight of all neurons is different), WaIs composed ofTwo row vectors in (a) assign vector-level weights (i.e. the weights assigned by neurons in a row vector are the same),and represents a bias term. Therefore, the current splicing resultAnd carrying out tensor operation on the fusion model layer input into the target BERT model to obtain a third vector expression result.
And S104, calling a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.
In this embodiment, the final vector expression result obtained in step S1032, step S1033, or step S1034 can be used as riIs shown (e.g. obtained from the above specific examples are all by r7Expressing), calling a pre-trained emotion classification model, and inputting a final vector expression result into the emotion classification model for operation as follows:
oi=tanh(Wori)
Wherein tanh () is a hyperbolic tangent function, WoIs riThe corresponding first weight value, softmax () can be understood as a linear classifier,is oiA corresponding second weight value of the first weight value,is the final predicted emotion classification result.
The method realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and obtains a more accurate emotion recognition result finally.
The embodiment of the invention also provides a speech emotion classification device, which is used for executing any embodiment of the speech emotion classification method. Specifically, please refer to fig. 3, fig. 3 is a schematic block diagram of a speech emotion classification apparatus according to an embodiment of the present invention. The speech emotion classification apparatus 100 may be configured in a server.
As shown in fig. 3, the speech emotion classification apparatus 100 includes: a speaker recognition unit 101, a target model selection unit 102, a final vector acquisition unit 103, and an emotion classification unit 104.
The speaker recognition unit 101 is configured to respond to a speech emotion classification instruction, acquire speech data to be recognized according to the speech emotion classification instruction, perform speech recognition, and obtain a speech recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
In this embodiment, a Speaker Recognition (i.e., Speaker Recognition, abbreviated as SR) technology is also called Voiceprint Recognition (abbreviated as VPR) technology, the Voiceprint Recognition technology mainly adopts MFCC (MFCC, mel-frequency cepstrum coefficient) and GMM (gaussian mixture model) framework, and Speaker Recognition can be effectively performed on voice data to be recognized by the Speaker Recognition technology to obtain a voice Recognition result corresponding to the voice data to be recognized; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
For example, a speech recognition result corresponding to a piece of speech data to be recognized is represented by U, and the expression of U is as followsThe speech recognition result U is embodied in 8 dialogues arranged in time-series ascending order, the first dialog being spoken by the speaker 1 (byWhere subscript 1 denotes chronological order 1, superscript 1 denotes the speaker number identification of speaker 1,the contents of speaker 1 as a whole), and a second session, speaker 2 (by using the speech of speaker 1)Wherein the subscript 2 denotes the chronological order 2, the superscript 2 denotes the speaker number designation of speaker 2), and a third session is spoken by speaker 1 (using the suffix 2)Wherein the subscript 3 is as defined inTime sequence 3, superscript 1 indicates speaker 1's speaker number), and the fourth session is spoken by speaker 1 (byWherein subscript 4 denotes chronological order 4, superscript 1 denotes speaker number identification of speaker 1), and a fifth session is spoken by speaker 3 (using the speaker number identification of speaker 1)Wherein subscript 5 denotes chronological order 5, superscript 3 denotes speaker number identification of speaker 3), and a sixth conversation is spoken by speaker 2 (using the speaker number identification of speaker 3)Wherein the subscript 6 denotes the chronological order 6, the superscript 2 denotes the speaker number identification of speaker 2), and a seventh conversation is spoken by speaker 1 (using the subscript 2)Wherein the subscript 7 denotes the chronological order 7, the superscript 1 denotes the speaker number identification of speaker 1), and an eighth conversation is spoken by speaker 2 (using the suffix)Where subscript 8 denotes chronological order 2 and superscript 2 denotes the speaker number identification of speaker 2). By the speaker recognition technology, the speaking content of each speaker in the multi-person conversation is effectively distinguished.
The target model selecting unit 102 is configured to obtain a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model.
In this embodiment, a pre-trained BERT model set is pre-stored locally in the server, where the BERT model set at least includes a flat BERT model, a hierarchical BERT model, and a spatio-temporal BERT model. When the server selects the BERT model, one of the flat BERT model, the level BERT model and the space-time BERT model is selected at random. Effective vector expression results in the voice recognition results can be effectively extracted through the three models so as to perform subsequent accurate emotion recognition. If the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; and if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy.
And a final vector obtaining unit 103, configured to perform preprocessing on the target speech recognizer result selected in the speech data to be recognized according to the character preprocessing policy to obtain a preprocessing result, and perform feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result.
In this embodiment, when a target speech recognition sub-result is arbitrarily selected from the speech data to be recognized, for example, the target speech recognition sub-result is selected from the speech recognition result UThe speech recognition sub-result is used as a target speech recognition sub-result, and the target speech recognition sub-result can be preprocessed according to a corresponding character preprocessing strategy to obtain a preprocessing result, so that the preprocessing result is finally input into a target BERT model to perform feature extraction on the preprocessing result to obtain a final vector expression result. That is, when it is determined that different target BERT models are adopted for feature extraction, a corresponding character preprocessing strategy is adopted for preprocessing a target voice recognition sub-result to obtain a preprocessing result, so that the information dimension of the target voice recognition sub-result can be increased, and the feature extraction is more accurate.
In an embodiment, as shown in fig. 3, the final vector obtaining unit 103 includes:
a target model obtaining unit 1031, configured to obtain any one of the BERT models in the pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a level BERT model and a space-time BERT model;
a first model processing unit 1032, configured to, when the target BERT model is determined to be a flat BERT model, pre-process the first target speech recognition sub-result selected from the speech recognition result according to a pre-stored first character pre-processing policy to obtain a first pre-processing result, and perform feature extraction on the first pre-processing result through the target BERT model to obtain a final vector expression result; wherein the first character pre-processing strategy is used to add a mixed context sequence in a first target speech recognizer result;
a second model processing unit 1033, configured to, when the target BERT model is determined to be a hierarchical BERT model, pre-process the second target speech recognizer result selected from the speech recognition results according to a pre-stored second character pre-processing policy to obtain a second pre-processing result, and perform feature extraction on the second pre-processing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a preceding result of a second target voice recognition sub-result and respectively adding an internal context sequence in each voice recognition sub-result in the preceding result and the second target voice recognition sub-result;
a third model processing unit 1034, configured to, when it is determined that the target BERT model is a space-time BERT model, pre-process a third target speech recognizer result selected from the speech recognition results according to a pre-stored third character pre-processing policy to obtain a third pre-processing result, and perform feature extraction on the third pre-processing result through the target BERT model to obtain a final vector expression result; wherein the third character preprocessing strategy is used for adding a standard context sequence and an internal context sequence, respectively, in a third target speech recognizer result.
In this embodiment, the flat BERT model corresponds to a BERT model with a flat structure, and the first target speech recognition sub-result selected from the speech recognition result is processed into an input variable and then directly input into the BERT model for operation, so as to obtain a final vector expression result corresponding to the first target speech recognition sub-result, where a specific model structure diagram of the flat BERT model is shown in fig. 2 a. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, the first model processing unit 1032 includes:
a mixed context sequence obtaining unit, configured to obtain a mixed context sequence in the speech recognition result according to a preset context window size value and the selected first target speech recognition sub-result;
a first time sequence obtaining unit, configured to splice the first target speech identifier result and the mixed context sequence into a first time sequence according to a preset first splicing strategy;
and the first operation unit is used for inputting the first time sequence into the flat BERT model for operation to obtain a corresponding first vector expression result, and taking the first vector expression result as a final vector expression result corresponding to the first target voice identifier result.
In this embodiment, in order to more clearly understand the subsequent technical solutions, three context sequences involved in extracting the speech recognition result U will be described in detail below:
mixed context sequences (i.e., conv-contexts), denoted by Ψ, e.g.Represents the speech recognition result U toAnd identifying a sub-result for the first target voice, and extracting a mixed context sequence by taking 5 as a preset context window size value K. When extracting the mixed context sequence, the method directly takes the result of the first target speech recognizer as a starting point to push backwards by 5 bits to obtain When the mixed context sequence is obtained, speakers are not distinguished, and the mixed context sequence is directly obtained by pushing forwards in a reverse order according to the size value of a preset context window.
Standard context sequences (i.e., inter-contexts), denoted by φ, e.g.Represents the speech recognition result U toFor the first target speech recognition sub-result, and using 5 as preset context window size value K to make standard context sequence extraction, and its goal is to forward and backward push 5 bits to obtain an initial sequence by using first target speech recognition sub-result as starting point and remove all speech recognition sub-results of speakers identical to first target speech recognition sub-result so as to obtain standard context sequence
Internal context sequence (i.e., intra-context), usingRepresents, for exampleRepresents the speech recognition result U toExtracting internal context sequence for the first target speech recognizer result by using 5 as preset context window size value K, and obtaining an initial sequence by backward pushing 5 bits from the first target speech recognizer result as a starting point and removing all speech recognizer results of speakers which are different from the first target speech recognizer result, thereby obtaining the internal context sequence
The mixed context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected first target speech recognition sub-result, namely, the mixed context sequence is formed by obtaining the speech recognition sub-results with the same number as the size value of the context window from the speech recognition result in reverse order by taking the selected first target speech recognition sub-result as a starting point.
In an embodiment, the first timing sequence obtaining unit includes:
the first splicing unit is used for splicing the characters included in the first target voice recognition sub-result into a corresponding first coding result through double-byte coding, and splicing a first pre-stored word embedding vector at the tail end of the first coding result to obtain a first processing result;
the second splicing unit is used for carrying out double-byte coding on the characters included in the mixed context sequence to obtain a corresponding second coding result, and splicing a pre-stored second word embedding vector at the tail end of the second coding result to obtain a second processing result;
a third splicing unit, configured to add a first preset character string before the first processing result, add a second preset character string between the first processing result and the second processing result, and add a second preset character string after the second processing result, to obtain a first initial timing sequence;
and the fourth splicing unit is used for splicing the corresponding position embedded vector at the tail of each character in the first initial time sequence to obtain the first time sequence.
In the present embodiment, i.e. for the flat BERT model, the goal is to predict the emotion of the ith speech recognizer result, the input is constructed as:
wherein the content of the first and second substances,represents an expression sequence comprising T words,represents and includesA mixed context sequence of words, K being a preset context window size value. And inputting the parameters into a BERT model after splicing and converting into embedding to obtain vector representation: r isi=BERT(Xi)。
When the target BERT model is determined to be a flat BERT model, the input of the flat BERT model comprises the following key points: 1) identifying a first target speech into a sub-result(first target Speech recognizer resultAlso understood as selected target expression) and mixed context sequencesSplicing into a time sequence; 2) increasing [ CLS ] in time-series header]Special characters are used for clarifying output positions; 3) using [ SEP ]]Special characters distinguish target expressions and mixed context sequences; 4) converting all characters into WordPiece embeddings; 5) character splicing type A embedding (namely a prestored first word embedding vector) for the first target voice recognition sub-result, and character splicing type B embedding (namely a prestored second word embedding vector) for the mixed context sequence so as to enhance the distinguishing degree of the two; 6) the position embedding is spliced for each character to retain the position information of the time series. Tong (Chinese character of 'tong')The first time sequence constructed in the above way can be used for modeling a longer time sequence and mining a deeper network structure.
And when the first time sequence is input into the flat BERT model for feature extraction, the obtained first vector expression result is the vector expression of the whole time sequence by using the output of the last layer [ CLS ] position of the BERT as the output.
In this embodiment, when the target BERT model is determined to be a hierarchical BERT model, the hierarchical BERT model corresponds to a BERT model with a multilayer structure, and at least includes a BERT layer and a Transformer layer, based on that the selected second target speech recognition sub-result and the selected speech recognition sub-result in the speech recognition result are respectively input into the BERT model of the BERT layer for operation, so as to obtain second vector expression results respectively corresponding to the second target speech recognition sub-result and the selected speech recognition sub-result, and a second vector expression result set formed by corresponding to the second vector expression results is used as a final vector expression result corresponding to the second target speech recognition sub-result, where a model structure diagram of the hierarchical BERT model is shown in fig. 2 b. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, the second model processing unit 1033 is further configured to:
and forward acquiring a number of voice recognition sub-results equal to the size value of the context window in the voice recognition results in a reverse sequence by taking the selected second target voice recognition sub-result as a starting point according to a preset size value of the context window to form a target voice recognition sub-result set, preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and extracting the characteristics of the second preprocessing result through the target BERT model to obtain a final vector expression result.
And sequentially inputting the preprocessing results respectively corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set into a BERT layer and a Transformer layer in a target BERT model for feature extraction, and obtaining a second vector expression result corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set.
When extracting the second vector expression result corresponding to the second target speech recognition sub-result and the target speech recognition sub-result set, the second target speech recognition sub-result and the preprocessing result corresponding to the target speech recognition sub-result set need to be input to a BERT layer and a transform layer of the target BERT model for feature extraction, and the second vector expression result obtained through the extraction of the two layers of models is subjected to weighting of two granularities of neurons and vectors, so that the features of finer granularity are fused, and the features have more 'hierarchical' feeling.
In one embodiment, the second model processing unit 1033 is further configured to:
acquiring an ith target voice recognition sub-result in the target voice recognition sub-result set; wherein the initial value of i is 1;
acquiring an ith internal context sequence from the voice recognition result according to a preset context window size value and an ith target voice recognition sub-result;
splicing the ith target voice recognition sub-result and the ith internal context sequence into an ith sub-time sequence according to a preset second splicing strategy;
increasing 1 to the i value, and judging whether the i value exceeds the size value of the context window; if the value i does not exceed the size value of the context window, returning to execute the step of obtaining the ith target voice recognition sub-result in the target voice recognition sub-result set;
if the value of i exceeds the size value of the context window, sequentially acquiring a 1 st sub-timing sequence to an i-1 st sub-timing sequence;
splicing the second target voice recognition sub-result and the corresponding target internal context sequence into an ith sub-time sequence according to the second splicing strategy;
respectively inputting the 1 st to ith sub-time sequence sequences into a BERT layer in a target BERT model for feature extraction, and obtaining second vector initial expression results respectively corresponding to the 1 st to ith sub-time sequences;
splicing second vector initial expression results corresponding to the 1 st to the ith sub-time sequences respectively to obtain a first splicing result;
and inputting the first splicing result into a Transformer layer in a target BERT model for feature extraction to obtain a second vector expression result.
In this embodiment, for example, the preset size value K of the context window is 5, and the speech recognition result is obtained Then the 1 st target speech recognition sub-result in said set of target speech recognition sub-results isIts corresponding 1 st internal context sequenceIs an empty set; similarly, the 2 nd target speech recognition sub-result in the target speech recognition sub-result set isIts corresponding 2 nd internal context sequenceThe 3 rd target speech recognition sub-result in the target speech recognition sub-result set isIts corresponding 3 rd internal context sequenceIdentifying a 4 th target in the subset of target speechThe speech recognition sub-result isIts corresponding 4 th internal context sequenceIs an empty set; the 5 th target speech recognition sub-result of the target speech recognition sub-result set isIts corresponding 5 th internal context sequence
When the ith target speech recognition sub-result and the ith internal context sequence are spliced into the ith sub-timing sequence according to a preset second splicing strategy, the method specifically comprises the following steps: obtaining a corresponding ith group of first sub-coding results by double-byte coding of characters included in the ith target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the ith group of first sub-coding results to obtain an ith group of first processing results; obtaining a corresponding ith group of second sub-coding results by double-byte coding of characters included in the ith internal context sequence, and splicing a pre-stored second-class word embedding vector at the tail end of the ith group of second sub-coding results to obtain an ith group of second processing results; adding [ CLS ] characters before the ith group of first sub-coding results, adding [ SEP ] characters between the ith group of first sub-coding results and the ith group of second processing results, and adding [ SEP ] characters after the ith group of second processing results to obtain an ith group of initial time sequence sequences; and splicing corresponding position embedded vectors at the tail of each character in the ith group of initial time sequence sequences to obtain the ith sub-time sequence. Aiming at the improved hierarchical BERT model of the hierarchical recurrent neural network, the hierarchical structure can effectively distinguish speakers compared with a flat structure.
Sequentially obtaining the 1 st to the ith sub-timing sequences and inputting the sequences to a targetPerforming feature extraction on a BERT layer in the BERT model to obtain a second vector initial expression result with context of the speaker at each moment, namely inputting the 1 st sub-time sequence into the BERT layer in the target BERT model to perform feature extraction to obtain a result(For) inputting the 2 nd sub-time sequence into a BERT layer in a target BERT model for feature extraction to obtainNamely inputting the 3 rd sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtainNamely, inputting the 4 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtainNamely, inputting the 5 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtainNamely, the 6 th sub-time sequence is input into a BERT layer in a target BERT model for feature extraction to obtainAnd after the 6 second vector initial expression results are obtained, splicing according to the ascending sequence of the angle marks to obtain a first splicing result. And finally, inputting the first splicing result to a transform layer in a target BERT model for feature extraction (specifically, inputting an encode part of the transform layer, wherein the number of layers of the encode part of the transform layer is 6) to obtain a second vector expression result.
The second vector expression result obtained is the encode part of the Transformer layerThe last layer isThe output of the location is used as a vector expression for final emotion classification.
In this embodiment, when the target BERT model is determined to be a space-time BERT model, the hierarchical BERT model corresponds to a BERT model considered comprehensively from both time and space perspectives, the third target speech recognition sub-result selected from the speech recognition result is processed into two input variables (one input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current standard context sequence, and the other input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current internal context sequence), and then the input variables are directly input into the BERT model for operation, and the respective obtained operation results are subjected to a fusion process by the fusion model, a third vector expression result corresponding to the third target speech recognizer result may be obtained as a final vector expression result, and a model structure diagram of the concrete space-time BERT model is shown in fig. 2 c. Similarly, the final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, the third model processing unit 1034 further includes:
and the second hierarchical extraction unit is used for acquiring a current standard context sequence and a current internal context sequence which correspond to the third target voice identifier result in the voice identification result respectively, splicing the third target voice identifier result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result.
In this embodiment, when a third vector expression result corresponding to the third target speech recognition sub-result is extracted, a current first time sequence and a current second time sequence obtained by processing the third target speech recognition sub-result and respectively inputting the third target speech recognition sub-result to a BERT layer in a target BERT model are obtained from a time perspective, and then the first time sequence and the current second time sequence are spliced from a space perspective (that is, the current first time sequence and the current second time sequence are input to a fusion model layer in the target BERT model) to obtain a third vector expression result. From a time perspective, the emotional impact of the speaker can be discerned explicitly; from a spatial perspective, the weighting of two granularities, neuron and vector, can be performed for the features, and the feature fusion granularity is finer.
In an embodiment, the third model processing unit 1034 is further configured to:
respectively acquiring a current standard context sequence and a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result;
splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, and splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to the third splicing strategy;
inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result;
longitudinally splicing the current first vector initial expression result and the current second vector initial expression result to obtain a current splicing result;
and inputting the current splicing result into a fusion model layer in a target BERT model for fusion processing to obtain a third vector expression result corresponding to the third target voice identifier result.
In this embodiment, the current standard context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected third target speech recognition sub-result, that is, the speech recognition sub-results with the same number as the size value of the context window are obtained from the speech recognition result in reverse order by using the selected third target speech recognition sub-result as a starting point, and all the speech recognition sub-results with the same speaker as the third target speech recognition sub-result are removed to form the current standard context sequence.
And obtaining a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result, namely obtaining the voice recognition sub-results with the same number as the context window size value in the voice recognition result in a reverse sequence by taking the selected third target voice recognition sub-result as a starting point, and removing all the voice recognition sub-results of the speakers which are different from the third target voice recognition sub-result to form the current internal context sequence.
And inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result, wherein the output of the position of the last layer [ CLS ] of the BERT is used as the vector expression of the whole time sequence.
And splicing the third target speech recognizer result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, wherein the method specifically comprises the following steps: obtaining a corresponding current first coding result by double-byte coding of characters included in the third target voice recognition result, and splicing a pre-stored first word embedding vector at the tail end of the current first coding result to obtain a current first processing result; obtaining a corresponding current second coding result by double-byte coding of characters included in the current standard context sequence, and splicing a prestored second-class word embedding vector at the tail end of the current second coding result to obtain a current second processing result; adding a [ CLS ] character before the current first processing result, adding a [ SEP ] character between the current first processing result and the current second processing result, and adding a [ SEP ] character after the current second processing result to obtain a current first initial time sequence; and splicing corresponding position embedded vectors at the tail of each character in the first current initial time sequence to obtain a current first time sequence. And the splicing acquisition process of splicing the third target voice identifier result and the current internal context sequence into the current second time sequence according to a preset third splicing strategy is also an acquisition process of referring to the current first time sequence.
For example, the third target speech recognizer results inAnd the size of the preset context window is 5, the current standard context sequenceAnd current internal context sequence The third target speech recognizer has the result ofWith current standard context sequence The current first initial time sequence is spliced through a preset third splicing strategy, and the result of the third target voice identifier isWith the current internal context sequenceSplicing the current first initial time sequence into a current second initial time sequence through a preset third splicing strategy, inputting the current first initial time sequence into a BERT layer in a target BERT model, and performing feature extraction to obtain a current first vector initial expression resultInputting the current second time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current second vector initial expression resultWhereinAnd isdfFor vector dimension, the current first vector initially expresses the resultAnd the current second vector initial expression resultAre two emotional impact vector expressions obtained from the time dimension.
Initially expressing the current first vector to a resultAnd the current second vector initial expression resultPerforming longitudinal splicing to obtain the current splicing resultFinally, the current splicing result is obtainedTensor operation can be specifically implemented when the fusion model layer input into the target BERT model is subjected to fusion processing, namely:
wherein RELU () represents a linear rectification function, WbIs composed ofWherein each neuron is assigned with neuron-level weight (i.e. the weight of all neurons is different), WaIs composed ofTwo row vectors in (a) assign vector-level weights (i.e. the weights assigned by neurons in a row vector are the same),and represents a bias term. Therefore, the current splicing resultAnd carrying out tensor operation on the fusion model layer input into the target BERT model to obtain a third vector expression result.
And the emotion classification unit 104 is used for calling a pre-trained emotion classification model, inputting the final vector expression result to the emotion classification model for operation, and obtaining a corresponding emotion classification result.
In this embodiment, the final vector expression result obtained by the first model processing unit 1032, the second model processing unit 1033, or the third model processing unit 1034 can be riIs shown (e.g. obtained from the above specific examples are all by r7Expressing), calling a pre-trained emotion classification model, and inputting a final vector expression result into the emotion classification model for operation, wherein the method specifically comprises the following steps:
wherein tanh () is a hyperbolic tangent function, WoIs riThe corresponding first weight value, softmax () can be understood as a linear classifier,is oiA corresponding second weight value of the first weight value,is the final predicted emotion classification result.
The device realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors are carried out on the features, the feature fusion granularity is finer, and the finally obtained emotion recognition result is more accurate.
The speech emotion classification apparatus may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 4, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.
The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a speech emotion classification method.
The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be enabled to execute the speech emotion classification method.
The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The processor 502 is configured to run a computer program 5032 stored in the memory to implement the speech emotion classification method disclosed in the embodiment of the present invention.
Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.
It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer-readable storage medium may be a nonvolatile computer-readable storage medium or a volatile computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program is executed by a processor to implement the speech emotion classification method disclosed by the embodiment of the invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A speech emotion classification method is characterized by comprising the following steps:
responding to a voice emotion classification instruction, acquiring voice data to be recognized according to the voice emotion classification instruction, and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
preprocessing the target voice identifier result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and performing feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result; and
and calling a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.
2. The method for classifying speech emotion according to claim 1, wherein the target BERT model is any one of a flat BERT model, a hierarchical BERT model and a spatiotemporal BERT model; if the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy;
the pre-processing the target speech recognizer result selected in the speech data to be recognized according to the character pre-processing strategy to obtain a pre-processing result, and performing feature extraction on the pre-processing result through the target BERT model to obtain a final vector expression result, comprising:
acquiring any one BERT model in a pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a level BERT model and a space-time BERT model;
when the target BERT model is determined to be a flat BERT model, preprocessing a first target voice identifier result selected from the voice recognition results according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result, and performing feature extraction on the first preprocessing result through the target BERT model to obtain a final vector expression result; wherein the first character pre-processing strategy is used to add a mixed context sequence in a first target speech recognizer result;
when the target BERT model is determined to be a hierarchical BERT model, preprocessing a second target voice identifier result selected from the voice recognition result according to a second pre-stored character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a preceding result of a second target voice recognition sub-result and respectively adding an internal context sequence in each voice recognition sub-result in the preceding result and the second target voice recognition sub-result;
when the target BERT model is determined to be a space-time BERT model, preprocessing a third target voice identifier result selected from the voice recognition result according to a prestored third character preprocessing strategy to obtain a third preprocessing result, and performing feature extraction on the third preprocessing result through the target BERT model to obtain a final vector expression result; wherein the third character preprocessing strategy is used for adding a standard context sequence and an internal context sequence, respectively, in a third target speech recognizer result.
3. The method of claim 2, wherein when the target BERT model is determined to be a flat BERT model, the step of preprocessing the selected first target speech recognition sub-result in the speech recognition result according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result, and the step of extracting features of the first preprocessing result through the target BERT model to obtain a final vector expression result comprises:
acquiring a mixed context sequence in the voice recognition result according to a preset context window size value and the selected first target voice recognition sub-result;
splicing the first target voice identifier result and the mixed context sequence into a first time sequence according to a preset first splicing strategy;
and inputting the first time sequence into a flat BERT model for operation to obtain a corresponding first vector expression result, and taking the first vector expression result as a final vector expression result corresponding to the first target voice recognizer result.
4. The method for classifying speech emotion according to claim 3, wherein the concatenating the first target speech recognizer result and the mixed context sequence into a first time sequence according to a preset first concatenation strategy comprises:
obtaining a corresponding first coding result by double-byte coding of characters included in the first target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the first coding result to obtain a first processing result;
obtaining a corresponding second coding result by double-byte coding of characters included in the mixed context sequence, and splicing a prestored second word embedding vector at the tail end of the second coding result to obtain a second processing result;
adding a first preset character string before the first processing result, adding a second preset character string between the first processing result and the second processing result, and adding a second preset character string after the second processing result to obtain a first initial time sequence;
and splicing the corresponding position embedding vector at the tail of each character in the first initial time sequence to obtain a first time sequence.
5. The method of claim 2, wherein the step of preprocessing the selected second target speech recognition sub-result in the speech recognition result according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and the step of extracting features of the second preprocessing result by using the target BERT model to obtain a final vector expression result comprises:
according to a preset context window size value, forward obtaining a number of voice recognition sub-results equal to the size value of the context window from the selected second target voice recognition sub-result serving as a starting point in the voice recognition result in a reverse sequence to form a target voice recognition sub-result set, preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result;
the preprocessing the second target speech recognition result and the target speech recognition result set according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result, including:
acquiring an ith target voice recognition sub-result in the target voice recognition sub-result set; wherein the initial value of i is 1;
acquiring an ith internal context sequence from the voice recognition result according to a preset context window size value and an ith target voice recognition sub-result;
splicing the ith target voice recognition sub-result and the ith internal context sequence into an ith sub-time sequence according to a preset second splicing strategy;
increasing 1 to the i value, and judging whether the i value exceeds the size value of the context window; if the value i does not exceed the size value of the context window, returning to execute the step of obtaining the ith target voice recognition sub-result in the target voice recognition sub-result set;
if the value of i exceeds the size value of the context window, sequentially acquiring a 1 st sub-timing sequence to an i-1 st sub-timing sequence;
splicing the second target voice recognition sub-result and the corresponding target internal context sequence into an ith sub-time sequence according to the second splicing strategy;
respectively inputting the 1 st to ith sub-time sequence sequences into a BERT layer in a target BERT model for feature extraction, and obtaining second vector initial expression results respectively corresponding to the 1 st to ith sub-time sequences;
splicing second vector initial expression results corresponding to the 1 st to the ith sub-time sequences respectively to obtain a first splicing result;
and inputting the first splicing result into a Transformer layer in a target BERT model for feature extraction to obtain a second vector expression result.
6. The method of claim 2, wherein the step of preprocessing the selected third target speech recognition sub-result in the speech recognition result according to a pre-stored third character preprocessing strategy to obtain a third preprocessing result, and the step of extracting features of the third preprocessing result by using the target BERT model to obtain a final vector expression result comprises:
and acquiring a current standard context sequence and a current internal context sequence which correspond to the third target voice identifier result in the voice identification result respectively, splicing the third target voice identifier result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result.
7. The method according to claim 6, wherein the obtaining of the current standard context sequence and the current internal context sequence corresponding to the third target speech recognition sub-result in the speech recognition result respectively, the splicing of the third target speech recognition sub-result with the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence, and the inputting of the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result comprises:
respectively acquiring a current standard context sequence and a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result;
splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, and splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to the third splicing strategy;
inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result;
longitudinally splicing the current first vector initial expression result and the current second vector initial expression result to obtain a current splicing result;
and inputting the current splicing result into a fusion model layer in a target BERT model for fusion processing to obtain a third vector expression result corresponding to the third target voice identifier result.
8. A speech emotion classification apparatus, comprising:
the speaker recognition unit is used for responding to the speech emotion classification instruction, acquiring speech data to be recognized according to the speech emotion classification instruction and performing speech recognition to obtain a speech recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
the target model selection unit is used for acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
a final vector obtaining unit, configured to pre-process the target speech recognizer result selected in the speech data to be recognized according to the character pre-processing policy to obtain a pre-processing result, and perform feature extraction on the pre-processing result through the target BERT model to obtain a final vector expression result; and
and the emotion classification unit is used for calling a pre-trained emotion classification model, inputting the final vector expression result to the emotion classification model for operation, and obtaining a corresponding emotion classification result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for speech emotion classification as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of speech emotion classification according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110850075.9A CN113362858B (en) | 2021-07-27 | 2021-07-27 | Voice emotion classification method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110850075.9A CN113362858B (en) | 2021-07-27 | 2021-07-27 | Voice emotion classification method, device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113362858A true CN113362858A (en) | 2021-09-07 |
CN113362858B CN113362858B (en) | 2023-10-31 |
Family
ID=77540332
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110850075.9A Active CN113362858B (en) | 2021-07-27 | 2021-07-27 | Voice emotion classification method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113362858B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114565964A (en) * | 2022-03-03 | 2022-05-31 | 网易(杭州)网络有限公司 | Emotion recognition model generation method, recognition method, device, medium and equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933795A (en) * | 2019-03-19 | 2019-06-25 | 上海交通大学 | Based on context-emotion term vector text emotion analysis system |
CN110147452A (en) * | 2019-05-17 | 2019-08-20 | 北京理工大学 | A kind of coarseness sentiment analysis method based on level BERT neural network |
CN110334210A (en) * | 2019-05-30 | 2019-10-15 | 哈尔滨理工大学 | A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN |
CN110609899A (en) * | 2019-08-29 | 2019-12-24 | 成都信息工程大学 | Specific target emotion classification method based on improved BERT model |
CN110647619A (en) * | 2019-08-01 | 2020-01-03 | 中山大学 | Common sense question-answering method based on question generation and convolutional neural network |
CN111581966A (en) * | 2020-04-30 | 2020-08-25 | 华南师范大学 | Context feature fusion aspect level emotion classification method and device |
CN112464657A (en) * | 2020-12-07 | 2021-03-09 | 上海交通大学 | Hybrid text abstract generation method, system, terminal and storage medium |
WO2021139108A1 (en) * | 2020-01-10 | 2021-07-15 | 平安科技(深圳)有限公司 | Intelligent emotion recognition method and apparatus, electronic device, and storage medium |
-
2021
- 2021-07-27 CN CN202110850075.9A patent/CN113362858B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933795A (en) * | 2019-03-19 | 2019-06-25 | 上海交通大学 | Based on context-emotion term vector text emotion analysis system |
CN110147452A (en) * | 2019-05-17 | 2019-08-20 | 北京理工大学 | A kind of coarseness sentiment analysis method based on level BERT neural network |
CN110334210A (en) * | 2019-05-30 | 2019-10-15 | 哈尔滨理工大学 | A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN |
CN110647619A (en) * | 2019-08-01 | 2020-01-03 | 中山大学 | Common sense question-answering method based on question generation and convolutional neural network |
CN110609899A (en) * | 2019-08-29 | 2019-12-24 | 成都信息工程大学 | Specific target emotion classification method based on improved BERT model |
WO2021139108A1 (en) * | 2020-01-10 | 2021-07-15 | 平安科技(深圳)有限公司 | Intelligent emotion recognition method and apparatus, electronic device, and storage medium |
CN111581966A (en) * | 2020-04-30 | 2020-08-25 | 华南师范大学 | Context feature fusion aspect level emotion classification method and device |
CN112464657A (en) * | 2020-12-07 | 2021-03-09 | 上海交通大学 | Hybrid text abstract generation method, system, terminal and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114565964A (en) * | 2022-03-03 | 2022-05-31 | 网易(杭州)网络有限公司 | Emotion recognition model generation method, recognition method, device, medium and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113362858B (en) | 2023-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108305641B (en) | Method and device for determining emotion information | |
CN110444223B (en) | Speaker separation method and device based on cyclic neural network and acoustic characteristics | |
CN109785824B (en) | Training method and device of voice translation model | |
CN108305643B (en) | Method and device for determining emotion information | |
Chen et al. | A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. | |
CN111144127B (en) | Text semantic recognition method, text semantic recognition model acquisition method and related device | |
WO2018194960A1 (en) | Multi-stage machine learning and recognition | |
WO2021128044A1 (en) | Multi-turn conversation method and apparatus based on context, and device and storage medium | |
CN111966800B (en) | Emotion dialogue generation method and device and emotion dialogue model training method and device | |
CN111524527A (en) | Speaker separation method, device, electronic equipment and storage medium | |
CN115362497A (en) | Sequence-to-sequence speech recognition with delay threshold | |
KR20220130565A (en) | Keyword detection method and apparatus thereof | |
CN111639529A (en) | Speech technology detection method and device based on multi-level logic and computer equipment | |
CN111344717A (en) | Interactive behavior prediction method, intelligent device and computer-readable storage medium | |
CN113948090B (en) | Voice detection method, session recording product and computer storage medium | |
CN113362858A (en) | Voice emotion classification method, device, equipment and medium | |
Amiriparian et al. | On the impact of word error rate on acoustic-linguistic speech emotion recognition: An update for the deep learning era | |
CN113793599A (en) | Training method of voice recognition model and voice recognition method and device | |
CN112837700A (en) | Emotional audio generation method and device | |
CN115104152A (en) | Speaker recognition device, speaker recognition method, and program | |
CN112183106A (en) | Semantic understanding method and device based on phoneme association and deep learning | |
CN111563161A (en) | Sentence recognition method, sentence recognition device and intelligent equipment | |
CN115547345A (en) | Voiceprint recognition model training and related recognition method, electronic device and storage medium | |
JP7291099B2 (en) | Speech recognition method and device | |
CN114170997A (en) | Pronunciation skill detection method, pronunciation skill detection device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |