CN113362858A - Voice emotion classification method, device, equipment and medium - Google Patents

Voice emotion classification method, device, equipment and medium Download PDF

Info

Publication number
CN113362858A
CN113362858A CN202110850075.9A CN202110850075A CN113362858A CN 113362858 A CN113362858 A CN 113362858A CN 202110850075 A CN202110850075 A CN 202110850075A CN 113362858 A CN113362858 A CN 113362858A
Authority
CN
China
Prior art keywords
result
target
bert model
preprocessing
voice recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110850075.9A
Other languages
Chinese (zh)
Other versions
CN113362858B (en
Inventor
刘广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110850075.9A priority Critical patent/CN113362858B/en
Publication of CN113362858A publication Critical patent/CN113362858A/en
Application granted granted Critical
Publication of CN113362858B publication Critical patent/CN113362858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a speech emotion classification method, a speech emotion classification device, computer equipment and a storage medium, and relates to an artificial intelligence technology. The method realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and obtains a more accurate emotion recognition result finally.

Description

Voice emotion classification method, device, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence voice semantics, in particular to a voice emotion classification method and device, computer equipment and a storage medium.
Background
Emotion recognition is an important branch of the field of artificial intelligence, especially in conversational scenes. During a conversation, a speaker receives emotional influences from other speakers in an attempt to change the speaker's mood, and from the speaker's own mood in an attempt to maintain the speaker's mood. To model these two types of emotional influences, existing methods use both "flat" and "hierarchical" model structures based on a "recurrent neural network" to model.
However, 1) existing methods are all based on "recurrent neural networks", and do not utilize a powerful pre-trained BERT model. 2) The 'flat' model cannot distinguish different speakers by serially connecting emotional expressions of different speakers in the same time sequence; 3) although the emotion expressions of the same speaker are connected in series in the same time sequence through the branch layer, the emotion influences of different speakers are still mixed in the same time sequence of the main layer and cannot be distinguished.
Disclosure of Invention
The embodiment of the invention provides a speech emotion classification method, a speech emotion classification device, computer equipment and a storage medium, and aims to solve the problem that in the prior art, emotion recognition results of conversations in a multi-person conversation scene are inaccurate based on existing models.
In a first aspect, an embodiment of the present invention provides a speech emotion classification method, which includes:
responding to a voice emotion classification instruction, acquiring voice data to be recognized according to the voice emotion classification instruction, and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
preprocessing the target voice identifier result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and performing feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result; and
and calling a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.
In a second aspect, an embodiment of the present invention provides a speech emotion classification apparatus, which includes:
the speaker recognition unit is used for responding to the speech emotion classification instruction, acquiring the speech data to be recognized according to the speech emotion classification instruction and performing speech recognition to obtain a speech recognition result if the speech data to be recognized sent by the user side or other servers is detected; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
the target model selection unit is used for acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
a final vector obtaining unit, configured to pre-process the target speech recognizer result selected in the speech data to be recognized according to the character pre-processing policy to obtain a pre-processing result, and perform feature extraction on the pre-processing result through the target BERT model to obtain a final vector expression result; and
and the emotion classification unit is used for calling a pre-trained emotion classification model, inputting the final vector expression result to the emotion classification model for operation, and obtaining a corresponding emotion classification result.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the speech emotion classification method according to the first aspect when executing the computer program.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the speech emotion classification method according to the first aspect.
The embodiment of the invention provides a speech emotion classification method, a speech emotion classification device, computer equipment and a storage medium, wherein speech data to be recognized are obtained and are subjected to speech recognition to obtain a speech recognition result, then a target speech recognizer result selected from the speech data to be recognized is preprocessed according to a character preprocessing strategy to obtain a preprocessing result, a target BERT model is used for carrying out feature extraction on the preprocessing result to obtain a final vector expression result, and finally the final vector expression result is input to a pre-trained emotion classification model to be operated to obtain a corresponding emotion classification result. The method realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and obtains a more accurate emotion recognition result finally.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an application scenario of a speech emotion classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a speech emotion classification method according to an embodiment of the present invention;
FIG. 2a is a model structure diagram of a flat BERT model of a speech emotion classification method according to an embodiment of the present invention;
FIG. 2b is a model structure diagram of a hierarchical BERT model in the speech emotion classification method according to the embodiment of the present invention;
FIG. 2c is a model structure diagram of a spatiotemporal BERT model in a speech emotion classification method according to an embodiment of the present invention;
FIG. 2d is a sub-flow diagram of a speech emotion classification method according to an embodiment of the present invention;
FIG. 3 is a schematic block diagram of a speech emotion classification apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a speech emotion classification method according to an embodiment of the present invention; fig. 2 is a schematic flowchart of a speech emotion classification method according to an embodiment of the present invention, where the speech emotion classification method is applied to a server and is executed by application software installed in the server.
As shown in fig. 2, the method includes steps S101 to S106.
S101, responding to a speech emotion classification instruction, acquiring speech data to be recognized according to the speech emotion classification instruction, and performing speech recognition to obtain a speech recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
In the present embodiment, in order to more clearly understand the technical solution of the present application, the following describes the related execution body in detail. The technical scheme is described by taking the server as an execution subject.
The background server receives the voice data to be recognized which are collected and uploaded based on the plurality of user terminals when the plurality of users communicate under the same video conference scene. After the server receives the voice data to be recognized, speaker recognition can be carried out on the voice data to be recognized.
The server is used for storing a speaker recognition model so as to perform speaker recognition on the voice data to be recognized uploaded by the user side to obtain a voice recognition result; and a set of pre-trained BERT models is stored in the server so as to perform emotion recognition based on the speaker context on the voice recognition result.
In specific implementation, the Speaker Recognition (i.e., Speaker Recognition, abbreviated as SR) technology is also called Voiceprint Recognition (abbreviated as VPR) technology, the Voiceprint Recognition technology mainly adopts MFCC (MFCC, i.e., mel frequency cepstrum coefficient) and GMM (gaussian mixture model) framework, and Speaker Recognition can be effectively performed on voice data to be recognized through the Speaker Recognition technology to obtain a voice Recognition result corresponding to the voice data to be recognized; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
For example, a speech recognition result corresponding to a piece of speech data to be recognized is represented by U, and the expression of U is as follows
Figure BDA0003182107350000041
The speech recognition result U is embodied in 8 dialogues arranged in time-series ascending order, the first dialog being spoken by the speaker 1 (by
Figure BDA0003182107350000042
Where subscript 1 denotes chronological order 1, superscript 1 denotes the speaker number identification of speaker 1,
Figure BDA0003182107350000043
the contents of speaker 1 as a whole), and a second session, speaker 2 (by using the speech of speaker 1)
Figure BDA0003182107350000044
Wherein the subscript 2 denotes the chronological order 2, the superscript 2 denotes the speaker number designation of speaker 2), and a third session is spoken by speaker 1 (using the suffix 2)
Figure BDA0003182107350000045
Wherein the subscript 3 denotes the chronological order 3, the superscript 1 denotes the speaker number identification of speaker 1), and the fourth session is spoken by speaker 1 (by the subscript 1)
Figure BDA0003182107350000046
Wherein subscript 4 denotes chronological order 4, superscript 1 denotes speaker number identification of speaker 1), and a fifth session is spoken by speaker 3 (using the speaker number identification of speaker 1)
Figure BDA0003182107350000047
Wherein the subscript 5 represents the time sequence order 5, superscript 3 tableSpeaker 3, speaker number), and a sixth conversation is spoken by speaker 2 (using
Figure BDA0003182107350000048
Wherein the subscript 6 denotes the chronological order 6, the superscript 2 denotes the speaker number identification of speaker 2), and a seventh conversation is spoken by speaker 1 (using the subscript 2)
Figure BDA0003182107350000049
Wherein the subscript 7 denotes the chronological order 7, the superscript 1 denotes the speaker number identification of speaker 1), and an eighth conversation is spoken by speaker 2 (using the suffix)
Figure BDA00031821073500000410
Where subscript 8 denotes chronological order 2 and superscript 2 denotes the speaker number identification of speaker 2). By the speaker recognition technology, the speaking content of each speaker in the multi-person conversation is effectively distinguished.
S102, obtaining a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model.
In this embodiment, a set of pre-trained BERT models is stored locally in the server, where the set of BERT models at least includes a flat BERT model, a hierarchical BERT model, and a spatio-temporal BERT model. When the server selects the BERT model, one of the flat BERT model, the level BERT model and the space-time BERT model is selected at random. Effective vector expression results in the voice recognition results can be effectively extracted through the three models so as to perform subsequent accurate emotion recognition. If the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; and if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy.
S103, preprocessing the target voice identifier result selected from the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and performing feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result.
In this embodiment, when a target speech recognition sub-result is arbitrarily selected from the speech data to be recognized, for example, the target speech recognition sub-result is selected from the speech recognition result U
Figure BDA00031821073500000411
The speech recognition sub-result is used as a target speech recognition sub-result, and the target speech recognition sub-result can be preprocessed according to a corresponding character preprocessing strategy to obtain a preprocessing result, so that the preprocessing result is finally input into a target BERT model to perform feature extraction on the preprocessing result to obtain a final vector expression result. That is, when it is determined that different target BERT models are adopted for feature extraction, a corresponding character preprocessing strategy is adopted for preprocessing a target voice recognition sub-result to obtain a preprocessing result, so that the information dimension of the target voice recognition sub-result can be increased, and the feature extraction is more accurate.
In one embodiment, as shown in fig. 2d, step S103 includes:
s1031, obtaining any one BERT model in a pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a level BERT model and a space-time BERT model;
s1032, when the target BERT model is determined to be a flat BERT model, preprocessing a first target voice recognizer result selected from the voice recognition result according to a first pre-stored character preprocessing strategy to obtain a first preprocessing result, and performing feature extraction on the first preprocessing result through the target BERT model to obtain a final vector expression result; wherein the first character pre-processing strategy is used to add a mixed context sequence in a first target speech recognizer result;
s1033, when the target BERT model is determined to be a hierarchical BERT model, preprocessing a selected second target voice recognition sub-result in the voice recognition result according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a preceding result of a second target voice recognition sub-result and respectively adding an internal context sequence in each voice recognition sub-result in the preceding result and the second target voice recognition sub-result;
s1034, when the target BERT model is determined to be a space-time BERT model, preprocessing a selected third target voice recognizer result in the voice recognition result according to a prestored third character preprocessing strategy to obtain a third preprocessing result, and performing feature extraction on the third preprocessing result through the target BERT model to obtain a final vector expression result; wherein the third character preprocessing strategy is used for adding a standard context sequence and an internal context sequence, respectively, in a third target speech recognizer result.
In this embodiment, the flat BERT model corresponds to a BERT model with a flat structure, and the first target speech recognition sub-result selected from the speech recognition result is processed into an input variable and then directly input into the BERT model for operation, so as to obtain a final vector expression result corresponding to the first target speech recognition sub-result, where a specific model structure diagram of the flat BERT model is shown in fig. 2 a. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In one embodiment, step S1032 includes:
acquiring a mixed context sequence in the voice recognition result according to a preset context window size value and the selected first target voice recognition sub-result;
splicing the first target voice identifier result and the mixed context sequence into a first time sequence according to a preset first splicing strategy;
and inputting the first time sequence into a flat BERT model for operation to obtain a corresponding first vector expression result, and taking the first vector expression result as a final vector expression result corresponding to the first target voice recognizer result.
In this embodiment, in order to more clearly understand the subsequent technical solutions, three context sequences involved in extracting the speech recognition result U will be described in detail below:
mixed context sequences (i.e., conv-contexts), denoted by Ψ, e.g.
Figure BDA0003182107350000061
Represents the speech recognition result U to
Figure BDA0003182107350000062
And identifying a sub-result for the first target voice, and extracting a mixed context sequence by taking 5 as a preset context window size value K. When extracting the mixed context sequence, the method directly takes the result of the first target speech recognizer as a starting point to push backwards by 5 bits to obtain
Figure BDA0003182107350000063
Figure BDA0003182107350000064
When the mixed context sequence is obtained, speakers are not distinguished, and the mixed context sequence is directly obtained by pushing forwards in a reverse order according to the size value of a preset context window.
Standard context sequences (i.e., inter-contexts), denoted by φ, e.g.
Figure BDA0003182107350000065
Represents the speech recognition result U to
Figure BDA0003182107350000066
For the first target speech recognition sub-result, and using 5 as preset context window size value K to make standard context sequence extraction, and its goal is to forward and backward push 5 bits to obtain an initial sequence by using first target speech recognition sub-result as starting point and remove all speech recognition sub-results of speakers identical to first target speech recognition sub-result so as to obtain standard speech recognition sub-resultSequences of the text
Figure BDA0003182107350000067
Figure BDA0003182107350000068
Internal context sequence (i.e., intra-context), using
Figure BDA0003182107350000069
Represents, for example
Figure BDA00031821073500000610
Represents the speech recognition result U to
Figure BDA00031821073500000611
Extracting internal context sequence for the first target speech recognizer result by using 5 as preset context window size value K, and obtaining an initial sequence by backward pushing 5 bits from the first target speech recognizer result as a starting point and removing all speech recognizer results of speakers which are different from the first target speech recognizer result, thereby obtaining the internal context sequence
Figure BDA00031821073500000612
Figure BDA00031821073500000613
The mixed context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected first target speech recognition sub-result, namely, the mixed context sequence is formed by obtaining the speech recognition sub-results with the same number as the size value of the context window from the speech recognition result in reverse order by taking the selected first target speech recognition sub-result as a starting point.
In an embodiment, the concatenating the first target speech recognizer result and the mixed context sequence into a first time sequence according to a preset first concatenation strategy includes:
obtaining a corresponding first coding result by double-byte coding of characters included in the first target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the first coding result to obtain a first processing result;
obtaining a corresponding second coding result by double-byte coding of characters included in the mixed context sequence, and splicing a prestored second word embedding vector at the tail end of the second coding result to obtain a second processing result;
adding a first preset character string before the first processing result, adding a second preset character string between the first processing result and the second processing result, and adding a second preset character string after the second processing result to obtain a first initial time sequence;
and splicing the corresponding position embedding vector at the tail of each character in the first initial time sequence to obtain a first time sequence.
In the present embodiment, i.e. for the flat BERT model, the goal is to predict the emotion of the ith speech recognizer result, the input is constructed as:
Figure BDA0003182107350000071
wherein the content of the first and second substances,
Figure BDA0003182107350000072
represents an expression sequence comprising T words,
Figure BDA0003182107350000073
represents and includes
Figure BDA0003182107350000074
A mixed context sequence of words, K being a preset context window size value. And inputting the parameters into a BERT model after splicing and converting into embedding to obtain vector representation: r isi=BERT(Xi)。
Determining the target BERT model as a flat BERT modelType, the inputs to the flat BERT model include the following key points: 1) identifying a first target speech into a sub-result
Figure BDA0003182107350000075
(first target Speech recognizer result
Figure BDA0003182107350000076
Also understood as selected target expression) and mixed context sequences
Figure BDA0003182107350000077
Splicing into a time sequence; 2) increasing [ CLS ] in time-series header]Special characters (where [ CLS)]As can be understood a first predetermined string) for unambiguous output location; 3) using [ SEP ]]Special character (wherein [ SEP ]]As can be understood as a second pre-defined string) to distinguish between the target expression and the mixed context sequence; 4) converting all characters into WordPiece embeddings; 5) character splicing type A embedding (namely a prestored first word embedding vector) for the first target voice recognition sub-result, and character splicing type B embedding (namely a prestored second word embedding vector) for the mixed context sequence so as to enhance the distinguishing degree of the two; 6) the position embedding is spliced for each character to retain the position information of the time series. The first time sequence constructed in the above way can be used for modeling a longer time sequence, and a deeper network structure is mined.
And when the first time sequence is input into the flat BERT model for feature extraction, the obtained first vector expression result is the vector expression of the whole time sequence by using the output of the last layer [ CLS ] position of the BERT as the output.
In this embodiment, when the target BERT model is determined to be a hierarchical BERT model, the hierarchical BERT model corresponds to a BERT model with a multilayer structure, and at least includes a BERT layer and a Transformer layer, the second target speech recognition sub-result selected from the speech recognition results and the speech recognition sub-result obtained by screening are preprocessed and then respectively input into the BERT model of the BERT layer for operation, so that second vector expression results respectively corresponding to the second target speech recognition sub-result and the speech recognition sub-result obtained by screening are obtained, a second vector expression result set formed by corresponding to the second vector expression results is used as a final vector expression result corresponding to the second target speech recognition sub-result, and a model structure diagram of the specific hierarchical BERT model is shown in fig. 2 b. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In one embodiment, step S1033 includes:
and forward acquiring a number of voice recognition sub-results equal to the size value of the context window in the voice recognition results in a reverse sequence by taking the selected second target voice recognition sub-result as a starting point according to a preset size value of the context window to form a target voice recognition sub-result set, preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and extracting the characteristics of the second preprocessing result through the target BERT model to obtain a final vector expression result.
And sequentially inputting the preprocessing results respectively corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set into a BERT layer and a Transformer layer in a target BERT model for feature extraction, and obtaining a second vector expression result corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set.
When extracting the second vector expression result corresponding to the second target speech recognition sub-result and the target speech recognition sub-result set, the second target speech recognition sub-result and the preprocessing result corresponding to the target speech recognition sub-result set need to be input to a BERT layer and a transform layer of the target BERT model for feature extraction, and the second vector expression result obtained through the extraction of the two layers of models is subjected to weighting of two granularities of neurons and vectors, so that the features of finer granularity are fused, and the features have more 'hierarchical' feeling.
In an embodiment, the preprocessing the second target speech recognition sub-result and the target speech recognition sub-result set according to a pre-stored second character preprocessing policy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result includes:
acquiring an ith target voice recognition sub-result in the target voice recognition sub-result set; wherein the initial value of i is 1;
acquiring an ith internal context sequence from the voice recognition result according to a preset context window size value and an ith target voice recognition sub-result;
splicing the ith target voice recognition sub-result and the ith internal context sequence into an ith sub-time sequence according to a preset second splicing strategy;
increasing 1 to the i value, and judging whether the i value exceeds the size value of the context window; if the value i does not exceed the size value of the context window, returning to execute the step of obtaining the ith target voice recognition sub-result in the target voice recognition sub-result set;
if the value of i exceeds the size value of the context window, sequentially acquiring a 1 st sub-timing sequence to an i-1 st sub-timing sequence;
splicing the second target voice recognition sub-result and the corresponding target internal context sequence into an ith sub-time sequence according to the second splicing strategy;
respectively inputting the 1 st to ith sub-time sequence sequences into a BERT layer in a target BERT model for feature extraction, and obtaining second vector initial expression results respectively corresponding to the 1 st to ith sub-time sequences;
splicing second vector initial expression results corresponding to the 1 st to the ith sub-time sequences respectively to obtain a first splicing result;
and inputting the first splicing result into a Transformer layer in a target BERT model for feature extraction to obtain a second vector expression result.
In this embodiment, for example, the preset context window size value K is 5,and the speech recognition result
Figure BDA0003182107350000081
Figure BDA0003182107350000082
Then the 1 st target speech recognition sub-result in said set of target speech recognition sub-results is
Figure BDA0003182107350000083
Its corresponding 1 st internal context sequence
Figure BDA0003182107350000084
Is an empty set; similarly, the 2 nd target speech recognition sub-result in the target speech recognition sub-result set is
Figure BDA0003182107350000085
Its corresponding 2 nd internal context sequence
Figure BDA0003182107350000086
The 3 rd target speech recognition sub-result in the target speech recognition sub-result set is
Figure BDA0003182107350000087
Its corresponding 3 rd internal context sequence
Figure BDA0003182107350000088
The 4 th target speech recognition sub-result of the target speech recognition sub-result set is
Figure BDA0003182107350000089
Its corresponding 4 th internal context sequence
Figure BDA00031821073500000810
Is an empty set; the 5 th target speech recognition sub-result of the target speech recognition sub-result set is
Figure BDA00031821073500000811
Its corresponding 5 th internal context sequence
Figure BDA00031821073500000812
When the ith target speech recognition sub-result and the ith internal context sequence are spliced into the ith sub-timing sequence according to a preset second splicing strategy, the method specifically comprises the following steps: obtaining a corresponding ith group of first sub-coding results by double-byte coding of characters included in the ith target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the ith group of first sub-coding results to obtain an ith group of first processing results; obtaining a corresponding ith group of second sub-coding results by double-byte coding of characters included in the ith internal context sequence, and splicing a pre-stored second-class word embedding vector at the tail end of the ith group of second sub-coding results to obtain an ith group of second processing results; adding [ CLS ] characters before the ith group of first sub-coding results, adding [ SEP ] characters between the ith group of first sub-coding results and the ith group of second processing results, and adding [ SEP ] characters after the ith group of second processing results to obtain an ith group of initial time sequence sequences; and splicing corresponding position embedded vectors at the tail of each character in the ith group of initial time sequence sequences to obtain the ith sub-time sequence. Aiming at the improved hierarchical BERT model of the hierarchical recurrent neural network, the hierarchical structure can effectively distinguish speakers compared with a flat structure.
Sequentially obtaining the 1 st to the ith sub-time sequence sequences and inputting the sequences to a BERT layer in a target BERT model for feature extraction to obtain a second vector initial expression result with context of the speaker at each moment, namely inputting the 1 st sub-time sequence to the BERT layer in the target BERT model for feature extraction to obtain
Figure BDA0003182107350000091
(
Figure BDA0003182107350000092
For), BERT that inputs the 2 nd sub-timing sequence into the target BERT modelLayer characteristic extraction is carried out to obtain
Figure BDA0003182107350000093
Namely inputting the 3 rd sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtain
Figure BDA0003182107350000094
Namely, inputting the 4 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtain
Figure BDA0003182107350000095
Namely, inputting the 5 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtain
Figure BDA0003182107350000096
Namely, the 6 th sub-time sequence is input into a BERT layer in a target BERT model for feature extraction to obtain
Figure BDA0003182107350000097
And after the 6 second vector initial expression results are obtained, splicing according to the ascending sequence of the angle marks to obtain a first splicing result. And finally, inputting the first splicing result to a transform layer in a target BERT model for feature extraction (specifically, inputting an encode part of the transform layer, wherein the number of layers of the encode part of the transform layer is 6) to obtain a second vector expression result.
The second vector expression result is obtained with the last layer of the encode part of the Transformer layer in
Figure BDA0003182107350000098
The output of the location is used as a vector expression for final emotion classification.
In this embodiment, when the target BERT model is determined to be a space-time BERT model, the hierarchical BERT model corresponds to a BERT model considered comprehensively from both time and space perspectives, the third target speech recognition sub-result selected from the speech recognition result is processed into two input variables (one input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current standard context sequence, and the other input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current internal context sequence), and then the input variables are directly input into the BERT model for operation, and the respective obtained operation results are subjected to a fusion process by the fusion model, a third vector expression result corresponding to the third target speech recognizer result may be obtained as a final vector expression result, and a model structure diagram of the concrete space-time BERT model is shown in fig. 2 c. Similarly, the final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, in step S1034, the preprocessing the selected third target speech recognition sub-result in the speech recognition result according to a pre-stored third character preprocessing policy to obtain a third preprocessing result, and performing feature extraction on the third preprocessing result through the target BERT model to obtain a final vector expression result, including:
and acquiring a current standard context sequence and a current internal context sequence which correspond to the third target voice identifier result in the voice identification result respectively, splicing the third target voice identifier result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result.
In this embodiment, when a third vector expression result corresponding to the third target speech recognition sub-result is extracted, a current first time sequence and a current second time sequence obtained by processing the third target speech recognition sub-result and respectively inputting the third target speech recognition sub-result to a BERT layer in a target BERT model are obtained from a time perspective, and then the first time sequence and the current second time sequence are spliced from a space perspective (that is, the current first time sequence and the current second time sequence are input to a fusion model layer in the target BERT model) to obtain a third vector expression result. From a time perspective, the emotional impact of the speaker can be discerned explicitly; from a spatial perspective, the weighting of two granularities, neuron and vector, can be performed for the features, and the feature fusion granularity is finer.
In an embodiment, the obtaining of the current standard context sequence and the current internal context sequence of the third target speech recognition sub-result respectively corresponding to the speech recognition result, respectively splicing the third target speech recognition sub-result with the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence, and respectively inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model to perform feature extraction, so as to obtain a third vector expression result, includes:
respectively acquiring a current standard context sequence and a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result;
splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, and splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to the third splicing strategy;
inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result;
longitudinally splicing the current first vector initial expression result and the current second vector initial expression result to obtain a current splicing result;
and inputting the current splicing result into a fusion model layer in a target BERT model for fusion processing to obtain a third vector expression result corresponding to the third target voice identifier result.
In this embodiment, the current standard context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected third target speech recognition sub-result, that is, the speech recognition sub-results with the same number as the size value of the context window are obtained from the speech recognition result in reverse order by using the selected third target speech recognition sub-result as a starting point, and all the speech recognition sub-results with the same speaker as the third target speech recognition sub-result are removed to form the current standard context sequence.
And obtaining a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result, namely obtaining the voice recognition sub-results with the same number as the context window size value in the voice recognition result in a reverse sequence by taking the selected third target voice recognition sub-result as a starting point, and removing all the voice recognition sub-results of the speakers which are different from the third target voice recognition sub-result to form the current internal context sequence.
And inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result, wherein the output of the position of the last layer [ CLS ] of the BERT is used as the vector expression of the whole time sequence.
And splicing the third target speech recognizer result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, wherein the method specifically comprises the following steps: obtaining a corresponding current first coding result by double-byte coding of characters included in the third target voice recognition result, and splicing a pre-stored first word embedding vector at the tail end of the current first coding result to obtain a current first processing result; obtaining a corresponding current second coding result by double-byte coding of characters included in the current standard context sequence, and splicing a prestored second-class word embedding vector at the tail end of the current second coding result to obtain a current second processing result; adding a [ CLS ] character before the current first processing result, adding a [ SEP ] character between the current first processing result and the current second processing result, and adding a [ SEP ] character after the current second processing result to obtain a current first initial time sequence; and splicing corresponding position embedded vectors at the tail of each character in the first current initial time sequence to obtain a current first time sequence. And the splicing acquisition process of splicing the third target voice identifier result and the current internal context sequence into the current second time sequence according to a preset third splicing strategy is also an acquisition process of referring to the current first time sequence.
For example, the third target speech recognizer results in
Figure BDA0003182107350000111
And the size of the preset context window is 5, the current standard context sequence
Figure BDA0003182107350000112
And current internal context sequence
Figure BDA0003182107350000113
Figure BDA0003182107350000114
The third target speech recognizer has the result of
Figure BDA0003182107350000115
With current standard context sequence
Figure BDA0003182107350000116
Figure BDA0003182107350000117
The current first initial time sequence is spliced through a preset third splicing strategy, and the result of the third target voice identifier is
Figure BDA0003182107350000118
With the current internal context sequence
Figure BDA0003182107350000119
Splicing the current first initial time sequence into a current second initial time sequence through a preset third splicing strategy, inputting the current first initial time sequence into a BERT layer in a target BERT model, and performing feature extraction to obtain a current first vector initial expression result
Figure BDA00031821073500001110
Inputting the current second time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current second vector initial expression result
Figure BDA00031821073500001111
Wherein
Figure BDA00031821073500001112
And is
Figure BDA00031821073500001113
dfFor vector dimension, the current first vector initially expresses the result
Figure BDA00031821073500001114
And the current second vector initial expression result
Figure BDA00031821073500001115
Are two emotional impact vector expressions obtained from the time dimension.
Initially expressing the current first vector to a result
Figure BDA00031821073500001116
And the current second vector initial expression result
Figure BDA00031821073500001117
Performing longitudinal splicing to obtain the current splicing result
Figure BDA00031821073500001118
Finally, the current splicing result is obtained
Figure BDA00031821073500001119
Tensor operation can be specifically implemented when the fusion model layer input into the target BERT model is subjected to fusion processing, namely:
Figure BDA00031821073500001120
wherein RELU () represents a linear rectification function, WbIs composed of
Figure BDA00031821073500001121
Wherein each neuron is assigned with neuron-level weight (i.e. the weight of all neurons is different), WaIs composed of
Figure BDA00031821073500001122
Two row vectors in (a) assign vector-level weights (i.e. the weights assigned by neurons in a row vector are the same),
Figure BDA0003182107350000121
and represents a bias term. Therefore, the current splicing result
Figure BDA0003182107350000122
And carrying out tensor operation on the fusion model layer input into the target BERT model to obtain a third vector expression result.
And S104, calling a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.
In this embodiment, the final vector expression result obtained in step S1032, step S1033, or step S1034 can be used as riIs shown (e.g. obtained from the above specific examples are all by r7Expressing), calling a pre-trained emotion classification model, and inputting a final vector expression result into the emotion classification model for operation as follows:
oi=tanh(Wori)
Figure BDA0003182107350000123
Figure BDA0003182107350000124
Wherein tanh () is a hyperbolic tangent function, WoIs riThe corresponding first weight value, softmax () can be understood as a linear classifier,
Figure BDA0003182107350000125
is oiA corresponding second weight value of the first weight value,
Figure BDA0003182107350000126
is the final predicted emotion classification result.
The method realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors for the features, has finer feature fusion granularity, and obtains a more accurate emotion recognition result finally.
The embodiment of the invention also provides a speech emotion classification device, which is used for executing any embodiment of the speech emotion classification method. Specifically, please refer to fig. 3, fig. 3 is a schematic block diagram of a speech emotion classification apparatus according to an embodiment of the present invention. The speech emotion classification apparatus 100 may be configured in a server.
As shown in fig. 3, the speech emotion classification apparatus 100 includes: a speaker recognition unit 101, a target model selection unit 102, a final vector acquisition unit 103, and an emotion classification unit 104.
The speaker recognition unit 101 is configured to respond to a speech emotion classification instruction, acquire speech data to be recognized according to the speech emotion classification instruction, perform speech recognition, and obtain a speech recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
In this embodiment, a Speaker Recognition (i.e., Speaker Recognition, abbreviated as SR) technology is also called Voiceprint Recognition (abbreviated as VPR) technology, the Voiceprint Recognition technology mainly adopts MFCC (MFCC, mel-frequency cepstrum coefficient) and GMM (gaussian mixture model) framework, and Speaker Recognition can be effectively performed on voice data to be recognized by the Speaker Recognition technology to obtain a voice Recognition result corresponding to the voice data to be recognized; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data.
For example, a speech recognition result corresponding to a piece of speech data to be recognized is represented by U, and the expression of U is as follows
Figure BDA0003182107350000127
The speech recognition result U is embodied in 8 dialogues arranged in time-series ascending order, the first dialog being spoken by the speaker 1 (by
Figure BDA0003182107350000128
Where subscript 1 denotes chronological order 1, superscript 1 denotes the speaker number identification of speaker 1,
Figure BDA0003182107350000131
the contents of speaker 1 as a whole), and a second session, speaker 2 (by using the speech of speaker 1)
Figure BDA0003182107350000132
Wherein the subscript 2 denotes the chronological order 2, the superscript 2 denotes the speaker number designation of speaker 2), and a third session is spoken by speaker 1 (using the suffix 2)
Figure BDA0003182107350000133
Wherein the subscript 3 is as defined inTime sequence 3, superscript 1 indicates speaker 1's speaker number), and the fourth session is spoken by speaker 1 (by
Figure BDA0003182107350000134
Wherein subscript 4 denotes chronological order 4, superscript 1 denotes speaker number identification of speaker 1), and a fifth session is spoken by speaker 3 (using the speaker number identification of speaker 1)
Figure BDA0003182107350000135
Wherein subscript 5 denotes chronological order 5, superscript 3 denotes speaker number identification of speaker 3), and a sixth conversation is spoken by speaker 2 (using the speaker number identification of speaker 3)
Figure BDA0003182107350000136
Wherein the subscript 6 denotes the chronological order 6, the superscript 2 denotes the speaker number identification of speaker 2), and a seventh conversation is spoken by speaker 1 (using the subscript 2)
Figure BDA0003182107350000137
Wherein the subscript 7 denotes the chronological order 7, the superscript 1 denotes the speaker number identification of speaker 1), and an eighth conversation is spoken by speaker 2 (using the suffix)
Figure BDA0003182107350000138
Where subscript 8 denotes chronological order 2 and superscript 2 denotes the speaker number identification of speaker 2). By the speaker recognition technology, the speaking content of each speaker in the multi-person conversation is effectively distinguished.
The target model selecting unit 102 is configured to obtain a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model.
In this embodiment, a pre-trained BERT model set is pre-stored locally in the server, where the BERT model set at least includes a flat BERT model, a hierarchical BERT model, and a spatio-temporal BERT model. When the server selects the BERT model, one of the flat BERT model, the level BERT model and the space-time BERT model is selected at random. Effective vector expression results in the voice recognition results can be effectively extracted through the three models so as to perform subsequent accurate emotion recognition. If the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; and if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy.
And a final vector obtaining unit 103, configured to perform preprocessing on the target speech recognizer result selected in the speech data to be recognized according to the character preprocessing policy to obtain a preprocessing result, and perform feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result.
In this embodiment, when a target speech recognition sub-result is arbitrarily selected from the speech data to be recognized, for example, the target speech recognition sub-result is selected from the speech recognition result U
Figure BDA0003182107350000139
The speech recognition sub-result is used as a target speech recognition sub-result, and the target speech recognition sub-result can be preprocessed according to a corresponding character preprocessing strategy to obtain a preprocessing result, so that the preprocessing result is finally input into a target BERT model to perform feature extraction on the preprocessing result to obtain a final vector expression result. That is, when it is determined that different target BERT models are adopted for feature extraction, a corresponding character preprocessing strategy is adopted for preprocessing a target voice recognition sub-result to obtain a preprocessing result, so that the information dimension of the target voice recognition sub-result can be increased, and the feature extraction is more accurate.
In an embodiment, as shown in fig. 3, the final vector obtaining unit 103 includes:
a target model obtaining unit 1031, configured to obtain any one of the BERT models in the pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a level BERT model and a space-time BERT model;
a first model processing unit 1032, configured to, when the target BERT model is determined to be a flat BERT model, pre-process the first target speech recognition sub-result selected from the speech recognition result according to a pre-stored first character pre-processing policy to obtain a first pre-processing result, and perform feature extraction on the first pre-processing result through the target BERT model to obtain a final vector expression result; wherein the first character pre-processing strategy is used to add a mixed context sequence in a first target speech recognizer result;
a second model processing unit 1033, configured to, when the target BERT model is determined to be a hierarchical BERT model, pre-process the second target speech recognizer result selected from the speech recognition results according to a pre-stored second character pre-processing policy to obtain a second pre-processing result, and perform feature extraction on the second pre-processing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a preceding result of a second target voice recognition sub-result and respectively adding an internal context sequence in each voice recognition sub-result in the preceding result and the second target voice recognition sub-result;
a third model processing unit 1034, configured to, when it is determined that the target BERT model is a space-time BERT model, pre-process a third target speech recognizer result selected from the speech recognition results according to a pre-stored third character pre-processing policy to obtain a third pre-processing result, and perform feature extraction on the third pre-processing result through the target BERT model to obtain a final vector expression result; wherein the third character preprocessing strategy is used for adding a standard context sequence and an internal context sequence, respectively, in a third target speech recognizer result.
In this embodiment, the flat BERT model corresponds to a BERT model with a flat structure, and the first target speech recognition sub-result selected from the speech recognition result is processed into an input variable and then directly input into the BERT model for operation, so as to obtain a final vector expression result corresponding to the first target speech recognition sub-result, where a specific model structure diagram of the flat BERT model is shown in fig. 2 a. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, the first model processing unit 1032 includes:
a mixed context sequence obtaining unit, configured to obtain a mixed context sequence in the speech recognition result according to a preset context window size value and the selected first target speech recognition sub-result;
a first time sequence obtaining unit, configured to splice the first target speech identifier result and the mixed context sequence into a first time sequence according to a preset first splicing strategy;
and the first operation unit is used for inputting the first time sequence into the flat BERT model for operation to obtain a corresponding first vector expression result, and taking the first vector expression result as a final vector expression result corresponding to the first target voice identifier result.
In this embodiment, in order to more clearly understand the subsequent technical solutions, three context sequences involved in extracting the speech recognition result U will be described in detail below:
mixed context sequences (i.e., conv-contexts), denoted by Ψ, e.g.
Figure BDA0003182107350000141
Represents the speech recognition result U to
Figure BDA0003182107350000142
And identifying a sub-result for the first target voice, and extracting a mixed context sequence by taking 5 as a preset context window size value K. When extracting the mixed context sequence, the method directly takes the result of the first target speech recognizer as a starting point to push backwards by 5 bits to obtain
Figure BDA0003182107350000143
Figure BDA0003182107350000144
When the mixed context sequence is obtained, speakers are not distinguished, and the mixed context sequence is directly obtained by pushing forwards in a reverse order according to the size value of a preset context window.
Standard context sequences (i.e., inter-contexts), denoted by φ, e.g.
Figure BDA0003182107350000145
Represents the speech recognition result U to
Figure BDA0003182107350000146
For the first target speech recognition sub-result, and using 5 as preset context window size value K to make standard context sequence extraction, and its goal is to forward and backward push 5 bits to obtain an initial sequence by using first target speech recognition sub-result as starting point and remove all speech recognition sub-results of speakers identical to first target speech recognition sub-result so as to obtain standard context sequence
Figure BDA0003182107350000151
Figure BDA0003182107350000152
Internal context sequence (i.e., intra-context), using
Figure BDA0003182107350000153
Represents, for example
Figure BDA0003182107350000154
Represents the speech recognition result U to
Figure BDA0003182107350000155
Extracting internal context sequence for the first target speech recognizer result by using 5 as preset context window size value K, and obtaining an initial sequence by backward pushing 5 bits from the first target speech recognizer result as a starting point and removing all speech recognizer results of speakers which are different from the first target speech recognizer result, thereby obtaining the internal context sequence
Figure BDA0003182107350000156
Figure BDA0003182107350000157
The mixed context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected first target speech recognition sub-result, namely, the mixed context sequence is formed by obtaining the speech recognition sub-results with the same number as the size value of the context window from the speech recognition result in reverse order by taking the selected first target speech recognition sub-result as a starting point.
In an embodiment, the first timing sequence obtaining unit includes:
the first splicing unit is used for splicing the characters included in the first target voice recognition sub-result into a corresponding first coding result through double-byte coding, and splicing a first pre-stored word embedding vector at the tail end of the first coding result to obtain a first processing result;
the second splicing unit is used for carrying out double-byte coding on the characters included in the mixed context sequence to obtain a corresponding second coding result, and splicing a pre-stored second word embedding vector at the tail end of the second coding result to obtain a second processing result;
a third splicing unit, configured to add a first preset character string before the first processing result, add a second preset character string between the first processing result and the second processing result, and add a second preset character string after the second processing result, to obtain a first initial timing sequence;
and the fourth splicing unit is used for splicing the corresponding position embedded vector at the tail of each character in the first initial time sequence to obtain the first time sequence.
In the present embodiment, i.e. for the flat BERT model, the goal is to predict the emotion of the ith speech recognizer result, the input is constructed as:
Figure BDA0003182107350000158
wherein the content of the first and second substances,
Figure BDA0003182107350000159
represents an expression sequence comprising T words,
Figure BDA00031821073500001510
represents and includes
Figure BDA00031821073500001511
A mixed context sequence of words, K being a preset context window size value. And inputting the parameters into a BERT model after splicing and converting into embedding to obtain vector representation: r isi=BERT(Xi)。
When the target BERT model is determined to be a flat BERT model, the input of the flat BERT model comprises the following key points: 1) identifying a first target speech into a sub-result
Figure BDA00031821073500001512
(first target Speech recognizer result
Figure BDA00031821073500001513
Also understood as selected target expression) and mixed context sequences
Figure BDA00031821073500001514
Splicing into a time sequence; 2) increasing [ CLS ] in time-series header]Special characters are used for clarifying output positions; 3) using [ SEP ]]Special characters distinguish target expressions and mixed context sequences; 4) converting all characters into WordPiece embeddings; 5) character splicing type A embedding (namely a prestored first word embedding vector) for the first target voice recognition sub-result, and character splicing type B embedding (namely a prestored second word embedding vector) for the mixed context sequence so as to enhance the distinguishing degree of the two; 6) the position embedding is spliced for each character to retain the position information of the time series. Tong (Chinese character of 'tong')The first time sequence constructed in the above way can be used for modeling a longer time sequence and mining a deeper network structure.
And when the first time sequence is input into the flat BERT model for feature extraction, the obtained first vector expression result is the vector expression of the whole time sequence by using the output of the last layer [ CLS ] position of the BERT as the output.
In this embodiment, when the target BERT model is determined to be a hierarchical BERT model, the hierarchical BERT model corresponds to a BERT model with a multilayer structure, and at least includes a BERT layer and a Transformer layer, based on that the selected second target speech recognition sub-result and the selected speech recognition sub-result in the speech recognition result are respectively input into the BERT model of the BERT layer for operation, so as to obtain second vector expression results respectively corresponding to the second target speech recognition sub-result and the selected speech recognition sub-result, and a second vector expression result set formed by corresponding to the second vector expression results is used as a final vector expression result corresponding to the second target speech recognition sub-result, where a model structure diagram of the hierarchical BERT model is shown in fig. 2 b. The final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, the second model processing unit 1033 is further configured to:
and forward acquiring a number of voice recognition sub-results equal to the size value of the context window in the voice recognition results in a reverse sequence by taking the selected second target voice recognition sub-result as a starting point according to a preset size value of the context window to form a target voice recognition sub-result set, preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and extracting the characteristics of the second preprocessing result through the target BERT model to obtain a final vector expression result.
And sequentially inputting the preprocessing results respectively corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set into a BERT layer and a Transformer layer in a target BERT model for feature extraction, and obtaining a second vector expression result corresponding to the second target voice recognition sub-result and the target voice recognition sub-result set.
When extracting the second vector expression result corresponding to the second target speech recognition sub-result and the target speech recognition sub-result set, the second target speech recognition sub-result and the preprocessing result corresponding to the target speech recognition sub-result set need to be input to a BERT layer and a transform layer of the target BERT model for feature extraction, and the second vector expression result obtained through the extraction of the two layers of models is subjected to weighting of two granularities of neurons and vectors, so that the features of finer granularity are fused, and the features have more 'hierarchical' feeling.
In one embodiment, the second model processing unit 1033 is further configured to:
acquiring an ith target voice recognition sub-result in the target voice recognition sub-result set; wherein the initial value of i is 1;
acquiring an ith internal context sequence from the voice recognition result according to a preset context window size value and an ith target voice recognition sub-result;
splicing the ith target voice recognition sub-result and the ith internal context sequence into an ith sub-time sequence according to a preset second splicing strategy;
increasing 1 to the i value, and judging whether the i value exceeds the size value of the context window; if the value i does not exceed the size value of the context window, returning to execute the step of obtaining the ith target voice recognition sub-result in the target voice recognition sub-result set;
if the value of i exceeds the size value of the context window, sequentially acquiring a 1 st sub-timing sequence to an i-1 st sub-timing sequence;
splicing the second target voice recognition sub-result and the corresponding target internal context sequence into an ith sub-time sequence according to the second splicing strategy;
respectively inputting the 1 st to ith sub-time sequence sequences into a BERT layer in a target BERT model for feature extraction, and obtaining second vector initial expression results respectively corresponding to the 1 st to ith sub-time sequences;
splicing second vector initial expression results corresponding to the 1 st to the ith sub-time sequences respectively to obtain a first splicing result;
and inputting the first splicing result into a Transformer layer in a target BERT model for feature extraction to obtain a second vector expression result.
In this embodiment, for example, the preset size value K of the context window is 5, and the speech recognition result is obtained
Figure BDA0003182107350000171
Figure BDA0003182107350000172
Then the 1 st target speech recognition sub-result in said set of target speech recognition sub-results is
Figure BDA0003182107350000173
Its corresponding 1 st internal context sequence
Figure BDA0003182107350000174
Is an empty set; similarly, the 2 nd target speech recognition sub-result in the target speech recognition sub-result set is
Figure BDA0003182107350000175
Its corresponding 2 nd internal context sequence
Figure BDA0003182107350000176
The 3 rd target speech recognition sub-result in the target speech recognition sub-result set is
Figure BDA0003182107350000177
Its corresponding 3 rd internal context sequence
Figure BDA0003182107350000178
Identifying a 4 th target in the subset of target speechThe speech recognition sub-result is
Figure BDA0003182107350000179
Its corresponding 4 th internal context sequence
Figure BDA00031821073500001710
Is an empty set; the 5 th target speech recognition sub-result of the target speech recognition sub-result set is
Figure BDA00031821073500001711
Its corresponding 5 th internal context sequence
Figure BDA00031821073500001712
When the ith target speech recognition sub-result and the ith internal context sequence are spliced into the ith sub-timing sequence according to a preset second splicing strategy, the method specifically comprises the following steps: obtaining a corresponding ith group of first sub-coding results by double-byte coding of characters included in the ith target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the ith group of first sub-coding results to obtain an ith group of first processing results; obtaining a corresponding ith group of second sub-coding results by double-byte coding of characters included in the ith internal context sequence, and splicing a pre-stored second-class word embedding vector at the tail end of the ith group of second sub-coding results to obtain an ith group of second processing results; adding [ CLS ] characters before the ith group of first sub-coding results, adding [ SEP ] characters between the ith group of first sub-coding results and the ith group of second processing results, and adding [ SEP ] characters after the ith group of second processing results to obtain an ith group of initial time sequence sequences; and splicing corresponding position embedded vectors at the tail of each character in the ith group of initial time sequence sequences to obtain the ith sub-time sequence. Aiming at the improved hierarchical BERT model of the hierarchical recurrent neural network, the hierarchical structure can effectively distinguish speakers compared with a flat structure.
Sequentially obtaining the 1 st to the ith sub-timing sequences and inputting the sequences to a targetPerforming feature extraction on a BERT layer in the BERT model to obtain a second vector initial expression result with context of the speaker at each moment, namely inputting the 1 st sub-time sequence into the BERT layer in the target BERT model to perform feature extraction to obtain a result
Figure BDA00031821073500001713
(
Figure BDA00031821073500001714
For) inputting the 2 nd sub-time sequence into a BERT layer in a target BERT model for feature extraction to obtain
Figure BDA0003182107350000181
Namely inputting the 3 rd sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtain
Figure BDA0003182107350000182
Namely, inputting the 4 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtain
Figure BDA0003182107350000183
Namely, inputting the 5 th sub-time sequence into a BERT layer in a target BERT model to carry out feature extraction to obtain
Figure BDA0003182107350000184
Namely, the 6 th sub-time sequence is input into a BERT layer in a target BERT model for feature extraction to obtain
Figure BDA0003182107350000185
And after the 6 second vector initial expression results are obtained, splicing according to the ascending sequence of the angle marks to obtain a first splicing result. And finally, inputting the first splicing result to a transform layer in a target BERT model for feature extraction (specifically, inputting an encode part of the transform layer, wherein the number of layers of the encode part of the transform layer is 6) to obtain a second vector expression result.
The second vector expression result obtained is the encode part of the Transformer layerThe last layer is
Figure BDA0003182107350000186
The output of the location is used as a vector expression for final emotion classification.
In this embodiment, when the target BERT model is determined to be a space-time BERT model, the hierarchical BERT model corresponds to a BERT model considered comprehensively from both time and space perspectives, the third target speech recognition sub-result selected from the speech recognition result is processed into two input variables (one input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current standard context sequence, and the other input variable is obtained by performing a concatenation process on the third target speech recognition sub-result and its corresponding current internal context sequence), and then the input variables are directly input into the BERT model for operation, and the respective obtained operation results are subjected to a fusion process by the fusion model, a third vector expression result corresponding to the third target speech recognizer result may be obtained as a final vector expression result, and a model structure diagram of the concrete space-time BERT model is shown in fig. 2 c. Similarly, the final vector expression result is the extraction of the most effective features in the speech recognizer result, and can provide effective input features for subsequent emotion recognition.
In an embodiment, the third model processing unit 1034 further includes:
and the second hierarchical extraction unit is used for acquiring a current standard context sequence and a current internal context sequence which correspond to the third target voice identifier result in the voice identification result respectively, splicing the third target voice identifier result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result.
In this embodiment, when a third vector expression result corresponding to the third target speech recognition sub-result is extracted, a current first time sequence and a current second time sequence obtained by processing the third target speech recognition sub-result and respectively inputting the third target speech recognition sub-result to a BERT layer in a target BERT model are obtained from a time perspective, and then the first time sequence and the current second time sequence are spliced from a space perspective (that is, the current first time sequence and the current second time sequence are input to a fusion model layer in the target BERT model) to obtain a third vector expression result. From a time perspective, the emotional impact of the speaker can be discerned explicitly; from a spatial perspective, the weighting of two granularities, neuron and vector, can be performed for the features, and the feature fusion granularity is finer.
In an embodiment, the third model processing unit 1034 is further configured to:
respectively acquiring a current standard context sequence and a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result;
splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, and splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to the third splicing strategy;
inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result;
longitudinally splicing the current first vector initial expression result and the current second vector initial expression result to obtain a current splicing result;
and inputting the current splicing result into a fusion model layer in a target BERT model for fusion processing to obtain a third vector expression result corresponding to the third target voice identifier result.
In this embodiment, the current standard context sequence is obtained from the speech recognition result according to the preset size value of the context window and the selected third target speech recognition sub-result, that is, the speech recognition sub-results with the same number as the size value of the context window are obtained from the speech recognition result in reverse order by using the selected third target speech recognition sub-result as a starting point, and all the speech recognition sub-results with the same speaker as the third target speech recognition sub-result are removed to form the current standard context sequence.
And obtaining a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result, namely obtaining the voice recognition sub-results with the same number as the context window size value in the voice recognition result in a reverse sequence by taking the selected third target voice recognition sub-result as a starting point, and removing all the voice recognition sub-results of the speakers which are different from the third target voice recognition sub-result to form the current internal context sequence.
And inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result, wherein the output of the position of the last layer [ CLS ] of the BERT is used as the vector expression of the whole time sequence.
And splicing the third target speech recognizer result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, wherein the method specifically comprises the following steps: obtaining a corresponding current first coding result by double-byte coding of characters included in the third target voice recognition result, and splicing a pre-stored first word embedding vector at the tail end of the current first coding result to obtain a current first processing result; obtaining a corresponding current second coding result by double-byte coding of characters included in the current standard context sequence, and splicing a prestored second-class word embedding vector at the tail end of the current second coding result to obtain a current second processing result; adding a [ CLS ] character before the current first processing result, adding a [ SEP ] character between the current first processing result and the current second processing result, and adding a [ SEP ] character after the current second processing result to obtain a current first initial time sequence; and splicing corresponding position embedded vectors at the tail of each character in the first current initial time sequence to obtain a current first time sequence. And the splicing acquisition process of splicing the third target voice identifier result and the current internal context sequence into the current second time sequence according to a preset third splicing strategy is also an acquisition process of referring to the current first time sequence.
For example, the third target speech recognizer results in
Figure BDA0003182107350000191
And the size of the preset context window is 5, the current standard context sequence
Figure BDA0003182107350000192
And current internal context sequence
Figure BDA0003182107350000193
Figure BDA0003182107350000194
The third target speech recognizer has the result of
Figure BDA0003182107350000195
With current standard context sequence
Figure BDA0003182107350000196
Figure BDA0003182107350000201
The current first initial time sequence is spliced through a preset third splicing strategy, and the result of the third target voice identifier is
Figure BDA0003182107350000202
With the current internal context sequence
Figure BDA0003182107350000203
Splicing the current first initial time sequence into a current second initial time sequence through a preset third splicing strategy, inputting the current first initial time sequence into a BERT layer in a target BERT model, and performing feature extraction to obtain a current first vector initial expression result
Figure BDA0003182107350000204
Inputting the current second time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current second vector initial expression result
Figure BDA0003182107350000205
Wherein
Figure BDA0003182107350000206
And is
Figure BDA0003182107350000207
dfFor vector dimension, the current first vector initially expresses the result
Figure BDA0003182107350000208
And the current second vector initial expression result
Figure BDA0003182107350000209
Are two emotional impact vector expressions obtained from the time dimension.
Initially expressing the current first vector to a result
Figure BDA00031821073500002010
And the current second vector initial expression result
Figure BDA00031821073500002011
Performing longitudinal splicing to obtain the current splicing result
Figure BDA00031821073500002012
Finally, the current splicing result is obtained
Figure BDA00031821073500002013
Tensor operation can be specifically implemented when the fusion model layer input into the target BERT model is subjected to fusion processing, namely:
Figure BDA00031821073500002014
wherein RELU () represents a linear rectification function, WbIs composed of
Figure BDA00031821073500002015
Wherein each neuron is assigned with neuron-level weight (i.e. the weight of all neurons is different), WaIs composed of
Figure BDA00031821073500002016
Two row vectors in (a) assign vector-level weights (i.e. the weights assigned by neurons in a row vector are the same),
Figure BDA00031821073500002017
and represents a bias term. Therefore, the current splicing result
Figure BDA00031821073500002018
And carrying out tensor operation on the fusion model layer input into the target BERT model to obtain a third vector expression result.
And the emotion classification unit 104 is used for calling a pre-trained emotion classification model, inputting the final vector expression result to the emotion classification model for operation, and obtaining a corresponding emotion classification result.
In this embodiment, the final vector expression result obtained by the first model processing unit 1032, the second model processing unit 1033, or the third model processing unit 1034 can be riIs shown (e.g. obtained from the above specific examples are all by r7Expressing), calling a pre-trained emotion classification model, and inputting a final vector expression result into the emotion classification model for operation, wherein the method specifically comprises the following steps:
Figure BDA00031821073500002019
Figure BDA00031821073500002020
Figure BDA00031821073500002021
wherein tanh () is a hyperbolic tangent function, WoIs riThe corresponding first weight value, softmax () can be understood as a linear classifier,
Figure BDA00031821073500002022
is oiA corresponding second weight value of the first weight value,
Figure BDA00031821073500002023
is the final predicted emotion classification result.
The device realizes the feature extraction of a deeper network structure, can also distinguish the emotion influence of the speaker in a display manner, weights two granularities of neurons and vectors are carried out on the features, the feature fusion granularity is finer, and the finally obtained emotion recognition result is more accurate.
The speech emotion classification apparatus may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 4, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.
The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a speech emotion classification method.
The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be enabled to execute the speech emotion classification method.
The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The processor 502 is configured to run a computer program 5032 stored in the memory to implement the speech emotion classification method disclosed in the embodiment of the present invention.
Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.
It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer-readable storage medium may be a nonvolatile computer-readable storage medium or a volatile computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program is executed by a processor to implement the speech emotion classification method disclosed by the embodiment of the invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A speech emotion classification method is characterized by comprising the following steps:
responding to a voice emotion classification instruction, acquiring voice data to be recognized according to the voice emotion classification instruction, and performing voice recognition to obtain a voice recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
preprocessing the target voice identifier result selected in the voice data to be recognized according to the character preprocessing strategy to obtain a preprocessing result, and performing feature extraction on the preprocessing result through the target BERT model to obtain a final vector expression result; and
and calling a pre-trained emotion classification model, and inputting the final vector expression result into the emotion classification model for operation to obtain a corresponding emotion classification result.
2. The method for classifying speech emotion according to claim 1, wherein the target BERT model is any one of a flat BERT model, a hierarchical BERT model and a spatiotemporal BERT model; if the target BERT model is a flat BERT model, the character preprocessing strategy is a first character preprocessing strategy; if the target BERT model is a hierarchical BERT model, the character preprocessing strategy is a second character preprocessing strategy; if the target BERT model is a space-time BERT model, the character preprocessing strategy is a third character preprocessing strategy;
the pre-processing the target speech recognizer result selected in the speech data to be recognized according to the character pre-processing strategy to obtain a pre-processing result, and performing feature extraction on the pre-processing result through the target BERT model to obtain a final vector expression result, comprising:
acquiring any one BERT model in a pre-trained BERT model set as a target BERT model; wherein the BERT model set at least comprises a flat BERT model, a level BERT model and a space-time BERT model;
when the target BERT model is determined to be a flat BERT model, preprocessing a first target voice identifier result selected from the voice recognition results according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result, and performing feature extraction on the first preprocessing result through the target BERT model to obtain a final vector expression result; wherein the first character pre-processing strategy is used to add a mixed context sequence in a first target speech recognizer result;
when the target BERT model is determined to be a hierarchical BERT model, preprocessing a second target voice identifier result selected from the voice recognition result according to a second pre-stored character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result; the second character preprocessing strategy is used for acquiring a preceding result of a second target voice recognition sub-result and respectively adding an internal context sequence in each voice recognition sub-result in the preceding result and the second target voice recognition sub-result;
when the target BERT model is determined to be a space-time BERT model, preprocessing a third target voice identifier result selected from the voice recognition result according to a prestored third character preprocessing strategy to obtain a third preprocessing result, and performing feature extraction on the third preprocessing result through the target BERT model to obtain a final vector expression result; wherein the third character preprocessing strategy is used for adding a standard context sequence and an internal context sequence, respectively, in a third target speech recognizer result.
3. The method of claim 2, wherein when the target BERT model is determined to be a flat BERT model, the step of preprocessing the selected first target speech recognition sub-result in the speech recognition result according to a pre-stored first character preprocessing strategy to obtain a first preprocessing result, and the step of extracting features of the first preprocessing result through the target BERT model to obtain a final vector expression result comprises:
acquiring a mixed context sequence in the voice recognition result according to a preset context window size value and the selected first target voice recognition sub-result;
splicing the first target voice identifier result and the mixed context sequence into a first time sequence according to a preset first splicing strategy;
and inputting the first time sequence into a flat BERT model for operation to obtain a corresponding first vector expression result, and taking the first vector expression result as a final vector expression result corresponding to the first target voice recognizer result.
4. The method for classifying speech emotion according to claim 3, wherein the concatenating the first target speech recognizer result and the mixed context sequence into a first time sequence according to a preset first concatenation strategy comprises:
obtaining a corresponding first coding result by double-byte coding of characters included in the first target voice recognition sub-result, and splicing a pre-stored first word embedding vector at the tail end of the first coding result to obtain a first processing result;
obtaining a corresponding second coding result by double-byte coding of characters included in the mixed context sequence, and splicing a prestored second word embedding vector at the tail end of the second coding result to obtain a second processing result;
adding a first preset character string before the first processing result, adding a second preset character string between the first processing result and the second processing result, and adding a second preset character string after the second processing result to obtain a first initial time sequence;
and splicing the corresponding position embedding vector at the tail of each character in the first initial time sequence to obtain a first time sequence.
5. The method of claim 2, wherein the step of preprocessing the selected second target speech recognition sub-result in the speech recognition result according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and the step of extracting features of the second preprocessing result by using the target BERT model to obtain a final vector expression result comprises:
according to a preset context window size value, forward obtaining a number of voice recognition sub-results equal to the size value of the context window from the selected second target voice recognition sub-result serving as a starting point in the voice recognition result in a reverse sequence to form a target voice recognition sub-result set, preprocessing the second target voice recognition sub-result and the target voice recognition sub-result set according to a prestored second character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result;
the preprocessing the second target speech recognition result and the target speech recognition result set according to a pre-stored second character preprocessing strategy to obtain a second preprocessing result, and performing feature extraction on the second preprocessing result through the target BERT model to obtain a final vector expression result, including:
acquiring an ith target voice recognition sub-result in the target voice recognition sub-result set; wherein the initial value of i is 1;
acquiring an ith internal context sequence from the voice recognition result according to a preset context window size value and an ith target voice recognition sub-result;
splicing the ith target voice recognition sub-result and the ith internal context sequence into an ith sub-time sequence according to a preset second splicing strategy;
increasing 1 to the i value, and judging whether the i value exceeds the size value of the context window; if the value i does not exceed the size value of the context window, returning to execute the step of obtaining the ith target voice recognition sub-result in the target voice recognition sub-result set;
if the value of i exceeds the size value of the context window, sequentially acquiring a 1 st sub-timing sequence to an i-1 st sub-timing sequence;
splicing the second target voice recognition sub-result and the corresponding target internal context sequence into an ith sub-time sequence according to the second splicing strategy;
respectively inputting the 1 st to ith sub-time sequence sequences into a BERT layer in a target BERT model for feature extraction, and obtaining second vector initial expression results respectively corresponding to the 1 st to ith sub-time sequences;
splicing second vector initial expression results corresponding to the 1 st to the ith sub-time sequences respectively to obtain a first splicing result;
and inputting the first splicing result into a Transformer layer in a target BERT model for feature extraction to obtain a second vector expression result.
6. The method of claim 2, wherein the step of preprocessing the selected third target speech recognition sub-result in the speech recognition result according to a pre-stored third character preprocessing strategy to obtain a third preprocessing result, and the step of extracting features of the third preprocessing result by using the target BERT model to obtain a final vector expression result comprises:
and acquiring a current standard context sequence and a current internal context sequence which correspond to the third target voice identifier result in the voice identification result respectively, splicing the third target voice identifier result, the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence respectively, and inputting the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result.
7. The method according to claim 6, wherein the obtaining of the current standard context sequence and the current internal context sequence corresponding to the third target speech recognition sub-result in the speech recognition result respectively, the splicing of the third target speech recognition sub-result with the current standard context sequence and the current internal context sequence into a current first time sequence and a current second time sequence, and the inputting of the first time sequence and the current second time sequence into a BERT layer and a fusion model layer in a target BERT model respectively for feature extraction to obtain a third vector expression result comprises:
respectively acquiring a current standard context sequence and a current internal context sequence in the voice recognition result according to a preset context window size value and the selected third target voice recognition sub-result;
splicing the third target voice recognition sub-result and the current standard context sequence into a current first time sequence according to a preset third splicing strategy, and splicing the third target voice recognition sub-result and the current internal context sequence into a current second time sequence according to the third splicing strategy;
inputting the current first time sequence into a BERT layer in a target BERT model for feature extraction to obtain a current first vector initial expression result, and inputting the current second time sequence into the BERT layer in the target BERT model for feature extraction to obtain a current second vector initial expression result;
longitudinally splicing the current first vector initial expression result and the current second vector initial expression result to obtain a current splicing result;
and inputting the current splicing result into a fusion model layer in a target BERT model for fusion processing to obtain a third vector expression result corresponding to the third target voice identifier result.
8. A speech emotion classification apparatus, comprising:
the speaker recognition unit is used for responding to the speech emotion classification instruction, acquiring speech data to be recognized according to the speech emotion classification instruction and performing speech recognition to obtain a speech recognition result; the voice recognition result comprises a plurality of voice recognition sub-results which are arranged in a time sequence ascending order, and each voice recognition sub-result corresponds to one speaker and corresponding speaking content data;
the target model selection unit is used for acquiring a pre-trained target BERT model and a character preprocessing strategy corresponding to the target BERT model;
a final vector obtaining unit, configured to pre-process the target speech recognizer result selected in the speech data to be recognized according to the character pre-processing policy to obtain a pre-processing result, and perform feature extraction on the pre-processing result through the target BERT model to obtain a final vector expression result; and
and the emotion classification unit is used for calling a pre-trained emotion classification model, inputting the final vector expression result to the emotion classification model for operation, and obtaining a corresponding emotion classification result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for speech emotion classification as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of speech emotion classification according to any of claims 1 to 7.
CN202110850075.9A 2021-07-27 2021-07-27 Voice emotion classification method, device, equipment and medium Active CN113362858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110850075.9A CN113362858B (en) 2021-07-27 2021-07-27 Voice emotion classification method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110850075.9A CN113362858B (en) 2021-07-27 2021-07-27 Voice emotion classification method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN113362858A true CN113362858A (en) 2021-09-07
CN113362858B CN113362858B (en) 2023-10-31

Family

ID=77540332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110850075.9A Active CN113362858B (en) 2021-07-27 2021-07-27 Voice emotion classification method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113362858B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933795A (en) * 2019-03-19 2019-06-25 上海交通大学 Based on context-emotion term vector text emotion analysis system
CN110147452A (en) * 2019-05-17 2019-08-20 北京理工大学 A kind of coarseness sentiment analysis method based on level BERT neural network
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN
CN110609899A (en) * 2019-08-29 2019-12-24 成都信息工程大学 Specific target emotion classification method based on improved BERT model
CN110647619A (en) * 2019-08-01 2020-01-03 中山大学 Common sense question-answering method based on question generation and convolutional neural network
CN111581966A (en) * 2020-04-30 2020-08-25 华南师范大学 Context feature fusion aspect level emotion classification method and device
CN112464657A (en) * 2020-12-07 2021-03-09 上海交通大学 Hybrid text abstract generation method, system, terminal and storage medium
WO2021139108A1 (en) * 2020-01-10 2021-07-15 平安科技(深圳)有限公司 Intelligent emotion recognition method and apparatus, electronic device, and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933795A (en) * 2019-03-19 2019-06-25 上海交通大学 Based on context-emotion term vector text emotion analysis system
CN110147452A (en) * 2019-05-17 2019-08-20 北京理工大学 A kind of coarseness sentiment analysis method based on level BERT neural network
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN
CN110647619A (en) * 2019-08-01 2020-01-03 中山大学 Common sense question-answering method based on question generation and convolutional neural network
CN110609899A (en) * 2019-08-29 2019-12-24 成都信息工程大学 Specific target emotion classification method based on improved BERT model
WO2021139108A1 (en) * 2020-01-10 2021-07-15 平安科技(深圳)有限公司 Intelligent emotion recognition method and apparatus, electronic device, and storage medium
CN111581966A (en) * 2020-04-30 2020-08-25 华南师范大学 Context feature fusion aspect level emotion classification method and device
CN112464657A (en) * 2020-12-07 2021-03-09 上海交通大学 Hybrid text abstract generation method, system, terminal and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment

Also Published As

Publication number Publication date
CN113362858B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN108305641B (en) Method and device for determining emotion information
CN110444223B (en) Speaker separation method and device based on cyclic neural network and acoustic characteristics
CN109785824B (en) Training method and device of voice translation model
CN108305643B (en) Method and device for determining emotion information
Chen et al. A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition.
CN111144127B (en) Text semantic recognition method, text semantic recognition model acquisition method and related device
WO2018194960A1 (en) Multi-stage machine learning and recognition
WO2021128044A1 (en) Multi-turn conversation method and apparatus based on context, and device and storage medium
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
CN115362497A (en) Sequence-to-sequence speech recognition with delay threshold
KR20220130565A (en) Keyword detection method and apparatus thereof
CN111639529A (en) Speech technology detection method and device based on multi-level logic and computer equipment
CN111344717A (en) Interactive behavior prediction method, intelligent device and computer-readable storage medium
CN113948090B (en) Voice detection method, session recording product and computer storage medium
CN113362858A (en) Voice emotion classification method, device, equipment and medium
Amiriparian et al. On the impact of word error rate on acoustic-linguistic speech emotion recognition: An update for the deep learning era
CN113793599A (en) Training method of voice recognition model and voice recognition method and device
CN112837700A (en) Emotional audio generation method and device
CN115104152A (en) Speaker recognition device, speaker recognition method, and program
CN112183106A (en) Semantic understanding method and device based on phoneme association and deep learning
CN111563161A (en) Sentence recognition method, sentence recognition device and intelligent equipment
CN115547345A (en) Voiceprint recognition model training and related recognition method, electronic device and storage medium
JP7291099B2 (en) Speech recognition method and device
CN114170997A (en) Pronunciation skill detection method, pronunciation skill detection device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant