WO2021174757A1 - Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium - Google Patents

Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium Download PDF

Info

Publication number
WO2021174757A1
WO2021174757A1 PCT/CN2020/105543 CN2020105543W WO2021174757A1 WO 2021174757 A1 WO2021174757 A1 WO 2021174757A1 CN 2020105543 W CN2020105543 W CN 2020105543W WO 2021174757 A1 WO2021174757 A1 WO 2021174757A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
emotion
audio
voice
recognition model
Prior art date
Application number
PCT/CN2020/105543
Other languages
French (fr)
Chinese (zh)
Inventor
王德勋
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021174757A1 publication Critical patent/WO2021174757A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device, electronic device, and computer-readable storage medium for voice emotion recognition.
  • Emotional computing is an important technology that gives intelligent machines the ability to perceive, understand and express various emotional states.
  • voice technology has also received more and more attention.
  • voice emotion detection has good results, the inventor realizes that due to problems such as the quality of the data set and the subjective annotation of emotions, most models can only judge a single emotion, and there are fewer types of emotions that can be judged, which cannot be accurately described.
  • For the hidden emotions in complex speech it is difficult to determine the boundaries of the multiple emotions that may be contained in a speech.
  • an object of the present application is to provide a voice emotion recognition method, device, electronic equipment, and computer-readable storage medium.
  • a voice emotion recognition method includes: when a user voice is received, extracting multiple types of audio features of the user voice; Emotion labels corresponding to the feature samples matched by the audio features; construct the feature label matrix of the user voice based on the audio features and the emotion labels corresponding to the matched feature samples; input the feature label matrix into multiple emotions
  • a recognition model is used to obtain a plurality of emotion sets and a scene label corresponding to each of the emotion sets; to obtain a scene label matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene label as the recognition
  • the user ’s voice emotions.
  • a voice emotion recognition device includes: an extraction module for extracting multiple types of audio features of the user's voice when a user voice is received; a matching module for separately comparing the audio feature with the emotional feature library Matching the feature samples in each of the audio features to obtain an emotion label corresponding to the feature sample that matches each of the audio features; a construction module for constructing the user based on the audio feature and the emotion label corresponding to the matched feature sample A feature label matrix of speech; a prediction module, used to input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; a determination module, used to obtain the user voice The scene tag matched by the voice scene of the user is determined to determine the emotion set corresponding to the matched scene tag as the recognized voice emotion of the user.
  • an electronic device includes: a processor; and a memory for storing computer program instructions of the processor; wherein the processor is configured to execute the above method by executing the computer program instructions.
  • a computer-readable storage medium has computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the above method is implemented.
  • the multiple types of audio features of the user voice are extracted; the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives. Characterize the user's emotions. Then, the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained. Emotions, in turn, can guide the identification of various hidden emotions of the user in the subsequent steps.
  • the feature tag matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions.
  • the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.
  • Fig. 1 schematically shows a flow chart of a method for speech emotion recognition.
  • Fig. 2 schematically shows an example diagram of an application scenario of a voice emotion recognition method.
  • Fig. 3 schematically shows a flow chart of a feature extraction method.
  • Fig. 4 schematically shows a block diagram of a voice emotion recognition device.
  • Fig. 5 schematically shows an example block diagram of an electronic device for implementing the above-mentioned voice emotion recognition method.
  • Fig. 6 schematically shows a computer-readable storage medium for implementing the aforementioned voice emotion recognition method.
  • This exemplary embodiment first provides a voice emotion recognition method.
  • the voice emotion recognition method can be run on a server, a server cluster or a cloud server, etc.
  • a server cluster or a cloud server, etc.
  • the voice emotion recognition method may include the following steps:
  • Step S110 when a user voice is received, extract multiple types of audio features of the user voice;
  • Step S120 respectively matching the audio features with the feature samples in the emotion feature library, to obtain the emotion label corresponding to the feature sample that matches each of the audio features;
  • Step S130 based on the audio features and the emotion labels corresponding to the matched feature samples, construct a feature label matrix of the user voice;
  • Step S140 Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
  • Step S150 Obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  • the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives.
  • the angle characterizes the user's emotions.
  • the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained.
  • Emotions in turn, can guide the identification of various hidden emotions of the user in the subsequent steps.
  • the feature tag matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions.
  • the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.
  • step S110 when a user voice is received, multiple types of audio features of the user voice are extracted.
  • the server 201 receives the user voice sent by the server 202, and then the server 201 can extract multiple types of audio features of the user voice, and then perform emotion recognition in the subsequent steps.
  • the server 201 can be any terminal that has the function of executing program instructions and storage, such as a cloud server, a mobile phone, a computer, etc.; the server 202 can be any terminal that has a storage function, such as a mobile phone, a computer, and the like.
  • Audio features can be: zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, Mel cepstrum coefficient feature, etc. Audio characteristics. These features can be extracted from a piece of audio using existing audio feature extraction methods.
  • the extracted multi-type audio features of the user's voice can reflect the change characteristics of the user's voice from different angles, that is, it can represent the user's emotions from different angles, for example, short-term energy reflects the strength of the signal at different times , which can reflect the change process of the user’s emotional stability in a segment of speech; audio has periodic characteristics, and the short-term average amplitude difference can be used to better observe the periodic characteristics in the case of steady noise, and the short-term average amplitude difference can reflect the user’s
  • the periodicity of the middle mood; the formant is the resonance characteristic when the quasi-periodic pulse at the glottis is excited into the vocal tract, resulting in a set of resonance frequencies. This set of resonance frequencies is called the formant frequency or formant for short.
  • Formant parameters Including the frequency of the formant and the width of the frequency band, it is an important parameter to distinguish different finals, and can characterize the user's emotions from a language perspective.
  • the user's emotions can be analyzed based on the multiple types of audio features in the subsequent steps.
  • extracting multiple types of audio features of the user voice includes:
  • Step S310 when the user voice is received, convert the user voice into text
  • Step S320 matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text
  • Step S330 Extract audio features of multiple feature categories associated with the text sample from the user voice.
  • the user's voice When the user's voice is received, the user's voice is converted into text, and the real content expressed by the user can be obtained. Then, the converted text is matched with the text sample in the feature extraction category database to obtain a text sample that matches the converted text .
  • the feature extraction category database stores the feature categories of multiple audio features that can clearly reflect emotions when texts with different semantic meanings are expressed. Furthermore, by extracting audio features of multiple feature categories associated with the text sample from the user's voice, emotion recognition can be efficiently and accurately performed in the subsequent steps.
  • the multiple types of audio characteristics include at least zero-crossing rate characteristics, short-term energy characteristics, short-term average amplitude difference characteristics, pronunciation frame number characteristics, pitch frequency characteristics, formant characteristics, harmonic-to-noise ratio characteristics, and There are three characteristics of Mel cepstrum coefficients.
  • Multiple types of audio features include at least three of the zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, and Mel cepstrum coefficient feature. Therefore, multiple emotion recognition can be realized with high accuracy.
  • step S120 the audio features are matched with the feature samples in the emotion feature library, respectively, to obtain an emotion label corresponding to the feature samples that match each of the audio features.
  • the emotion feature library stores feature samples of audio features of various categories, and each feature sample is associated with a category of emotion label.
  • the audio feature is matched with the feature samples in the emotional feature library, and the similarity between the audio feature and the feature sample can be calculated through Euclidean distance or Hamming distance, and then multiple feature samples matching each audio feature (such as similar (Feature samples with a degree greater than 50%) corresponding emotion labels, so that multiple suspicious emotions represented by each feature vector can be obtained, which can guide the subsequent steps to identify various hidden emotions of the user.
  • the respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples matching each of the audio features includes:
  • the predetermined threshold can be set according to accuracy requirements.
  • the predetermined threshold corresponds to the number of audio features. That is, the value of the predetermined threshold is determined by the number of audio features. The more the number of audio features, the more the predetermined threshold. The smaller. In this way, by separately comparing the audio features with the feature samples in the emotion feature library, multiple feature samples whose similarity to each audio feature exceeds a predetermined threshold are obtained, and then the emotion label corresponding to each feature sample is obtained from the emotion feature library , It can ensure that the emotion recognition of each audio feature knows the reliability.
  • step S130 a feature tag matrix of the user voice is constructed based on the audio feature and the emotion tag corresponding to the matched feature sample.
  • the feature label matrix stores the audio features of the user's voice and the corresponding emotion labels. Emotion tags that can reflect the audio features of the user’s voice and the possible emotions reflected. Then, the different types of audio features and the corresponding similarity feature samples are structurally linked to the emotion tags of different possibilities, through the emotion tags Form the constraints of the audio feature combination. Different types of audio features and corresponding emotion labels of different possibilities embodied by feature samples of respective similarities can be structurally linked through a feature label matrix, which can reflect the law of potential potential emotion changes.
  • the constructing the feature tag matrix of the user voice based on the audio feature and the corresponding emotion tag of the matched feature sample includes:
  • the emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
  • Each audio feature is added to the first row of the empty matrix, and then each column corresponds to an audio feature.
  • the emotion label corresponding to each audio feature is added to the corresponding column of each audio feature in the order of the similarity between each feature sample and the audio feature to obtain the feature label matrix, for example, A audio feature and A1 feature sample If the similarity is 63%, the Qin Xu label corresponding to the A1 feature sample can be added to the row in the 60%-70% interval of the column where the A audio feature is located.
  • Each row of the matrix corresponds to a similarity range, for example, rows with a similarity range of 60%-70%.
  • step S140 the vector label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets.
  • the multi-emotion recognition model is a pre-trained machine learning model that can recognize multiple emotions at once.
  • the vector label matrix is input to the multi-emotion recognition model, which can be used for multi-class audio based on the structured label matrix.
  • the constraint of the feature vector allows the machine learning model to easily calculate the possible emotions of the user’s voice, obtain multiple emotional combinations, predict multiple emotional sets of the user’s voice, and possible scenarios for each emotional set (such as shopping scenes, chat scenes) ) Scene label. In this way, multiple possible scenarios and corresponding multiple emotions can be analyzed efficiently and accurately based on the vector label matrix through the multi-emotion recognition model.
  • the method for constructing the multi-emotion recognition model includes:
  • a multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
  • the AISHELL Chinese voiceprint database uses the AISHELL Chinese voiceprint database to train the restnet34 model.
  • the first n-layer network is taken out as the pre-training model.
  • the multi-layer fully connected layer is used as the classifier.
  • the labeled speech emotion data set is used for the The model is trained to obtain the final model.
  • the ratio of positive and negative samples can be calculated in each training batch as the weighting matrix of the loss function, so that it can pay more attention to the small sample data and improve the model’s performance. Accuracy.
  • the first multi-emotion recognition model and the second multi-emotion recognition model are initialized at the same time, and the first multi-emotion recognition model is trained on the first multi-emotion recognition model using raw data with labels and unlabeled. A prediction value, and the classification error loss value of the labeled data part is obtained;
  • the first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
  • the original model can be improved by means of semi-supervised learning Mean-Teacher, and a large amount of unlabeled data can be reused.
  • the moving average can make the model more robust on the test data.
  • input the noise-added data into Model teacher training to obtain the predicted value P teacher , calculate the error between P teacher and P student as the consistency loss value loss consistency , and update the first multiple emotion recognition with the loss value of loss classification + loss consistency Model student .
  • transfer learning and semi-supervised learning technology can be used to effectively improve the classification effect of the model under a small amount of data sets, and also alleviate the model overfitting problem to a certain extent.
  • the program can not only accurately detect the displayed emotions in the voice, but also accurately identify a variety of potential emotions, improving and expanding the voice emotion recognition technology.
  • step S150 a scene tag matched by the voice scene of the user's voice is acquired, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  • the scene of the user's voice can be determined by pre-calibrating or locating the voice source (such as customer service voice).
  • the emotion set corresponding to the scene tag matched by the scene of the user's voice is determined as the recognized user's voice emotion, so as to ensure the accuracy of the recognition boundary, which can further ensure the accuracy of the emotion recognition of the user's voice.
  • the voice emotion recognition result of the voice is obtained.
  • the application also provides a voice emotion recognition device.
  • the voice emotion recognition device may include an extraction module 410, a matching module 420, a construction module 430, a prediction module 440 and a determination module 450. in:
  • the extraction module 410 may be used to extract multiple types of audio feature vectors of the user voice when the user voice is received;
  • the matching module 420 may be configured to respectively match the audio feature vector with feature vector samples in the emotion feature library to obtain an emotion label corresponding to each feature vector sample that matches the audio feature vector;
  • the construction module 430 may be configured to construct a vector label matrix of the user voice based on the audio feature vector and the corresponding emotion label of the matched feature vector sample;
  • the prediction module 440 may be used to input the vector label matrix into a multi-emotion recognition model to obtain multiple emotion sets and a scene label corresponding to each emotion set;
  • the determining module 450 may be used to obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
  • Including several instructions to make a computing device which can be a personal computer, a server, a mobile terminal, or a network device, etc.
  • an electronic device capable of implementing the above method is also provided.
  • the electronic device 500 according to this embodiment of the present invention will be described below with reference to FIG. 5.
  • the electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
  • the electronic device 500 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 500 may include, but are not limited to: the aforementioned at least one processing unit 510, the aforementioned at least one storage unit 520, and a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510).
  • the storage unit stores program code, and the program code can be executed by the processing unit 510, so that the processing unit 510 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Steps of implementation.
  • the processing unit 510 may perform as shown in FIG. 1:
  • Step S110 when a user voice is received, extract multiple types of audio features of the user voice;
  • Step S120 respectively matching the audio features with the feature samples in the emotion feature library, to obtain the emotion label corresponding to the feature sample that matches each of the audio features;
  • Step S130 based on the audio features and the emotion labels corresponding to the matched feature samples, construct a feature label matrix of the user voice;
  • Step S140 Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
  • Step S150 Obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  • the storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 5201 and/or a cache storage unit 5202, and may further include a read-only storage unit (ROM) 5203.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 520 may also include a program/utility tool 5204 having a set (at least one) program module 5205.
  • program module 5205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 500 can also communicate with one or more external devices 700 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable customers to interact with the electronic device 500, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. Such communication may be performed through an input/output (I/O) interface 550, and may also include a display unit 540 connected to the input/output (I/O) interface 550.
  • the electronic device 500 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 560.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530.
  • other hardware and/or software modules can be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium is also provided.
  • the computer-readable storage medium may be non-volatile or volatile, and stored thereon Program products that can implement the above-mentioned methods in this specification.
  • various aspects of the present invention may also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 600 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
  • the program product of the present invention is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, in which readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of the present invention can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the client computing device, partly executed on the client device, executed as an independent software package, partly executed on the client computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a client computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present application relates to the technical field of artificial intelligence and relates to a method and apparatus for recognizing the emotion in a voice, an electronic device and a computer-readable storage medium. The method comprises: when a user voice is received, extracting multiple types of audio features of the user voice; matching the audio features with feature samples in an emotion feature library, and obtaining an emotion tag corresponding to a feature sample that matches each audio feature; constructing a feature tag matrix of the user voice on the basis of the audio features and the emotion tags corresponding to the matched feature samples; inputting the feature tag matrix into a multi-emotion recognition model, and obtaining a plurality of emotion sets and scene tags corresponding to the emotion sets; and acquiring a scene tag that matches the voice scene of the user voice so as to determine the emotion set corresponding to the matched scene tag as a recognized emotion in the user voice. According to the present application, various potential emotions can be efficiently and accurately recognized from a voice.

Description

语音情绪识别方法、装置、电子设备及计算机可读存储介质Voice emotion recognition method, device, electronic equipment and computer readable storage medium
本申请要求于2020年3月3日提交中国专利局、申请号为202010138561.3,发明名称为“语音情绪识别方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 3, 2020, the application number is 202010138561.3, and the invention title is "Voice Emotion Recognition Method, Apparatus, Medium and Electronic Equipment", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能技术领域,具体而言,涉及一种语音情绪识别方法、装置、电子设备及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to a method, device, electronic device, and computer-readable storage medium for voice emotion recognition.
背景技术Background technique
情感计算是赋予智能机器感知,理解和表达各种情感状态能力的重要技术,语音技术作为情感信息表达的重要载体,也受到越来越多的重视。虽然目前语音情绪检测有着不错的成果,但发明人意识到受限于数据集质量和情绪的主观标注等问题,多数模型都只能判断单一情绪,且可判断情绪种类也较少,无法精确描述复杂语音中的隐藏情感,对于一段语音中可能包含的多种情感,其边界也难以确定,这些问题都大大限制了语音情绪识别技术的推广与发展。Emotional computing is an important technology that gives intelligent machines the ability to perceive, understand and express various emotional states. As an important carrier of emotional information expression, voice technology has also received more and more attention. Although the current voice emotion detection has good results, the inventor realizes that due to problems such as the quality of the data set and the subjective annotation of emotions, most models can only judge a single emotion, and there are fewer types of emotions that can be judged, which cannot be accurately described. For the hidden emotions in complex speech, it is difficult to determine the boundaries of the multiple emotions that may be contained in a speech. These problems greatly limit the promotion and development of speech emotion recognition technology.
发明内容Summary of the invention
为了解决上述技术问题,本申请的一个目的在于提供一种语音情绪识别方法、装置、电子设备及计算机可读存储介质。In order to solve the above technical problems, an object of the present application is to provide a voice emotion recognition method, device, electronic equipment, and computer-readable storage medium.
其中,本申请所采用的技术方案为:Among them, the technical solution adopted in this application is:
第一方面,一种语音情绪识别方法,包括:当接收到用户语音,提取所述用户语音的多类音频特征;分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。In a first aspect, a voice emotion recognition method includes: when a user voice is received, extracting multiple types of audio features of the user voice; Emotion labels corresponding to the feature samples matched by the audio features; construct the feature label matrix of the user voice based on the audio features and the emotion labels corresponding to the matched feature samples; input the feature label matrix into multiple emotions A recognition model is used to obtain a plurality of emotion sets and a scene label corresponding to each of the emotion sets; to obtain a scene label matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene label as the recognition The user’s voice emotions.
第二方面,一种语音情绪识别装置,包括:提取模块,用于当接收到用户语音,提取所述用户语音的多类音频特征;匹配模块,用于分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;构建模块,用于基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;预测模块,用于将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;确定模块,用于获取所述用户语音的语音场景所匹配的 场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。In a second aspect, a voice emotion recognition device includes: an extraction module for extracting multiple types of audio features of the user's voice when a user voice is received; a matching module for separately comparing the audio feature with the emotional feature library Matching the feature samples in each of the audio features to obtain an emotion label corresponding to the feature sample that matches each of the audio features; a construction module for constructing the user based on the audio feature and the emotion label corresponding to the matched feature sample A feature label matrix of speech; a prediction module, used to input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; a determination module, used to obtain the user voice The scene tag matched by the voice scene of the user is determined to determine the emotion set corresponding to the matched scene tag as the recognized voice emotion of the user.
第三方面,一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的计算机程序指令;其中,所述处理器配置为经由执行所述计算机程序指令来执行如上的方法。In a third aspect, an electronic device includes: a processor; and a memory for storing computer program instructions of the processor; wherein the processor is configured to execute the above method by executing the computer program instructions.
第四方面,一种计算机可读存储介质,其上存储有计算机程序指令,当所述计算机程序指令被处理器执行时,实现如上的方法。In a fourth aspect, a computer-readable storage medium has computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the above method is implemented.
在上述技术方案中,首先,当接收到用户语音,提取所述用户语音的多类音频特征;这样得到多类音频特征可以从不同的角度反映用户的语音的变化特点,即可以从不同的角度表征用户的情绪。然后,分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;这样可以获取到每个特征向量所表现的具有嫌疑的情绪,进而可以指导后续步骤中识别到用户潜在的各种隐藏的情绪。然后,基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;可以将不同类别的音频特征及相应的各个相似度的特征样本体现的不同可能性的情绪标签,通过特征标签矩阵结构化联系起来,可以反映出可能的情绪变化规律。进而,将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;可以通过多情绪识别模型,高效准确地基于特征标签矩阵分析出多个可能的场景及对应的多个情绪。最后,获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪;这样可以根据语音真实场景的匹配,获取到语音的语音情绪识别结果。以这种方式可以实现高效地、准确地从语音中识别出各类潜在情绪。In the above technical solution, firstly, when the user voice is received, the multiple types of audio features of the user voice are extracted; the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives. Characterize the user's emotions. Then, the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained. Emotions, in turn, can guide the identification of various hidden emotions of the user in the subsequent steps. Then, based on the audio features and the corresponding emotional tags of the matched feature samples, construct the feature tag matrix of the user's voice; different possibilities of different types of audio features and corresponding feature samples of each similarity can be reflected The sentiment labels of, which are structurally linked through the feature label matrix, can reflect the possible laws of emotional changes. Furthermore, the feature label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions. Finally, the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the application.
附图说明Description of the drawings
图1示意性示出一种语音情绪识别方法的流程图。Fig. 1 schematically shows a flow chart of a method for speech emotion recognition.
图2示意性示出一种语音情绪识别方法的应用场景示例图。Fig. 2 schematically shows an example diagram of an application scenario of a voice emotion recognition method.
图3示意性示出一种特征提取方法流程图。Fig. 3 schematically shows a flow chart of a feature extraction method.
图4示意性示出一种语音情绪识别装置的方框图。Fig. 4 schematically shows a block diagram of a voice emotion recognition device.
图5示意性示出一种用于实现上述语音情绪识别方法的电子设备示例框图。Fig. 5 schematically shows an example block diagram of an electronic device for implementing the above-mentioned voice emotion recognition method.
图6示意性示出一种用于实现上述语音情绪识别方法的计算机可读存储介质。Fig. 6 schematically shows a computer-readable storage medium for implementing the aforementioned voice emotion recognition method.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许 多具体细节从而给出对本申请的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本申请的各方面变得模糊。Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, the provision of these embodiments makes this application more comprehensive and complete, and fully conveys the concept of the example embodiments To those skilled in the art. The described features, structures or characteristics can be combined in one or more embodiments in any suitable way. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present application. However, those skilled in the art will realize that the technical solutions of the present application can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, the well-known technical solutions are not shown or described in detail to avoid overwhelming the crowd and obscure all aspects of the present application.
此外,附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。In addition, the drawings are only schematic illustrations of the application and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.
本示例实施方式中首先提供了语音情绪识别方法,该语音情绪识别方法可以运行于服务器,也可以运行于服务器集群或云服务器等,当然,本领域技术人员也可以根据需求在其他平台运行本发明的方法,本示例性实施例中对此不做特殊限定。参考图1所示,该语音情绪识别方法可以包括以下步骤:This exemplary embodiment first provides a voice emotion recognition method. The voice emotion recognition method can be run on a server, a server cluster or a cloud server, etc. Of course, those skilled in the art can also run the present invention on other platforms according to their needs. The method is not specifically limited in this exemplary embodiment. As shown in FIG. 1, the voice emotion recognition method may include the following steps:
步骤S110,当接收到用户语音,提取所述用户语音的多类音频特征;Step S110, when a user voice is received, extract multiple types of audio features of the user voice;
步骤S120,分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;Step S120, respectively matching the audio features with the feature samples in the emotion feature library, to obtain the emotion label corresponding to the feature sample that matches each of the audio features;
步骤S130,基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;Step S130, based on the audio features and the emotion labels corresponding to the matched feature samples, construct a feature label matrix of the user voice;
步骤S140,将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;Step S140: Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
步骤S150,获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。Step S150: Obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
上述语音情绪识别方法中,首先,当接收到用户语音,提取所述用户语音的多类音频特征;这样得到多类音频特征可以从不同的角度反映用户的语音的变化特点,即可以从不同的角度表征用户的情绪。然后,分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;这样可以获取到每个特征向量所表现的具有嫌疑的情绪,进而可以指导后续步骤中识别到用户潜在的各种隐藏的情绪。然后,基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;可以将不同类别的音频特征及相应的各个相似度的特征样本体现的不同可能性的情绪标签,通过特征标签矩阵结构化联系起来,可以反映出可能的情绪变化规律。进而,将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;可以通过多情绪识别模型,高效准确地基于特征标签矩阵分析出多个可能的场景及对应的多个情绪。最后,获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪;这样可以根据语音真实场景的匹配,获取到语音的语音情绪识别结果。以这种方式可以实现高效地、准确地从语音 中识别出各类潜在情绪。In the above voice emotion recognition method, first, when a user voice is received, multiple types of audio features of the user’s voice are extracted; the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives. The angle characterizes the user's emotions. Then, the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained. Emotions, in turn, can guide the identification of various hidden emotions of the user in the subsequent steps. Then, based on the audio features and the emotional tags corresponding to the matched feature samples, construct the feature tag matrix of the user voice; different types of audio features and corresponding feature samples of each similarity can be reflected in different possibilities The sentiment labels of, which are structurally linked through the feature label matrix, can reflect the possible laws of emotional changes. Furthermore, the feature label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions. Finally, the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.
下面,将结合附图对本示例实施方式中上述语音情绪识别方法中的各步骤进行详细的解释以及说明。Hereinafter, each step in the above-mentioned voice emotion recognition method in this exemplary embodiment will be explained and described in detail with reference to the accompanying drawings.
在步骤S110中,当接收到用户语音,提取所述用户语音的多类音频特征。In step S110, when a user voice is received, multiple types of audio features of the user voice are extracted.
在本示例的实施方式中,参考图2所示,服务器201接收到服务器202发送的用户语音,然后服务器201就可以提取用户语音的多类音频特征,进而在后续步骤中进行情绪识别。其中,服务器201可以是任何具有执行程序指令、存储功能的终端,例如云服务器、手机、电脑等;服务器202可以是任何具有存储功能的终端,例如手机、电脑等。In the embodiment of this example, referring to FIG. 2, the server 201 receives the user voice sent by the server 202, and then the server 201 can extract multiple types of audio features of the user voice, and then perform emotion recognition in the subsequent steps. Among them, the server 201 can be any terminal that has the function of executing program instructions and storage, such as a cloud server, a mobile phone, a computer, etc.; the server 202 can be any terminal that has a storage function, such as a mobile phone, a computer, and the like.
音频特征可以是:过零率特征、短时能量特征、短时平均幅度差特征、发音帧数特征、基音频率特征、共振峰特征、谐波噪声比特征以及梅尔倒谱系数特征等各种音频特征。这些特征可以通过现有的音频特征提取方法从一段音频中提取到。提取到的用户语音的多类音频特征可以从不同的角度反映用户的语音的变化特点,即可以从不同的角度表征用户的情绪,例如,短时能量体现的是信号在不同时刻的强弱程度,进而可以反映用户在一段语音中情绪稳定性变化过程;音频具有周期特性,平稳噪声情况下利用短时平均幅度差可以更好地观察周期特性,进而短时平均幅度差可以反映用户在一段语音中情绪的周期性;共振峰是当声门处准周期脉冲激励进入声道时会引起共振特性,产生一组共振频率,这一组共振频率称为共振峰频率或简称共振峰,共振峰参数包括共振峰频率和频带的宽度,它是区别不同韵母的重要参数,可以从语言角度表征用户情绪。Audio features can be: zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, Mel cepstrum coefficient feature, etc. Audio characteristics. These features can be extracted from a piece of audio using existing audio feature extraction methods. The extracted multi-type audio features of the user's voice can reflect the change characteristics of the user's voice from different angles, that is, it can represent the user's emotions from different angles, for example, short-term energy reflects the strength of the signal at different times , Which can reflect the change process of the user’s emotional stability in a segment of speech; audio has periodic characteristics, and the short-term average amplitude difference can be used to better observe the periodic characteristics in the case of steady noise, and the short-term average amplitude difference can reflect the user’s The periodicity of the middle mood; the formant is the resonance characteristic when the quasi-periodic pulse at the glottis is excited into the vocal tract, resulting in a set of resonance frequencies. This set of resonance frequencies is called the formant frequency or formant for short. Formant parameters Including the frequency of the formant and the width of the frequency band, it is an important parameter to distinguish different finals, and can characterize the user's emotions from a language perspective.
以这种方式,通过提取用户语音的多类音频特征可以在后续步骤根据多类音频特征分析用户情绪。In this way, by extracting multiple types of audio features of the user's voice, the user's emotions can be analyzed based on the multiple types of audio features in the subsequent steps.
在本示例的一种实施方式中,参考图3所示,所述当接收到用户语音,提取所述用户语音的多类音频特征,包括:In an implementation of this example, referring to FIG. 3, when a user voice is received, extracting multiple types of audio features of the user voice includes:
步骤S310,当接收到用户语音,将所述用户语音转化为文本;Step S310, when the user voice is received, convert the user voice into text;
步骤S320,将所述文本与特征提取类别数据库中的文本样本匹配,得到与所述文本匹配的文本样本;Step S320, matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text;
步骤S330,从所述用户语音,提取与所述文本样本关联的多个特征类别的音频特征。Step S330: Extract audio features of multiple feature categories associated with the text sample from the user voice.
当接收到用户语音,将用户语音转化为文本,可以得到用户所表达的真实内容,然后,将转化得到的文本与特征提取类别数据库中的文本样本匹配,得到与转化得到的文本匹配的文本样本,特征提取类别数据库中存储不同语意的文本在表达时可以清楚反映情绪的多个音频特征的特征类别。进而,从用户语音,提取与文本样本关联的多个特征类别的音频特征,可以在后续步骤中高效准确进行情绪识别。When the user's voice is received, the user's voice is converted into text, and the real content expressed by the user can be obtained. Then, the converted text is matched with the text sample in the feature extraction category database to obtain a text sample that matches the converted text , The feature extraction category database stores the feature categories of multiple audio features that can clearly reflect emotions when texts with different semantic meanings are expressed. Furthermore, by extracting audio features of multiple feature categories associated with the text sample from the user's voice, emotion recognition can be efficiently and accurately performed in the subsequent steps.
一种实施例中,所述多类音频特征至少包括过零率特征、短时能量特征、短时平均幅度差特征、发音帧数特征、基音频率特征、共振峰特征、谐波噪声比特征以及梅尔倒谱系数特征中三个。In an embodiment, the multiple types of audio characteristics include at least zero-crossing rate characteristics, short-term energy characteristics, short-term average amplitude difference characteristics, pronunciation frame number characteristics, pitch frequency characteristics, formant characteristics, harmonic-to-noise ratio characteristics, and There are three characteristics of Mel cepstrum coefficients.
多类音频特征至少包括过零率特征、短时能量特征、短时平均幅度差特征、发音帧数 特征、基音频率特征、共振峰特征、谐波噪声比特征以及梅尔倒谱系数特征中三个,可以以较高的准确度实现多情绪识别。Multiple types of audio features include at least three of the zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, and Mel cepstrum coefficient feature. Therefore, multiple emotion recognition can be realized with high accuracy.
在步骤S120中.分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签。In step S120, the audio features are matched with the feature samples in the emotion feature library, respectively, to obtain an emotion label corresponding to the feature samples that match each of the audio features.
在本示例的实施方式中,情绪特征库中保存了各个类别的音频特征的特征样本,每个特征样本关联于一个类别的情绪标签。将音频特征与情绪特征库中的特征样本进行匹配,可以通过欧氏距离或者汉明距离等计算音频特征与特征样本的相似度,进而得到与每个音频特征匹配的多个特征样本(如相似度大于50%的特征样本)相应的情绪标签,这样可以获取到每个特征向量所表现的多个具有嫌疑的情绪,进而可以指导后续步骤中识别到用户潜在的各种隐藏的情绪。In the implementation of this example, the emotion feature library stores feature samples of audio features of various categories, and each feature sample is associated with a category of emotion label. The audio feature is matched with the feature samples in the emotional feature library, and the similarity between the audio feature and the feature sample can be calculated through Euclidean distance or Hamming distance, and then multiple feature samples matching each audio feature (such as similar (Feature samples with a degree greater than 50%) corresponding emotion labels, so that multiple suspicious emotions represented by each feature vector can be obtained, which can guide the subsequent steps to identify various hidden emotions of the user.
在本示例的一种实施方式中,所述分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签,包括:In an implementation manner of this example, the respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples matching each of the audio features includes:
分别将所述音频特征与情绪特征库中的特征样本进行对比,得到与每个所述音频特征相似度超过预定阈值的多个特征样本,所述预定阈值与所述音频特征的个数对应;Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;
从所述情绪特征库中获取每个所述特征样本对应的情绪标签。Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.
预定阈值可以根据精确度需求进行设定,预定阈值与音频特征的个数对应,即预定阈值的取值通过音频特征的个数决定,可以是音频特征的个数越多,预定阈值的取值越小。这样通过分别将音频特征与情绪特征库中的特征样本进行对比,得到与每个音频特征相似度超过预定阈值的多个特征样本,然后,从情绪特征库中获取每个特征样本对应的情绪标签,可以保证每个音频特征的情绪识别知道可靠性。The predetermined threshold can be set according to accuracy requirements. The predetermined threshold corresponds to the number of audio features. That is, the value of the predetermined threshold is determined by the number of audio features. The more the number of audio features, the more the predetermined threshold. The smaller. In this way, by separately comparing the audio features with the feature samples in the emotion feature library, multiple feature samples whose similarity to each audio feature exceeds a predetermined threshold are obtained, and then the emotion label corresponding to each feature sample is obtained from the emotion feature library , It can ensure that the emotion recognition of each audio feature knows the reliability.
在步骤S130中,基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵。In step S130, a feature tag matrix of the user voice is constructed based on the audio feature and the emotion tag corresponding to the matched feature sample.
在本示例的实施方式中,特征标签矩阵存储用户语音的音频特征及相应的情绪标签。可以反映用户语音的音频特征及反映的可能的情绪的情绪标签,然后,将不同类别的音频特征及相应的各个相似度的特征样本体现的不同可能性的情绪标签结构化联系起来,通过情绪标签形成音频特征组合的约束。可以将不同类别的音频特征及相应的各个相似度的特征样本体现的不同可能性的情绪标签,通过特征标签矩阵结构化联系起来,可以反映出可能的潜在情绪变化规律。In the implementation of this example, the feature label matrix stores the audio features of the user's voice and the corresponding emotion labels. Emotion tags that can reflect the audio features of the user’s voice and the possible emotions reflected. Then, the different types of audio features and the corresponding similarity feature samples are structurally linked to the emotion tags of different possibilities, through the emotion tags Form the constraints of the audio feature combination. Different types of audio features and corresponding emotion labels of different possibilities embodied by feature samples of respective similarities can be structurally linked through a feature label matrix, which can reflect the law of potential potential emotion changes.
在本示例的一种实施方式中,所述基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵,包括:In an implementation manner of this example, the constructing the feature tag matrix of the user voice based on the audio feature and the corresponding emotion tag of the matched feature sample includes:
将所述音频特征添加到矩阵的第一行;Adding the audio feature to the first row of the matrix;
将每个所述音频特征相应的所述情绪标签,按照每个所述特征样本与所述音频特征的相似度由高到低的顺序,添加到每个所述音频特征对应的列得到所述特征标签矩阵,其中,所述矩阵每行对应于一个相似度范围。The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
各个音频特征添加到空矩阵的第一行,然后,每一列对应于一个音频特征。每个音频 特征相应的情绪标签,按照每个特征样本与音频特征的相似度由高到低的顺序,添加到每个音频特征对应的列得到特征标签矩阵,例如,A音频特征与A1特征样本相似度为63%,则可以将A1特征样本对应的秦旭标签添加到A音频特征所在列的60%-70%区间的行。矩阵每行对应于一个相似度范围,例如,相似度范围为60%-70%区间的行。Each audio feature is added to the first row of the empty matrix, and then each column corresponds to an audio feature. The emotion label corresponding to each audio feature is added to the corresponding column of each audio feature in the order of the similarity between each feature sample and the audio feature to obtain the feature label matrix, for example, A audio feature and A1 feature sample If the similarity is 63%, the Qin Xu label corresponding to the A1 feature sample can be added to the row in the 60%-70% interval of the column where the A audio feature is located. Each row of the matrix corresponds to a similarity range, for example, rows with a similarity range of 60%-70%.
在步骤S140中,将所述向量标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签。In step S140, the vector label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets.
在本示例的实施方式中,多情绪识别模型是预先训练好的可以一次性识别出多种情绪的机器学习模型,将向量标签矩阵输入多情绪识别模型,可以基于结构化标签矩阵对于多类音频特征向量的约束,使得机器学习模型容易地计算得到用户语音可能的情绪,得到多个情绪组合,预测出用户语音的多个情绪集,及每个情绪集可能的场景(如购物场景、聊天场景)的场景标签。这样可以通过多情绪识别模型,高效准确地基于向量标签矩阵分析出多个可能的场景及对应的多个情绪。In the embodiment of this example, the multi-emotion recognition model is a pre-trained machine learning model that can recognize multiple emotions at once. The vector label matrix is input to the multi-emotion recognition model, which can be used for multi-class audio based on the structured label matrix. The constraint of the feature vector allows the machine learning model to easily calculate the possible emotions of the user’s voice, obtain multiple emotional combinations, predict multiple emotional sets of the user’s voice, and possible scenarios for each emotional set (such as shopping scenes, chat scenes) ) Scene label. In this way, multiple possible scenarios and corresponding multiple emotions can be analyzed efficiently and accurately based on the vector label matrix through the multi-emotion recognition model.
在本示例的一种实施方式中,所述多情绪识别模型的构建方法,包括:In an implementation of this example, the method for constructing the multi-emotion recognition model includes:
利用AISHELL中文声纹数据库训练restnet34模型,训练结束后取出前n层网络作为预训练模型;Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;
为所述预训练模型接入多层全连接层作为分类器,得到识别模型,以使用标注好的语音情绪数据集对所述识别模型进行训练得到多情绪识别模型。A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
首先利用AISHELL中文声纹数据库训练restnet34模型,训练结束后取出前n层网络作为预训练模型,在这之后接入多层全连接层作为分类器,最后再使用标注好的语音情绪数据集对该模型进行训练得到最终模型,针对其中遇到的正负样本不均衡问题,可以在每个训练批次中计算正负样本比例作为损失函数的加权矩阵,使其更加关注少样本数据,提高模型的准确度。First, use the AISHELL Chinese voiceprint database to train the restnet34 model. After the training, the first n-layer network is taken out as the pre-training model. After that, the multi-layer fully connected layer is used as the classifier. Finally, the labeled speech emotion data set is used for the The model is trained to obtain the final model. For the imbalance of positive and negative samples encountered in it, the ratio of positive and negative samples can be calculated in each training batch as the weighting matrix of the loss function, so that it can pay more attention to the small sample data and improve the model’s performance. Accuracy.
在本示例的一种实施方式中,同时初始化第一多情绪识别模型和第二多情绪识别模型,并使用有标签混合无标签的原始数据在所述第一多情绪识别模型上进行训练得第一预测值,并得到有标签数据部分的分类误差损失值;In an implementation of this example, the first multi-emotion recognition model and the second multi-emotion recognition model are initialized at the same time, and the first multi-emotion recognition model is trained on the first multi-emotion recognition model using raw data with labels and unlabeled. A prediction value, and the classification error loss value of the labeled data part is obtained;
利用指数滑动平均更新所述第二多情绪识别模型,并将加上噪声的数据输入更新后的所述第二多情绪识别模型训练得到第二预测值;Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;
计算所述第一预测值和所述第二预测值之间的误差作为一致性损失值;Calculating an error between the first predicted value and the second predicted value as a consistency loss value;
利用所述分类误差损失值与所述一致性损失值之和更新所述第一多情绪识别模型。The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
可以使用半监督学习Mean-Teacher的方式改进原始模型,可以重复利用大量无标签数据。同时初始化两个模型:第一多情绪识别模型Model student和第二多情绪识别模型Model teacher,使用有标签混合无标签的原始数据在Model student上进行训练得到各情绪概率值P student,同时得到有标签数据部分的分类误差损失值loss classification,然后,利用指数滑动平均来更新Model teacher,滑动平均可以使模型在测试数据上更健壮。然后,将加上噪声的数据输入Model teacher训练得到预测值P teacher,计算P teacher和P student之间误差作为一致性损 失值loss consistency,利用loss classification+loss consistency的损失值更新第一多情绪识别模型Model studentThe original model can be improved by means of semi-supervised learning Mean-Teacher, and a large amount of unlabeled data can be reused. Initialize two models at the same time: the first multi-emotion recognition model Model student and the second multi-emotion recognition model Model teacher . Use the raw data with label and unlabeled to train on Model student to obtain the probability value of each emotion P student , and obtain the The classification error loss value loss classification of the label data part, and then use the exponential moving average to update the Model teacher . The moving average can make the model more robust on the test data. Then, input the noise-added data into Model teacher training to obtain the predicted value P teacher , calculate the error between P teacher and P student as the consistency loss value loss consistency , and update the first multiple emotion recognition with the loss value of loss classification + loss consistency Model student .
结合上述两个实施例,构建多情绪识别模型,可以利用迁移学习和半监督学习技术有效的改善了在少量数据集下模型的分类效果,也在一定程度上缓解了模型过拟合问题。经过测试,该方案不仅可以准确的检测出语音中的显示情绪,也能准确识别出多种潜在情绪,改善和拓展了语音情绪识别技术。Combining the above two embodiments to construct a multi-emotion recognition model, transfer learning and semi-supervised learning technology can be used to effectively improve the classification effect of the model under a small amount of data sets, and also alleviate the model overfitting problem to a certain extent. After testing, the program can not only accurately detect the displayed emotions in the voice, but also accurately identify a variety of potential emotions, improving and expanding the voice emotion recognition technology.
在步骤S150中,获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。In step S150, a scene tag matched by the voice scene of the user's voice is acquired, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
在本示例的实施方式中,通过事先标定或者定位语音来源(如客服语音)可以确定用户语音的场景。将用户语音的场景匹配的场景标签对应的情绪集确定为识别出的用户语音情绪,保证识别边界准确性,可以进一步保证用户语音的情绪识别准确性。根据语音真实场景的匹配,获取到语音的语音情绪识别结果。In the implementation of this example, the scene of the user's voice can be determined by pre-calibrating or locating the voice source (such as customer service voice). The emotion set corresponding to the scene tag matched by the scene of the user's voice is determined as the recognized user's voice emotion, so as to ensure the accuracy of the recognition boundary, which can further ensure the accuracy of the emotion recognition of the user's voice. According to the matching of the real scene of the voice, the voice emotion recognition result of the voice is obtained.
以这种方式可以实现高效地、准确地从语音中识别出各类潜在情绪。In this way, various potential emotions can be recognized efficiently and accurately from speech.
本申请还提供了一种语音情绪识别装置。参考图4所示,该语音情绪识别装置可以包括提取模块410、匹配模块420、构建模块430、预测模块440以及确定模块450。其中:The application also provides a voice emotion recognition device. As shown in FIG. 4, the voice emotion recognition device may include an extraction module 410, a matching module 420, a construction module 430, a prediction module 440 and a determination module 450. in:
提取模块410可以用于当接收到用户语音,提取所述用户语音的多类音频特征向量;The extraction module 410 may be used to extract multiple types of audio feature vectors of the user voice when the user voice is received;
匹配模块420可以用于分别将所述音频特征向量与情绪特征库中的特征向量样本进行匹配,得到与每个所述音频特征向量匹配的特征向量样本相应的情绪标签;The matching module 420 may be configured to respectively match the audio feature vector with feature vector samples in the emotion feature library to obtain an emotion label corresponding to each feature vector sample that matches the audio feature vector;
构建模块430可以用于基于所述音频特征向量及所述匹配的特征向量样本相应的情绪标签,构建所述用户语音的向量标签矩阵;The construction module 430 may be configured to construct a vector label matrix of the user voice based on the audio feature vector and the corresponding emotion label of the matched feature vector sample;
预测模块440可以用于将所述向量标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;The prediction module 440 may be used to input the vector label matrix into a multi-emotion recognition model to obtain multiple emotion sets and a scene label corresponding to each emotion set;
确定模块450可以用于获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。The determining module 450 may be used to obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
上述语音情绪识别装置中各模块的具体细节已经在对应的语音情绪识别方法中进行了详细的描述,因此此处不再赘述。The specific details of each module in the above-mentioned voice emotion recognition device have been described in detail in the corresponding voice emotion recognition method, so it will not be repeated here.
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
此外,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。In addition, although the various steps of the method in the present application are described in a specific order in the drawings, this does not require or imply that these steps must be performed in the specific order, or that all the steps shown must be performed to achieve the desired result. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式 可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本申请实施方式的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present application.
在本申请的示例性实施例中,还提供了一种能够实现上述方法的电子设备。In an exemplary embodiment of the present application, an electronic device capable of implementing the above method is also provided.
所属技术领域的技术人员能够理解,本发明的各个方面可以实现为系统、方法或程序产品。因此,本发明的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art can understand that various aspects of the present invention can be implemented as a system, a method, or a program product. Therefore, various aspects of the present invention can be specifically implemented in the following forms, namely: complete hardware implementation, complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System".
下面参照图5来描述根据本发明的这种实施方式的电子设备500。图5显示的电子设备500仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。The electronic device 500 according to this embodiment of the present invention will be described below with reference to FIG. 5. The electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
如图5所示,电子设备500以通用计算设备的形式表现。电子设备500的组件可以包括但不限于:上述至少一个处理单元510、上述至少一个存储单元520、连接不同系统组件(包括存储单元520和处理单元510)的总线530。As shown in FIG. 5, the electronic device 500 is represented in the form of a general-purpose computing device. The components of the electronic device 500 may include, but are not limited to: the aforementioned at least one processing unit 510, the aforementioned at least one storage unit 520, and a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510).
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元510执行,使得所述处理单元510执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。例如,所述处理单元510可以执行如图1中所示的:Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 510, so that the processing unit 510 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Steps of implementation. For example, the processing unit 510 may perform as shown in FIG. 1:
步骤S110,当接收到用户语音,提取所述用户语音的多类音频特征;Step S110, when a user voice is received, extract multiple types of audio features of the user voice;
步骤S120,分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;Step S120, respectively matching the audio features with the feature samples in the emotion feature library, to obtain the emotion label corresponding to the feature sample that matches each of the audio features;
步骤S130,基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;Step S130, based on the audio features and the emotion labels corresponding to the matched feature samples, construct a feature label matrix of the user voice;
步骤S140,将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;Step S140: Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
步骤S150,获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。Step S150: Obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
存储单元520可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)5201和/或高速缓存存储单元5202,还可以进一步包括只读存储单元(ROM)5203。The storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 5201 and/or a cache storage unit 5202, and may further include a read-only storage unit (ROM) 5203.
存储单元520还可以包括具有一组(至少一个)程序模块5205的程序/实用工具5204,这样的程序模块5205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 520 may also include a program/utility tool 5204 having a set (at least one) program module 5205. Such program module 5205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
总线530可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的 局域总线。The bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
电子设备500也可以与一个或多个外部设备700(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得客户能与该电子设备500交互的设备通信,和/或与使得该电子设备500能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口550进行,还可以包括与输入/输出(I/O)接口550连接的显示单元540。并且,电子设备500还可以通过网络适配器560与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器560通过总线530与电子设备500的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备500使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 500 can also communicate with one or more external devices 700 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable customers to interact with the electronic device 500, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. Such communication may be performed through an input/output (I/O) interface 550, and may also include a display unit 540 connected to the input/output (I/O) interface 550. In addition, the electronic device 500 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 560. As shown in the figure, the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
在本申请的示例性实施例中,参考图6所示,还提供了一种计算机可读存储介质,该计算机可读存储介质可以是非易失性,也可以是易失性,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。In an exemplary embodiment of the present application, as shown in FIG. 6, a computer-readable storage medium is also provided. The computer-readable storage medium may be non-volatile or volatile, and stored thereon Program products that can implement the above-mentioned methods in this specification. In some possible implementation manners, various aspects of the present invention may also be implemented in the form of a program product, which includes program code. When the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.
参考图6所示,描述了根据本发明的实施方式的用于实现上述方法的程序产品600,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Referring to FIG. 6, a program product 600 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer. However, the program product of the present invention is not limited to this. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product can use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承 载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, in which readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在客户计算设备上执行、部分地在客户设备上执行、作为一个独立的软件包执行、部分在客户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到客户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。The program code used to perform the operations of the present invention can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the client computing device, partly executed on the client device, executed as an independent software package, partly executed on the client computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a client computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
此外,上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiment of the present invention, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其他实施例。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由权利要求指出。After considering the specification and practicing the invention disclosed herein, those skilled in the art will easily think of other embodiments of the present application. This application is intended to cover any variations, uses, or adaptive changes of this application. These variations, uses, or adaptive changes follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the application are pointed out by the claims.

Claims (20)

  1. 一种语音情绪识别方法,其中,包括:A voice emotion recognition method, which includes:
    当接收到用户语音,提取所述用户语音的多类音频特征;When the user voice is received, extract multiple types of audio features of the user voice;
    分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;Respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples that match each of the audio features;
    基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;Constructing a feature tag matrix of the user's voice based on the audio feature and the emotion tag corresponding to the matched feature sample;
    将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
    获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。Acquire the scene tag matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  2. 根据权利要求1所述的方法,其中,所述当接收到用户语音,提取所述用户语音的多类音频特征,包括:The method according to claim 1, wherein, when the user voice is received, extracting multiple types of audio features of the user voice comprises:
    当接收到用户语音,将所述用户语音转化为文本;When the user voice is received, convert the user voice into text;
    将所述文本与特征提取类别数据库中的文本样本匹配,得到与所述文本匹配的文本样本;Matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text;
    从所述用户语音,提取与所述文本样本关联的多个特征类别的音频特征。From the user voice, audio features of a plurality of feature categories associated with the text sample are extracted.
  3. 根据权利要求1所述的方法,其中,所述分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签,包括:The method according to claim 1, wherein the respectively matching the audio feature with the feature samples in the emotional feature library to obtain the emotional label corresponding to the feature sample matched with each of the audio features comprises:
    分别将所述音频特征与情绪特征库中的特征样本进行对比,得到与每个所述音频特征相似度超过预定阈值的多个特征样本,所述预定阈值与所述音频特征的个数对应;Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;
    从所述情绪特征库中获取每个所述特征样本对应的情绪标签。Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.
  4. 根据权利要求1所述的方法,其中,所述基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵,包括:The method according to claim 1, wherein the constructing a feature label matrix of the user voice based on the audio feature and the corresponding emotion label of the matched feature sample comprises:
    将所述音频特征添加到矩阵的第一行;Adding the audio feature to the first row of the matrix;
    将每个所述音频特征相应的所述情绪标签,按照每个所述特征样本与所述音频特征的相似度由高到低的顺序,添加到每个所述音频特征对应的列得到所述特征标签矩阵,其中,所述矩阵每行对应于一个相似度范围。The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
  5. 根据权利要求1所述的方法,其中,所述多情绪识别模型的构建方法,包括:The method according to claim 1, wherein the method for constructing the multi-emotion recognition model comprises:
    利用AISHELL中文声纹数据库训练restnet34模型,训练结束后取出前n层网络作为预训练模型;Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;
    为所述预训练模型接入多层全连接层作为分类器,得到识别模型,以使用标注好的语音情绪数据集对所述识别模型进行训练得到多情绪识别模型。A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
  6. 根据权利要求5所述的方法,其中,还包括:The method according to claim 5, further comprising:
    同时初始化第一多情绪识别模型和第二多情绪识别模型,并使用有标签混合无标签的原始数据在所述第一多情绪识别模型上进行训练得到第一预测值,并得到有标签数据部分的分类误差损失值;Initialize the first multi-emotion recognition model and the second multi-emotion recognition model at the same time, and use the labeled mixed unlabeled raw data to train on the first multi-emotion recognition model to obtain the first predicted value, and obtain the labeled data part The loss of classification error;
    利用指数滑动平均更新所述第二多情绪识别模型,并将加上噪声的数据输入更新后的所述第二多情绪识别模型训练得到第二预测值;Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;
    计算所述第一预测值和所述第二预测值之间的误差作为一致性损失值;Calculating an error between the first predicted value and the second predicted value as a consistency loss value;
    利用所述分类误差损失值与所述一致性损失值之和更新所述第一多情绪识别模型。The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
  7. 根据权利要求1或2所述的方法,其中,所述多类音频特征至少包括过零率特征、短时能量特征、短时平均幅度差特征、发音帧数特征、基音频率特征、共振峰特征、谐波噪声比特征以及梅尔倒谱系数特征中三个。The method according to claim 1 or 2, wherein the multiple types of audio characteristics include at least zero-crossing rate characteristics, short-term energy characteristics, short-term average amplitude difference characteristics, pronunciation frame number characteristics, pitch frequency characteristics, formant characteristics , Three of the characteristics of the harmonic to noise ratio and the Mel cepstrum coefficient.
  8. 一种语音情绪识别装置,其中,包括:A voice emotion recognition device, which includes:
    提取模块,用于当接收到用户语音,提取所述用户语音的多类音频特征;The extraction module is used to extract multiple types of audio features of the user voice when the user voice is received;
    匹配模块,用于分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;A matching module, configured to respectively match the audio features with the feature samples in the emotion feature library to obtain the emotion label corresponding to the feature sample that matches each of the audio features;
    构建模块,用于基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;A construction module, configured to construct a feature tag matrix of the user voice based on the audio feature and the corresponding emotion tag of the matched feature sample;
    预测模块,用于将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;A prediction module, configured to input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
    确定模块,用于获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。The determining module is configured to obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  9. 一种电子设备,其中,包括:处理器;以及存储器,用于存储所述处理器的计算机程序指令;其中,所述处理器配置为经由执行所述计算机程序指令来执行以下处理:An electronic device, comprising: a processor; and a memory for storing computer program instructions of the processor; wherein the processor is configured to execute the following processing by executing the computer program instructions:
    当接收到用户语音,提取所述用户语音的多类音频特征;When the user voice is received, extract multiple types of audio features of the user voice;
    分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;Respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples that match each of the audio features;
    基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;Constructing a feature tag matrix of the user's voice based on the audio feature and the emotion tag corresponding to the matched feature sample;
    将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
    获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。Acquire the scene tag matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  10. 根据权利要求9所述的电子设备,其中,所述当接收到用户语音,提取所述用户语音的多类音频特征,包括:The electronic device according to claim 9, wherein said extracting multiple types of audio features of the user voice when the user voice is received comprises:
    当接收到用户语音,将所述用户语音转化为文本;When the user voice is received, convert the user voice into text;
    将所述文本与特征提取类别数据库中的文本样本匹配,得到与所述文本匹配的文本样 本;Matching the text with a text sample in the feature extraction category database to obtain a text sample matching the text;
    从所述用户语音,提取与所述文本样本关联的多个特征类别的音频特征。From the user voice, audio features of a plurality of feature categories associated with the text sample are extracted.
  11. 根据权利要求9所述的电子设备,其中,所述分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签,包括:The electronic device according to claim 9, wherein the matching the audio features with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches with each of the audio features comprises:
    分别将所述音频特征与情绪特征库中的特征样本进行对比,得到与每个所述音频特征相似度超过预定阈值的多个特征样本,所述预定阈值与所述音频特征的个数对应;Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;
    从所述情绪特征库中获取每个所述特征样本对应的情绪标签。Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.
  12. 根据权利要求9所述的电子设备,其中,所述基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵,包括:9. The electronic device according to claim 9, wherein the constructing the feature label matrix of the user voice based on the audio feature and the emotion label corresponding to the matched feature sample comprises:
    将所述音频特征添加到矩阵的第一行;Adding the audio feature to the first row of the matrix;
    将每个所述音频特征相应的所述情绪标签,按照每个所述特征样本与所述音频特征的相似度由高到低的顺序,添加到每个所述音频特征对应的列得到所述特征标签矩阵,其中,所述矩阵每行对应于一个相似度范围。The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
  13. 根据权利要求9所述的电子设备,其中,所述多情绪识别模型的构建方法,包括:The electronic device according to claim 9, wherein the method for constructing the multi-emotion recognition model comprises:
    利用AISHELL中文声纹数据库训练restnet34模型,训练结束后取出前n层网络作为预训练模型;Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;
    为所述预训练模型接入多层全连接层作为分类器,得到识别模型,以使用标注好的语音情绪数据集对所述识别模型进行训练得到多情绪识别模型。A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
  14. 根据权利要求13所述的电子设备,其中,还包括:The electronic device according to claim 13, further comprising:
    同时初始化第一多情绪识别模型和第二多情绪识别模型,并使用有标签混合无标签的原始数据在所述第一多情绪识别模型上进行训练得到第一预测值,并得到有标签数据部分的分类误差损失值;Initialize the first multi-emotion recognition model and the second multi-emotion recognition model at the same time, and use the labeled mixed unlabeled raw data to train on the first multi-emotion recognition model to obtain the first predicted value, and obtain the labeled data part The loss of classification error;
    利用指数滑动平均更新所述第二多情绪识别模型,并将加上噪声的数据输入更新后的所述第二多情绪识别模型训练得到第二预测值;Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;
    计算所述第一预测值和所述第二预测值之间的误差作为一致性损失值;Calculating an error between the first predicted value and the second predicted value as a consistency loss value;
    利用所述分类误差损失值与所述一致性损失值之和更新所述第一多情绪识别模型。The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
  15. 一种计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令被处理器执行时执行以下处理:A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions execute the following processing when executed by a processor:
    当接收到用户语音,提取所述用户语音的多类音频特征;When the user voice is received, extract multiple types of audio features of the user voice;
    分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签;Respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples that match each of the audio features;
    基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵;Constructing a feature tag matrix of the user's voice based on the audio feature and the emotion tag corresponding to the matched feature sample;
    将所述特征标签矩阵输入多情绪识别模型,得到多个情绪集及每个所述情绪集对应的场景标签;Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
    获取所述用户语音的语音场景所匹配的场景标签,以将所述匹配的场景标签对应的情绪集确定为识别出的用户语音情绪。Acquire the scene tag matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述当接收到用户语音,提取所述用户语音的多类音频特征,包括:15. The computer-readable storage medium according to claim 15, wherein the extracting multiple types of audio features of the user voice when the user voice is received includes:
    当接收到用户语音,将所述用户语音转化为文本;When the user voice is received, convert the user voice into text;
    将所述文本与特征提取类别数据库中的文本样本匹配,得到与所述文本匹配的文本样本;Matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text;
    从所述用户语音,提取与所述文本样本关联的多个特征类别的音频特征。From the user voice, audio features of a plurality of feature categories associated with the text sample are extracted.
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述分别将所述音频特征与情绪特征库中的特征样本进行匹配,得到与每个所述音频特征匹配的特征样本相应的情绪标签,包括:15. The computer-readable storage medium according to claim 15, wherein said respectively matching said audio feature with a feature sample in an emotional feature library to obtain an emotional tag corresponding to each feature sample matching said audio feature ,include:
    分别将所述音频特征与情绪特征库中的特征样本进行对比,得到与每个所述音频特征相似度超过预定阈值的多个特征样本,所述预定阈值与所述音频特征的个数对应;Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;
    从所述情绪特征库中获取每个所述特征样本对应的情绪标签。Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述基于所述音频特征及所述匹配的特征样本相应的情绪标签,构建所述用户语音的特征标签矩阵,包括:15. The computer-readable storage medium according to claim 15, wherein the constructing the feature label matrix of the user voice based on the audio feature and the emotion label corresponding to the matched feature sample comprises:
    将所述音频特征添加到矩阵的第一行;Adding the audio feature to the first row of the matrix;
    将每个所述音频特征相应的所述情绪标签,按照每个所述特征样本与所述音频特征的相似度由高到低的顺序,添加到每个所述音频特征对应的列得到所述特征标签矩阵,其中,所述矩阵每行对应于一个相似度范围。The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述多情绪识别模型的构建方法,包括:15. The computer-readable storage medium of claim 15, wherein the method for constructing the multi-emotion recognition model comprises:
    利用AISHELL中文声纹数据库训练restnet34模型,训练结束后取出前n层网络作为预训练模型;Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;
    为所述预训练模型接入多层全连接层作为分类器,得到识别模型,以使用标注好的语音情绪数据集对所述识别模型进行训练得到多情绪识别模型。A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
  20. 根据权利要求19所述的计算机可读存储介质,其中,还包括:The computer-readable storage medium according to claim 19, further comprising:
    同时初始化第一多情绪识别模型和第二多情绪识别模型,并使用有标签混合无标签的原始数据在所述第一多情绪识别模型上进行训练得到第一预测值,并得到有标签数据部分的分类误差损失值;Initialize the first multi-emotion recognition model and the second multi-emotion recognition model at the same time, and use the labeled mixed unlabeled raw data to train on the first multi-emotion recognition model to obtain the first predicted value, and obtain the labeled data part The loss of classification error;
    利用指数滑动平均更新所述第二多情绪识别模型,并将加上噪声的数据输入更新后的所述第二多情绪识别模型训练得到第二预测值;Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;
    计算所述第一预测值和所述第二预测值之间的误差作为一致性损失值;Calculating an error between the first predicted value and the second predicted value as a consistency loss value;
    利用所述分类误差损失值与所述一致性损失值之和更新所述第一多情绪识别模型。The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
PCT/CN2020/105543 2020-03-03 2020-07-29 Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium WO2021174757A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010138561.3A CN111429946A (en) 2020-03-03 2020-03-03 Voice emotion recognition method, device, medium and electronic equipment
CN202010138561.3 2020-03-03

Publications (1)

Publication Number Publication Date
WO2021174757A1 true WO2021174757A1 (en) 2021-09-10

Family

ID=71551972

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105543 WO2021174757A1 (en) 2020-03-03 2020-07-29 Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN111429946A (en)
WO (1) WO2021174757A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889150A (en) * 2021-10-15 2022-01-04 北京工业大学 Speech emotion recognition method and device
CN113903363A (en) * 2021-09-29 2022-01-07 平安银行股份有限公司 Violation detection method, device, equipment and medium based on artificial intelligence
CN114121041A (en) * 2021-11-19 2022-03-01 陈文琪 Intelligent accompanying method and system based on intelligent accompanying robot
CN114169440A (en) * 2021-12-08 2022-03-11 北京百度网讯科技有限公司 Model training method, data processing method, device, electronic device and medium
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment
CN114912502A (en) * 2021-12-28 2022-08-16 天翼数字生活科技有限公司 Bimodal deep semi-supervised emotion classification method based on expressions and voices
CN115113781A (en) * 2022-06-28 2022-09-27 广州博冠信息科技有限公司 Interactive icon display method, device, medium and electronic equipment
CN115414042A (en) * 2022-09-08 2022-12-02 北京邮电大学 Multi-modal anxiety detection method and device based on emotion information assistance
CN115460166A (en) * 2022-09-06 2022-12-09 网易(杭州)网络有限公司 Instant voice communication method and device, electronic equipment and storage medium
CN116306686A (en) * 2023-05-22 2023-06-23 中国科学技术大学 Method for generating multi-emotion-guided co-emotion dialogue
CN116564281A (en) * 2023-07-06 2023-08-08 世优(北京)科技有限公司 Emotion recognition method and device based on AI
CN114666618B (en) * 2022-03-15 2023-10-13 广州欢城文化传媒有限公司 Audio auditing method, device, equipment and readable storage medium
WO2024040793A1 (en) * 2022-08-26 2024-02-29 天翼电子商务有限公司 Multi-modal emotion recognition method combined with hierarchical policy

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429946A (en) * 2020-03-03 2020-07-17 深圳壹账通智能科技有限公司 Voice emotion recognition method, device, medium and electronic equipment
CN112017670B (en) * 2020-08-13 2021-11-02 北京达佳互联信息技术有限公司 Target account audio identification method, device, equipment and medium
CN112423106A (en) * 2020-11-06 2021-02-26 四川长虹电器股份有限公司 Method and system for automatically translating accompanying sound
CN112466324A (en) * 2020-11-13 2021-03-09 上海听见信息科技有限公司 Emotion analysis method, system, equipment and readable storage medium
CN113806586B (en) * 2021-11-18 2022-03-15 腾讯科技(深圳)有限公司 Data processing method, computer device and readable storage medium
CN114093389B (en) * 2021-11-26 2023-03-28 重庆凡骄网络科技有限公司 Speech emotion recognition method and device, electronic equipment and computer readable medium
CN114242070B (en) * 2021-12-20 2023-03-24 阿里巴巴(中国)有限公司 Video generation method, device, equipment and storage medium
CN115460317A (en) * 2022-09-05 2022-12-09 西安万像电子科技有限公司 Emotion recognition and voice feedback method, device, medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014062521A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
CN108363706A (en) * 2017-01-25 2018-08-03 北京搜狗科技发展有限公司 The method and apparatus of human-computer dialogue interaction, the device interacted for human-computer dialogue
CN108922564A (en) * 2018-06-29 2018-11-30 北京百度网讯科技有限公司 Emotion identification method, apparatus, computer equipment and storage medium
CN109784414A (en) * 2019-01-24 2019-05-21 出门问问信息科技有限公司 Customer anger detection method, device and electronic equipment in a kind of phone customer service
CN110136723A (en) * 2019-04-15 2019-08-16 深圳壹账通智能科技有限公司 Data processing method and device based on voice messaging
CN111429946A (en) * 2020-03-03 2020-07-17 深圳壹账通智能科技有限公司 Voice emotion recognition method, device, medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961776A (en) * 2017-12-18 2019-07-02 上海智臻智能网络科技股份有限公司 Speech information processing apparatus
CN110288974B (en) * 2018-03-19 2024-04-05 北京京东尚科信息技术有限公司 Emotion recognition method and device based on voice
CN109885713A (en) * 2019-01-03 2019-06-14 刘伯涵 Facial expression image recommended method and device based on voice mood identification
CN110120231B (en) * 2019-05-15 2021-04-02 哈尔滨工业大学 Cross-corpus emotion recognition method based on self-adaptive semi-supervised non-negative matrix factorization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014062521A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
CN108363706A (en) * 2017-01-25 2018-08-03 北京搜狗科技发展有限公司 The method and apparatus of human-computer dialogue interaction, the device interacted for human-computer dialogue
CN108922564A (en) * 2018-06-29 2018-11-30 北京百度网讯科技有限公司 Emotion identification method, apparatus, computer equipment and storage medium
CN109784414A (en) * 2019-01-24 2019-05-21 出门问问信息科技有限公司 Customer anger detection method, device and electronic equipment in a kind of phone customer service
CN110136723A (en) * 2019-04-15 2019-08-16 深圳壹账通智能科技有限公司 Data processing method and device based on voice messaging
CN111429946A (en) * 2020-03-03 2020-07-17 深圳壹账通智能科技有限公司 Voice emotion recognition method, device, medium and electronic equipment

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113903363A (en) * 2021-09-29 2022-01-07 平安银行股份有限公司 Violation detection method, device, equipment and medium based on artificial intelligence
CN113889150A (en) * 2021-10-15 2022-01-04 北京工业大学 Speech emotion recognition method and device
CN113889150B (en) * 2021-10-15 2023-08-29 北京工业大学 Speech emotion recognition method and device
CN114121041A (en) * 2021-11-19 2022-03-01 陈文琪 Intelligent accompanying method and system based on intelligent accompanying robot
CN114121041B (en) * 2021-11-19 2023-12-08 韩端科技(深圳)有限公司 Intelligent accompanying method and system based on intelligent accompanying robot
CN114169440A (en) * 2021-12-08 2022-03-11 北京百度网讯科技有限公司 Model training method, data processing method, device, electronic device and medium
CN114912502A (en) * 2021-12-28 2022-08-16 天翼数字生活科技有限公司 Bimodal deep semi-supervised emotion classification method based on expressions and voices
CN114912502B (en) * 2021-12-28 2024-03-29 天翼数字生活科技有限公司 Double-mode deep semi-supervised emotion classification method based on expressions and voices
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment
CN114666618B (en) * 2022-03-15 2023-10-13 广州欢城文化传媒有限公司 Audio auditing method, device, equipment and readable storage medium
CN115113781A (en) * 2022-06-28 2022-09-27 广州博冠信息科技有限公司 Interactive icon display method, device, medium and electronic equipment
WO2024040793A1 (en) * 2022-08-26 2024-02-29 天翼电子商务有限公司 Multi-modal emotion recognition method combined with hierarchical policy
CN115460166A (en) * 2022-09-06 2022-12-09 网易(杭州)网络有限公司 Instant voice communication method and device, electronic equipment and storage medium
CN115414042A (en) * 2022-09-08 2022-12-02 北京邮电大学 Multi-modal anxiety detection method and device based on emotion information assistance
CN116306686B (en) * 2023-05-22 2023-08-29 中国科学技术大学 Method for generating multi-emotion-guided co-emotion dialogue
CN116306686A (en) * 2023-05-22 2023-06-23 中国科学技术大学 Method for generating multi-emotion-guided co-emotion dialogue
CN116564281B (en) * 2023-07-06 2023-09-05 世优(北京)科技有限公司 Emotion recognition method and device based on AI
CN116564281A (en) * 2023-07-06 2023-08-08 世优(北京)科技有限公司 Emotion recognition method and device based on AI

Also Published As

Publication number Publication date
CN111429946A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
WO2021174757A1 (en) Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium
CN109036384B (en) Audio recognition method and device
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN107481717B (en) Acoustic model training method and system
CN108428446A (en) Audio recognition method and device
EP3940693A1 (en) Voice interaction-based information verification method and apparatus, and device and computer storage medium
CN112885336B (en) Training and recognition method and device of voice recognition system and electronic equipment
CN111653274B (en) Wake-up word recognition method, device and storage medium
US11322151B2 (en) Method, apparatus, and medium for processing speech signal
CN114330371A (en) Session intention identification method and device based on prompt learning and electronic equipment
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
CN111966798A (en) Intention identification method and device based on multi-round K-means algorithm and electronic equipment
CN117115581A (en) Intelligent misoperation early warning method and system based on multi-mode deep learning
CN116361442A (en) Business hall data analysis method and system based on artificial intelligence
CN113555005B (en) Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium
CN112735432B (en) Audio identification method, device, electronic equipment and storage medium
CN115512692A (en) Voice recognition method, device, equipment and storage medium
CN112951270B (en) Voice fluency detection method and device and electronic equipment
Sartiukova et al. Remote Voice Control of Computer Based on Convolutional Neural Network
CN115116443A (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN111883133A (en) Customer service voice recognition method, customer service voice recognition device, customer service voice recognition server and storage medium
CN112037772A (en) Multi-mode-based response obligation detection method, system and device
Fennir et al. Acoustic scene classification for speaker diarization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923083

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20923083

Country of ref document: EP

Kind code of ref document: A1