WO2021174757A1 - Procédé et appareil de reconnaissance d'émotions dans la voix, dispositif électronique et support de stockage lisible par ordinateur - Google Patents
Procédé et appareil de reconnaissance d'émotions dans la voix, dispositif électronique et support de stockage lisible par ordinateur Download PDFInfo
- Publication number
- WO2021174757A1 WO2021174757A1 PCT/CN2020/105543 CN2020105543W WO2021174757A1 WO 2021174757 A1 WO2021174757 A1 WO 2021174757A1 CN 2020105543 W CN2020105543 W CN 2020105543W WO 2021174757 A1 WO2021174757 A1 WO 2021174757A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- emotion
- audio
- voice
- recognition model
- Prior art date
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 155
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000011159 matrix material Substances 0.000 claims abstract description 55
- 230000008909 emotion recognition Effects 0.000 claims description 36
- 230000002996 emotional effect Effects 0.000 claims description 27
- 238000012549 training Methods 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 239000013598 vector Substances 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 210000004704 glottis Anatomy 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
Definitions
- This application relates to the field of artificial intelligence technology, and in particular to a method, device, electronic device, and computer-readable storage medium for voice emotion recognition.
- Emotional computing is an important technology that gives intelligent machines the ability to perceive, understand and express various emotional states.
- voice technology has also received more and more attention.
- voice emotion detection has good results, the inventor realizes that due to problems such as the quality of the data set and the subjective annotation of emotions, most models can only judge a single emotion, and there are fewer types of emotions that can be judged, which cannot be accurately described.
- For the hidden emotions in complex speech it is difficult to determine the boundaries of the multiple emotions that may be contained in a speech.
- an object of the present application is to provide a voice emotion recognition method, device, electronic equipment, and computer-readable storage medium.
- a voice emotion recognition method includes: when a user voice is received, extracting multiple types of audio features of the user voice; Emotion labels corresponding to the feature samples matched by the audio features; construct the feature label matrix of the user voice based on the audio features and the emotion labels corresponding to the matched feature samples; input the feature label matrix into multiple emotions
- a recognition model is used to obtain a plurality of emotion sets and a scene label corresponding to each of the emotion sets; to obtain a scene label matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene label as the recognition
- the user ’s voice emotions.
- a voice emotion recognition device includes: an extraction module for extracting multiple types of audio features of the user's voice when a user voice is received; a matching module for separately comparing the audio feature with the emotional feature library Matching the feature samples in each of the audio features to obtain an emotion label corresponding to the feature sample that matches each of the audio features; a construction module for constructing the user based on the audio feature and the emotion label corresponding to the matched feature sample A feature label matrix of speech; a prediction module, used to input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; a determination module, used to obtain the user voice The scene tag matched by the voice scene of the user is determined to determine the emotion set corresponding to the matched scene tag as the recognized voice emotion of the user.
- an electronic device includes: a processor; and a memory for storing computer program instructions of the processor; wherein the processor is configured to execute the above method by executing the computer program instructions.
- a computer-readable storage medium has computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the above method is implemented.
- the multiple types of audio features of the user voice are extracted; the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives. Characterize the user's emotions. Then, the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained. Emotions, in turn, can guide the identification of various hidden emotions of the user in the subsequent steps.
- the feature tag matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions.
- the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.
- Fig. 1 schematically shows a flow chart of a method for speech emotion recognition.
- Fig. 2 schematically shows an example diagram of an application scenario of a voice emotion recognition method.
- Fig. 3 schematically shows a flow chart of a feature extraction method.
- Fig. 4 schematically shows a block diagram of a voice emotion recognition device.
- Fig. 5 schematically shows an example block diagram of an electronic device for implementing the above-mentioned voice emotion recognition method.
- Fig. 6 schematically shows a computer-readable storage medium for implementing the aforementioned voice emotion recognition method.
- This exemplary embodiment first provides a voice emotion recognition method.
- the voice emotion recognition method can be run on a server, a server cluster or a cloud server, etc.
- a server cluster or a cloud server, etc.
- the voice emotion recognition method may include the following steps:
- Step S110 when a user voice is received, extract multiple types of audio features of the user voice;
- Step S120 respectively matching the audio features with the feature samples in the emotion feature library, to obtain the emotion label corresponding to the feature sample that matches each of the audio features;
- Step S130 based on the audio features and the emotion labels corresponding to the matched feature samples, construct a feature label matrix of the user voice;
- Step S140 Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
- Step S150 Obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
- the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives.
- the angle characterizes the user's emotions.
- the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained.
- Emotions in turn, can guide the identification of various hidden emotions of the user in the subsequent steps.
- the feature tag matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions.
- the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.
- step S110 when a user voice is received, multiple types of audio features of the user voice are extracted.
- the server 201 receives the user voice sent by the server 202, and then the server 201 can extract multiple types of audio features of the user voice, and then perform emotion recognition in the subsequent steps.
- the server 201 can be any terminal that has the function of executing program instructions and storage, such as a cloud server, a mobile phone, a computer, etc.; the server 202 can be any terminal that has a storage function, such as a mobile phone, a computer, and the like.
- Audio features can be: zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, Mel cepstrum coefficient feature, etc. Audio characteristics. These features can be extracted from a piece of audio using existing audio feature extraction methods.
- the extracted multi-type audio features of the user's voice can reflect the change characteristics of the user's voice from different angles, that is, it can represent the user's emotions from different angles, for example, short-term energy reflects the strength of the signal at different times , which can reflect the change process of the user’s emotional stability in a segment of speech; audio has periodic characteristics, and the short-term average amplitude difference can be used to better observe the periodic characteristics in the case of steady noise, and the short-term average amplitude difference can reflect the user’s
- the periodicity of the middle mood; the formant is the resonance characteristic when the quasi-periodic pulse at the glottis is excited into the vocal tract, resulting in a set of resonance frequencies. This set of resonance frequencies is called the formant frequency or formant for short.
- Formant parameters Including the frequency of the formant and the width of the frequency band, it is an important parameter to distinguish different finals, and can characterize the user's emotions from a language perspective.
- the user's emotions can be analyzed based on the multiple types of audio features in the subsequent steps.
- extracting multiple types of audio features of the user voice includes:
- Step S310 when the user voice is received, convert the user voice into text
- Step S320 matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text
- Step S330 Extract audio features of multiple feature categories associated with the text sample from the user voice.
- the user's voice When the user's voice is received, the user's voice is converted into text, and the real content expressed by the user can be obtained. Then, the converted text is matched with the text sample in the feature extraction category database to obtain a text sample that matches the converted text .
- the feature extraction category database stores the feature categories of multiple audio features that can clearly reflect emotions when texts with different semantic meanings are expressed. Furthermore, by extracting audio features of multiple feature categories associated with the text sample from the user's voice, emotion recognition can be efficiently and accurately performed in the subsequent steps.
- the multiple types of audio characteristics include at least zero-crossing rate characteristics, short-term energy characteristics, short-term average amplitude difference characteristics, pronunciation frame number characteristics, pitch frequency characteristics, formant characteristics, harmonic-to-noise ratio characteristics, and There are three characteristics of Mel cepstrum coefficients.
- Multiple types of audio features include at least three of the zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, and Mel cepstrum coefficient feature. Therefore, multiple emotion recognition can be realized with high accuracy.
- step S120 the audio features are matched with the feature samples in the emotion feature library, respectively, to obtain an emotion label corresponding to the feature samples that match each of the audio features.
- the emotion feature library stores feature samples of audio features of various categories, and each feature sample is associated with a category of emotion label.
- the audio feature is matched with the feature samples in the emotional feature library, and the similarity between the audio feature and the feature sample can be calculated through Euclidean distance or Hamming distance, and then multiple feature samples matching each audio feature (such as similar (Feature samples with a degree greater than 50%) corresponding emotion labels, so that multiple suspicious emotions represented by each feature vector can be obtained, which can guide the subsequent steps to identify various hidden emotions of the user.
- the respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples matching each of the audio features includes:
- the predetermined threshold can be set according to accuracy requirements.
- the predetermined threshold corresponds to the number of audio features. That is, the value of the predetermined threshold is determined by the number of audio features. The more the number of audio features, the more the predetermined threshold. The smaller. In this way, by separately comparing the audio features with the feature samples in the emotion feature library, multiple feature samples whose similarity to each audio feature exceeds a predetermined threshold are obtained, and then the emotion label corresponding to each feature sample is obtained from the emotion feature library , It can ensure that the emotion recognition of each audio feature knows the reliability.
- step S130 a feature tag matrix of the user voice is constructed based on the audio feature and the emotion tag corresponding to the matched feature sample.
- the feature label matrix stores the audio features of the user's voice and the corresponding emotion labels. Emotion tags that can reflect the audio features of the user’s voice and the possible emotions reflected. Then, the different types of audio features and the corresponding similarity feature samples are structurally linked to the emotion tags of different possibilities, through the emotion tags Form the constraints of the audio feature combination. Different types of audio features and corresponding emotion labels of different possibilities embodied by feature samples of respective similarities can be structurally linked through a feature label matrix, which can reflect the law of potential potential emotion changes.
- the constructing the feature tag matrix of the user voice based on the audio feature and the corresponding emotion tag of the matched feature sample includes:
- the emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
- Each audio feature is added to the first row of the empty matrix, and then each column corresponds to an audio feature.
- the emotion label corresponding to each audio feature is added to the corresponding column of each audio feature in the order of the similarity between each feature sample and the audio feature to obtain the feature label matrix, for example, A audio feature and A1 feature sample If the similarity is 63%, the Qin Xu label corresponding to the A1 feature sample can be added to the row in the 60%-70% interval of the column where the A audio feature is located.
- Each row of the matrix corresponds to a similarity range, for example, rows with a similarity range of 60%-70%.
- step S140 the vector label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets.
- the multi-emotion recognition model is a pre-trained machine learning model that can recognize multiple emotions at once.
- the vector label matrix is input to the multi-emotion recognition model, which can be used for multi-class audio based on the structured label matrix.
- the constraint of the feature vector allows the machine learning model to easily calculate the possible emotions of the user’s voice, obtain multiple emotional combinations, predict multiple emotional sets of the user’s voice, and possible scenarios for each emotional set (such as shopping scenes, chat scenes) ) Scene label. In this way, multiple possible scenarios and corresponding multiple emotions can be analyzed efficiently and accurately based on the vector label matrix through the multi-emotion recognition model.
- the method for constructing the multi-emotion recognition model includes:
- a multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
- the AISHELL Chinese voiceprint database uses the AISHELL Chinese voiceprint database to train the restnet34 model.
- the first n-layer network is taken out as the pre-training model.
- the multi-layer fully connected layer is used as the classifier.
- the labeled speech emotion data set is used for the The model is trained to obtain the final model.
- the ratio of positive and negative samples can be calculated in each training batch as the weighting matrix of the loss function, so that it can pay more attention to the small sample data and improve the model’s performance. Accuracy.
- the first multi-emotion recognition model and the second multi-emotion recognition model are initialized at the same time, and the first multi-emotion recognition model is trained on the first multi-emotion recognition model using raw data with labels and unlabeled. A prediction value, and the classification error loss value of the labeled data part is obtained;
- the first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
- the original model can be improved by means of semi-supervised learning Mean-Teacher, and a large amount of unlabeled data can be reused.
- the moving average can make the model more robust on the test data.
- input the noise-added data into Model teacher training to obtain the predicted value P teacher , calculate the error between P teacher and P student as the consistency loss value loss consistency , and update the first multiple emotion recognition with the loss value of loss classification + loss consistency Model student .
- transfer learning and semi-supervised learning technology can be used to effectively improve the classification effect of the model under a small amount of data sets, and also alleviate the model overfitting problem to a certain extent.
- the program can not only accurately detect the displayed emotions in the voice, but also accurately identify a variety of potential emotions, improving and expanding the voice emotion recognition technology.
- step S150 a scene tag matched by the voice scene of the user's voice is acquired, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
- the scene of the user's voice can be determined by pre-calibrating or locating the voice source (such as customer service voice).
- the emotion set corresponding to the scene tag matched by the scene of the user's voice is determined as the recognized user's voice emotion, so as to ensure the accuracy of the recognition boundary, which can further ensure the accuracy of the emotion recognition of the user's voice.
- the voice emotion recognition result of the voice is obtained.
- the application also provides a voice emotion recognition device.
- the voice emotion recognition device may include an extraction module 410, a matching module 420, a construction module 430, a prediction module 440 and a determination module 450. in:
- the extraction module 410 may be used to extract multiple types of audio feature vectors of the user voice when the user voice is received;
- the matching module 420 may be configured to respectively match the audio feature vector with feature vector samples in the emotion feature library to obtain an emotion label corresponding to each feature vector sample that matches the audio feature vector;
- the construction module 430 may be configured to construct a vector label matrix of the user voice based on the audio feature vector and the corresponding emotion label of the matched feature vector sample;
- the prediction module 440 may be used to input the vector label matrix into a multi-emotion recognition model to obtain multiple emotion sets and a scene label corresponding to each emotion set;
- the determining module 450 may be used to obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
- modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
- the features and functions of two or more modules or units described above may be embodied in one module or unit.
- the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
- the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present application.
- a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
- Including several instructions to make a computing device which can be a personal computer, a server, a mobile terminal, or a network device, etc.
- an electronic device capable of implementing the above method is also provided.
- the electronic device 500 according to this embodiment of the present invention will be described below with reference to FIG. 5.
- the electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
- the electronic device 500 is represented in the form of a general-purpose computing device.
- the components of the electronic device 500 may include, but are not limited to: the aforementioned at least one processing unit 510, the aforementioned at least one storage unit 520, and a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510).
- the storage unit stores program code, and the program code can be executed by the processing unit 510, so that the processing unit 510 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Steps of implementation.
- the processing unit 510 may perform as shown in FIG. 1:
- Step S110 when a user voice is received, extract multiple types of audio features of the user voice;
- Step S120 respectively matching the audio features with the feature samples in the emotion feature library, to obtain the emotion label corresponding to the feature sample that matches each of the audio features;
- Step S130 based on the audio features and the emotion labels corresponding to the matched feature samples, construct a feature label matrix of the user voice;
- Step S140 Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;
- Step S150 Obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
- the storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 5201 and/or a cache storage unit 5202, and may further include a read-only storage unit (ROM) 5203.
- RAM random access storage unit
- ROM read-only storage unit
- the storage unit 520 may also include a program/utility tool 5204 having a set (at least one) program module 5205.
- program module 5205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
- the bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
- the electronic device 500 can also communicate with one or more external devices 700 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable customers to interact with the electronic device 500, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. Such communication may be performed through an input/output (I/O) interface 550, and may also include a display unit 540 connected to the input/output (I/O) interface 550.
- the electronic device 500 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 560.
- networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
- the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530.
- other hardware and/or software modules can be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
- the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
- a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
- a computer-readable storage medium is also provided.
- the computer-readable storage medium may be non-volatile or volatile, and stored thereon Program products that can implement the above-mentioned methods in this specification.
- various aspects of the present invention may also be implemented in the form of a program product, which includes program code.
- the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.
- a program product 600 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
- the program product of the present invention is not limited to this.
- the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
- the program product can use any combination of one or more readable media.
- the readable medium may be a readable signal medium or a readable storage medium.
- the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, in which readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
- the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
- the program code used to perform the operations of the present invention can be written in any combination of one or more programming languages.
- the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language.
- the program code can be executed entirely on the client computing device, partly executed on the client device, executed as an independent software package, partly executed on the client computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
- the remote computing device can be connected to a client computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
- LAN local area network
- WAN wide area network
- Internet service providers for example, using Internet service providers
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- User Interface Of Digital Computer (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention se rapporte au domaine technique de l'intelligence artificielle et concerne un procédé et un appareil de reconnaissance des émotions dans une voix, un dispositif électronique et un support de stockage lisible par ordinateur. Le procédé comprend les étapes suivantes : lorsqu'une voix d'utilisateur est reçue, extraire de multiples types de caractéristiques audio de la voix de l'utilisateur ; faire correspondre les caractéristiques audio avec des échantillons caractéristiques dans une bibliothèque de caractéristiques d'émotions, et obtenir une étiquette d'émotion correspondant à un échantillon caractéristique qui correspond à chaque caractéristique audio ; construire une matrice d'étiquettes caractéristique de la voix de l'utilisateur en fonction des caractéristiques audio et des étiquettes d'émotion correspondant aux échantillons caractéristiques correspondants ; fournir la matrice d'étiquettes caractéristique en entrée d'un modèle de reconnaissance de multiples émotions, et obtenir une pluralité d'ensembles d'émotions et d'étiquettes de scène correspondant aux ensembles d'émotions ; et acquérir une étiquette de scène qui correspond à la scène vocale de la voix de l'utilisateur de façon à déterminer l'ensemble d'émotions correspondant à l'étiquette de scène correspondante en tant qu'émotion reconnue dans la voix de l'utilisateur. Selon la présente invention, diverses émotions potentielles peuvent être reconnues de manière efficace et précise à partir d'une voix.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010138561.3 | 2020-03-03 | ||
CN202010138561.3A CN111429946A (zh) | 2020-03-03 | 2020-03-03 | 语音情绪识别方法、装置、介质及电子设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021174757A1 true WO2021174757A1 (fr) | 2021-09-10 |
Family
ID=71551972
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/105543 WO2021174757A1 (fr) | 2020-03-03 | 2020-07-29 | Procédé et appareil de reconnaissance d'émotions dans la voix, dispositif électronique et support de stockage lisible par ordinateur |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111429946A (fr) |
WO (1) | WO2021174757A1 (fr) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113889150A (zh) * | 2021-10-15 | 2022-01-04 | 北京工业大学 | 语音情感识别方法及装置 |
CN113903363A (zh) * | 2021-09-29 | 2022-01-07 | 平安银行股份有限公司 | 基于人工智能的违规行为检测方法、装置、设备及介质 |
CN114121041A (zh) * | 2021-11-19 | 2022-03-01 | 陈文琪 | 一种基于智伴机器人智能陪伴方法及系统 |
CN114153956A (zh) * | 2021-11-22 | 2022-03-08 | 深圳市北科瑞声科技股份有限公司 | 多意图识别方法、装置、设备及介质 |
CN114169440A (zh) * | 2021-12-08 | 2022-03-11 | 北京百度网讯科技有限公司 | 模型训练方法、数据处理方法、装置、电子设备及介质 |
CN114464210A (zh) * | 2022-02-15 | 2022-05-10 | 游密科技(深圳)有限公司 | 声音处理方法、装置、计算机设备、存储介质 |
CN114565964A (zh) * | 2022-03-03 | 2022-05-31 | 网易(杭州)网络有限公司 | 情绪识别模型的生成方法、识别方法、装置、介质和设备 |
CN114912502A (zh) * | 2021-12-28 | 2022-08-16 | 天翼数字生活科技有限公司 | 一种基于表情与语音的双模态深度半监督情感分类方法 |
CN115113781A (zh) * | 2022-06-28 | 2022-09-27 | 广州博冠信息科技有限公司 | 互动图标显示方法、装置、介质与电子设备 |
CN115414042A (zh) * | 2022-09-08 | 2022-12-02 | 北京邮电大学 | 基于情感信息辅助的多模态焦虑检测方法及装置 |
CN115460166A (zh) * | 2022-09-06 | 2022-12-09 | 网易(杭州)网络有限公司 | 即时语音通信方法、装置、电子设备及存储介质 |
CN116306686A (zh) * | 2023-05-22 | 2023-06-23 | 中国科学技术大学 | 一种多情绪指导的共情对话生成方法 |
CN116564281A (zh) * | 2023-07-06 | 2023-08-08 | 世优(北京)科技有限公司 | 基于ai的情绪识别方法及装置 |
CN114666618B (zh) * | 2022-03-15 | 2023-10-13 | 广州欢城文化传媒有限公司 | 音频审核方法、装置、设备及可读存储介质 |
WO2024040793A1 (fr) * | 2022-08-26 | 2024-02-29 | 天翼电子商务有限公司 | Procédé de reconnaissance d'émotion multimodale combiné à une politique hiérarchique |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429946A (zh) * | 2020-03-03 | 2020-07-17 | 深圳壹账通智能科技有限公司 | 语音情绪识别方法、装置、介质及电子设备 |
CN112017670B (zh) * | 2020-08-13 | 2021-11-02 | 北京达佳互联信息技术有限公司 | 一种目标账户音频的识别方法、装置、设备及介质 |
CN112423106A (zh) * | 2020-11-06 | 2021-02-26 | 四川长虹电器股份有限公司 | 一种自动翻译伴音的方法及系统 |
CN112466324A (zh) * | 2020-11-13 | 2021-03-09 | 上海听见信息科技有限公司 | 一种情绪分析方法、系统、设备及可读存储介质 |
CN113806586B (zh) * | 2021-11-18 | 2022-03-15 | 腾讯科技(深圳)有限公司 | 数据处理方法、计算机设备以及可读存储介质 |
CN114093389B (zh) * | 2021-11-26 | 2023-03-28 | 重庆凡骄网络科技有限公司 | 语音情绪识别方法、装置、电子设备和计算机可读介质 |
CN114242070B (zh) * | 2021-12-20 | 2023-03-24 | 阿里巴巴(中国)有限公司 | 一种视频生成方法、装置、设备及存储介质 |
CN115374418B (zh) * | 2022-08-31 | 2024-09-03 | 中国电信股份有限公司 | 情绪鉴权方法、情绪鉴权装置、存储介质及电子设备 |
CN115547308B (zh) * | 2022-09-01 | 2024-09-20 | 北京达佳互联信息技术有限公司 | 一种音频识别模型训练方法、音频识别方法、装置、电子设备及存储介质 |
CN115460317A (zh) * | 2022-09-05 | 2022-12-09 | 西安万像电子科技有限公司 | 一种情绪识别及语音反馈方法、装置、介质及电子设备 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014062521A1 (fr) * | 2012-10-19 | 2014-04-24 | Sony Computer Entertainment Inc. | Reconnaissance d'émotion à l'aide de repères d'attention auditive extraits de la voix d'utilisateurs |
CN108363706A (zh) * | 2017-01-25 | 2018-08-03 | 北京搜狗科技发展有限公司 | 人机对话交互的方法和装置、用于人机对话交互的装置 |
CN108922564A (zh) * | 2018-06-29 | 2018-11-30 | 北京百度网讯科技有限公司 | 情绪识别方法、装置、计算机设备及存储介质 |
CN109784414A (zh) * | 2019-01-24 | 2019-05-21 | 出门问问信息科技有限公司 | 一种电话客服中客户情绪检测方法、装置及电子设备 |
CN110136723A (zh) * | 2019-04-15 | 2019-08-16 | 深圳壹账通智能科技有限公司 | 基于语音信息的数据处理方法及装置 |
CN111429946A (zh) * | 2020-03-03 | 2020-07-17 | 深圳壹账通智能科技有限公司 | 语音情绪识别方法、装置、介质及电子设备 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109961776A (zh) * | 2017-12-18 | 2019-07-02 | 上海智臻智能网络科技股份有限公司 | 语音信息处理装置 |
CN110288974B (zh) * | 2018-03-19 | 2024-04-05 | 北京京东尚科信息技术有限公司 | 基于语音的情绪识别方法及装置 |
CN109885713A (zh) * | 2019-01-03 | 2019-06-14 | 刘伯涵 | 基于语音情绪识别的表情图像推荐方法以及装置 |
CN110120231B (zh) * | 2019-05-15 | 2021-04-02 | 哈尔滨工业大学 | 基于自适应半监督非负矩阵分解的跨语料情感识别方法 |
-
2020
- 2020-03-03 CN CN202010138561.3A patent/CN111429946A/zh active Pending
- 2020-07-29 WO PCT/CN2020/105543 patent/WO2021174757A1/fr active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014062521A1 (fr) * | 2012-10-19 | 2014-04-24 | Sony Computer Entertainment Inc. | Reconnaissance d'émotion à l'aide de repères d'attention auditive extraits de la voix d'utilisateurs |
CN108363706A (zh) * | 2017-01-25 | 2018-08-03 | 北京搜狗科技发展有限公司 | 人机对话交互的方法和装置、用于人机对话交互的装置 |
CN108922564A (zh) * | 2018-06-29 | 2018-11-30 | 北京百度网讯科技有限公司 | 情绪识别方法、装置、计算机设备及存储介质 |
CN109784414A (zh) * | 2019-01-24 | 2019-05-21 | 出门问问信息科技有限公司 | 一种电话客服中客户情绪检测方法、装置及电子设备 |
CN110136723A (zh) * | 2019-04-15 | 2019-08-16 | 深圳壹账通智能科技有限公司 | 基于语音信息的数据处理方法及装置 |
CN111429946A (zh) * | 2020-03-03 | 2020-07-17 | 深圳壹账通智能科技有限公司 | 语音情绪识别方法、装置、介质及电子设备 |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113903363A (zh) * | 2021-09-29 | 2022-01-07 | 平安银行股份有限公司 | 基于人工智能的违规行为检测方法、装置、设备及介质 |
CN113889150B (zh) * | 2021-10-15 | 2023-08-29 | 北京工业大学 | 语音情感识别方法及装置 |
CN113889150A (zh) * | 2021-10-15 | 2022-01-04 | 北京工业大学 | 语音情感识别方法及装置 |
CN114121041A (zh) * | 2021-11-19 | 2022-03-01 | 陈文琪 | 一种基于智伴机器人智能陪伴方法及系统 |
CN114121041B (zh) * | 2021-11-19 | 2023-12-08 | 韩端科技(深圳)有限公司 | 一种基于智伴机器人智能陪伴方法及系统 |
CN114153956A (zh) * | 2021-11-22 | 2022-03-08 | 深圳市北科瑞声科技股份有限公司 | 多意图识别方法、装置、设备及介质 |
CN114169440A (zh) * | 2021-12-08 | 2022-03-11 | 北京百度网讯科技有限公司 | 模型训练方法、数据处理方法、装置、电子设备及介质 |
CN114912502B (zh) * | 2021-12-28 | 2024-03-29 | 天翼数字生活科技有限公司 | 一种基于表情与语音的双模态深度半监督情感分类方法 |
CN114912502A (zh) * | 2021-12-28 | 2022-08-16 | 天翼数字生活科技有限公司 | 一种基于表情与语音的双模态深度半监督情感分类方法 |
CN114464210A (zh) * | 2022-02-15 | 2022-05-10 | 游密科技(深圳)有限公司 | 声音处理方法、装置、计算机设备、存储介质 |
CN114565964A (zh) * | 2022-03-03 | 2022-05-31 | 网易(杭州)网络有限公司 | 情绪识别模型的生成方法、识别方法、装置、介质和设备 |
CN114666618B (zh) * | 2022-03-15 | 2023-10-13 | 广州欢城文化传媒有限公司 | 音频审核方法、装置、设备及可读存储介质 |
CN115113781A (zh) * | 2022-06-28 | 2022-09-27 | 广州博冠信息科技有限公司 | 互动图标显示方法、装置、介质与电子设备 |
WO2024040793A1 (fr) * | 2022-08-26 | 2024-02-29 | 天翼电子商务有限公司 | Procédé de reconnaissance d'émotion multimodale combiné à une politique hiérarchique |
CN115460166A (zh) * | 2022-09-06 | 2022-12-09 | 网易(杭州)网络有限公司 | 即时语音通信方法、装置、电子设备及存储介质 |
CN115414042A (zh) * | 2022-09-08 | 2022-12-02 | 北京邮电大学 | 基于情感信息辅助的多模态焦虑检测方法及装置 |
CN116306686B (zh) * | 2023-05-22 | 2023-08-29 | 中国科学技术大学 | 一种多情绪指导的共情对话生成方法 |
CN116306686A (zh) * | 2023-05-22 | 2023-06-23 | 中国科学技术大学 | 一种多情绪指导的共情对话生成方法 |
CN116564281B (zh) * | 2023-07-06 | 2023-09-05 | 世优(北京)科技有限公司 | 基于ai的情绪识别方法及装置 |
CN116564281A (zh) * | 2023-07-06 | 2023-08-08 | 世优(北京)科技有限公司 | 基于ai的情绪识别方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN111429946A (zh) | 2020-07-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021174757A1 (fr) | Procédé et appareil de reconnaissance d'émotions dans la voix, dispositif électronique et support de stockage lisible par ordinateur | |
CN109036384B (zh) | 语音识别方法和装置 | |
CN106683680B (zh) | 说话人识别方法及装置、计算机设备及计算机可读介质 | |
CN108428446A (zh) | 语音识别方法和装置 | |
CN112885336B (zh) | 语音识别系统的训练、识别方法、装置、电子设备 | |
EP3940693A1 (fr) | Procédé et appareil de vérification d'informations basée sur une interaction vocale et dispositif et support de stockage sur ordinateur | |
CN117115581A (zh) | 一种基于多模态深度学习的智能误操作预警方法及系统 | |
CN111653274B (zh) | 唤醒词识别的方法、装置及存储介质 | |
US11989514B2 (en) | Identifying high effort statements for call center summaries | |
CN114399995A (zh) | 语音模型的训练方法、装置、设备及计算机可读存储介质 | |
CN114330371A (zh) | 基于提示学习的会话意图识别方法、装置和电子设备 | |
US11322151B2 (en) | Method, apparatus, and medium for processing speech signal | |
CN110647613A (zh) | 一种课件构建方法、装置、服务器和存储介质 | |
CN111966798A (zh) | 一种基于多轮K-means算法的意图识别方法、装置和电子设备 | |
CN112735432B (zh) | 音频识别的方法、装置、电子设备及存储介质 | |
CN113593523B (zh) | 基于人工智能的语音检测方法、装置及电子设备 | |
CN115116443A (zh) | 语音识别模型的训练方法、装置、电子设备及存储介质 | |
CN113763939B (zh) | 基于端到端模型的混合语音识别系统及方法 | |
CN113555005B (zh) | 模型训练、置信度确定方法及装置、电子设备、存储介质 | |
CN115512692A (zh) | 语音识别方法、装置、设备及存储介质 | |
CN112951270B (zh) | 语音流利度检测的方法、装置和电子设备 | |
Sartiukova et al. | Remote Voice Control of Computer Based on Convolutional Neural Network | |
CN111883133A (zh) | 客服语音识别方法、装置、服务器及存储介质 | |
Fennir et al. | Acoustic scene classification for speaker diarization | |
CN113793598B (zh) | 语音处理模型的训练方法和数据增强方法、装置及设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20923083 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.01.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20923083 Country of ref document: EP Kind code of ref document: A1 |