WO2021232594A1 - Appareil et procédé de reconnaissance d'émotions de parole, dispositif électronique, et support de stockage - Google Patents

Appareil et procédé de reconnaissance d'émotions de parole, dispositif électronique, et support de stockage Download PDF

Info

Publication number
WO2021232594A1
WO2021232594A1 PCT/CN2020/106010 CN2020106010W WO2021232594A1 WO 2021232594 A1 WO2021232594 A1 WO 2021232594A1 CN 2020106010 W CN2020106010 W CN 2020106010W WO 2021232594 A1 WO2021232594 A1 WO 2021232594A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voiceprint
speech
fused
frequency
Prior art date
Application number
PCT/CN2020/106010
Other languages
English (en)
Chinese (zh)
Inventor
王德勋
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021232594A1 publication Critical patent/WO2021232594A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to an artificial intelligence-based voice emotion recognition method, device, electronic equipment, and computer-readable storage medium.
  • the present application provides a method, device, electronic device, and computer-readable storage medium for voice emotion recognition, the main purpose of which is to improve the recognition ability of voice emotion recognition.
  • a voice emotion recognition method including:
  • Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • a voice emotion recognition device includes:
  • the segmentation module is used to receive voice data, segment the voice segment of the voice data, and mark the voice segmentation point in the voice segment;
  • An extraction module configured to extract the characteristic voiceprint of the voice segment according to the voice segmentation point, and generate a characteristic voiceprint set
  • the fusion module is used to fuse the same characteristic voiceprints in the characteristic voiceprint set to obtain a fused voiceprint set;
  • An identification module configured to identify the user information corresponding to the fused voiceprint in the fused voiceprint, and mark the user information into the corresponding fused voiceprint;
  • the detection module is configured to use the pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • An electronic device which includes:
  • Memory storing at least one instruction
  • the processor executes the instructions stored in the memory to implement the following steps:
  • Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • a computer-readable storage medium storing at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:
  • Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • FIG. 1 is a schematic flowchart of a voice emotion recognition method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of modules of a voice emotion recognition method provided by an embodiment of this application.
  • FIG. 3 is a schematic diagram of the internal structure of an electronic device of a voice emotion recognition method provided by an embodiment of the application.
  • the execution subject of the voice emotion recognition method provided in the embodiments of the present application includes, but is not limited to, at least one of the electronic devices that can be configured to execute the method provided in the embodiments of the present application, such as a server and a terminal.
  • the voice emotion recognition method can be executed by software or hardware installed on a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, etc.
  • This application provides a voice emotion recognition method.
  • FIG. 1 it is a schematic flowchart of a voice emotion recognition method provided by an embodiment of this application.
  • the voice emotion recognition method includes:
  • S1. Receive voice data, segment the voice segment of the voice data, and mark the voice segmentation point in the voice segment.
  • the voice data includes voices made by one or more users in different scenarios.
  • the scenarios may be: kitchens, conference rooms, and gymnasiums.
  • the voice data since the voice data includes different sounds made by the user at different time periods, the voice data may have different voice emotions at different time periods. Therefore, the present application divides the voice fragments of the voice data, In order to recognize the voice emotion of the user's voice in different time periods, the accuracy of subsequent voice emotion recognition is improved. Among them, it should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice segment, the above-mentioned voice segment may also be stored in a node of a blockchain.
  • the voice segment from which the voice data is segmented includes:
  • the voice signal of the voice data perform frame processing on the voice signal to obtain the voice sequence of the voice signal, calculate the voice frequency of the voice sequence, and intercept the corresponding voice within a preset range according to the voice frequency
  • the signal is used as the speech segment.
  • the voice signal is subjected to framing processing by using an overlapping segmentation method.
  • the following method is used to calculate the speech frequency of the speech sequence:
  • B(f) represents the speech frequency
  • f represents the expected speech frequency of the speech sequence.
  • the preset range is a voice frequency of 0-50hz.
  • the voice segmentation points are marked in the intercepted voice fragments to speed up the query of subsequent voice fragments, thereby improving the timeliness of voice emotion detection.
  • the characteristic voiceprint is used to characterize the characteristic sound of the speech segment. According to the extraction of the characteristic voiceprint, the time for subsequent voice emotion detection can be reduced and the efficiency of voice emotion detection can be improved.
  • the extracting the characteristic voiceprint of the voice segment according to the voice segmentation point to generate a characteristic voiceprint set includes:
  • the voice frequency of the corresponding voice segment is obtained, the dimensional parameter of the voice frequency is calculated, the voiceprint feature of the standard voice data is generated according to the dimensional parameter, and the feature is obtained according to the voiceprint feature Voiceprint set.
  • the dimensional parameters include: intonation value, speech rate value, etc.
  • the voiceprint characteristics include: peace, coherence, sweetness, and the like.
  • the following method is used to calculate the dimensional parameters of the voice frequency:
  • d(n) represents the dimensional parameter of the speech frequency
  • i represents the frame rate of the speech frequency
  • n represents the amplitude of the speech frequency
  • B(f) represents the speech frequency
  • k represents the linear combination of the current speech frame and the previous and subsequent speech frames, usually The value is 2, which represents the linear combination of the current speech frame and the two preceding and following speech frames.
  • the same characteristic voiceprints in the characteristic voiceprint set are merged, that is, the same characteristic voiceprints are merged, so that the sounds made by multiple users are adapted to the same scene, and the subsequent voice emotion detection is improved. Recognition ability.
  • the currently known k-means algorithm is used to fuse the same feature voiceprints in the feature voiceprint set to obtain a fused voiceprint set.
  • the fused voiceprint in the fused voiceprint set contains voices from different users, if the voice emotion detection is directly performed on the fused voiceprint set, it is impossible to determine the user information corresponding to the fused voiceprint in the fused voiceprint set, which limits the voice The recognition effect of emotion detection. Therefore, in the embodiment of this application, the user information corresponding to the fused voiceprint is identified in the fused voiceprint, and the user information is marked into the corresponding fused voiceprint, which enhances the subsequent voice emotions The recognition effect of the test.
  • the following method is used to identify the user information corresponding to the fused voiceprint in the fused voiceprint:
  • p(X,Y,Z) represents the user information corresponding to the fused voiceprint
  • X represents the fused voiceprint set
  • Y represents user information
  • Z represents the change of user information
  • T represents the number of users
  • x t Represents the fused voiceprint of the t-th user
  • y t represents the t-th user information
  • x t-1 represents the fused voiceprint of the t-1th user
  • y t-1 represents the t-1th user information.
  • S5 Perform voice emotion detection on the labeled fusion voiceprint set using the pre-trained voice emotion detection model to obtain a voice emotion detection result.
  • the pre-trained voice emotion detection model is obtained by pre-collecting a large number of voice voiceprints and corresponding labels for training. For example, when the emotion of a person is happy, the voice voiceprint will appear. Features such as sweetness and softness. Therefore, the embodiment of this application creates happy voice emotion tags for voice sounds such as sweetness and softness; when a person’s emotions are angry, the voiceprint of the person’s voice will appear high and coherent. Therefore, this application is implemented Examples will be high, coherence and other characteristics to establish a voice-free emotional label without anger.
  • the voice emotion detection model includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer
  • the training process of the voice emotion detection model includes:
  • the pre-collected voice print and corresponding label through the input layer perform a convolution operation on the voice voice print through the convolution layer to obtain the feature vector of the voice voice print, and use the pool
  • the transformation layer performs a pooling operation on the feature vector, calculates the pooled feature vector through the activation function of the activation layer to obtain a training value, and uses the loss function of the fully connected layer to calculate the training value and For the loss function value of the tag, if the loss function value is greater than a preset threshold value, the parameters of the speech emotion detection model are adjusted until the loss function value is not greater than the preset threshold value to obtain The pre-trained voice emotion detection model.
  • the preset threshold value described in the embodiment of the present application is 0.1.
  • the activation function includes:
  • O j represents the training value of the jth neuron in the activation layer
  • I j represents the input value of the jth neuron in the activation layer
  • t represents the total number of neurons in the activation layer
  • e is an infinite non-recurring decimal.
  • the loss function includes:
  • L (s) represents the value of the loss function
  • s represents an error value corresponding to the value of the training label value
  • k is the number of previously collected in speaker
  • y i represents a tag value
  • y 'i represents the training value.
  • the embodiment of the present application uses the pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • the embodiment of the present application firstly segments the voice segment of the voice data, which can improve the timeliness of voice emotion detection; secondly, the embodiment of the present application extracts the characteristic voiceprint of the voice segment to generate a characteristic voiceprint set, according to The feature voiceprint set can improve the efficiency of voice emotion detection, and the same feature voiceprints in the feature voiceprint set are merged to obtain the fused voiceprint set, which realizes that the sounds made by multiple users are adapted to the same scene, thereby Improve the recognition ability of subsequent voice emotion detection; identify the user information corresponding to the fused voiceprint in a centralized fusion voiceprint, and mark the user information in the corresponding fused voiceprint, which enhances the recognition effect of subsequent voice emotion detection; Further, in this embodiment of the present application, a pre-trained voice emotion detection model is used to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result. Therefore, the voice emotion recognition method proposed in this application can improve the recognition ability of voice emotion recognition.
  • FIG. 2 it is a functional block diagram of the voice emotion recognition device of the present application.
  • the voice emotion recognition 100 described in this application can be installed in an electronic device.
  • the voice emotion recognition device may include a segmentation module 101, an extraction module 102, a fusion module 103, a recognition module 104, and a detection module 105.
  • the module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the segmentation module 101 is configured to receive voice data, segment the voice segment of the voice data, and mark the voice segmentation point in the voice segment.
  • the voice data includes voices made by one or more users in different scenarios.
  • the scenarios may be: kitchens, conference rooms, and gymnasiums.
  • the voice data since the voice data includes different sounds made by the user at different time periods, the voice data may have different voice emotions at different time periods. Therefore, the present application divides the voice fragments of the voice data, In order to recognize the voice emotion of the user's voice in different time periods, the accuracy of subsequent voice emotion recognition is improved. Among them, it should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice segment, the above-mentioned voice segment may also be stored in a node of a blockchain.
  • the voice segment from which the voice data is segmented includes:
  • the voice signal of the voice data perform frame processing on the voice signal to obtain the voice sequence of the voice signal, calculate the voice frequency of the voice sequence, and intercept the corresponding voice within a preset range according to the voice frequency
  • the signal is used as the speech segment.
  • the voice signal is subjected to framing processing by using an overlapping segmentation method.
  • the following method is used to calculate the speech frequency of the speech sequence:
  • B(f) represents the speech frequency
  • f represents the expected speech frequency of the speech sequence.
  • the preset range is a voice frequency of 0-50hz.
  • the voice segmentation points are marked in the intercepted voice fragments to speed up the query of subsequent voice fragments, thereby improving the timeliness of voice emotion detection.
  • the extraction module 102 is configured to extract the characteristic voiceprint of the voice segment according to the voice segmentation point, and generate a characteristic voiceprint set.
  • the characteristic voiceprint is used to characterize the characteristic sound of the speech segment. According to the extraction of the characteristic voiceprint, the time for subsequent voice emotion detection can be reduced and the efficiency of voice emotion detection can be improved.
  • the extracting the characteristic voiceprint of the voice segment according to the voice segmentation point to generate a characteristic voiceprint set includes:
  • the voice frequency of the corresponding voice segment is obtained, the dimensional parameter of the voice frequency is calculated, the voiceprint feature of the standard voice data is generated according to the dimensional parameter, and the feature is obtained according to the voiceprint feature Voiceprint set.
  • the dimensional parameters include: intonation value, speech rate value, etc.
  • the voiceprint characteristics include: peace, coherence, sweetness, and the like.
  • the following method is used to calculate the dimensional parameters of the voice frequency:
  • d(n) represents the dimensional parameter of the speech frequency
  • i represents the frame rate of the speech frequency
  • n represents the amplitude of the speech frequency
  • B(f) represents the speech frequency
  • k represents the linear combination of the current speech frame and the previous and subsequent speech frames, usually The value is 2, which represents the linear combination of the current speech frame and the two preceding and following speech frames.
  • the fusion module 103 is configured to fuse the same characteristic voiceprints in the characteristic voiceprint set to obtain a fused voiceprint set.
  • the same characteristic voiceprints in the characteristic voiceprint set are merged, that is, the same characteristic voiceprints are merged, so that the sounds made by multiple users are adapted to the same scene, and the subsequent voice emotion detection is improved. Recognition ability.
  • the currently known k-means algorithm is used to fuse the same feature voiceprints in the feature voiceprint set to obtain a fused voiceprint set.
  • the recognition module 104 is configured to recognize user information corresponding to the fused voiceprint in a centralized fusion voiceprint, and mark the user information into the corresponding fused voiceprint.
  • the fused voiceprint in the fused voiceprint set contains voices from different users, if the voice emotion detection is directly performed on the fused voiceprint set, it is impossible to determine the user information corresponding to the fused voiceprint in the fused voiceprint set, which limits the voice The recognition effect of emotion detection. Therefore, in the embodiment of this application, the user information corresponding to the fused voiceprint is identified in the fused voiceprint, and the user information is marked into the corresponding fused voiceprint, which enhances the subsequent voice emotions The recognition effect of the test.
  • the following method is used to identify the user information corresponding to the fused voiceprint in the fused voiceprint:
  • p(X,Y,Z) represents the user information corresponding to the fused voiceprint
  • X represents the fused voiceprint set
  • Y represents user information
  • Z represents the change of user information
  • T represents the number of users
  • x t Represents the fused voiceprint of the t-th user
  • y t represents the t-th user information
  • x t-1 represents the fused voiceprint of the t-1th user
  • y t-1 represents the t-1th user information.
  • the detection module 105 is configured to use a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • the pre-trained voice emotion detection model is obtained by pre-collecting a large number of voice voiceprints and corresponding labels for training. For example, when the emotion of a person is happy, the voice voiceprint will appear. Features such as sweetness and softness. Therefore, the embodiment of this application creates happy voice emotion tags for voice sounds such as sweetness and softness; when a person’s emotions are angry, the voiceprint of the person’s voice will appear high and coherent. Therefore, this application is implemented Examples will be high, coherence and other characteristics to establish a voice-free emotional label without anger.
  • the voice emotion detection model includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer
  • the training process of the voice emotion detection model includes:
  • the pre-collected voice print and corresponding label through the input layer perform a convolution operation on the voice voice print through the convolution layer to obtain the feature vector of the voice voice print, and use the pool
  • the transformation layer performs a pooling operation on the feature vector, calculates the pooled feature vector through the activation function of the activation layer to obtain a training value, and uses the loss function of the fully connected layer to calculate the training value and For the loss function value of the tag, if the loss function value is greater than a preset threshold value, the parameters of the speech emotion detection model are adjusted until the loss function value is not greater than the preset threshold value to obtain The pre-trained voice emotion detection model.
  • the preset threshold value described in the embodiment of the present application is 0.1.
  • the activation function includes:
  • O j represents the training value of the jth neuron in the activation layer
  • I j represents the input value of the jth neuron in the activation layer
  • t represents the total number of neurons in the activation layer
  • e is an infinite non-recurring decimal.
  • the loss function includes:
  • L (s) represents the value of the loss function
  • s represents an error value corresponding to the value of the training label value
  • k is the number of previously collected in speaker
  • y i represents a tag value
  • y 'i represents the training value.
  • the embodiment of the present application uses the pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • the embodiment of the present application firstly segments the voice segment of the voice data, which can improve the timeliness of voice emotion detection; secondly, the embodiment of the present application extracts the characteristic voiceprint of the voice segment to generate a characteristic voiceprint set, according to The feature voiceprint set can improve the efficiency of voice emotion detection, and the same feature voiceprints in the feature voiceprint set are merged to obtain the fused voiceprint set, which realizes that the sounds made by multiple users are adapted to the same scene, thereby Improve the recognition ability of subsequent voice emotion detection; identify the user information corresponding to the fused voiceprint in a centralized fusion voiceprint, and mark the user information in the corresponding fused voiceprint, which enhances the recognition effect of subsequent voice emotion detection; Further, in this embodiment of the present application, a pre-trained voice emotion detection model is used to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result. Therefore, the voice emotion recognition device proposed in this application can improve the recognition ability of voice emotion recognition.
  • FIG. 3 it is a schematic diagram of the structure of an electronic device that implements the voice emotion recognition method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as a voice emotion recognition program.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart media card (SMC), and a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as the code of a voice emotion recognition program, etc., but also to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips, etc.
  • the processor 10 is the control unit of the electronic device, which uses various interfaces and lines to connect the various components of the entire electronic device, and runs or executes programs or modules stored in the memory 11 (such as executing Voice emotion recognition program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
  • the electronic device 1 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators.
  • the electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface.
  • the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may also include a user interface.
  • the user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)).
  • the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
  • the voice emotion recognition program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:
  • Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result Using a pre-trained voice emotion detection model to perform voice emotion detection on the labeled fusion voiceprint set to obtain a voice emotion detection result.
  • the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk , Magnetic disks, optical disks, computer memory, read-only memory (ROM, Read-Only Memory).
  • ROM Read-Only Memory
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

La présente invention concerne un procédé de reconnaissance d'émotion de parole, consistant : à recevoir des données de parole, à réaliser une segmentation pour obtenir des segments de parole des données de parole, et à marquer des points de segmentation de parole dans les segments de parole (S1) ; à extraire des empreintes vocales caractéristiques des segments de parole en fonction des points de segmentation de parole, de façon à générer un ensemble d'empreintes vocales caractéristiques (S2) ; à fusionner des empreintes vocales caractéristiques identiques dans l'ensemble d'empreintes vocales caractéristiques pour obtenir un ensemble d'empreintes vocales fusionnées (S3) ; à identifier des informations d'utilisateur correspondant à une empreinte vocale fusionnée dans l'ensemble d'empreintes vocales fusionnées, et à marquer l'empreinte vocale fusionnée correspondante avec les informations d'utilisateur (S4) ; et à réaliser une détection d'émotion de parole sur l'ensemble d'empreintes vocales fusionnées marquées au moyen d'un modèle de détection d'émotions de parole pré-entraîné, de façon à obtenir un résultat de détection d'émotions de parole (S5). Les segments de parole peuvent être déployés dans un nœud de chaîne de blocs. La capacité de reconnaissance de la reconnaissance d'émotions de la parole est ainsi améliorée.
PCT/CN2020/106010 2020-05-22 2020-07-30 Appareil et procédé de reconnaissance d'émotions de parole, dispositif électronique, et support de stockage WO2021232594A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010445602.3 2020-05-22
CN202010445602.3A CN111681681A (zh) 2020-05-22 2020-05-22 语音情绪识别方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021232594A1 true WO2021232594A1 (fr) 2021-11-25

Family

ID=72453527

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/106010 WO2021232594A1 (fr) 2020-05-22 2020-07-30 Appareil et procédé de reconnaissance d'émotions de parole, dispositif électronique, et support de stockage

Country Status (2)

Country Link
CN (1) CN111681681A (fr)
WO (1) WO2021232594A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093389A (zh) * 2021-11-26 2022-02-25 重庆凡骄网络科技有限公司 语音情绪识别方法、装置、电子设备和计算机可读介质
CN114387997A (zh) * 2022-01-21 2022-04-22 合肥工业大学 一种基于深度学习的语音情感识别方法
CN116528438A (zh) * 2023-04-28 2023-08-01 广州力铭光电科技有限公司 一种灯具的智能调光方法和装置

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232276B (zh) * 2020-11-04 2023-10-13 上海企创信息科技有限公司 一种基于语音识别和图像识别的情绪检测方法和装置
CN112786054B (zh) * 2021-02-25 2024-06-11 深圳壹账通智能科技有限公司 基于语音的智能面试评估方法、装置、设备及存储介质
CN113113048B (zh) * 2021-04-09 2023-03-10 平安科技(深圳)有限公司 语音情绪识别方法、装置、计算机设备及介质
CN113422876B (zh) * 2021-06-24 2022-05-10 广西电网有限责任公司 基于ai的电力客服中心辅助管理方法、系统及介质
CN113378226A (zh) * 2021-06-24 2021-09-10 平安普惠企业管理有限公司 生物数据处理方法、装置、设备及计算机可读存储介质
CN113674755B (zh) * 2021-08-19 2024-04-02 北京百度网讯科技有限公司 语音处理方法、装置、电子设备和介质
CN114898775B (zh) * 2022-04-24 2024-05-28 中国科学院声学研究所南海研究站 一种基于跨层交叉融合的语音情绪识别方法及系统
CN117041807B (zh) * 2023-10-09 2024-01-26 深圳市迪斯声学有限公司 蓝牙耳机播放控制方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012215668A (ja) * 2011-03-31 2012-11-08 Fujitsu Ltd 話者状態検出装置、話者状態検出方法及び話者状態検出用コンピュータプログラム
CN107452385A (zh) * 2017-08-16 2017-12-08 北京世纪好未来教育科技有限公司 一种基于语音的数据评价方法及装置
US20180218750A1 (en) * 2017-02-01 2018-08-02 Wipro Limited Integrated system and a method of identifying and learning emotions in conversation utterances
CN109256150A (zh) * 2018-10-12 2019-01-22 北京创景咨询有限公司 基于机器学习的语音情感识别系统及方法
CN109451188A (zh) * 2018-11-29 2019-03-08 平安科技(深圳)有限公司 差异性自助应答的方法、装置、计算机设备和存储介质
CN109841230A (zh) * 2017-11-29 2019-06-04 威刚科技股份有限公司 语音情绪辨识系统与方法以及使用其的智能机器人

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709402A (zh) * 2015-11-16 2017-05-24 优化科技(苏州)有限公司 基于音型像特征的真人活体身份验证方法
CN106782564B (zh) * 2016-11-18 2018-09-11 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN109256136B (zh) * 2018-08-31 2021-09-17 三星电子(中国)研发中心 一种语音识别方法和装置
CN109448728A (zh) * 2018-10-29 2019-03-08 苏州工业职业技术学院 融合情感识别的多方会话可视化方法和系统
CN110222719B (zh) * 2019-05-10 2021-09-24 中国科学院计算技术研究所 一种基于多帧音视频融合网络的人物识别方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012215668A (ja) * 2011-03-31 2012-11-08 Fujitsu Ltd 話者状態検出装置、話者状態検出方法及び話者状態検出用コンピュータプログラム
US20180218750A1 (en) * 2017-02-01 2018-08-02 Wipro Limited Integrated system and a method of identifying and learning emotions in conversation utterances
CN107452385A (zh) * 2017-08-16 2017-12-08 北京世纪好未来教育科技有限公司 一种基于语音的数据评价方法及装置
CN109841230A (zh) * 2017-11-29 2019-06-04 威刚科技股份有限公司 语音情绪辨识系统与方法以及使用其的智能机器人
CN109256150A (zh) * 2018-10-12 2019-01-22 北京创景咨询有限公司 基于机器学习的语音情感识别系统及方法
CN109451188A (zh) * 2018-11-29 2019-03-08 平安科技(深圳)有限公司 差异性自助应答的方法、装置、计算机设备和存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093389A (zh) * 2021-11-26 2022-02-25 重庆凡骄网络科技有限公司 语音情绪识别方法、装置、电子设备和计算机可读介质
CN114387997A (zh) * 2022-01-21 2022-04-22 合肥工业大学 一种基于深度学习的语音情感识别方法
CN114387997B (zh) * 2022-01-21 2024-03-29 合肥工业大学 一种基于深度学习的语音情感识别方法
CN116528438A (zh) * 2023-04-28 2023-08-01 广州力铭光电科技有限公司 一种灯具的智能调光方法和装置
CN116528438B (zh) * 2023-04-28 2023-10-10 广州力铭光电科技有限公司 一种灯具的智能调光方法和装置

Also Published As

Publication number Publication date
CN111681681A (zh) 2020-09-18

Similar Documents

Publication Publication Date Title
WO2021232594A1 (fr) Appareil et procédé de reconnaissance d'émotions de parole, dispositif électronique, et support de stockage
CN107492379B (zh) 一种声纹创建与注册方法及装置
CN109145766B (zh) 模型训练方法、装置、识别方法、电子设备及存储介质
WO2022116420A1 (fr) Procédé et appareil de détection d'événement vocal, dispositif électronique, et support de stockage informatique
CN109583332B (zh) 人脸识别方法、人脸识别系统、介质及电子设备
WO2022105179A1 (fr) Procédé et appareil de reconnaissance d'image de caractéristiques biologiques, dispositif électronique et support de stockage lisible
WO2021208696A1 (fr) Procédé d'analyse d'intention d'utilisateur, appareil, dispositif électronique et support de stockage informatique
CN112527994A (zh) 情绪分析方法、装置、设备及可读存储介质
CN114648392B (zh) 基于用户画像的产品推荐方法、装置、电子设备及介质
US10423817B2 (en) Latent fingerprint ridge flow map improvement
CN111666415A (zh) 话题聚类方法、装置、电子设备及存储介质
CN113064994A (zh) 会议质量评估方法、装置、设备及存储介质
CN112988963A (zh) 基于多流程节点的用户意图预测方法、装置、设备及介质
CN112732949A (zh) 一种业务数据的标注方法、装置、计算机设备和存储介质
CN114677650B (zh) 地铁乘客行人违法行为智能分析方法及装置
CN114077841A (zh) 基于人工智能的语义提取方法、装置、电子设备及介质
JP2023530893A (ja) データ処理および取引決定システム
CN112634017A (zh) 远程开卡激活方法、装置、电子设备及计算机存储介质
US10755074B2 (en) Latent fingerprint pattern estimation
US11321397B2 (en) Composition engine for analytical models
CN113220828B (zh) 意图识别模型处理方法、装置、计算机设备及存储介质
CN113205814B (zh) 语音数据标注方法、装置、电子设备及存储介质
CN112528903B (zh) 人脸图像获取方法、装置、电子设备及介质
CN113591881A (zh) 基于模型融合的意图识别方法、装置、电子设备及介质
CN113254814A (zh) 网络课程视频打标签方法、装置、电子设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20936507

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03/04/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20936507

Country of ref document: EP

Kind code of ref document: A1