CN113129927A

CN113129927A - Voice emotion recognition method, device, equipment and storage medium

Info

Publication number: CN113129927A
Application number: CN202110408995.5A
Authority: CN
Inventors: 顾艳梅; 凌波; 王少军; 马骏
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-07-16
Anticipated expiration: 2041-04-16
Also published as: CN113129927B

Abstract

The invention relates to the field of artificial intelligence, and discloses a speech emotion recognition method, a speech emotion recognition device, speech emotion recognition equipment and a storage medium, which are used for improving speech emotion recognition precision and recognition efficiency. The speech emotion recognition method comprises the following steps: acquiring a conversation voice information set; carrying out voice data preprocessing and emotion type division on the conversation voice information set to obtain a voice training sample set; performing iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on a voice training sample set to obtain a target emotion recognition model; transplanting the target emotion recognition model to the mobile terminal, and performing emotion recognition processing on the user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category so that the mobile terminal can display character information and/or pattern information corresponding to the target emotion category. In addition, the invention also relates to a block chain technology, and the voice training sample set can be stored in the block chain node.

Description

Voice emotion recognition method, device, equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence voice classification, in particular to a voice emotion recognition method, a device, equipment and a storage medium.

Background

In recent years, with the application of large-scale data and deep learning technologies, man-machine interaction technologies are more and more mature, and meanwhile, the application range of the technology is expanded, including the current education, medical treatment, transportation industries and the like. The voice emotion recognition is an important part in the human-computer interaction system, and the effective recognition of the emotion of the interlocutor can help the system to better know the speaking intention of the interlocutor, so that the communication quality in the whole human-computer interaction process is improved.

For Speech Emotion Recognition (SER), a lot of research work is currently done at home and abroad. Early sentence recognition error rate SER task mainly used Convolutional Neural Networks (CNN), cyclic neural networks (RNN), long short memory networks (LSTM), and other neural networks to train recognition models, however, these models have low recognition rate. Researchers have subsequently adopted the combination of two network structures, for example: the CNN + LSTM, wherein the CNN network is used for extracting enough emotion information, and the LSTM network realizes an emotion recognition task, so that the recognition rate is improved to a certain extent. And on the basis, an attention mechanism is added into an SER task, so that the emotion recognition rate is further improved.

After analyzing the current SER research task, it is found that the recognition effect of the current SER task achieves a certain effect, but the adopted network model is complex, not only needs to be supported by hardware, but also carries out speech emotion engine recognition at the server end, which results in complex recognition steps and low recognition efficiency.

Disclosure of Invention

The invention provides a voice emotion recognition method, a voice emotion recognition device, voice emotion recognition equipment and a storage medium, which are used for improving emotion recognition precision and voice emotion recognition efficiency.

In order to achieve the above object, a first aspect of the present invention provides a speech emotion recognition method, including: acquiring a conversation voice information set, wherein the conversation voice information set is used for indicating conversation voice information between a user and a target customer service, and the target customer service comprises a robot customer service and an artificial customer service; carrying out voice data preprocessing on the conversation voice information set to obtain a processed voice information set, and carrying out emotion type division on the processed voice information set to obtain a voice training sample set; performing iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the voice training sample set to obtain a target emotion recognition model; transplanting the target emotion recognition model to a mobile terminal, and performing emotion recognition processing on user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays character information and/or pattern information corresponding to the target emotion category.

Optionally, in a first implementation manner of the first aspect of the present invention, the performing voice data preprocessing on the dialog speech information set to obtain a processed speech information set, and performing emotion type division on the processed speech information set to obtain a speech training sample set includes: sequentially carrying out noise reduction processing, voice enhancement processing and voice fragment interception on the dialogue voice information set to obtain an initial voice information set; extracting corresponding Mel frequency spectrum emotion characteristics from the initial voice information set through the preset audio frequency characteristic extraction tool, and storing the Mel frequency spectrum emotion characteristics corresponding to each initial voice information as a comma separated value format file to obtain a processed voice information set; performing emotion recognition on the processed voice information sets respectively through a preset voice emotion classification model to obtain emotion types corresponding to the processed voice information; and setting corresponding classification labels for the processed voice information according to the emotion types corresponding to the processed voice information, and storing the processed voice information set into a preset file directory in a classification manner according to the classification labels corresponding to the processed voice information to obtain a voice training sample set.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the speech training sample set to obtain a target emotion recognition model includes: dividing the voice training sample set according to a preset proportion to obtain a voice training set and a voice testing set; performing pruning processing and quantification processing on a preset bidirectional long-short term memory network model in a neural network frame through a preset weight pruning algorithm to obtain a pruned bidirectional long-short term memory network model; performing model training and model testing on the pruned bidirectional long-short term memory network model according to preset training times based on the voice training set and the voice testing set to obtain a trained emotion recognition model; when the trained emotion recognition model meets a preset iteration termination condition, determining that the trained emotion recognition model is a target emotion recognition model, and storing the target emotion recognition model as a first model file, wherein the preset iteration termination condition is that the pruning times are equal to the preset times and/or the precision of the trained emotion recognition model is smaller than a preset precision threshold value, and the first model file is a file in a standard model format.

Optionally, in a third implementation manner of the first aspect of the present invention, the transplanting the target emotion recognition model to a mobile terminal, and performing emotion recognition processing on user voice information acquired in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays text information and/or pattern information corresponding to the target emotion category, includes: converting the first model file into a second model file, installing a dependency package of a machine learning framework in the mobile terminal, and executing the second model file to obtain an execution result, wherein the second model file is a file in a migration model format; when the execution result is that the execution is successful, determining that the target emotion recognition model is successfully transplanted to the mobile terminal; and performing emotion recognition processing on the user voice information acquired in real time through a target emotion recognition model in the mobile terminal to obtain a target emotion category, and determining corresponding character information or pattern information according to the target emotion category so that the mobile terminal displays the character information or pattern information corresponding to the target emotion category.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the performing emotion recognition processing on the user voice information collected in real time through a target emotion recognition model in the mobile terminal to obtain a target emotion category, and determining corresponding text information or pattern information according to the target emotion category, so that the mobile terminal displays the text information or pattern information corresponding to the target emotion category, includes: carrying out feature extraction on user voice information acquired in real time in the mobile terminal to obtain voice emotion feature information, and carrying out classification processing on the voice emotion feature information through a target emotion recognition model in the mobile terminal to obtain a target emotion category; setting the target emotion category as a target key, and inquiring a preset memory database according to the target key to obtain character information or pattern information corresponding to the target emotion category; and calling a preset real-time data push interface to send the text information or the pattern information corresponding to the target emotion category to the mobile terminal, so that the mobile terminal displays the text information or the pattern information corresponding to the target emotion category.

Optionally, in a fifth implementation manner of the first aspect of the present invention, after the target emotion recognition model is transplanted to a mobile terminal, and emotion recognition processing is performed on user voice information acquired in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays text information and/or pattern information corresponding to the target emotion category, the voice emotion recognition method further includes: acquiring a user identifier, and generating a user emotion portrait according to the user identifier, the user voice information acquired in real time and the target emotion category; and pushing the user emotion portrait to the mobile terminal at regular time according to a preset notification mode so that the mobile terminal displays the user emotion portrait.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after the target emotion recognition model is transplanted to a mobile terminal, and emotion recognition processing is performed on user voice information acquired in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays text information and/or pattern information corresponding to the target emotion category, the voice emotion recognition method further includes: counting a plurality of emotion categories of the user within a preset time length to obtain a comprehensive emotion state of the user, wherein the plurality of emotion categories of the user comprise the target emotion category; evaluating the health condition of the user according to the comprehensive emotional state of the user and preset user health information; customizing a health diet plan and a health exercise plan based on the health condition of the user, and pushing the health diet plan and the health exercise plan to the mobile terminal.

A second aspect of the present invention provides a speech emotion recognition apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a conversation voice information set, the conversation voice information set is used for indicating conversation voice information between a user and a target customer service, and the target customer service comprises a robot customer service and an artificial customer service; the dividing module is used for preprocessing voice data of the conversation voice information set to obtain a processed voice information set and dividing emotion types of the processed voice information set to obtain a voice training sample set; the pruning training module is used for carrying out iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the voice training sample set to obtain a target emotion recognition model; and the recognition module is used for transplanting the target emotion recognition model to a mobile terminal, performing emotion recognition processing on the user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, and enabling the mobile terminal to display character information and/or pattern information corresponding to the target emotion category.

Optionally, in a first implementation manner of the second aspect of the present invention, the dividing module is specifically configured to: sequentially carrying out noise reduction processing, voice enhancement processing and voice fragment interception on the dialogue voice information set to obtain an initial voice information set; extracting corresponding Mel frequency spectrum emotion characteristics from the initial voice information set through the preset audio frequency characteristic extraction tool, and storing the Mel frequency spectrum emotion characteristics corresponding to each initial voice information as a comma separated value format file to obtain a processed voice information set; performing emotion recognition on the processed voice information sets respectively through a preset voice emotion classification model to obtain emotion types corresponding to the processed voice information; and setting corresponding classification labels for the processed voice information according to the emotion types corresponding to the processed voice information, and storing the processed voice information set into a preset file directory in a classification manner according to the classification labels corresponding to the processed voice information to obtain a voice training sample set.

Optionally, in a second implementation manner of the second aspect of the present invention, the pruning training module is specifically configured to: dividing the voice training sample set according to a preset proportion to obtain a voice training set and a voice testing set; performing pruning processing and quantification processing on a preset bidirectional long-short term memory network model in a neural network frame through a preset weight pruning algorithm to obtain a pruned bidirectional long-short term memory network model; performing model training and model testing on the pruned bidirectional long-short term memory network model according to preset training times based on the voice training set and the voice testing set to obtain a trained emotion recognition model; when the trained emotion recognition model meets a preset iteration termination condition, determining that the trained emotion recognition model is a target emotion recognition model, and storing the target emotion recognition model as a first model file, wherein the preset iteration termination condition is that the pruning times are equal to the preset times and/or the precision of the trained emotion recognition model is smaller than a preset precision threshold value, and the first model file is a file in a standard model format.

Optionally, in a third implementation manner of the second aspect of the present invention, the identification module includes: the installation unit is used for converting the first model file into a second model file, installing a dependency package of a machine learning framework in the mobile terminal, and executing the second model file to obtain an execution result, wherein the second model file is a file in a migration model format; a determining unit, configured to determine that the target emotion recognition model has been successfully transplanted into the mobile terminal when the execution result is that execution is successful; and the recognition unit is used for performing emotion recognition processing on the user voice information acquired in real time through a target emotion recognition model in the mobile terminal to obtain a target emotion category, and determining corresponding character information or pattern information according to the target emotion category so that the mobile terminal displays the character information or pattern information corresponding to the target emotion category.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the identification unit is specifically configured to: carrying out feature extraction on user voice information acquired in real time in the mobile terminal to obtain voice emotion feature information, and carrying out classification processing on the voice emotion feature information through a target emotion recognition model in the mobile terminal to obtain a target emotion category; setting the target emotion category as a target key, and inquiring a preset memory database according to the target key to obtain character information or pattern information corresponding to the target emotion category; and calling a preset real-time data push interface to send the text information or the pattern information corresponding to the target emotion category to the mobile terminal, so that the mobile terminal displays the text information or the pattern information corresponding to the target emotion category.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the speech emotion recognition apparatus further includes: the processing module is used for acquiring a user identifier and generating a user emotion portrait according to the user identifier, the user voice information acquired in real time and the target emotion category; and the pushing module is used for pushing the user emotion portrait to the mobile terminal at regular time according to a preset notification mode so that the mobile terminal can display the user emotion portrait.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the speech emotion recognition apparatus further includes: the counting module is used for counting a plurality of emotion categories of the user within a preset time length to obtain a comprehensive emotion state of the user, wherein the emotion categories of the user comprise the target emotion category; the evaluation module is used for evaluating the health condition of the user according to the comprehensive emotional state of the user and preset user health information; and the customizing module is used for customizing a healthy diet plan and a healthy exercise plan based on the health condition of the user and pushing the healthy diet plan and the healthy exercise plan to the mobile terminal.

A third aspect of the present invention provides a speech emotion recognition apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the speech emotion recognition device to perform the speech emotion recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-described speech emotion recognition method.

In the technical scheme provided by the invention, a conversation voice information set is obtained, wherein the conversation voice information set is used for indicating the conversation voice information between a user and a target customer service, and the target customer service comprises a robot customer service and an artificial customer service; carrying out voice data preprocessing on the conversation voice information set to obtain a processed voice information set, and carrying out emotion type division on the processed voice information set to obtain a voice training sample set; performing iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the voice training sample set to obtain a target emotion recognition model; transplanting the target emotion recognition model to a mobile terminal, and performing emotion recognition processing on user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays character information and/or pattern information corresponding to the target emotion category. In the embodiment of the invention, the preset bidirectional long-short term memory network model is subjected to iterative pruning treatment and model training through the voice training sample set to obtain the target emotion recognition model, so that the emotion recognition precision is improved, and the time consumption and hardware consumption are reduced; after the target emotion recognition model is deployed to the mobile terminal, emotion recognition processing is carried out on user voice information collected in real time through the target emotion recognition model in the mobile terminal, and character information and/or pattern information corresponding to the target emotion category is displayed, so that voice emotion recognition efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a speech emotion recognition method in an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a speech emotion recognition method in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a speech emotion recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram of another embodiment of the speech emotion recognition apparatus in the embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a speech emotion recognition device in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a voice emotion recognition method, a voice emotion recognition device, voice emotion recognition equipment and a storage medium, which are used for combining a pruning algorithm with a bidirectional long-short term memory network model, improving emotion recognition accuracy and reducing hardware consumption; and the target emotion recognition model is deployed to the mobile terminal, so that the speech emotion recognition efficiency is improved.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a speech emotion recognition method in an embodiment of the present invention includes:

101. and acquiring a conversation voice information set, wherein the conversation voice information set is used for indicating the conversation voice information between the user and a target customer service, and the target customer service comprises a robot customer service and a manual customer service.

The conversation voice information set is derived from a plurality of conversation recording data in a preset service customer service system, the conversation recording data comprise voice information of a user, voice information of customer service and voice information of a robot, the conversation voice information set is used for indicating conversation voice information between the user and target customer service, and the target customer service comprises robot customer service and manual customer service. Further, the server extracts a plurality of corresponding conversation recording data from a preset service customer service system according to a preset conversation scene, a preset topic type and a preset time period to obtain a conversation voice information set, wherein the storage format of the conversation voice information set can be wav format. The preset dialogue recording data in the customer service system is large in quantity, and the voice emotion types are rich, so that all the daily dialogue voice emotions are basically covered. And the voice information of the robot customer service is added at the same time in consideration of increasing the data volume and richness of the voice data.

It is to be understood that the executing subject of the present invention may be a speech emotion recognition device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

102. And carrying out voice data preprocessing on the conversation voice information set to obtain a processed voice information set, and carrying out emotion type division on the processed voice information set to obtain a voice training sample set.

Specifically, the server performs pre-emphasis processing, framing processing, windowing processing, fourier transform and mel filtering processing on the voice sample data in sequence on a dialog voice information set to obtain a processed voice information set, wherein the processed voice information set is a plurality of pre-processed voice fragments; and then, carrying out emotion type division on the preprocessed voice information by the server according to a preset emotion type set to obtain a voice training sample set, wherein each voice training sample in the voice training sample set is a voice sample carrying emotion, each voice training sample has a unique identifier, and each voice training sample is stored in a preset file directory corresponding to each emotion type by the server in a comma separated value format csv format.

It should be noted that the preset emotion type set includes happy emotion, difficult emotion, angry emotion, excited emotion, normal emotion, and the like. Further, the server stores the voice training sample set in a blockchain database, which is not limited herein.

103. And carrying out iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the voice training sample set to obtain a target emotion recognition model.

Specifically, the server removes some unnecessary network weights in the weight tensor from the preset bidirectional long and short term memory network Bi-LSTM by adopting a preset weight pruning algorithm, retains weight parameters important to the Bi-LSTM network, reduces connection among different network layers, and obtains an optimized network model, namely the pruned bidirectional long and short term memory network model. And then, the server sequentially carries out model training and model testing (for model verification) on the pruned bidirectional long-short term memory network model for multiple times according to the processed voice information set to obtain a trained emotion recognition model, and when the trained emotion recognition model meets a preset iteration termination condition, the server determines that the trained emotion recognition model is the target emotion recognition model.

It can be understood that Bi-directional input and output are added to Bi-LSTM, the network structure is more complex, and a deeper network structure can obtain more speech emotion feature information to a certain extent, so that the speech emotion recognition effect is improved. The pruned bidirectional long-short term memory network model is used for indicating that a Bi-LSTM network is optimized by adopting a weight pruning algorithm, so that the calculation times are reduced, and the speech emotion recognition efficiency and recognition effect of the mobile terminal are improved.

104. Transplanting the target emotion recognition model to the mobile terminal, and performing emotion recognition processing on the user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays character information and/or pattern information corresponding to the target emotion category.

Specifically, firstly, a server adds a dependence item to a mobile terminal, and a target emotion recognition model is transplanted to the mobile terminal based on the dependence item; then, the server collects user voice information (for example, in a voice call scene, the mobile terminal call information of the robot customer service) in real time through the mobile terminal to obtain the user voice information collected in real time, the user voice information collected in real time is stored as an ID file with a unique mark in the format of wav, and further the server performs voice feature analysis on the user voice information through a feature extraction tool librosa to obtain voice emotion feature information. Secondly, the server identifies and processes the speech emotion characteristic information through a target emotion identification model deployed in the mobile terminal to obtain a target emotion category, and determines that the target emotion category is happy emotion, difficult emotion, angry emotion, excited emotion or normal emotion. For example, in a one-pass conversation, the emotion of a speaker sentence segment is divided into: the server identifies the emotion type of each sentence segment through a target emotion identification model in the mobile terminal, and then obtains the target emotion type as the normal emotion. The method comprises the steps that a server obtains a preset service scene type, analyzes user intention based on the preset service scene type and a target emotion category, and queries a preset database according to the user intention to obtain character information and/or pattern information corresponding to the target emotion category; the server displays the text information and/or the pattern information corresponding to the target emotion category through a target page in the mobile terminal, for example, the server outputs the text information and/or the pattern information corresponding to the normal emotion to the target page in the mobile terminal. The text information may be a specific description of the pattern information. The preset database may include a preset knowledge graph database and a preset memory database, and may also be other types of databases, which are not limited herein.

Furthermore, the mobile terminal can convert the text information into voice information and play the voice information. The target emotion recognition model can be applied to audio and video call service scenes such as online education or online inquiry.

In the embodiment of the invention, the preset bidirectional long-short term memory network model is subjected to iterative pruning treatment and model training through the voice training sample set to obtain the target emotion recognition model, so that the emotion recognition precision is improved, and the time consumption and hardware consumption are reduced; after the target emotion recognition model is deployed to the mobile terminal, emotion recognition processing is carried out on user voice information collected in real time through the target emotion recognition model in the mobile terminal, and character information and/or pattern information corresponding to the target emotion category is displayed, so that voice emotion recognition efficiency is improved.

Referring to fig. 2, another embodiment of the speech emotion recognition method according to the embodiment of the present invention includes:

201. and acquiring a conversation voice information set, wherein the conversation voice information set is used for indicating the conversation voice information between the user and a target customer service, and the target customer service comprises a robot customer service and a manual customer service.

The execution process of step 201 is similar to the execution process of step 101, and detailed description thereof is omitted here.

202. And carrying out voice data preprocessing on the conversation voice information set to obtain a processed voice information set, and carrying out emotion type division on the processed voice information set to obtain a voice training sample set.

Further, the server divides the dialogue voice information set into a plurality of voice fragments according to a preset time period, namely a processed voice information set, and the server divides the processed voice information set into a voice training sample set according to a preset emotion type set. Optionally, the server performs noise reduction, speech enhancement and speech segment interception on the set of speech information to obtain an initial set of speech information, specifically, the server performs noise reduction on the set of speech information to be processed through a trained Generative Adaptive Networks (GAN) model to obtain a noise-reduced set of speech information, the server performs speech enhancement on the noise-reduced set of speech information through a preset speech enhancement network model to obtain an enhanced set of speech information, the preset speech enhancement network model may generate a countermeasure network model for the trained conditions, or may be a recurrent neural network model, where specific details are not limited, the server determines the start and stop positions of each speech segment in each enhanced set of speech information through a trained weighted finite state machine WFST network model, the server performs speech segment interception on each enhanced set of speech information according to the start and stop positions of each speech segment, obtaining a plurality of candidate voice segments, and deleting blank voice segments in the candidate voice segments to obtain an initial voice information set; the server extracts corresponding Mel-spectrum emotion features mel-spectrum from the initial voice information set through a preset audio feature extraction tool librosa, wherein the librosa is used for analyzing and processing the audio and storing the Mel-spectrum emotion features mel-spectrum corresponding to each initial voice information as a comma separated value format file (namely, a file in a csv format) to obtain a processed voice information set; the server respectively carries out emotion recognition on the processed voice information sets through preset voice emotion classification models to obtain emotion types corresponding to the processed voice information, wherein the preset voice emotion classification models can be long and short memory network (LSTM) models with attention and intelligence and can also be other emotion classification models, and specific details are not limited here; the server sets corresponding classification labels for the processed voice information according to the emotion types corresponding to the processed voice information, and stores the processed voice information sets into a preset file directory in a classification mode according to the classification labels corresponding to the processed voice information to obtain a voice training sample set. That is, each processed voice message has a corresponding classification label, and there is a corresponding relationship between the classification label and the emotion type.

203. And carrying out iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the voice training sample set to obtain a target emotion recognition model.

It should be noted that the server improves the recognition effect through the preset Bi-directional long-short term memory network model Bi-LSTM, but increases the computational power, so that the server weight pruning algorithm optimizes the Bi-LSTM network model, i.e., removes unnecessary parts in the weight parameters, thereby reducing the number of times of calculation. Optionally, the server divides the voice training sample set according to a preset ratio to obtain a voice training set and a voice test set, wherein the preset ratio may be 8:2 or 7:3, and is not limited herein; the method comprises the steps that a server performs pruning processing and quantification processing on a preset bidirectional long-short term memory network model in a neural network frame through a preset weight pruning algorithm to obtain the pruned bidirectional long-short term memory network model, wherein the neural network frame can be a deep learning modeling environment keras well as other network frames, the concrete point is not limited, the pruned bidirectional long-short term memory network model is a sparse model, the weight pruning algorithm comprises weight quantification, the weight quantification refers to representing a weight by using fewer bits, and for example, the server performs Huffman coding on the weight to reduce network redundancy; the server performs model training and model testing on the pruned bidirectional long and short term memory network model according to preset training times based on a voice training set and a voice testing set to obtain a trained emotion recognition model, wherein the voice training set is used for instructing the pruned bidirectional long and short term memory network model to perform model training, and the voice testing set is used for instructing the pruned bidirectional long and short term memory network model to perform model testing; when the trained emotion recognition model meets a preset iteration termination condition, the server determines that the trained emotion recognition model is a target emotion recognition model, and stores the target emotion recognition model as a first model file, wherein the preset iteration termination condition is that the pruning times are equal to the preset times and/or the precision of the trained emotion recognition model is smaller than a preset precision threshold, and the first model file is a file in a standard model format, for example, the file in the standard model format can be in a pth format or in other formats. Wherein the initial value of the number of times of pruning is 0.

Further, when the trained emotion recognition model does not meet the preset iteration termination condition, the server extracts an abnormal recognition test sample from the voice test set, adds the abnormal recognition test sample to the voice training sample set to obtain an updated final voice training sample set, and carries out iterative training on the trained emotion recognition model based on the updated final voice training sample set until the trained emotion recognition model does not meet the preset iteration termination condition to obtain a target emotion recognition model, and stores the target emotion recognition model as a first model file. In the embodiment of the invention, by means of pruning and retraining the trained recognition model, the server progressively reduces the weight of each layer of weight in the Bi-LSTM to 0 according to the preset quantity on the basis of the Bi-LSTM, further reduces the weight tensor, when the parameter percentage of the weight of 0 is set to 0.6, the server determines that the number of the model parameters after pruning is obviously reduced by printing the output result of the model parameters, and then the server compiles and trains the model of the pruned Bi-LSTM until the target emotion recognition model is obtained, and stops the model training.

204. Transplanting the target emotion recognition model to the mobile terminal, and performing emotion recognition processing on the user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays character information and/or pattern information corresponding to the target emotion category.

The target emotion recognition model adds a pruning algorithm in the bidirectional long-short term memory network model to achieve high compression ratio, so that the speech emotion recognition accuracy is improved while the training time and hardware consumption are reduced. In addition, the server deploys the emotion recognition model on the mobile terminal, performs user voice collection and voice emotion feature extraction through the mobile terminal, and performs emotion category classification processing through the deployed emotion recognition model, which is different from the deployment of the target emotion recognition model on the server side, so that the voice emotion recognition process is simplified, the recognition efficiency is improved, and the emotion category of the user can be fed back in real time.

Optionally, the server converts the first model file into a second model file, installs a dependency package of the machine learning framework in the mobile terminal, and executes the second model file to obtain an execution result, where the second model file is a file in a migration model format, for example, the server converts the first model file module _ a.pth conforming to the keras format (i.e., standard model format) into the second model file module _ a.pb conforming to the machine learning framework tensoflow format (i.e., migration model format), where the dependency package of the machine learning framework may be a tensorlow mobile dependency file, or may be another dependency package, and is not limited herein; when the execution result is successful, the server determines that the target emotion recognition model is successfully transplanted to the mobile terminal, wherein the execution result comprises execution success and execution failure, the server judges whether the execution result is a preset value, if the execution result is the preset value, the server determines that the execution result is successful, if the execution result is not the preset value, the server determines that the execution result is failed, and when the execution result is the execution failure, the server generates model migration abnormal information and sends the model migration abnormal information to the mobile terminal; the server performs emotion recognition processing on the user voice information collected in real time through a target emotion recognition model in the mobile terminal to obtain a target emotion category, and determines corresponding character information or pattern information according to the target emotion category so that the mobile terminal can display the character information or pattern information corresponding to the target emotion category.

Further, the server acquires user voice information acquired in real time from the mobile terminal, the server performs feature extraction on the user voice information acquired in real time from the mobile terminal to obtain voice emotion feature information, for example, the server performs voice feature analysis on the user voice information acquired in real time through librosa to obtain a target mel-spectrum feature (namely, voice emotion feature information), the server performs classification processing on the voice emotion feature information through a target emotion recognition model in the mobile terminal to obtain a target emotion category, for example, mobile terminal call information of a user and a machine customer service is acquired at the mobile terminal to obtain the user voice information acquired in real time, and the user voice information acquired in real time is stored and stored, and the file storage format is wav; the server sets the target emotion category as a target key, and queries a preset memory database according to the target key to obtain character information or pattern information corresponding to the target emotion category, wherein the target emotion category corresponds to the character information or the pattern information; and the server calls a preset real-time data push interface to send the text information or the pattern information corresponding to the target emotion category to the mobile terminal, so that the mobile terminal displays the text information or the pattern information corresponding to the target emotion category. The number of the target emotion categories may be one or multiple, and is not limited herein, and the text information or the pattern information is displayed in a target page of the mobile terminal in a data sequence manner.

205. And acquiring a user identifier, and generating a user emotion portrait according to the user identifier, the user voice information acquired in real time and the target emotion category.

Specifically, the server receives the portrait generation request, analyzes the portrait generation request, obtains a user identifier, user voice information and a target emotion type which are collected in real time, obtains basic data and historical behavior data of the user from a preset database according to the user identifier, the basic data of the user comprises data such as height of the user, gender and name of the user, conducts multi-dimensional analysis processing according to the user voice information, the target emotion type, the basic data and the historical behavior data which are collected in real time, obtains a plurality of portrait features of the user, generates a user emotion portrait according to the portrait features of the user, and can be used for indicating emotion change trend and emotion information of the user.

It should be noted that the server obtains information such as user voice information, target emotion category, basic data and historical behavior data of the user, which are collected in real time, and the information is authorized by the user and does not relate to user privacy.

206. And pushing the emotion portrait of the user to the mobile terminal at regular time according to a preset notification mode so that the mobile terminal displays the emotion portrait of the user.

Specifically, the server performs data encapsulation on the user emotion portrait to obtain notification push data, and the server pushes the notification push data to the mobile terminal at regular time according to a preset notification mode, so that the mobile terminal displays the user emotion portrait according to the notification push data, and the preset notification mode includes a message push mode and can also be other push modes, which is not limited specifically here.

Further, the server counts a plurality of emotion categories of the user within a preset time length to obtain a comprehensive emotion state of the user, wherein the emotion categories of the user comprise a target emotion category; the server evaluates the health condition of the user according to the comprehensive emotional state of the user and preset user health information, specifically, the server determines the mental health information of the user and the physiological health information of the user according to the comprehensive emotional state of the user and the preset user health information, and the server evaluates the health condition of the user according to the mental health information of the user and the physiological health information of the user; the server customizes the health diet plan and the health exercise plan based on the health condition of the user and pushes the health diet plan and the health exercise plan to the mobile terminal.

In the embodiment of the invention, the preset bidirectional long-short term memory network model is subjected to iterative pruning treatment and model training through the voice training sample set to obtain the target emotion recognition model, so that the emotion recognition precision is improved, and the time consumption and hardware consumption are reduced; after the target emotion recognition model is deployed to the mobile terminal, emotion recognition processing is carried out on user voice information collected in real time through the target emotion recognition model in the mobile terminal, and character information and/or pattern information corresponding to the target emotion category is displayed, so that voice emotion recognition efficiency is improved. This scheme can be applied to in the wisdom medical field to promote the construction in wisdom city.

In the above description of the speech emotion recognition method in the embodiment of the present invention, referring to fig. 3, a speech emotion recognition apparatus in the embodiment of the present invention is described below, and an embodiment of the speech emotion recognition apparatus in the embodiment of the present invention includes:

an obtaining module 301, configured to obtain a conversation voice information set, where the conversation voice information set is used to indicate communication voice information between a user and a target customer service, and the target customer service includes a robot customer service and a manual customer service;

a dividing module 302, configured to perform voice data preprocessing on the conversational voice information set to obtain a processed voice information set, and perform emotion type division on the processed voice information set to obtain a voice training sample set;

a pruning training module 303, configured to perform iterative pruning processing and model training on a preset bidirectional long-and-short term memory network model based on the voice training sample set, to obtain a target emotion recognition model;

the recognition module 304 is configured to transplant the target emotion recognition model to the mobile terminal, perform emotion recognition processing on the user voice information collected in real time through the target emotion recognition model in the mobile terminal, and obtain a target emotion category, so that the mobile terminal displays text information and/or pattern information corresponding to the target emotion category.

Further, the speech training sample set is stored in the blockchain database, which is not limited herein.

Referring to fig. 4, another embodiment of the speech emotion recognition apparatus according to the embodiment of the present invention includes:

Optionally, the dividing module 302 may be further specifically configured to:

sequentially carrying out noise reduction processing, voice enhancement processing and voice fragment interception on the dialogue voice information set to obtain an initial voice information set;

extracting corresponding Mel frequency spectrum emotion characteristics from the initial voice information set through a preset audio characteristic extraction tool, and storing the Mel frequency spectrum emotion characteristics corresponding to each initial voice information as a comma separated value format file to obtain a processed voice information set;

performing emotion recognition on the processed voice information sets respectively through a preset voice emotion classification model to obtain emotion types corresponding to the processed voice information;

and setting corresponding classification labels for the processed voice information according to the emotion types corresponding to the processed voice information, and storing the processed voice information set into a preset file directory in a classification manner according to the classification labels corresponding to the processed voice information to obtain a voice training sample set.

Optionally, the pruning training module 303 may be further specifically configured to:

dividing a voice training sample set according to a preset proportion to obtain a voice training set and a voice testing set;

performing pruning processing and quantification processing on a preset bidirectional long-short term memory network model in a neural network frame through a preset weight pruning algorithm to obtain a pruned bidirectional long-short term memory network model;

performing model training and model testing on the pruned bidirectional long-short term memory network model according to preset training times based on the voice training set and the voice testing set to obtain a trained emotion recognition model;

when the trained emotion recognition model meets a preset iteration termination condition, determining the trained emotion recognition model as a target emotion recognition model, and storing the target emotion recognition model as a first model file, wherein the preset iteration termination condition is that the pruning times are equal to the preset times and/or the precision of the trained emotion recognition model is smaller than a preset precision threshold value, and the first model file is a file in a standard model format.

Optionally, the identifying module 304 may further include:

an installing unit 3041, configured to convert the first model file into a second model file, install a dependency package of the machine learning framework in the mobile terminal, and execute the second model file to obtain an execution result, where the second model file is a file in a migration model format;

a determining unit 3042, configured to determine that the target emotion recognition model has been successfully transplanted into the mobile terminal when the execution result is that the execution is successful;

the identifying unit 3043 is configured to perform emotion identification processing on the user voice information collected in real time through a target emotion identification model in the mobile terminal to obtain a target emotion category, and determine corresponding text information or pattern information according to the target emotion category, so that the mobile terminal displays the text information or pattern information corresponding to the target emotion category.

Optionally, the identifying unit 3043 may be further specifically configured to:

carrying out feature extraction on user voice information acquired in real time in a mobile terminal to obtain voice emotion feature information, and carrying out classification processing on the voice emotion feature information through a target emotion recognition model in the mobile terminal to obtain a target emotion category;

setting the target emotion category as a target key, and inquiring a preset memory database according to the target key to obtain character information or pattern information corresponding to the target emotion category;

and calling a preset real-time data push interface to send the text information or the pattern information corresponding to the target emotion category to the mobile terminal, so that the mobile terminal displays the text information or the pattern information corresponding to the target emotion category.

Optionally, the speech emotion recognition apparatus further includes:

the processing module 305 is used for acquiring a user identifier, and generating a user emotion portrait according to the user identifier, user voice information acquired in real time and a target emotion category;

the pushing module 306 is configured to push the user emotion figure to the mobile terminal at regular time according to a preset notification manner, so that the mobile terminal displays the user emotion figure.

Optionally, the speech emotion recognition apparatus further includes:

the counting module 307 is configured to count a plurality of emotion categories of the user within a preset time period to obtain a comprehensive emotion state of the user, where the plurality of emotion categories of the user include a target emotion category;

the evaluation module 308 is used for evaluating the health condition of the user according to the comprehensive emotional state of the user and preset user health information;

and the customizing module 309 is configured to customize the health diet plan and the health exercise plan based on the health condition of the user, and push the health diet plan and the health exercise plan to the mobile terminal.

Fig. 3 and 4 describe the speech emotion recognition apparatus in the embodiment of the present invention in detail from the perspective of modularization, and the speech emotion recognition apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a speech emotion recognition device 500 according to an embodiment of the present invention, where the speech emotion recognition device 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on storage medium 530 may include one or more modules (not shown), each of which may include a series of instructions operating on speech emotion recognition device 500. Still further, processor 510 may be configured to communicate with storage medium 530 to execute a series of instruction operations in storage medium 530 on speech emotion recognition device 500.

The speech emotion recognition device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows service, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the speech emotion recognition device structure shown in fig. 5 does not constitute a limitation of the speech emotion recognition device, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the speech emotion recognition method.

The invention further provides a speech emotion recognition device, which comprises a memory and a processor, wherein the memory stores instructions, and the instructions, when executed by the processor, cause the processor to execute the steps of the speech emotion recognition method in the above embodiments.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speech emotion recognition method, characterized in that the speech emotion recognition method comprises:

acquiring a conversation voice information set, wherein the conversation voice information set is used for indicating conversation voice information between a user and a target customer service, and the target customer service comprises a robot customer service and an artificial customer service;

carrying out voice data preprocessing on the conversation voice information set to obtain a processed voice information set, and carrying out emotion type division on the processed voice information set to obtain a voice training sample set;

performing iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the voice training sample set to obtain a target emotion recognition model;

transplanting the target emotion recognition model to a mobile terminal, and performing emotion recognition processing on user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays character information and/or pattern information corresponding to the target emotion category.

2. The speech emotion recognition method of claim 1, wherein the performing speech data preprocessing on the conversational speech information set to obtain a processed speech information set, and performing emotion type classification on the processed speech information set to obtain a speech training sample set comprises:

extracting corresponding Mel frequency spectrum emotion characteristics from the initial voice information set through the preset audio frequency characteristic extraction tool, and storing the Mel frequency spectrum emotion characteristics corresponding to each initial voice information as a comma separated value format file to obtain a processed voice information set;

3. The speech emotion recognition method of claim 1, wherein the performing iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the speech training sample set to obtain a target emotion recognition model comprises:

dividing the voice training sample set according to a preset proportion to obtain a voice training set and a voice testing set;

when the trained emotion recognition model meets a preset iteration termination condition, determining that the trained emotion recognition model is a target emotion recognition model, and storing the target emotion recognition model as a first model file, wherein the preset iteration termination condition is that the pruning times are equal to the preset times and/or the precision of the trained emotion recognition model is smaller than a preset precision threshold value, and the first model file is a file in a standard model format.

4. The speech emotion recognition method of claim 3, wherein the transplanting of the target emotion recognition model to a mobile terminal and emotion recognition processing on the user speech information collected in real time through the target emotion recognition model in the mobile terminal obtain a target emotion category, so that the mobile terminal displays text information and/or pattern information corresponding to the target emotion category, comprises:

converting the first model file into a second model file, installing a dependency package of a machine learning framework in the mobile terminal, and executing the second model file to obtain an execution result, wherein the second model file is a file in a migration model format;

when the execution result is that the execution is successful, determining that the target emotion recognition model is successfully transplanted to the mobile terminal;

and performing emotion recognition processing on the user voice information acquired in real time through a target emotion recognition model in the mobile terminal to obtain a target emotion category, and determining corresponding character information or pattern information according to the target emotion category so that the mobile terminal displays the character information or pattern information corresponding to the target emotion category.

5. The speech emotion recognition method of claim 4, wherein the emotion recognition processing is performed on the user speech information collected in real time through a target emotion recognition model in the mobile terminal to obtain a target emotion category, and corresponding text information or pattern information is determined according to the target emotion category, so that the mobile terminal displays the text information or pattern information corresponding to the target emotion category, and the method includes:

carrying out feature extraction on user voice information acquired in real time in the mobile terminal to obtain voice emotion feature information, and carrying out classification processing on the voice emotion feature information through a target emotion recognition model in the mobile terminal to obtain a target emotion category;

6. The speech emotion recognition method according to any one of claims 1 to 5, wherein after the target emotion recognition model is transplanted to a mobile terminal, and emotion recognition processing is performed on user speech information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays text information and/or pattern information corresponding to the target emotion category, the speech emotion recognition method further comprises:

acquiring a user identifier, and generating a user emotion portrait according to the user identifier, the user voice information acquired in real time and the target emotion category;

and pushing the user emotion portrait to the mobile terminal at regular time according to a preset notification mode so that the mobile terminal displays the user emotion portrait.

7. The speech emotion recognition method according to any one of claims 1 to 5, wherein after the target emotion recognition model is transplanted to a mobile terminal, and emotion recognition processing is performed on user speech information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, so that the mobile terminal displays text information and/or pattern information corresponding to the target emotion category, the speech emotion recognition method further comprises:

counting a plurality of emotion categories of the user within a preset time length to obtain a comprehensive emotion state of the user, wherein the plurality of emotion categories of the user comprise the target emotion category;

evaluating the health condition of the user according to the comprehensive emotional state of the user and preset user health information;

customizing a health diet plan and a health exercise plan based on the health condition of the user, and pushing the health diet plan and the health exercise plan to the mobile terminal.

8. A speech emotion recognition apparatus, characterized in that the speech emotion recognition apparatus comprises:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a conversation voice information set, the conversation voice information set is used for indicating conversation voice information between a user and a target customer service, and the target customer service comprises a robot customer service and an artificial customer service;

the dividing module is used for preprocessing voice data of the conversation voice information set to obtain a processed voice information set and dividing emotion types of the processed voice information set to obtain a voice training sample set;

the pruning training module is used for carrying out iterative pruning processing and model training on a preset bidirectional long-short term memory network model based on the voice training sample set to obtain a target emotion recognition model;

and the recognition module is used for transplanting the target emotion recognition model to a mobile terminal, performing emotion recognition processing on the user voice information collected in real time through the target emotion recognition model in the mobile terminal to obtain a target emotion category, and enabling the mobile terminal to display character information and/or pattern information corresponding to the target emotion category.

9. A speech emotion recognition apparatus characterized by comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the speech emotion recognition device to perform the speech emotion recognition method of any of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the speech emotion recognition method according to any of claims 1-7.