CN113314150A

CN113314150A - Emotion recognition method and device based on voice data and storage medium

Info

Publication number: CN113314150A
Application number: CN202110575150.5A
Authority: CN
Inventors: 邓真
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-08-27

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence, and relates to a voice data-based emotion recognition method for improving the accuracy of emotion recognition, which comprises the following steps: acquiring voice data of a user; converting the voice data into text data; performing emotion recognition on the voice data by adopting a preset voice emotion model, and outputting a voice emotion label, wherein the voice emotion model is a combined model comprising an extreme gradient lifting Xgboost model and a long-short term memory network LSTM model; performing emotion recognition on the text data by adopting a preset text emotion model, and outputting a text emotion label; and determining a comprehensive emotion label according to the voice emotion label and the text emotion label. The application also provides an emotion recognition device based on the voice data, computer equipment and a storage medium. In addition, the application also relates to a block chain technology, and voice data of a user can be stored in the block chain.

Description

Emotion recognition method and device based on voice data and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for emotion recognition based on speech data, a computer device, and a storage medium.

Background

Emotion recognition is an important research direction in Natural Language Processing (NLP) technology at present, and specifically, it is an important component of emotion calculation that an Artificial Intelligence (AI) technology is used to automatically identify an individual's emotional state by acquiring physiological or non-physiological signals of the individual. The content of emotion recognition research comprises the aspects of facial expression, voice, heart rate, behavior, text, physiological signal recognition and the like, and the emotional state of the user is judged through the content. Human voice is an important behavior signal capable of reflecting human emotion, such as tones, frequency fluctuations or text content corresponding to voice included in voice can reflect a part of human emotion.

Currently, emotion recognition technology based on a speech signal has been widely studied in recent years. However, because the emotion recognition performed on the user voice has less detection dimension, the result of emotion recognition is often not accurate enough.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for emotion recognition based on voice data, a computer device and a storage medium, and mainly aims to solve the technical problem that in the existing emotion recognition technology based on voice data, the detection dimensionality is small, so that the accuracy is not enough.

In order to solve the above technical problem, an embodiment of the present application provides an emotion recognition method based on voice data, which adopts the following technical solutions:

acquiring voice data of a user;

converting the voice data into text data;

performing emotion recognition on the voice data by adopting a preset voice emotion model, and outputting a voice emotion label, wherein the voice emotion model is a combined model comprising an Xgboost model and an LSTM model;

performing emotion recognition on the text data by adopting a preset text emotion model, and outputting a text emotion label;

and determining a comprehensive emotion label according to the voice emotion label and the text emotion label.

In order to solve the above technical problem, an embodiment of the present application further provides an emotion recognition apparatus based on voice data, which adopts the following technical scheme:

an acquisition unit for acquiring voice data of a user;

a conversion unit for converting the voice data into text data;

the first emotion recognition unit is used for performing emotion recognition on the voice data by adopting a preset voice emotion model and outputting a voice emotion label, wherein the voice emotion model is a combined model comprising an Xgboost model and an LSTM model;

the second emotion recognition unit is used for performing emotion recognition on the text data by adopting a preset text emotion model and outputting a text emotion label;

and the third emotion recognition unit is used for determining a comprehensive emotion label according to the voice emotion label and the text emotion label.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

a computer device comprising a memory having computer readable instructions stored therein and a processor that when executed performs the steps of the speech data based emotion recognition method described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of a speech data based emotion recognition method as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

in this proposal, after the voice data of the user is acquired, the voice data is first converted into text data. And then, performing emotion recognition on the voice data and the text data of the user by respectively using the voice emotion model and the text emotion model, and determining a voice emotion tag and a text emotion tag. And finally, determining a comprehensive emotion label according to the voice emotion label and the text emotion label. In the processing process, emotion recognition is carried out on the basis of the data of the two dimensions of the voice data and the text data of the user, and the final result of the emotion recognition can represent the voice data and the text data, so that the final result of the emotion recognition has better accuracy.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of an embodiment of a method for emotion recognition based on speech data in an embodiment of the present application;

FIG. 3 is a flowchart of one embodiment after step S240;

FIG. 4 is a flow chart of yet another embodiment of a method for emotion recognition based on speech data in an embodiment of the present application;

FIG. 5 is a flow chart of an embodiment of an emotion recognition apparatus based on speech data in an embodiment of the present application;

FIG. 6 is a flow chart of yet another embodiment of an emotion recognition apparatus based on speech data in an embodiment of the present application;

FIG. 7 is a flow chart of yet another embodiment of an emotion recognition apparatus based on speech data in an embodiment of the present application;

FIG. 8 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer iii, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the emotion recognition method based on voice data provided in the embodiments of the present application is generally performed by a server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a method of emotion recognition based on speech data is shown, in accordance with the present application. The emotion recognition method based on the voice data comprises the following steps:

step S210, acquiring voice data of the user.

In the present embodiment, an electronic device (for example, as shown in fig. 1) on which a speech data-based emotion recognition method is executedServer/terminal device) The voice data acquisition device can acquire the voice data of the user, and can also receive the voice data of other equipment acquired by the user in a wired connection mode or a wireless connection mode. The wireless connection mode may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, an Ultra Wideband (UWB) connection, and other wireless connection modes now known or developed in the future. In addition, it should be noted that the voice data of the user can be collected in real time and used in a conversation or a historical recordingThe user's voice data, and the file format of the voice data may include common audio formats such as CD, WAVE, AIFF, MPEG, MP3, MPEG-4, or the like.

It is emphasized that, to further ensure the privacy and security of the voice data, the voice data may also be stored in a node of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Step S220, converting the voice data into text data.

Specifically, an Automatic Speech Recognition (ASR) technique may be employed to convert the speech data into text data. During conversion, the original voice data can be converted into a PCM file and the like, and the converted voice data is further preprocessed by denoising, framing, windowing (hammingwindow) and the like to obtain windowed voice data. After that, algorithms such as Linear Predictive Cepstral Coefficients (LPCC) and Mel-Frequency cepstral coefficients (MFCC) can be used to process the windowed speech data, so as to change each frame waveform into a multi-dimensional feature vector containing sound information. And further, extracting the multi-dimensional characteristic vector by using a preset acoustic model to obtain a corresponding pinyin text, and converting the pinyin text into a plurality of text data according to a preset dictionary. And finally, determining the probability of the mutual correlation of the single characters or words contained in the plurality of text data by using a preset language model, determining the finally recognized text data according to the probability, and decoding and outputting the finally recognized text data.

Specifically, in the above conversion process, each sub-step is a common technique in the current ASR technique, and will not be described in detail here.

And step S230, performing emotion recognition on the voice data by adopting a preset voice emotion model, and outputting a voice emotion label, wherein the voice emotion model is a combined model comprising an Xgboost model and an LSTM model.

Specifically, the speech emotion model may include an eXtreme Gradient Boosting (Xgboost) model and a long-short-term memory (LSTM) model. The Xgboost model is a tree model that can perform multi-classification tasks, and the LSTM model can be used for sequence classification tasks. When emotion recognition is performed on voice data, preprocessing such as denoising, framing and windowing can be performed on the voice data to obtain windowed voice data. Further, short-time Fourier transform (STFT) is used to convert the windowed speech data into audio features, which are input into the two models, respectively, to generate two predicted probability values. And finally, selecting the probability with a large value for the two predicted probability values by adopting a voting mechanism, or adopting weighted summation to determine a final probability value, and taking the voice emotion label corresponding to the final probability value as the voice emotion label corresponding to the voice data.

And S240, performing emotion recognition on the text data by adopting a preset text emotion model, and outputting a text emotion label.

Specifically, after the text data is obtained, emotion recognition can be performed on the text data by using a preset text emotion model, and a text emotion tag is output. The emotion types which can be detected in the voice may be different from the emotion types which can be detected in the text, so that the recognizable emotion tags which are arranged in the text emotion model may be different from the recognizable emotion tags which are arranged in the voice emotion model. For example, a voice emotion tag may include an emotion of joy, sorrow, anger, etc., while a text emotion tag may include an emotion of joy, anger, confidence, fear, etc.

It should be noted that the order before and after the execution of the steps S230 and S240 is not particularly limited.

And step S250, determining a comprehensive emotion label according to the voice emotion label and the text emotion label.

In particular, since the emotion indicated in the voice of a person and the emotion indicated in the text corresponding to the voice can be greatly different in many times. For example, in some playing scenes, the result of speech emotion tag recognition of the user speech may be a positive emotion tag, and the text may be recognized as a negative emotion tag because it contains a negative vocabulary; alternatively, if in the cynical scenario, the result of the user's speech emotion tag recognition may be a negative emotion tag, while the text may be recognized as a positive emotion tag because it contains a positive vocabulary.

Such as the existence of the above-mentioned scenario, the present proposal may further determine a comprehensive emotion tag according to the determined speech emotion tag and text emotion tag expressed in the speech of the user after determining the speech emotion tag and text emotion tag.

Wherein determining the integrated sentiment tag may have a plurality of ways:

the method comprises the steps of setting corresponding different scores for each emotion label in advance, and determining a final emotion label according to the emotion tendencies expressed by the scores of the voice emotion labels and the scores of the text emotion labels.

Specifically, the score may be used to indicate a tendency of the emotion to be positive or negative, and for example, it may be set that the positive emotion has a value range of [0,1], the more close to a value of 1, the greater the positive tendency of the emotion label, the negative emotion has a value range of [ -1,0], and the more close to a value of-1, the greater the positive tendency of the emotion label. After the voice emotion tag and the text emotion tag are obtained, which type of emotion tendency indicating the result is large, namely closer to 1 or-1, is judged, and finally which type of emotion tag is used as the standard.

And secondly, presetting a corresponding rule for determining the comprehensive emotion label according to the voice emotion label and the text emotion label, and determining the comprehensive emotion label according to the corresponding rule.

Specifically, the correspondence rule may be a correspondence relationship, and the correspondence rule includes a correspondence relationship between a combination of different speech emotion tags and different text emotion tags and a comprehensive emotion tag. For example, the correspondence relationship may be schematically shown in the following table 1:

voice emotion label	Text emotion label	Comprehensive emotion label
			Happy	Cynical, qi generation	Go out for playing
Difficult to pass	Difficult and inferior	Difficult to pass
			Cynical	Quzhan	Cynical
……	……	……

TABLE 1

As shown in table 1 above, the comprehensive emotion tag corresponding to any combination of the speech emotion tag and the text emotion tag may be queried according to the correspondence indicated by the correspondence rule.

In the embodiment of the application, after the voice data of the user is acquired, the voice data is converted into the text data. And then, performing emotion recognition on the voice data and the text data of the user by respectively using the voice emotion model and the text emotion model, and determining a voice emotion tag and a text emotion tag. And finally, determining a comprehensive emotion label according to the voice emotion label and the text emotion label. In the processing process, emotion recognition is carried out on the basis of the data of the two dimensions of the voice data and the text data of the user, and the final result of the emotion recognition can represent the voice data and the text data, so that the final result of the emotion recognition has better accuracy.

In some optional implementation manners of this embodiment, if the voice data of the user is obtained in a conversation scene, after the voice emotion tag and the text emotion tag are determined, a conversation emotion task in a subsequent conversation emotion recognition scene may also be executed. The conversation emotion tasks may include some analysis processing or other tasks of preset operation types, for example, in a customer service scene, whether violation conditions such as poor attitudes, threat scaring and the like exist in sentences of workers may be determined according to voice emotion tags and text emotion tags of voices of the workers recognized in a section of conversation; particularly in the conversation robot scenario, the conversation robot may determine a following reply sentence for the user voice according to the emotion recognition result of the previous sentence of the user voice.

Specifically, the emotion recognition method based on voice data can be applied to a scene for detecting whether a worker violates rules when the worker serves a customer. Referring specifically to fig. 3, after step S240, the emotion recognition method may further include:

and step S360, judging whether the voice data has violation according to the voice emotion label.

Specifically, the voice emotion tag may be determined, and if the detected voice emotion tag is determined to belong to any one of voice emotion tags such as anger, insulting, hate, aversion, fighting, falseness, slight, shame, jealoy, conflict, and annoyance, it is determined that the violation exists in the voice data.

Step S370, judging whether the voice data has violation according to the text emotion label.

Specifically, the text emotion tag may be determined, and if the detected text emotion tag is judged to be poor in attitude or threatened, it is determined that the speech data has a violation.

And step S380, determining the violation degree of the voice data according to the violation judgment results respectively indicated by the voice emotion tag and the text emotion tag.

Specifically, if the voice emotion tag and the text emotion tag of the voice data of the worker are both judged to be violation tags, it is judged that a significant violation exists in the statement of the worker; if only the text emotion label violation occurs, determining that the text emotion label violation occurs to be moderate violation; if only the voice emotion label violation exists, the violation is judged to be slight, and if not, the violation is judged not to be violated.

It should be noted that, the conversation emotion task specifically refers to determining whether the voice data of the staff in the conversation with the client is violated according to the voice emotion tag and the text emotion tag, and the executed process is a judgment process.

In the embodiment of the application, when the staff is detected to have a conversation with the client, the voice emotion tag and the text emotion tag indicated by the voice data of the staff are comprehensively determined to determine whether the staff violates rules or not, and the detection accuracy rate of whether the staff violates rules or not in the conversation process can be effectively improved.

In some possible implementations, the text emotion model may include an XLnet model, where in step S240, performing emotion recognition on the text data by using a preset text emotion model, and outputting the text emotion tag may include:

and performing emotion recognition on the text data by adopting the XLinet model, and outputting a text emotion label.

Specifically, the XLnet model is a new model obtained by improvement based on the bert model, and can be used in the field of Natural Language Processing (NLP), for example, in a plurality of scenarios such as question and answer, text classification, and Natural Language understanding.

In this embodiment, an XLnet model is used to implement an emotion recognition technology for text data, the XLnet model is trained in advance, and text data is input into the XLnet model obtained after training is completed, so that a text emotion tag corresponding to the text data is output.

In some possible implementations, when the text emotion model is an XLnet model, in step S370, determining whether the voice data has a violation according to the text emotion tag may include:

when the text emotion tag is a violation emotion tag, performing soft violent vocabulary matching on the text data, and when the soft violent vocabulary is determined to exist in the text data, determining that the voice data is in violation; and when the text emotion tag is not a violation emotion tag or is a violation emotion tag but is not matched with the soft violent vocabulary in the text data, determining that the voice data has no violation.

Specifically, when the text emotion tag is determined to be the violation emotion tag, soft violent vocabulary matching may be further applied to the text data of the violation point to further determine whether the speech data of the first user has a violation. Wherein, the soft violent vocabulary developer can set and update by oneself, for example, can include: sealing, freezing, mortgage, disposition, auction, property preservation, laws and regulations, courts and legal procedures, etc. And when the soft violent words appear in the text data, determining that the voice data of the first user has illegal behaviors. And if the voice data is not the violation emotion label or the soft violent words do not exist in the text data, determining that the voice data is not violated.

In the embodiment of the application, the method for comprehensively determining whether the voice data of the first user violates rules by performing XLNet model detection and soft violent vocabulary matching on the text data can effectively improve the accuracy of determining whether the voice data of the first user violates rules through the text data.

In some possible implementations, the text emotion model may include an XLnet model and an HAN model, and the step S240 of performing emotion recognition on the text data by using a preset text emotion model and outputting a text emotion tag may include:

inputting the text data into an XLinet model, and outputting a first text emotion tag corresponding to the text data; and if the first text emotion label is a violation emotion label, inputting the text data and the context text data into an HAN model, and outputting a second text emotion label.

In this embodiment, an emotion recognition technology of text data is realized by using an XLnet model, the XLnet model is trained in advance, and text data is input into the XLnet model obtained after the training is completed, so that a first text emotion tag corresponding to the text data is output.

While the Hierarchical Attention Network (HAN) is a full-text model. When the first text emotion tag is determined to be the violation emotion tag, the text data and the context text data of the text data may be scanned by using the HAN model, and a second text emotion tag may be output. Specifically, during scanning, 5 sentence pairs can be used as input, so that a second text emotion label is output, the purpose of doing so is that if the whole text is used as input and transmitted to the HAN model, the extracted full-text features related to the violation are limited, so that the violation identification accuracy is low, the determination that the 5 sentence pairs are used as the input model has the best effect, the length of the input text can be effectively controlled, and the identification accuracy of the model is improved. It should be noted that other numbers of sentences may be adopted here, and selecting 5 sentences is only the best test efficiency and accuracy in the experimental process, and if in practice, other numbers can obtain better effect through test, and can also be flexibly set, and this embodiment is not specifically limited.

Step S370, determining whether the voice data has a violation according to the text emotion tag, which may include:

and determining whether the voice data has a violation according to the second text emotion tag.

Specifically, if the second text emotion tag is in a poor attitude or threatened, it is determined that the voice data has a violation.

In the embodiment of the application, the method for comprehensively determining whether the voice data of the first user violates rules by sequentially adopting XLNet model detection and HAN model detection on the text data can effectively improve the accuracy of determining whether the voice data of the first user violates rules through the text data.

In some optional implementations of the present embodiment, the emotion recognition method based on voice data may be applied in an application scenario of a conversation robot. Fig. 4 is a schematic diagram of another embodiment of an emotion recognition method based on speech data in the embodiment of the present application, wherein,

step S210, acquiring voice data of a user, may include:

step S411, acquiring voice data of the user in the dialogue between the dialogue robot and the user.

Specifically, the conversation robot may be a physical device or an application running on the intelligent terminal and capable of communicating with a person in natural language, and the specific form of the conversation robot is not specified herein. When the conversation robot carries out a conversation with a person, voice data input by a user in the conversation can be recorded.

After step S250, the emotion recognition method may further include:

and step S460, generating a reply sentence according to the comprehensive emotion label.

Specifically, in this scenario, at least one corresponding standard reply sentence may be set in advance for different integrated emotion tags, so that the standard reply sentence corresponding to the integrated emotion tag of the user may be selected as the reply sentence to reply to the user. Wherein, the reply sentence can be in a voice or text form. For example, if the current user is judged to be in the angry emotion by combining the voice emotion tag and the text emotion tag, a corresponding reply sentence can be selected from the preselected conversation technique to reply to the user; if the current user is judged to be in an impatient emotion by combining the voice emotion label and the text emotion label, a response sentence can be selected from the preselected comforting words to respond to the user.

In some possible implementations, the textual emotion model may include an XLnet model, which in turn is employed to determine a textual emotion tag indicated by the textual data.

In the embodiment of the application, when the conversation robot is in conversation with the user through detection, the voice emotion tag and the text emotion tag indicated by the voice data of the user can be comprehensively set to reply sentences more accurately according to the voice emotion tag and the text emotion tag.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an emotion recognition apparatus based on voice data, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the emotion recognition apparatus based on speech data according to the present embodiment may include:

an obtaining unit 510, configured to obtain voice data of a user;

a conversion unit 520 for converting the voice data into text data;

a first emotion recognition unit 530, configured to perform emotion recognition on the voice data by using a preset voice emotion model, and output a voice emotion tag, where the voice emotion model is a combined model including an Xgboost model and an LSTM model;

the second emotion recognition unit 540 is configured to perform emotion recognition on the text data by using a preset text emotion model, and output a text emotion tag;

and a third emotion recognition unit 550, configured to determine a comprehensive emotion tag according to the voice emotion tag and the text emotion tag.

In some possible implementations, referring specifically to fig. 6, the emotion recognition apparatus based on voice data may further include:

a first violation detecting unit 660, configured to determine whether the speech data violates a violation according to the speech emotion tag;

a second violation detecting unit 670, configured to determine whether the speech data violates a violation according to the text emotion tag;

and the violation degree determination unit 680 is configured to determine the violation degree of the voice data according to the violation determination results respectively indicated by the voice emotion tag and the text emotion tag.

In some possible implementations, the text emotion model includes an XLnet model, and the second emotion recognition unit 540 may include:

and the first XLinet detection module is used for inputting the text data into an XLinet model and outputting a text emotion tag corresponding to the text data.

In some possible implementations, the second violation detecting unit 670 specifically includes:

the first text violation detection module is used for performing soft violent vocabulary matching on the text data when the text emotional tag is a violation emotional tag, and determining that the voice data is violated when the soft violent vocabulary is determined to exist in the text data; when the text emotion tag is not a violation emotion tag or is a violation emotion tag but the soft violent vocabulary is not matched in the text data, determining that the speech data has no violation.

In some possible implementations, the text emotion model includes an XLnet model and an HAN model, and the second emotion recognition unit 540 may include:

the second XLinet detection module is used for inputting the text data into an XLinet model and outputting a first text emotion tag corresponding to the text data;

the HAN detection module is used for inputting the text data and the context text data into an HAN model and outputting a second text emotion label if the first text emotion label is an illegal emotion label;

the second violation detecting unit 570 specifically includes:

and the second text violation detection module is used for determining whether the voice data has a violation according to the second text emotion tag.

In some possible implementations, further reference is made to fig. 7, wherein,

the obtaining unit 510 may include:

the acquisition module 711 is used for acquiring voice data of a user in a conversation between the conversation robot and the user;

the emotion recognition apparatus based on voice data may further include:

a reply unit 760 for generating a reply sentence according to the integrated emotion tag.

In the embodiment of the application, after the emotion recognition device based on voice data acquires the voice data of a user, the voice data is converted into text data. And then, performing emotion recognition on the voice data and the text data of the user by respectively using the voice emotion model and the text emotion model, and determining a voice emotion tag and a text emotion tag. And finally, determining a comprehensive emotion label according to the voice emotion label and the text emotion label. In the processing process, emotion recognition is carried out on the basis of the data of the two dimensions of the voice data and the text data of the user, and the final result of the emotion recognition can represent the voice data and the text data, so that the final result of the emotion recognition has better accuracy.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 8, fig. 8 is a block diagram of a basic structure of the computer device according to the present embodiment.

The computer device includes a memory 810, a processor 820, and a network interface 830 communicatively coupled to each other via a system bus. It is noted that only a computer

device having components

810 and 830 is shown, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 810 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 810 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory 810 may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device. Of course, the memory 810 may also include both internal and external storage devices of the computer device. In this embodiment, the memory 810 is generally used for storing an operating system and various types of application software installed on the computer device, such as computer readable instructions of the aforementioned emotion recognition method based on voice data. The memory 810 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 820 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 820 is generally used to control the overall operation of the computer device. In this embodiment, the processor 820 is configured to execute computer readable instructions stored in the memory 810 or process data, such as executing computer readable instructions of the emotion recognition method based on voice data.

The network interface 830 may include a wireless network interface or a wired network interface, and the network interface 830 is generally used for establishing a communication link between the computer device and other electronic devices.

In the embodiment of the application, after the computer device obtains the voice data of the user, the voice data is converted into the text data. And then, performing emotion recognition on the voice data and the text data of the user by respectively using the voice emotion model and the text emotion model, and determining a voice emotion tag and a text emotion tag. And finally, determining a comprehensive emotion label according to the voice emotion label and the text emotion label. In the processing process, emotion recognition is carried out on the basis of the data of the two dimensions of the voice data and the text data of the user, and the final result of the emotion recognition can represent the voice data and the text data, so that the final result of the emotion recognition has better accuracy.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the speech data based emotion recognition method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for emotion recognition based on voice data, comprising the steps of:

acquiring voice data of a user;

converting the voice data into text data;

performing emotion recognition on the voice data by adopting a preset voice emotion model, and outputting a voice emotion label, wherein the voice emotion model is a combined model comprising an extreme gradient lifting Xgboost model and a long-short term memory network LSTM model;

2. The emotion recognition method of claim 1, wherein after the steps of performing emotion recognition on the text data using a preset text emotion model and outputting a text emotion tag, the emotion recognition method further comprises:

judging whether the voice data has violation according to the voice emotion tag;

judging whether the voice data has violation according to the text emotion tag;

and determining the violation degree of the voice data according to violation judgment results respectively indicated by the voice emotion tag and the text emotion tag.

3. The emotion recognition method of claim 2, wherein the text emotion model includes an XLnet model, the emotion recognition of the text data using a preset text emotion model, and the step of outputting the text emotion label specifically includes:

4. The emotion recognition method of claim 3, wherein the step of determining whether the voice data has a violation according to the text emotion tag specifically comprises:

when the text emotion tag is a violation emotion tag, performing soft violent vocabulary matching on the text data, and when the soft violent vocabulary is determined to exist in the text data, determining that the voice data is in violation;

and when the text emotion tag is not a violation emotion tag or is a violation emotion tag but is not matched with the soft violent vocabulary in the text data, determining that the voice data has no violation.

5. The emotion recognition method of claim 2, wherein the text emotion models include an XLnet model and an HAN model, and the step of performing emotion recognition on the text data by using a preset text emotion model and outputting a text emotion tag specifically includes:

performing emotion recognition on the text data by adopting the XLNet model, and outputting a first text emotion label;

if the first text emotion label is a violation emotion label, inputting the text data and the context text data into a multi-layer attention model (HAN) model, and outputting a second text emotion label;

the step of judging whether the voice data has a violation according to the text emotion tag specifically comprises the following steps:

6. The emotion recognition method of claim 1,

the step of acquiring the voice data of the user specifically includes:

acquiring voice data of a user in a conversation between a conversation robot and the user;

after the step of determining a composite emotion tag according to the voice emotion tag and the text emotion tag, the emotion recognition method further includes:

and generating a reply sentence according to the comprehensive emotion label.

7. The emotion recognition method according to any one of claims 1 to 6, wherein the step of determining a comprehensive emotion label based on the speech emotion label and the text emotion label specifically includes:

and determining a comprehensive emotion label corresponding to the combination of the voice emotion label and the text emotion label according to a preset corresponding rule, wherein the corresponding rule comprises the corresponding relation between the combination of different voice emotion labels and different text emotion labels and the comprehensive emotion label.

8. An emotion recognition apparatus based on voice data, characterized by comprising:

an acquisition unit for acquiring voice data of a user;

a conversion unit configured to convert the voice data into text data;

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed performs the steps of the speech data based emotion recognition method as claimed in any of claims 1 to 7.

10. A computer-readable storage medium, having computer-readable instructions stored thereon, which, when executed by a processor, carry out the steps of the speech data based emotion recognition method according to any of claims 1 to 7.