CN116467455A

CN116467455A - Emotion recognition method, emotion recognition device, electronic device, and storage medium

Info

Publication number: CN116467455A
Application number: CN202310504904.7A
Authority: CN
Inventors: 欧阳升; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-07-21

Abstract

The embodiment of the application provides an emotion recognition method, an emotion recognition device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring target dialogue data, wherein the target dialogue data comprises target speaking audio of a target object; extracting the content of the target speaking audio to obtain a target text sentence corresponding to the target speaking audio; coding the target text sentence based on a preset emotion recognition model to obtain target sentence coding characteristics; carrying out emotion recognition on the coding features of the target sentences based on the emotion recognition model to obtain initial emotion categories of the target text sentences; carrying out emotion feature word recognition on the target text sentence based on a preset emotion word list to obtain a recognition result; obtaining a target emotion category corresponding to the target text sentence according to the identification result and the initial emotion category; and obtaining emotion state data of the target object based on the target emotion type. According to the embodiment of the application, the accuracy of emotion recognition can be improved.

Description

Emotion recognition method, emotion recognition device, electronic device, and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a mood recognition method, a mood recognition device, electronic equipment and a storage medium.

Background

At present, in many business fields (such as insurance industry, commodity marketing, etc.), in order to better develop a business, the emotional state of a user is often required to be known, for this reason, in the related art, the emotional state of the user is often judged based on the subjective experience of a worker, and this mode has a large misjudgment of emotion, which affects the accuracy of emotion recognition, so how to improve the accuracy of emotion recognition becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide an emotion recognition method, an emotion recognition device, electronic equipment and a storage medium, and aims to improve the accuracy of emotion recognition.

To achieve the above object, a first aspect of an embodiment of the present application proposes an emotion recognition method, including:

obtaining target dialogue data, wherein the target dialogue data comprises target speaking audio of a target object;

extracting the content of the target speaking audio to obtain a target text sentence corresponding to the target speaking audio;

Coding the target text sentence based on a preset emotion recognition model to obtain a target sentence coding feature;

carrying out emotion recognition on the target sentence coding features based on the emotion recognition model to obtain an initial emotion category of the target text sentence;

carrying out emotion feature word recognition on the target text sentence based on a preset emotion word list to obtain a recognition result;

obtaining a target emotion category corresponding to the target text sentence according to the identification result and the initial emotion category;

and obtaining emotion state data of the target object based on the target emotion category.

In some embodiments, the content extracting the target speech audio to obtain a target text sentence corresponding to the target speech audio includes:

extracting semantic features of the target speaking audio to obtain a target audio semantic representation;

performing format conversion on the target audio semantic representation to obtain an initial text sentence;

and carrying out text correction on the initial text sentence based on a preset grammar rule to obtain the target text sentence.

In some embodiments, the performing emotion recognition on the target sentence coding feature based on the emotion recognition model to obtain an initial emotion category of the target text sentence includes:

Carrying out emotion scoring on the target sentence coding features based on the emotion classifier and the candidate emotion labels of the emotion recognition model to obtain sentence emotion scores;

and screening the candidate emotion labels based on the sentence emotion scores to obtain the initial emotion category of the target text sentence.

In some embodiments, after the obtaining the emotional state data of the target object based on the target emotional category, the emotion recognition method further includes:

acquiring object basic information of the target object;

constructing the target object portrait based on the object basic information and the emotion state data to obtain an object personality portrait;

screening target dialogue strategy information from preset candidate dialogue strategy information based on the object character portraits;

pushing the target dialogue strategy information to a feedback object so that the feedback object can carry out dialogue with the target object according to the target dialogue strategy information.

In some embodiments, the identifying the emotion feature word of the target text sentence based on the preset emotion word list to obtain an identification result includes:

word segmentation processing is carried out on the target text sentence, so that a plurality of text word segments are obtained;

And comparing the text word segment with the reference emotion feature words in the preset emotion word list to obtain the recognition result, wherein the recognition result is used for representing that the reference emotion feature words exist in the text word segment or the reference emotion feature words do not exist in the text word segment.

In some embodiments, the obtaining, according to the recognition result and the initial emotion category, a target emotion category corresponding to the target text sentence includes:

if the recognition result indicates that the text word segment has the reference emotion feature word, selecting the reference emotion feature word which is the same as the text word segment as a target emotion feature word;

inquiring vocabulary emotion categories corresponding to the target emotion feature words;

and obtaining the target emotion category according to a preset priority order, the vocabulary emotion category and the initial emotion category.

In some embodiments, the obtaining the target emotion category according to the preset priority order, the vocabulary emotion category and the initial emotion category includes:

comparing the vocabulary emotion category with the initial emotion category;

if the vocabulary emotion category is the same as the initial emotion category, the vocabulary emotion category or the initial emotion category is used as the target emotion category;

And if the vocabulary emotion category is different from the initial emotion category, selecting the vocabulary emotion category with higher priority from the initial emotion category as the target emotion category according to the priority order.

To achieve the above object, a second aspect of the embodiments of the present application proposes an emotion recognition device, the device including:

the data acquisition module is used for acquiring target dialogue data, wherein the target dialogue data comprises target speaking audio of a target object;

the content extraction module is used for extracting the content of the target speaking audio to obtain a target text sentence corresponding to the target speaking audio;

the coding module is used for coding the target text sentence based on a preset emotion recognition model to obtain target sentence coding characteristics;

the emotion recognition module is used for carrying out emotion recognition on the target sentence coding features based on the emotion recognition model to obtain an initial emotion category of the target text sentence;

the feature word recognition module is used for recognizing the emotion feature words of the target text sentences based on a preset emotion word list to obtain recognition results;

The emotion type determining module is used for obtaining a target emotion type corresponding to the target text sentence according to the identification result and the initial emotion type;

and the emotion state data generation module is used for obtaining the emotion state data of the target object based on the target emotion category.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor implements the method described in the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect.

According to the emotion recognition method, the emotion recognition device, the electronic equipment and the storage medium, target dialogue data are obtained, wherein the target dialogue data comprise target speaking audio of a target object; and extracting the content of the target speaking audio to obtain a target text sentence corresponding to the target speaking audio, and converting the target speaking audio into a target text sentence form. Further, coding the target text sentence based on a preset emotion recognition model to obtain a target sentence coding feature; carrying out emotion recognition on the coding features of the target sentences based on the emotion recognition model to obtain initial emotion categories of the target text sentences; carrying out emotion feature word recognition on the target text sentence based on a preset emotion word list to obtain a recognition result; according to the identification result and the initial emotion category, the target emotion category corresponding to the target text sentence is obtained, and finally, the emotion state data of the target object is obtained based on the target emotion category.

Drawings

FIG. 1 is a flow chart of an emotion recognition method provided by an embodiment of the present application;

fig. 2 is a flowchart of step S102 in fig. 1;

fig. 3 is a flowchart of step S104 in fig. 1;

fig. 4 is a flowchart of step S105 in fig. 1;

fig. 5 is a flowchart of step S106 in fig. 1;

fig. 6 is a flowchart of step S503 in fig. 5;

FIG. 7 is another flow chart of a method of emotion recognition provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of an emotion recognition device according to an embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction, NER): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Coding (encoder): the input sequence is converted into a vector with fixed length;

decoding (decoder): the fixed vector generated before is converted into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.

Based on this, the embodiment of the application provides an emotion recognition method, an emotion recognition device, electronic equipment and a storage medium, aiming at improving the accuracy of emotion recognition.

The emotion recognition method, the emotion recognition device, the electronic apparatus and the storage medium provided in the embodiments of the present application are specifically described through the following embodiments, and the emotion recognition method in the embodiments of the present application is described first.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a mood recognition method, and relates to the technical field of artificial intelligence. The emotion recognition method provided by the embodiment of the application can be applied to the terminal, the server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the emotion recognition method, but is not limited to the above form.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to data related to user identity or characteristics, such as user information, user behavior data, user voice data, user history data, and user location information, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of an emotion recognition method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, target dialogue data is obtained, wherein the target dialogue data comprises target speaking audio of a target object;

step S102, extracting the content of the target speaking audio to obtain a target text sentence corresponding to the target speaking audio;

step S103, coding the target text sentence based on a preset emotion recognition model to obtain a target sentence coding feature;

step S104, carrying out emotion recognition on the coding features of the target sentences based on the emotion recognition model to obtain initial emotion categories of the target text sentences;

step S105, identifying emotion feature words of the target text sentence based on a preset emotion word list, and obtaining an identification result;

step S106, obtaining a target emotion category corresponding to the target text sentence according to the identification result and the initial emotion category;

step S107, obtaining emotion state data of the target object based on the target emotion type.

In steps S101 to S107 illustrated in the embodiments of the present application, content extraction is performed on the target speech audio by obtaining target dialogue data, so as to obtain a target text sentence corresponding to the target speech audio, and the target speech audio can be converted into a target text sentence form. Further, coding the target text sentence based on a preset emotion recognition model to obtain a target sentence coding feature; carrying out emotion recognition on the coding features of the target sentences based on the emotion recognition model to obtain initial emotion categories of the target text sentences; carrying out emotion feature word recognition on the target text sentence based on a preset emotion word list to obtain a recognition result; according to the identification result and the initial emotion category, the target emotion category corresponding to the target text sentence is obtained, and finally, the emotion state data of the target object is obtained based on the target emotion category.

In step S101 of some embodiments, the target dialogue data of the target object may be acquired by performing image-text consultation, call-back visit, video answering, and the like on the target object, or the voice evaluation questionnaire information in which the target object participates may be acquired to extract the target dialogue data. In the actual target dialogue data acquisition process, technical means such as a web crawler and the like can be adopted to crawl data, for example, by compiling the web crawler, setting a data source and then performing targeted crawling to obtain the target dialogue data, the method can improve the rationality of the target dialogue data, wherein a target object can be a crowd such as a consumer or a user and the like without limitation. The targeted dialog data includes targeted audio of the targeted object, which may include dialog audio in a variety of formats such as MP3, WAV, WMA, MP2, flac, etc.

Referring to fig. 2, in some embodiments, step S102 may include, but is not limited to, steps S201 to S203:

step S201, semantic feature extraction is carried out on target speaking audio to obtain target audio semantic representation;

step S202, performing format conversion on the target audio semantic representation to obtain an initial text sentence;

Step S203, performing text correction on the initial text sentence based on a preset grammar rule to obtain a target text sentence.

In step S201 of some embodiments, semantic feature extraction may be performed on the target speech audio based on a speech recognition technology (for example, ASR technology), so as to obtain speech content information of the target speech audio, and obtain a semantic representation of the target audio.

In step S202 of some embodiments, the target audio semantic representation is converted from a speech vector form to a text form, resulting in an initial text sentence.

In step S203 of some embodiments, text correction is performed on the initial text sentence based on a preset grammar rule, where the text correction includes correction on wrongly written words, text typesetting, and the preset grammar rule may be a commonly used host-to-object format, a word part of speech specification of a word, and the like, and based on this manner, the text normalization and semantic rationality of the initial text sentence can be improved, so that sentences with semantic errors or disordered semantic logic are avoided, and the sentence quality of the generated target text sentence is improved.

Through the steps S201 to S203, the target speaking audio can be converted into a plurality of target text sentences more conveniently, so that emotion recognition can be performed based on the target text sentences, extraction and recognition processing of emotion semantic feature information of the target speaking audio by the model are facilitated, and the accuracy of emotion recognition can be improved.

Prior to step S103 of some embodiments, the emotion recognition method further includes pre-training an emotion recognition model, where the emotion recognition model may be constructed based on a deep convolutional neural network model or a cyclic neural network model, and the emotion recognition model includes a pre-training layer and a prediction layer, and the emotion recognition model may be a 4-classification model of the RoBERTa network (i.e., the pre-training layer) +the fully-connected layer (i.e., the prediction layer). The pre-training layer is mainly used for carrying out coding processing on an input text, extracting sentence emotion representation information in the input text, and the prediction layer is mainly used for carrying out emotion recognition according to the extracted sentence emotion representation information and judging emotion types contained in each text sentence in the input text, wherein the emotion types in the embodiment of the application are divided into four types, namely a first emotion type, a second emotion type, a third emotion type and a fourth emotion type, wherein the first emotion type represents that no non-positive emotion representation exists in the text sentence, and the second emotion type represents that the text sentence contains emotion representations with certain non-positive tendencies. For example, a text sentence contains a slight complaint word; the third emotion category indicates that the text sentence contains emotion words with obvious non-positive tendencies; the fourth emotion category indicates that the text sentence contains more extreme emotion words, for example, words such as complaints and the like of the text sentence. According to the method, the emotion recognition model can be used for emotion recognition, compared with the method in the related art that the emotion state of the user is judged based on subjective experience of staff, the method can effectively reduce the possibility of misjudgment of the emotion by adopting the model, and the accuracy and the calculation efficiency of emotion recognition can be improved.

In step S103 of some embodiments, the pre-training layer of the emotion recognition model may include an encoder and a downsampling unit, where the encoding process is performed on the target text sentence based on the encoder, sentence emotion characterization information in the target text sentence is extracted to obtain initial sentence encoding features, and then feature sampling is performed on the initial sentence encoding features based on the downsampling unit to obtain target sentence encoding features.

Referring to fig. 3, in some embodiments, step S104 may include, but is not limited to, steps S301 to S302:

step S301, carrying out emotion scoring on the coding features of the target sentences based on an emotion classifier and a candidate emotion label of the emotion recognition model to obtain sentence emotion scores;

step S302, screening the candidate emotion labels based on the emotion scores of the sentences to obtain initial emotion categories of the target text sentences.

In step S301 of some embodiments, the emotion classifier may be a softmax classifier or the like, without limitation. Taking a softmax classifier as an example, creating probability distribution of target sentence coding features on each candidate emotion label based on the softmax classifier, realizing emotion scoring of the target sentence coding features, and taking the probability distribution vector of each candidate emotion label as the sentence emotion score of the target file sentence on the candidate emotion label.

In step S302 of some embodiments, since the size of the emotion score of the sentence may directly reflect the likelihood that the target text sentence belongs to each candidate emotion tag, that is, the larger the emotion score of the sentence, the more the emotion features in the target text sentence are biased to the candidate emotion tags corresponding to the emotion score of the sentence. Therefore, the largest sentence emotion score can be screened out from the sentence emotion scores to serve as a target score, and the candidate emotion label corresponding to the target score serves as an initial emotion category of the target text sentence, wherein the initial emotion category can represent the emotion category represented by the target text sentence.

Because at least one target text sentence exists in the target speaking audio, certain difference exists in emotion categories represented by different target text sentences, emotion recognition is performed on the speaking audio within a period of time, and whether the emotion state of the target object changes and the generated change trend can be clearly obtained according to the generation sequence of the target text sentences.

It should be noted that, in the embodiment of the present application, the candidate emotion labels may be divided into four types, namely, a first emotion type, a second emotion type, a third emotion type and a fourth emotion type, where the first emotion type indicates that there is no non-positive emotion representation in the text sentence, and the second emotion type indicates that the text sentence includes an emotion representation with a certain tendency. For example, a text sentence contains a slight complaint word; the third emotion category indicates that the text sentence contains obviously prone emotion words, for example, non-civilized expressions and the like appear in the text sentence; the fourth emotion category indicates that the text sentence contains more extreme emotion words, such as words of complaint, exposure and the like.

Through the steps S301 to S302, the emotion state condition of the target object can be clearly determined according to the target speaking audio of the target object, the probability distribution of the target text sentence on each candidate emotion label can be predicted based on the target text sentence corresponding to the target speaking audio, and the recognition accuracy of the emotion state of the target object can be better improved.

Referring to fig. 4, in some embodiments, step S105 may include, but is not limited to, steps S401 to S402:

step S401, word segmentation processing is carried out on the target text sentence, and a plurality of text word segments are obtained;

step S402, comparing the text word segment with the reference emotion feature words in the preset emotion word list to obtain a recognition result, wherein the recognition result is used for representing that the text word segment has the reference emotion feature words or the text word segment does not have the reference emotion feature words.

In step S401 of some embodiments, a Jieba word segmentation device may be used to segment a target text sentence, and the target text sentence is divided into a plurality of segments according to conventional grammar rules and word parts of speech, so as to obtain a plurality of text word segments, where a text word segment may include a character, a word, and so on.

In step S402 of some embodiments, a recognition result may be obtained by comparing the text word segment with the reference emotion feature words in the preset emotion word list in various forms such as a cosine similarity algorithm, a euclidean distance, or a field value, where the recognition result includes that the text word segment has the reference emotion feature words or the text word segment does not have the reference emotion feature words. Taking a cosine similarity algorithm as an example, vectorizing a text word segment and a reference emotion feature word, calculating cosine similarity of the vectorized text word segment and the reference emotion feature word, if the cosine similarity of the reference emotion feature word and the text word segment is higher than a preset threshold, considering that the reference emotion feature word can be matched with the text word segment, namely representing that the reference emotion feature word exists in the text word segment, and if the cosine similarity of all the reference emotion feature words and the text word segment is smaller than or equal to the preset threshold, considering that the reference emotion feature word in the preset emotion word list cannot be matched with the text word segment, namely representing that the reference emotion feature word does not exist in the text word segment.

Through the steps S401 to S402, whether each target text sentence contains the emotion feature words representing the emotion can be clearly identified, and the preset emotion word list is introduced to perform emotion identification, so that the emotion identification process is more diversified, and the emotion identification precision can be improved.

Referring to fig. 5, in some embodiments, step S106 may include, but is not limited to, steps S501 to S503:

step S501, if the recognition result indicates that the text word segment has the reference emotion feature word, selecting the reference emotion feature word identical to the text word segment as a target emotion feature word;

step S502, inquiring vocabulary emotion categories corresponding to target emotion feature words;

step S503, obtaining a target emotion type according to a preset priority order, vocabulary emotion type and initial emotion type.

In step S501 of some embodiments, if the recognition result indicates that the text word segment has the reference emotion feature, the vocabulary emotion type of the target text sentence needs to be further determined according to the recognized reference emotion feature, and therefore, the reference emotion feature word identical to the text word segment needs to be extracted from the preset emotion word list, and the reference emotion feature word is used as the target emotion feature word.

In step S502 of some embodiments, since the preset emotion vocabulary includes vocabulary emotion categories corresponding to each of the plurality of reference emotion feature words, the vocabulary emotion categories corresponding to the target emotion feature words may be directly queried from the preset emotion vocabulary. It should be noted that the preset emotion vocabulary may be set based on experience of related personnel, or may be obtained in other manners, which is not limited.

It should be noted that, in order to improve the accuracy of emotion recognition, the vocabulary emotion categories in the preset emotion vocabulary also include two types, namely, a first vocabulary emotion category, wherein the first vocabulary emotion category represents that the reference emotion feature words contain obvious non-positive emotion expressions, for example, the reference emotion feature words are some non-civilized expressions and the like; the reference emotion feature words corresponding to the second vocabulary emotion category contain more extreme emotion tendencies, for example, the reference emotion feature words are words such as complaints, exposure and the like.

In step S503 of some embodiments, when the vocabulary emotion category and the initial emotion category exist at the same time, there may be a difference in the emotion levels characterized by the vocabulary emotion category and the initial emotion category, and thus, it is necessary to compare the vocabulary emotion category and the initial emotion category; when the vocabulary emotion type and the initial emotion type are the same, the vocabulary emotion type or the initial emotion type is used as a target emotion type; if the vocabulary emotion type and the initial emotion type are different, selecting the higher priority of the vocabulary emotion type and the initial emotion type as the target emotion type according to the priority order.

It should be noted that when the recognition result indicates that the text word segment has the reference emotion feature word, a plurality of reference emotion feature words may be extracted as target emotion feature words, that is, there are a plurality of reference emotion feature words corresponding to the text word segment in a certain target text sentence, for example, a text word segment a and a text word segment B exist in the target text sentence, and if the two text word segments have the corresponding reference emotion feature words, two reference emotion feature words may be extracted as target emotion feature words, so that it is necessary to query the vocabulary emotion categories of the two target emotion feature words respectively, compare the emotion levels of the vocabulary emotion categories of the two target emotion feature words, and compare the vocabulary emotion category with the initial emotion category with the higher emotion level to obtain the target emotion category.

In addition, if the recognition result indicates that the text word segment does not have the reference emotion feature word, the initial emotion type of the target text sentence is directly used as the target emotion type.

Through the steps S501 to S503, the two forms of the emotion recognition model and the preset emotion word list can be combined to comprehensively determine the target emotion type of the target text sentence in the target speaking audio, and compared with the emotion recognition in a single mode, the emotion recognition model and the preset emotion word list can be adopted to more objectively and accurately evaluate the emotion state of each target text sentence representation, so that the accuracy of emotion recognition can be effectively improved.

Referring to fig. 6, in some embodiments, step S503 includes, but is not limited to, steps S601 to S603:

step S601, comparing the vocabulary emotion type and the initial emotion type;

step S602, if the vocabulary emotion type and the initial emotion type are the same, the vocabulary emotion type or the initial emotion type is used as a target emotion type;

step S603, if the vocabulary emotion type and the initial emotion type are different, selecting the higher priority of the vocabulary emotion type and the initial emotion type as the target emotion type according to the priority order.

In step S601 of some embodiments, when the vocabulary emotion category and the initial emotion category exist at the same time, there may be a difference in the emotion levels characterized by the vocabulary emotion category and the initial emotion category, and thus, it is necessary to compare the vocabulary emotion category and the initial emotion category.

In step S602 of some embodiments, when the vocabulary emotion type and the initial emotion type are the same, it is indicated that the emotion intensities of the target objects represented by the vocabulary emotion type and the initial emotion type are at the same level, i.e. the vocabulary emotion type or the initial emotion type is regarded as the target emotion type.

In step S603 of some embodiments, when the vocabulary emotion type and the initial emotion type are different, the emotion levels of the vocabulary emotion type and the initial emotion type need to be compared, and a higher priority of the vocabulary emotion type and the initial emotion type is selected as the target emotion type according to the priority order.

It should be noted that, in the embodiment of the present application, the priority order is formulated according to the emotion levels, that is, in all the candidate emotion tags, the priority of the first emotion category is lower than that of the second emotion category, the priority of the second emotion category is lower than that of the third emotion category, and the priority of the third emotion category is lower than that of the fourth emotion category. Among the lexical emotion categories, the first lexical emotion category has a lower priority than the second lexical emotion category; however, when the candidate emotion labels and the vocabulary emotion categories are compared, the priorities of the first vocabulary emotion category and the second vocabulary emotion category are higher than those of the first emotion category and the second emotion category, the priorities of the first vocabulary emotion category and the third emotion category are the same, and the priorities of the second vocabulary emotion category and the fourth emotion category are the same.

Through the steps S601 to S603, the two forms of the emotion recognition model and the preset emotion word list can be combined to comprehensively determine the target emotion type of the target text sentence in the target speaking audio, so that the accuracy of emotion recognition can be effectively improved.

In step S107 of some embodiments, since there is also a certain difference in emotion types of different target text sentence representations, emotion recognition is performed on speech audio within a period of time, so as to clearly obtain whether the emotion state of the target object changes and the trend of the change generated according to the generation sequence of the target text sentences. Therefore, according to the target emotion type of the target text sentence, the emotion state curve drawing can be carried out on the corresponding speaking audio to obtain the emotion state data of the target object.

Referring to fig. 7, in some embodiments, after step S107, the emotion recognition method may further include, but is not limited to, steps S701 to S704:

step S701, obtaining object basic information of a target object;

step S702, constructing a target object portrait based on the object basic information and the emotion state data to obtain an object personality portrait;

step S703, screening target dialogue strategy information from preset candidate dialogue strategy information based on the object character portraits;

In step S704, the target session policy information is pushed to the feedback object, so that the feedback object performs a session with the target object according to the target session policy information.

In step S701 of some embodiments, the object basic information of the target object may be directly called from a preset database, where the object basic information includes data of name, age, sex, occupation, and the like of the target object.

In step S702 of some embodiments, a preset portrayal construction tool may be invoked, with which to construct a target object from the object basic information and the emotional state data, resulting in an object personality portrayal.

In step S703 of some embodiments, the object personality portrait and the candidate dialogue policy information may be matched, and the candidate dialogue policy information conforming to the object personality portrait may be selected as the target dialogue policy information.

In step S704 of some embodiments, the target dialogue strategy information is pushed to the feedback object in the form of a mail or a network platform message, where the feedback object may be a business person in a specific business field, so that the feedback object can perform a dialogue with the target object according to the target dialogue strategy information, that is, the feedback object can use a question-answer operation in the target dialogue strategy information and a dialogue rhythm, a dialogue topic, and so on to communicate with the target object.

Through the steps S701 to S704, a character image of the target object can be constructed according to the emotion state characteristics of the target object and the basic information thereof, dialogue strategy information corresponding to the character of the target object is selected in a targeted manner based on the character image, and dialogue is performed according to the selected dialogue strategy information, so that the dialogue quality and the accuracy of dialogue questions and answers can be improved, and the dialogue experience of the target object is improved.

According to the emotion recognition method, target dialogue data are obtained; and extracting the content of the target speaking audio to obtain a target text sentence corresponding to the target speaking audio, and converting the target speaking audio into a target text sentence form. Further, coding the target text sentence based on a preset emotion recognition model to obtain a target sentence coding feature; carrying out emotion recognition on the coding features of the target sentences based on the emotion recognition model to obtain initial emotion categories of the target text sentences; carrying out emotion feature word recognition on the target text sentence based on a preset emotion word list to obtain a recognition result; according to the identification result and the initial emotion category, the target emotion category corresponding to the target text sentence is obtained, and finally, the emotion state data of the target object is obtained based on the target emotion category. Further, the method and the device can construct the character image of the target object according to the emotion state characteristics of the target object and the basic information thereof, select dialogue strategy information corresponding to the character of the target object based on the character image in a targeted manner, and conduct dialogue according to the selected dialogue strategy information, so that dialogue quality and accuracy of dialogue questions and answers can be improved.

Referring to fig. 8, an embodiment of the present application further provides an emotion recognition device, which may implement the emotion recognition method, where the device includes:

a data acquisition module 801, configured to acquire target dialogue data, where the target dialogue data includes target speech audio of a target object;

the content extraction module 802 is configured to perform content extraction on the target speech audio to obtain a target text sentence corresponding to the target speech audio;

the encoding module 803 is configured to encode the target text sentence based on a preset emotion recognition model, so as to obtain a target sentence encoding feature;

the emotion recognition module 804 is configured to perform emotion recognition on the target sentence coding feature based on the emotion recognition model, so as to obtain an initial emotion category of the target text sentence;

the feature word recognition module 805 is configured to perform emotion feature word recognition on the target text sentence based on a preset emotion word list, so as to obtain a recognition result;

the emotion type determining module 806 is configured to obtain a target emotion type corresponding to the target text sentence according to the recognition result and the initial emotion type;

the emotional state data generating module 807 is configured to obtain emotional state data of the target object based on the target emotion type.

The specific implementation of the emotion recognition device is basically the same as the specific embodiment of the emotion recognition method, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises: the emotion recognition system comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program is executed by the processor to realize the emotion recognition method. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the emotion recognition method to perform the embodiments of the present application;

An input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the emotion recognition method.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the application provides an emotion recognition method, an emotion recognition device, electronic equipment and a computer readable storage medium, which are used for acquiring target dialogue data; and extracting the content of the target speaking audio to obtain a target text sentence corresponding to the target speaking audio, and converting the target speaking audio into a target text sentence form. Further, coding the target text sentence based on a preset emotion recognition model to obtain a target sentence coding feature; carrying out emotion recognition on the coding features of the target sentences based on the emotion recognition model to obtain initial emotion categories of the target text sentences; carrying out emotion feature word recognition on the target text sentence based on a preset emotion word list to obtain a recognition result; according to the identification result and the initial emotion category, the target emotion category corresponding to the target text sentence is obtained, and finally, the emotion state data of the target object is obtained based on the target emotion category. Further, the method and the device can construct the character image of the target object according to the emotion state characteristics of the target object and the basic information thereof, select dialogue strategy information corresponding to the character of the target object based on the character image in a targeted manner, and conduct dialogue according to the selected dialogue strategy information, so that dialogue quality and accuracy of dialogue questions and answers can be improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting to embodiments of the present application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of emotion recognition, the method comprising:

2. The emotion recognition method of claim 1, wherein the content extracting the target speech audio to obtain a target text sentence corresponding to the target speech audio includes:

3. The emotion recognition method of claim 1, wherein performing emotion recognition on the target sentence coding feature based on the emotion recognition model, to obtain an initial emotion category of the target text sentence, comprises:

4. The emotion recognition method according to claim 1, characterized in that, after the obtaining of the emotion state data of the target object based on the target emotion classification, the emotion recognition method further comprises:

acquiring object basic information of the target object;

5. The emotion recognition method according to any one of claims 1 to 4, wherein the performing emotion feature word recognition on the target text sentence based on a preset emotion word table to obtain a recognition result includes:

6. The emotion recognition method of claim 5, wherein the obtaining the target emotion category corresponding to the target text sentence according to the recognition result and the initial emotion category includes:

7. The emotion recognition method of claim 6, wherein the obtaining the target emotion category according to a preset priority order, the vocabulary emotion category and the initial emotion category includes:

comparing the vocabulary emotion category with the initial emotion category;

8. An emotion recognition device, the device comprising:

9. An electronic device comprising a memory storing a computer program and a processor implementing the emotion recognition method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the emotion recognition method of any one of claims 1 to 7.