WO2020233504A1

WO2020233504A1 - Systems and methods for emotion recognition

Info

Publication number: WO2020233504A1
Application number: PCT/CN2020/090435
Authority: WO
Inventors: Ruixiong ZHANG; Wubo LI
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2019-05-17
Filing date: 2020-05-15
Publication date: 2020-11-26
Also published as: CN111862984B; CN111862984A

Abstract

An emotion recognition method is provided. The method includes: obtaining voice signals including audio data of a user in a scene (1610); using speech recognition to convert audio data of the user in the scene to obtain a result of the speech recognition comprising a text content of the user's voice signal (1620); determining one or more acoustic characteristics from the audio data (1630); determining an emotion of the user based on at least one of the text content and the one or more acoustic characteristics (1640); sending at least one of the emotion and the text content to a terminal device (1650).

Description

SYSTEMS AND METHODS FOR EMOTION RECOGNITION

CROSS REFERENNCE RELATED TO THE APPLICATION

This application claims priority to Chinese Patent Application No. 201910411095.9, filed on May 17, 2019, the contents of which are incorporated herein by reference.

TECHNICAL FIELD

The present disclosure generally relates to emotion recognition, and specifically, to systems and methods for emotion recognition for voice control.

BACKGROUND

With the rapid development of online games, more and more people choose to relax themselves through role-playing games (RPGs) . At present, the interaction and/or communication of real-life players and/or game characters in RPGs are performed by clicking mouse or keyboard manually to promote the development of game's plots, which cannot bring players better game experience. Usually, a client terminal (e.g., a game console or machine, a mobile phone) associated with an RPG may be configured with a voice pickup device (e.g., microphone) which may acquire voice signals of users of RPGs. The voice signals of users may indicate emotions of the users playing in RPGs. It is desired to provide systems and methods for emotion recognition with improved accuracy.

SUMMARY

According to an aspect of the present disclosure, a system for emotion recognition is provided. The system may include at least one storage medium storing a set of instructions and at least one processor configured to communicate with the at least one storage medium. When executing the set of instructions, the at least one processor may be directed to cause the system to obtain voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user. The at least one processor may be further directed to cause the system to optionally determine an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The at least one processor may be further directed to cause the system to optionally determine a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user. The at least one processor may be further directed to cause the system to determine a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene.

In some embodiments, the target real time emotion determination step may comprise the sub step of using the content based emotion recognition model to perform a correction of the acoustic based real time emotion of the user to obtain a corrected real time emotion as the target real time emotion of the user.

In some embodiments, the correction of the real time emotion may comprise using the content based real time emotion of the user as the corrected real time emotion of the user.

In some embodiments, the target real time emotion determination step may comprise the sub step of determining the target real time emotion of the user by comparing the acoustic based real time emotion and the content based real time emotion of the user.

In some embodiments, to determine the target real time emotion of the user by comparing the acoustic based real time emotion and the content based real time emotion of the user, the at least one processor may be further directed to cause the system to use the acoustic based emotion recognition model to determine a first confidence level for the acoustic based real time emotion. The at least one processor may be further directed to cause the system to use the content based emotion recognition model to determine a second confidence level for the content based real time emotion. The at least one processor may be further directed to cause the system to compare the first confidence level and the second confidence level to determine one of the acoustic based real time emotion and the content based real time emotion that corresponds to a higher confidence level as the target real time emotion.

In some embodiments, to determine the acoustic based real time emotion of the user, the at least one processor may be further directed to cause the system to obtain base acoustic characteristics of the user acquired before the scene of the user. The at least one processor may be further directed to cause the system to calibrate the acoustic characteristics of the user in the scene with the base acoustic characteristics of the user to obtain calibrated acoustic characteristics of the user in the scene. The at least one processor may be further directed to cause the system to use the acoustic based emotion recognition model to determine, based on the calibrated acoustic characteristics of the user in the scene, the acoustic based real time emotion of the user.

In some embodiments, the content based real time emotion determination step may comprise the sub steps of using a speech recognition model to convert the audio data of the user in the scene into a text content. The content based real time emotion determination step may also comprise the sub steps of using the content based emotion recognition model to determine, based on the text content, the content based real time emotion of the user.

In some embodiments, to obtain the speech recognition model, the at least one processor may be further directed to cause the system to obtain a plurality of groups of universal audio data of one or more subjects communicating in one or more circumstances. The at least one processor may be further directed to cause the system to determine a universal speech recognition model by training a machine learning model using the plurality of groups of universal audio data. The at least one processor may be further directed to cause the system to obtain a plurality of groups of special audio data of one or more subjects associated with the scene. The at least one processor may be further directed to cause the system to use the plurality of groups of special audio data to train the universal speech recognition model to determine the speech recognition model.

In some embodiments, to obtain the acoustic based emotion recognition model, the at least one processor may be further directed to cause the system to obtain a plurality of groups of acoustic characteristics associated with the scene of users. The at least one processor may be further directed to cause the system to use the plurality of groups of acoustic characteristics to train a first machine learning model to determine the acoustic based emotion recognition model.

In some embodiments, the first machine learning model may include a support vector machine.

In some embodiments, to obtain the content based emotion recognition model, the at least one processor may be further directed to cause the system to obtain a plurality of groups of audio data associated with the scene of users. The at least one processor may be further directed to cause the system to convert each group of the audio data into a text content. The at least one processor may be further directed to cause the system to use the text content to train a second machine learning model to determine the content based emotion recognition model.

In some embodiments, the second machine learning model may include a text classifier.

The voice signals of the user is acquired when the user plays an RPG. The at least one processor performs additional operations including adjusting, based on the target real time emotion of the user in the scene, a plot of the RPG subsequent to the scene.

In some embodiments, the user may have a relationship with at least one of one or more real-life players of the RPG or one or more characters in the RPG, and to adjust, based on the target real time emotion of the user, a plot of the RPG, the at least one processor may be further directed to cause the system to determine, based on the target real time emotion of the user, the relationship between the user and the one or more real life players or the one or more characters in the RPG. The at least one processor may be further directed to cause the system to adjust, based on the determined relationship, the plot of the RPG.

In some embodiments, the at least one processor may be further directed to cause the system to adjust, based on the target real time emotion of the user in the scene, an element of the RPG in the scene. The element of the RPG includes at least one of a vision effect associated with the RPG in the scene, a sound effect associated with the RPG in the scene, a display interface element associated with the RPG in the scene or one or more props used in the RPG in the scene.

According to another aspect of the present disclosure, a method for emotion recognition is provided. The method may include obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user. The method may further include optionally determining an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The method may further include optionally determining a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user. The method may further include determining a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene.

According to still a further aspect of the present disclosure, a non-transitory computer readable medium is provided. The non-transitory computer readable medium storing instructions, the instructions, when executed by a computer, may cause the computer to implement a method. The method may include one or more of the following operations. The method may include obtaining voice signals of a user playing in a scene, the voice signals comprising acoustic characteristics and audio data of the user. The method may further include optionally determining an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The method may further include optionally determining a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user. The method may further include determining a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene.

According to another aspect of the present disclosure, a system for emotion recognition is provided. The system may include an obtaining module, an emotion recognition module. The obtaining module may be configured to obtain voice signals of a user, the voice signals comprising acoustic characteristics and audio data of the user. The emotion recognition module may be configured to optionally determine an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The emotion recognition module may be also configured to optionally determine a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user.

According to another aspect of the present disclosure, a system for emotion recognition is provided. The system may include at least one storage medium storing a set of instructions and at least one processor configured to communicate with the at least one storage medium. When executing the set of instructions, the at least one processor may be directed to cause the system to obtain voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user. The at least one processor may be directed to cause the system to determine one or more acoustic characteristics of the user from the voice signals. The at least one processor may be directed to cause the system to determine one or more text contents derived from the audio data of the user. The at least one processor may be directed to cause the system to determine a target real time emotion of the user in the scene based on the one or more acoustic characteristics and the one or more text contents.

In some embodiments, the at least one processor may be further directed to cause the system to send the target real time emotion of the user and the one or more text contents to a terminal device for voice control.

According to another aspect of the present disclosure, a method is provided. The method may be implemented on a computing device including a storage device and at least one processor for emotion recognition. The method may include obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user. The method may include determining one or more acoustic characteristics of the user from the voice signals. The method may include determining one or more text contents derived from the audio data of the user. The method may include determining a target real time emotion of the user in the scene based on the one or more acoustic characteristics and the one or more text contents.

According to another aspect of the present disclosure, a non-transitory computer readable medium is provided. The non-transitory computer readable medium storing instructions, the instructions, when executed by a computer, may cause the computer to implement a method. The method may include one or more of the following operations. The method may include obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user. The method may include determining one or more acoustic characteristics of the user from the voice signals. The method may include determining one or more text contents derived from the audio data of the user. The method may further include determining a target real time emotion of the user in the scene based on the one or more acoustic characteristics and the one or more text contents.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 is a schematic diagram illustrating an exemplary emotion recognition system according to some embodiments of the present disclosure;

FIG. 2 is a schematic diagram illustrating exemplary hardware and software components of a computing device according to some embodiments of the present disclosure;

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device on which a terminal may be implemented according to some embodiments of the present disclosure;

FIG. 4A is a block diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure;

FIG. 4B is a block diagram illustrating an exemplary model determination module according to some embodiments of the present disclosure;

FIG. 5 is a flowchart illustrating an exemplary process for adjusting the plot of an RPG according to some embodiments of the present disclosure;

FIG. 6 is a flowchart illustrating an exemplary process for adjusting the plot of an RPG according to some embodiments of the present disclosure;

FIG. 7 is a flowchart illustrating an exemplary process for adjusting the plot of an RPG according to some embodiments of the present disclosure;

FIG. 8 is a flowchart illustrating an exemplary process for adjusting the plot of an RPG according to some embodiments of the present disclosure;

FIG. 9 is a flowchart illustrating an exemplary process for obtaining a speech recognition model according to some embodiments of the present disclosure;

FIG. 10A is a flowchart illustrating an exemplary process for determining an acoustic based emotion recognition model according to some embodiments of the present disclosure;

FIG. 10B is a flowchart illustrating an exemplary process for determining a content based emotion recognition model according to some embodiments of the present disclosure;

FIG. 11 is a flowchart illustrating an exemplary process for determining an emotion of a user according to some embodiments of the present disclosure;

FIG. 12 is a flowchart illustrating an exemplary process for determining a first probability corresponding to each of one or more predetermined emotions according to some embodiments of the present disclosure;

FIG. 13 is a flowchart illustrating an exemplary process for determining a second probability corresponding to each of multiple predetermined emotions according to some embodiments of the present disclosure;

FIG. 14 is a flowchart illustrating an exemplary process for determining a targer portion in audio dataaccording to some embodiments of the present disclosure;

FIG. 15 is a flowchart illustrating an exemplary process for determining a second probability corresponding to each of multiple predetermined emotions according to some embodiments of the present disclosure; and

FIG. 16 is a flowchart illustrating an exemplary process for determining an emotion of a user based on at least one of a text content and one or more acoustic characteristics in a scene according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant disclosure. However, it should be apparent to those skilled in the art that the present disclosure may be practiced without such details. In other instances, well-known methods, procedures, systems, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present disclosure. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood that the term “system, ” “engine, ” “unit, ” “module, ” and/or “block” used herein are one method to distinguish different components, elements, parts, sections or assembly of different levels in ascending order. However, the terms may be displaced by another expression if they achieve the same purpose.

Generally, the word “module, ” “unit, ” or “block, ” as used herein, refers to logic embodied in hardware or firmware, or to a collection of software instructions. A module, a unit, or a block described herein may be implemented as software and/or hardware and may be stored in any type of non-transitory computer-readable medium or another storage device. In some embodiments, a software module/unit/block may be compiled and linked into an executable program. It will be appreciated that software modules can be callable from other modules/units/blocks or from themselves, and/or may be invoked in response to detected events or interrupts. Software modules/units/blocks configured for execution on computing devices may be provided on a computer-readable medium, such as a compact disc, a digital video disc, a flash drive, a magnetic disc, or any other tangible medium, or as a digital download (and can be originally stored in a compressed or installable format that needs installation, decompression, or decryption prior to execution) . Such software code may be stored, partially or fully, on a storage device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an erasable programmable read-only memory (EPROM) . It will be further appreciated that hardware modules/units/blocks may be included in connected logic components, such as gates and flip-flops, and/or can be included of programmable units, such as programmable gate arrays or processors. The modules/units/blocks or computing device functionality described herein may be implemented as software modules/units/blocks but may be represented in hardware or firmware. In general, the modules/units/blocks described herein refer to logical modules/units/blocks that may be combined with other modules/units/blocks or divided into sub-modules/sub-units/sub-blocks despite their physical organization or storage. The description may be applicable to a system, an engine, or a portion thereof.

It will be understood that when a unit, engine, module or block is referred to as being “on, ” “connected to, ” or “coupled to, ” another unit, engine, module, or block, it may be directly on, connected or coupled to, or communicate with the other unit, engine, module, or block, or an intervening unit, engine, module, or block may be present, unless the context clearly indicates otherwise. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

These and other features, and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawings, all of which form a part of this disclosure. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments in the present disclosure. It is to be expressly understood the operations of the flowchart may be implemented not in order. Conversely, the operations may be implemented in an inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

Embodiments of the present disclosure may be applied to different transportation systems including but not limited to land transportation, sea transportation, air transportation, space transportation, or the like, or any combination thereof. A vehicle of the transportation systems may include a rickshaw, travel tool, taxi, chauffeured car, hitch, bus, rail transportation (e.g., a train, a bullet train, high-speed rail, and subway) , ship, airplane, spaceship, hot-air balloon, driverless vehicle, or the like, or any combination thereof. The transportation system may also include any transportation system that applies management and/or distribution, for example, a system for sending and/or receiving an express.

The application scenarios of different embodiments of the present disclosure may include but not limited to one or more webpages, browser plugins and/or extensions, client terminals, custom systems, intracompany analysis systems, artificial intelligence robots, or the like, or any combination thereof. It should be understood that application scenarios of the system and method disclosed herein are only some examples or embodiments. Those having ordinary skills in the art, without further creative efforts, may apply these drawings to other application scenarios, for example, another similar server.

Some embodiments of the present disclosure provide systems and methods for emotion recognition for adjusting a plot of an RPG. A method may include obtaining voice signals of a user playing in a scene of a Role-playing game (RPG) . The voice signals may comprise acoustic characteristics and audio data of the user. The method may include optionally determining an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The method may also include optionally determining a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user. The method may also include determining a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene. The target real time emotion of the user may be determined based on the acoustic based real time emotion of the user and/or the content based real time emotion, which may improve an accuracy of the recognized emotion of the user.

In some embodiments, the method may further include adjusting, based on the target real time emotion of the user in the scene, a plot of the RPG subsequent to the scene. Thus, the development of game's plots may be promoted based on interaction or communication between users or characters of the RPG via speech ways, which can bring users a better game experience, increase interestingness of the RPG, and absorb more users.

FIG. 1 is a schematic diagram illustrating an exemplary emotion recognition system 100 according to some embodiments of the present disclosure. The emotion recognition system 100 may be a platform for data and/or information processing, for example, training a machine learning model for emotion recognition and/or data classification, such as text classification, etc. The emotion recognition system 100 may be applied in online game (e.g., a role-playing game (RPG) ) , artificial intelligence (AI) customer service, AI shopping guidance, AI tourist guidance, driving system (e.g., an automatic pilot system) , lie detection system, or the like, or a combination thereof. For example, for online games (e.g., a role-playing game (RPG) ) , a plot of an RPG may be adjusted and/or controlled based on emotions of users identified by the emotion recognition system 100. As another example, for artificial intelligence (AI) customer services, personalized information associated with different users may be recommended based on emotions of users identified by the emotion recognition system 100. The emotion recognition system 100 may recognize an emotion of a user based on, for example, facial expression images, voice signals, etc. The emotion recognition system 100 may include a server 110, a storage device 120,

terminals

130 and 140, and a network 150.

The server 110 may process information and/or data relating to emotion recognition. In some embodiments, the server 110 may be a single server or a server group. The server group may be centralized, or distributed (e.g., the server 110 may be a distributed system) . In some embodiments, the server 110 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 110 may be implemented on a computing device having one or more components illustrated in FIG. 2 in the present disclosure.

In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process information and/or data relating to emotion recognition to perform one or more functions described in the present disclosure. For example, the processing device 112 may receive voice signals including acoustic characteristics and audio data of a user communicating or speaking in a scene from the terminal 130 or the terminal 140. The processing device 112 may obtain an acoustic based emotion recognition model and a content based emotion recognition model from the storage device 120. As another example, the processing device 112 may determine an acoustic based real time emotion of the user based on the acoustic characteristics using the acoustic based emotion recognition model. The processing device 112 may determine a content based real time emotion of the user based on a text content derived from the audio data of the user using the content based emotion recognition model. The text content may be derived from the audio data of the user using a speech recognition model. The processing device 112 may determine a target real time emotion of the user based on acoustic based emotion and the content based emotion of the user. As a further example, the processing device 112 may adjust a plot subsequent to the scene based on the target real time emotion of the user. The scene may be associated with an RPG, an AI customer service, an AI shopping guidance, an AI tourist guidance, a driving, a lie detection, etc.

In some embodiments, the determination and/or updating of models (e.g., the acoustic based emotion recognition model, the content based emotion recognition model, the speech recognition model) may be performed on a processing device, while the application of the models may be performed on a different processing device. In some embodiments, the determination and/or updating of the models may be performed on a processing device of a system different than the emotion recognition system 100 or a server different than the server 110 on which the application of the models is performed. For instance, the determination and/or updating of the models may be performed on a first system of a vendor who provides and/or maintains such a machine learning model, , and/or has access to training samples used to determine and/or update the machine learning model, while emotion recognition based on the provided machine learning model, may be performed on a second system of a client of the vendor. In some embodiments, the determination and/or updating of the models may be performed online in response to a request for emotion recognition. In some embodiments, the determination and/or updating of the models may be performed offline.

In some embodiments, the processing device 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) . Merely by way of example, the processing device 112 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.

The storage device 120 may store data and/or instructions related to content identification and/or data classification. In some embodiments, the storage device 120 may store data obtained/acquired from the terminal 130 and/or the terminal 140. In some embodiments, the storage device 120 may store data and/or instructions that the server 110 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, the storage device 120 may include a mass storage device, a removable storage device, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage devices may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage devices may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Exemplary volatile read-and-write memory may include a random access memory (RAM) . Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (PEROM) , an electrically erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage device 120 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the storage device 120 may be connected to or communicate with the server 110. The server 110 may access data or instructions stored in the storage device 120 directly or via a network. In some embodiments, the storage device 120 may be a part of the server 110.

The terminal 130 and/or the terminal 140 may provide data and/or information related to emotion recognition and/or data classification. The data and/or information may include images, text files, voice segments, web pages, video recordings, user requests, programs, applications, algorithms, instructions, computer codes, or the like, or a combination thereof. In some embodiments, the terminal 130 and/or the terminal 140 may provide the data and/or information to the server 110 and/or the storage device 120 of the emotion recognition system 100 for processing (e.g., train a machine learning model for emotion recognition) .

In some embodiments, the terminal 130 and/or the terminal 140 may be a device, a platform, or other entity interacting with the server 110. In some embodiments, the terminal 130 may be implemented in a device with data acquisition and/or data storage, such as a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, and a server 130-4, a storage device (not shown) , or the like, or any combination thereof. In some embodiments, the mobile devices 130-1 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, a game machine (or a game console) or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart footgear, a smart glass, a smart helmet, a smartwatch, a smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a personal digital assistant (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass, an Oculus Rift, a HoloLens, a Gear VR, etc. In some embodiments, the servers 130-4 may include a database server, a file server, a mail server, a web server, an application server, a computing server, a media server, a communication server, etc. The terminal 140 may be similar to or same as the terminal 130. For example, the terminal 140 may be implemented in a device with data acquisition and/or data storage, such as a mobile device 140-1, a tablet computer 140-2, a laptop computer 140-3, and a server 140-4, a storage device (not shown) , or the like, or any combination thereof.

In some embodiments, the terminal 130 (or the terminal 140) may be a client terminal. The client terminal may send and/or receive information for emotion recognition to the processing device 112 via a user interface. The user interface may be in the form of an application for an online game (e.g., an RPG platform) or emotion recognition implemented on the terminal 130 and/or the terminal 140. The user interface implemented on the terminal 130 and/or the terminal 140 may be configured to facilitate communication between users of the terminal 130 and/or the terminal 140, and the processing device 112. For example, each of the terminal 130 and/or the terminal 140 may be configured with a voice pickup device for acquiring voice signals of users. The terminal 130 and/or the terminal 140 may be installed with the same RPG platform. Each of the users of the terminal 130 and the terminal 140 may be a player of the RPG and have a game character in the RPG. The users of the terminal 130 and the terminal 140 may communicate with each other via the voice pickup device in the RPG platform. The game characters of the users playing in the RPG may communicate or interact with each other based on communication of the users via the voice pickup device. As another example, the processing device 112 may obtain voice signals of the users playing the RPG from the terminal 130 and the terminal 140. The processing device 112 may determine a real time emotion of at least one of the users of the terminal 130 and the terminal 140 based on methods as described elsewhere in the present disclosure. The processing device 112 may further adjust a plot of the RPG associated with at least one of the users based on the real time emotion.

In some embodiments, the terminal 130 (or the terminal 140) may be a server terminal. For example, the terminal 130 (or the terminal 140) may be a game server used to process and/or store data in response to one or more service requests when a user plays an online game (e.g., an RPG) . In some embodiments, the terminal (or the terminal 140) may obtain a real time emotion of the user playing the online game determined by the server 110 (e.g., the processing device 112) according to a method for emotion recognition as described elsewhere in the present disclosure. The terminal 130 (or the terminal 140) may adjust a plot of the online game (e.g., an RPG) based on the real time emotion of the user.

The network 150 may facilitate exchange of information and/or data. In some embodiments, one or more components in the emotion recognition system 100 (e.g., the server 110, the terminal 130, the terminal 140, or the storage device 120) may send information and/data to another component (s) in the emotion recognition system 100 via the network 150. In some embodiments, the network 150 may be any type of wired or wireless network, or any combination thereof. Merely by way of example, the network 150 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, an internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PTSN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 150 may include one or more network access points. For example, the network 150 may include wired or wireless network access points such as base stations and/or internet exchange points 150-1, 150-2…through which one or more components of the emotion recognition system 100 may be connected to the network 150 to exchange data and/or information.

FIG. 2 illustrates a schematic diagram of an exemplary computing device 200 according to some embodiments of the present disclosure. The computing device 200 may be a computer, such as the server 110 in FIG. 1 and/or a computer with specific functions, configured to implement any particular system according to some embodiments of the present disclosure. The computing device 200 may be configured to implement any component that performs one or more functions disclosed in the present disclosure. For example, the server 110 (e.g., the processing device 112) may be implemented in hardware devices, software programs, firmware, or any combination thereof of a computer like the computing device 200. For brevity, FIG. 2 depicts only one computing device. In some embodiments, the functions of the computing device may be implemented by a group of similar platforms in a distributed mode to disperse the processing load of the system.

The computing device 200 may include a communication terminal 250 that may connect with a network that may implement the data communication. The computing device 200 may also include a processor 220 that is configured to execute instructions and includes one or more processors. The schematic computer platform may include an internal communication bus 210, different types of program storage units and data storage units (e.g., a hard disk 270, a read-only memory (ROM) 230, a random-access memory (RAM) 240) , various data files applicable to computer processing and/or communication, and some program instructions executed possibly by the processor 220. The computing device 200 may also include an I/O device 260 that may support the input and output of data flows between the computing device 200 and other components. Moreover, the computing device 200 may receive programs and data via the communication network.

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary mobile device on which the terminal 130, the terminal 140, and the server 110, may be implemented according to some embodiments of the present disclosure. As illustrated in FIG. 3, the mobile device 300 may include, a communication platform 310, a display 320, a graphics processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, a mobile operating system (OS) 370, application (s) , and a storage 390. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.

In some embodiments, the mobile operating system 370 (e.g., iOS ^TM, Android ^TM, Windows Phone ^TM, etc. ) and one or more applications 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to image processing or other information from the emotion recognition system 100. User interactions with the information stream may be achieved via the I/O 350 and provided to the storage device 120, the server 110 and/or other components of the emotion recognition system 100. In some embodiments, the mobile device 300 may be an exemplary embodiment corresponding to a terminal associated with, the emotion recognition system 100, the terminal 130 and/or 140.

To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein. A computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device. A computer may also act as a system if appropriately programmed.

FIG. 4A is a block diagram illustrating an exemplary processing device 112 according to some embodiments of the present disclosure. The processing device 112 may include an obtaining module 410, a model determination module 420, an emotion recognition module 430, an adjustment module 440, and a sending module 450.

The obtaining module 410 may be configured to obtain audio data of a user in a scene. The audio data may be acquired from voice signals of the user playing in the scene. For example, the voice signals may be generated when a user playing in a scene of a role-playing game (RPG) . The obtaining module 410 may be configured to obtain voice signals of a user playing in a scene of a role-playing game (RPG) . The voice signals of the user may comprise acoustic characteristics and audio data of the user. The voice signals of the user may be obtained by the obtaining module 410 from the terminal 130, the terminal 140, a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as described elsewhere in the present disclosure.

The obtaining module 410 may use speech recognition to convert audio data of the user in the scene to obtain a result of the speech recognition comprising a text content of the user’s voice signal, for example, use a speech recognition model to obtain the result of the speech recognition.

The obtaining module 410 may also obtain models used in a process for emotion recognition, for example, an acoustic based emotion recognition model, a content based emotion recognition model, a speech recognition model, etc. The acoustic based emotion recognition model may be configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user using a speech recognition model.

The model determination module 420 may be configured to determine one or more models used in a process for emotion recognition, for example, an acoustic based emotion recognition model, a content based emotion recognition model, a speech recognition model, a trained audio category identification model, etc. For example, the model determination module 420 may use a plurality of groups of voice signals to train a machine learning model to obtain an acoustic based emotion recognition model. The model determination module 420 may also use speech recognition to convert each group of the plurality of groups of audio data to obtain a result of the speech recognition comprises a text content of each of the plurality of groups of audio data. Further, the model determination module 420 may use the text content of each group of audio data to train a machine learning model to obtain a content based emotion recognition model. In some embodiments, the machine learning model may include a linear regression model, a Kernel function model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof. As another example, the model determination module 420 may determine a universal speech recognition model and/or a special speech recognition model according to process 900. In some embodiments, the model determination module 420 may determine a trained audio category identification model according to process 1400.

The model determination module 420 may be configured to determine a first probability corresponding to each of one or more predetermined emotions based on a text vector corresponding to the text content. In some embodiments, the model determination module 420 may determine a word vector corresponding to each of one or more words in a text content. The model determination module 420 may determine a text vector by summing word vectors The model determination module 420 may determine a first probability corresponding to each of one or more predetermined emotions by inputting the text vector into a content based emotion recognition model.

The model determination module 420 may determine a second probability corresponding to the each of one or more predetermined emotions based on acoustic characteristics of the audio data. In some embodiments, the model determination module 420 may determine a MFCC corresponding to each of multiple frames of the audio data by performing a Fourier transform on the audio data. The model determination module 420 may identify each of the multiple frames based on the MFCC to obtain a target portion of the audio data. The model determination module 420 may determine a second probability corresponding to each of multiple predetermined emotions based on the target portion of the audio data. The model determination module 420 may determine an emotion degree corresponding to each of the one or more predetermined emotions based on at least one of the first probability and the second probability.

The emotion recognition module 430 may be configured to determine an emotion of the user based on at least one of the text content and the one or more acoustic characteristics. In some embodiments, the emotion recognition module 430 may determine the emotion of the user based on the at least one of the text content and the one or more acoustic characteristics using the acoustic based emotion recognition model and/or the content based emotion recognition model. In some embodiments, the emotion recognition module 430 may determine the emotion of the user based on the emotion degree corresponding to each of the one or more predetermined emotions.

The emotion recognition module 430 may be configured to determine a real time emotion of the user in the scene based on the voice signals using at least one of the acoustic based emotion recognition model or the content based recognition model.

In some embodiments, the emotion recognition module 430 may first determine an acoustic based real time emotion using the acoustic based emotion recognition model. Further, the emotion recognition module 430 may optionally perform correction of the acoustic based real time emotion of the user by determining a content based real time emotion of the user using the content based emotion recognition model.

In some embodiments, the emotion recognition module 430 may first determine the content based real time emotion using the content based emotion recognition model. Further, the emotion recognition module 430 may optionally perform correction of the content based real time emotion of the user by determining the acoustic based real time emotion of the user using the acoustic based emotion recognition model.

In some embodiments, the emotion recognition module 430 may determine the content based real time emotion using the content based emotion recognition model and the acoustic based real time emotion using the acoustic based emotion recognition model. The emotion recognition module 430 may compare the content based real time emotion and the acoustic based real time emotion. Further, the emotion recognition module 430 may determine the real time emotion of the user based on the comparison.

The adjustment module 440 may be configured to adjust the plot of the RPG subsequent to the scene based on the determined real time emotion of the user in the scene. In some embodiments, the emotion of the user (e.g., player) of the RPG may reflect user experience of the RPG, and the adjustment module 440 may adjust the plot of the RPG subsequent to the scene to improve the user experience based on the determined or corrected real time emotion of the user. In some embodiments, the user may have a relationship with at least one of one or more real-life players of the RPG or one or more characters in the RPG. The adjustment module 440 may determine the relationship between at least one of one or more real-life players of the RPG or one or more characters in the RPG based on the determined or corrected real time emotion of the user. The adjustment module 440 may adjust the plot of the RPG based on the relationship between at least one of one or more real-life players of the RPG or one or more characters in the RPG.

The sending module 450 may be configured to send the emotion and the text content to a terminal device. In some embodiments, when receiving the text content and the emotion from the processing device 112, the terminal device (e.g., the terminal 130, the terminal 140) may recognize the user's actual intention through the text content and the emotion to perform operations in the scene (e.g., adjusting a plot of the RPG, pushing a plot of the RPG) .

FIG. 4B is a block diagram illustrating an exemplary model determination module 420 according to some embodiments of the present disclosure. The model determination module 420 may include a speech recognition model determination unit 422, an emotion recognition model determination unit 424, and a storage unit 426.

The speech recognition model determination unit 422 may be configured to use a plurality of groups of universal audio data to train a machine learning model to obtain a universal speech recognition model. Further, the speech recognition model determination unit 422 may use a plurality of groups of special audio data to train the universal speech recognition model to obtain a special speech recognition model.

The emotion recognition model determination unit 424 may be configured to use a plurality of groups of voice signals to train a machine learning model to obtain an acoustic based emotion recognition model. The emotion recognition model determination unit 424 may be also configured to use speech recognition to convert each group of the plurality of groups of audio data to obtain a result of the speech recognition comprises a text content of each of the plurality of groups of audio data. In some embodiments, a speech recognition model may be used to obtain the text content of each group of audio data. Further, the emotion recognition model determination unit 424 may use the text content of each group of audio data to train a machine learning model to obtain a content based emotion recognition model.

The storage unit 426 may be configured to store information. The information may include programs, software, algorithms, data, text, number, images and/or some other information. For example, the information may include data that may be used for the emotion recognition of the user. As another example, the information may include the models for the emotion recognition of the user. As still an example, the information may include training data for model determination.

It should be noted that the above description of the processing engines 112 provided for the purposes of illustration, and is not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, any module mentioned above may be implemented in two or more separate units. Additionally or alternatively, one or more modules mentioned above may be omitted.

FIG. 5 is a flowchart illustrating an exemplary process 500 for adjusting the plot of an RPG according to some embodiments of the present disclosure. At least a portion of process 500 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 500 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 500 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 502, the processing device 112 (e.g., the obtaining module 410) may obtain voice signals of a user playing in a scene of a role-playing game (RPG) . The voice signals of the user may comprise acoustic characteristics and audio data of the user. The voice signals of the user may be obtained by the obtaining module 410 from the terminal 130, the terminal 140, a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as described elsewhere in the present disclosure. For example, the voice signals of the user may be picked up by a voice pickup device (e.g., a microphone) of the terminal 130 (or the terminal 140) in real time. The obtaining module 410 may obtain the voice signals from the terminal 130 (or the terminal 140) or the voice pickup device in real time. As another example, the voice pickup device (e.g., a microphone) of the terminal 130 may transmit the voice signals of the user to the storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) . The obtaining module 410 may obtain the voice signals of the user from the storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) , periodically.

The acoustic characteristics of the user may include one or more features associated with duration, one or more features associated with energy, one or more features associated with fundamental frequency, one or more features associated with frequency spectrum, etc. A feature associated with duration may also be referred to as a duration feature. Exemplary duration features may include a speaking speed, a short time average zero-over rate, a zero-crossing rate, etc. A feature associated with energy may also be referred to as an energy or amplitude feature. Exemplary energy or amplitude features may include a short time average energy, a Root-Mean-Square (RMS) energy, a short time average amplitude, a short time energy gradient, an average amplitude change rate, a short time maximum amplitude, etc. A feature associated with fundamental frequency may be also referred to as a fundamental frequency feature. Exemplary fundamental frequency features may include a fundamental frequency, a pitch of the fundamental frequency (also referred to as F0) , an average fundamental frequency, a maximum fundamental frequency, a fundamental frequency range, etc. Exemplary features associated with frequency spectrum may include formant features, linear predictive coding cepstrum coefficients (LPCC) , mel-frequency cepstrum coefficients (MFCC) , Harmonics to Noise Ratio (HNR) , etc. The acoustic characteristics of the user may be identified and/or determined from the voice signals or the audio data of the user using an acoustic characteristic extraction technique. Exemplary acoustic characteristic extraction techniques may include using an autocorrelation function (ACF) algorithm, an average amplitude difference function (AMDF) algorithm, a nonlinear feature extraction algorithm based on Teager energy operator (TEO) , a linear predictive analysis (LPC) algorithm, a deep learning algorithm (e.g., a Laplacian Eigenmaps, a principal component analysis (PCA) , a local preserved projection (LPP) , etc. ) , etc. Different emotions may correspond to different acoustic characteristics. For example, “anger” may correspond to a wider fundamental frequency range than “fear” or “sadness” . “Happiness” may correspond to a higher short time average amplitude than “fear” or “sadness. ” “Anger” may correspond to a higher average fundamental frequency than “happiness. ”

The audio data of the user may include semantic information of the voice signals of the user that may reflect the content of the voice signals of the user. In some embodiments, the audio data may include a plurality of phoneme sets, each of which may include one or more phonemes. Each phoneme set may correspond to a pronunciation of a word. In some embodiments, the audio data may include a plurality of word sets, each of which includes one or more words. In some embodiments, the audio data may include a plurality of phrase sets, each of which includes one or more phrases. For example, when the user speaks “Oh my god” , three phoneme sets A, B, and C may be used to represent three words “Oh, ” “my, ” “god, ” respectively. In some embodiments, the audio data of the user may be determined based on the voice signals of the user. For example, the voice signals of the user may be analog signals. The audio data of the user may be obtained by performing an analog to digital converting operation on the voice signals (i.e., analog signals) of the user. In some embodiments, the voice signals may be digital signals, which may be also referred to as the audio data.

In 504, the processing device 112 (e.g., the obtaining module 410) may obtain an acoustic based emotion recognition model. In some embodiments, the processing device 112 (e.g., the obtaining module 410) may obtain the acoustic based emotion recognition model from a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) . The acoustic based emotion recognition model may be configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The emotion determined based on one or more acoustic characteristics may also be referred to as an acoustic based real time emotion. The acoustic based emotion recognition model may be configured to determine the emotion of the user based on the one or more acoustic characteristics of the user according to one or more dimensions, such as category, degree, etc. In some embodiments, the acoustic based emotion recognition model may be configured to classify the emotion of the user into a category. For example, the category may be one of positive, negative, and else (e.g., neutral) . As another example, the category may be “joy” , “anger” , “fear” , “disgust” , “surprise” , “sadness” , and else (e.g., neutral) . As another example, the category may be one of “interest” , “desire” , “sorrow” , “wonder” , “surprise” , “happiness” , and else (e.g., neutral) . As still another example, the category may be one of “anxiety” , “anger” , “sadness” , “disgust” , “happiness” . and else (e.g., neutral) . As still another example, the category may be one of “pleasure” , “pain” , and else (e.g., neutral) . In some embodiments, the acoustic based emotion recognition model may be configured to determine a degree of the emotion of the user. The degree of an emotion may be used to denote an intensity of the emotion. For example, the degree of an emotion may include several levels, such as strong and week, or first level, second level, and third level, etc.

In some embodiments, the acoustic based emotion recognition model may be determined by training a machine learning model using a training set. The training set may include a plurality of groups of audio data or acoustic characteristics of audio data. In some embodiments, at least a portion of the plurality of groups of audio data or acoustic characteristics of audio data may be obtained from an emotion voice database, such as Belfast emotion voice database. In some embodiments, at least a portion of the plurality of groups of audio data or acoustic characteristics of audio data may be obtained by one or more testers simulating playing in one or more scenes (e.g., a scene of the RPG) . Each group of the plurality of groups of acoustic characteristics may correspond to a known emotion. Exemplary machine learning models may include a support vector machine (SVM) , a naive Bayes, maximum entropy, a neural network model (e.g., a deep learning model) , or the like, or any combination thereof. Exemplary deep learning models may include a convolutional neural network (CNN) model, a long short-term memory (LSTM) model, an extreme learning machine (ELM) model, or the like, or any combination thereof. More descriptions for the determination of the acoustic based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIG. 10A, and the descriptions thereof) .

In 506, the processing device 112 (e.g., the obtaining module 410) may obtain a content based emotion recognition model. In some embodiments, the processing device 112 (e.g., the obtaining module 410) may obtain the content based emotion recognition model from a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) . The content based emotion recognition model may be configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user using a speech recognition model. More descriptions for the determination of the speech recognition model may be found elsewhere in the present disclosure (e.g., FIGs. 6 and 9, and the descriptions thereof) . The emotion of the user based on one or more text contents may be also referred to as a content based real time emotion. The content based emotion recognition model may be configured to determine the emotion of the user based on the one or more text contents of the audio data of the user according to one or more dimensions, such as category, degree, etc. In some embodiments, the content based emotion recognition model may be configured to classify the emotion of the user into a category. For example, the category may be positive, negative, or neutral. As another example, the category may be “joy, ” “anger, ” “fear, ” “disgust, ” “surprise, ” or “sadness. ” As another example, the category may be “interest” , “desire” , “sorrow” , “wonder” , “surprise” , or “happiness. ” As still another example, the category may be “anxiety” , “anger” , “sadness” , “disgust” , or “happiness. ” As still another example, the category may be “pleasure” , or “pain. ” In some embodiments, the content based emotion recognition model may be configured to determine a degree of the emotion of the user. The degree of an emotion may be used to denote an intensity of the emotion. For example, the degree of an emotion may include several levels, such as strong and week, or first level, second level, and third level, etc.

In some embodiments, the content based emotion recognition model may be determined by training a machine learning model using a training set. The training set may include a plurality of groups of text contents. In some embodiments, at least a portion of the plurality of groups of text contents may be obtained from an emotion voice database, such as Belfast emotion voice database. For example, the audio data in the emotion voice database may be recognized using a speech recognition technique to generate text contents to form the training set. In some embodiments, at least a portion of the plurality of groups of text contents may be obtained by one or more testers simulating playing in one or more scenes (e.g., a scene of the RPG) . For example, the audio data of the one or more testers may be recognized using a speech recognition technique to generate text contents to form the training set. Each group of the plurality of groups of text contents may correspond to a known emotion. The content based emotion recognition model may be constructed based on a linear regression model, a Kernel function model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model (e.g., a deep learning model) , or the like, or any combination thereof. More descriptions for the determination of the content based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIG. 10B, and the descriptions thereof) .

In 508, the processing device 112 (e.g., the emotion recognition module 430) may determine a real time emotion of the user in the scene based on the voice signals using at least one of the acoustic based emotion recognition model or the content based emotion recognition model.

In some embodiments, the processing device 112 (e.g., the emotion recognition module 430) may first determine the acoustic based real time emotion using the acoustic based emotion recognition model. The processing device 112 (e.g., the emotion recognition module 430) may determine whether the acoustic based real time emotion satisfies a condition, for example, is a positive emotion or a negative emotion. The processing device 112 (e.g., the emotion recognition module 430) may optionally perform correction of the acoustic based real time emotion of the user by determining the content based real time emotion of the user using the content based emotion recognition model. For example, if the acoustic based real time emotion of the user determined by the acoustic based emotion recognition model is a negative emotion or a positive emotion, the processing device 112 (e.g., the emotion recognition module 430) may determine the acoustic based real time emotion as the real time emotion of the user. If the acoustic based real time emotion of the user determined by the acoustic based emotion recognition model is neither a negative emotion or a positive emotion, the processing device 112 (e.g., the emotion recognition module 430) may correct the acoustic based real time emotion using the content based real time emotion model. More descriptions for the correction of the acoustic based real time emotion may be found in FIG. 6 and the descriptions thereof.

In some embodiments, the processing device 112 (e.g., the emotion recognition module 430) may first determine the content based real time emotion using the content based emotion recognition model. The processing device 112 (e.g., the emotion recognition 430) may determine whether the content based real time emotion satisfies a condition, for example, is a positive emotion or a negative emotion. The processing device 112 (e.g., the emotion recognition module 430) may optionally perform correction of the content based real time emotion of the user by determining the acoustic based real time emotion of the user using the acoustic based emotion recognition model. For example, if the content based real time emotion of the user determined by the content based emotion recognition model is a negative emotion or a positive emotion, the processing device 112 (e.g., the emotion recognition module 430) may determine the content based real time emotion as the real time emotion of the user. If the content based real time emotion of the user determined by the content based emotion recognition model is neither a negative emotion or a positive emotion, the processing device 112 (e.g., the emotion recognition module 430) may correct the content based real time emotion using the acoustic based real time emotion model. More descriptions for the correction of the content based real time emotion may be found in FIG. 7 and the descriptions thereof.

In some embodiments, the processing device 112 may determine the content based real time emotion using the content based emotion recognition model and the acoustic based real time emotion using the acoustic based emotion recognition model. The processing device 112 may compare the content based real time emotion and the acoustic based real time emotion. The processing device 112 may determine the real time emotion of the user based on the comparison. More descriptions of the determination of the real time emotion of the user may be found elsewhere in the present disclosure (e.g., FIG. 8 and the descriptions thereof) .

In 510, the processing device 112 (e.g., the adjustment module 440) may adjust the plot of the RPG subsequent to the scene and/or an element of the RPG based on the determined real time emotion of the user in the scene.

In some embodiments, when receiving the real time emotion of the user and the text content (s) from the processing device 112, the terminal device (e.g., the terminal 130, the terminal 140) may recognize the user's actual intention through the text content (s) and the emotion to perform operations in the scene (e.g., adjusting a plot of the RPG, pushing a plot of the RPG) . For example, the emotion is “happy, ” and the text content is “agree, ” the terminal device may perform operation of “agree” in the scene. As another example, the emotion is “unhappy, ” and the text content is “agree, ” the terminal device may perform operations different with “agree, ” such as, “disagree. ” Thus, it is beneficial to improve the accuracy of the voice control since the terminal device can obtain the text content and the emotion of the audio data.

In some embodiments, the emotion of the user (e.g., player) of the RPG may reflect user experience of the RPG, and the processing device 112 (e.g., the adjustment module 440) may adjust the plot of the RPG subsequent to the scene to improve the user experience based on the determined or corrected real time emotion of the user. In some embodiments, the user may have a relationship with at least one of one or more real-life players of the RPG or one or more characters in the RPG. The processing device 112 (e.g., the adjustment module 440) may determine the relationship between at least one of one or more real-life players of the RPG or one or more characters in the RPG based on the determined or corrected real time emotion of the user. The processing device 112 (e.g., the adjustment module 440) may adjust the plot of the RPG based on the relationship between at least one of one or more real-life players of the RPG or one or more characters in the RPG. For example, if the determined or corrected real time emotion of the user is a negative emotion, the processing device 112 (e.g., the adjustment module 440) may determine that the relationship between the user and a real-life player of the RPG or a character of the real-life player in the RPG is bad or poor. The processing device 112 (e.g., the adjustment module 440) may decrease the plot of the RPG associated with the character of the user and the character of the real-life player in the RPG in the RPG. As another example, when the user expresses an emotion of hating his/her partner, the processing device 112 may adjust the plot of the RPG to make the user and his/her partner not in a team. As another example, when the user expresses an emotion of being sad for not passing a test (e.g., failure of beating a monster) in the plot of RPG, the processing device 112 may adjust the difficulty of the plot of the RPG to make it easy to pass.

In some embodiments, the processing device 112 may adjust, based on the determined real time emotion of the user in the scene, the element of the RPG in the scene. The element of the RPG in the scene may include a vision effect in the scene of the RPG, a sound effect in the scene of the RPG, a display interface element associated with the scene of the scene, one or more props used in the scene of the RPG, or the like, or a combination thereof. For example, assuming that the RPG is a horror game. The scene may be associated with a horror plot. If the determined real time emotion of the user is not “fear” , or an intensity of the “fear” is relative low (e.g., smaller than a threshold) , the processing engine 112 may adjust the vision effect (e.g., changing painting style) in the scene of the RPG, the sound effect in the scene of the RPG, the display interface element associated with the scene of the scene, the one or more props used in the scene of the RPG, etc., to increase a degree of terror of the RPG. If the determined real time emotion of the user is “fear” , and an intensity of the “fear” is relative high (e.g., exceeds a threshold) , the processing engine 112 may adjust the vision effect (e.g., changing painting style) in the scene of the RPG, the sound effect in the scene of the RPG, the display interface element associated with the scene of the scene, the one or more props used in the scene of the RPG, etc., to degree a degree of terror of the RPG. It should be noted that the above description regarding the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more operations may be omitted and/or one or more additional operations may be added. For example, operation 510 may be omitted. As another example, one or more operation in process 600 (for example operation 630) may be added into the process 500. In some embodiments, process 500 may further include obtaining an image based emotion recognition model configured to identify an emotion of a user based on an image of the face of the user (also referred to as image based real time emotion) . The real time emotion of the user may be determined based on at least one of the image based real time emotion, the acoustic based real time emotion, and the content based real time emotion. In some embodiments, the acoustic based emotion recognition model and the content based emotion recognition model may be integrated into one single model. The one single model may be configured to identify an emotion of the user based on the acoustic characteristics of the user and the text content of the audio data of the user.

FIG. 6 is a flowchart illustrating an exemplary process for adjusting the plot of a role-playing game (RPG) according to some embodiments of the present disclosure. At least a portion of process 600 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 600 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 600 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in the form of electronic current or electrical signals.

In 610, the processing device 112 (e.g., the obtaining module 430) may obtain voice signals of a user playing in a scene of a role-playing game (RPG) . The voice signals of the user may comprise acoustic characteristics and audio data of the user. As used herein, the user may be also referred to as a player of the PRG. The voice signals of the user be obtained by the obtaining module 410 from the terminal 130, the terminal 140, a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as described elsewhere in the present disclosure. The voice signals of the user may be obtained as described in connection with 510 as illustrated in FIG. 5.

In 620, the processing device 112 (e.g., the emotion recognition module 430) may determine user’s real time emotion based on the acoustic characteristics of the user using an acoustic based emotion recognition model. As used herein, a real time emotion determined using acoustic based emotion recognition model may be also referred to as an acoustic based real time emotion. The acoustic based emotion recognition model may be obtained as described elsewhere in the present disclosure (e.g., FIGs. 5 and 10A, and the descriptions thereof) . The acoustic based emotion recognition model may be configured to identify the real time emotion of the user based on one or more acoustic characteristics. As illustrated above, the voice signals may include the plurality of acoustic characteristics.

In some embodiments, to determine the user’s real time emotion, the processing device 112 (e.g., the emotion recognition module 430) may determine the voice signals including one or more acoustic characteristics as an input of the acoustic based emotion recognition model. For example, the processing device 112 (e.g., the emotion recognition module 430) may input the voice signals of the user into the acoustic based emotion recognition model. The acoustic based emotion recognition model may identify the acoustic characteristics (e.g., the real time fundamental frequency and the real time amplitude) of the user from the voice signals. Then the acoustic based emotion recognition model may be used to determine the type of the real time emotion of the user and/or the degree of the real time emotion of the user. In some embodiments, the processing device 112 (e.g., the emotion recognition module 430) may determine the acoustic characteristics of the user as an input of the acoustic based emotion recognition model. For example, the processing device 112 (e.g., the emotion recognition module 430) may input the into the acoustic characteristics of the user into the acoustic based emotion recognition model. Then the acoustic based emotion recognition model may be used to determine the type of the real time emotion of the user and/or the degree of the real time emotion of the user based on the inputted acoustic characteristics of the user. More descriptions for the determination of the acoustic based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIGs. 5 and 10A, and the descriptions thereof) .

In some embodiments, before determine user’s real time emotion based on the acoustic characteristics of the user, the processing device 112 (e.g., the emotion recognition module 430) may perform a calibration operation on the acoustic characteristics of the user. For example, before starting the RPG, the processing device 112 (e.g., the emotion recognition module 430) may obtain one or more base voice signals (i.e., standard voice signals) of the user. The base voice signals may include a series of selected base acoustic characteristics. The processing device 112 (e.g., the emotion recognition module 430) may calibrate the acoustic characteristics of the user based on the base acoustic characteristics of the user. The processing device 112 (e.g., the emotion recognition module 430) may calibrate the acoustic characteristics of the user based on the base acoustic characteristics of the user. For example, the processing device 112 (e.g., the emotion recognition module 430) may normalize the acoustic characteristics of the user based on the base acoustic characteristics of the user. As a further example, the processing device 112 (e.g., the emotion recognition module 430) may determine the average value of an acoustic characteristic (e.g., a fundamental frequency) of the prerecorded voice signals as a base acoustic characteristic. The processing device 112 (e.g., the emotion recognition module 430) may determine normalize the acoustic characteristic of the user by subtracting the base acoustic characteristic from the acoustic characteristic of the user.

In 630, the processing device 112 (e.g., the emotion recognition module 430) may use speech recognition to convert the audio data of the user in the scene to obtain a result of the speech recognition comprising a text content of the user’s voice signals. In some embodiments, a speech recognition model may be used to obtain the result of the speech recognition. Exemplary speech recognition models may include a Hidden Markov model (HMMs) , a dynamic time warping (DTW) -based speech recognition model, an artificial neural network model, an end-to-end automatic speech recognition model, or the like, or any combination thereof. In some embodiments, the speech recognition model may be a universal speech recognition model (e.g. a deep neural network model) . The universal speech recognition model may be trained using universal training data. The universal training data may include a plurality of groups of universal audio data corresponding to universal audio scenes, such as, a meeting scene, a working scene, a game scene, a party scene, a travel scene, a play scene, or the like, or any combination thereof. In some embodiments, the speech recognition model may be a special speech recognition model for the RPG. The special speech recognition model may be obtained by transfer learning from the universal speech recognition model or a machine learning model using special training data. The special training data may include special audio data corresponding to special audio scenes of the RPG. More descriptions for the determination of the speech recognition model may be found elsewhere in the present disclosure (e.g., FIG. 9 and the descriptions thereof) .

In 640, the processing device 112 (e.g., the emotion recognition module 430) may optionally perform a correction of the determined real time emotion of the user by determining a real time emotion of the user in the scene based on the text content using a content based emotion recognition model to obtain a corrected real time emotion of the user. As used herein, the real time emotion of the user in the scene determined based on the text content using the content based emotion recognition model may be also referred to as a content based real time emotion. The content based real time emotion may be generated by inputting the text content into the content based emotion recognition model. More descriptions for the determination of the content based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIGs. 5 and 10B and the descriptions thereof) .

In some embodiments, if the processing device 112 (e.g., the emotion recognition module 430) determines that the acoustic based real time emotion of the user in the scene determined in 620 is a neutral emotion, the correction of the determined real time emotion may be performed using the content based emotion recognition model. For example, the processing device 112 (e.g., the emotion recognition module 430) may designate the content based real time emotion as the corrected real time emotion. As another example, the processing device 112 (e.g., the emotion recognition module 430) may correct the acoustic based real time emotion of the user in the scene determined in 620 based on the content based real time emotion. As a further example, the processing device 112 (e.g., the emotion recognition module 430) may determine an average emotion between the acoustic based real time emotion of the user in the scene determined in 620 and the content based real time emotion as the corrected real time emotion. As still an example, the processing device 112 (e.g., the emotion recognition module 430) may determine an emotion close to or similar to one of the acoustic based real time emotion of the user in the scene determined in 620 and the content based real time emotion as the corrected real time emotion.

In some embodiments, if the processing device 112 (e.g., the emotion recognition module 430) determines that the acoustic based real time emotion of the user in the scene determined in 620 is a non-neutral emotion (e.g., a positive emotion or a negative emotion) , the correction of the determined acoustic based real time emotion may not be performed. Operation 630 and operation 640 may be omitted. In some embodiments, if the processing device 112 (e.g., the emotion recognition module 430) determines that the acoustic real time emotion of the user in the scene determined in 620 is a non-neutral emotion (e.g., a positive emotion or a negative emotion) , the correction of the acoustic based real time emotion may be performed based on the content based real time emotion. For example, the processing device 112 (e.g., the emotion recognition module 430) may determine whether the acoustic based real time emotion and the content based real time emotion are different. If the processing device 112 (e.g., the emotion recognition module 430) determines that the acoustic based real time emotion and the content based real time emotion are different, the processing device 112 may designate the content based real time emotion as the corrected real time emotion. As a further example, if the processing device 112 (e.g., the emotion recognition module 430) determines that the types of the acoustic based real time emotion and the content based real time emotion are same (e.g., are both positive emotions) while the degrees of the acoustic based real time emotion and the content based real time emotion are same, the processing device 112 may designate the degree of the content based real time emotion as the degree of the corrected real time emotion.

In 650, the processing device 112 (e.g., the adjustment recognition module 440) may adjust the plot of the RPG subsequent to the scene and/or an element of the RPG based on the determined or corrected real time emotion of the user. In some embodiments, as the emotion of the user (e.g., player) of the RPG may reflect user experience of the RPG, the processing device 112 (e.g., the adjustment recognition module 440) may adjust the plot of the RPG subsequent to the scene to improve the user experience based on the determined or corrected real time emotion of the user. In some embodiments, the user may have a relationship with at least one of one or more real-life players of the RPG or one or more characters in the RPG. The processing device 112 (e.g., the adjustment recognition module 440) may determine the relationship between one or more real-life players of the RPG or one or more characters in the RPG based on the determined or corrected real time emotion of the user. The processing device 112 (e.g., the adjustment recognition module 440) may adjust the plot of the RPG based on the relationship between at least one of one or more real-life players of the RPG or one or more characters in the RPG. For example, if the determined or corrected real time emotion of the user is a negative emotion, the processing device 112 (e.g., the adjustment recognition module 440) may determine that the relationship between the one or more real-life players of the RPG or one or more characters in the RPG is bad or poor. The processing device 112 (e.g., the adjustment recognition module 440) may decrease the plot of the RPG associated with the one or more real-life players of the RPG or one or more characters in the RPG or determine a bad ending between the one or more characters in the RPG. As another example, when the user expresses an emotion of hating his/her partner, the processing device 112 may adjust the plot of the RPG to make the user and his/her partner not in a team. As another example, when the user expresses an emotion of being sad for not passing a test (e.g., failure of beating a monster) in the plot of RPG, the processing device 112 may adjust the difficulty of the plot of the RPG to make it easy to pass. More descriptions of the adjusting the plot of the RPG may be found elsewhere in the present disclosure (e.g., FIG. 5, and the descriptions thereof) .

It should be noted that the above description regarding the process 600 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more operations may be omitted and/or one or more additional operations may be added. For example, operation 630 may be combined into operation 640. Operation 630 and operation 640 may be omitted. As another example, one or more operations in

processes

1000 and 1050 may be added into the process 600 to obtain the acoustic based emotion recognition model and the content based emotion recognition model.

FIG. 7 is a flowchart illustrating an exemplary process 700 for adjusting the plot of a role-playing game (RPG) according to some embodiments of the present disclosure. At least a portion of process 700 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 700 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process /00 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 710, the processing device 112 (e.g., the obtaining module 430) may obtain voice signals of a user. The voice signals of the user may comprise acoustic characteristics and audio data of the user playing in a scene of an RPG. More description of the voice signals of a user may be found elsewhere in the present disclosure (e.g., FIGs. 5 and 6, and the descriptions thereof) .

In 720, the processing device 112 (e.g., the emotion recognition module 430) may use speech recognition to convert the audio data of the user in the scene to obtain results of the speech recognition comprising text of the user’s voice signals. More description of obtaining results of the speech recognition comprising text of the user’s voice signals may be found elsewhere in the present disclosure (e.g., FIG. 6, and the descriptions thereof) .

In 730, the processing device 112 (e.g., the emotion recognition module 430) may determine user’s real time emotion based on the text using a content based emotion recognition model. The emotion of the user in the scene determined based on the text content of the user may also be referred to as the content based real time emotion. As illustrated above, the processing device 112 (e.g., the emotion recognition module 430) may input the text content of the user’s voice signal in the scene into the content based emotion recognition model to determine the second real time emotion of the user. More descriptions for the determination of the content based real time emotion using the content based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIGs. 5 and 6, and the descriptions thereof) .

In 740, the processing device 112 (e.g., the emotion recognition module 430) may optionally perform correction of the determined real time emotion of the user by determining an emotion of the user in the scene based on the acoustic characteristics of the user in the scene using an acoustic based emotion recognition model to obtain a corrected real time emotion of the user. As used herein, the real time emotion of the user in the scene determined based on based on the acoustic characteristics of the user in the scene using the acoustic based emotion recognition model may be also referred to as an acoustic based real time emotion. The acoustic based real time emotion may be generated by inputting the voice signals or the acoustic characteristics of the user into the acoustic based emotion recognition model. More descriptions for the determination of the acoustic based real time emotion using the acoustic based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIGs. 5 and 6, and the descriptions thereof) .

In some embodiments, if the processing device 112 (e.g., the emotion recognition module 430) determines the content based real time emotion of the user is a neutral emotion, the correction of the determined real time emotion (i.e., the content based real time emotion) may be performed using the acoustic based emotion recognition model. For example, the processing device 112 (e.g., the emotion recognition module 430) may designate the acoustic based real time emotion as the corrected real time emotion. As another example, the processing device 112 (e.g., the emotion recognition module 430) may correct the content based real time emotion of the user in the scene based on the acoustic based real time emotion. As a further example, the processing device 112 (e.g., the emotion recognition module 430) may determine an average emotion between the content based real time emotion and the acoustic based real time emotion as the corrected real time emotion. As still an example, the processing device 112 (e.g., the emotion recognition module 430) may determine an emotion close to or similar to one of the acoustic based real time emotion of the user in the scene and the content based real time emotion as the corrected real time emotion.

In some embodiments, if the processing device 112 (e.g., the emotion recognition module 430) determines that the content based real time emotion of the user in the scene is a non-neutral emotion (e.g., a positive emotion or a negative emotion) , the correction of the determined content based real time emotion may not be performed. Operation 740 may be omitted. In some embodiments, if the processing device 112 (e.g., the emotion recognition module 430) determines that the content real time emotion of the user in the scene is a non-neutral emotion (e.g., a positive emotion or a negative emotion) , , the processing device 112 (e.g., the emotion recognition module 430) may determine whether the content based real time emotion and the acoustic based real time emotion are different. The correction of the content based real time emotion may be performed based on the acoustic based real time emotion if the content based real time emotion and the acoustic based real time emotion are different. For example, the processing device 112 may designate the acoustic based real time emotion as the corrected real time emotion. As another example, if the processing device 112 (e.g., the emotion recognition module 430) determines that the types of the content based real time emotion and the acoustic based real time emotion are same (e.g., are both positive emotions) while the degrees of the content based real time emotion and the acoustic based real time emotion are different, the processing device 112 may designate the degree of the acoustic based real time emotion as the degree of the corrected real time emotion.

In 750, the processing device 112 (e.g., the adjustment recognition module 440) may adjust the plot of the RPG subsequent to the scene and/or an element of the RPG based on the determined or corrected real time emotion of the user. More descriptions of the adjusting the plot of the RPG may be found elsewhere in the present disclosure (e.g., FIG. 5, and the descriptions thereof) .

It should be noted that the above description regarding the process 700 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.

FIG. 8 is a flowchart illustrating an exemplary process 800 for adjusting the plot of a role-playing game (RPG) according to some embodiments of the present disclosure. At least a portion of process 800 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 800 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 800 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 810, the processing device 112 (e.g., the obtaining module 430) may obtain voice signals of a user. The voice signals of the user may comprise acoustic characteristics and audio data of the user playing in a scene of an RPG. More description of the voice signals of a user may be found elsewhere in the present disclosure (e.g., FIGs. 5 and 6, and the descriptions thereof) .

In 820, the processing device 112 (e.g., the emotion recognition module 430) may determine a first real time emotion of the user in the scene based on the acoustic characteristics of the user. The first real time emotion of the user in the scene determined based on the acoustic characteristics of the user may be also referred to as an acoustic based real time emotion. The first real time emotion of the user may be determined using an acoustic based emotion recognition model. The acoustic based emotion recognition model may be configured to identify the first real time emotion of the user based on one or more acoustic characteristics. More descriptions for the determination of the first real time emotion of the user (i.e., the acoustic based real time emotion) based on the acoustic characteristics may be found elsewhere in the present disclosure (e.g., FIG. 6, and the descriptions thereof) .

In 830, the processing device 112 (e.g., the emotion recognition module 430) may use speech recognition to convert the audio data of the user in the scene to obtain results of the speech recognition comprising text of the user’s voice signals. More description of obtaining results of the speech recognition comprising text of the user’s voice signals may be found elsewhere in the present disclosure (e.g., FIG. 6, and the descriptions thereof) .

In 840, the processing device 112 (e.g., the emotion recognition module 430) may determine a second real time emotion of the user in the scene based on the text of the user’s voice signal in the scene using a content based emotion recognition model. The second real time emotion of the user in the scene determined based on the text content of the user may also be referred to as a content based real time emotion. As illustrated above, the processing device 112 (e.g., the emotion recognition module 430) may input the text content of the user’s voice signal in the scene into the content based emotion recognition model to determine the second real time emotion of the user. More descriptions for the determination of the second real time emotion of the user in the scene (i.e., the content based real time emotion) based on the text using the content based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIG. 6, and the descriptions thereof) .

In 850, the processing device 112 (e.g., the emotion recognition module 430) may determine a target real time emotion of the user by comparing the first real time emotion and the second real time emotion of the user in the scene. In some embodiments, the processing device 112 may determine whether the first real time emotion is consistent with or same as the second real time emotion. The processing device 112 may determine the target real time emotion of the user based on the determination. As used herein, the first real time emotion consistent with or same as the second real time emotion may refer to that the type and the degree of the first real time emotion are both consistent with or same as the second real time emotion. The first real time emotion inconsistent with or different from the second real time emotion may refer to that the type and/or the degree of the first real time emotion are inconsistent with or different from the second real time emotion. In some embodiments, if the processing device 112 determines that the first real time emotion is consistent with or same as the second real time emotion of the user in the scene, the processing device 112 (e.g., the emotion recognition module 430) may determine the consistent real time emotion of the user (i.e., the first real time emotion or the second real time emotion) as the target real time emotion of the user. In some embodiments, if the processing device 112 determines that the first real time emotion is inconsistent with or different from the second real time emotion of the user in the scene, the processing device 112 (e.g., the emotion recognition module 430) may determine the either (e.g., the second real time emotion of the user) of the first real time emotion and the second real time emotion as the target real time emotion of the user. In some embodiments, the processing device 112 may use the acoustic based emotion recognition model to determine a first confidence level for the first real time emotion (i.e., the acoustic based real time emotion) . The processing device 112 may use the content based emotion recognition model to determine a second confidence level for the second real time emotion (i.e., the content based real time emotion) . The processing device 112 may compare the first confidence level and the second confidence level to determine one of the acoustic based real time emotion and the content based real time emotion that corresponds to a higher confidence level as the target real time emotion.

In some embodiments, if the processing device 112 determines that the first real time emotion is inconsistent with or different from the second real time emotion of the user in the scene, the processing device 112 (e.g., the emotion recognition module 430) may further determine whether the first real time emotion or the second real time emotion of the user is a neutral emotion. If the processing device 112 determines that the first emotion is a neutral emotion, the processing device 112 (e.g., the emotion recognition module 430) may determine the second real time emotion as the target real time emotion of the user. If the processing device 112 determines that the second real time emotion is a neutral emotion, the processing device 112 (e.g., the emotion recognition module 430) may determine the first real time emotion as the target real time emotion of the user.

In some embodiments, if the processing device 112 determines that the first real time emotion is inconsistent with or different from the second real time emotion of the user in the scene, the processing device 112 (e.g., the emotion recognition module 430) may further determine the target real time by based on the first real time emotion and the second real time emotion. For example, if the degrees of the first real time emotion and the second real time emotion are inconsistent, the processing device 112 may determine the degree of the target real time emotion by averaging the degrees of the first real time emotion and the second real time emotion. As another example, the processing device 112 may compare the degree of the first real time emotion and the degree of the second real time emotion. The processing device 112 may determine the bigger or smaller of the degrees of the first real time emotion and the second real time emotion as the degree of the target real time emotion.

In 860, the processing device 112 (e.g., the adjustment recognition module 440) may adjust the plot of the RPG subsequent to the scene and/or an element of the RPG based on the determined target real time emotion of the user. Operation 860 may be performed as described in connection with 550 illustrated in FIG. 5.

It should be noted that the above description regarding the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.

FIG. 9 is a flowchart illustrating an exemplary process 900 for obtaining a speech recognition model according to some embodiments of the present disclosure. At least a portion of process 900 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 900 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 900 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 910, the processing device 112 (e.g., the obtaining module 430) may obtain a plurality of groups of universal audio data of one or more users communicating in one or more scenes (or circumstances) . The processing device 112 (e.g., the obtaining module 430) may obtain a plurality of groups of universal audio data of one or more users communicating in one or more scenes (or circumstances) from the terminal 130, the terminal 140, the storage device 120, or any other storage device. In some embodiments, the one or more scenes may include a meeting scene, a working scene, a game scene, a party scene, a travel scene, a play scene, or the like, or any combination thereof. One group of universal audio data may include information of a communication of the user in one of the one or more scenes. For example, in a travel scene, a passenger and a driver may make a dialogue during the travel. The communication between the passenger and the driver may be recorded or picked up as voice signals by a voice pickup device (e.g., a microphone) installed in the vehicle of the driver or a mobile device associated with the driver or the passenger. The voice signals may be converted into the audio data of the driver and/or the passenger.

In some embodiments, a group of universal audio data may include a plurality of phoneme sets, each of which includes one or more phonemes. Each phoneme set may correspond to a pronunciation of a word. In some embodiments, a group of universal audio data may include a plurality of word sets, each of which includes one or more words. In some embodiments, a group of universal audio data may include a plurality of phrase sets, each of which includes one or more phrases. Each group of the plurality of groups of universal audio data may correspond to an actual text content indicating semantic information of a communication of the user in a scene. The actual text content may be denoted by one or more words or phrases. In some embodiments, the actual text content corresponding to each group of the plurality of groups of universal audio data may be determined based on each group of the plurality of groups of universal audio data by an operator (e.g., an engineer) manually.

In 920, the processing device 112 (e.g., the speech recognition model determination unit 422) may use the plurality of groups of universal audio data to train a machine learning model to obtain a universal speech recognition model. The machine learning model may include a linear regression model, a Kernel function model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model (e.g., a deep learning model) , or the like, or any combination thereof. In some embodiments, the universal speech recognition model may be obtained by training a neural network model using a neural network model training algorithm. Exemplary neural network training algorithms may include a gradient descent algorithm, a Newton’s algorithm, a Quasi-Newton algorithm, a Levenberg-Marquardt algorithm, a conjugate gradient algorithm, or the like, or a combination thereof.

In some embodiments, the universal speech recognition model may be obtained by performing a plurality of iterations. For each of the plurality of iterations, a specific group of universal audio data may first be inputted into the machine learning model. The machine learning model may extract one or more phonemes, letters, characters, words, phrases, sentences etc., included in the specific group of universal audio data. Based on the extracted phonemes letters, characters, words, phrases, sentences etc., the machine learning model may determine a predict text content corresponding to the specific group of universal audio data. The predict text content may then be compared with an actual text content (i.e., a desired text content) corresponding to the specific group of universal audio data based on a cost function. The cost function of the machine learning model may be configured to assess a difference between an estimated value (e.g., the predict text content) of the machine learning model and a desired value (e.g., the actual text content) . If the value of the cost function exceeds a threshold in a current iteration, parameters of the machine learning model may be adjusted and updated to cause the value of the cost function (i.e., the difference between the predict text content and the actual text content) smaller than the threshold. Accordingly, in a next iteration, another group of universal audio data may be inputted into the machine learning model to train the machine learning model as described above. Then the plurality of iterations may be performed to update the parameters of the machine learning model until a terminated condition is satisfied. The terminated condition may provide an indication of whether the machine learning model is sufficiently trained. For example, the terminated condition may be satisfied if the value of the cost function associated with the machine learning model is minimal or smaller than a threshold (e.g., a constant) . As another example, the terminated condition may be satisfied if the value of the cost function converges. The convergence may be deemed to have occurred if the variation of the values of the cost function in two or more consecutive iterations is smaller than a threshold (e.g., a constant) . As still an example, the terminated condition may be satisfied when a specified number of iterations are performed in the training process. The trained machine learning model (i.e., the universal speech recognition model) may be determined based on the updated parameters. In some embodiments, the trained machine learning model (i.e., the universal speech recognition model) may be transmitted to the storage device 120, the storage module 408, or any other storage device for storage.

In 930, the processing device 112 (e.g., the obtaining module 430) may obtain a plurality of groups of special audio data of one or more users playing in a scene of an RPG. In some embodiments, a group of special audio data may include information associated with communication of a user (e.g., player) occurred in the scene of the RPG. For example, the user may communicate with a real-life player or a character in the RPG to generate voice signals picked by a voice pickup device (e.g., a microphone) associated with a terminal (e.g., a game machine) of the user. The voice signals may be transformed into special audio data and be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390, the storage unit 426) . The processing device 112 (e.g., the obtaining module 430) may obtain the group of special audio data of the user from the storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390, the storage unit 426) . In some embodiments, the processing device 112 (e.g., the obtaining module 430) may obtain the plurality of groups of special audio data from the voice pickup device (e.g., a microphone) associated with the terminal (e.g., a game machine) of the user directly.

In some embodiments, a group of special audio data may include a plurality of phoneme sets, each of which includes one or more phonemes. Each phoneme set may correspond to a pronunciation of a word. In some embodiments, a group of universal audio data may include a plurality of word sets, each of which includes one or more words. In some embodiments, a group of universal audio data may include a plurality of phrase sets, each of which includes one or more phrases. Each group of the plurality of groups of special audio data may correspond to an actual text content indicating semantic information of a communication of the user in the scene of the RPG. The actual text content may be denoted by one or more words or phrases. In some embodiments, the actual text content corresponding to each group of the plurality of groups of special audio data may be determined based on the each group of the plurality of groups of special audio data by an operator (e.g., an engineer) manually.

In 940, the processing device 112 (e.g., the speech recognition model determination unit 422) may using the plurality of groups of special audio data to train the universal speech recognition model to obtain a special speech recognition model. A training process of the special speech recognition model may refer to train the universal speech recognition model to obtain the special speech recognition model.

The special speech recognition model may be obtained by training the universal speech recognition model using the plurality of groups of special audio data. The training process of the special speech recognition model may be similar to or same as the training process of the universal speech recognition model as described in operation 920. For example, the special speech recognition model may be obtained by training the universal speech recognition model via performing a plurality of iterations. For each of the plurality of iterations, a specific group of special audio data may first be inputted into the universal speech recognition model. The universal speech recognition model may extract one or more phonemes, letters, characters, words, phrases, sentences etc., included in the specific group of special audio data. Based on the extracted phonemes, letters, characters, words, phrases, sentences etc., the universal speech recognition model may determine a predict text content corresponding to the specific group of special audio data. The predict text content may then be compared with an actual text content (i.e., a desired text content) corresponding to the specific group of special audio data based on a cost function. If the value of the cost function exceeds a threshold in a current iteration, parameters of the universal speech recognition model may be adjusted and updated to cause the value of the cost function (i.e., the difference between the predict text content and the actual text content) smaller than the threshold. Accordingly, in a next iteration, another group of special audio data may be inputted into the universal speech recognition model to train the universal speech recognition model as described above. Then the plurality of iterations may be performed to update the parameters of the universal speech recognition model until a terminated condition is satisfied.

In some embodiments, training sets of the universal speech recognition model and/or the special speech recognition model may be updated based on added data (e.g., the audio data of the user obtained in 502) over a period (e.g., every other month, every two months, etc. ) . In some embodiments, the universal speech recognition model and/or the special speech recognition model may be updated according to an instruction of a user, clinical demands, the updated training set, or a default setting of the emotion recognition system 100. For example, the universal speech recognition model and/or the special speech recognition model may be updated at set intervals (e.g., every other month, every two months, etc. ) . As another example, the universal speech recognition model and/or the special speech recognition model may be updated based on added data in the training sets of the universal speech recognition model and/or the special speech recognition model over a period (e.g., every other month, every two months, etc. ) . If the quantity of the added data in the training set over a period of time is greater than a threshold, the universal speech recognition model and/or the special speech recognition model may be updated based on the updated training set.

It should be noted that the above description regarding the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more operations may be omitted and/or one or more additional operations may be added. For example, operation 910 and operation 920 may be combined into a signal operation to obtain the universal emotion recognition model. As another example, one or more operation may be added into the process 900. For example, the universal audio data may be preprocessed by one more preprocessing operation (e.g., a denoising operation) .

FIG. 10A is a flowchart illustrating an exemplary process 1000 for determining an acoustic based emotion recognition model according to some embodiments of the present disclosure. At least a portion of process 1000 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1000 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1000 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 1010, the processing device 112 (e.g., the obtaining module 430) may obtain a plurality of groups of voice signals. Each group of the plurality of voice signals may include one or more acoustic characteristics. The processing device 112 (e.g., the obtaining module 430) may obtain a plurality of groups of voice signals from the terminal 130, the terminal 140, the storage device 120, or any other storage device. In some embodiment, the one or more acoustic characteristics may include one or more features associated with duration, one or more features associated with energy, one or more features associated with fundamental frequency, one or more features associated with frequency spectrum, etc. A feature associated with duration may be also referred to as a duration feature. Exemplary duration features may include a speaking speed, a short time average zero-over rate, etc. A feature associated with energy may be also referred to as an energy or amplitude feature. Exemplary amplitude features may include a short time average energy, a short time average amplitude, a short time energy gradient, an average amplitude change rate, a short time maximum amplitude, etc. A feature associated with fundamental frequency may be also referred to as a fundamental frequency feature. Exemplary fundamental frequency features may include a pitch, a fundamental frequency, an average fundamental frequency, a maximum fundamental frequency, a fundamental frequency range, etc. Exemplary features associated with frequency spectrum may include formants features, linear predictive coding cepstrum coefficients (LPCC) , mel-frequency cepstrum coefficients (MFCC) , features of the smoothed pitch contour and its derivatives, etc.

In some embodiments, the plurality of groups of voice signals may be generated by different users communicating in different scenes. For example, the voice signals may be generated by a speechmaker and/or a participant communicating in a meeting scene. As another example, the voice signals may be obtained from a passenger and/or a driver in a travel scene. In some embodiments, the plurality of groups of voice signals may be generated by one or more users communicating in a same scene. For example, the plurality of groups of voice signals may be generated by one or more users playing in one or more scenes of an RPG. In some embodiments, the plurality of groups of voice signals may be generated by one or more testers.

Each group of the plurality of groups of voice signals or acoustic characteristics may correspond to a label indicating an actual emotion that each group of the plurality of groups of voice signals or acoustic characteristics reflects. The label corresponding to each group of the plurality of groups of voice signals or acoustic characteristics may denote a category and/or degree of the actual emotion that each group of the plurality of groups of voice signals or acoustic characteristics reflects. For example, the label may be one of positive, negative, and else (e.g., neutral) . As another example, the label may be one of “joy” , “anger” , “fear” , “disgust” , “surprise” , “sadness” , and else (e.g., neutral) . As another example, the label may be one of “interest” , “desire” , “sorrow” , “wonder” , “surprise” , “happiness” , and else (e.g., neutral) . As still another example, the label may be one of “anxiety” , “anger” , “sadness” , “disgust” , “happiness” , and else (e.g., neutral) . As still another example, the label may be one of “pleasure” , “pain” , and else (e.g., neutral) . In some embodiments, the label may include strong and week, or first level, second level, and third level, etc. In some embodiments, the label corresponding to each group of the plurality of groups of voice signals or acoustic characteristics may be determined based on each group of the plurality of groups of voice signals or acoustic characteristics by an operator (e.g., an engineer) manually.

In 1020, the processing device 112 (e.g., the model determination module 420, the emotion recognition unit 424) may use the plurality of groups of voice signals to train a machine learning model to obtain an acoustic based emotion recognition model. In some embodiments, the machine learning model may include a linear regression model, a Kernel function model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof. In some embodiments, the machine learning model may be trained by performing a plurality of iterations. For each of the plurality of iterations, a specific group of voice signals or acoustic characteristics may first be inputted into the machine learning model. The machine learning model may determine a predict emotion corresponding to the specific group of voice signals or acoustic characteristics. The predict emotion may then be compared with a label (i.e., an actual emotion) of the specific group of voice signals or acoustic characteristics based on a cost function. The cost function of the machine learning model may be configured to assess a difference between an estimated value (e.g., the predict emotion) of the machine learning model and a desired value (e.g., the label or the actual emotion) . If the value of the cost function exceeds a threshold in a current iteration, parameters of the machine learning model may be adjusted and updated to cause the value of the cost function (i.e., the difference between the predict emotion and the actual emotion) smaller than the threshold. Accordingly, in a next iteration, another group of voice signals or acoustic characteristics may be inputted into the machine learning model to train the machine learning model as described above. Then the plurality of iterations may be performed to update the parameters of the machine learning model until a terminated condition is satisfied. The terminated condition may provide an indication of whether the machine learning model is sufficiently trained. For example, the terminated condition may be satisfied if the value of the cost function associated with the machine learning model is minimal or smaller than a threshold (e.g., a constant) . As another example, the terminated condition may be satisfied if the value of the cost function converges. The convergence may be deemed to have occurred if the variation of the values of the cost function in two or more consecutive iterations is smaller than a threshold (e.g., a constant) . As still an example, the terminated condition may be satisfied when a specified number of iterations are performed in the training process. The trained machine learning model (i.e., the acoustic based emotion recognition model) may be determined based on the updated parameters. In some embodiments, the trained machine learning model (i.e., the acoustic based emotion recognition model) may be transmitted to the storage device 120, the storage module 408, or any other storage device for storage.

The acoustic based emotion recognition model may be configured to estimate an emotion based on one or more acoustic characteristics. For example, the acoustic based emotion recognition model may determine a category and/or degree of an emotion based on one or more acoustic characteristics. The category and/or degree of an emotion estimated by the acoustic based emotion recognition model may be associated with labels of the plurality of groups of voice signals or acoustic characteristics in a training set. For example, if the labels of the plurality of groups of voice signals or acoustic characteristics include positive, negative, and else (e.g., neutral) , the category of an emotion estimated by the acoustic based emotion recognition model may be one of positive, negative, and else (e.g., neutral) . If the labels of the plurality of groups of voice signals or acoustic characteristics include “joy” , “anger” , “fear” , “disgust” , “surprise” , “sadness” , and else (e.g., neutral) , the category of an emotion estimated by the acoustic based emotion recognition model may be one of “joy” , “anger” , “fear” , “disgust” , “surprise” , “sadness” , and else (e.g., neutral) .

FIG. 10B is a flowchart illustrating an exemplary process 1050 for determining a content based emotion recognition model according to some embodiments of the present disclosure. At least a portion of process 1050 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1050 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1050 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 1030, the processing device 112 (e.g., the obtaining module 430) may obtain a plurality of groups of audio data. In some embodiments, a group of audio data may include a plurality of phoneme sets, each of which includes one or more phonemes. Each phoneme set may correspond to a pronunciation of a word. In some embodiments, a group of universal audio data may include a plurality of word sets, each of which includes one or more words. In some embodiments, a group of universal audio data may include a plurality of phrase sets, each of which includes one or more phrases. In some embodiments, the plurality of groups of audio data may be generated by different users communicating in different scenes. For example, a group of audio data may be generated by a speechmaker and/or a participant communicating in a meeting scene. As other example, a group of audio data may be obtained from a passenger and/or a driver in a travel scene. In some embodiments, the plurality of groups of audio data may be generated by one or more users communicating in a same scene. For example, the plurality of groups of audio data may be generated by one or more users playing in a scene of RPG. In some embodiments, the plurality of groups of audio data may be generated by one or more testers.

In 1040, the processing device 112 (e.g., the model determination module 420, the emotion recognition unit 424) may use speech recognition to convert each group of the plurality of groups of audio data to obtain a result of the speech recognition comprising a text content of each of the plurality of groups of audio data. In some embodiments, a speech recognition model may be used to obtain the text content of each group of audio data. Exemplary speech recognition models may include a Hidden Markov model (HMMs) , a dynamic time warping (DTW) -based speech recognition model, an artificial neural network model, an end-to-end automatic speech recognition model, or the like, or any combination thereof. In some embodiments, the speech recognition model may be a universal speech recognition model (e.g. a deep neural network model) . The universal speech recognition model may be trained using universal training data. The universal training data may include a plurality of groups of universal audio data corresponding to universal audio scenes, such as, a meeting scene, a working scene, a game scene, a party scene, a travel scene, a play scene, or the like, or any combination thereof. In some embodiments, the speech recognition model may be a special speech recognition model for the RPG. The special speech recognition model may be obtained by training the universal speech recognition model or a machine learning model using special training data. The special training data may include special audio data corresponding to special audio scenes of the RPG. More descriptions for the speech recognition model may be found elsewhere in the present disclosure (e.g., FIG. 9, and the descriptions thereof) .

In some embodiments, the text content of each group of the plurality of groups of audio data may correspond to a label indicating an actual emotion that each group of the plurality of groups of audio data reflects. The label corresponding to each group of the plurality of groups of audio data may denote a category and/or degree of the actual emotion that each group of the plurality of groups of audio data reflects. For example, the label may be one of positive, negative, and else (e.g., neutral) . As another example, the label may be one of “joy” , “anger” , “fear” , “disgust” , “surprise” , “sadness” , and else (e.g., neutral) . As another example, the label may be one of “interest” , “desire” , “sorrow” , “wonder” , “surprise” , “happiness” , and else (e.g., neutral) . As still another example, the label may be one of “anxiety” , “anger” , “sadness” , “disgust” , “happiness” , and else (e.g., neutral) . As still another example, the label may be one of “pleasure” , “pain” , and else (e.g., neutral) . In some embodiments, the label may include strong and week, or first level, second level, and third level, etc. In some embodiments, the label corresponding to each group of the plurality of groups of audio data may be determined based on each group of the plurality of groups of audio data by an operator (e.g., an engineer) manually.

In 1050, the processing device 112 (e.g., the model determination module 420, the emotion recognition unit 424) may use the text content of each group of audio data to train a machine learning model to obtain a content based emotion recognition model. In some embodiments, the machine learning model may include a linear regression model, a Kernel function model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof. As used herein, the machine learning model may be a fast text model which may fastly classify the text content of each group of the plurality of groups of audio data into different text types.

A training process of the content based emotion model may be similar to or same as the training process of the acoustic based emotion model. For example, the content based emotion model may be obtained by performing a plurality of iterations. For each of the plurality of iterations, a text content of a specific group of audio data may first be inputted into the machine learning model. The machine learning model may determine a predict emotion corresponding to the text content of the specific group of audio data. The predict emotion may then be compared with an actual emotion (i.e., a label) corresponding to the text content of the specific group of audio data based on a cost function. If the value of the cost function exceeds a threshold in a current iteration, parameters of the machine learning model may be adjusted and updated to cause the value of the cost function (i.e., the difference between the predict emotion and the actual emotion) smaller than the threshold. Accordingly, in a next iteration, another text content of another group of audio data may be inputted into the machine learning model to train the machine learning model as described above. Then the plurality of iterations may be performed to update the parameters of the machine learning model until a terminated condition is satisfied. The content based emotion recognition model may be obtained based on the updated parameters.

It should be noted that the above description regarding the process 1000 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more operations may be omitted and/or one or more additional operations may be added. For example, the process 1000 and the process 1050 may be combined into a signal processing to train a fixed emotion recognition model. The fixed emotion recognition model may be composed by an acoustic based emotion recognition model and a text based emotion recognition model.

FIG. 11 is a flowchart illustrating an exemplary process for determining an emotion of a user according to some embodiments of the present disclosure. At least a portion of process 1100 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1100 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1100 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

For voice control, it is important to accurately determine a real intention of a user because the real intention of the user may be affected by different emotions under the same text content corresponding to an inputted voice (i.e., audio data) . For example, the real intention may be the same as or opposite to the original meaning of the text content. In some embodiments, the text content of a voice may be positive, negative, or neutral. For example, when the user is happy and the inputted voice is "agree" , the text content of the voice may be positive, indicating that the real meaning of the inputted voice is the same as the original meaning of the word "agree. " When the user is unhappy and the inputted voice is "agree" , the text content may be negative, indicating that the real meaning of the inputted voice is opposite to the original meaning of the word "agree. " Therefore, in order to improve the accuracy of voice control, it is necessary to determine an emotion of the user based on a text content and an emotion corresponding to voice.

In 1110, the processing device 112 (e.g., the obtaining module 410) may use speech recognition to convert audio data of the user in the scene to obtain a result of the speech recognition comprising a text content of the user’s voice signal.

In some embodiments, the processing device 112 may use a speech recognition model to obtain the result of the speech recognition. Exemplary speech recognition models may include a Hidden Markov model (HMMs) , a dynamic time warping (DTW) -based speech recognition model, an artificial neural network model, an end-to-end automatic speech recognition model, or the like, or any combination thereof. In some embodiments, the speech recognition model may be a universal speech recognition model (e.g. a deep neural network model) . More descriptions for speech recognition may be found elsewhere in the present disclosure (e.g., operation 630, FIG. 9 and the descriptions thereof) .

In 1120, the processing device 112 (e.g., the model determination module 420) may determine a first probability corresponding to each of one or more predetermined emotions based on a text vector corresponding to the text content.

In some embodiments, the predetermined emotions may include “joy, ” “anger, ” “fear, ” “disgust, ” “surprise, ” or “sadness. ” In some embodiments, the predetermined emotions may include “interest, ” “desire, ” “sorrow, ” “wonder, ” “surprise, ” or “happiness. ” In some embodiments, the predetermined emotions may include “anxiety, ” “anger, ” “sadness, ” “disgust, ” or “happiness. ” In some embodiments, the predetermined emotions may include “pleasure, ” or “pain. ” The first probability may indicate a possibility of the text content expressing each of the predetermined emotions. For example, the first probability may include a probability of the text content expressing “anger, ” a probability of the text content expressing “happiness, ” a probability of the text content expressing “sadness, ” a probability of the text content expressing “disgust, ” a probability of the text content expressing “surprise, ” a probability of the text content expressing “fear, ” etc. The first probability of the text content expressing each of the predetermined emotions may be determined based on the text vector corresponding to the text content. More descriptions about the determination of the first probability may be found elsewhere in the present disclosure. See, for example, FIG. 12 and descriptions thereof.

In 1130, the processing device 112 (e.g., the model determination module 420) may determine a second probability corresponding to the each of one or more predetermined emotions based on acoustic characteristics of the audio data.

The acoustic characteristics of the audio data may be identified and/or determined from the audio data of the user using an acoustic characteristic extraction technique (e.g., an ACF algorithm, an AMDF algorithm, etc. ) . Merely by way of illustration, the acoustic characteristics may include a zero-crossing rate, a root-mean-square (RMS) energy, F0 (or referred to as pitch, fundamental frequency) , harmonics-to-noise (HNR) , mel-frequency cepstral coefficients (MFCC) , etc. It should be noted that, the acoustic characteristics may be set according to actual needs, and the present disclosure is not intended to be limiting. For example, the acoustic characteristics may include other characteristics as described elsewhere in the present disclose (e.g., operation 502 and descriptions thereof) .

The acoustic characteristics of the audio data may represent emotions of a user when he/she inputs a voice (e.g., the audio data) , such as, tone, intonation. The acoustic characteristics may indicate whether the text content of the voice (e.g., the audio data) is positive or negative. The second probability may indicate a possibility of the acoustic characteristics expressing each of the one or more predetermined emotions. The processing device 112 may determine the second probability corresponding to the each of one or more predetermined emotions based on the acoustic characteristics of the audio data. For example, the processing device 112 may determine the second probability corresponding to the each of one or more predetermined emotions based on an MCFF. In some embodiments, the processing device 112 may determine the second probability using an acoustic based emotion recognition model as described elsewhere in the present disclosure (e.g., FIG. 5 and the descriptions thereof) . More descriptions about determination of the second probability may be found elsewhere in the present disclosure (e.g., FIG. 13 and descriptions thereof) .

In 1140, the processing device 112 (e.g., the model determination module 420) may determine an emotion degree corresponding to each of the one or more predetermined emotions based on at least one of the first probability and the second probability.

The emotion degree may be used to denote an intensity of each of the predetermined emotions of the user in the scene. For example, the emotion degree may be denoted by a level, such as strong or week, or first level, second level, or third level, etc. As another example, the emotion degree may be denoted by a score, such as high or low. The higher the emotion degree corresponding to each of the one or more predetermined emotions is, the more likely the emotion represented by the user is to be the predetermined emotion.

In some embodiments, the emotion degree corresponding to each of the one or more predetermined emotions may be determined based on the first probability and the second probability corresponding to the each of the one or more predetermined emotions. For example, the emotion degree may be determined based on the first probability, the second probability, and weight values assigned to the first probability and the second probability. The weight values may be used to represent importance degrees of the text content (e.g., represented by the first probability) and the acoustic characteristics (e.g., represented by the second probability) in determining emotions of the voice signal. Thus, the emotion degree may be determined accurately. For example, the first weight value may be assigned to the first probability corresponding to each of the predetermined emotions based on the text content, and the second weight value may be assigned to the second probability corresponding to each of the predetermined emotions based on the acoustic characteristics. Merely by way of illustration, the first weight value may be 2, and the second weight value may be 1. The weight values may be default settings or set under different conditions. It should be noted that first weight values assigned to first probabilities corresponding to predetermined emotions may be the same or different, and the second weight values assigned to first probabilities corresponding to predetermined emotions may be the same or different.

In some embodiments, based on a first probability corresponding to a predetermined emotion, the second probability corresponding to the same predetermined emotion, and weight values assigned to the first probability and the second probability, an emotion degree corresponding to the same predetermined emotion may be obtained by the following Equation (1) :

y5=W2logp+W1·logq (1)

wherein p denotes the second probability, q denotes the first probability, W1 denotes a weight value of the first probability, W2 denotes a weight value of the second probability, and y5 denotes the emotion degree. In some embodiments, if there are 5 predetermined emotions, 5 emotion degrees corresponding to 5 predetermined emotions may be obtained by the equation (1) . In some embodiments, the weight values assigned to the first probability corresponding to each of the predetermined emotions may be the same or different. In some embodiments, the weight values assigned to the second probability corresponding to each of the predetermined emotions may be the same or different. In some embodiments, the processing device 112 may determine the emotion degree corresponding to each of the one or more predetermined emotions based on the first probability. In some embodiments, the processing device 112 may determine the emotion degree corresponding to each of the one or more predetermined emotions based on the second probability. In some embodiments, the processing device 112 may determine a first emotion degree corresponding to each of the one or more predetermined emotions based on the first probability. The processing device 112 may determine a second emotion degree corresponding to each of the one or more predetermined emotions based on the second probability. In some embodiments, the processing device 112 may compare the first motion degree and the second emotion degree and determine a maximum or minimum among the first motion degree and the second emotion degree as the emotion degree.

In 1150, the processing device 112 (e.g., the emotion recognition module 430) may determine an emotion of the user based on the emotion degree corresponding to each of the one or more predetermined emotions.

In some embodiments, the processing device 112 may rank the predetermined emotions corresponding to the emotion degrees according to levels or scores representing the emotion degrees (e.g., in an ascending or descending order) . The processing device 112 may determine a predetermined emotion with the highest level or highest score of the emotion degree as an emotion of the user.

In 1160, the processing device 112 (e.g., the sending module 450) may send the emotion and the text content to a terminal device.

In some embodiments, when receiving the text content and the emotion from the processing device 112, the terminal device (e.g., the terminal 130, the terminal 140) may recognize the user's actual intention through the text content and the emotion to perform operations in the scene (e.g., adjusting a plot of the RPG, pushing a plot of the RPG) . For example, the emotion is “happy, ” and the text content is “agree, ” the terminal device may perform operation of “agree” in the scene. As another example, the emotion is “unhappy, ” and the text content is “agree, ” the terminal device may perform operations different with “agree, ” such as, “disagree. ” Thus, it is beneficial to improve the accuracy of the voice control since the terminal device can obtain the text content and the emotion of the audio data.

It should be noted that the above description regarding the process 1100 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more operations may be omitted and/or one or more additional operations may be added. For example, operation 1130 may be omitted and an emotion degree corresponding to each of the one or more predetermined emotions may be determined based on the first probability in 1140.

FIG. 12 is a flowchart illustrating an exemplary process for determining a first probability corresponding to each of one or more predetermined emotions according to some embodiments of the present disclosure. At least a portion of process 1200 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1100 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1100 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals. The process 1200 may be performed to accomplish operation 1120 in FIG. 11.

In 1210, the processing device 112 (e.g., the model determination module 420) may determine a word vector corresponding to each of one or more words in a text content.

In some embodiments, the processing device 112 may determine a word vector corresponding to each of one or more words in the text content based on a word vector dictionary. The word vector dictionary may provide a mapping relationship between a set of words and word vectors. Each of the set of words in the word vector dictionary corresponds to one of the word vectors. The word vector dictionary may be set in advance, and stored in the storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) . In some embodiments, the processing device 112 may search for each of the one or more words in the text content from the vector dictionary and determine a word vector corresponding to the each of the one or more words in the word vector dictionary. For example, when the user speaks “Oh my god” , the processing device 112 may determine three word vectors corresponding to three words “Oh, ” “my, ” “god, ” respectively, from the word vector dictionary. The three word vectors may be denoted as word vector 1, word vector 2, and word vector 3.

In 1220, the processing device 112 (e.g., the model determination module 420) may determine a text vector by summing word vectors.

The text vector may correspond to the text content. The text vector may be determined by summing word vectors. For example, the obtained word vectors corresponding to three words “Oh, ” “my, ” “god” may include word vector 1, word vector 2, and word vector 3. The processing device 112 may sum word vector 1, word vector 2, and word vector 3 to obtain a sum result, i.e., the text vector. The sum result may be determined as the text vector corresponding to the text content “Oh my god. ”

In 1230, the processing device 112 (e.g., the model determination module 420) may determine a first probability corresponding to each of one or more predetermined emotions by inputting the text vector into a content based emotion recognition model.

The content based emotion recognition model may be configured to determine the first probability based on the text vector. In some embodiments, the content based emotion recognition model may be determined by training a machine learning model using a training set. The training set may include a plurality of text vectors obtained from a plurality of text contents of a plurality of groups of audio data. The text vector may be input into the content based emotion recognition model to determine the first probability corresponding to each of the predetermined emotions expressed by the text content. In some embodiments, the content based emotion recognition model herein may be represented by Equation (2) . The first probability may be determined after N iterations through the following Equation (2) :

y1=H1 (x1, W _H1) (2)

wherein W _H1 denotes a learnable parameter, x1 denotes an input parameter in the nth iteration, n belongs to [1, N] and is a positive integer, N is a positive integer greater than or equal to 1, H1 denotes a function, which is different according to the value of n, and y1 denotes the first probability. When n belongs to [1, N-1] , H1 denotes the function of relu (W _H1·x1) . When the value of n is N, H1 denotes the function of softmax (W _H1·x1) . When the value of n is 1, the text vector may be used as the input parameter (i.e., x1) . When n belongs to [2, N] , the result of the last iteration is used as the input parameter of the current iteration.

It should be noted that the above description regarding the process 1200 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more operations may be omitted and/or one or more additional operations may be added.

FIG. 13 is a flowchart illustrating an exemplary process for determining a second probability corresponding to each of multiple predetermined emotions according to some embodiments of the present disclosure. At least a portion of process 1300 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1300 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1300 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals. The process 1300 may be performed to accomplish operation 1130 in FIG. 11.

In 1310, the processing device 112 (e.g., the model determination module 420) may determine a MFCC corresponding to each of multiple frames of the audio data by performing a Fourier transform on the audio data.

The audio data may include a target portion, a mute portion, and a noise portion. The target portion of the audio data may refer to speech to be recognized input by the user. The mute portion of the audio data may refer to one or more pauses (e.g., there is no voice) among words and/or sentences during speaking. The noise portion may be caused by noise hindrance from the surroundings (e.g., voice from another people, walking sound, etc. ) during speaking. The target portion of the audio data needs to be identified and processed since only the target portion of the audio data relates to voice control, thereby reducing the amount of data processing. In some embodiments, the processing device 112 may identify the target portion of the audio data based on different acoustic characteristics corresponding to the target portion, the mute portion and the noise portion of the audio data. For example, the processing device 112 may determine the MFCC corresponding to each of multiple frames in the audio data by performing the Fourier transform on the audio data. Based on different MFCCs of multiple frames corresponding to the target portion, the mute portion and the noise portion of the audio data, the processing device 112 may determine the target portion.

In 1320, the processing device 112 (e.g., the model determination module 420) may identify each of the multiple frames based on the MFCC to obtain a target portion of the audio data.

In some embodiments, the processing device 112 may determine a fourth probability that each of multiple frames in audio data belongs to each of multiple audio categories by inputting an MFCC corresponding to each of the multiple frames into a trained audio category identification model. The processing device 112 may designate a specific audio category of a frame that corresponds to a maximum fourth probability among multiple fourth probabilities of the multiple frames as an audio category of the audio data. The processing device 112 may determine a target portion of the audio data based on the audio category of the audio data. More descriptions about identification of each of the multiple frames based on the MFCC may be found eslwhere in the present disclosure. See, FIG. 14 and descriptions thereof.

In 1330, the processing device 112 (e.g., the model determination module 420) may determine a second probability corresponding to each of multiple predetermined emotions based on the target portion of the audio data.

The acoustic characteristics of the target portion may be used to obtain the second probability. In some embodiments, the processing device 112 may determine a difference between acoustic characteristics of each two adjacent frames in a target portion of audio data. The processing device 112 may determine a statistics of each acoustic characteristic of the target portion of the audio data by determining a same acoustic characteristic in a first feature and a second feature. The processing device 112 may determine a second probability by inputting the statistics of each acoustic characteristic into an acoustic based emotion recognition model as described elsewhere in the present disclosure (e.g., FIG. 5 and the descriptions thereof) . More descriptions for determination of a second probability corresponding to each of multiple predetermined emotions based on the MFCC may be found elsewhere in the present disclosure. See, FIG. 15 and descriptions thereof.

It should be noted that the above description regarding the process 1300 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more operations may be omitted and/or one or more additional operations may be added.

FIG. 14 is a flowchart illustrating an exemplary process for determining a targer portion in audio dataaccording to some embodiments of the present disclosure. At least a portion of process 1400 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1400 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1400 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals. The process 1400 may be performed to accomplish operation 1320 in FIG. 13.

In 1410, the processing device 112 (e.g., the model determination module 420) may determine a fourth probability that each of multiple frames in audio data belongs to each of multiple audio categories by inputting an MFCC corresponding to each of the multiple frames into a trained audio category identification model.

The audio categories may include a target category, a mute category, and a noise category. The fourth probability may include a probability that each of multiple frames in the audio data belongs to the target category, a probability that each of multiple frames in the audio data belongs to the mute category, and a probability that each of multiple frames in the audio data belongs to the noise category.

A machine leanrning model may be previously obtained by using a training set including samples (e.g., audio data) each of which includes a target portion, samples each of which includes a mute portion, and samples each of which includes a noise portion. The trained category identification model may be obtained by training the machine learning model. The trained category identification model may identify the MFCC of each frame in the audio data. For recognizing the MFCC of each frame in the audio data, an MLP with M-layer network may be used, and each layer of the network may use the following Equation (3) to perform calculations. The trained audio category identification model may be represented by Equation (3) . The fourth probability that each frame in the audio data belongs to each of multiple audio categories may be determined based on following Equation (3) :

y2=H2 (x2, W _H2) (3)

wherein W _H2 denotes a learnable parameter, x2 denotes an input parameter in the m ^th iteration, m belongs to [1, M] and is a positive integer, M is a positive integer greater than or equal to 1. When m belongs to [1, M-1] , H2 denotes a function of relu (W _H2·x2) . When the value of m is M, H2 denotes a function of softmax (W _H2·x2) . When the value of m is 1, the MFCC of each frame in the audio data is used as the input parameter. When the value of m belongs to [2, M] , the result of the last iteration is used as the input parameter of the current iteration.

In 1420, the processing device 112 (e.g., the model determination module 420) may designate a specific audio category that corresponds to a maximum fourth probability among multiple fourth probabilities of each of the multiple frames as an audio category of the frame.

After determining the multiple fourth probabilities of a specific frame including the probability that the specific frame belongs to the target category (i.e., the probability that the specific frame is the target voice frame) , the probability that the specific frame belongs to the mute category (i.e., the probability that the specific frame is a mute frame) , and the probability that the specific frame is belongs to the noise category (e.g., the probability that the specific frame is a noise frame) , the specific audio category corresponding to the maximum fourth probability among the multiple fourth probabilities may be designated as the audio category of the specific frame.

In 1430, the processing device 112 (e.g., the model determination module 420) may determine a target portion of the audio data based on the audio category of the each of the multiple frames.

The processing device 112 may determine the frames in the audio data each of which includes the auio category of the target categoryto obtain the target portion of the audio data. For example, the audio data includes 10 frames, and the audio category of each of the first, fifth, and eighth frame are the target category, the first, fifth, and eighth frames may be determined as the target portion of the audio data.

FIG. 15 is a flowchart illustrating an exemplary process for determining a second probability corresponding to each of multiple predetermined emotions according to some embodiments of the present disclosure. At least a portion of process 1500 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1500 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1500 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals. The process 1500 may be performed to accomplish operation 1330 in FIG. 13.

In 1510, the processing device 112 (e.g., the model determination module 420) may determine a difference between each of acoustic characteristics of each two adjacent frames in a target portion of audio data. The target portion of the audio data may be determined as described elsewhere in the present disclosure (e.g., FIG. 14 and the descriptions thereof) .

In some embodimentss, the acoustic characteristics of each two adjacent frames may include a zero-crossing rate, an RMS energy, F0, HNR, MFCC, etc. The difference between each of the acoustic characteristics of each two adjacent frames may include a difference between zero-crossing rates of each two adjacent frames, a difference between RMS energies of each two adjacent frames, a difference between F0s of each two adjacent frames, a difference between HNRs of each two adjacent frames, a difference between MFCCs of each two adjacent frames, etc. In some embodiments, The difference between an acoustic characteristic of each two adjacent frames may be determined by subtracting the acoustic characteristic of a previous frame in the each two adjacent frames from the acoustic characteristic of a frame next to the previous frame in the each two adjacent frames in the target portion of the audio data. Assuming that the acoustic characteristics include a zero-crossing rate, an RMS energy, F0, HNR, and MFCC, and the target portion of the audio data includes the first frame, the fifth frame, and the eighth frame, the difference between the zero-crossing rate of each two adjacent frames may include a difference between zero-crossing rates of the first frame and the fifth frame, a difference between zero-crossing rates of the fifth frame and the eighth frame; the difference between the RMS energy of each two adjacent frames may includea difference between RMS energies of the first frame and the fifth frame, a difference between RMS energies of the fifth frame and the eighth frame; the difference between the F0 of each two adjacent frames may include may include a difference between F0s of the first frame and the fifth frame, a difference between F0s of the fifth frame and the eighth frame; the difference between the HNR of each two adjacent frames may include a difference between HNRs of the first frame and the fifth frame, a difference between HNRs of the fifth frame and the eighth frame; and the difference between the MFCC of each two adjacent frames may include may includea difference between MFCCs of the first frame and the fifth frame, a difference between MFCCs of the fifth frame and the eighth frame. The differences between zero-crossing rates of the first frame and the fifth frame, the fifth frame and the eighth frame may be determined by subtracting a zero-crossing rate of the first frame from a zero-crossing rate of the fifth frame, a zero-crossing rate of the fifth frame from a zero-crossing rate of the eighth frame, respectively. The other differences between acoustic characteristics of each two adjacent frames may be determined in the same way. In some embodiments, the first frame may be frame 0, and acoustic characteristics of frame 0 may be equal to 0. In some embodiments, the first frame may be frame 1 and the difference between the frame 1 and frame 0 may be equal to the acoustic characteristic of frame 1.

In 1520, the processing device 112 (e.g., the model determination module 420) may determine a statistic result associated with each acoustic characteristic of the target portion of the audio data in each of a first feature set and a second feature set.

The first feature set may include the difference between each of the acoustic characteristics of each two adjacent frames in the target portion of the audio data, an acoustic characteristic of the first frame in the target portion of the audio data, or the like, or any combination thereof. The second feature set may include acoustic characteristics of each frame in the target portion of the audio data. In some embodiments, the statistic result may include a first statistic result associated with each acoustic characteristic in the first feature and a second statistic result associated with each acoustic characteristic in the second feature set. For example, the processing device 112 may determine a first statistic result associated with the differences of MFCC in the first feature set and determine a second statistic result asscoaited with the MFCC in the second feature set.. In some embodiments, the processing device 112 may determine the first statistic result associated with each acoustic characteristic of the target portion of the audio data by performing a statistical calculation based on the first feature set and/or the second statistic result associated with each acoustic characteristic of the target portion of the audio data by performing a statistical calculation basd on the second feature set.

In some embodiments, the statistic result may include one or more statistics associated with one or more statistic factors. Exemplary statistic factors may include a mean, a variance, a skewness, a kurtosis, extreme point information (e.g., an extreme point value, an extreme point position, an extreme point range) of the statistic, a slope after linear regression, or the like, or any combination thereof. In some embodiment, a count of the one or more statistics of acoustic characteristics of the target portion of the audio data in the first feature set and the second feature set may be associated with a count of the one or more statistic factors of an acoustic characteristic (donated as X) and a count of the acoustic characteristics of a frame (denoted as Y) . For example, the count of the one or more statistics of the acoustic characteristics of the target portion of the audio data may be 2*X*Y.

In 1530, the processing device 112 (e.g., the model determination module 420) may determine a second probability by inputting the statistic result of each acoustic characteristic into an acoustic based emotion recognition model.

The acoustic based emotion recognition model may be determined by training a machine learning model (e.g., a classifier) by a processing device that is the same as or different from the processing device 112. The acoustic based emotion recognition model may be configured to determine the second probability corresponding to each predetermined emotion expressed by the target portion of the audio data. The econd probability corresponding to each predetermined emotion may be determined by inputting the statistic result of each acoustic characteristic into the acoustic based emotion recognition model. In the trained classifier (i.e., the acoustic based emotion recognition model) , an iteration result may be determined after L iterations through the following Equation (4) :

y3=H3 (x3, W _H3) ·T (x3, W _T) +x3· (1-T (x3, W _T) ) (4)

wherein W _H3 and W _T denote learnable parameters, x3 denotes an input parameter in the kth iteration, k belongs to [1, L] and is a positive integer, L is a positive integer greater than or equal to 1, H3 denotes the function of relu (W _H3·x3) , and T denotes Sigmoid (WT·x3) . When k is equal to 1, the statistic resullt of each acoustic characteristic of the target portion of the audio data are used as input parameters. When k belongs to [2, L] , the result of the last iteration is used as the input parameter of the current iteration.

The second probability may be obtained by the following Equation (5) :

y4=H4 (x4, W _H4) (5)

wherein H4 denotes Softmax (W _H4·x4) , W _H4 is a learnable parameter, and x4 denotes the iteration result obtained from Equation (5) .

It should be noted that the training ofthe model acoustic based emotion recognition model may be set according to actual needs, and is not specifically limited herein.

FIG. 16 is a flowchart illustrating an exemplary process for determining an emotion of a user based on at least one of a text content and one or more acoustic characteristics in a scene according to some embodiments of the present disclosure. At least a portion of process 1600 may be implemented on the computing device 200 as illustrated in FIG. 2 or the mobile device 300 as illustrated in FIG. 3. In some embodiments, one or more operations of process 1600 may be implemented in the emotion recognition system 100 as illustrated in FIG. 1. In some embodiments, one or more operations in the process 1600 may be stored in a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as a form of instructions, and invoked and/or executed by the server 110 (e.g., the processing device 112 in the server 110, or the processor 220 of the computing device 200) or the CPU 340 of the mobile device 300. In some embodiments, the instructions may be transmitted in a form of electronic current or electrical signals.

In 1610, the processing device 112 (e.g., the obtaining module 410) may acquire audio data of a user in a scene. The audio data may be acquired from voice signals of the user playing in the scene. For example, the voice signals may be generated when a user playing in a scene of a role-playing game (RPG) . The voice signals of the user may be obtained by the obtaining module 410 from the terminal 130, the terminal 140, a storage device (e.g., the storage device 120, the ROM 230, the RAM 240, the storage 390) as described elsewhere in the present disclosure. The audio data of the user may include semantic information of the voice signals of the user that may reflect the text content of the voice signals of the user. Exemplary audio data may include a plurality of phoneme sets, a plurality of word sets, a plurality of phrase sets, etc. More description for acquiring of the audio data may be found elsewhere in the present disclosure (e.g., FIG. 5 and the descriptions thereof) .

In 1620, the processing device 112 (e.g., the obtaining module 410) may use speech recognition to convert audio data of the user in the scene to obtain a result of the speech recognition comprising a text content of the user’s voice signal. More description of obtaining results of the speech recognition comprising text of the user’s voice signals may be found elsewhere in the present disclosure (e.g., FIGs. 5 and 11, and the descriptions thereof) .

In 1630, the processing device 112 (e.g., the model determination module 420) may determine one or more acoustic characteristics from the audio data. The acoustic characteristics of the user may be determined from the audio data of the user using an acoustic characteristic extraction technique. Exemplary acoustic characteristic extraction techniques may include using an autocorrelation function (ACF) algorithm, an average amplitude difference function (AMDF) algorithm, a nonlinear feature extraction algorithm based on teager energy operator (TEO) , a linear predictive analysis (LPC) algorithm, a deep learning algorithm (e.g., a Laplacian Eigenmaps, a principal component analysis (PCA) , a local preserved projection (LPP) , etc. ) , etc. More description for determining acoustic characteristics may be found elsewhere in the present disclosure (e.g., FIG. 5 and FIG. 11, and the descriptions thereof) .

In 1640, the processing device 112 (e.g., the emotion recognition module 430) may determine an emotion of the user based on at least one of the text content and the one or more acoustic characteristics.

In some embodiments, the processing device 112 may obtain an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user. The processing device 112 may obtain a content based emotion recognition model configured to determine an emotion of the user based on the text content derived from the audio data of the user. The processing device 112 may determine the emotion of the user based on the at least one of the text content and the one or more acoustic characteristics using the acoustic based emotion recognition model and/or the content based emotion recognition model. More descriptions for determining the emotion of the user using the acoustic based emotion recognition model and/or the content based emotion recognition model may be found elsewhere in the present disclosure (e.g., FIGs. 5-8, and the descriptions thereof) .

In some embodiments, the processing device 112 may determine a first probability corresponding to each of one or more predetermined emotions based on a text vector corresponding to the text content and a second probability corresponding to the each of one or more predetermined emotions based on acoustic characteristics of the audio data. The processing device 112 may determine an emotion degree corresponding to each of the one or more predetermined emotions based on at least one of the first probability and the second probability. The processing device 112 may determine the emotion of the user based on the emotion degree corresponding to each of the one or more predetermined emotions. More description of the determination of the emotion based on the emotion degree may be found elsewhere in the present disclosure (e.g., FIGs. 11-14 and descriptions thereof) .

In 1650, the processing device 112 (e.g., the sending module 450) may send at least one of the emotion and the text content to a terminal device.

In some embodiments, the terminal device may perform voice control based on the text content and/or the emotion. For example, the terminal device may adjust a plot of the RPG subsequent to the scene and/or an element of the RPG based on the determined real time emotion of the user in the scene. More descriptions for adjustment a plot of the RPG subsequent to the scene and/or an element of the RPG may be found elsewhere in the present disclosure (e.g., FIG. 5, and the descriptions thereof) .

Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.

Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment, ” “one embodiment, ” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.

Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a "block, " “module, ” “engine, ” “unit, ” “component, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the “C” programming language, Visual Basic, Fortran 1703, Perl, COBOL 1702, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a software as a service (SaaS) .

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations, therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software-only solution-e.g., an installation on an existing server or mobile device.

Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, claimed subject matter may lie in less than all features of a single foregoing disclosed embodiment.

Claims

A system for emotion recognition, comprising:

at least one storage medium storing a set of instructions;

at least one processor in communication with the at least one storage medium to execute the set of instructions to perform operations including:

obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user;

optionally determining an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user;

optionally determining a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user; and

determining a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene.
The system of claim 1, wherein

the target real time emotion determination step comprises the sub step of using the content based emotion recognition model to perform a correction of the acoustic based real time emotion of the user to obtain a corrected real time emotion as the target real time emotion of the user.
The system of claim 2, wherein the correction of the real time emotion comprises:

using the content based real time emotion of the user as the corrected real time emotion of the user.
The system of claim 1, wherein the target real time emotion determination step comprises the sub step of:

determining the target real time emotion of the user by comparing the acoustic based real time emotion and the content based real time emotion of the user.
The system of claim 1, wherein to determine the target real time emotion of the user by comparing the acoustic based real time emotion and the content based real time emotion of the user, the at least one processor performs additional operations including:

using the acoustic based emotion recognition model to determine a first confidence level for the acoustic based real time emotion;

using the content based emotion recognition model to determine a second confidence level for the content based real time emotion;

comparing the first confidence level and the second confidence level to determine one of the acoustic based real time emotion and the content based real time emotion that corresponds to a higher confidence level as the target real time emotion.
The system of any one of claims 1 to 5, wherein to determine the acoustic based real time emotion of the user, the at least one processor performs additional operations including:

obtaining base acoustic characteristics of the user acquired before the scene of the user;

calibrating the acoustic characteristics of the user in the scene with the base acoustic characteristics of the user to obtain calibrated acoustic characteristics of the user in the scene; and

using the acoustic based emotion recognition model to determine, based on the calibrated acoustic characteristics of the user in the scene, the acoustic based real time emotion of the user.
The system of any one of claims 1 to 6, wherein the content based real time emotion determination step comprises the sub steps of:

using a speech recognition model to convert the audio data of the user in the scene into a text content; and

using the content based emotion recognition model to determine, based on the text content, the content based real time emotion of the user.
The system of any one of claim 7, wherein the speech recognition model is obtained by:

obtaining a plurality of groups of universal audio data of one or more subjects communicating in one or more circumstances;

determining a universal speech recognition model by training a machine learning model using the plurality of groups of universal audio data;

obtaining a plurality of groups of special audio data of one or more subjects associated with the scene; and

using the plurality of groups of special audio data to train the universal speech recognition model to determine the speech recognition model.
The system of any one of claims 1 to 8, wherein the acoustic based emotion recognition model is obtained by:

obtaining a plurality of groups of acoustic characteristics associated with the scene of users; and

using the plurality of groups of acoustic characteristics to train a first machine learning model to determine the acoustic based emotion recognition model.
The system of claim 9, wherein the first machine learning model includes a support vector machine.
The system of any one of claims 1 to 10, wherein the content based emotion recognition model is obtained by:

obtaining a plurality of groups of audio data associated with the scene of users;

converting each group of the audio data into a text content; and

using the text content to train a second machine learning model to determine the content based emotion recognition model.
The system of claim 11, wherein the second machine learning model includes a text classifier.
The system of any one of claims 1 to 12, wherein the voice signals of the user is acquired when the user plays a role palying game (RPG) , and the at least one processor performs additional operations including:

adjusting, based on the target real time emotion of the user in the scene, a plot of the RPG subsequent to the scene.
The system of claim 13, wherein the user has a relationship with at least one of one or more real-life players of the RPG or one or more characters in the RPG, and to adjust, based on the target real time emotion of the user, a plot of the RPG, the at least one processor performs operations including:

determining, based on the target real time emotion of the user, the relationship between the user and the one or more real life players or the one or more characters in the RPG; and

adjusting, based on the determined relationship, the plot of the RPG.
The system of claim 13 or claim 14 , wherein the at least one processor performs additional operations including:

adjusting, based on the target real time emotion of the user in the scene, an element of the RPG in the scene, wherein the element of the RPG includes at least one of:

a vision effect associated with the RPG in the scene;

a sound effect associated with the RPG in the scene;

a display interface element associated with the RPG in the scene; or

one or more props used in the RPG in the scene.
A method implemented on a computing device including a storage device and at least one processor for emotion recognition, the method comprising:

obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user;

optionally determining an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user;

optionally determining a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user; and

determining a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene.
The method of claim 16, wherein

the target real time emotion determination step comprises the sub step of using the content based emotion recognition model to perform a correction of the acoustic based real time emotion of the user to obtain a corrected real time emotion as the target real time emotion of the user.
The method of claim 17, wherein the correction of the real time emotion comprises:

using the content based real time emotion of the user as the corrected real time emotion of the user.
The method of claim 16, wherein the target real time emotion determination step comprises the sub step of:

determining the target real time emotion of the user by comparing the acoustic based real time emotion and the content based real time emotion of the user.
The method of claim 16, wherein to determine the target real time emotion of the user by comparing the acoustic based real time emotion and the content based real time emotion of the user, the at least one processor performs additional operations including:

using the acoustic based emotion recognition model to determine a first confidence level for the acoustic based real time emotion;

using the content based emotion recognition model to determine a second confidence level for the content based real time emotion;

comparing the first confidence level and the second confidence level to determine one of the acoustic based real time emotion and the content based real time emotion that corresponds to a higher confidence level as the target real time emotion.
The method of any one of claims 16 to 20, wherein to determine the acoustic based real time emotion of the user, the at least one processor performs additional operations including:

obtaining base acoustic characteristics of the user ;

calibrating the acoustic characteristics of the user in the scene with the base acoustic characteristics of the user to obtain calibrated acoustic characteristics of the user in the scene; and

using the acoustic based emotion recognition model to determine, based on the calibrated acoustic characteristics of the user in the scene, the acoustic based real time emotion of the user.
The method of any one of claims 16 to 21, wherein the content based real time emotion determination step comprises the sub steps of:

using a speech recognition model to convert the audio data of the user in the scene into a text content; and

using the content based emotion recognition model to determine, based on the text content, the content based real time emotion of the user.
The method of any one of claim 22, wherein the speech recognition model is obtained by:

obtaining a plurality of groups of universal audio data of one or more subjects communicating in one or more circumstances;

determining a universal speech recognition model by training a machine learning model using the plurality of groups of universal audio data;

obtaining a plurality of groups of special audio data of one or more subjects playing the RPG; and

using the plurality of groups of special audio data to train the universal speech recognition model to determine the speech recognition model.
The method of any one of claims 16 to 23, wherein the acoustic based emotion recognition model is obtained by:

obtaining a plurality of groups of acoustic characteristics associated with users; and

using the plurality of groups of acoustic characteristics to train a first machine learning model to determine the acoustic based emotion recognition model.
The method of claim 24, wherein the first machine learning model includes a support vector machine.
The method of any one of claims 16 to 25, wherein the content based emotion recognition model is obtained by:

obtaining a plurality of groups of audio data associated with users;

converting each group of the audio data into a text content; and

using the text content to train a second machine learning model to determine the content based emotion recognition model.
The method of claim 26, wherein the second machine learning model includes a text classifier.
The method of any one of claims 16 to 27, wherein the voice signals of the user is acquired when the user plays a role palying game (RPG) , and the the at least one processor performs additional operations including:

adjusting, based on the target real time emotion of the user in the scene, a plot of the RPG subsequent to the scene.
The method of claim 28, wherein the user has a relationship with at least one of one or more real-life players of the RPG or one or more characters in the RPG, and to adjust, based on the target real time emotion of the user, a plot of the RPG, the at least one processor performs operations including:

determining, based on the target real time emotion of the user, the relationship between the user and the one or more real life players or the one or more characters in the RPG; and

adjusting, based on the determined relationship, the plot of the RPG.
The method of claim 28, wherein the at least one processor performs operations including:

adjusting, based on the target real time emotion of the user in the scene, an element of the RPG in the scene, wherein the element of the RPG includes at least one of:

a vision effect associated with the RPG in the scene;

a sound effect associated with the RPG in the scene;

a display interface element associated with the RPG in the scene; or

one or more props used in the RPG in the scene.
A non-transitory computer readable medium storing instructions, the instructions, when executed by at least one processor, causing the at least one processor to implement a method comprising:

obtaining voice signals of a user playing in a scene, the voice signals comprising acoustic characteristics and audio data of the user;

optionally determining an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user;

optionally determining a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user; and

determining a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene.
A system for emotion recognition, comprising:

an obtaining module configured to obtain voice signals of a user, the voice signals comprising acoustic characteristics and audio data of the user;

an emotion recognition module configured to:

optionally determine an acoustic based real time emotion of the user in the scene using an acoustic based emotion recognition model configured to determine an emotion of the user based on one or more acoustic characteristics of the user;

optionally determine a content based real time emotion of the user in the scene using a content based emotion recognition model configured to determine an emotion of the user based on one or more text contents derived from the audio data of the user; and

determine a target real time emotion of the user in the scene based on at least one of the acoustic based real time emotion of the user in the scene or the content based real time emotion of the user in the scene.
A system for emotion recognition, comprising:

at least one storage medium storing a set of instructions;

at least one processor in communication with the at least one storage medium to execute the set of instructions to perform operations including:

obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user;

determining one or more acoustic characteristics of the user from the voice signals;

determining one or more text contents derived from the audio data of the user; and

determining a target real time emotion of the user in the scene based on the one or more acoustic characteristics and the one or more text contents.
The system of claim 33, wherein the at least one processor performs additional operations including:

sending the target real time emotion of the user and the one or more text contents to a terminal device for voice control.
A method implemented on a computing device including a storage device and at least one processor for emotion recognition, the method comprising:

obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user;

determining one or more acoustic characteristics of the user from the voice signals;

determining one or more text contents derived from the audio data of the user; and

determining a target real time emotion of the user in the scene based on the one or more acoustic characteristics and the one or more text contents.
A non-transitory computer readable medium storing instructions, the instructions, when executed by at least one processor, causing the at least one processor to implement a method comprising:

obtaining voice signals of a user in a scene, the voice signals comprising acoustic characteristics and audio data of the user;

determining one or more acoustic characteristics of the user from the voice signals;

determining one or more text contents derived from the audio data of the user; and determining a target real time emotion of the user in the scene based on the one or more acoustic characteristics and the one or more text contents.