CN114495988A

CN114495988A - Emotion processing method of input information and electronic equipment

Info

Publication number: CN114495988A
Application number: CN202111017308.3A
Authority: CN
Inventors: 张婧颖
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2022-05-13
Anticipated expiration: 2041-08-31
Also published as: CN114495988B; CN116705072A

Abstract

An emotion processing method of input information and electronic equipment relate to the technical field of terminals, and can enable the same input information to be expressed by different emotions to generate different listening effects, so that the emotion expression of voice is enriched. The method comprises the following steps: the electronic equipment displays an input interface; the input interface is used for receiving input information input by a user, and the input information comprises voice information or text information; the emotion of the input information comprises a first emotion; the electronic equipment determines a target emotion type; the electronic equipment carries out emotion processing on the input information according to the target emotion type to obtain target voice; the emotion of the target voice comprises a second emotion, the content of the target voice is the same as that of the input information, and the first emotion is different from the second emotion.

Description

Emotion processing method of input information and electronic equipment

Technical Field

The present application relates to the field of terminal technologies, and in particular, to an emotion processing method for input information and an electronic device.

Background

With the development of intelligent electronic devices, voice processing technology has also been greatly advanced and widely applied in the life of users. For example, speech processing techniques can be widely used in various scenes such as short video capture, live broadcast, and text-to-speech. Under this scene, speech processing technique can strengthen user's sound effect to improve the interest of user's life, satisfy user's diversified demand simultaneously.

However, the current human voice effect enhancement functions are mainly tone equalization and voice noise reduction, and the tone equalization generates a heavy and stuffy or sharp and bright sound by changing the energy distribution of voice frequency bands. It can be seen that the current speech processing technology can only change the tone of the speech, resulting in a single speech in emotion expression, which results in poor speech emotion abundance.

Disclosure of Invention

The application provides an emotion processing method of input information and electronic equipment, which can enable the same input information to be expressed by different emotions to generate different listening effects, so that the emotion expression of voice is enriched.

In a first aspect, an embodiment of the present application provides an emotion processing method for input information. The method comprises the following steps: the electronic equipment displays an input interface; the input interface is used for receiving input information input by a user, and the input information comprises voice information or text information; the emotion of the input information comprises a first emotion; the electronic equipment determines a target emotion type; the electronic equipment carries out emotion processing on the input information according to the target emotion type to obtain target voice; the emotion of the target voice comprises a second emotion, the content of the target voice is the same as the content of the input information, and the first emotion is different from the second emotion.

By adopting the scheme, the electronic equipment can carry out emotion processing on the input information according to the determined target emotion type to obtain target voice; because the emotion of the input information comprises the first emotion, the emotion of the target voice comprises the second emotion, the content of the input information is the same as that of the target voice, and the first emotion and the second emotion are different, the same input information can be expressed by different emotions to generate different listening effects, so that the emotion expression of the voice is enriched.

In a possible design of the first aspect, the electronic device displays an input interface, including: the electronic equipment responds to the operation of starting the camera by a user and displays an input interface; the input interface is a preview interface before the electronic equipment shoots the video; or the input interface is an interface in a video shot by the electronic equipment; or the input interface is the interface after the electronic equipment shoots the video.

In the design mode, the input interface is an interface for shooting videos by the electronic equipment, namely the electronic equipment can beautify voice emotion of the shot videos, so that users can hear voices of different emotions when the shot videos are played, and emotion expressions of video shooting are enriched.

In a possible design of the first aspect, the electronic device includes a gallery application, a recording application, and a notepad application; an electronic device display input interface comprising: the electronic equipment responds to the operation of opening any video file in the gallery application by a user and displays an input interface; or the electronic equipment responds to the operation of opening any one sound recording file in the sound recording application by the user and displays an input interface; or the electronic equipment responds to the operation of opening any text file in the notepad application by the user and displays the input interface.

In the design mode, the electronic equipment can beautify the voice emotion of the video file when playing the video file in the gallery application; or, the electronic device can beautify the voice emotion of the recording file when playing the recording file in the recording application; or the electronic equipment can endow the emotion to the text file in the process of converting the text file in the notebook conversion application into the voice, so that the converted voice is the voice with emotional colors, and the expression of the voice emotion effect is further enriched.

In a possible design manner of the first aspect, the input interface includes a plurality of speech emotion controls, and one speech emotion control corresponds to one emotion type; determining a target emotion type, comprising: the electronic equipment responds to the operation of the user on at least one voice emotion control in the plurality of voice emotion controls and determines the target emotion type.

In the design mode, a user can select at least one voice emotion control in the multiple voice emotion controls, so that the user can customize the emotion type required by the user for input information, and when the electronic equipment plays the target voice, the voice heard by the user is the voice of the emotion type customized by the user, and the user experience is improved.

In a possible design of the first aspect, the input interface includes a shot; determining a target emotion type, comprising: and the electronic equipment identifies the style of the shot picture and automatically matches the target emotion type corresponding to the style of the shot picture.

In the design mode, the electronic equipment can also identify the style of the shot picture and automatically match the target emotion type corresponding to the style of the shot picture, so that the user experience is further improved.

In a possible embodiment of the first aspect, the input information is speech information; determining a target emotion type, comprising: the electronic equipment identifies the emotion in the voice information and automatically matches the emotion type corresponding to the emotion in the voice information to determine the target emotion type.

In the design mode, when the input information is the voice information, the electronic equipment can also recognize the emotion in the voice information and automatically match the emotion type corresponding to the emotion in the voice information to determine the target emotion type, so that the user experience is further improved.

In a possible design of the first aspect, the input information is text information; determining a target emotion type, comprising: the electronic equipment identifies the semantics of the text information and automatically matches the emotion types corresponding to the semantics of the text information to determine the target emotion types.

In the design mode, when the input information is the text information, the electronic equipment can also identify the semantics of the text information and automatically match the emotion type corresponding to the semantics of the text information to determine the target emotion type, so that the user experience is further improved.

In a possible design manner of the first aspect, the electronic device performs emotion processing on the input information according to the target emotion type to obtain the target speech, including: the electronic equipment inputs the input information into the speech emotion model to obtain target speech; the voice emotion model is used for modifying the emotion of the input information according to the target emotion type.

In the design mode, the electronic equipment inputs the input information into the voice emotion model to obtain the target voice, and the voice emotion model is used for modifying the emotion of the input information according to the target emotion type, so that the emotion of the target voice output by the electronic equipment is different from the emotion of the input information, and the emotion expression of the voice is enriched.

In a possible design manner of the first aspect, the inputting, by the electronic device, the input information into the speech emotion model to obtain the target speech includes: the electronic equipment carries out coding processing on the input information to obtain time-frequency characteristics of the input information; the encoding process comprises framing processing and Fourier transform; the framing processing is used for dividing input information into a plurality of voice frames, and the time-frequency characteristics are used for describing the time variation and the relation between the frequency and the amplitude of each voice frame; the electronic equipment inputs the time-frequency characteristics of the input information into the speech emotion model to obtain the time-frequency characteristics of the target speech; the electronic equipment carries out decoding processing and voice synthesis processing on the time-frequency characteristics of the target voice to obtain the target voice; the decoding process includes inverse fourier transform and time domain waveform superposition.

In the design mode, the electronic equipment firstly encodes the input information, then newly inputs the processed input information into the voice emotion model to obtain the time-frequency characteristics of the target voice, and finally carries out decryption processing and voice synthesis processing on the time-frequency characteristics of the target voice to obtain the target voice, so that the emotion expression of the target voice can be more accurate.

In a possible design of the first aspect, the method further includes: the electronic equipment acquires a voice emotion data set; the voice emotion data set comprises a plurality of pieces of emotion voice; the corresponding emotion types of each emotional voice in the plurality of emotional voices are different; aiming at each emotional voice in the voice emotional data set, the electronic equipment performs feature extraction processing on the emotional voice to obtain time-frequency features of the emotional voice; and the electronic equipment inputs the time-frequency characteristics of the emotional voice into the neural network model for emotional training to obtain a voice emotional model.

In the design mode, the electronic equipment can carry out emotion training on the neural network model according to the language emotion data set, so that the trained voice emotion model is mature, and the emotion accuracy of modifying input information is improved.

In one possible design of the first aspect, the speech emotion model includes a first model and a second model; the first model is used for indicating the mapping relation between the emotional voice and the emotional type; the second model is used to modify the emotion of the input information.

In this design, the speech emotion model comprises a first model and a second model; the first model is used for indicating the mapping relation between the emotional voice and the emotional type; the second model is used for modifying the emotion of the input information, so that after the input information is input to the voice emotion model by the electronic equipment, the emotion of the input information is firstly determined by the first model, and then the emotion of the input information is converted by the second model, so that the accuracy of modifying the emotion of the input information is improved.

In a possible design manner of the first aspect, when the electronic device plays the audio picture, the electronic device outputs the target voice; the audio picture is a video picture; alternatively, the audio picture is an audio file.

In the design mode, when the electronic equipment plays the audio picture, the electronic equipment outputs the target voice, so that different listening effects are generated for a user, and the emotional expression of the voice is enriched.

In a possible design manner of the first aspect, when the electronic device plays an audio screen, the interface of the electronic device displays indication information; the indication information is used for indicating the emotion type corresponding to the target voice.

In the design mode, when the electronic equipment plays the audio image, the interface of the electronic equipment also displays the indication information, and the indication information is used for indicating the emotion type corresponding to the target voice, so that the user can see the emotion type of the target voice from the interface of the electronic equipment, and the user experience is improved.

In a second aspect, the present application provides an electronic device comprising a memory, a display screen, one or more cameras, and one or more processors. The memory, the display screen and the camera are coupled with the processor. Wherein the camera is configured to capture an image, the display screen is configured to display the image captured by the camera or an image generated by the processor, and the memory has stored therein computer program code comprising computer instructions that, when executed by the processor, cause the electronic device to perform the steps of: the electronic equipment displays an input interface; the input interface is used for receiving input information input by a user, and the input information comprises voice information or text information; the emotion of the input information comprises a first emotion; the electronic equipment determines a target emotion type; the electronic equipment carries out emotion processing on the input information according to the target emotion type to obtain target voice; the target voice emotion comprises a second emotion, the content of the target voice is the same as that of the input information, and the first emotion is different from the second emotion.

In a possible design of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to perform the following steps: the electronic equipment responds to the operation of starting the camera by a user and displays an input interface; the input interface is a preview interface before the electronic equipment shoots the video; or the input interface is an interface in a video shot by the electronic equipment; or, the input interface is the interface after the electronic equipment shoots the video.

In a possible design of the second aspect, the electronic device includes an gallery application, a recording application, and a notepad application; when executed by the processor, the computer instructions cause the electronic device to perform the following steps: the electronic equipment responds to the operation of opening any video file in the gallery application by a user and displays an input interface; or the electronic equipment responds to the operation of opening any one sound recording file in the sound recording application by a user and displays an input interface; or the electronic equipment responds to the operation of opening any text file in the notepad application by the user and displays the input interface.

In a possible design manner of the second aspect, the input interface includes a plurality of speech emotion controls, and one speech emotion control corresponds to one emotion type; when executed by the processor, the computer instructions cause the electronic device to perform the following steps: the electronic device determines a target emotion type in response to a user operation of at least one of the plurality of voice emotion controls.

In a possible design of the second aspect, the input interface includes a shot; when executed by a processor, the computer instructions cause the electronic device to perform the steps of: and the electronic equipment identifies the style of the shot picture and automatically matches the target emotion type corresponding to the style of the shot picture.

In a possible embodiment of the second aspect, the input information is speech information; when executed by the processor, the computer instructions cause the electronic device to perform the following steps: the electronic equipment identifies the emotion in the voice information and automatically matches the emotion type corresponding to the emotion in the voice information to determine the target emotion type.

In a possible embodiment of the second aspect, the input information is text information; when executed by the processor, the computer instructions cause the electronic device to perform the following steps: the electronic device identifies semantics of the text information and automatically matches an emotion type corresponding to the semantics of the text information to determine a target emotion type.

In a possible design of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to perform the following steps: the electronic equipment inputs the input information into the speech emotion model to obtain target speech; the voice emotion model is used for modifying the emotion of the input information according to the target emotion type.

In a possible design of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to perform the following steps: the electronic equipment carries out coding processing on the input information to obtain time-frequency characteristics of the input information; the encoding processing comprises framing processing and Fourier transform; the framing processing is used for dividing input information into a plurality of voice frames, the time-frequency characteristics are used for describing the change along with time, and the relationship between the frequency and the amplitude of each voice frame; the electronic equipment inputs the time-frequency characteristics of the input information into the voice emotion model to obtain the time-frequency characteristics of the target voice; the electronic equipment carries out decoding processing and voice synthesis processing on the time-frequency characteristics of the target voice to obtain the target voice; the decoding process includes inverse fourier transform and time domain waveform superposition.

In a possible design of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: the electronic equipment acquires a voice emotion data set; the voice emotion data set comprises a plurality of pieces of emotion voice; the corresponding emotion types of each emotional voice in the plurality of emotional voices are different; aiming at each emotional voice in the voice emotional data set, the electronic equipment performs feature extraction processing on the emotional voice to obtain time-frequency features of the emotional voice; and the electronic equipment inputs the time-frequency characteristics of the emotional voice into the neural network model for emotional training to obtain a voice emotional model.

In one possible design of the second aspect, the speech emotion model includes a first model and a second model; the first model is used for indicating the mapping relation between the emotional voice and the emotional type; the second model is used to modify the emotion of the input information.

In a possible design of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: when the electronic equipment plays the audio picture, the electronic equipment outputs the target voice; the audio picture is a video picture; alternatively, the audio picture is an audio file.

In a possible design of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: when the electronic equipment plays the audio picture, the interface of the electronic equipment displays indication information; the indication information is used for indicating the emotion type corresponding to the target voice.

In a third aspect, the present application provides a computer-readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method according to the first aspect and any one of its possible design approaches.

In a fourth aspect, the present application provides a computer program product for causing a computer to perform the method according to the first aspect and any one of the possible design forms when the computer program product runs on the computer. The computer may be the electronic device described above.

It should be understood that the advantageous effects achieved by the electronic device according to the second aspect and any one of the possible design manners of the electronic device according to the second aspect, the computer storage medium according to the third aspect, and the computer program product according to the fourth aspect may refer to the advantageous effects of the first aspect and any one of the possible design manners of the electronic device, and are not described herein again.

Drawings

Fig. 1 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a software structure of an electronic device according to an embodiment of the present application;

fig. 3a is a first schematic diagram of an input interface provided in an embodiment of the present application;

fig. 3b is a schematic diagram of a second input interface provided in the embodiment of the present application;

fig. 3c is a third schematic diagram of an input interface provided in the embodiment of the present application;

fig. 4a is a fourth schematic diagram of an input interface provided in the embodiment of the present application;

fig. 4b is a schematic diagram of an input interface provided in the embodiment of the present application;

fig. 4c is a schematic interface diagram of an electronic device when playing a video according to an embodiment of the present application;

fig. 5a is a sixth schematic view of an input interface provided in an embodiment of the present application;

fig. 5b is a schematic diagram seven of an input interface provided in the embodiment of the present application;

fig. 5c is a schematic diagram eight of an input interface provided in the embodiment of the present application;

FIG. 6 is a first flowchart illustrating an emotion processing method for input information according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a second method for emotion processing of input information according to an embodiment of the present application;

fig. 8 is a schematic diagram of a real-time spectrogram according to an embodiment of the present application;

FIG. 9 is a schematic flowchart of a process for training a speech emotion model according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a speech emotion data set provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of a chip system according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present application, "a plurality" means two or more unless otherwise specified.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In order to solve the problems in the background art, the embodiment of the application provides an emotion processing method for input information, which is applied to electronic equipment; by the method, the electronic equipment can perform emotion processing on the input information, so that the same input information can be expressed by different emotions to generate different listening effects, and further, the emotion expression of the voice is enriched. The input information may include voice or text, among others.

Specifically, the electronic equipment inputs input information into a pre-trained neural network model, so that the neural network model performs emotion processing on the input information to obtain target voice; the emotion type corresponding to the emotion of the target voice is different from the emotion type corresponding to the emotion of the input information; or, the input information does not contain emotion, and the target speech contains emotion; the emotion types may include, for example, neutral, angry, aversion, fear, joy, sadness, surprise, and the like. In some embodiments, when the input information is the speech to be processed, the electronic device inputs the speech to be processed into a pre-trained neural network model, so that the neural network model performs emotion transformation on the speech to be processed to obtain the target speech. For example, the emotion type corresponding to the emotion of the voice to be processed may be angry, and the emotion type corresponding to the emotion of the target voice may be cheerful. In another embodiment, when the input information is a text to be processed, the electronic device inputs the text to be processed into a pre-trained neural network model, so that the neural network model performs emotion processing on the text to be processed to obtain the target voice. For example, the text to be processed does not have emotion, and the emotion type corresponding to the emotion of the obtained target voice may be cheerful, for example.

The emotion processing method for input information provided by the embodiment of the application can be applied to electronic equipment comprising a smart voice device, such as a voice assistant, a smart sound box, a smart phone, a tablet computer, a computer, wearable electronic equipment, a smart robot and the like. In each of the above devices, a speech emotion desired by the user can be output. Several possible scenarios for applying emotion processing to input information are described below.

Application scenario 1: text to speech

In the application scene of text-to-speech, the text content can be combined, so that the speech heard by the user is speech with emotion, and the effect of speech tone quality can be enriched on the premise of ensuring higher accuracy of text-to-speech.

Application scenario 2: smart phone voice interaction

In the voice interaction scene of the smart phone, the voice of the voice assistant of the smart phone is not a single machine sound, but a user customized voice with emotion. For example, the user may customize the voice of the voice assistant to be a cheerful voice, and the user hears the voice with cheerful emotion during the communication with the voice assistant.

Application scenario 3: short video shot

In the scene of short video shooting, for example, the voice of the user can be customized into personalized voice with specific emotion, such as anger, cheerfulness and the like, so that the voice of the user can be beautified in the process of video shooting, and the voice is adapted to the picture styles of different shot subjects.

For a better understanding of the embodiments of the present application, reference is made to the following detailed description of the embodiments of the present application taken in conjunction with the accompanying drawings.

Please refer to fig. 1, which is a schematic structural diagram of an electronic device 100 according to an embodiment of the present disclosure. As shown in fig. 1, the electronic device 100 may include: the mobile terminal includes a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a microphone 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like.

The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, and a bone conduction sensor 180M.

It is to be understood that the illustrated structure of the present embodiment does not constitute a specific limitation to the electronic apparatus 100. In other embodiments, electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be a neural center and a command center of the electronic device 100. The controller can generate operation control signals according to the instruction operation codes and the timing signals to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

It should be understood that the interface connection relationship between the modules illustrated in the present embodiment is only an exemplary illustration, and does not constitute a structural limitation for the electronic device 100. In other embodiments, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a Mini-LED, a Micro-OLED, a quantum dot light-emitting diode (QLED), and the like.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP to be converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is configured to perform fourier transform or the like on the frequency point energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor, which processes input information quickly by referring to a biological neural network structure, for example, by referring to a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121. For example, in the embodiment of the present application, the processor 110 may execute instructions stored in the internal memory 121, and the internal memory 121 may include a program storage area and a data storage area.

The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The data storage area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. Further, the internal memory 121 may include a high-speed random access memory, and may also include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into analog audio signals for output, and also used to convert analog audio inputs into digital audio signals. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110. The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be an open mobile electronic device 100 platform (OMTP) standard interface of 3.5mm, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. Indicator 192 may be an indicator light that may be used to indicate a charge status, a charge change, a message, a missed call, a notification, etc. The SIM card interface 195 is used to connect a SIM card. The SIM card can be brought into and out of contact with the electronic apparatus 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support a Nano SIM card, a Micro SIM card, a SIM card, etc.

The emotion processing method for input information provided in the embodiment of the present application may be executed by the processor 110 included in the electronic device 100. Illustratively, the implementation may be performed by a neural network processor in the processor 110 described above. The neural network processor may be equipped with a neural network model to implement the method according to the embodiment of the present application.

The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture or a cloud architecture. The embodiment of the present invention exemplarily illustrates a software architecture of the electronic device 100 by taking an Android system with a layered architecture as an example.

Fig. 2 is a block diagram of a software structure of the electronic device 100 according to the embodiment of the present application.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 2, the application packages may include applications such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message and voice assistant.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 2, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, place and receive calls, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources, such as localized characters, icons, pictures, layout files, video files, etc., to the application.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The Android Runtime comprises a core library and a virtual machine. The Android Runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application layer and the application framework layer as binary files. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface manager Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, composition, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The emotion processing method for input information provided in the embodiment of the present application is described in detail with the electronic device 100 as a mobile phone. It should be understood that the methods in the following embodiments may be implemented in an electronic device having the above-described hardware structure and software structure.

The mobile phone displays an input interface, the input interface is used for receiving input information input by a user, and the input information comprises voice information or text information; the emotion of the input information comprises a first emotion; then, the mobile phone responds to the operation of the user on the input interface and determines the target emotion type; finally, the mobile phone performs emotion processing on the input information according to the target emotion type (for example, the mobile phone converts the emotion of the input information from a first emotion to a second emotion), so as to obtain target voice; the emotion of the target voice comprises a second emotion, the content of the target voice is the same as the content of the input information, and the first emotion is different from the second emotion.

The emotion processing method for the input information provided by the embodiment of the application is described in detail according to different scenes with reference to the drawings of the specification.

In some embodiments, the emotion processing method for input information provided by the embodiments of the present application may be applied to a short video shooting scene, which may be, for example, a scene in which a user records a video using a camera application on a mobile phone. For example, a user may record a video using a video recording mode of a camera application. As another example, a user may record a video using a professional mode of a camera application. As another example, a user may record a video using a movie mode of a camera application.

It should be noted that the user may also record the video using other applications for short video shooting on the mobile phone. The short video shooting application may be a system application on the mobile phone, or may also be a third-party application on the mobile phone (for example, an application downloaded from an application market of the mobile phone or an application store), which is not limited in this embodiment of the present application.

Illustratively, the user may enter speech during the recording of the video. For example, when the user is recording a food video, the user can make a corresponding introduction or evaluation for the food scene. In the video shooting process, the mobile phone can perform emotion transformation according to the voice input by the user to transform voices with different emotions to correspondingly introduce the food scene, or evaluate, namely, the emotion type of the voice output by the mobile phone is different from the emotion type of the voice input by the user, so that different listening effects can be generated, and the emotion expression of the voice is enriched.

In a possible implementation mode, a user can select different voice emotion types according to own subjective consciousness or own preference, so that voice emotion types corresponding to different videos shot by the user are different, and the beautifying requirement of the user on the voice effect is met.

In some embodiments, as shown in fig. 3 a-3 c, the mobile phone displays an input interface 201, wherein the input interface 201 may be a preview interface before the mobile phone takes a picture; or the input interface 201 can also be a shooting interface; alternatively, the input interface 201 may be an interface after the end of shooting. For example, the input interface 201 is an interface for recording a video in a video recording mode of a mobile phone.

Taking the input interface 201 as a preview interface before shooting by the mobile phone as an example, as shown in fig. 3a, the input interface 201 includes a speech emotion template 202; wherein the speech emotion template 202 comprises a plurality of different speech emotion types. For example, the speech emotion templates 202 include emotion types such as neutral, angry, disgust, fear, joy, sadness, and surprise. On this basis, the user can select the emotion type in the speech emotion template 202 before recording the video, and after the mobile phone finishes recording, a recorded audio image is obtained. And the voice in the audio image is emotion voice corresponding to the emotion type selected by the user.

For example, before the user records a video, the user may select one of the speech emotion types in the speech emotion template 202 according to the style of the preview image of the current preview interface, so that the speech emotion type matches the style of the preview image of the current preview interface. For example, the style of the preview image of the preview interface belongs to the very simple style, and the user can select the neutral speech emotion type to match the very simple style of the current preview image. For another example, the style of the preview screen of the preview interface is an exaggerated style, so that the user can select a cheerful speech emotion type to match the style of the current preview screen.

Taking the input interface 201 as an interface in mobile phone shooting as an example, as shown in fig. 3b, the input interface 201 includes a speech emotion template 202; wherein the speech emotion template 202 comprises a plurality of different speech emotion types. For example, the speech emotion template 202 includes emotion types such as neutral, angry, disgust, fear, joy, sadness, and surprise.

For example, the user may select one of the speech emotion types in the speech emotion templates 202 to match the style of the current captured image according to the style of the captured image during capturing. For example, the style of the current captured image belongs to the very simple style, and the user can select a neutral speech emotion type to match the style of the current captured image. For example, the style of the current captured image is an exaggerated style, and the user can select a cheerful speech emotion type to match the exaggerated style of the current captured image. Or, in the video shooting process, if the style of the current shooting picture is changed, the user can change the speech emotion type correspondingly. For example, if the style of the current shot is changed from a very simple style to an exaggerated style, the user may select a neutral speech emotion type first and then select an impatient speech emotion type, so that the same video may contain speech of two different emotion types.

Take the input interface 201 as the interface after the mobile phone completes shooting as an example. Illustratively, referring to FIG. 3c, a first interface 201 speech emotion template 202; wherein the speech emotion template 202 comprises a plurality of different speech emotion types. For example, the speech emotion template 202 includes emotion types such as neutral, angry, disgust, fear, joy, sadness, and surprise.

For example, the user may select one of the speech emotion types in the speech emotion template 202 according to the overall style of the captured video, so that the selected speech emotion type matches the overall style of the captured video. In some embodiments, the user may determine the style of the video finished with the shot based on his own subjective awareness. For example, when the user determines that the overall style of the filmed video is biased towards the simplistic style, the user may select neutral speech emotion types that match the overall style of the filmed video. For another example, when the user determines that the overall style of the photographed video is biased toward an exaggerated style, the user may select a cheerful speech emotion type to match the overall style of the photographed video. Then, the mobile phone can carry out emotion transformation on the voice in the shot video according to the voice emotion type selected by the user, and store the shot video. And the voice in the video is emotion voice corresponding to the voice emotion type selected by the user.

In other embodiments, as shown in fig. 4a and 4b, the mobile phone displays an input interface 203, wherein the input interface 203 may be a preview interface before shooting by the mobile phone; alternatively, the input interface 203 may also be an interface in shooting; alternatively, the input interface 203 may be an interface after the end of shooting. For example, the input interface is an interface for recording a video in a video recording mode by a mobile phone.

Illustratively, the input interface 203 includes speech emotion setting items 204, and the speech emotion setting items 204 may include, for example, "neutral" setting items, "angry" setting items, "aversion" setting items, and "fear" setting items, and the like. In the embodiment of the present application, the method of the embodiment of the present application is described by taking the example that the speech emotion setting item 204 is a scroll bar shown in fig. 4a and 4 b.

It should be understood that the speech emotion template 202 in the above embodiment and the speech emotion setting item 204 in this embodiment are only different expressions, and the method for the user to select an emotion type matching the style of the screen of the input interface 203 is the same. For example, the user can select the emotion type according to the style of a preview picture of the preview interface; or the user can select the emotion type according to the style of the current shooting picture; or, the user may select the emotion type according to the overall style of the shot video, and for a specific example, reference may be made to the above embodiment, which is not described herein again.

For example, the user may slide a scroll bar of one of the speech emotion setting items 204 to select a style of the currently photographed picture that matches the corresponding speech emotion type. Wherein the length of the scroll bar is 0-1; for example, when the scroll bar slides to the position of 0, it indicates that the user has not selected the setting item; accordingly, when the scroll bar slides to the position of 1, it indicates that the user selected the setting item. Referring to fig. 4a, for example, the scroll bar of the "neutral" setting item included in the speech emotion setting item 204 is slid to the position of 1, and the scroll bars of the other setting items (such as the "angry" setting item, the "dislike" setting item, and the "fear" setting item) are slid to the position of 0, that is, the user selects the neutral emotion type to match the style of the currently photographed screen.

In some embodiments, scroll bars 0-1 are also used to represent the strength of the speech emotion. For example, when the scroll bar of a certain setting item is closer to the position of 1, the emotional intensity representing the emotional type corresponding to the setting item is stronger; when the scroll bar of a certain setting item is closer to the position of 0, the emotional intensity of the emotional type corresponding to the setting item is weaker. Therefore, the user can also position the scroll bar at different positions between 0 and 1, so that the emotion types corresponding to different voices are different, and the emotion intensities of different emotion types are also different, thereby further meeting the beautifying requirement of the user on the voice effect.

On this basis, the user can also slide each setting item in the speech emotion setting items 204, and the positions of the scroll bars corresponding to each setting item are not identical, that is, the intensities of the speech emotions corresponding to each setting item are not identical, so that the speech emotion output by the mobile phone can include emotion types with different emotion intensities, and the emotion expression of the speech is further enriched. As shown in FIG. 4b, the speech emotion setting items 204 are a "neutral" setting item, an "angry" setting item, an "aversion" setting item, and a "fear" setting item. The setting items of 'neutral', the setting items of 'angry', the setting items of 'disgust' and 'fear' are different from one another in the position of the scroll bar corresponding to each setting item. After the video shooting is finished, the voice of the video can include the complex emotions of neutrality, anger, disgust, fear and the like with different emotion intensities.

In another possible implementation manner, the mobile phone can automatically match corresponding speech emotion types according to different styles of shot pictures. Thereby satisfying the beautifying requirement of the user on the recording effect. The style of the shot picture can be as follows: the style of the Chinese characters is a simple style, an exaggerated style, a gradual change style, a wash and ink style, an elegant style, a literature and art style and the like. Of course, the style of the shot picture can be other styles, which are not listed here.

For example, when the mobile phone recognizes that the style of the current shot picture is a very simple style, the mobile phone can automatically match neutral speech emotion to match the style of the current shot picture. For example, when the mobile phone recognizes that the style of the current captured image is an exaggerated style, the mobile phone can automatically match the fast speech emotion to the style of the current captured image. Alternatively, the mobile phone may recognize the emotion of the voice in the video, and match the emotion type corresponding to the emotion in the voice according to the emotion of the voice. For example, the mobile phone can detect the keywords when the user speaks in real time, and automatically match the speech emotion types corresponding to the keywords according to the keywords. Illustratively, keywords detected by the mobile phone during shooting include "hard pass" and "delegate", and meanwhile, the mobile phone is sad according to the voice emotion type matched with the keywords.

In combination with any of the above embodiments, in some embodiments, after the mobile phone finishes shooting the video, when the user plays the video, the emotion type of the voice heard by the user is the emotion type selected by the user before. In other embodiments, when the user plays the video, the indication information is displayed on the playing interface of the mobile phone. The indication information is used for indicating the emotion type of the voice of the video which is currently played. For example, as shown in fig. 4c, the indication information 205 is: the current speech emotion type is cheerful.

In the embodiment of the present application, the shot pictures of the mobile phone shown in fig. 3a to 4c when shooting videos are all illustrated by taking a person as an example. It should be understood that the shot pictures when the mobile phone shoots the video may also be pictures of scenery, food, and the like, and the actual shooting is taken as the standard, and the pictures are not listed here.

In some embodiments, the emotion processing method for input information provided by the embodiment of the present application may also be applied in a scenario of recording audio (hereinafter, referred to as recording), where the recording scenario may be a scenario in which a user uses a recording application on a mobile phone to record audio. Illustratively, the handset displays an input interface; the input interface can be an interface before mobile phone recording; or, the input interface may be an interface in the recording process; alternatively, the input interface may be an interface for playing the recording.

The input interface is exemplified as an interface when a recording (which may also be referred to as a recording file) is played. Illustratively, as shown in FIG. 5a, input interface 206 includes a speech emotion template 207; where speech emotion template 207 includes a plurality of different speech emotion types. For example, the speech emotion template 207 includes emotion types such as neutral, angry, disgust, fear, joy, sadness, and surprise. On the basis, the user can select the emotion type in the voice emotion template 207 before playing the recording so as to endow emotion to the recording; or when the user plays the recording, the mobile phone can identify the emotion in the recording and match the emotion type corresponding to the emotion in the recording. For example, the mobile phone can detect keywords in the recording in real time and match the emotion types corresponding to the keywords to give emotion to the recording. For example, when the keywords identified by the mobile phone include "too hard" and "cry", the emotion type matched by the mobile phone for the recording may be "sad". It should be understood that when the mobile phone plays the recording, the emotion of the recording heard by the user corresponds to the emotion type of the recording.

In some embodiments, the emotion processing method for input information provided by the embodiments of the present application may also be applied to a scene in which a video (also referred to as a video file) is played. Illustratively, if a user opens a gallery application of the mobile phone, a certain video file in the gallery application is played. The input interface is taken as an interface for playing the video file for illustration. Illustratively, as shown in FIG. 5b, input interface 208 includes a speech emotion template 209; where voice emotion template 209 includes a plurality of different voice emotion types. For example, the speech emotion templates 209 include emotion types such as neutral, angry, disgust, fear, joy, sadness, and surprise. On this basis, the user can select the emotion type in emotion template 209 before playing the video file to give emotion to the voice in the video file; or the mobile phone identifies the emotion in the video file and matches the emotion type corresponding to the emotion of the video file. For example, the mobile phone can identify keywords in the video file and match the emotion types corresponding to the keywords in the video file to assign emotion to the segment of video. For example, when the keywords identified by the mobile phone include "too hard" and "cry", the emotion type matched by the mobile phone for the video segment may be "sad". It should be understood that when the mobile phone plays the video file, the emotion of the voice in the video file heard by the user is the voice corresponding to the emotion type.

In some embodiments, the emotion processing method for input information provided by the embodiments of the present application may also be applied to a text-to-speech scenario. Illustratively, the text (which may also be referred to as a text file) may be a segment of text in a notepad application of a cell phone. The input interface is taken as an interface for converting text into voice. Illustratively, as shown in FIG. 5c, the input interface 210 includes a speech emotion template 211; wherein speech emotion template 211 comprises a plurality of different speech emotion types. For example, the speech emotion template 211 includes emotion types such as neutral, angry, disgust, fear, joy, sadness, and surprise. On the basis, when the user wants to convert the text into voice, the user can select the emotion type in the voice emotion template 211 to endow emotion to the converted voice; or the mobile phone can automatically identify the semantics of the text segment and match the corresponding emotion type for the text segment according to the semantics. After the mobile phone converts the text into voice, the emotion of the voice heard by the user is emotion voice corresponding to the emotion type selected by the user.

It should be noted that, in the above embodiments, the speech emotion templates included in the input interfaces shown in fig. 5a to 5c may also be items set for speech emotion. For the illustration of the speech emotion setting item, reference may be made to the foregoing embodiments, and details are not repeated here. The embodiment of the application provides an emotion processing method of input information, which can be applied to electronic equipment. As shown in fig. 6, the method may include S301-S304.

S301, the electronic equipment acquires input information.

Wherein the input information comprises voice information or text information. The speech information may be, for example, a piece of speech and the text information may be, for example, a piece of text. The emotion of the input information is a first emotion; the first emotion corresponds to a first emotion type. Illustratively, the first emotion type may include neutral, angry, aversion, fear, joy, sadness, and surprise emotion types. Wherein, neutral emotion type refers to emotion type without any one emotion color. For example, when the input information is text, the first emotion type may be a neutral emotion type.

S302, the electronic equipment processes the input information to obtain time-frequency characteristics of the input information.

In connection with the above embodiments, the input information comprises speech or text. The electronic equipment processes the input information and comprises the following steps: the electronic equipment converts input information into a voice signal at first, and then the electronic equipment performs coding processing on the voice signal to obtain time-frequency characteristics corresponding to the voice signal. The voice signal refers to a signal of a voice wave (i.e., a voice waveform), and the voice signal is an information carrier of a wavelength and intensity of the voice wave.

Taking the input information as the voice, for example, the electronic device may recognize the audio of the voice, so as to obtain a voice signal corresponding to the audio of the voice. Taking the input information as the text as an example, for example, the electronic device converts the text into the voice first, and then recognizes the audio of the converted voice, so as to obtain the voice signal corresponding to the text.

It should be noted that, for the specific description of the speech signal obtained by recognizing the audio of the speech, reference may be made to the speech recognition technology of the related art, and details thereof are not described herein.

In some embodiments, as shown in fig. 7, the encoding process may include, for example: framing processing and fourier transformation. As an example, the framing processing refers to dividing a speech signal into a plurality of speech frames according to a preset frame length and frame shift, so as to obtain a speech frame sequence corresponding to the speech signal. Each voice frame can be a voice segment, and then the voice signal can be processed frame by frame.

Illustratively, a frame length may be used to indicate the duration of each speech frame and a frame shift may be used to indicate the overlap between adjacent speech frames. For example, when the frame length is 25ms and the frame shift is 15ms, the first speech frame is 0-25 ms, the second speech frame is 15-40 ms, and so on, the framing processing of the speech signals can be realized. It should be understood that the specific frame length and frame shift may be set according to practical situations, and the embodiment of the present application is not limited thereto.

Then, the electronic device performs Fourier transform processing on each speech frame in the speech frame sequence in sequence to obtain the time-frequency characteristics of each speech frame. The time-frequency characteristics can be used for describing the change with time and the relation between the frequency and the amplitude of each speech frame. In some embodiments, the time-frequency features may be represented by a real-time spectrogram (abbreviated as time spectrum). The real-time spectrogram can be a three primary color pixel map (namely an RGB pixel map); alternatively, the real-time spectrogram may be a time-frequency waveform. Taking the real-time spectrogram as an RGB pixel map, as shown in fig. 8, in the RGB pixel map, the horizontal axis (i.e., X axis) represents time, the vertical axis (i.e., Y axis) represents frequency, and the color depth represents amplitude.

It should be noted that the fourier transform described in the above embodiments may be, for example, a short-time fourier transform; alternatively, the transform may be a fast fourier transform, which is not limited in this application.

S303, the electronic equipment inputs the time-frequency characteristics of the input information into the pre-trained speech emotion model to obtain the time-frequency characteristics of the target speech.

The voice emotion model is used for converting emotion of input information. Illustratively, the speech emotion model may be a model built based on a neural network, for example, that is, the speech emotion model is a model trained based on the neural network. Neural networks herein include, but are not limited to, combinations, stacks, nests of at least one of the following: convolutional Neural Network (CNN), Recurrent Neural Network (RNN), temporal recursive neural network (LSTM), bidirectional temporal recursive neural network (BLSTM), Deep Convolutional Neural Network (DCNN), and the like.

The neural network may be composed of neural units. Taking a neural network as an example of a deep neural network, specifically, the deep neural network may also be referred to as a multilayer neural network, and may be understood as a neural network with multiple hidden layers. And dividing the deep neural network according to the positions of different layers. Illustratively, neural networks inside a deep neural network can be divided into three categories: input layer, hidden layer, output layer. In general, the first layer of the deep neural network is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

In some embodiments, the speech emotion model can be obtained by training a neural network by adopting a forward propagation algorithm and a backward propagation algorithm. Illustratively, the forward propagation algorithm includes a convolutional layer and a pooling layer; the back propagation algorithm includes deconvolution and deballasting. In the embodiment of the application, the neural network adopts a back propagation algorithm to correct the data of the parameters in the initial neural network model in the training process of the speech emotion model, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the forward propagation algorithm generates error loss, and the parameters in the initial neural network model are updated through backward propagation of error loss information, so that the error loss converges. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix. In other words, the back propagation algorithm can optimize the weights of the neural network model.

Additionally, in some embodiments, the speech emotion model includes a first model and a second model. The first model is used for indicating the mapping relation between the emotional voice and the emotional type; the second model is used to modify the emotion of the input information. Exemplarily, the electronic device inputs the time-frequency characteristics of the input information into the first model to obtain the time-frequency characteristics of the input information and a first emotion type corresponding to the time-frequency characteristics; and then, the electronic equipment takes the time-frequency characteristics of the input information and the first emotion types corresponding to the time-frequency characteristics as the input of the second model, so that the second model outputs the time-frequency characteristics of the target voice. The time-frequency characteristics of the target voice are different from the time-frequency characteristics of the input information, the time-frequency characteristics of the target voice correspond to the second emotion type, and the time-frequency characteristics of the input information correspond to the first emotion type.

For example, taking the first emotion type as anger, for example, the electronic device may input the input information into the first model, and may obtain the emotion type corresponding to the input information, that is, the electronic device may know that the emotion type of the input information is the first emotion type (that is, anger). It should be noted that the input information input to the first model is represented in a time-frequency characteristic manner. And then, the electronic equipment inputs the time-frequency characteristics of the input information into a second model, and the second model can modify the time-frequency characteristics of the input information to obtain the time-frequency characteristics of the target voice with the second emotion type.

Taking the time-frequency feature as an example, the modifying, by the second model, the time-frequency feature of the input information includes: the distribution of energy, frequency and time of the real-time spectrogram is modified.

It should be noted that, for the example of the time-frequency characteristic of the target speech, reference may be made to the example of the time-frequency characteristic of the input information in the foregoing embodiment, and details are not repeated here. It should be understood that the time-frequency characteristics of the target speech are used to describe the frequency versus amplitude of the frame of speech of the target speech over time.

S304, the electronic equipment processes the time-frequency characteristics of the target voice to obtain the target voice.

Wherein the emotion of the target voice is a second emotion; the second emotion corresponds to a second emotion type; the content of the input information is the same as that of the target voice, and the first emotion is different from the second emotion, namely the first emotion type is different from the second emotion type.

In some embodiments, the electronic device may perform a decoding process and a speech synthesis process on the time-frequency characteristics of the target speech. Wherein the decoding process comprises inverse fourier transform and time domain waveform superposition. For example, the electronic device performs inverse fourier transform on the frequencies in the time-frequency characteristic of the target voice, the amplitude at each frequency, and the phase at each frequency to obtain a time-domain signal (i.e., a time-domain waveform) of the target voice.

As can be seen from the foregoing embodiments, since the speech signal is subjected to framing processing in the encoding processing process, the time domain signal of the target speech obtained in the embodiment is the time domain signal of each speech frame. Based on this, the electronic device also performs time domain waveform superposition on the time domain signal of each speech frame, so that a signal of the target speech (i.e., a target speech waveform) can be obtained. Correspondingly, the electronic device may synthesize the signal of the target speech by using a speech synthesis technique to obtain the target speech, i.e., the speech finally output by the electronic device.

It should be noted that the speech synthesis technology can refer to the speech synthesis technology of the related art, and is not described in detail here.

To sum up, in the embodiment of the application, the electronic device can perform emotion processing on the input information, so that the same input information can be expressed by different emotions to generate different listening effects, and thus, the emotion expression of voice is enriched.

Taking the input information as speech as an example, the input information is a section of speech spoken by a speaker (e.g., "landscape here"), and the emotion type corresponding to the emotion contained in the section of speech is neutral, i.e., the speaker does not inject any emotion color into the section of speech. After receiving the voice spoken by the speaker, the electronic device modifies the emotion of the voice and outputs the voice containing the emotion. For example, the emotion type corresponding to the emotion included in the speech output by the electronic device is cheerful. In other words, the emotion of the speech uttered by the speaker is neutral, and after the emotion processing by the electronic device, the emotion of the output speech is cheerful, that is, the speech heard by the listener is a speech with cheerful emotion.

The embodiment of the application also provides a model training method, which is used for training the speech emotion model in the embodiment. As shown in fig. 9, the method includes:

s401, the electronic equipment acquires a voice emotion data set.

Wherein the speech emotion data set comprises a plurality of pieces of emotion speech.

In some embodiments, the electronic device can collect the user's speech to build a speech emotion data set. Wherein the user may be a professional dubber. For example, the user may speak the texts included in the corpus with different emotion types. The corpus may include text that may be standard mandarin chinese text. For example, the corpus includes "did you eat? "," today's weather is really good! "standard mandarin text, etc.

What is "did you eat? "this text is an example, and it is assumed that the user phonetically expresses the text with five different emotion types. Wherein, five different emotion types include: fear, joy, surprise, sadness and neutrality.

Exemplarily, as shown in fig. 10, the user uses fear emotion to perform speech expression on the text, so as to obtain a first piece of emotion speech; the user adopts the cheerful emotion to carry out voice expression on the text to obtain a second piece of emotion voice; the user adopts the surprised emotion to carry out voice expression on the text to obtain a third piece of emotion voice; and performing voice expression on the text by the user by adopting sad emotions to obtain a fourth emotional voice, and performing voice expression on the text by the user by adopting neutral emotions to obtain a fifth emotional voice. Correspondingly, for other texts in the corpus, the method can be adopted to obtain a plurality of emotional voices with different emotions corresponding to the other texts, so that the voice of the user can be collected based on the corpus, and a voice emotion data set is constructed.

Considering the richness of emotional expression of the user, the user may be mixed with other emotions (such as neutrality, anger and dislike) besides customized emotions (such as fear) when expressing a certain text; also, the emotional intensity of each emotion also differs. Based on the above, after each piece of emotion voice corresponding to different texts is obtained, the user can evaluate the emotion intensity based on each piece of emotion voice. For example, the emotional intensity of each emotional voice is scored. The score refers to a proportion (%) of each emotion among a plurality of emotions included in the emotional voice. Note that, the evaluation of the emotion intensity of each emotional voice may be based on subjective awareness of the user, for example.

As shown in fig. 9, for example, in the first speech emotion, the proportion of the emotion intensity of the "neutral" emotion is 5%, the proportion of the emotion intensity of the "angry" emotion is 10%, the proportion of the emotion intensity of the "aversive" emotion is 15%, the proportion of the emotion intensity of the "fear" emotion is 70%, and the proportion of the emotion intensity of other emotions (such as cheerful, sad and surprise) is 0. In the second emotional speech, the proportion of the emotion intensity of the "neutral" emotion is 5%, the proportion of the emotion intensity of the "cheerful" emotion is 70%, the proportion of the emotion intensity of the "sad" emotion is 10%, the proportion of the emotion intensity of the "surprised" emotion is 15%, and the proportion of the emotion intensity of the other emotions (such as anger, disgust, and fear) is 0. In the third emotional voice, the ratio of the emotional intensity of the "angry" emotion is 5%, the ratio of the emotional intensity of the "neutral" emotion is 10%, the ratio of the emotional intensity of the "cheerful" emotion is 15%, the ratio of the emotional intensity of the "surprise" emotion is 75%, and the ratio of the emotional intensity of other emotions (such as aversion, sadness, and fear) is 0. In the fourth emotional speech, the proportion of the emotion intensity of the "neutral" emotion is 5%, the proportion of the emotion intensity of the "sad" emotion is 70%, the proportion of the emotion intensity of the "aversive" emotion is 10%, the proportion of the emotion intensity of the "fear" emotion is 15%, and the proportion of the emotion intensity of other emotions (e.g., anger, cheerfulness, or surprise) is 0. In the fifth emotional voice, the proportion of the emotional intensity of the "neutral" emotion is 80%, the proportion of the emotional intensity of the "cheerful" emotion is 5%, the proportion of the emotional intensity of the "sad" emotion is 10%, the proportion of the emotional intensity of the "surprised" emotion is 5%, and the proportion of the emotional intensity of the other emotions (such as anger, aversion, and fear) is 0.

S402, aiming at each emotional voice in the voice emotional data set, the electronic equipment carries out feature extraction processing on the emotional voice to obtain time-frequency features of the emotional voice.

In some embodiments, the feature extraction process may include, for example: framing processing and fourier transformation. In addition, after feature extraction processing, the obtained time-frequency features of the emotion voices are used for describing the time-dependent change, and the relationship between the frequency and the amplitude of each voice frame in the emotion voices.

It should be noted that, for the frame processing, the fourier transform, and the example of the time-frequency characteristic, reference may be made to the foregoing embodiments, and details are not repeated here.

S403, inputting the time-frequency characteristics of the emotion voice into the neural network model by the electronic equipment for emotion training to obtain a voice emotion model.

Illustratively, the electronic device inputs the time-frequency characteristics of each emotion voice into the neural network model until the neural network model is completely converged, so as to obtain a mature voice emotion model. For example, the speech emotion model is used to convert the emotion type of the input information.

It should be noted that, for the illustration of the neural network model, reference may be made to the illustration of the neural network in the foregoing embodiment, and details are not described here.

In some embodiments, the speech emotion model includes a first model and a second model. The first model is used for indicating the mapping relation between the time-frequency characteristics and the emotion types of the emotion voice; the second model is used for converting the emotion types of the emotion voice.

To sum up, in the embodiment of the application, the electronic device can train the neural network model based on the voice emotion data set to obtain the voice emotion model, so that the electronic device can transform the emotion types of the input information according to the voice emotion model to obtain the target voice, namely the voice corresponding to the emotion desired by the user, the emotion expression of the voice is enriched, and the use experience of the user is improved.

An embodiment of the present application provides an electronic device, which may include: a display screen (e.g., a touch screen), a camera, a memory, and one or more processors. The display screen, camera, memory and processor are coupled. Wherein the display screen is for displaying an image captured by the camera or an image generated by the processor, and the memory is for storing computer program code comprising computer instructions. When the processor executes the computer instructions, the electronic device may perform various functions or steps performed by the mobile phone in the above method embodiments. The structure of the electronic device may refer to the structure of the electronic device 100 shown in fig. 1.

An embodiment of the present application further provides a chip system, as shown in fig. 11, the chip system 1800 includes at least one processor 1801 and at least one interface circuit 1802.

The processor 1801 and the interface circuit 1802 may be interconnected by wires. For example, the interface circuit 1802 may be used to receive signals from other devices (e.g., a memory of an electronic device). Also for example, the interface circuit 1802 may be used to send signals to other devices, such as the processor 1801. Illustratively, the interface circuit 1802 may read instructions stored in the memory and send the instructions to the processor 1801. The instructions, when executed by the processor 1801, may cause the electronic device to perform the steps performed by the handset 180 in the above embodiments. Of course, the chip system may further include other discrete devices, which is not specifically limited in this embodiment of the present application.

Embodiments of the present application further provide a computer storage medium, where the computer storage medium includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device is enabled to execute each function or step executed by a mobile phone in the foregoing method embodiments.

The embodiments of the present application further provide a computer program product, which when run on a computer, causes the computer to execute each function or step executed by the mobile phone in the above method embodiments.

Through the description of the above embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for emotion processing of input information, the method comprising:

the electronic equipment displays an input interface; the input interface is used for receiving input information input by a user, and the input information comprises voice information or text information; the emotion of the input information comprises a first emotion;

the electronic equipment determines a target emotion type;

the electronic equipment carries out emotion processing on the input information according to the target emotion type to obtain target voice; the emotion of the target voice comprises a second emotion, the content of the target voice is the same as the content of the input information, and the first emotion is different from the second emotion.

2. The method of claim 1, wherein the electronic device displays an input interface comprising:

the electronic equipment responds to the operation of starting a camera by a user and displays the input interface; the input interface is a preview interface before the electronic equipment shoots a video; or the input interface is an interface in a video shot by the electronic equipment; or the input interface is an interface after the electronic equipment shoots the video.

3. The method of claim 1, wherein the electronic device comprises a gallery application, a recording application, and a notepad application; the electronic device displays an input interface, comprising: the electronic equipment responds to the operation of opening any video file in the gallery application by a user and displays the input interface; or the electronic equipment responds to the operation of opening any sound recording file in the sound recording application by a user and displays the input interface; or the electronic equipment responds to the operation of opening any text file in the notepad application by the user and displays the input interface.

4. The method of claim 2 or 3, wherein the input interface comprises a plurality of speech emotion controls, one for each emotion type; the determining the target emotion type comprises the following steps:

the electronic device determines the target emotion type in response to a user operation of at least one of the plurality of voice emotion controls.

5. The method of claim 2, wherein the input interface comprises a photographic frame; the determining the target emotion type comprises the following steps:

and the electronic equipment identifies the style of the shot picture and automatically matches the target emotion type corresponding to the style of the shot picture.

6. A method according to claim 2 or 3, characterized in that the input information is speech information; the determining the target emotion type comprises the following steps:

and the electronic equipment identifies the emotion in the voice information and automatically matches the emotion type corresponding to the emotion in the voice information so as to determine the target emotion type.

7. The method of claim 3, wherein the input information is text information; the determining the target emotion type comprises the following steps:

and the electronic equipment identifies the semantics of the text information and automatically matches the emotion types corresponding to the semantics of the text information to determine the target emotion types.

8. The method of claim 1, wherein the electronic device performs emotion processing on the input information according to the target emotion type to obtain a target voice, and the method comprises:

the electronic equipment inputs the input information into a voice emotion model to obtain target voice; the voice emotion model is used for modifying the emotion of the input information according to the target emotion type.

9. The method of claim 8, wherein the electronic device inputs the input information into a speech emotion model to obtain a target speech, comprising:

the electronic equipment carries out coding processing on the input information to obtain the time-frequency characteristics of the input information; the encoding process comprises framing and Fourier transform; the framing processing is used for dividing the input information into a plurality of voice frames, the time-frequency characteristics are used for describing the change along with time, and the relationship between the frequency and the amplitude of each voice frame;

the electronic equipment inputs the time-frequency characteristics of the input information into the voice emotion model to obtain the time-frequency characteristics of the target voice;

the electronic equipment performs decoding processing and voice synthesis processing on the time-frequency characteristics of the target voice to obtain the target voice; the decoding process includes inverse fourier transform and time domain waveform superposition.

10. The method according to claim 8 or 9, characterized in that the method further comprises:

the electronic equipment acquires a voice emotion data set; the voice emotion data set comprises a plurality of pieces of emotion voice; the emotion types corresponding to each emotion voice in the plurality of emotion voices are different;

for each emotional voice in the voice emotion data set, the electronic equipment performs feature extraction processing on the emotional voice to obtain a time-frequency feature of the emotional voice;

and the electronic equipment inputs the time-frequency characteristics of the emotional voice into a neural network model for emotional training to obtain the voice emotional model.

11. The method according to any one of claims 8 to 10,

the speech emotion model comprises a first model and a second model; the first model is used for indicating the mapping relation between the emotional voice and the emotional type; the second model is used to modify the emotion of the input information.

12. The method according to any one of claims 1 to 8,

when the electronic equipment plays an audio picture, the electronic equipment outputs the target voice; the audio picture is a video picture; or the audio picture is an audio file.

13. The method of claim 12,

when the electronic equipment plays an audio image, displaying indication information on an interface of the electronic equipment; the indication information is used for indicating the emotion type corresponding to the target voice.

14. An electronic device, comprising a memory, a display, one or more cameras, and one or more processors; the display screen is for displaying images captured by the camera or images generated by the processor, and the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1-13.

15. A computer readable storage medium comprising computer instructions which, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-13.

16. A computer program product, characterized in that it comprises computer instructions which, when run on an electronic device, cause the electronic device to perform the method according to any of claims 1-13.