CN111354377A

CN111354377A - Method and device for recognizing emotion through voice and electronic equipment

Info

Publication number: CN111354377A
Application number: CN201910569691.XA
Authority: CN
Inventors: 鲁召选
Original assignee: Shenzhen Honghe Innovation Information Technology Co Ltd
Current assignee: Shenzhen Honghe Innovation Information Technology Co Ltd
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2020-06-30
Anticipated expiration: 2039-06-27
Also published as: CN111354377B

Abstract

The invention discloses a method and a device for recognizing emotion through voice and electronic equipment, wherein the method comprises the following steps: acquiring a voice signal of a recognition object; processing the voice signal to obtain a voice characteristic vector; inputting the voice feature vector into an emotion recognition model, and recognizing to obtain a first emotion recognition result; searching an emotion word database according to the voice feature vector to obtain a second emotion recognition result; and obtaining a final emotion recognition result according to the first emotion recognition result and the second emotion recognition result. The invention can recognize emotion through voice.

Description

Method and device for recognizing emotion through voice and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for recognizing emotion through voice and electronic equipment.

Background

The voice has various characteristics, the individual category of the voice can be identified through the voice, and for human beings, the emotion of the human can be identified according to different characteristics of the voice. In the education field, through voice recognition student's mood, can help the teacher in time to know student's the condition, the teacher of being convenient for adjusts the teaching mode, improves the teaching effect, or in time discovers the unusual student of mood and carries out the front guide.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for recognizing emotion through voice, and an electronic device, which are capable of recognizing emotion through voice.

In view of the above object, the present invention provides a method for recognizing emotion by voice, comprising:

acquiring a voice signal of a recognition object;

processing the voice signal to obtain a voice characteristic vector;

inputting the voice feature vector into an emotion recognition model, and recognizing to obtain a first emotion recognition result;

searching an emotion word database according to the voice feature vector to obtain a second emotion recognition result;

and obtaining a final emotion recognition result according to the first emotion recognition result and the second emotion recognition result.

Optionally, the speech feature vector includes a mood feature, a speech rate feature, a intonation feature, a pronunciation frequency feature, an accent feature, and a word.

Optionally, the mood characteristic, the speech speed characteristic, the intonation characteristic and the pronunciation frequency characteristic are input into the emotion recognition model, and the first emotion recognition result is obtained through recognition.

Optionally, the emotion word database is searched for words according to the accent features, and the second emotion recognition result is obtained.

Optionally, the method further includes:

and searching an identity information database according to the voice feature vector to obtain identity information matched with the recognition object.

An embodiment of the present invention further provides a device for recognizing emotion through voice, including:

the voice acquisition module is used for acquiring a voice signal of the recognition object;

the voice processing module is used for processing the voice signal to obtain a voice characteristic vector;

the first recognition module is used for inputting the voice feature vector into an emotion recognition model and recognizing to obtain a first emotion recognition result;

the second recognition module is used for searching an emotion word database according to the voice feature vector to obtain a second emotion recognition result;

and the recognition result module is used for obtaining a final emotion recognition result according to the first emotion recognition result and the second emotion recognition result.

Optionally, the first recognition module is configured to input the mood characteristic, the speed characteristic, the intonation characteristic, and the pronunciation frequency characteristic into the emotion recognition model, and recognize to obtain the first emotion recognition result.

Optionally, the second recognition module is configured to search the emotion word database by using words according to the accent features to obtain the second emotion recognition result.

Optionally, the apparatus further comprises:

and the identity recognition module is used for searching an identity information database according to the voice feature vector to obtain identity information matched with the recognition object.

The embodiment of the invention also provides electronic equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the method for recognizing the emotion through the voice when executing the program.

As can be seen from the above, according to the method and apparatus for recognizing emotion through voice and the electronic device provided by the present invention, the voice signal of the recognition object is obtained, the voice signal is processed to obtain the voice feature vector, the first emotion recognition result is obtained by using the emotion recognition model according to the voice feature vector, the emotion word database is searched according to the voice feature vector to obtain the second emotion recognition result, and the final emotion recognition result is obtained according to the first emotion recognition result and the second emotion recognition result. The invention can recognize emotion through voice.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention. As shown in the figure, the method for recognizing emotion through voice according to the embodiment of the present invention includes:

s10: acquiring a voice signal of a recognition object;

in some embodiments, a voice signal of the recognition object may be collected by a voice collecting apparatus.

In the application scene of school, a voice acquisition device can be configured at the desk position of each student, and in the course of lessons, the voice signals of the corresponding students can be acquired through the voice acquisition devices. And voice signals collected by each voice collecting device are transmitted to the server, and the server obtains the voice signals and carries out subsequent voice recognition and analysis processing on the voice signals.

S11: processing a voice signal to obtain a voice characteristic vector;

and processing the voice signals to obtain voice feature vectors, wherein the voice feature vectors comprise voice features such as tone features, speech speed features, intonation features, pronunciation frequency features, accent features, word usage and the like. The voice signal processing method comprises frequency domain signal processing, time domain signal processing, denoising processing, voice enhancement processing and the like, belongs to the prior art, and detailed description is not given to the specific method flow of the voice signal processing.

S12: inputting the voice feature vector into an emotion recognition model, and recognizing to obtain a first emotion recognition result;

in some embodiments, the emotion recognition model is pre-established by acquiring voice signals of a plurality of recognition objects, processing the voice signals to obtain a plurality of groups of voice feature vectors, inputting the plurality of groups of voice feature vectors as training samples into a classifier for classification training to obtain the emotion recognition model. The MFCC features can be obtained by processing the voice signals by using a Mel frequency cepstrum coefficient method and are used as training samples of the model.

Optionally, the emotion recognition model can recognize one emotion recognition result of happiness, hurt, anger, fear, surprise, confusion and the like according to the tone features, the speech rate features, the tone features and the pronunciation frequency features in the input speech feature vector. For example, mood is moderate, mood is slow, intonation is down, pronunciation frequency is low, the first emotion recognition result output by the emotion recognition model is sad, mood is questioned, intonation is up, the first emotion recognition result output by the emotion recognition model is suspicion, mood is anger, intonation is fast, intonation is up, pronunciation frequency is fast, the first emotion recognition result output by the emotion recognition model is anger, and the like. The tone type, the speed of speech, the tone type and the pronunciation frequency can be determined according to preset threshold values.

S13: searching an emotion word database according to the voice feature vector to obtain a second emotion recognition result;

in some embodiments, an emotion word database is established in advance, the emotion word database includes words with different accents corresponding to various emotions, and the emotion word database is searched by the words according to accent features in the speech feature vector to obtain a second emotion recognition result. For example, the word "take a good care", "haha", "too good", the word "what" is found to give a second emotion recognition result of happy, the word "what" is found to give a second emotion recognition result of doubtful or surprised, the word "unlawful phrase" is found to give a second emotion recognition result of angry, and the like.

S14: and obtaining a final emotion recognition result according to the first emotion recognition result and the second emotion recognition result.

In some embodiments, a first emotion recognition result is obtained by emotion recognition model recognition through tone features, speech speed features, tone features and pronunciation frequency features, a second emotion recognition result is obtained by accent features and word use and emotion word database recognition, and a final emotion recognition result is obtained comprehensively according to the first emotion recognition result and the second emotion recognition result. For example, if the first emotion recognition result and the second emotion recognition result are both happy, the final emotion recognition result is happy, if the first emotion recognition result is suspicious, the second emotion recognition result is suspicious or surprised, the final emotion recognition result is suspicious, and if the first emotion recognition result is angry and the second emotion recognition result is not matched, the final emotion recognition result is angry and the like.

In some embodiments, the system further comprises an identity information database for storing the voice feature vectors of the recognition objects. The method comprises the steps of collecting voice signals of an identification object in advance, processing the voice signals to obtain voice characteristic vectors, and storing identity information of the identification object and the corresponding voice characteristic vectors in an identity information database. Processing the voice signal according to the acquired voice signal to obtain a voice feature vector to be matched, searching the identity information database according to the voice feature vector to be matched, and if a search result is obtained, taking the search result as matched identity information, namely, the embodiment of the invention can identify the identity information of the identification object according to the voice signal of the identification object.

In an application scene of a school, voice signals of each student collected by the voice collecting equipment on each student desk are sent to the server, and the server processes the voice signals of each channel according to the obtained multiple channels of voice signals to obtain voice feature vectors corresponding to the voice signals of each channel. Searching an identity information database according to each group of voice feature vectors, and searching to obtain identity information matched with each group of voice feature vectors, namely identifying the identity information (information such as name, gender, class and the like) of students according to the voice feature vectors; according to each group of voice feature vectors, utilizing an emotion recognition model to recognize to obtain first emotion recognition results corresponding to each group of voice feature vectors; and finally, obtaining emotion recognition results corresponding to the voice feature vectors of each group according to the first emotion recognition result and the second emotion recognition result, and obtaining the emotion state of each student by combining the recognized identity information.

Fig. 2 is a schematic structural diagram of an apparatus according to an embodiment of the present invention. As shown in the drawings, an apparatus for recognizing emotion through voice according to an embodiment of the present invention includes:

the voice processing module is used for processing the voice signals to obtain voice characteristic vectors;

the first recognition module is used for inputting the voice feature vector into the emotion recognition model and recognizing to obtain a first emotion recognition result;

the second recognition module is used for searching the emotion word database according to the voice feature vector to obtain a second emotion recognition result;

In the application scene of school, a voice acquisition device can be configured at the desk position of each student, and in the course of lessons, the voice signals of the corresponding students can be acquired through the voice acquisition devices. The voice signals collected by the voice collecting devices are transmitted to the server, and the voice obtaining module of the server obtains the voice signals and carries out subsequent voice recognition and analysis processing on the voice signals.

In some embodiments, the speech processing module processes the speech signal to obtain a speech feature vector, where the speech feature vector includes speech features such as mood features, speech rate features, intonation features, pronunciation frequency features, accent features, and word usage. The voice signal processing method comprises frequency domain signal processing, time domain signal processing, denoising processing, voice enhancement processing and the like, belongs to the prior art, and detailed description is not given to the specific method flow of the voice signal processing.

In some embodiments, the emotion recognition model is pre-established by acquiring voice signals of a plurality of recognition objects, processing the voice signals to obtain a plurality of groups of voice feature vectors, and performing classification training by using the plurality of groups of voice feature vectors as training samples to obtain the emotion recognition model.

The first recognition module can recognize one emotion recognition result of happiness, injury, anger, fear, surprise, confusion and the like according to tone features, speech speed features, tone features and pronunciation frequency features in the input voice feature vector by using the emotion recognition model. For example, mood is moderate, mood is slow, intonation is down, pronunciation frequency is low, the first emotion recognition result output by the emotion recognition model is sad, mood is questioned, intonation is up, the first emotion recognition result output by the emotion recognition model is suspicion, mood is anger, intonation is fast, intonation is up, pronunciation frequency is fast, the first emotion recognition result output by the emotion recognition model is anger, and the like.

In some embodiments, an emotion word database is established in advance, the emotion word database includes words with different accents corresponding to various emotions, and the second recognition module searches the emotion word database according to the accent features in the speech feature vector and the words to obtain a second emotion recognition result. For example, the word "take a good care", "haha", "too good", the word "what" is found to give a second emotion recognition result of happy, the word "what" is found to give a second emotion recognition result of doubtful or surprised, the word "unlawful phrase" is found to give a second emotion recognition result of angry, and the like.

In some embodiments, a first emotion recognition result is obtained by emotion recognition model recognition through tone features, speech speed features, tone features and pronunciation frequency features, a second emotion recognition result is obtained by accent features and word use and emotion word database recognition, and the recognition result module obtains a final emotion recognition result comprehensively according to the first emotion recognition result and the second emotion recognition result. For example, if the first emotion recognition result and the second emotion recognition result are both happy, the final emotion recognition result is happy, if the first emotion recognition result is suspicious, the second emotion recognition result is suspicious or surprised, the final emotion recognition result is suspicious, and if the first emotion recognition result is angry and the second emotion recognition result is not matched, the final emotion recognition result is angry and the like.

The device for recognizing emotion through voice of the embodiment of the present invention further includes:

and the identity recognition module is used for searching the identity information database according to the voice feature vector to obtain the identity information matched with the recognition object.

In some embodiments, the identity recognition module searches the identity information database according to the voice feature vector, and obtains the identity information of the recognition object according to the search result.

And the identity information database is used for storing the voice feature vectors of the recognition objects. The method comprises the steps of collecting voice signals of an identification object in advance, processing the voice signals to obtain voice characteristic vectors, and storing identity information of the identification object and the corresponding voice characteristic vectors in an identity information database. Processing the voice signal according to the acquired voice signal to obtain a voice feature vector to be matched, searching the identity information database according to the voice feature vector to be matched, and if a search result is obtained, taking the search result as matched identity information, namely, the embodiment of the invention can identify the identity information of the identification object according to the voice signal of the identification object.

In view of the above object, the embodiment of the present invention further provides an embodiment of an apparatus for performing the method for recognizing emotion through voice. The device comprises:

one or more processors, and a memory.

The apparatus performing the method of recognizing emotion by voice may further include: an input device and an output device.

The processor, memory, input device, and output device may be connected by a bus or other means.

The memory, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method of recognizing emotion through voice in the embodiments of the present invention. The processor executes various functional applications of the server and data processing by running the nonvolatile software programs, instructions and modules stored in the memory, that is, implements the method of recognizing emotion by voice of the above-described method embodiments.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an apparatus performing the method of recognizing emotion by voice, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory remotely located from the processor, and these remote memories may be connected to the member user behavior monitoring device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device may receive input numeric or character information and generate key signal inputs related to user settings and function control of the device performing the method of recognizing emotion by voice. The output device may include a display device such as a display screen.

The one or more modules are stored in the memory and, when executed by the one or more processors, perform a method of recognizing emotion through voice in any of the method embodiments described above. The technical effect of the embodiment of the device for executing the method for recognizing emotion through voice is the same as or similar to that of any method embodiment.

The embodiment of the invention also provides a non-transitory computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the processing method of the list item operation in any method embodiment. Embodiments of the non-transitory computer storage medium may be the same or similar in technical effect to any of the method embodiments described above.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program that can be stored in a computer-readable storage medium and that, when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The technical effect of the embodiment of the computer program is the same as or similar to that of any of the method embodiments described above.

Furthermore, the apparatuses, devices, etc. described in the present disclosure may be various electronic terminal devices, such as a mobile phone, a Personal Digital Assistant (PDA), a tablet computer (PAD), a smart television, etc., and may also be large terminal devices, such as a server, etc., and therefore the scope of protection of the present disclosure should not be limited to a specific type of apparatus, device. The client disclosed by the present disclosure may be applied to any one of the above electronic terminal devices in the form of electronic hardware, computer software, or a combination of both.

Furthermore, the method according to the present disclosure may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method of the present disclosure.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for recognizing emotion through voice, comprising:

acquiring a voice signal of a recognition object;

processing the voice signal to obtain a voice characteristic vector;

2. The method of claim 1, wherein the speech feature vector comprises mood features, pace features, intonation features, pronunciation frequency features, accent features, and vocabularies.

3. The method according to claim 2, wherein the mood characteristic, the speech rate characteristic, the intonation characteristic and the pronunciation frequency characteristic are input into the emotion recognition model, and the first emotion recognition result is obtained through recognition.

4. The method according to claim 2, wherein the emotion word database is searched for words based on the accent features to obtain the second emotion recognition result.

5. The method of claim 1, further comprising:

6. An apparatus for recognizing emotion by voice, comprising:

7. The apparatus of claim 6, wherein the speech feature vector comprises mood features, pace features, intonation features, pronunciation frequency features, accent features, and vocabularies.

8. The apparatus of claim 7,

and the first recognition module is used for inputting the tone features, the speed features, the tone features and the pronunciation frequency features into the emotion recognition model and recognizing to obtain the first emotion recognition result.

9. The apparatus of claim 7,

and the second recognition module is used for searching the emotion word database by words according to the accent characteristics to obtain a second emotion recognition result.

10. The apparatus of claim 6, further comprising:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the program.