CN111370030A

CN111370030A - Voice emotion detection method and device, storage medium and electronic equipment

Info

Publication number: CN111370030A
Application number: CN202010261354.7A
Authority: CN
Inventors: 聂镭; 王竹欣; 聂颖
Original assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Current assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2020-07-03

Abstract

The disclosure belongs to the technical field of artificial intelligence, and relates to a voice emotion detection method and device, a storage medium and electronic equipment. The method comprises the following steps: acquiring initial voice data, and performing voice recognition processing on the initial voice data to obtain a voice recognition text; determining the voice duration of the initial voice data and determining the number of characters of the voice recognition text; determining a speech speed grade according to the number of characters and the speech duration, and acquiring a tone grade and a tone grade of initial speech data; and inputting the speed grade, the tone grade and the tone grade into a machine learning model trained in advance to obtain the speech emotion category. On one hand, the method avoids the subjectivity of manual voice emotion labeling, enables the labeling of voice emotion categories to be more objective, reduces the error times of emotion category labeling, and improves the accuracy of emotion category labeling; on the other hand, online voice recognition is avoided, the speed of voice recognition is improved, and the efficiency of voice recognition and emotion category marking is improved.

Description

Voice emotion detection method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a speech emotion detection method, a speech emotion detection apparatus, a computer-readable storage medium, and an electronic device.

Background

With the improvement of living standards and the increasing change of science and technology, people hope to acquire information and services through man-machine interaction in a more natural way. In both the speech recognition scenario and the speech synthesis scenario, people have raised higher requirements on the accuracy of speech recognition.

At present, generally, text emotion recognition is performed on a recognition text of a voice audio, and voice emotion recognition can also be performed through a voiceprint feature of the voice audio to determine voice emotion information and text emotion information. However, the accuracy of speech recognition is reduced because the speech and text are simply labeled comprehensively.

In view of the above, there is a need in the art to develop a new method and apparatus for detecting speech emotion.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a speech emotion detection method, a speech emotion detection apparatus, a computer-readable storage medium, and an electronic device, so as to overcome the problem of low emotion detection accuracy due to limitations of related technologies, at least to some extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the embodiments of the present invention, there is provided a speech emotion detection method, including: acquiring initial voice data, and performing voice recognition processing on the initial voice data to obtain a voice recognition text; determining the voice duration of the initial voice data and determining the number of characters of the voice recognition text; determining a speech speed grade according to the number of the characters and the voice duration, and acquiring a tone grade and a tone grade of the initial voice data; and inputting the speech speed grade, the intonation grade and the tone grade into a machine learning model trained in advance to obtain the speech emotion category.

In an exemplary embodiment of the invention, before the inputting the speech rate level, the intonation level and the mood level into a pre-trained machine learning model, the method further includes: acquiring a speech rate sample, a tone sample and a tone sample for training the machine learning model and emotion category samples corresponding to the speech rate sample, the tone sample and/or the tone sample; inputting the speech rate sample, the tone sample and/or the tone sample to a machine learning model to be trained so that the machine learning model to be trained outputs speech emotion categories corresponding to the speech rate sample, the tone sample and/or the tone sample; and if the speech emotion type is not matched with the emotion type sample, adjusting parameters of the machine learning model to be trained so as to enable the speech emotion type to be the same as the emotion type sample.

In an exemplary embodiment of the present invention, the obtaining of the speech rate sample, the intonation sample, and the mood sample for training the machine learning model and the emotion classification sample corresponding to the speech rate sample, the intonation sample, and/or the mood sample includes: acquiring a speech rate sample, a tone sample and a tone sample for training the machine learning model, and acquiring an emotion mapping table corresponding to the speech rate sample, the tone sample and the tone sample; and determining an emotion category sample according to the speech speed sample, the intonation sample and/or the tone sample based on the emotion mapping table.

In an exemplary embodiment of the present invention, the acquiring initial voice data and performing voice recognition processing on the initial voice data to obtain a voice recognition text includes: acquiring initial voice data, and removing mute data in the initial voice data by using a voice endpoint detection algorithm to obtain voiced voice data; and carrying out voice recognition processing on the voiced voice data to obtain a voice recognition text.

In an exemplary embodiment of the present invention, the performing a speech recognition process on the voiced speech data to obtain a speech recognition text includes: if the initial voice data comprises voice data of at least two different speakers, separating the voiced voice data by using a speaker change detection algorithm to obtain separated voice data, and performing voice recognition processing on the separated voice data to obtain a voice recognition text.

In an exemplary embodiment of the present invention, the performing a speech recognition process on the separated speech data to obtain a speech recognition text includes: if the separated voice data is longer than the preset duration, cutting the separated voice data to obtain cut voice data; and performing voice recognition processing on the cut voice data to obtain a voice recognition text.

In an exemplary embodiment of the invention, after the speech recognizing the text, the method further comprises: matching the voice recognition text with a corresponding standard text, and determining a regular matching word of the voice recognition text; correcting the regular matching words using a regular algorithm to optimize the speech recognition process.

According to a second aspect of the embodiments of the present invention, there is provided a speech emotion detection apparatus, including: the voice recognition module is configured to acquire initial voice data and perform voice recognition processing on the initial voice data to obtain a voice recognition text; the parameter acquisition module is configured to determine the voice duration of the initial voice data and determine the number of characters of the voice recognition text; the level determining module is configured to input the speech rate level, the intonation level and the tone level into a machine learning model trained in advance to obtain a speech emotion category; and the emotion detection module is configured to input the speech rate grade, the intonation grade and the tone grade into a machine learning model trained in advance to obtain a speech emotion category.

According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus including: a processor and a memory; wherein the memory has stored thereon computer readable instructions, which when executed by the processor, implement the speech emotion detection method of any of the above exemplary embodiments.

According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the speech emotion detection method in any of the above-mentioned exemplary embodiments.

According to the technical solution, the speech emotion detection method, the speech emotion detection apparatus, the computer storage medium and the electronic device in the exemplary embodiment of the present invention have at least the following advantages and positive effects:

in the method and apparatus provided by the exemplary embodiment of the present disclosure, by inputting the speech rate level of the speech recognition text and the intonation level and mood level of the initial speech data into the machine learning model, the corresponding speech emotion category may be obtained. On one hand, the subjectivity of manual voice emotion labeling is avoided, the voice emotion type labeling is more objective, the number of errors of emotion type labeling is reduced, and the emotion type labeling accuracy is improved; on the other hand, online voice recognition with slow recognition is avoided, the speed of voice recognition is improved, and the efficiency of voice recognition and emotion category marking is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 schematically shows a flow chart of a method for emotion detection of speech in an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of deriving speech recognition text in an exemplary embodiment of the disclosure;

FIG. 3 schematically illustrates a flow chart of a method of separating voiced speech data in an exemplary embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart of a method of cutting separated speech data in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a method of optimizing speech recognition processing in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow diagram of a method of training a machine learning model in an exemplary embodiment of the disclosure;

FIG. 7 schematically illustrates a flow chart of a method of determining emotion classification samples in an exemplary embodiment of the disclosure;

FIG. 8 is a schematic structural diagram of a speech emotion detection apparatus in an exemplary embodiment of the present disclosure;

FIG. 9 schematically illustrates an electronic device for implementing a speech emotion detection method in an exemplary embodiment of the present disclosure;

FIG. 10 schematically shows a computer-readable storage medium for implementing a speech emotion detection method in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

Aiming at the problems in the related art, the disclosure provides a voice emotion detection method. FIG. 1 shows a flow chart of a speech emotion detection method, as shown in FIG. 1, the speech emotion detection method at least includes the following steps:

and step S110, acquiring initial voice data, and performing voice recognition processing on the initial voice data to obtain a voice recognition text.

And S120, determining the voice time length of the initial voice data and determining the number of characters of the voice recognition text.

And S130, determining a speech speed grade according to the number of the characters and the speech duration, and acquiring a tone grade and a tone grade of the initial speech data.

And S140, inputting the speech speed grade, the tone grade and the tone grade into a machine learning model trained in advance to obtain the speech emotion category.

In an exemplary embodiment of the present disclosure, by inputting the speech rate level of the speech recognition text and the intonation level and mood level of the initial speech data into the machine learning model, the corresponding speech emotion classification can be obtained. On one hand, the subjectivity of manual voice emotion labeling is avoided, the voice emotion type labeling is more objective, the number of errors of emotion type labeling is reduced, and the emotion type labeling accuracy is improved; on the other hand, online voice recognition with slow recognition is avoided, the speed of voice recognition is improved, and the efficiency of voice recognition and emotion category marking is improved.

The following describes each step of the speech emotion detection method in detail.

In step S110, initial voice data is acquired, and voice recognition processing is performed on the initial voice data to obtain a voice recognition text.

In an exemplary embodiment of the present disclosure, the initial voice data may be call voice data acquired by a recording device, or may be other voice data stored in other storage media, which is not limited in this exemplary embodiment. After the initial speech data is obtained, the corresponding speech recognition text may be recognized.

In an alternative embodiment, fig. 2 shows a flow diagram of a method of obtaining speech recognition text, as shown in fig. 2, the method comprising at least the following steps: in step S210, initial voice data is obtained, and a voice endpoint detection algorithm is used to remove mute data in the initial voice data, so as to obtain voiced voice data. The voice endpoint Detection algorithm (AVD for short) is a silence Detection technique. Silence detection refers to detecting silent (no-talk) portions, i.e., silence data, in call voice data and other initial voice data. Specifically, the voice endpoint detection algorithm may be implemented by using frame amplitude, frame energy, short-time zero-crossing rate, a deep neural network, and the like. By removing the mute data in the original voice data, the voice segment generated by the speaker speaking or the sounding body in the original voice data can be reserved, thereby eliminating the interference of the mute part in the subsequent processing process and effectively improving the efficiency and the accuracy of the subsequent processing. The initial speech data after the removal of the mute data can be subsequently recognized as voiced speech data.

In step S220, voice recognition processing is performed on the voiced speech data to obtain a voice recognition text. When the initial voice data is the call voice data, the call voice data of at least two different speakers can be included, so that the voice separation of different speakers can be performed on the voiced voice data.

In an alternative embodiment, fig. 3 shows a flow diagram of a method of separating voiced speech data, which, as shown in fig. 3, comprises at least the following steps: in step S310, if the initial speech data includes speech data of at least two different speakers, the voiced speech data is separated by using the speaker change detection algorithm to obtain separated speech data. The Speaker Change Detection (SCD) algorithm detects Speaker Change points in a given audio stream. The speaker change detection algorithm is a preprocessing step of speaker identification, speaker confirmation, speaker tracking, automatic labeling, information extraction, subject detection, voice summarization and voice retrieval, and is widely applied to different fields. Generally, the method may be implemented by a bayesian information criterion algorithm, or may be implemented by other algorithms, which is not limited in this exemplary embodiment.

For example, when the initial speech data includes speech data of two speakers, and the voiced speech data is separated by the speaker change detection algorithm, it can be obtained that the separated speech data is 0-10 seconds for speaker a, 11-15 seconds for speaker B, and 16-20 seconds for speaker a. That is, the complete voiced speech data can be segmented into separate speech data for different segments of the speaker by the speaker change detection algorithm.

In step S320, the separated voice data is subjected to voice recognition processing, so as to obtain a voice recognition text. Because the voice frequency duration is too long, the efficiency of data marking and voice recognition processing can be influenced, and therefore after speaker change detection is carried out on voiced voice data, further cutting can be carried out according to the preset duration.

In an alternative embodiment, fig. 4 shows a flow chart of a method for cutting separated speech data, as shown in fig. 4, the method at least comprises the following steps: in step S410, if the separated voice data is greater than the preset duration, the separated voice data is cut to obtain cut voice data. The preset time length may be a time length artificially set to separate the separated voice data. Preferably, the preset time period is 15 seconds. In addition, other time periods may be set according to actual conditions, and the exemplary embodiment is not particularly limited in this regard.

For example, when the separated voice data is 33 seconds and the preset time duration is 15 seconds, it may be determined that the separated voice data is to be cut, and the time duration of the cutting interval is 15 seconds, that is, the separated voice data is cut every 15 seconds. Therefore, 33 seconds of separated voice data can be cut into 3 pieces, cut into 15 seconds, and 3 seconds of cut voice data.

In step S420, a speech recognition process is performed on the cut speech data to obtain a speech recognition text. Specifically, the terminal device is provided with a voice recognition algorithm which can be used for a voice recognition model or a commercial voice recognition module of a third party to recognize the cut voice data. Alternatively, the speech recognition algorithm may be a linguistic and acoustic based method, a stochastic model method, a neural network method, a probabilistic parsing method, and the like, and this exemplary embodiment is not particularly limited thereto. After the voice recognition processing, a voice recognition text corresponding to the cut voice data can be obtained.

In the exemplary embodiment, the voice recognition processing is performed on the separated voice data, so that a corresponding voice recognition text can be obtained, the interaction between the human and the machine is facilitated, and the emotion detection accuracy is improved.

The speech recognition samples can be used for subsequent emotion detection, and can provide more accurate prediction for a model or algorithm of the speech recognition processing.

In an alternative embodiment, fig. 5 shows a flow diagram of a method of optimizing a speech recognition process, as shown in fig. 5, the method comprising at least the steps of: in step S510, the speech recognition text is matched with the corresponding standard text, and a regular matching word of the speech recognition text is determined. The standard text may be a manually marked text or other text with higher accuracy, and this exemplary embodiment is not particularly limited in this respect. For example, the speech recognition text of the customer service typically includes "hello, here … …". If the speech recognition text is matched with the standard text, it indicates that the speech recognition text is incorrect, so that it may be determined that "hello, which is" regular matching word ", or" hello, which is "regular matching word", of the first three words may be set according to an actual situation, which is not particularly limited in the present exemplary embodiment.

In step S520, the regular matching words are corrected using a regular algorithm to optimize the speech recognition process. When the speech recognition text does not match the standard text, the accuracy of the algorithm or model indicating the speech recognition process is yet to be improved. Therefore, the regular algorithm can be used for moving the cut points of the regular matching words forward for correction, so as to obtain the corrected voice recognition text. Further, the corrected voice recognition text is fed back to the voice recognition processing model to be used as a sample of the voice recognition processing model for enhancing training so as to optimize voice recognition processing.

In the present exemplary embodiment, the speech recognition processing model is optimized by the corrected speech recognition text, improving the accuracy of the speech recognition processing result.

In step S120, the voice duration of the initial voice data is determined, and the number of characters of the voice recognition text is determined.

In an exemplary embodiment of the present disclosure, when the initial voice data is acquired, the voice duration of the initial voice data may be correspondingly determined. Further, the number of characters of the voice recognition text can be obtained through statistics according to the voice recognition text obtained through the voice recognition processing.

In step S130, a speech rate level is determined according to the number of characters and the speech duration, and a tone level of the initial speech data are obtained.

In an exemplary embodiment of the present disclosure, the speech rate level may be determined by a calculation according to the number of characters and the duration of speech. The determination of the speech rate level depends on the number of words contained for the different levels of the division. Because the speaking speed of normal people is generally 5-8 words, 1-5 speech speed levels can be correspondingly set. Wherein level 1 contains a number of words per second of (2, 3), level 2 contains a number of words per second of (3, 5), level 3 contains a number of words per second of (5, 8), level 4 contains a number of words per second of (8, 11), and level 5 contains a number of words per second of (11, 13).

For example, when the number of words is 30 words and the voice time length is 15 seconds, the word rate level may be determined to be level 1 according to the word count/second =30/15= 2.

Because the speaking tone of a normal person is relatively stable, the tone can also change correspondingly when fluctuation occurs along with the emotion of the person. Therefore, the intonation levels can be divided into 5. Wherein, the level 1 corresponds to a slightly lower intonation, the level 2 corresponds to a lower intonation, the level 3 corresponds to a steady intonation, the level 4 corresponds to an increasingly higher intonation, and the level 5 corresponds to a slightly higher intonation. The annotator can select the tone level of the initial voice data according to the change of whether the tone of the voice in the initial voice data is increased or decreased and the like.

The speaking tone of normal people is stable, and the tone can change correspondingly along with fluctuation of human emotion. Therefore, the tone level can be divided into 3. Wherein, level 1 corresponds to a deep intonation, level 2 corresponds to a steady intonation, and level 3 corresponds to a sharp intonation. The annotator can select the tone level of the initial voice data according to the tone change of the voice in the initial voice data.

In step S140, the speech rate level, the intonation level, and the mood level are input into a machine learning model trained in advance to obtain a speech emotion category.

In an exemplary embodiment of the present disclosure, the machine learning model trained in advance may be trained according to a speech rate sample, a intonation sample, and a tone sample.

In an alternative embodiment, fig. 6 shows a flow diagram of a method of training a machine learning model, as shown in fig. 6, the method comprising at least the steps of: in step S610, a speech rate sample, a tone sample, and an emotion classification sample corresponding to the speech rate sample, the tone sample, and/or the tone sample of the trained machine learning model are obtained. The speech rate sample, the intonation sample and the tone sample can be in one-to-one, many-to-one, one-to-many and many-to-many relations with the emotion category sample, and can be determined through an emotion mapping table.

In an alternative embodiment, fig. 7 shows a flow chart of a method for determining emotion category samples, as shown in fig. 7, the method at least comprises the following steps: in step S710, a speech rate sample, a tone sample, and a tone sample of the training machine learning model are obtained, and an emotion mapping table corresponding to the speech rate sample, the tone sample, and the tone sample is obtained. The emotion mapping table may be set in advance empirically or may be generated according to an algorithm, and this exemplary embodiment is not particularly limited in this respect. The emotion mapping table records mapping relations between the speech rate samples, the tone samples and the emotion category samples, and the mapping relations can be one-to-one, many-to-one, one-to-many and many-to-many. Preferably, the emotion classification samples may include six of impatience, anger, calm, happiness, passion, and excitement.

In step S720, emotion classification samples are determined according to the speech rate samples, the intonation samples and/or the tone samples based on the emotion mapping table. After the emotion mapping table is obtained, the corresponding emotion category sample can be determined according to the speech rate sample, the intonation sample and/or the tone sample. For example, when the speech rate level is level 1, it may be determined that the emotion classification sample is calm; when the speech rate level is level 3 and the intonation level is level 4, the emotion category sample can be determined to be impatient.

In the exemplary embodiment, the corresponding emotion category sample can be determined through the emotion mapping table, the determination mode is simple, and the accuracy is extremely high.

In step S620, the speech rate sample, the intonation sample and/or the tone sample are input to the machine learning model to be trained, so that the machine learning model to be trained outputs the speech emotion classification corresponding to the speech rate sample, the intonation sample and/or the tone sample. After the speech emotion type output by the machine learning model to be trained is obtained, the speech emotion type can be matched with the corresponding emotion type sample, whether the output speech emotion type is the same as the emotion type sample or not is judged, and whether the machine learning model to be trained is trained or not is determined according to a matching result.

In step S630, if the speech emotion type does not match the emotion type sample, the parameters of the machine learning model to be trained are adjusted so that the speech emotion type is the same as the emotion type sample. When the speech emotion type is not matched with the emotion type sample, the machine learning model to be trained is not trained, so that parameters of the machine learning model to be trained can be adjusted, the speech emotion type is consistent with the emotion type sample, and training of the machine learning model to be trained is completed.

In the exemplary embodiment, the accuracy of the speech emotion category output is ensured for the complete training process of the machine learning model, and further, the correctness of the emotion standard is improved and ensured.

In view of this, when the speech rate level, the intonation level and the mood level are input into the machine learning model trained in advance, the machine learning model can be made to output accurate speech emotion categories. For example, when the "annoying" speech rate level, intonation level, and mood level are all level 1, the input is to a trained machine learning model, which can output the speech emotion classification as happy.

In addition, in the exemplary embodiment of the present disclosure, a speech emotion detection apparatus is also provided. Fig. 8 shows a schematic structure of the speech emotion detection apparatus, and as shown in fig. 8, the speech emotion detection apparatus 800 may include: speech recognition module 810, parameter acquisition module 820, rank determination module 830, and emotion detection module 840. Wherein:

the voice recognition module 810 is configured to acquire initial voice data and perform voice recognition processing on the initial voice data to obtain a voice recognition text; a parameter obtaining module 820 configured to determine a voice duration of the initial voice data and determine a number of words of the voice recognition text; a level determination module 830 configured to input the speech rate level, the intonation level, and the mood level into a machine learning model trained in advance, so as to obtain a speech emotion category; and the emotion detection module 840 is configured to input the speech rate grade, the intonation grade and the tone grade into a machine learning model trained in advance to obtain the speech emotion category.

The specific details of the speech emotion detection apparatus 800 have been described in detail in the corresponding speech emotion detection method, and therefore are not described herein again.

It should be noted that although several modules or units of the speech emotion detection apparatus 800 are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

An electronic device 900 according to such an embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification.

The storage unit 920 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 921 and/or a cache memory unit 922, and may further include a read only memory unit (ROM) 923.

Storage unit 920 may also include a program/utility 924 having a set (at least one) of program modules 925, such program modules 925 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 1100 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 940 communicates with the other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.

Referring to fig. 10, a program product 1000 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A speech emotion detection method, characterized in that the method comprises:

acquiring initial voice data, and performing voice recognition processing on the initial voice data to obtain a voice recognition text;

determining the voice duration of the initial voice data and determining the number of characters of the voice recognition text;

determining a speech speed grade according to the number of the characters and the voice duration, and acquiring a tone grade and a tone grade of the initial voice data;

and inputting the speech speed grade, the intonation grade and the tone grade into a machine learning model trained in advance to obtain the speech emotion category.

2. The method according to claim 1, wherein before the inputting the speech rate level, the intonation level and the mood level into a pre-trained machine learning model, the method further comprises:

acquiring a speech rate sample, a tone sample and a tone sample for training the machine learning model and emotion category samples corresponding to the speech rate sample, the tone sample and/or the tone sample;

inputting the speech rate sample, the tone sample and/or the tone sample to a machine learning model to be trained so that the machine learning model to be trained outputs speech emotion categories corresponding to the speech rate sample, the tone sample and/or the tone sample;

and if the speech emotion type is not matched with the emotion type sample, adjusting parameters of the machine learning model to be trained so as to enable the speech emotion type to be the same as the emotion type sample.

3. The method according to claim 2, wherein the obtaining of the speech rate sample, the intonation sample, the tone sample, and the emotion classification sample corresponding to the speech rate sample, the tone sample, and/or the tone sample for training the machine learning model comprises:

acquiring a speech rate sample, a tone sample and a tone sample for training the machine learning model, and acquiring an emotion mapping table corresponding to the speech rate sample, the tone sample and the tone sample;

and determining an emotion category sample according to the speech speed sample, the intonation sample and/or the tone sample based on the emotion mapping table.

4. The method for detecting speech emotion according to claim 1, wherein the obtaining initial speech data and performing speech recognition processing on the initial speech data to obtain a speech recognition text comprises:

acquiring initial voice data, and removing mute data in the initial voice data by using a voice endpoint detection algorithm to obtain voiced voice data;

and carrying out voice recognition processing on the voiced voice data to obtain a voice recognition text.

5. The method for detecting speech emotion according to claim 4, wherein the performing speech recognition processing on the voiced speech data to obtain a speech recognition text comprises:

if the initial voice data comprises voice data of at least two different speakers, separating the voiced voice data by using a speaker change detection algorithm to obtain separated voice data;

and carrying out voice recognition processing on the separated voice data to obtain a voice recognition text.

6. The method for detecting speech emotion according to claim 5, wherein the performing speech recognition processing on the separated speech data to obtain a speech recognition text comprises:

if the separated voice data is longer than the preset duration, cutting the separated voice data to obtain cut voice data;

and performing voice recognition processing on the cut voice data to obtain a voice recognition text.

7. The method of claim 6, wherein after the speech recognizing the text, the method further comprises:

matching the voice recognition text with a corresponding standard text, and determining a regular matching word of the voice recognition text;

correcting the regular matching words using a regular algorithm to optimize the speech recognition process.

8. A speech emotion detection apparatus, comprising:

the voice recognition module is configured to acquire initial voice data and perform voice recognition processing on the initial voice data to obtain a voice recognition text;

the parameter acquisition module is configured to determine the voice duration of the initial voice data and determine the number of characters of the voice recognition text;

the level determining module is configured to input the speech rate level, the intonation level and the tone level into a machine learning model trained in advance to obtain a speech emotion category;

and the emotion detection module is configured to input the speech rate grade, the intonation grade and the tone grade into a machine learning model trained in advance to obtain a speech emotion category.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for detecting speech emotion of any one of claims 1-7.

10. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the speech emotion detection method of any of claims 1-7 via execution of the executable instructions.